Batch flow integration of Flink and Pulsar

Introduction to provide the basis for batch stream integration through Apache Pulsar's native storage and computing separation architecture, and how Apache Pulsar combines with Flink to realize batch stream integrated computing.
Introduction: In this speech, Zhai Jia, co-founder of StreamNative, introduced the next-generation cloud native messaging platform Apache Pulsar, and explained how to provide the basis for batch stream integration through Apache Pulsar's native storage and computing separation architecture, and how Apache Pulsar and Combining Flink to realize batch-flow integrated computing.

GitHub address
https://github.com/apache/flink
Everyone is welcome to give Flink likes and send stars~

Apache Pulsar is relatively new. It joined the Apache Software Foundation in 2017 and only graduated from the Apache Software Foundation in 2018 and became a top-level project. Because Pulsar natively adopts a storage and computing separation architecture, and has a storage engine BookKeeper specially designed for messages and streams, combined with Pulsar's own enterprise-level features, it has attracted more and more developers' attention. Today’s sharing is divided into 3 parts:

What is Apache Pulsar;
Pulsar's data view;
Batch flow integration of Pulsar and Flink.

1. What is Apache Pulsar

The picture below is an open source tool that belongs to the field of messaging. Developers engaged in messaging or infrastructure must not be unfamiliar with these. Although Pulsar began development in 2012 and was not open source until 2016, it has been running on Yahoo's online for a long time before meeting everyone. This is why it has received a lot of attention from developers as soon as it is open sourced. It is already an online-tested system.

The most fundamental difference between Pulsar and other messaging systems lies in two aspects:

On the one hand, Pulsar adopts a cloud-native architecture that separates storage and computing;
On the other hand, Pulsar has a storage engine specially designed for messages, Apache BookKeeper.

Architecture

The following figure shows the architecture of Pulsar's storage and computing separation:

First of all, in the computing layer, Pulsar Broker does not store any state data, does not do any data storage, we also call it the service layer.
Secondly, Pulsar has a storage engine BookKeeper designed specifically for messages and streams, which we also call the data layer.

This layered architecture is very convenient for users' cluster expansion:

If you want to support more Producers and Consumers, you can expand the stateless Broker layer above;
If you want to store more data, you can expand the underlying storage layer separately.

This cloud-native architecture has two main features:

The first is the separation of storage and calculation;
Another feature is that each layer is a peer-to-peer architecture.

In terms of node peering, the Broker layer does not store data, so it is easy to achieve node peering. But Pulsar's storage at the bottom layer is also peer-to-peer: At the storage layer, BookKeeper does not use the master/slave synchronization method, but uses Quorum.

If you want to keep multiple data backups, the user writes three storage nodes concurrently through a broker, and each piece of data is in a peer state, so that the nodes at the bottom are also in a peer state, and the user needs to do the bottom node Expansion and management will be easy. Having such a peer-to-peer basis will bring users great cloud-native convenience, make it convenient for users to expand capacity separately at each layer, and will also improve the usability and maintainability of users' online systems.

At the same time, this layered architecture lays a solid foundation for us to do batch stream integration in Flink. Because it is natively divided into two layers, it can provide two different APIs according to the user's usage scenarios and different access modes of batch streams.

If it is real-time data access, it can be through the Consumer interface provided by the upper-layer Broker;
If it is access to historical data, you can skip the Broker and use the reader interface of the storage layer to directly access the underlying storage layer.

Store BookKeeper

Another advantage of Pulsar is that it has Apache BookKeeper, a storage engine designed specifically for streams and messages. It is a simple write-ahead-log abstraction. Log abstraction is similar to stream abstraction. All data is continuously appended directly from the end.

The advantage it brings to users is that the writing mode is relatively simple and can bring relatively high throughput. In terms of consistency, BookKeeper combines PAXOS and ZooKeeper ZAB. What BookKeeper exposes to everyone is a log abstraction. You can simply think that its consistency is very high, and it can achieve log-level storage similar to Raft. BookKeeper was born to serve our HA in HDFS naming node. This scenario has particularly high requirements for consistency. This is why in many critical scenarios, people choose Pulsar and BookKeeper for storage.

In the design of BookKeeper, there is special read-write isolation. The simple understanding is that read and write occur on different disks. The advantage of this is that in the batch stream fusion scenario, the mutual interference with historical data reading can be reduced. Many times when users read the latest real-time data, they will inevitably read historical data. If there is a separate dedicated for historical data The read and write of historical data and real-time data will not compete for IO, which will bring a better experience to batch-stream-integrated IO services.

Application scenario

Pulsar scenarios are widely used. The following are several common application scenarios of Pulsar:

First, because Pulsar has BookKeeper, data consistency is particularly high. Pulsar can be used in billing platforms, payment platforms and transaction systems, etc., where data service quality, consistency and availability are very demanding.
The second application scenario is Worker Queue / Push Notifications / Task Queue, which is mainly to achieve mutual decoupling between systems.
The third scenario is more related to Pulsar's support for message and queue scenarios. Pulsar supports the Queue consumption model and Kafka's high-bandwidth consumption model. Later, I will specifically explain the advantages of combining the Queue consumption model with Flink.
The fourth scenario is the IoT application, because Pulsar has the analysis of the MQTT protocol on the server side and lightweight calculation Pulsar Functions.
The fifth aspect is unified data processing, which uses Pulsar as the basis of a batch-stream-integrated storage.

At the Pulsar Summit Asia Summit at the end of November 2020, we invited more than 40 lecturers to share their Pulsar landing cases. If you are more interested in Pulsar application scenarios, you can follow the StreamNative account on station B and watch related videos.

Two, Pulsar's data view

In these application scenarios, Unified Data Processing is particularly important. Regarding batch stream integration, the first reaction of many domestic users is to choose Flink. Let's take a look at what advantages does the combination of Pulsar and Flink have? Why do users choose Pulsar and Flink for batch stream integration.

First, let's start with Pulsar's data view. Like other messaging systems, Pulsar takes the message as the main body and Topic as the center. All data is handed over by the producer to the topic, and then the consumer subscribes to consumption news from the topic.

Partition

In order to facilitate expansion, Pulsar also has the concept of partitioning within the topic, which is similar to many messaging systems. As mentioned above, Pulsar is a layered architecture. It uses partitions to expose topics to users, but internally, in fact, each partition can be cut into a slice according to the time or size specified by the user. When a Topic is first created, there is only one active segment. When the time specified by the user arrives, a new segment will be cut. In the process of opening a new shard, the storage layer can select the node with the most capacity to store the new shard according to the capacity of each node.

The advantage of this is that each shard of the topic will be evenly distributed on each node of the storage layer to achieve a balanced data storage. If users wish, they can use the entire storage cluster to store partitions, and are no longer limited by the capacity of a single node. As shown in the figure below, the topic has 4 partitions, and each partition is split into multiple slices. The user can cut a slice according to time (for example, 10 minutes or one hour), or according to size (for example, 1G or 2G). . Fragmentation itself is sequential. It gradually increases according to the ID. All messages in the fragment increase monotonically according to the ID, which makes it easy to ensure the orderliness.

Stream storage

Let's take a look at a single shard, the concept of data processing in a common stream. All the user data is continuously added from the end of the stream, similar to the concept of the stream, the new data of Topic in Pulsar is constantly added at the end of the Topic. The difference is that Pulsar's Topic abstraction provides some advantages:

First, it uses an architecture that separates storage and computing. In the computing layer, it is more of a message service layer, which can quickly return the latest data to the user through the consumer interface, and the user can obtain the latest data in real time;
Another advantage is that it is divided into multiple shards. If the user specifies the time, the corresponding shard can be found from the metadata, and the user can bypass the real-time stream and directly read the shards of the storage layer;
Another advantage is that Pulsar can provide unlimited stream storage.

If you are an infrastructure student, if you see a time-slicing architecture, it is easy to think of moving the old shards to secondary storage. This is also done in Pulsar. Users can automatically move old data or data that exceeds the time limit or size to the secondary storage according to the consumption heat of the topic. Users can choose to use Google, Microsoft's Azure or AWS to store old shards, and it also supports HDFS storage.

The advantage of this is that the latest data can be quickly returned through BookKeeper, and the network storage cloud resources can be used to make an unlimited stream storage for the old cold data. This is why Pulsar can support unlimited stream storage, and it is also a basis for batch stream integration.

In general, Pulsar provides two different access interfaces for real-time data and historical data through the separation of storage and calculation. Users can choose which interface to use to access data according to different internal shard positions and metadata. At the same time, the old shards can be placed in the secondary storage according to the sharding mechanism, which can support unlimited stream storage.

The unification of Pulsar is embodied in the management of slice metadata. Each shard can be stored in different storage media or formats according to time, but Pulsar provides a logical concept of a partition externally by managing the metadata of each shard. When accessing a shard in the partition, I can get its metadata, know its order in the partition, the storage location of the data and the storage type. Pulsar provides a unified management of the metadata of each shard Abstraction of topic.

3. Batch flow integration of Pulsar and Flink

In Flink, flow is a basic concept, and Pulsar can be used as a carrier to store data. If the user does a batch calculation, it can be considered as a bounded stream. For Pulsar, this is a fragment within a bounded range of Topic.

In the figure, we can see that topic has many fragments. If the start and end time is determined, the user can determine the range of fragments to be read based on this time. For real-time data, it corresponds to a continuous query or access. For the Pulsar scenario, it is to constantly consume the tail data of Topic. In this way, Pulsar's Topic model can be well combined with the concept of Flink flow, and Pulsar can be used as a carrier of Flink flow calculation.

Bounded calculation can be regarded as a bounded stream, corresponding to some limited fragments of Pulsar;
Real-time calculation is an unbounded stream, querying and accessing the latest data in Topic.

Pulsar adopts different response modes for bounded and unbounded flows:

The first is the response to historical data. As shown in the following figure, the lower left corner is the user's query, and the given start and end time limits the scope of the flow. The response to Pulsar is divided into several steps:
- The first step is to find Topic. According to our unified management metadata, you can get the metadata list of all the shards in this topic;
- In the second step, according to the time limit in the metadata list, the starting and ending shards are obtained through a two-point search method, and the shards that need to be scanned are selected;
- In the third step, after finding these shards, access these shards that need to be accessed through the interface of the underlying storage layer to complete a search of historical data.

For real-time data search, Pulsar also provides the same interface as Kafka. It can read the last shard (that is, the latest data) through the consumer, and access the data in real time through the consumer interface. It keeps looking for the latest data, and then it does the next lookup after it is finished. In this case, using the Pulsar Pub/Sub interface is the most direct and effective way.

Simply put, Flink provides a unified view so that users can use a unified API to process streaming and historical data. In the past, data scientists may need to write two sets of applications to process real-time data and historical data, but now they only need one set of models to solve this problem.

Pulsar mainly provides a data carrier, and provides a streaming storage carrier for the upper computing layer through an architecture based on partition and sharding. Because Pulsar uses a hierarchical sharding architecture, it has the latest data access interface for streams, and it also has a storage layer access interface for batches that has higher requirements for concurrency. At the same time, it provides unlimited stream storage and a unified consumption model.

4. Pulsar's existing capabilities and progress

Finally, let's talk about Pulsar's current capabilities and some recent developments.

Existing capacity

schema

In big data, schema is a particularly important abstraction. The same is true in the messaging field. In Pulsar, if the producer and the consumer can sign a set of agreements through the schema, there is no need for users on the production side and the consumer side to communicate the format of sending and receiving data offline. We also need the same support in the calculation engine.

In the Pulsar-Flink connector, we borrow the interface of the Flink schema to connect to the schema that comes with Pulsar. Flink can directly parse the schema stored in the Pulsar data. This schema includes two types:

The first is our common metadata for each message (meatdata), including the key of the message, the time when the message was generated, or other metadata information.
The other is the description of the data structure of the content of the message. The most common is the Avro format. When the user accesses it, the data structure corresponding to each message can be known through the Schema.

At the same time, we combined Flip-107, integrated Flink metadata schema and Avro metadata, and combined the two schemas to do more complex queries.

source

With this schema, users can easily use it as a source because it can understand each message from the information in the schema.

Pulsar Sink

We can also return the calculation result in Flink to Pulsar and use it as a sink.

Streaming Tables

With the support of Sink and Source, we can directly expose Flink table to users. Users can simply use Pulsar as a table of Flink to find data.

write to straming tables

The following figure shows how to write calculation results or data to Pulsar's Topic.

Pulsar Catalog

Pulsar comes with many features of enterprise flow. Pulsar's topic (eg persistent://tenant\_name/namespace\_name/topic\_name) is not a flat concept, but is divided into many levels. There are tenant level and namespace level. This can be easily combined with the catalog concept commonly used by Flink.

As shown in the figure below, a Pulsar Catalog is defined, the database is tn/ns, which is a path expression, first tenant, then namespace, and finally a topic. In this way, Pulsar's namespace can be regarded as Flink's Catalog. There will be many topics under the namespace, and each topic can be a catalog table. This can easily correspond to Flink Cataglog. In the figure below, the upper part is the definition of the Catalog, and the lower part demonstrates how to use the catalog. However, further improvement is needed here, and there are plans to do partition support later.

FLIP-27

FLIP-27 is a representative of Pulsar-Flink batch stream integration. I introduced that Pulsar provides a unified view to manage the metadata of all topics. In this view, the information of each shard is marked according to the metadata, and then the FLIP-27 framework is used to achieve the purpose of batch stream integration. There are two concepts in FLIP-27: Splitter and reader.

It works like this. First, there will be a splitter to cut the data source, and then hand it over to the reader to read the data. For Pulsar, the splitter is still a topic of Pulsar. After capturing the metadata of the Pulsar topic, determine where the shard is stored based on the metadata of each shard, and then select the most suitable reader for access. Pulsar provides a unified storage layer. Flink selects different readers to read the data in Pulsar according to the information of the different locations and formats of each partition of the splitter.

Source high concurrency

Another closely related to Pulsar's consumption model is. The problem faced by many Flink users is how to make Flink perform tasks faster. For example, if the user gives 10 concurrency, it will have 10 jobs concurrently, but if a Kafka topic has only 5 partitions, since each partition can only be consumed by one job, there will be 5 Flink jobs that are idle . If you want to speed up the concurrency of consumption, you can only coordinate with the business side to open a few more partitions. In this case, the operation and maintenance side from the consumer end to the production end and behind will find it particularly complicated. And it is difficult to update on-demand in real time.

Pulsar not only supports Kafka, where each partition can only be consumed by one active consumer, but also supports the Key-Shared model. Multiple consumers can consume a partition together, while ensuring that messages for each key are only sent to A consumer, this guarantees the concurrency of the consumer and at the same time guarantees the order of the messages.

For the previous scenario, we have supported the Key-shared consumption model in Pulsar Flink. There are also 5 partitions and 10 concurrent Flink jobs. But I can split the range of keys into 10. Each Flink subtask consumes one of the 10 key ranges. In this way, from the consumer side, the relationship between the number of partitions and Flink concurrency can be well decoupled, and data concurrency can be better provided.

Automatic Reader selection

Another direction is that the Pulsar mentioned above already has a unified storage foundation. On this basis, we can choose different readers according to different segment metadata of users. At present, we have implemented this function.

recent work

Recently, we are also doing work related to Flink 1.12 integration. The Pulsar-Flink project is also constantly iterating. For example, we have added support for transactions in Pulsar 2.7 and integrated the end-to-end Exactly-Once into the Pulsar Flink repo; the other work is how to read the second parquet format. Column data of level storage; and use the Pulsar storage layer for Flink's state storage, etc.

Copyright Statement: content of this article is voluntarily contributed by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright, and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.