开源 - Technical blog post | Batch stream fusion of Flink and Pulsar - ApachePulsar

Editor's Note:
In this speech, Zhai Jia, co-founder of StreamNative, introduced Apache Pulsar, the next-generation cloud-native message streaming platform, and explained how to provide the basis for batch stream fusion through Apache Pulsar's native storage and computing separation architecture, and how to combine with Flink to achieve Batch-stream integrated computing.
About Apache Pulsar
Apache Pulsar is a top-level project of the Apache Software Foundation. It is a next-generation cloud-native distributed message flow platform that integrates messaging, storage, and lightweight functional computing. Multi-machine room and cross-region data replication, with streaming data storage features such as strong consistency, high throughput, low latency, and high scalability.
GitHub address: http://github.com/apache/pulsar/

Apache Pulsar is relatively new, it joined the Apache Software Foundation in 2017 and graduated from the Apache Software Foundation in 2018 as a top-level project. Pulsar has attracted more and more developers' attention due to its native storage and computing separation architecture, and its specially designed storage engine BookKeeper for messages and streams, combined with Pulsar's own enterprise-level features.

What is Apache Pulsar

The following figure is an open source tool in the field of messaging. Developers engaged in messaging or infrastructure must not be unfamiliar with these. While Pulsar started development in 2012 and wasn't open-sourced until 2016, it had been running on Yahoo's line for a long time before it got to you. This is why it has attracted the attention of many developers as soon as it is open sourced. It is already an online-tested system.

The most fundamental difference between Pulsar and other messaging systems lies in two aspects:

On the one hand, Pulsar adopts a cloud-native architecture that separates storage and computing;
On the other hand, Pulsar has a storage engine specially designed for messages, Apache BookKeeper.

Architecture

The following figure shows the architecture of Pulsar's storage and computing separation:

First of all, in the computing layer, Pulsar Broker does not save any state data, does not do any data storage, we also call it the service layer.
Second, Pulsar has a storage engine BookKeeper specially designed for messages and streams, which we also call the data layer.

This layered architecture is very convenient for users to expand the cluster:

If you want to support more Producers and Consumers, you can expand the stateless Broker layer above;
If you want to store more data, you can expand the underlying storage layer separately.

This cloud-native architecture has two main characteristics:

The first is the separation of storage and computation;
Another feature is that each layer is a peer-to-peer architecture.

In terms of node peering, the Broker layer does not store data, so it is easy to implement node peering. But Pulsar's underlying storage is also peer-to-peer: in the storage layer, BookKeeper does not use master/slave master-slave synchronization, but Quorum.

If you want to maintain multiple data backups, the user writes three storage nodes concurrently through a broker, and each piece of data is in a peer-to-peer state, so the underlying nodes are also in a peer-to-peer state. Expansion and management will be easy. Having such a peer-to-peer foundation of nodes will bring great cloud-native convenience to users, facilitate users to individually expand capacity at each layer, and improve the availability and maintainability of users' online systems.

At the same time, this layered architecture lays the foundation for us to do batch stream fusion in Flink. Because it is natively divided into two layers, it can provide two different sets of APIs according to user usage scenarios and different access modes of batch streams.

If it is access to real-time data, it can pass the Consumer interface provided by the upper-layer Broker;
If it is to access historical data, you can skip the Broker and use the reader interface of the storage layer to directly access the underlying storage layer.

Store BookKeeper

Another advantage of Pulsar is Apache BookKeeper, a storage engine specially designed for streams and messages. It's a simple write-ahead-log abstraction. Log abstraction is similar to stream abstraction, all data is continuously appended directly from the tail.

The advantage it brings to users is that the writing mode is relatively simple and can bring relatively high throughput. In terms of consistency, BookKeeper combines the two protocols PAXOS and ZooKeeper ZAB. What BookKeeper exposes to you is a log abstraction. You can simply think that it has high consistency and can implement log layer storage similar to Raft. The birth of BookKeeper is to serve our HA in HDFS naming node, and this scenario has particularly high requirements on consistency. This is also the reason why people choose Pulsar and BookKeeper for storage in many critical scenarios.

In the design of BookKeeper, there is special read and write isolation. The simple understanding is that reading and writing occur on different disks. The advantage of this is that in the scenario of batch stream fusion, the mutual interference with historical data reading can be reduced. In many cases, users will inevitably read historical data when reading the latest real-time data. There will be no IO contention for reading and writing historical data and real-time data, which will bring a better experience to the IO service of batch-stream fusion.

Application scenarios

Pulsar scenarios are widely used. The following are several common application scenarios of Pulsar:

First, because Pulsar has BookKeeper, the data consistency is particularly high. Pulsar can be used in billing platforms, payment platforms, and transaction systems, etc., in scenarios that require high data service quality, consistency and availability.
The second application scenario is Worker Queue / Push Notifications / Task Queue, mainly to achieve mutual decoupling between systems.
The third scenario is more related to Pulsar's support for both message and queue scenarios. Pulsar supports the Queue consumption model and the Kafka high-bandwidth consumption model. Later, I will specifically explain the advantages of combining the Queue consumption model with Flink.
The fourth scenario is IoT applications, because Pulsar has MQTT protocol parsing on the server side, as well as lightweight computing Pulsar Functions.
The fifth aspect is unified data processing, which uses Pulsar as the basis for a batch-stream fusion storage.

At the Pulsar Summit Asia at the end of November 2020, we invited more than 40 lecturers to share their Pulsar implementation cases. If you are interested in the Pulsar application scenario, you can follow the StreamNative account on station B, or click on the relevant reading at the end of the article to watch the summit speech video.

Pulsar's data view

In these application scenarios, Unified Data Processing is especially important. Regarding batch stream integration, the first reaction of many domestic users is to choose Flink. Let's see what are the advantages of combining Pulsar and Flink? Why do users choose Pulsar and Flink for batch stream fusion?

First, let's start with Pulsar's data view. Like other messaging systems, Pulsar is message-based and Topic-centric. All data is delivered to the topic by the producer, and then the consumer subscribes to consume messages from the topic.

Partition

In order to facilitate expansion, Pulsar also has the concept of partitioning within topics, which is similar to many message systems. As mentioned above, Pulsar is a layered architecture. It uses partitions to expose topics to users, but internally, in fact, each partition can be cut into a shard according to the time or size specified by the user. When a topic is initially created, there is only one active shard. When the time specified by the user arrives, a new shard will be cut. In the process of opening a new shard, the storage layer can select the node with the largest capacity to store the new shard according to the capacity of each node.

The advantage of this is that each shard of the topic will be evenly distributed on each node of the storage layer to achieve balanced data storage. If the user wishes, the entire storage cluster can be used to store partitions, no longer limited by the capacity of a single node. As shown in the figure below, the topic has 4 partitions, and each partition is divided into multiple shards. Users can cut a shard according to time (such as 10 minutes or an hour) or size (such as 1G or 2G). . The shard itself is sequential, and it gradually increases according to the ID, and all messages inside the shard increase monotonically according to the ID, so it is easy to ensure the order.

Stream stream storage

Let's look at a single shard again, in the concept of common stream (stream) data processing. All the data of the user is continuously added from the end of the stream. Similar to the concept of stream, the new data of the topic in Pulsar is continuously added to the end of the topic. The difference is that Pulsar's Topic abstraction provides some advantages:

First, it adopts an architecture that separates storage and computing. In the computing layer, it is more of a message service layer, which can quickly return the latest data to the user through the consumer interface, and the user can obtain the latest data in real time;
Another advantage is that it is divided into multiple shards. If the user specifies the time, the corresponding shard can be found from the metadata, and the user can bypass the real-time stream and directly read the shard of the storage layer;
Another advantage is that Pulsar can provide unlimited streaming storage.

If you are an infrastructure student, if you see the structure of time sharding, it is easy to think of moving the old shards to the secondary storage, which is also done in Pulsar. Users can set to automatically move old, or data that exceeds the time limit or size to secondary storage according to the topic's consumption. Users can choose to use Google, Microsoft's Azure or AWS to store old shards, and HDFS storage is also supported.

The advantage of this is that the latest data can be quickly returned through BookKeeper, and the old cold data can be stored in an unlimited stream by using network storage cloud resources. This is why Pulsar can support unlimited stream storage, and it is also a basis for batch stream fusion.

In general, Pulsar provides two different access interfaces for real-time data and historical data through the separation of storage and computing. Users can choose which interface to use to access data according to different internal shard locations and metadata. At the same time, according to the sharding mechanism, the old shards can be placed in the secondary storage, which can support unlimited stream storage.

The unification of Pulsar is reflected in the management of shard metadata. Each shard can be stored in different storage media or formats according to time, but Pulsar provides the logical concept of a partition by managing the metadata of each shard. When accessing a shard in the partition, I can get its metadata, know its order in the partition, the storage location and storage type of the data, and Pulsar provides a unified management of the metadata of each shard. Abstraction of topic.

Batch stream fusion of Pulsar and Flink

In Flink, stream is a basic concept, and Pulsar can be used as a stream carrier to store data. If the user does a batch computation, it can be considered a bounded stream. For Pulsar, this is a shard within the bounded range of a topic.

As we can see in the figure, the topic has many shards. If the start and end time is determined, the user can determine the range of shards to be read according to this time. For real-time data, the corresponding is a continuous query or access. For the Pulsar scenario, it is to continuously consume the tail data of the Topic. In this way, the topic model of Pulsar can be well combined with the concept of Flink stream, and Pulsar can be used as the carrier of Flink stream computing.

Bounded computation can be regarded as a bounded stream, corresponding to some limited shards of Pulsar;
Real-time computing is an unbounded stream, querying and accessing the latest data in topics.

For bounded and unbounded streams, Pulsar adopts different response modes:

The first is a response to historical data. As shown in the figure below, the lower left corner is the user's query, and the given start and end time limits the scope of the stream. The response to Pulsar is divided into several steps:
- The first step is to find the topic. According to the metadata we manage uniformly, you can get the list of metadata of all shards in this topic;
- In the second step, according to the time limit in the metadata list, the starting and ending shards are obtained by two-point search, and the shards to be scanned are selected;
- In the third step, after finding these shards, access the shards that need to be accessed through the interface of the underlying storage layer to complete a historical data search.

For real-time data search, Pulsar also provides the same interface as Kafka, which can read the last shard (that is, the latest data) through the consumer interface, and access the data in real time through the consumer interface. It keeps looking for the latest data, and when it's done, it goes to the next lookup. In this case, using the Pulsar Pub/Sub interface is the most direct and efficient way.

In short, Flink provides a unified view so that users can use a unified API to process streaming and historical data. In the past, data scientists may need to write two sets of applications to deal with real-time data and historical data, but now only need one set of models to solve this problem.

Pulsar mainly provides a data carrier, and provides a stream storage carrier for the upper computing layer through a partition-based sharding architecture. Because Pulsar adopts a layered and sharded architecture, it has the latest data access interface for streams and a storage layer access interface for batches that has higher requirements for concurrency. At the same time, it provides unlimited streaming storage and a unified consumption model.

Pulsar's current capabilities and progress

Finally, let's talk about Pulsar's current capabilities and some recent progress.

existing capacity

schema

In big data, schema is a particularly important abstraction. The same is true in the message field. In Pulsar, if the producer and the consumer can sign a set of agreements through the schema, there is no need for the users of the producer and the consumer to communicate the format of data sending and receiving offline. We need the same support in the compute engine.

In the Pulsar-Flink connector, we borrow the interface of the Flink schema to connect to the schema that comes with Pulsar, and Flink can directly parse the schema stored in the Pulsar data. This schema includes two kinds:

The first is our common metadata (meatdata) for each message, including the key of the message, the time when the message was generated, or other metadata information.
The other is the description of the data structure of the content of the message, which is commonly in Avro format. When the user accesses, the data structure corresponding to each message can be known through the Schema.

At the same time, we combine Flip-107 to integrate Flink metadata schema and Avro metadata, and we can combine the two schemas to do more complex queries.

source

With this schema, the user can easily use it as a source because it can understand each message from the schema's information.

Pulsar Sink

We can also return the calculation result in Flink to Pulsar as a sink.

Streaming Tables

With the support of Sink and Source, we can directly expose the Flink table to users. Users can simply use Pulsar as a Flink table to search for data.

write to straming tables

The following figure shows how to write the calculation results or data to the Topic of Pulsar.

Pulsar Catalog

Pulsar comes with many features of enterprise streaming. Pulsar's topic (eg persistent://tenant_name/namespace_name/topic_name) is not a tiled concept, but is divided into many levels. There are tenant levels and namespace levels. This can be easily combined with Flink's commonly used Catalog concept.

As shown in the figure below, a Pulsar Catalog is defined, the database is tn/ns, which is a path expression, first the tenant, then the namespace, and finally a topic. In this way, the namespace of Pulsar can be regarded as the catalog of Flink. There are many topics under the namespace, and each topic can be a table of the catalog. This can easily correspond to Flink Catalog. In the following figure, the upper part is the definition of Catalog, and the lower part shows how to use this Catalog. However, further improvement is needed here, and there are plans to do partition support later.

FLIP-27

FLIP-27 is a representative of Pulsar-Flink batch-stream fusion. As mentioned earlier, Pulsar provides a unified view to manage the metadata of all topics. In this view, the information of each shard is marked according to the metadata, and then the FLIP-27 framework is used to achieve the purpose of batch stream fusion. There are two concepts in FLIP-27: Splitter and reader.

It works like this. First, a splitter will cut the data source, and then hand it over to the reader to read the data. For Pulsar, the splitter is still a topic of Pulsar. After capturing the metadata of the Pulsar topic, judge where the shard is stored according to the metadata of each shard, and then select the most suitable reader for access. Pulsar provides a unified storage layer, and Flink selects different readers to read the data in Pulsar according to the information of the different locations and formats of the splitter for each partition.

Source high concurrency

Another closely related Pulsar consumption model is. The problem that many Flink users face is how to make Flink perform tasks faster. For example, if the user gives 10 concurrency, it will have 10 concurrent jobs, but if a Kafka topic has only 5 partitions, since each partition can only be consumed by one job, 5 Flink jobs will be idle . If you want to speed up the concurrency of consumption, you can only coordinate with the business side to open a few more partitions. In this case, the operation and maintenance side from the consumer side to the production side and the back side will feel particularly complicated. And it is difficult to achieve real-time on-demand updates.

Pulsar not only supports Kafka, where each partition can only be consumed by one active consumer, but also supports the Key-Shared mode. Multiple consumers can jointly consume a partition, while ensuring that messages for each key are only sent to A consumer, which ensures the concurrency of consumers and the ordering of messages at the same time.

For the previous scenario, we have supported the Key-shared consumption mode in Pulsar Flink. The same is 5 partitions, 10 concurrent Flink jobs. But I can split the range of keys into 10. Each Flink subtask consumes one of 10 key ranges. In this way, from the consumer side, the relationship between the number of partitions and Flink concurrency can be well decoupled, and data concurrency can be better provided.

Automatic Reader selection

Another direction is that Pulsar mentioned above already has a unified storage foundation. On this basis, we can select different readers according to different segment metadata of users. At present, we have implemented this function.

recent work

Recently, we are also working on integration with Flink 1.12. The Pulsar-Flink project is also constantly iterating. For example, we have added support for transactions in Pulsar 2.7, and integrated the end-to-end Exactly-Once into the Pulsar Flink repo; another work is how to read the second part of the Parquet format. Column data of level storage; and use the Pulsar storage layer as Flink's state storage, etc.

Technical blog post | Batch stream fusion of Flink and Pulsar

About Apache Pulsar

What is Apache Pulsar

Architecture

Store BookKeeper

Application scenarios

Pulsar's data view

Partition

Stream stream storage

Batch stream fusion of Pulsar and Flink

Pulsar's current capabilities and progress

existing capacity

schema

source

Pulsar Sink

Streaming Tables

write to straming tables

Pulsar Catalog

FLIP-27

Source high concurrency

Automatic Reader selection

recent work

Related Reading

ApachePulsar

引用和评论

深入解析 Apache BookKeeper 系列：第二篇 — 写操作原理

定档 7 月！Community Over Code Asia 2025 议题征集全面启动！

蚂蚁技术研究院发布推理大模型强化学习框架，邀请开发者共同助力 AGI 生态

祝贺陈梓立(Tison)当选 2025 年度 Apache 软件基金会董事会

Koupleless 助力「人力家」实现分布式研发集中式部署，又快又省！

活动推荐：2025 RISC-V 生态大会将在北京召开，龙蜥受邀参展

强烈推荐|新手从搭建到二开TinyEngine低代码引擎