From Kafka to Pulsar, BIGO creates a real-time messaging system

About Apache Pulsar
Apache Pulsar is the top-level project of the Apache Software Foundation. It is the next-generation cloud-native distributed message flow platform. It integrates messaging, storage, and lightweight functional computing. It uses a separate architecture design for computing and storage to support multi-tenancy, persistent storage, Multi-computer room and cross-regional data replication, with strong consistency, high throughput, low latency and high scalability and other streaming data storage characteristics.
GitHub address: http://github.com/apache/pulsar/

BIGO was established in 2014 and is a fast-growing technology company. Based on powerful audio and video processing technology, global audio and video real-time transmission technology, artificial intelligence technology, and CDN technology, BIGO has launched a series of audio and video social and content products, including Bigo Live (live broadcast) and Likee (short video), etc. There are nearly 100 million users worldwide, and products and services have covered more than 150 countries and regions.

challenge

Initially, BIGO's message flow platform mainly used open source Kafka as its data support. As the scale of data continues to grow and products continue to iterate, the scale of data carried by the BIGO message flow platform has doubled. Downstream online model training, online recommendation, real-time data analysis, real-time data warehouse and other businesses have real-time performance on the message flow platform. And stability puts forward higher requirements. The open source Kafka cluster is difficult to support massive data processing scenarios. We need to invest more manpower to maintain multiple Kafka clusters, so the cost will become higher and higher, mainly reflected in the following aspects:

Data storage is bound to the message queue service, and cluster expansion/partition balancing requires a large amount of data to be copied, resulting in a decrease in cluster performance.
When the partition copy is not in the ISR (synchronization) state, once a broker fails, it may cause data loss or the partition cannot provide read and write services.
When the Kafka broker disk fails or the space occupancy rate is too high, manual intervention is required.
The cluster uses KMM (Kafka Mirror Maker) synchronously across regions, and performance and stability are difficult to meet expectations.
In the catch-up read scenario, PageCache pollution is prone to occur, resulting in a decrease in read and write performance.
The number of topic partitions stored on the Kafka broker is limited. The more partitions, the worse the disk read and write sequence and the lower the read and write performance.
The increase in the scale of Kafka clusters has led to a sharp increase in the cost of operation and maintenance, which requires a lot of manpower to carry out daily operation and maintenance; in BIGO, it takes 0.5 people/day to expand a machine to the Kafka cluster and balance the partitions; to shrink one machine, 1 person is required /sky.

If you continue to use Kafka, the cost will continue to rise: expand and shrink machines, increase operation and maintenance manpower. At the same time, as the business scale grows, we have higher requirements for the messaging system: the system should be more stable and reliable, easy to scale, and have low latency. In order to improve the real-time, stability and reliability of the message queue, and reduce the cost of operation and maintenance, we began to consider whether to do localized secondary development based on open source Kafka, or to see if there is a better solution in the community to solve us Problems encountered when maintaining Kafka clusters.

Why choose Pulsar

In November 2019, we started investigating message queues, comparing the advantages and disadvantages of current mainstream message streaming platforms, and matching our needs. During the investigation, we found that Apache Pulsar is the next generation cloud native distributed message flow platform, integrating messaging, storage, and lightweight functional computing. Pulsar is capable of seamless expansion, low latency, high throughput, and supports multi-tenancy and cross-regional replication. The most important thing is that Pulsar's storage and computing separation architecture can perfectly solve the problem of Kafka's expansion and contraction. The Pulsar producer sends the message to the broker, and the broker writes it to the second-tier storage BookKeeper through the bookie client.

Pulsar adopts a hierarchical architecture design that separates storage and computing. It supports multi-tenant, persistent storage, and multi-computer room cross-regional data replication. It has strong consistency, high throughput and low latency and highly scalable streaming data storage characteristics.

Horizontal expansion: It can seamlessly expand to hundreds of nodes.
High throughput: It has been tested in the production environment of Yahoo!, supporting Pub-Sub (Pub-Sub) of millions of messages per second.
Low latency: Low latency (less than 5 ms) can be maintained even with a large amount of messages.
Persistence mechanism: Pulsar's persistence mechanism is built on Apache BookKeeper to achieve read-write separation.
Read-write separation: BookKeeper's read-write separation IO model greatly exerts disk sequential write performance and is relatively friendly to mechanical hard drives. The number of topics supported by a single bookie node is not limited.

In order to further deepen the understanding of Apache Pulsar and measure whether Pulsar can truly meet the needs of large-scale messaging Pub-Sub in our production environment, we have carried out a series of stress tests since December 2019. Since we are using a mechanical hard drive and no SSD, we encountered some performance problems during the stress test. With the assistance of StreamNative, we have performed a series of performance tunings on Broker and BookKeeper respectively. Pulsar's throughput and stability are both has seen an increase.

After 3 to 4 months of stress testing and tuning, we believe that Pulsar can completely solve the various problems we encountered when using Kafka, and Pulsar will be launched in the test environment in April 2020.

Apache Pulsar at BIGO: Pub-Sub consumption model

In May 2020, we officially used the Pulsar cluster in the production environment. Pulsar's BIGO scene is mainly Pub-Sub's classic production and consumption model. The front end has Baina service (data receiving service implemented in C++), Kafka's Mirror Maker and Flink, and other languages such as Java, Python, C++ and other client-side producers Write data to topic. The backend consumes data by Flink and Flink SQL, as well as consumers of clients in other languages.

In the downstream, our docking business scenarios include real-time data warehouse, real-time ETL (Extract-Transform-Load, the process of extracting, transforming, and loading data from the source to the destination), and real-time data Analysis and real-time recommendations. Most business scenarios use Flink to consume the data in the Pulsar topic and perform business logic processing; the client languages used in other business scenarios are mainly distributed in C++, Go, Python, etc. After the data is processed by the respective business logic, it will eventually be written into Hive, Pulsar topic, and third-party storage services such as ClickHouse, HDFS, and Redis.

Pulsar + Flink real-time streaming platform

At BIGO, we have built a real-time streaming platform with the help of Flink and Pulsar. Before introducing this platform, let's first understand the internal operating mechanism of Pulsar Flink Connector. In Pulsar Flink Source/Sink API, there is a Pulsar topic in the upstream, Flink job in the middle, and a Pulsar topic in the downstream. How do we consume this topic, and how do we process the data and write it to the Pulsar topic?

According to the code example on the left side of the figure above, initialize a StreamExecutionEnvironment and perform related configurations, such as modifying the property and topic values. Then create a FlinkPulsarSource object, this Source is filled with serviceUrl (brokerlist), adminUrl (admin address) and the serialization method of topic data, and finally the property will be passed in, so that the data in the Pulsar topic can be read. The use of Sink is very simple. First, create a FlinkPulsarSink, specify the target topic in the sink, and then specify the TopicKeyExtractor as the key, and call addsink to write the data into the sink. This production and consumption model is very simple, very similar to Kafka.

How does the consumption of Pulsar topic and Flink link up? As shown in the figure below, when creating a new FlinkPulsarSource, a new reader object will be created for each partition of the topic. It should be noted that the bottom layer of Pulsar Flink Connector uses reader API to consume, and a reader will be created first, and this reader uses Pulsar Non-Durable Cursor. The characteristic of Reader consumption is to submit (commit) immediately after reading a piece of data, so you may see that the subscription corresponding to the reader does not have backlog information on the monitoring.

In Pulsar 2.4.2, the topic subscribed by Non-Durable Cursor will not save the data in the cache of the broker when receiving the data written by the producer, which causes a large number of data read requests to fall into BookKeeper, which reduces Data reading efficiency. BIGO has corrected this problem in Pulsar 2.5.1 version.

After Reader subscribes to Pulsar topic and consumes data in Pulsar topic, how does Flink guarantee exactly-once? Pulsar Flink Connector uses another independent subscription, which uses Durable Cursor. When Flink triggers a checkpoint, Pulsar Flink Connector will checkpoint the status of the reader (including the consumption location of each Pulsar Topic Partition) to files, memory or RocksDB. When the checkpoint is completed, a Notify Checkpoint Complete notification will be issued. After the Pulsar Flink Connector receives the checkpoint completion notification, it submits the current consumption offset of all readers, that is, the message id, to the Pulsar broker with an independent SubscriptionName, and then the consumption offset information is actually recorded.

After the Offset Commit is completed, the Pulsar broker will store the Offset information (represented by Cursor in Pulsar) in the underlying distributed storage system BookKeeper. The advantage of this is that when the Flink task is restarted, there will be two layers of recovery guarantees. The first case is to recover from the checkpoint: you can directly obtain the last consumed message id from the checkpoint, and obtain data through this message id, and the data stream can continue to be consumed. If it is not restored from the checkpoint, after the Flink task restarts, it will obtain the Offset position corresponding to the last Commit from Pulsar according to the SubscriptionName and start consumption. This can effectively prevent the checkpoint damage from causing the entire Flink task to fail to start successfully.

The Checkpoint process is shown in the figure below.

Do checkpoint N first, issue a notify Checkpoint Complete after completion, wait for a certain interval, then do checkpoint N+1, and then perform a notify Checkpoint Complete operation after completion, at this time, perform a Commit on Durable Cursor, and finally Commit to On the server side of Pulsar topic, this can ensure the exact-once of the checkpoint, and also ensure the message "keep alive" according to the subscription set by yourself.

What problem does Topic/Partition Discovery solve? When a topic is consumed by a Flink task, if the topic adds a partition, the Flink task needs to be able to automatically discover the partition. How does Pulsar Flink Connector achieve this? Readers subscribing to topic partitions are independent of each other. Each task manager contains multiple reader threads. The topic partitions contained in a single task manager are mapped according to the hash function. When a partition is added to the topic, the newly added partition will be mapped to On a task manager, after the task manager finds a new partition, it will create a reader and consume the new data. The user can adjust the detection frequency partition.discovery.interval-millis

In order to lower the threshold for Flink to consume Pulsar topic and allow Pulsar Flink Connector to support richer new features of Flink, the BIGO message queue team added Pulsar Flink SQL DDL (Data Definition Language) and Flink 1.11 support to Pulsar Flink Connector. Previously, the official Pulsar Flink SQL only supports Catalog. It is not convenient to consume and process data in Pulsar topic in the form of DDL. In the BIGO scenario, most topic data is stored in JSON format, and the JSON schema is not registered in advance, so it can only be consumed after specifying the topic DDL in Flink SQL. In response to this scenario, BIGO has done secondary development based on Pulsar Flink Connector, and provides a code framework for consuming, parsing, and processing Pulsar topic data in the form of Pulsar Flink SQL DDL (as shown in the figure below).

In the code on the left, the first step is to configure the consumption of Pulsar topic. First, specify the DDL form of the topic, such as rip, rtime, uid, etc. The following is the basic configuration of the consumption of Pulsar topic, such as topic name, service-url, admin-url Wait. After the underlying reader reads the message, it will decode the message according to DDL and store the data in the test_flink_sql table. The second step is conventional logic processing (such as extracting fields from the table, doing joins, etc.). After obtaining relevant statistical information or other relevant results, return these results and write them to HDFS or other systems. The third step is to extract the corresponding fields and insert them into a hive table. Since Flink 1.11 has better writing support for hive than 1.9.1, BIGO has made another API compatibility and version upgrade to make Pulsar Flink Connector support Flink 1.11. BIGO's real-time streaming platform based on Pulsar and Flink is mainly used for real-time ETL processing scenarios and AB-test scenarios.

Real-time ETL processing scenarios

Real-time ETL processing scenarios mainly use Pulsar Flink Source and Pulsar Flink Sink. In this scenario, Pulsar topic implements hundreds or even thousands of topics, and each topic has an independent schema. We need to perform routine processing on hundreds of topics, such as field conversion, fault-tolerant processing, and writing to HDFS. Each topic corresponds to a table on HDFS. Hundreds or thousands of topics will be mapped to hundreds or thousands of tables on HDFS. The fields of each table are different. This is the real-time ETL scenario we encountered.

The difficulty in this scenario lies in the large number of topics. If each topic maintains a Flink task, the maintenance cost is too high. Before, we wanted to sink the data in the Pulsar topic directly to HDFS through the HDFS Sink Connector, but processing the logic inside was very troublesome. In the end, we decided to use one or more Flink tasks to consume hundreds of topics, each topic is equipped with its own schema, directly use the reader to subscribe to all topics, perform schema analysis and post-processing, and write the processed data to HDFS .

As the program runs, we find that this scheme also has a problem: the pressure is not balanced between the operators. Because some topics have large traffic and some have small traffic, if they are mapped to the corresponding task manager by random hashing, some task managers will process high traffic, and some task managers will process very low traffic, which leads to some task machines. The upper accumulation is very serious, which slows down the processing of Flink streams. Therefore, we introduced the concept of slot group, which is grouped according to the traffic situation of each topic. The traffic will be mapped to the number of topic partitions. When creating topic partitions, it is also based on the traffic. If the traffic is high, create partitions for the topic. On the contrary, less. When grouping, the topics with small traffic are grouped into a group, and the topics with large traffic are placed in a group separately, which isolates resources well and ensures that the overall traffic of the task manager is balanced.

AB-test scene

Real-time data warehouses need to provide hourly or daily tables for data analysts and recommendation algorithm engineers to provide data query services. Simply put, there will be many management points in the app application, and various types of management points will be reported to the server. If the original management is directly exposed to the business side, different business users need to access various original tables to extract data from different dimensions and perform correlation calculations between the tables. Frequent data extraction and association operations on the underlying basic table will seriously waste computing resources, so we extract the dimensions that users care about from the basic table in advance, and merge multiple dots together to form one or more wide tables, covering the relevant recommendations above. Or 80% ~ 90% scenario tasks related to data analysis.

In the real-time data warehouse scenario, real-time intermediate tables are also required. Our solution is to use Pulsar Flink SQL to parse the consumed data into corresponding tables for topic A to topic K. Under normal circumstances, the common way to aggregate multiple tables into one table is to use join, such as joining tables A to K according to uid to form a very wide wide table; but joining multiple wide tables in Flink SQL is inefficient. Therefore, BIGO uses union instead of join to make a wide view, return the view in hours, write it to ClickHouse, and provide it to downstream business parties for real-time query. Using union instead of join to accelerate table aggregation can control the output of hourly intermediate tables at the minute level.

The output day table may also need to join the table stored on the hive or the offline table on other storage media, that is, the problem of join between the flow table and the offline table. If you join directly, the intermediate state that needs to be stored in the checkpoint will be relatively large, so we have optimized it in another dimension.

The left part is similar to the hour table. Each topic is consumed by Pulsar Flink SQL and converted into a corresponding table. The union operation is performed between the tables. The table obtained by the union is input into HBase in units of days (here, HBase is introduced to do Replace its join).

On the right side, you need to join offline data, use Spark to aggregate offline Hive tables (such as tables a1, a2, a3), and the aggregated data will be written into HBase through a carefully designed row-key. The state after data aggregation is as follows: Assume that the key of the data on the left fills the first 80 columns of the wide table, and the data calculated by the subsequent Spark task corresponds to the same key. Fill in the last 20 columns of the wide table to form a large width in HBase Table, extract the final data from HBase again and write it into ClickHouse for upper-level users to query. This is the main structure of AB-test.

Business income

Since its launch in May 2020, Pulsar has been operating stably, processing tens of billions of messages per day, and byte inflows of 2 to 3 GB/s. The features of high throughput, low latency, and high reliability provided by Apache Pulsar have greatly improved BIGO message processing capabilities, reduced message queue operation and maintenance costs, and saved nearly 50% of hardware costs. At present, we have deployed hundreds of Pulsar broker and bookie processes on dozens of physical hosts. We have adopted a mixed model of bookie and broker on the same node. We have migrated ETL from Kafka to Pulsar, and gradually consumed it in the production environment. Kafka cluster services (such as Flink, Flink SQL, ClickHouse, etc.) are migrated to Pulsar. As more businesses migrate, traffic on Pulsar will continue to rise.

Our ETL task has more than 10,000 topics, each topic has an average of 3 partitions, and uses a storage strategy of 3 copies. Before using Kafka, as the number of partitions increases, the disk gradually degenerates from sequential read and write to random read and write, and the read and write performance is severely degraded. Apache Pulsar's storage tiered design can easily support millions of topics, providing elegant support for our ETL scenarios.

Future outlook

BIGO has done a lot of work in Pulsar broker load balancing, broker cache hit rate optimization, broker-related monitoring, BookKeeper read and write performance, BookKeeper disk IO performance optimization, Pulsar and Flink, Pulsar and Flink SQL combination, etc., to improve the stability of Pulsar Sex and throughput have also lowered the threshold for the combination of Flink and Pulsar, laying a solid foundation for the promotion of Pulsar.

In the future, we will increase the scene application of Pulsar in BIGO to help the community further optimize and improve the functions of Pulsar, as follows:

Develop new features for Apache Pulsar, such as supporting topic policy related features.
Migrate more tasks to Pulsar. This work involves two aspects, one is to migrate the tasks that used Kafka before to Pulsar. Second, the new business is directly connected to Pulsar.
BIGO intends to use KoP to ensure a smooth transition of data migration. Because BIGO has a large number of Flink tasks that consume Kafka clusters, we hope to make a layer of KoP directly in Pulsar to simplify the migration process.
Continuous performance optimization of Pulsar and BookKeeper. Due to the high flow rate in the production environment, BIGO has high requirements on the reliability and stability of the system.
Continue to optimize BookKeeper's IO protocol stack. Pulsar's underlying storage itself is an IO-intensive system, ensuring high throughput of the underlying IO can increase the throughput of the upper layer and ensure stable performance.

Author profile
Chen Hang, Apache Pulsar Committer, head of the BIGO big data messaging platform team, is responsible for creating and developing a centralized publish-subscribe messaging platform that carries large-scale services and applications. He introduced Apache Pulsar to the BIGO messaging platform and opened up upstream and downstream systems, such as Flink, ClickHouse and other real-time recommendation and analysis systems. He is currently mainly responsible for Pulsar performance tuning, new function development and Pulsar ecological integration.

From Kafka to Pulsar, BIGO creates a real-time messaging system

About Apache Pulsar

challenge

Why choose Pulsar

Apache Pulsar at BIGO: Pub-Sub consumption model

Pulsar + Flink real-time streaming platform

Real-time ETL processing scenarios

AB-test scene

Business income

Future outlook

Author profile

Related Reading

ApachePulsar

引用和评论

深入解析 Apache BookKeeper 系列：第二篇 — 写操作原理

Java IO 基础知识全面总结

Y 分钟速成 zfs

一键实现 Oracle 数据整库同步至 Apache Doris

分析型数据库入门指南：如何选择适合你的实时分析工具？

centos7使用yum网络安装

Apache Iceberg 解析，一文了解Iceberg定义、应用及未来发展