云原生 - Best Practice｜Technical Practice of Apache Pulsar in La Cala - ApachePulsar

About Apache Pulsar
Apache Pulsar is the top-level project of the Apache Software Foundation. It is the next-generation cloud-native distributed message flow platform. It integrates messaging, storage, and lightweight functional computing. It uses a separate architecture design for computing and storage to support multi-tenancy, persistent storage, Multi-computer room and cross-regional data replication, with strong consistency, high throughput, low latency and high scalability and other streaming data storage characteristics.
GitHub address: http://github.com/apache/pulsar/

Project background introduction

Founded in 2005, Lakala Payment is a leading third-party payment company in China. It is committed to integrating information technology, serving offline entities, starting from payment, and empowering small, medium and micro merchants in all dimensions. In 2011, it became one of the first companies to obtain the "Payment Business License" and served more than 21 million merchants in the first half of 2019. On April 25, 2019, it was listed on the Growth Enterprise Market.

Functional Requirements

Due to the large number of project teams in Lakala, each project selected its own message system according to its needs during the construction of each project. This leads to coupling between the business logic of many systems and the specific message system on the one hand, causing trouble for subsequent system maintenance and upgrades; on the other hand, business team members have differences in the management and use of the message system, which makes the overall System service quality and performance are unstable; in addition, maintaining multiple systems at the same time leads to high physical resource utilization and management costs.

Therefore, we plan to build a distributed basic messaging platform and provide services to various teams at the same time. The platform needs to have the following characteristics: high reliability, low coupling, tenant isolation, easy horizontal expansion, easy operation and maintenance, unified management, on-demand application and use, while supporting traditional message queues and streaming queues. Table 1 shows the characteristics that these two types of services should have.

表 1. 对消息队列和流式队列的要求

Why choose Apache Pulsar

Open source endorsement of major manufacturers

There are many large open source messaging platforms for users to choose from, and most of the architecture designs are similar. For example, Kafka and RocketMQ both use a storage and computing integrated architecture, and only Pulsar uses a multi-layered architecture that separates storage and computing. There are three messaging systems we choose: Kafka, RocketMQ and Pulsar. Before the test, we made a simple comparison of the performance and functions of the three through public data on the Internet. Table 2 shows the comparison results. It can be seen that Pulsar is more in line with our needs.

表 2. Kafka、RocketMQ 和 Pulsar 性能、功能对比

Pulsar's architectural advantages

Pulsar is a cloud-native distributed message flow platform, derived from Yahoo!, supports Yahoo! applications, serves 1.4 million topics, and processes more than 100 billion messages per day. In 2016, Yahoo! open sourced Pulsar and donated it to the Apache Software Foundation. In 2018, Pulsar became the top project of the Apache Software Foundation.

As a high-performance solution, Pulsar has the following features: support for multi-tenancy, through multi-tenancy, each tenant can be individually set up authentication mechanism, storage quota, isolation strategy, etc.; high throughput, low latency, high fault tolerance; native support for multiple clusters Deploy, support seamless data replication between clusters; highly scalable, capable of supporting millions of topics; support multi-language clients, such as Java, Go, Python, C++, etc.; support multiple message subscription modes (exclusive, shared, disaster Equipment, Key_Shared).

Reasonable structure

Kafka adopts an integrated computing and storage architecture. When there are a large number of topics, Kafka's storage mechanism will cause cache pollution and reduce performance. Pulsar uses an architecture that separates computing and storage (see Figure 1). The stateless computing layer is composed of a group of brokers that receive and deliver messages. The brokers are responsible for communicating with business systems, and are responsible for protocol conversion, serialization and deserialization, and master selection. The stateful storage layer consists of a set of bookie storage nodes, which can store messages persistently.

图 1. Pulsar 架构图

Broker architecture

Broker mainly consists of four modules. We can carry out secondary development of corresponding functions according to actual needs.

Dispatcher: Scheduling and distribution module, responsible for protocol conversion, serialization and deserialization, etc.
Load balancer: Load balancing module, which controls and manages access traffic.
Global replicator: Cross-cluster replication module, undertaking asynchronous cross-cluster message synchronization function.
Service discovery: The service discovery module selects a stateless master node for each topic.

图 2. Broker 架构图

Persistence layer (BookKeeper) architecture

Figure 3 shows the architecture diagram of the persistence layer in Pulsar. Bookie is the storage node of BookKeeper and provides independent storage services. ZooKeeper is a metadata storage system that provides service discovery and metadata management services. The BookKeeper architecture belongs to a typical slave-slave architecture. The role of all bookie nodes is slave, responsible for persisting data, and the processing logic of each node is the same; the BookKeeper client is the leader role and undertakes coordination work. Because it is stateless, So you can quickly achieve failover.

图 3. Pulsar 持久层架构图隔离架构

Isolation architecture

To ensure the excellent performance of Pulsar, it is mainly reflected in the following aspects:

IO isolation: write, rear-end read, and catch-up read isolation.
Utilize the network inflow bandwidth and disk sequential write characteristics to achieve high-throughput writing: traditional disks have high bandwidth during sequential writes, and scattered reads and writes result in reduced disk bandwidth. Sequential writes can improve performance.
Utilize the network outgoing bandwidth and the IOPS processing power provided by multiple disks to achieve high-throughput reading: After receiving the data, write it to a better-performing SSD disk for level-one caching, and then use asynchronous threads to write the data to the In the traditional HDD hard disk, the storage cost is reduced.
Use all levels of caching mechanism to achieve low-latency delivery: when the producer sends a message, the message is written to the broker cache; when it is consumed in real time (reading after the end), the data is first read from the broker cache to avoid reading from the persistence layer bookie. Thereby reducing delivery delays. In the scenario of reading historical messages (catch-up reading), bookie will read the disk messages into the bookie read cache, thereby avoiding reading the disk data every time and reducing the read latency.

图 4. Pulsar 隔离架构图

Comparison summary

On the left is the architecture design adopted by Kafka, RabbitMQ and other messaging systems. The broker node is responsible for both computing and storage. In some scenarios, using this architecture can achieve high throughput; but when the number of topics increases, the cache will be polluted and affected. performance.

On the right is the architecture of Pulsar. Pulsar splits the broker and adds the BookKeeper persistence layer. Although this will increase the design complexity of the system, it can reduce the coupling of the system and make it easier to implement functions such as scaling and failover. Table 3 summarizes the main characteristics of the partition architecture and the shard architecture.

图 5. 分区架构与分片架构对比图

表 3. 分区架构与分片架构特性

Based on Pulsar's architecture and functional characteristics, we tested Pulsar. At the operating system level, NetData tools are used for monitoring, and data packets of different sizes and frequencies are used for pressure testing. Several important indicators of the test are the fluctuations of disks and network bandwidth.

图 6. Pulsar 测试过程

The test conclusions are as follows:

Deployment method: Hybrid deployment is better than separate deployment. The broker and bookie can be deployed on the same node or separately. When the number of nodes is large, it is better to deploy separately; when the number of nodes is small or the performance requirements are high, it is better to deploy the two on the same node, which can save network bandwidth and reduce delay.
Load size: As the test load increases, tps decreases and throughput is stable.
Brushing method: Asynchronous brushing is better than synchronous brushing.
Compression algorithm: LZ4 is recommended for compression algorithm. We tested several compression methods that come with Pulsar. When using the LZ4 compression algorithm, the CPU usage is the lowest. Using compression algorithms can reduce network bandwidth usage, with a compression ratio of 82%.
Number of partitions: If a single topic does not reach the upper limit of the physical resources of a single node, it is recommended to use a single partition; since Pulsar storage is not coupled with partitions, the number of partitions can be adjusted at any time according to business development.
Number of topics: During the stress test process, increasing the number of topics will not affect performance.
Resource constraints: If the network bandwidth is Gigabit, the network will become a performance bottleneck, and the network IO can reach 880 MB/s; when the network bandwidth is 10 Gigabit, the disk will become the bottleneck, and the disk IO utilization rate will be about 85%.
Memory and threads: If you use a physical host, you need to pay attention to the ratio of memory to the number of threads. The default configuration parameter is that the number of IO threads is equal to 2 times the number of CPU cores. In this case, the number of cores in the physical machine is 48. If the memory is set to a smaller size, OOM problems are more likely to occur.

In addition to the above tests, we also retested the destructive test case by Jack Vanlightly (a test engineer of RabbitMQ), and got the following conclusions:

In all test scenarios, there is no message loss or out of order;
In the scenario where message deduplication is enabled, there is no message duplication.

Support team professional

In addition, we communicated with the core developers of the Apache Pulsar project earlier. They have rich practical experience on Yahoo! and Twitter, and are preparing to establish a company to promote the use of Pulsar around the world, and regard China as the most important one. Base, which provides a strong guarantee for our use. Now everyone knows that they established StreamNative and have received multiple rounds of financing, and the team is constantly growing.

Pulsar's practice in basic messaging platform

The architecture of our basic messaging platform based on Pulsar is shown in the figure below. The green part in the figure is the function or component developed based on Pulsar. This section will combine actual use scenarios to introduce in detail how we apply Pulsar and components developed based on Pulsar in actual use scenarios.

图 7. 基于 Pulsar 构建的基础消息平台架构图

Scenario 1: Streaming queue

1. OGG For Pulsar Adapter

The source data is stored in Oracle. We hope to capture Oracle's changed data in real time, perform real-time calculation, data analysis, and provide it to downstream business systems for query scenarios.

We use Oracle's OGG (Oracle Golden Gate) tool for real-time crawling, which contains two modules: source OGG and target OGG. Since OGG officially did not provide the Sink to Pulsar component, we developed the OGG For Pulsar component as needed. The following figure shows the data processing process diagram. OGG will capture the addition, deletion and modification operations of each record in the table, and push each operation as a message to the OGG For Pulsar component. The OGG For Pulsar component will call the producer interface of the Pulsar client for message delivery. During the delivery process, the order of messages must be strictly guaranteed. We use the primary key of the database table as the key of the message. When the amount of data is large, the topic can be partitioned according to the key, and the same key can be delivered to the same partition, so as to ensure that the addition, deletion and modification operations performed on the records with the same primary key in the database table sequence.

图 8. OGG For Pulsar 组件示意图

2. Pulsar To TiDB component

We store the captured change messages in TiDB through the Pulsar To TiDB component, and provide query services to downstream systems. The processing logic of this component is:

Use disaster recovery subscription method to consume Pulsar messages.
Hash operation is performed according to the key of the message, and the same key is hashed to the same persistence thread.
Enable Pulsar's message de-duplication function to avoid repeated message delivery. Assuming that MessageID2 is delivered repeatedly, data consistency will be destroyed.

图 9. Pulsar To TiDB 组件使用流程图

3. Pulsar's message persistence process analysis

Pulsar's message persistence process includes the following four steps:

The OGG For Pulsar component calls the producer interface of the Pulsar client to deliver messages.
The Pulsar client obtains the address of one of the brokers according to the list of broker addresses in the configuration file, and then sends the topic attribution query service to obtain the address of the broker serving the topic (broker2 in the example below).
The Pulsar client delivers the message to Broker2.
Broker2 calls the BookKeeper client for persistent storage. The storage strategy includes the total number of bookies that can be selected for this storage, the number of copies, and the number of successful storage confirmation replies.

图 10. Pulsar 的消息持久化示意图

4. Dynamic transfer of database table structure

When OGG uses the AVRO method for serialization, if multiple tables are delivered to the same topic, the AVRO Schema has a secondary structure: wrapper schema and table schema. The wrapper schema structure is always the same and contains three parts of information: table_name, schema_fingerprint, and payload; when OGG crawls data, it will sense the changes in the database table structure and notify OGG For Pulsar, that is, the table structure determines its table schema, and then the table schema Generate the corresponding schema_fingerprint.

We send and store the obtained table schema in the specified Schema topic. The message in the Data topic only contains schema_fingerprint information, which can reduce the size of the message packet after serialization. When Pulsar To TiDB starts, it consumes data from the Schema topic, and uses the schema_fingerprint as the Key to cache the table schema in memory. When deserializing the message in the Data Topic, extract the table schema from the cache according to the schema_fingerprint, and deserialize the payload.

图 11. 表结构管理流程图

5. Consistency Guarantee

To ensure the order and de-duplication of messages, it needs to be set from three aspects: broker, producer, and consumer.

Broker

Enable deduplication at the namespace level: bin/pulsar-admin namespaces set-deduplication namespace --enable
Fix/optimize the deadlock problem of Pulsar client. Version 2.7.1 has been fixed. For more information, please refer to PR 9552 .

Producer

pulsar.producer.batchingEnabled=false

In the producer settings, turn off batch sending. If you turn on batch message sending, the messages may be out of order.

pulsar.producer.blocklfQueueFull=true

In order to improve efficiency, we send messages asynchronously and need to open blocking queue processing, otherwise message loss may occur.

The call to asynchronous send timed out and sent to the abnormal topic. If the message is repeated when the message is resent after asynchronous timeout, it can be handled by turning on the automatic deduplication function; the message sending timeout in other cases needs to be handled separately, we store these messages in the abnormal topic, and then reconcile them The program obtains the final state data directly from the source library.

Consumer

Implement interceptor: ConsumerInterceptorlmpl implements ConsumerInterceptor Configuration confirmation timeout: pulsarClient.ackTimeout(3000, TimeUnit.MILLISECONDS).ackTimeoutTickTime(500, TimeUnit.MILLISECONDS) Use cumulative confirmation: consumer.acknowledgeCumulative(sendMessageID)

Note: Configure the confirmation timeout parameter. If the consumption confirmation is not made within the ackTimeout time, the message will be re-delivered. In order to strictly ensure consistency, we need to use the cumulative confirmation method to confirm.

6. Confirmation of message consumption

If the message with the MessageID of 1 has been confirmed that the consumption is successful, the cumulative confirmation method is started, and the message with the MessageID of 3 is being confirmed at this time, the message with the MessageID of 2 that has been consumed but not confirmed will also be confirmed as a success. If the confirmation has not been received within the "confirmation timeout" time, the messages with MessageID 2, 3, 4, and 5 will be re-delivered in the original order.

图 12. 消息确认流程图（1）

If a single confirmation method is used, the messages with MessageID 1, 3, and 4 in the figure confirm that the consumption is successful, while the message with MessageID 2 "confirms timed out". In this case, if the application does not handle properly and does not confirm one by one in the order of consumption, when the message "Confirm Timeout" appears, only the timeout message (ie the message with MessageID 2) will be re-delivered, causing the consumption sequence to occur Confused.

图 12. 消息确认流程图（2）

Summary: It is recommended to use the single confirmation method for the queue consumption mode, and the cumulative confirmation method for the streaming consumption mode.

7. Message confirmation timeout (client side) detection mechanism

There are two parameters in the confirmation timeout mechanism, the timeout period and the polling interval. The timeout detection mechanism is implemented by a two-way queue + multiple HashSets. The number of HashSets is (timeout time) divided by (polling interval) and then rounded. Therefore, one HashSet is processed for each polling, thereby effectively avoiding the performance loss caused by global locks.

图 13. 消息确认超时（客户端）检测机制示意图

Scenario 2: Message queue: OpenMessaging protocol implementation (transparent layer protocol)

Many business systems we used in the past were strongly coupled with the messaging system, which led to troublesome follow-up upgrades and maintenance, so we decided to use the OpenMessaging protocol as the middle layer for decoupling.

Implement the OpenMessaging protocol through Pulsar.
The development framework (based on spring boot) calls the OpenMessaging protocol interface to send and receive messages.

图 14. 透明层协议流程图

Scenario 3: Streaming queue: Custom Kafka 0.8-Source (Source development)

Pulsar IO can be easily connected to various data platforms. Some of our business systems use Kafka 0.8, and the official source does not provide the corresponding source. Therefore, we developed the Kafka 0.8 Source component according to the interface definition of Pulsar IO.

图 15. Kafka 0.8 Source 组件示意图

Scenario 4: Streaming queue: Function message filtering (message filtering)

We use Pulsar Functions to desensitize the sensitive fields in the Pulsar IDC cluster message (such as ID number, mobile phone number) and synchronize them to the cloud cluster in real time for consumption by cloud applications.

图 16. Pulsar Functions 消息过滤示意图

Scenario 5: Streaming queue: Pulsar Flink Connector streaming computing (streaming computing)

In the business analysis scenario of the merchant, Flink connects to Pulsar through the Pulsar Flink Connector, performs real-time calculations on the pipeline data according to different dimensions, and then persists the calculation results into TiDB through Pulsar. Judging from the current usage, the performance and stability of Pulsar Flink Connector are performing well.

图 17. Pulsar Flink Connector 流式计算示意图

Scenario 6: Streaming queue: TiDB CDC adaptation (TiDB adaptation)

We need to perform real-time crawling based on TiDB data changes, but the TiDB CDC For Pulsar serialization method does not support the AVRO method, so we have customized development for this use scenario, that is, first encapsulate the data sent from TiDB, and then deliver it to Pulsar middle. The development language of the TiDB CDC For Pulsar component is Go language.

图 18. TiDB CDC For Pulsar 组件示意图

future plan

Our basic messaging platform based on Pulsar has effectively improved the use efficiency of physical resources; using a set of messaging platform simplifies system maintenance and upgrade operations, and the overall service quality is also improved. Our plan for the future use of Pulsar mainly includes the following two points:

Continue to offline other messaging systems, and finally all access to the Pulsar basic messaging platform;
Deeply use Pulsar's resource isolation and flow control mechanism.

In practice, with the help of many native features of Pulsar and components developed based on Pulsar, the new messaging platform perfectly fulfilled our expected functional requirements.

Introduction to the author
Jiang Dianjing, the architect of the Infrastructure Department, responsible for the construction and operation of the basic messaging platform and its ecology, teamwork to introduce Apache Pulsar into the Lakala core architecture, and in the strong consistent streaming consumption scenarios and queues Successful practice has been achieved in consumption scenarios. Currently, he is mainly responsible for Pulsar performance tuning, new function development and Pulsar ecological integration.

Click the link to get the Apache Pulsar hard core dry goods information!

Best Practice｜Technical Practice of Apache Pulsar in La Cala