大数据 - Blog post｜Technical practice of Apache Pulsar in self-developed data pipeline - ApachePulsar

About Apache Pulsar

Apache Pulsar is the top-level project of the Apache Software Foundation. It is the next-generation cloud-native distributed message flow platform. It integrates messaging, storage, and lightweight functional computing. It uses a separate architecture design for computing and storage to support multi-tenancy, persistent storage, Multi-computer room and cross-regional data replication, with strong consistency, high throughput, low latency and high scalability and other streaming data storage characteristics.
GitHub address: http://github.com/apache/pulsar/

The author of this article is Jiang Moujing, senior R&D engineer of Yipin Fresh Food. Leading the design and development of data pipeline systems, using Apache Pulsar as a data synchronization tool, and implementing various application scenarios for incremental data synchronization. He plans to further realize the platformization and visualization of the data pipeline, and access to a richer database type support.

background

The data pipeline is to allow data to pass through a certain transmission medium, from one place to another, so as to achieve data synchronization or replication to meet application requirements. With the substantial increase in business volume and data volume, our existing microservices need to be refined (split) again.

How does the system split make users unaware? When it goes online, some users are diverted to the new service through a diversion strategy, and the new and old systems are required to run in parallel for a period of time to support the trial operation of the new service to the full landing, thereby minimizing production failures. In order for the new service data to be consistent with the data in the old system service in real time, data needs to be synchronized. As the amount of data increases significantly, to speed up the query, you can copy the data to ElasticSearch to increase the query rate.

There are related open source data synchronization products and commercial version data channel tools on the market, which can realize bilateral data synchronization replication without manual intervention. However, system reconstruction may cause some changes in table structure or table objects, which cannot be compatible with commercial data synchronization. Developers are required to intervene for related processing. We adopted the Maxwell + Pulsar : Use Maxwell to read binlog, and Pulsar for data transmission. Maxwell + Pulsar realizes the upper data reading, and the downstream business side realizes the corresponding data synchronization logic. For example, for system reconstruction and split data synchronization business scenarios and read-write separation, data replication is synchronized to business scenarios similar to the ElasticSearch search engine.

Why choose Pulsar?

In the system reconstruction of the data pipeline, we chose Apache Pulsar for the following reasons:

no status. In the microservice architecture system, middleware is best to be stateless. This starts fast, can be replaced at any time, and can achieve seamless expansion and elastic expansion. Kafka is not stateless. Each Broker contains all the logs of the partition. If a Broker goes down, not any Broker can take over, and Brokers cannot be added at will to share the load. Brokers must synchronize their states. In the Pulsar architecture, data is separated from Broker and stored in shared storage; the upper layer is a stateless computing layer (Broker), which replicates message distribution and services (computing), and the lower layer is a persistent storage layer (Bookie). Therefore, data calculation and storage are independent of each other, and independent expansion and rapid recovery of data can be realized.
Pulsar supports stream processing and traditional message queues, which greatly improves subscription flexibility.
Pulsar's cloud-native architecture facilitates horizontal and flexible expansion and supports cross-regional replication.
Pulsar supports partitioning, with high throughput and low latency.
The open source community is active, technical support responds quickly, and service is good.

How does Pulsar ensure the order in the distributed consumption process

First, let's take a look at Pulsar's subscription model. Pulsar has four subscription modes: exclusive mode (exclusive mode), failover mode (failover mode), Shared mode (shared mode) and Key_Shared mode. The Exclusive mode has only one consumer, which receives all the messages of a topic.

( Pulsar Exclusive 模式消费策略 )

In the Failover mode, there is only one valid consumer at the same time, and the rest of the consumers are used as backup nodes, which are replaced after the master consumer is unavailable (this mode is suitable for scenarios where the amount of data is small and the single point of failure is solved) .

In Shared mode, multiple consumers can connect to the same subscription topic. Messages are distributed among consumers in a polling manner, and any given message is only delivered to one consumer. At first, we adopted the Shared mode, because the Shared mode has distributed consumption capabilities and fast consumption. However, during the production process, it was discovered that the source database data and the synchronized target database (ElasticSearch, MySQL) frequently had data deviation and data inconsistency. After investigation, it was found that the consumption order was disordered. When users frequently manipulate a piece of data and generate multiple MQ messages, in Shared mode, multiple consumers consume messages in parallel.

( Pulsar Shared 模式消费策略 )

Pulsar introduced the Key_Shared mode based on the Shared mode in version 2.4.0. In Key_Shared mode, multiple consumers can be attached to the same subscription. Messages are distributed among users. Messages with the same key or the same subscription key are delivered to only one consumer. No matter how many times the message is resent, it will be sent to the same user. When the consumer connects or disconnects, the consumer of the service will change some message keys. The Key_Shared mode guarantees that the messages of the same Key in the Shared mode will be sent to the same consumer, ensuring the sequence while concurrency.

( Pulsar Key_Shared 模式消费策略 )

The data synchronization scenario requires very high message sequence. When the user continuously updates a certain piece of data, the corresponding record in the database table is also constantly updated. When the amount of data is large and concurrency, it is necessary to ensure that the sequence of messages generated by the user to change the data is consistent with the sequence of operations, otherwise the synchronized piece of data will be inconsistent with the source data, resulting in system failure.

The order problem is a common problem in the distributed consumption process. In order to ensure the orderly consumption of the client, we adopt the Key_Shared subscription model. The Key_Shared mode is an extension of the Shared subscription mode. A partition can have several consumers concurrently consuming messages, but messages with the same key are routed to only one consumer. The principle is to determine the target user through hashing, and each consumer provides a fixed range of hash values; the entire range of hash values can cover all consumer ends. Then specify the key when producing the message (as shown below) to form a closed loop, and then orderly storage to the specified partition and orderly consumption of the message can be realized. For specific principles and usage, please refer to Pulsar official website .

key ：{"database":"you_db_name","table":"you_table_name","pk.id":"you_table_Primary key"}

How to filter duplicate messages?

There are generally three types of message transmission guarantees: At least once, At most once, and Exactly once.

At least once: Each message will have multiple transmission attempts, at least once, that is, the message may be repeated but will not be lost;
At most once: Each message is transmitted at most once, and the message may be lost;
Exactly once: Each message is transmitted only once, and the message transmission will neither be lost nor repeated.

In the data synchronization scenario, to maximize the reachability of messages, we use Maxwell's At least once mode to ensure message transmission as much as possible. When the network is not ideal, the message may have been delivered to the target, but when a timeout response is received or the reception is unsuccessful, Pulsar will deliver it again, resulting in a "duplicate message".

At least once 图解

In order to solve the problem of repeated messages, we have added filters to the data pipeline data link model to filter out some repeated, invalid, and retry messages.

（为了聚焦功能，图中没有显示日志记录、链路追踪等微服务体系中的其他组件）

Summarize

In the scenario where a large amount of incremental data needs to be synchronized, we adopted Maxwell + Pulsar's self-developed solution. Whether the Pulsar Key_Shared subscription model can well solve the sequence problem in the distributed message consumption process, add it in the data pipeline data link The filter can ensure that the message is not repeated or leaked.

In the future, we plan to make full use of the features of Pulsar based on the existing solutions to make the data pipeline into a visualized data synchronization center, access to more database extensions, and a complete monitoring and log system.

Blog post｜Technical practice of Apache Pulsar in self-developed data pipeline

background

Why choose Pulsar?

How does Pulsar ensure the order in the distributed consumption process

How to filter duplicate messages?

Summarize

Related Reading

ApachePulsar

引用和评论

深入解析 Apache BookKeeper 系列：第二篇 — 写操作原理

Dolphinscheduler IDEA本地调试

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

得物增长兑换商城的构架演进

一键实现 Oracle 数据整库同步至 Apache Doris