The author of this article: Liu Dezhi
Tencent expert engineer, TEG technical engineering business group and billing system developer
Apache Pulsar is a next-generation distributed message flow platform. It adopts a layered computing and storage architecture and has many advantages such as multi-tenancy, high consistency, high performance, millions of topics, and smooth data migration. More and more companies are using Pulsar or trying to apply Pulsar to the production environment.
Tencent uses Pulsar as the message bus of the billing system to support hundreds of billions of online transactions. Tencent's billing volume is huge, and the core problem to be solved is to ensure that the money and goods are consistent. First of all, ensure that every payment transaction does not appear wrong, and achieve high consistency and high reliability. Second, ensure that all services carried by billing are available 7*24 to achieve high availability and high performance. The charging message bus must have these capabilities.
Pulsar architecture analysis
In terms of consistency, Pulsar uses the Quorum algorithm to ensure the number of copies of the distributed message queue and the number of responses written by strong consistency (A>W/2) through write quorum and ack quorum. In terms of performance, Pulsar uses the Pipeline method to produce messages, reduces disk IO pressure through sequential writes and striped writes, and multiple caches reduce network requests to speed up consumption efficiency.
The high performance of Pulsar is mainly reflected in the network model, communication protocol, queue model, disk IO and striped write. I will explain in detail one by one below.
Pulsar Broker is a typical Reactor model, which mainly includes a network thread pool, which is responsible for processing network requests, sending and receiving the network, and encoding and decoding, and then pushing the request to the core thread pool for processing through the request queue. First of all, Pulsar adopts a multi-threaded approach to make full use of the multi-core advantages of modern systems to allocate the same task request to the same thread for processing, and try to avoid the overhead caused by switching between threads. Secondly, Pulsar uses the queue method to realize the asynchronous decoupling of the network processing module and the core processing module, and realizes the parallel processing of network processing and file I/O, which greatly improves the efficiency of the entire system.
letter of agreement
Information (message) uses binary encoding and has a simple format; the client generates binary data and sends it directly to the Pulsar back-end broker, and the broker does not decode it and sends it directly to bookie for storage. The storage format is also binary, so there is no encoding or decoding operation in the message production and consumption process. Message compression and batch sending are all done on the client side, which can further improve the broker's ability to process messages.
Pulsar partitions topics and assigns different partitions to different Brokers to achieve horizontal expansion. Pulsar supports online adjustment of the number of partitions and theoretically supports unlimited throughput. Although the capacity and performance of ZooKeeper will affect the number of brokers and the number of partitions, the upper limit of the limit is very large, and it can be considered that there is no upper limit.
Message queue is a disk IO intensive system, so optimizing disk IO is very important. Disk-related operations in Pulsar are mainly divided into operation logs and data logs. The operation log is used for data recovery. It adopts a completely sequential write mode. If the write is successful, the production is considered successful. Therefore, Pulsar can support millions of topics and will not cause a sharp drop in performance due to random writes.
The operation log can also be out of order, so that the operation log can be written to maintain the best write rate, and the data log will be sorted and deduplicated. Although the write amplification occurs, this benefit is worthwhile: Logs and data logs are hung on different disks to separate read and write IO, further improving the IO-related processing capabilities of the entire system.
Striped writes can use more bookie nodes for IO sharing; Bookie sets up write cache and read cache. The latest messages are placed in the write cache, and other messages will be read from the file in batches and added to the read cache to improve reading efficiency.
From the perspective of architecture, Pulsar has no obvious stuck points in the various processes of processing messages. Operation log persistence has only one thread to be responsible for flushing the disk, which may cause lag. According to the disk characteristics, multiple disks and multiple directories can be set to improve disk read and write performance, which can fully meet our needs.
In the Tencent billing scenario, we set up the same scenario and compared Pulsar and Kafka respectively. The specific test scenarios are as follows.
The pressure test data is as follows:
As can be seen from the above data, in terms of network IO, in the case of 3 replicas with multiple partitions, Pulsar almost has to run the broker network card out of traffic, because a piece of data needs to be distributed on the broker side 3 times, which is the cost of computing storage separation.
Kafka's performance data is a bit disappointing, and the overall performance has not improved. This should be related to Kafka's own copy synchronization mechanism: Kafka uses a follow synchronous pull strategy, which leads to a low overall efficiency.
In terms of latency, Pulsar performs better on the production side. When resources do not reach the bottleneck, 99% of the entire time is within 10 milliseconds. There will be fluctuations in garbage collection (GC) and creation of operation log files.
From the results of the stress test, Pulsar performs better than Kafka in high-consistent scenarios. If you set
log.flush.interval.messages=1 , the performance of Kafka will be worse. Kafka is designed for high throughput at the beginning, and there is no such parameters as direct synchronization.
In addition, we also tested other scenarios, such as million Topic and cross-regional replication. In the production and consumption scenario tests of a million topic scenarios, Pulsar did not experience a sharp drop in performance due to the increase in the number of topics, while Kafka caused the system to slow down rapidly due to a large number of random writes.
Pulsar natively supports cross-regional replication, and supports both synchronous and asynchronous methods. In Kafka's cross-regional replication in the same city, the throughput is not high and the replication speed is very slow. Therefore, in the cross-regional replication scenario, we tested the Pulsar synchronous replication method. The storage cluster adopts cross-city deployment. Waiting for ACK must include multiple responses. The relevant parameters used in the test are consistent with the same city. Test results prove that Pulsar's throughput can reach 280,000 QPS under cross-city conditions. Of course, the performance of cross-city and cross-regional replication largely depends on the current network quality.
As a new distributed message flow platform, Pulsar has many advantages. Thanks to bookie's sharding processing and ledger's strategy of selecting storage nodes, Pulsar is very simple to operate and maintain, and it can get rid of the annoyance of manual data balancing like Kafka. But Pulsar is not perfect, there are some problems in itself, and the community is still improving.
Pulsar's strong dependence on ZooKeeper
Pulsar has a strong dependency on ZooKeeper. In extreme cases, the ZooKeeper cluster is down or blocked, which will cause the entire service to go down. The probability of a ZooKeeper cluster crashing is relatively small. After all, ZooKeeper has been tested by a large number of online systems and is still widely used. However, the probability of ZooKeeper being blocked is relatively high. For example, in a million Topic scenario, millions of ledger metadata information will be generated, and these data need to interact with ZooKeeper.
For example, to create a topic once, you need to create topic partition metadata, Topic name, and Topic storage ledger nodes; and to create a ledger once, you need to create and delete the only ledgerid and ledger metadata information nodes, which requires a total of 5 ZooKeeper writes. Operation, a subscription requires similar 4 ZooKeeper write operations, so a total of 9 write operations are required. If you create a million-level theme at the same time, it will inevitably cause great pressure on ZooKeeper.
Pulsar has the ability to decentralize the deployment of ZooKeeper, which can relieve the pressure of ZooKeeper to a certain extent. The ZooKeeper cluster that relies the most is zookeeperServer. From the previous analysis, the write operation is relatively controllable, and topics can be created through the console. The ZooKeeper that bookie relies on has the highest operating frequency. If the ZooKeeper is blocked, the current write will not affect it.
The dependency on zookeeperServerzk can be optimized in the same way. At least the current service can last for a period of time, giving ZooKeeper enough time to recover; secondly, reduce the number of writes in ZooKeeper and only use it for necessary operations, such as broker elections. Like the load information of a broker, other storage media can be sought, especially when a broker serves a large number of topics, this information will reach the mega (M) level. We are working with the Pulsar community to optimize the broker load function.
Pulsar memory management is slightly more complicated
Pulsar's memory is composed of JVM heap memory and heap external memory. Messages are sent and cached through off-heap memory to reduce garbage collection (GC) caused by IO; heap memory mainly caches ZooKeeper related data, such as ledger metadata information And the message ID cache information reposted by subscribers, through dump memory analysis, it is found that a ledger metadata information needs to occupy about 10K, and the initial message ID cache for a subscriber reposting message ID is 16K and will continue to grow. When the broker's memory continues to grow, it will eventually perform full garbage collection (full GC) frequently until it finally exits.
To solve this problem, we must first find the fields that can reduce the memory usage, such as the bookie address information in the ledger metadata information. Each ledger will create objects, and bookie nodes are very limited, and global variables can be used to reduce the creation of unnecessary objects; subscribers can re-push the message ID cache to control the initialization within 1K, and perform regular scaling and so on. These operations can greatly improve the usability of Broker.
Compared with Kafka, Pulsar broker has many advantages. Pulsar can automatically perform load balancing and will not cause service instability due to excessive load of a broker. It can quickly expand capacity and reduce the load of the entire cluster.
In general, Pulsar has excellent performance in high-consistent scenarios, and has been widely used within Tencent, such as Tencent Finance and big data scenarios. The big data scenario is mainly based on the KOP mode, and its performance is now very close to Kafka, and in some scenarios it has even surpassed Kafka. We are convinced that with the joint efforts of the community and development enthusiasts, Pulsar will get better and better, opening a new chapter in the next generation of cloud native messaging.