云原生 - Blog recommendation｜Pulsar's message storage mechanism and Bookie's GC mechanism principle - ApachePulsar

About Apache Pulsar
Apache Pulsar is the top-level project of the Apache Software Foundation. It is the next-generation cloud-native distributed message flow platform. It integrates messaging, storage, and lightweight functional computing. It uses a separate architecture design for computing and storage to support multi-tenancy, persistent storage, Multi-computer room and cross-regional data replication, with strong consistency, high throughput, low latency and high scalability and other streaming data storage characteristics.
GitHub address: http://github.com/apache/pulsar/

about the author
The author of this article: Bao Mingyu, Senior Engineer of Tencent TEG Data Platform Department, Apache Pulsar Contributor, is keen on open source technology, has rich experience in the field of message queues, and is currently committed to the implementation and promotion of Pulsar.
The MQ team of Tencent's Data Platform Department conducted in-depth research on Pulsar and a large number of performance and stability optimizations. It has been launched on TDbank. This article is one of the Pulsar technology series. It mainly summarizes the cleaning mechanism of Pulsar message storage and BookKeeper storage files. Among them, BookKeeper can be understood as a NoSQL storage system, using RocksDB to store index data by default.

Pulsar message storage

Pulsar's messages are stored in BookKeeper. BookKeeper is a fat client system. The client part is called BookKeeper, and each storage node in the server-side cluster is called bookie. The broker of the Pulsar system acts as the client of the BookKeeper storage system, and stores Pulsar's messages in the bookies cluster through the client SDK provided by BookKeeper.

Each partition of each topic in Pulsar (non-partitioned topics, which can be understood as partition 0, the number of partition topics starts from 0), will correspond to a series of ledger, and each ledger will only store the corresponding partition information. For each partition, only one ledger is in the open and writable state at the same time.

When Pulsar produces and stores messages, it will first find the ledger used by the current partition, and then generate the entry ID corresponding to the current message. The entry ID is incremented within the same ledger. In the case of non-mass production (the producer can configure this parameter, the default is batch), an entry contains a message. In batch mode, an entry may contain multiple messages. In bookie, only writing, searching, and obtaining are performed according to the entry dimension.

Therefore, the msgID of each message under Pulsar needs to be composed of four parts (the old version consists of three parts), namely (ledgerID, entryID, partition-index, batch-index), where partition-index is in the non-partition topic When it is -1, batch-index is -1 when it is not a batch message.

Each ledger will switch when the length of existence or the number of saved entries exceeds the threshold. Under the same partition, new messages will be stored in the next ledger. Ledger is just a logical concept, a logical assembly dimension of data, and there is no corresponding entity.

After each bookie node in the BookKeeper cluster receives the message, the data will be stored and processed in three parts: journal file, entryLog file, and index file.

Among them, the journal file and entry data are written to the journal file according to the wal method. Each journal file has a size limit. When the size limit of a single file is exceeded, it will switch to the next file to continue writing, because the journal file is refreshed in real time. Yes, in order to improve performance and avoid the mutual influence of read and write IO, it is recommended to separate the storage directory from the directory where the entrylog is stored, and to mount a separate hard disk for each journal file storage directory (ssd hard disk is recommended). Only a few journal files will be saved, and files exceeding the configured number will be deleted. Entry is stored in the journal file completely at random, first-come, first-written. The journal file is designed to ensure that the message is not lost.

As shown in the figure below, after each bookie receives a request to add an entry, it will be mapped to the journal directory and entry log directory according to the ledger id, and the entry data will be stored in the corresponding directory. At present, bookie does not support changing the storage directory during operation (in the process of using, adding or reducing the directory will cause some data to be unavailable).

As shown in the figure below, after bookie receives the entry write request, it writes the journal file and saves it to the write cache at the same time. The write cache is divided into two parts, one is the write cache that is being written, and the other is the write cache being refreshed. The two parts are used alternately.

There is an index data structure in the write cache, and the corresponding entry can be found through the index. The index in the write cache is at the memory level and is implemented based on the ConcurrentLongLongPairHashMap structure defined by bookie.

In addition, each entrylog storage directory corresponds to an instance object of the SingleDirectoryDbLedgerStorage class, and each SingleDirectoryDbLedgerStorage object has an index structure based on RocksDB. Through this index, you can quickly find out which entrylog file each entry is stored in. . Each write cache will perform sorting processing when adding entries. In the same write cache, the data under the same ledger are adjacently ordered, so that when the data in the write cache is flushed to the entrylog file, it will be written to The data in the entrylog file is partially ordered, and this design can greatly improve the subsequent reading efficiency.

The index data in SingleDirectoryDbLedgerStorage will also be flushed to the index file as the entry is flushed. When the bookie is down and restarted, data can be restored through the journal file and entry log file to ensure that the data is not lost.

When Pulsar consumer consumes data, it does multi-layer cache acceleration processing, as shown in the following figure:

The order of obtaining data is as follows:

Get it from the entry cache on the broker side, if not, continue;
Get it in the part that the write cache of bookie is writing, if not, continue;
Get it in the part where the write cache of bookie is flashing, if not, continue;
Get it from bookie's read cache, if not, continue;
Read the entry log file on the disk through the index.

In each step above, if the data can be obtained, it will return directly and skip the subsequent steps. If the data is obtained from a disk file, the data will be stored in the read cache when it is returned. In addition, if it is a disk read operation, a part of the disk will be read because it is stored locally. Processing, the probability of obtaining adjacent data is very high, this processing will greatly improve the efficiency of subsequent data acquisition.

In the process of use, we should try to avoid or reduce the scenario where the consumption of old data triggers the reading of the message in the disk file, so as not to affect the performance of the overall system.

BookKeeper's GC mechanism

Each bookie in BookKeeper will periodically perform data cleaning operations. By default, the check is performed every 15 minutes. The main cleaning process is as follows:

Clean up the ledger id stored in bookie (compare the ledger id stored in bookie with the ledger id stored on zk, if there is no ledger id on zk, delete the ledger id stored in bookie);
Count the proportion of entries that survive in each entry log, delete the entry log when the number of ledgers in the current entry log is 0;
According to the metadata information of the entry log, clean up the entry log file (delete when all ledger IDs contained in the entry log are invalid);
Compress the entry log file. When the proportion of entries that survive in the current entry log file is 0.5-default period of 1 day (major gc) or 0.2-default period of 1 hour (minor gc), the compaction entry log file will be old The surviving entry in the file is transferred to the new file, and then the old entry log file is deleted. If the entry log file processed by a single GC is relatively large, it may take a long time.

Through the above process, we can understand the general process of bookie cleaning up the entrylog file.

What needs special explanation is that whether ledger can be deleted is completely triggered by the client, which is triggered by the broker in Pulsar.

The broker has a periodic processing thread (2 minutes by default) to clean up the ledger mechanism where the messages that have been consumed are located, get the last confirmed message of the cursor contained in the topic, and add the ledger list contained in this topic to the ledger before this id ( Note that the current ledger id is not included) Delete all (including the metadata in zk, and notify bookie to delete the corresponding ledger).

Analysis of problems encountered in operation

In the process of application, we encountered many times when the bookie disk space is insufficient. A large number of entry log files are stored in the bookie. There are two typical reasons as follows.

reason one :

The production messages are too scattered. For example, in an extreme scenario, 1w topics are produced, one topic is produced, and 1w topics are produced sequentially. In this way, the ledger corresponding to each topic will not be switched due to the duration or storage size in a short time, and the ledger id in the active state is scattered in a large number of entry log files. These entry log files cannot be deleted or compressed in time.

If you encounter such a scenario, you can force the ledger to switch for processing by restarting. Of course, if the consumption has not kept up at this time, the ledger where the last ack of the consumption is located is also in the active state and cannot be deleted.

Reason two :

During the GC time process, if there are many existing enlog files, and a large number of them meet the minor or major gc thresholds, then a single minor gc or major gc time is too long, and the expired entry log files cannot be cleaned up during this period.

This is caused by the sequential execution of a single cleanup process, and only the last round of execution will be executed before the next one will be executed. At present, this area is also improving the process to prevent the implementation of sub-processes from being too long and having an impact on the whole.

summary

First of all, this article introduces the storage organization of Pulsar messages, the storage process, and the process of obtaining messages. Secondly, a detailed description of the GC process of a single bookie is given. In the use of Pulsar, you should try to avoid consuming old historical data, that is, the scenario where you need to read the disk to get the data.

In the process of operating and maintaining bookie, the number of storage directories cannot be adjusted during operation, and the capacity needs to be fully evaluated during deployment. If you need to make adjustments in the process of operation, you need to expand and shrink a single bookie node.

Blog recommendation｜Pulsar's message storage mechanism and Bookie's GC mechanism principle

About Apache Pulsar

about the author

Pulsar message storage

BookKeeper's GC mechanism

Analysis of problems encountered in operation

summary

Related Reading

ApachePulsar

引用和评论

深入解析 Apache BookKeeper 系列：第二篇 — 写操作原理

祝贺陈梓立(Tison)当选 2025 年度 Apache 软件基金会董事会

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

Y 分钟速成 zfs

Koupleless 2024 年度报告 & 2025 规划展望

一键实现 Oracle 数据整库同步至 Apache Doris

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

Blog recommendation｜Pulsar&#39;s message storage mechanism and Bookie&#39;s GC mechanism principle

About Apache Pulsar

about the author

Pulsar message storage

BookKeeper's GC mechanism

Analysis of problems encountered in operation

summary

Related Reading

ApachePulsar

引用和评论

深入解析 Apache BookKeeper 系列：第二篇 — 写操作原理

祝贺陈梓立(Tison)当选 2025 年度 Apache 软件基金会董事会

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

Y 分钟速成 zfs

Koupleless 2024 年度报告 & 2025 规划展望

一键实现 Oracle 数据整库同步至 Apache Doris

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

Blog recommendation｜Pulsar's message storage mechanism and Bookie's GC mechanism principle