云原生 - Blog recommendation | In-depth analysis of the BookKeeper multi-copy protocol (1) - ApachePulsar

This article is translated from "A Guide to the BookKeeper Replication Protocol (TLA+ Series Part 2)" by Jack Vanlightly. Original link: https://medium.com/splunk-maas/a-guide-to-the-bookkeeper-replication-protocol-tla-series-part-2-29f3371fe395 .

Translator Profile
Wang Jialing@China Mobile Cloud Competence Center, Product Manager of Mobile Cloud Pulsar, Apache Pulsar Contributor, active in open source projects and communities such as Apache Pulsar

We know that data in relational databases is stored in a table structure, and clients can store data in and read data from tables. The data in Apache BookKeeper is stored in a log structure, and clients read and write data in the form of logs. The log structure is a simple data structure that only supports data append operations, supports simultaneous reads by multiple clients, and non-destructive reads.

As data structures, logs and queues function very similarly, the difference being that logs allow multiple clients to simultaneously and independently read complete data from different locations. Therefore, the log must support non-destructive reads. The queue is a destructive read, and the head element of the queue is deleted after being read. This means that each element in the queue will only be read by one client.

Apache BookKeeper, as the data storage layer of Apache Pulsar, is itself a complex distributed system. BookKeeper utilizes a multi-copy mechanism to achieve data security and high availability. Multi-copy means that each entry data will be copied to multiple nodes for storage, so that read and write services can still be provided in the event of a partial node failure, and the saved data will not be lost. BookKeeper uses a unique multi-copy protocol, which specifies how multiple service nodes cooperate to achieve high availability of services and ensure data security.

Shard-based log data structure

Message queues based on queues and logs such as Apache Kafka and RabbitMQ store the data of each queue or partition as a whole, so the entire data must be stored on the same storage node. BookKeeper uses a shard-based log data structure, each log data is composed of a series of shard data (Segment) concatenated. A Topic partition data of Pulsar is actually divided into multiple data shards to save.

We know that each Pulsar topic has a unique Pulsar broker as the owner. This broker is responsible for creating data shards for the topic to which it belongs, and concatenating these data shards to logically form a complete log data.

Figure 1: Pulsar Topic data consists of a set of data shards concatenated

BookKeeper calls these data shards Ledger and stores them on BookKeeper server nodes (called bookie nodes).

Figure 2: Pulsar broker stores topic data to multiple Bookie nodes

The BookKeeper multi-replica protocol is closely related to the life cycle of each ledger. The implementation of the multi-copy protocol itself is encapsulated in the BookKeeper client library. Each Pulsar broker interacts with BookKeeper by calling the interface in the BookKeeper client library, such as creating ledger, closing ledger, and reading and writing entries. These interfaces contain very complex protocol logic. In this blog, we will analyze and show the implementation details of the protocol layer by layer.

First, the client that creates the ledger is the only owner of the ledger, and only the owner can write data to the ledger. For Pulsar, this client is the broker as the topic owner of the partition, and the broker is responsible for creating ledger to form the data segment of this topic. When the client fails for some reason, another client (another broker for Pulsar) will intervene and take over the topic. At this time, the ledger needs to be repaired before it is in the under-replicated state. the entry data (ie recovery operation) and shut down the ledger.

Figure 3: Ledger life cycle

Each Pulsar topic contains only one open state ledger and multiple close state ledgers. All write operations will write to the ledger in the open state, while read operations can read data from any ledger.

Figure 4: Write operations only write to open ledgers

Each ledger will be saved to multiple bookie nodes, and the correspondence between each ledger and the bookie pool (called ensemble) that stores the ledger is saved in ZooKeeper. When the size of the ledger in the open state reaches the threshold, or the owner of the ledger fails, the ledger will be shut down and a new ledger will be re-created. Depending on the configured multi-replica parameters, newly created ledgers may be saved to another set of bookie pools.

Figure 5: Multiple copies of Ledger data are stored in multiple bookie nodes, and the metadata of each Ledger and the ledger information contained in a Topic are stored in ZooKeeper

The process of writing data to ledger

BookKeeper contains the following parameters related to ledger multi-replica configuration:

Write quorum (how many bookie nodes need to be written to each entry data), referred to as WQ.
Ack Quorum (How many bookie nodes need to receive a successful write response to confirm that the entry was successfully written), referred to as AQ.
Ensemble size (the number of nodes in the bookie pool used to store ledger data), referred to as E. When E > WQ, entry data is written to different bookie nodes interleaved.

The set of bookie nodes to which an entry data is actually written becomes the write set. When E > WQ, the write sets of adjacent entries may be different.

Pulsar exposes APIs for setting AQ, WQ, and E parameters for each topic to customize replica settings.

Figure 6: Message writing and acknowledgment with WQ=3, AQ=2

Last Add Confirmed (LAC)

The BookKeeper client continuously updates the consecutive and highest entry ID of the entries that have been confirmed for writing, which we call Last Add Confirmed (LAC). This is a water mark, the entries higher than this entry ID have not been confirmed to be written, and the entries below and equal to this entry ID have been confirmed to be written. Each entry data sent to the bookie contains the current latest LAC, so that each bookie can know the current LAC value, although there is some delay. We'll see below that the LAC plays a role in addition to being a watermark for committed entries.

Ledger data segment

The Ledger itself can also be divided into one or more data segments (fragments). When a Ledger is created, it contains a data segment and allocates a bookie pool to store the data of this Ledger. When a bookie fails to write, the client replaces it with a new bookie. At this time, a new data segment will be created, and the unconfirmed entry data and subsequent entry data will be resent to the new bookie. When bookie fails to write again, a new data segment will be created again, and so on. Bookie write failure does not mean that the bookie node is unavailable, and other conditions such as network fluctuations will also cause a single write failure. The data of different data segments are stored on different bookie pools. A data segment is also commonly considered a write set (Ensemble).

Figure 7: Creation of the second data segment

The Ledger data segment can be seen as metadata that tells BookKeeper clients where to find entry data in a ledger. Bookie nodes themselves do not know these metadata information, they are only responsible for storing the received entry data and creating an index based on ledger ID and entry ID.

Figure 8: Writing entry 1000 to B3 bookie node fails and causes ledger to create a second segment

The process of reading data from ledger

The operation of reading data from ledger is divided into the following situations:

Read entry data normally
Long polling to read LAC data
Read data under Quorum LAC mechanism
Restorative read data

Unlike writing data, we only need to read a bookie node that has data to get the desired data. If this read fails, you only need to re-read the data from the bookie node that holds other copies of the data.

Clients usually only want to read acknowledged data, so they only read the location identified by the LAC value. When reading historical data, the bookie node will notify the client when to stop reading based on the current LAC value. When the client reads the LAC value and stops reading, it can initiate long polling to read the LAC data. This request will be suspended by bookie first, and will not respond and return new entry data until new entry data is confirmed.

The other two cases of reading data mainly occur when data is repaired, which we will cover later.

Different operations require different numbers of responses

Different operations require different numbers of successful responses received from the bookie node. For example, for normal data read operations, it only needs to successfully receive a response from a bookie node to complete. Some operations require successful responses from multiple bookie nodes (quorum) to complete.

These operations can be divided into the following types according to the number of responses they need to receive:

Ack quorum (AQ)
Write quorum (WQ)
Quorum Coverage (QC) QC = (WQ - AQ) + 1
Ensemble Coverage (EC) EC = (E - AQ) + 1

Both Quorum Coverage (QC) and Ensemble Coverage (EC) satisfy the following definitions (the following two definitions are essentially the same, but different in terms), QC and EC differ only in the scope of the "set":

Receive successful responses from enough bookie nodes for a given request that any combination of ack quorum (AQ) number of bookie nodes in a given set contains at least one bookie node that received a successful response.
For a given request, a success response is received from enough bookie nodes such that there are no ack quorum (AQ) number of bookie nodes in the given set that do not receive a success response.

For Quorum Coverage (QC), this collection refers to the write collection of an entry. QC is mainly used to ensure the data consistency of a single entry, such as verifying whether the write operation of a single entry has been confirmed by the client. For Ensemble Coverage (EC), this collection refers to the bookie pool that stores the current ledger data segment. EC is mainly used to ensure the consistency of the ledger data segment, such as setting the fence status of the ledger.

WQ and AQ are mainly used for writing data, while QC and EC are mainly used for ledger repair process.

Ledger Repair Process

Earlier we mentioned that each ledger has only one client as the owner. When the client is unavailable, another client will step in and trigger the ledger repair process and then shut down the ledger. For Pulsar, it is equivalent to a Topic owner's broker becoming unavailable, and then the Topic's ownership is transferred to another broker.

The Ledger repair process involves finding the highest entry ID that has been confirmed by bookie, ensuring that each entry before that has been replicated by a sufficient number of copies. After closing the ledger, the ledger state will be set to CLOSED, and the latest entry ID will be set to the last confirmed entry ID.

How to prevent split brain

BookKeeper is a distributed system, which means that network fluctuations may cause the cluster to be split into two or more blocks. We imagine that if a client is disconnected from ZooKeeper, then the client is considered unavailable and another client will take over the ledger that this client is responsible for and start the ledger repair process. But this client may still be running normally, it can connect to the BookKeeper cluster normally, so two clients will try to operate the same ledger at the same time, which is a split-brain situation. Split-brain refers to the situation where a distributed system is split into multiple independent systems due to network fluctuations, and the data is inconsistent due to network recovery after a certain period of time.

BookKeeper introduced the concept of fence to prevent split brain from happening. When a second client (eg another Pulsar broker) tries to start the ledger repair process, it will first set the ledger to fence state, in which the ledger will reject all new write requests. When enough bookie nodes set this ledger state to fence, even if the first client is still up and running, it cannot make any new writes. The second client can then start the ledger repair process in a safe state where no other clients will continue to write data or attempt to repair the same ledger.

Figure 9: A new topic owner starts to set the ledger to fence. When the original owner writes new data, the number of copies set by the Ack Quorum cannot be written, and the writing cannot be completed.

The first step of the repair process - set fence state

Set the ledger to fence state and confirm the LAC value.

The Fence request is actually a read request of Ensemble Coverage type to get the value of the LAC with the fencing flag. When each bookie node receives this request, it will set the status of the ledger to fence, and return the LAC value of the corresponding ledger on this node. When the client receives a response from enough bookie nodes, it means that the request is successful and can proceed to the next step. So how many bookie nodes to receive a response from is enough?

We set the ledger to fence state to prevent previous clients from continuing to write data to the ledger. So we only need to ensure that the number of bookie nodes that have not set this ledger to fence state is less than the set Ack Quorum value, then the previous client cannot write new data because it cannot receive enough write acknowledgments. The fence operation initiated by the new client does not need to wait until all bookie nodes have set the ledger to fence. It only needs to satisfy that the number of bookie nodes that have not been set to fence state is less than the set Ack Quorum to confirm the completion of the fence operation. The number of responses that need to be received to satisfy this condition is the Ensemble Coverage.

The second step of the repair process - repair the entry data

Next, the client sends a recovery read data request starting from the entry ID of LAC + 1, and rewrites these entry data to the new bookie pool. The write operation is an idempotent operation, that is to say, if the entry has been written to a bookie node, writing the same entry to this node again will not cause data to be written repeatedly. The client will continue to read and write operations until all data has been read. Make sure that all bookie nodes in this entry's write set have written a copy of this entry before shutting down the ledger.

A normal read operation only needs to receive a response from a bookie node. In contrast, the Recovery read operation needs to determine whether the entry has been confirmed based on the response content received from the bookie nodes of all write sets for the entry. Specifically, there are the following two situations:

Acknowledgment: Received a successful response for the number of Ack Quorums
Unconfirmed: There is no response to the data received with the number of Quorum Coverage (the number of bookie nodes that have written this entry data has not reached the Ack Quorum)

If all responses have been received, but neither threshold has been reached, then it is impossible to determine whether the entry has been confirmed, and the repair process will be terminated (there may be other error types of responses received, such as network fluctuations, in this case It is impossible to judge whether the entry has been successfully written to the corresponding bookie node). The repair process can be repeated until the final confirmation status of each entry can be determined.

Figure 10: The new client has received enough data non-existence requests when reading entry 3, and it can be judged that the status of entry 3 is unconfirmed. Then ensure that the data up to entry 2 is copied to a sufficient number of replicas

The third step of the repair process - closing the Ledger

Once all confirmed entries have been identified, and these entries have replicated enough replicas, the client will shut down the ledger. The operation of closing the ledger is mainly to update the ledger metadata on ZooKeeper, set the status to CLOSED, and set the Last Entry Id to the latest confirmed entry ID. These operations are not related to bookie itself, bookie will not sense whether the ledger is closed, bookie itself has no concept of open or closed.

The update of metadata on ZooKeeper is a versioning-based CAS operation. If another client is repairing the ledger at the same time and has shut down the ledger, the CAS operation will fail. In this way, multiple clients can be prevented from repairing the same ledger at the same time.

Summarize

This blog post describes most of the implementation of the BookKeeper multi-replica protocol. The key point to remember is that the bookie node is only a storage node used to store and read entry data. The BookKeeper client includes the operations of creating ledger, selecting the bookie pool to store ledger, and creating ledger data segments. Through Write Quorum and Ack Quorum to ensure the mechanism of multiple copies, as well as a series of logics such as repairing and closing the ledger in the event of a failure.

Blog recommendation | In-depth analysis of the BookKeeper multi-copy protocol (1)

Translator Profile

Shard-based log data structure

The process of writing data to ledger

Last Add Confirmed (LAC)

Ledger data segment

The process of reading data from ledger

Different operations require different numbers of responses

Ledger Repair Process

How to prevent split brain

The first step of the repair process - set fence state

The second step of the repair process - repair the entry data

The third step of the repair process - closing the Ledger

Summarize

Related Reading

ApachePulsar

引用和评论

深入解析 Apache BookKeeper 系列：第二篇 — 写操作原理

基于 MCP 的 AI Agent 应用开发实践

OSPO Summit 2025 正式定档！议题征集同步开启

OSPO Summit 2025 首批议程发布！

定档 7 月！Community Over Code Asia 2025 议题征集全面启动！

强烈推荐|新手从搭建到二开TinyEngine低代码引擎

面对开源大模型浪潮，基础模型公司如何持续盈利？