云原生 - In-depth analysis of Apache BookKeeper series: Part 2 - Principles of Write Operation - ApachePulsar

In the previous article , we explained the principle of bookie server from three aspects: components, threads, and read and write processes. In this article, we will describe in detail how write operations are efficiently written and quickly dropped to disk through the cooperation of various components and threading models. We try to analyze at the architectural level.

This series of articles is based on BookKeeper version 4.14 configured in Apache Pulsar.

There are many threads calling the Journal and LedgerStorage APIs during write operations. In the previous article , we already know that Journal is a synchronous operation and DbLedgerStorage is an asynchronous operation in the write operation.

Figure 1: How each thread handles write operations

We know that multiple instances of Journal and DbLedgerStorage can be configured, each with its own thread, queue and cache. So when it comes to certain threads, caches and queues, they may exist in parallel.

Netty thread

Netty threads handle all TCP connections and all requests on those connections. And forward these write requests to the write thread pool, which includes the entry request to be written, handle the callback when the request ends, and send the response to the client.

write thread pool

The write thread pool does not have much to do, so it doesn't need a lot of threads (default is 1). Each write request adds an Entry to the Write Cache of DbLedgerStorage, and if successful, adds the write request to the Journal's memory queue (BlockingQueue). At this point, the work of the writing thread is completed, and the rest of the work is handed over to other threads.

Each DbLedgerStorage instance has two write caches, one active and one idle. The idle cache can flush data to disk in the background. When DbLedgerStorage needs to flush data to disk (after the active write cache is full), the two write caches are swapped. While the idle write cache is flushing data to disk, an empty write cache can continue to provide write services. As long as the data in the idle write cache is flushed to disk before the active write cache is full, there should be no problem.

The refresh operation of DbLedgerStorage can be triggered by the synchronization thread (Sync Thread) periodically executing the checkpoint mechanism or through the DbStorage thread (DbStorage Thread, each DbLedgerStorage instance corresponds to a DbStorage thread).

If the write cache is full when the write thread tries to add an Entry to the write cache, the write thread submits the flush operation to the DbStorage thread; if the swapped write cache has completed the flush operation, the two write caches will immediately Execute the swap operation (swap), and then the write thread adds the Entry to the newly swapped write cache, and this part of the write operation is completed.

However, if the active write buffer is full and the swapped out write buffer is still flushing, the write thread will wait for a while and eventually reject the write request. The time to wait for the write cache is controlled by the parameter in the configuration file dbStorage_maxThrottleTimeMs , the default value is 10000 (10 seconds).

By default, there is only one thread in the write thread pool. If the flush operation is too long, this will cause the write thread to block for 10 seconds, which will cause the task queue of the write thread pool to quickly fill up with write requests, thereby rejecting additional writes. ask. This is the back pressure mechanism of DbLedgerStorage. Once the flushed write cache can be written again, the blocking state of the write thread pool will be released.

The size of the write cache defaults to 25% of the available direct memory, which can be set by dbStorage_writeCacheMaxSizeMb in the configuration file. The total available memory is allocated to the two write caches in each DbLedgerStorage instance, one DbLedgerStorage instance per ledger directory. If there are 2 ledger directories and 1GB of free write cache memory, each DbLedgerStorage instance will be allocated 500MB, of which each write cache will be allocated 250MB.

DbStorage thread

Each DbLedgerStorage instance has its own DbStorage thread. When the write cache is full, this thread is responsible for flushing the data to disk.

Sync thread

This thread is outside the Journal module and the DbLedgerStorage module. Its job is mainly to periodically perform checkpoints, checkpoints

There are the following:

Ledger's brush operation (long-term storage)
Mark the location in the Journal where data has been safely flushed to the ledger disk by writing the log mark file to the disk.
Clean up old Journal files that are no longer needed

This synchronization prevents two different threads from flushing at the same time.

When DbLedgerStorage is flushed, the swapped write cache will be written to the current entry log file (there will also be a log split operation), first these entries will be sorted by ledgerId and entryId, and then the entry will be written to entry log files and write their locations to the Entry Locations Index. This ordering when writing entries is to optimize the performance of read operations, which we will cover in the next article in this series.

Once all write-requested data has been flushed to disk, the swapped-out write-cache is flushed to be swapped with the active write-cache again.

Journal thread

The Journal thread is a loop, it gets the entry from the memory queue (BlockingQueue), writes the entry to disk, and periodically adds forced write requests to the Force Write queue, which triggers the fsync operation.

Journal does not execute the write system call for each entry obtained in the queue, it accumulates the entries and writes them to disk in batches (this is how BookKeeper flushes the disk), which is also called group commit. The following conditions will trigger the flush operation:

Maximum wait time is reached (configured by journalMaxGroupWaitMSec , the default value is 2ms)
Reached the maximum accumulated bytes (configured by journalBufferedWritesThreshold , the default value is 512Kb)
The accumulated number of entries reaches the maximum value (configured by journalBufferedEntriesThreshold , the default value is 0, 0 means not to use this configuration)
When the last entry in the queue is taken out, that is, the queue changes from non-empty to empty (configured through journalFlushWhenQueueEmpty , the default value is false )

Every time the disk is flushed, a Force Write Request is created, which contains the entry to be flushed.

Force write thread

The forced write thread is a loop that gets the forced write request from the forced write queue and performs an fsync operation on the journal file. The mandatory write request includes the entries to be written and the callbacks for these entry requests, so that after persisting to disk, the callbacks for these entry write requests can be submitted to the callback thread for execution.

Journal callback thread

This thread executes the callback for the write request and sends the response to the client.

Frequently asked questions

The bottleneck for write operations is usually disk IO in Journal or DbLedgerStorage. If the journal or synchronization operation (Fsync) is too slow, then the Journal thread and the forced write thread (can't quickly get entries from their respective queues. Similarly, if the DbLedgerStorage is too slow to flush the disk, then the Write Cache cannot be emptied and cannot be Swap quickly.
If Journal encounters a bottleneck, the number of tasks in the task queue of the writing thread pool will reach the upper limit, the entry will be blocked in the Journal queue, and the writing thread will also be blocked. Once the thread pool task queue is full, write operations will be rejected at the Netty layer because the Netty threads will not be able to submit more write requests to the writer thread pool. If you use a flame graph, you will find that the writer threads in the writer thread pool are busy. If the bottleneck is DbLedgerStorage, then DbLedgerStorage itself can refuse writes, and after 10 seconds (by default), the writer thread pool will quickly fill up, causing Netty threads to refuse write requests.
If the disk IO is not the bottleneck, but the CPU utilization is very high, it is very likely that high-performance disks are used, but the CPU performance is relatively low, resulting in lower processing efficiency of Netty threads and various other threads. This situation can be easily located through the monitoring indicators of the system.

Summarize

This article explains the write operation flow at the Journal and DBLedgerStorage level, and how the threads involved in the write operation work. In the next article, we will cover read operations.

In-depth analysis of Apache BookKeeper series: Part 2 - Principles of Write Operation

Netty thread

write thread pool

DbStorage thread

Sync thread

Journal thread

Force write thread

Journal callback thread

Frequently asked questions

Summarize

Related Reading

Translator Profile

ApachePulsar

引用和评论

【万字长文】大模型开源开发全景与趋势解读

基于 MCP 的 AI Agent 应用开发实践

OSPO Summit 2025 正式定档！议题征集同步开启

OSPO Summit 2025 首批议程发布！

定档 7 月！Community Over Code Asia 2025 议题征集全面启动！

祝贺陈梓立(Tison)当选 2025 年度 Apache 软件基金会董事会

强烈推荐|新手从搭建到二开TinyEngine低代码引擎