In-depth analysis of RocketMQ source code-message storage module

1. Introduction

RocketMQ is Alibaba's open source distributed messaging middleware. It draws on the implementation of Kafka and supports message subscription and publishing, sequential messages, transaction messages, timing messages, message backtracking, dead letter queues and other functions. The RocketMQ architecture is mainly divided into four parts, as shown in the following figure:

Producer: Message producer, supports distributed cluster deployment.
Consumer: Message consumer, supports distributed cluster deployment.
NameServer: The name service is a very simple Topic routing registration center that supports dynamic registration and discovery of Broker. Producer and Consumer dynamically perceive Broker's routing information through NameServer.
Broker: Broker is mainly responsible for message storage, forwarding and query.

This article analyzes how the message storage module in Broker is designed based on Apache RocketMQ version 4.9.1.

Two, storage architecture

The message file path of RocketMQ is shown in the figure.

CommitLog

The main body of the message and the storage main body of metadata, store the content of the message main body written by the Producer side, and the message content is not fixed-length. The default size of a single file is 1G, the length of the file name is 20 digits, the left is filled with zeros, and the remaining is the starting offset. For example, 00000000000000000000 represents the first file, the starting offset is 0, and the file size is 1G=1073741824; The first file is full, the second file is 00000000001073741824, the starting offset is 1073741824, and so on.

ConsumeQueue

For the message consumption queue, the Consumequeue file can be regarded as an index file based on CommitLog. The ConsumeQueue file adopts a fixed-length design. Each entry has a total of 20 bytes, which are the CommitLog physical offset of 8 bytes, the message length of 4 bytes, and the tag hashcode of 8 bytes. A single file consists of 30W entries. Random access to each entry like an array, the file size of each ConsumeQueue is about 5.72M.

IndexFile

The index file provides a way to query messages by key or time interval. The size of a single IndexFile file is about 400M, and one IndexFile can store 2000W indexes. The underlying storage design of IndexFile is similar to the HashMap data structure of JDK.

Other files: including the config folder, which stores runtime configuration information; the abort file, which indicates whether the Broker is closed normally; the checkpoint file, which stores the last time stamp of Commitlog, ConsumeQueue, and Index files. These are beyond the scope of this article.

Compared with Kafka, each partition of each topic in Kafka corresponds to a file, which is written sequentially and flushed regularly. But once a single Broker has too many topics, sequential writing will degenerate into random writing. In RocketMQ, all topics of a single Broker are written sequentially in the same CommitLog, which guarantees strict sequential writing. RocketMQ needs to get the actual physical offset of the message from the ConsumeQueue to read the message content from the CommitLog, which will cause random reading.

2.1 Page Cache and mmap

Before formally introducing the implementation of the Broker message storage module, first explain the two concepts of Page Cache and mmap.

Page Cache is the OS cache of files, which is used to speed up the reading and writing of files. Generally speaking, the sequential reading and writing speed of files is almost close to the reading and writing speed of the memory. The main reason is that the OS uses the Page Cache mechanism to optimize the performance of read and write access operations, and part of the memory is used as the Page Cache. For data writing, the OS will first write to the Cache, and then the pdflush kernel thread will flush the data in the Cache to the physical disk in an asynchronous manner. For data reading, if a Page Cache miss occurs when a file is read once, the OS reads the file from the physical disk while simultaneously pre-reading the data files of other adjacent blocks.

mmap directly maps the physical file on the disk to the memory address of the user mode, reducing the traditional IO to copy disk file data back and forth between the buffer of the operating system kernel address space and the buffer of the user application address space Performance overhead. FileChannel in Java NIO provides map() method to realize mmap. FileChannel (file channel) and mmap (memory mapping) read and write performance comparison can refer to this article .

2.2 Broker module

The following figure is a diagram of the Broker storage architecture, which shows the business flow process of the Broker module from receiving a message to returning a response.

service access layer : RocketMQ realizes the underlying communication based on Netty's Reactor multi-thread model. Reactor main thread pool eventLoopGroupBoss is responsible for creating TCP connections, and there is only one thread by default. After the connection is established, it is then thrown to the Reactor sub-thread pool eventLoopGroupSelector for processing read and write events.

defaultEventExecutorGroup is responsible for SSL verification, encoding and decoding, idle checking, and network connection management. Then according to the business request code code of RomotingCommand to find the corresponding processor in the local cache variable of processorTable, encapsulate it into a task task, and submit it to the corresponding business processor to process the thread pool for execution. The Broker module improves system throughput through these four-level thread pools.

Business processing layer : Process various business requests called via RPC, among which:

SendMessageProcessor is responsible for processing the request of the Producer to send messages;
PullMessageProcessor is responsible for processing Consumer's request to consume messages;
QueryMessageProcessor is responsible for processing the request for querying the message according to the message key and so on.

storage logic layer : DefaultMessageStore is the core storage logic of RocketMQ, providing message storage, read, and delete capabilities.

file mapping layer : Map Commitlog, ConsumeQueue, IndexFile files to the storage object MappedFile.

Data transmission layer : Support read and write messages based on mmap memory mapping, and also support read and write messages based on mmap and write messages to off-heap memory.

The following chapter will analyze how RocketMQ achieves high-performance storage from the perspective of source code.

Three, message writing

Taking the production of a single message as an example, the sequence logic of message writing is shown in the figure below, and the business logic is flowed between layers as shown in the Broker storage architecture above.

The core code for writing messages at the lowest level is in the asyncPutMessage method of CommitLog, which is mainly divided into three steps: obtaining MappedFile, writing messages to the buffer, and submitting a flash disk request. It should be noted that before and after these three steps, there are spin locks or ReentrantLock locking and releasing locks to ensure that a single Broker writes messages serially.

//org.apache.rocketmq.store.CommitLog::asyncPutMessage
public CompletableFuture<PutMessageResult> asyncPutMessage(final MessageExtBrokerInner msg) {
        ...
        putMessageLock.lock(); //spin or ReentrantLock ,depending on store config
        try {
            //获取最新的 MappedFile
            MappedFile mappedFile = this.mappedFileQueue.getLastMappedFile();
            ...
            //向缓冲区写消息
            result = mappedFile.appendMessage(msg, this.appendMessageCallback, putMessageContext);
            ...
            //提交刷盘请求
            CompletableFuture<PutMessageStatus> flushResultFuture = submitFlushRequest(result, msg);
            ...
        } finally {
            putMessageLock.unlock();
        }
        ...
    }

The following describes what exactly did these three steps do.

3.1 MappedFile initialization

When the Broker is initialized, the AllocateMappedFileService asynchronous thread created by the management MappedFile will be started. The message processing thread and the AllocateMappedFileService thread are associated through the requestQueue queue.

When the message is written, call the putRequestAndReturnMappedFile method of AllocateMappedFileService to the requestQueue to submit the MappedFile creation request, and here will construct two AllocateRequests and put them in the queue at the same time.

The AllocateMappedFileService thread loops to obtain the AllocateRequest from the requestQueue to create the MappedFile. The message processing thread waits for the first MappedFile to be created through CountDownLatch and returns.

When the message processing thread needs to create the MappedFile again, it can directly obtain the previously created MappedFile. In this way, by pre-creating a MappedFile, the waiting time for file creation is reduced.

//org.apache.rocketmq.store.AllocateMappedFileService::putRequestAndReturnMappedFile
public MappedFile putRequestAndReturnMappedFile(String nextFilePath, String nextNextFilePath, int fileSize) {
    //请求创建 MappedFile
    AllocateRequest nextReq = new AllocateRequest(nextFilePath, fileSize);
    boolean nextPutOK = this.requestTable.putIfAbsent(nextFilePath, nextReq) == null;
    ...
    //请求预先创建下一个 MappedFile
    AllocateRequest nextNextReq = new AllocateRequest(nextNextFilePath, fileSize);
    boolean nextNextPutOK = this.requestTable.putIfAbsent(nextNextFilePath, nextNextReq) == null;
    ...
    //获取本次创建 MappedFile
    AllocateRequest result = this.requestTable.get(nextFilePath);
    ...
}
 
//org.apache.rocketmq.store.AllocateMappedFileService::run
public void run() {
    ..
    while (!this.isStopped() && this.mmapOperation()) {
    }
    ...
}
 
//org.apache.rocketmq.store.AllocateMappedFileService::mmapOperation
private boolean mmapOperation() {
    ...
    //从队列获取 AllocateRequest
    req = this.requestQueue.take();
    ...
    //判断是否开启堆外内存池
    if (messageStore.getMessageStoreConfig().isTransientStorePoolEnable()) {
        //开启堆外内存的 MappedFile
        mappedFile = ServiceLoader.load(MappedFile.class).iterator().next();
        mappedFile.init(req.getFilePath(), req.getFileSize(), messageStore.getTransientStorePool());
    } else {
        //普通 MappedFile
        mappedFile = new MappedFile(req.getFilePath(), req.getFileSize());
    }
    ...
    //MappedFile 预热
    if (mappedFile.getFileSize() >= this.messageStore.getMessageStoreConfig()
        .getMappedFileSizeCommitLog()
        &&
        this.messageStore.getMessageStoreConfig().isWarmMapedFileEnable()) {
        mappedFile.warmMappedFile(this.messageStore.getMessageStoreConfig().getFlushDiskType(),
            this.messageStore.getMessageStoreConfig().getFlushLeastPagesWhenWarmMapedFile());
    }
    req.setMappedFile(mappedFile);
    ...
}

Every time a new ordinary MappedFile request is created, a mappedByteBuffer will be created. The following code shows how Java mmap is implemented.

//org.apache.rocketmq.store.MappedFile::init
private void init(final String fileName, final int fileSize) throws IOException {
    ...
    this.fileChannel = new RandomAccessFile(this.file, "rw").getChannel();
    this.mappedByteBuffer = this.fileChannel.map(MapMode.READ_WRITE, 0, fileSize);
    ...
}

If the off-heap memory is turned on, that is, when transientStorePoolEnable = true, mappedByteBuffer is only used to read messages, and off-heap memory is used to write messages, thus realizing the separation of reading and writing of messages. The off-heap memory object does not need to be created every time a new MappedFile is created, but is initialized according to the size of the off-heap memory pool when the system starts. Each off-heap memory DirectByteBuffer has the same size as the CommitLog file, and the off-heap memory is locked to ensure that it will not be replaced into virtual memory.

//org.apache.rocketmq.store.TransientStorePool
public void init() {
    for (int i = 0; i < poolSize; i++) {
        //分配与 CommitLog 文件大小相同的堆外内存
        ByteBuffer byteBuffer = ByteBuffer.allocateDirect(fileSize);
        final long address = ((DirectBuffer) byteBuffer).address();
        Pointer pointer = new Pointer(address);
        //锁定堆外内存，确保不会被置换到虚拟内存中去
        LibC.INSTANCE.mlock(pointer, new NativeLong(fileSize));
        availableBuffers.offer(byteBuffer);
    }
}

There is a section of MappedFile preheating logic in the mmapOperation method above. Why do files need to be warmed up? How to do file preheating?

Because through mmap mapping, only the mapping relationship between the process virtual memory address and the physical memory address is established, and the Page Cache is not loaded into the memory. When reading and writing data, if the Page Cache is not hit, a page fault interruption occurs, and the data is reloaded from the disk to the memory, which will affect the reading and writing performance. In order to prevent page faults and prevent the operating system from scheduling related memory pages to swap space, RocketMQ preheats files by preheating files as follows.

//org.apache.rocketmq.store.MappedFile::warmMappedFile
public void warmMappedFile(FlushDiskType type, int pages) {
        ByteBuffer byteBuffer = this.mappedByteBuffer.slice();
        int flush = 0;
        //通过写入 1G 的字节 0 来让操作系统分配物理内存空间，如果没有填充值，操作系统不会实际分配物理内存，防止在写入消息时发生缺页异常
        for (int i = 0, j = 0; i < this.fileSize; i += MappedFile.OS_PAGE_SIZE, j++) {
            byteBuffer.put(i, (byte) 0);
            // force flush when flush disk type is sync
            if (type == FlushDiskType.SYNC_FLUSH) {
                if ((i / OS_PAGE_SIZE) - (flush / OS_PAGE_SIZE) >= pages) {
                    flush = i;
                    mappedByteBuffer.force();
                }
            }
 
            //prevent gc
            if (j % 1000 == 0) {
                Thread.sleep(0);
            }
        }
 
        //force flush when prepare load finished
        if (type == FlushDiskType.SYNC_FLUSH) {
            mappedByteBuffer.force();
        }
        ...
        this.mlock();
}
 
//org.apache.rocketmq.store.MappedFile::mlock
public void mlock() {
    final long beginTime = System.currentTimeMillis();
    final long address = ((DirectBuffer) (this.mappedByteBuffer)).address();
    Pointer pointer = new Pointer(address);
 
    //通过系统调用 mlock 锁定该文件的 Page Cache，防止其被交换到 swap 空间
    int ret = LibC.INSTANCE.mlock(pointer, new NativeLong(this.fileSize));
 
    //通过系统调用 madvise 给操作系统建议，说明该文件在不久的将来要被访问
    int ret = LibC.INSTANCE.madvise(pointer, new NativeLong(this.fileSize), LibC.MADV_WILLNEED);
}

In summary, RocketMQ pre-creates a file every time to reduce file creation delay, and avoids page faults during reading and writing through file preheating.

3.2 Message writing

3.2.1 Write CommitLog

The logical view of each message storage in CommitLog is shown in the figure below. TOTALSIZE is the size of the storage space occupied by the entire message.

The following table explains which fields each message contains, as well as the space occupied by these fields and a brief introduction to the fields.

The message is written by calling the appendMessagesInner method of MappedFile.

//org.apache.rocketmq.store.MappedFile::appendMessagesInner
public AppendMessageResult appendMessagesInner(final MessageExt messageExt, final AppendMessageCallback cb,
        PutMessageContext putMessageContext) {
    //判断使用 DirectBuffer 还是 MappedByteBuffer 进行写操作
    ByteBuffer byteBuffer = writeBuffer != null ? writeBuffer.slice() : this.mappedByteBuffer.slice();
    ..
    byteBuffer.position(currentPos);
    AppendMessageResult result  = cb.doAppend(this.getFileFromOffset(), byteBuffer, this.fileSize - currentPos,
                    (MessageExtBrokerInner) messageExt, putMessageContext);
    ..
    return result;
}
 
//org.apache.rocketmq.store.CommitLog::doAppend
public AppendMessageResult doAppend(final long fileFromOffset, final ByteBuffer byteBuffer, final int maxBlank,
    final MessageExtBrokerInner msgInner, PutMessageContext putMessageContext) {
    ...
    ByteBuffer preEncodeBuffer = msgInner.getEncodedBuff();
    ...
    //这边只是将消息写入缓冲区，还未实际刷盘
    byteBuffer.put(preEncodeBuffer);
    msgInner.setEncodedBuff(null);
    ...
    return result;
}

At this point, the message is finally written into the ByteBuffer and has not been persisted to the disk. When it is persisted, the next section will specifically talk about the flashing mechanism. Here is a question about how ConsumeQueue and IndexFile are written?

The answer is to store the ReputMessageService of the logical layer in the storage architecture diagram. When the MessageStore is initialized, it will start a ReputMessageService asynchronous thread. After it is started, it will continuously call the doReput method in the loop to notify the ConsumeQueue and IndexFile to update. The reason why ConsumeQueue and IndexFile can be updated asynchronously is because the CommitLog saves the queue and topic information required to restore ConsumeQueue and IndexFile. Even if the Broker service is abnormally down, the Broker can restore ConsumeQueue and IndexFile according to the CommitLog after restarting.

//org.apache.rocketmq.store.DefaultMessageStore.ReputMessageService::run
public void run() {
    ...
    while (!this.isStopped()) {
        Thread.sleep(1);
         this.doReput();
    }
    ...
}
 
//org.apache.rocketmq.store.DefaultMessageStore.ReputMessageService::doReput
private void doReput() {
    ...
    //获取CommitLog中存储的新消息
    DispatchRequest dispatchRequest =
        DefaultMessageStore.this.commitLog.checkMessageAndReturnSize(result.getByteBuffer(), false, false);
    int size = dispatchRequest.getBufferSize() == -1 ? dispatchRequest.getMsgSize() : dispatchRequest.getBufferSize();
 
    if (dispatchRequest.isSuccess()) {
        if (size > 0) {
            //如果有新消息，则分别调用 CommitLogDispatcherBuildConsumeQueue、CommitLogDispatcherBuildIndex 进行构建 ConsumeQueue 和 IndexFile
            DefaultMessageStore.this.doDispatch(dispatchRequest);
    }
    ...
}

3.2.2 Write to ConsumeQueue

As shown in the figure below, each record of ConsumeQueue has a total of 20 bytes, which are 8 bytes of CommitLog physical offset, 4 bytes of message length, and 8 bytes of tag hashcode.

ConsumeQueue record persistence logic is as follows.

//org.apache.rocketmq.store.ConsumeQueue::putMessagePositionInfo
private boolean putMessagePositionInfo(final long offset, final int size, final long tagsCode,
    final long cqOffset) {
    ...
    this.byteBufferIndex.flip();
    this.byteBufferIndex.limit(CQ_STORE_UNIT_SIZE);
    this.byteBufferIndex.putLong(offset);
    this.byteBufferIndex.putInt(size);
    this.byteBufferIndex.putLong(tagsCode);
 
    final long expectLogicOffset = cqOffset * CQ_STORE_UNIT_SIZE;
 
    MappedFile mappedFile = this.mappedFileQueue.getLastMappedFile(expectLogicOffset);
    if (mappedFile != null) {
        ...
        return mappedFile.appendMessage(this.byteBufferIndex.array());
    }
}

3.2.3 Write IndexFile

The file logical structure of IndexFile is shown in the figure below, which is similar to the array and linked list structure of JDK's HashMap. It is mainly composed of three parts: Header, Slot Table and Index Linked List.

Header: The head of IndexFile, which occupies 40 bytes. It mainly contains the following fields:

beginTimestamp: The minimum storage time of the message contained in this IndexFile file.
endTimestamp: The maximum storage time of messages contained in this IndexFile file.
beginPhyoffset: This IndexFile file contains the minimum CommitLog file offset of the message.
endPhyoffset: The maximum CommitLog file offset of the message contained in this IndexFile file.
hashSlotcount: The total number of hashSlots contained in the IndexFile file.
indexCount: The number of used Index entries in the IndexFile file.

Slot Table: It contains 500w Hash slots by default, and each Hash slot stores the first IndexItem storage location of the same hash value.

Index Linked List: By default, it contains up to 2000w IndexItems. Its composition is as follows:

Key Hash: The hash of the message key. When searching based on the key, the hash is compared, and then the key itself is compared.
CommitLog Offset: The physical displacement of the message.
Timestamp: The difference between the storage time of the message and the timestamp of the first message.
Next Index Offset: The position of the next IndexItem saved after a hash conflict occurs.

Each hash slot in the Slot Table stores the position of IndexItem in the Index Linked List. If the hash conflicts, a new IndexItem is inserted into the head of the linked list, and its Next Index Offset stores the position of the previous linked list head IndexItem, and at the same time overwrites the hash in the Slot Table. The slot is the latest IndexItem position. code show as below:

//org.apache.rocketmq.store.index.IndexFile::putKey
public boolean putKey(final String key, final long phyOffset, final long storeTimestamp) {
    int keyHash = indexKeyHashMethod(key);
    int slotPos = keyHash % this.hashSlotNum;
    int absSlotPos = IndexHeader.INDEX_HEADER_SIZE + slotPos * hashSlotSize;
    ...
    //从 Slot Table 获取当前最新消息位置
    int slotValue = this.mappedByteBuffer.getInt(absSlotPos);
    ...
    int absIndexPos =
        IndexHeader.INDEX_HEADER_SIZE + this.hashSlotNum * hashSlotSize
            + this.indexHeader.getIndexCount() * indexSize;
 
    this.mappedByteBuffer.putInt(absIndexPos, keyHash);
    this.mappedByteBuffer.putLong(absIndexPos + 4, phyOffset);
    this.mappedByteBuffer.putInt(absIndexPos + 4 + 8, (int) timeDiff);
    //存放之前链表头 IndexItem 位置
    this.mappedByteBuffer.putInt(absIndexPos + 4 + 8 + 4, slotValue);
    //更新 Slot Table 中 hash 槽的值为最新消息位置
    this.mappedByteBuffer.putInt(absSlotPos, this.indexHeader.getIndexCount());
 
    if (this.indexHeader.getIndexCount() <= 1) {
        this.indexHeader.setBeginPhyOffset(phyOffset);
        this.indexHeader.setBeginTimestamp(storeTimestamp);
    }
 
    if (invalidIndex == slotValue) {
        this.indexHeader.incHashSlotCount();
    }
    this.indexHeader.incIndexCount();
    this.indexHeader.setEndPhyOffset(phyOffset);
    this.indexHeader.setEndTimestamp(storeTimestamp);
 
    return true;
    ...
}

In summary, a complete message writing process includes: synchronously writing to Commitlog file buffer area, and asynchronously constructing ConsumeQueue and IndexFile files.

3.3 Message refresh

RocketMQ message brushing is mainly divided into synchronous brushing and asynchronous brushing.

(1) Synchronous flashing: Only after the message is truly persisted to the disk, RocketMQ's Broker will actually return a successful ACK response to the Producer. Synchronous flashing is a good guarantee for the reliability of MQ messages, but it will have a greater impact on performance. General financial services use this mode more.

(2) Asynchronous flashing: It can take full advantage of the Page Cache of the OS. As long as the message is written into the Page Cache, the successful ACK will be returned to the Producer. Message flashing is performed by background asynchronous thread submission, which reduces read and write latency and improves the performance and throughput of MQ. Asynchronous flashing includes two methods: enabling off-heap memory and not enabling off-heap memory.

When submitting the brushing request in CommitLog, it will be determined whether to brush the disk synchronously or asynchronously according to the current Broker related configuration.

//org.apache.rocketmq.store.CommitLog::submitFlushRequest
public CompletableFuture<PutMessageStatus> submitFlushRequest(AppendMessageResult result, MessageExt messageExt) {
    //同步刷盘
    if (FlushDiskType.SYNC_FLUSH == this.defaultMessageStore.getMessageStoreConfig().getFlushDiskType()) {
        final GroupCommitService service = (GroupCommitService) this.flushCommitLogService;
        if (messageExt.isWaitStoreMsgOK()) {
            GroupCommitRequest request = new GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes(),
                    this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout());
            service.putRequest(request);
            return request.future();
        } else {
            service.wakeup();
            return CompletableFuture.completedFuture(PutMessageStatus.PUT_OK);
        }
    }
    //异步刷盘
    else {
        if (!this.defaultMessageStore.getMessageStoreConfig().isTransientStorePoolEnable()) {
            flushCommitLogService.wakeup();
        } else  {
            //开启堆外内存的异步刷盘
            commitLogService.wakeup();
        }
        return CompletableFuture.completedFuture(PutMessageStatus.PUT_OK);
    }
}

The inheritance relationship of GroupCommitService, FlushRealTimeService, and CommitRealTimeService is shown in the figure;

GroupCommitService: Synchronous brushing thread. As shown in the figure below, after the message is written to the Page Cache, the flashing is synchronized through the GroupCommitService, and the message processing thread is blocked waiting for the flashing result.

//org.apache.rocketmq.store.CommitLog.GroupCommitService::run
public void run() {
    ...
    while (!this.isStopped()) {
        this.waitForRunning(10);
        this.doCommit();
    }
    ...
}
 
//org.apache.rocketmq.store.CommitLog.GroupCommitService::doCommit
private void doCommit() {
    ...
    for (GroupCommitRequest req : this.requestsRead) {
        boolean flushOK = CommitLog.this.mappedFileQueue.getFlushedWhere() >= req.getNextOffset();
        for (int i = 0; i < 2 && !flushOK; i++) {
            CommitLog.this.mappedFileQueue.flush(0);
            flushOK = CommitLog.this.mappedFileQueue.getFlushedWhere() >= req.getNextOffset();
        }
        //唤醒等待刷盘完成的消息处理线程
        req.wakeupCustomer(flushOK ? PutMessageStatus.PUT_OK : PutMessageStatus.FLUSH_DISK_TIMEOUT);
    }
    ...
}
 
//org.apache.rocketmq.store.MappedFile::flush
public int flush(final int flushLeastPages) {
    if (this.isAbleToFlush(flushLeastPages)) {
        ...
        //使用到了 writeBuffer 或者 fileChannel 的 position 不为 0 时用 fileChannel 进行强制刷盘
        if (writeBuffer != null || this.fileChannel.position() != 0) {
            this.fileChannel.force(false);
        } else {
            //使用 MappedByteBuffer 进行强制刷盘
            this.mappedByteBuffer.force();
        }
        ...
    }
}

FlushRealTimeService off-heap memory is not enabled. As shown in the figure below, after the message is written to Page Cache, the message processing thread returns immediately and flushes the disk asynchronously through FlushRealTimeService.

//org.apache.rocketmq.store.CommitLog.FlushRealTimeService
public void run() {
    ...
    //判断是否需要周期性进行刷盘
    if (flushCommitLogTimed) {
        //固定休眠 interval 时间间隔
        Thread.sleep(interval);
    } else {
        // 如果被唤醒就刷盘，非周期性刷盘
        this.waitForRunning(interval);
    }
    ...
    // 这边和 GroupCommitService 用的是同一个强制刷盘方法
    CommitLog.this.mappedFileQueue.flush(flushPhysicQueueLeastPages);
    ...
}

CommitRealTimeService 16191bf2c294c8: Turn on the asynchronous As shown in the figure below, the message processing thread returns immediately after writing the message to the off-heap memory. Subsequent messages are asynchronously submitted from off-heap memory to Page Cache through CommitRealTimeService, and then flushed asynchronously by the FlushRealTimeService thread.

Note: After the message is asynchronously submitted to the Page Cache, the business can read the message from the MappedByteBuffer.

After the message is written to the off-heap memory writeBuffer, the isAbleToCommit method is used to determine whether it has accumulated to at least the number of committed pages (default 4 pages). If the number of pages reaches the minimum number of submitted pages, they are submitted in batches; otherwise, they still reside in off-heap memory, and there is a risk of losing messages here. Through this batch operation, the read and write Page Cahe will be separated by several pages, which reduces the probability of Page Cahe read and write conflicts and realizes the separation of read and write. The specific implementation logic is as follows:

//org.apache.rocketmq.store.CommitLog.CommitRealTimeService
class CommitRealTimeService extends FlushCommitLogService {
    @Override
    public void run() {
        while (!this.isStopped()) {
            ...
            int commitDataLeastPages = CommitLog.this.defaultMessageStore.getMessageStoreConfig().getCommitCommitLogLeastPages();
 
            ...
            //把消息 commit 到内存缓冲区，最终调用的是 MappedFile::commit0 方法，只有达到最少提交页数才能提交成功，否则还在堆外内存中
            boolean result = CommitLog.this.mappedFileQueue.commit(commitDataLeastPages);
            if (!result) {
                //唤醒 flushCommitLogService，进行强制刷盘
                flushCommitLogService.wakeup();
            }
            ...
            this.waitForRunning(interval);
        }
    }
}
 
//org.apache.rocketmq.store.MappedFile::commit0
protected void commit0() {
    int writePos = this.wrotePosition.get();
    int lastCommittedPosition = this.committedPosition.get();
     
    //消息提交至 Page Cache，并未实际刷盘
    if (writePos - lastCommittedPosition > 0) {
        ByteBuffer byteBuffer = writeBuffer.slice();
        byteBuffer.position(lastCommittedPosition);
        byteBuffer.limit(writePos);
        this.fileChannel.position(lastCommittedPosition);
        this.fileChannel.write(byteBuffer);
        this.committedPosition.set(writePos);
    }
}

The following summarizes the usage scenarios and advantages and disadvantages of the three brushing mechanisms.

Four, message reading

The message reading logic is much simpler than the writing logic. The following focuses on the analysis of how to query messages based on offset and query messages based on key.

4.1 Query according to offset

The process of reading the message is to first find the physical offset address of the message in the CommitLog from the ConsumeQueue, and then read the entity content of the message from the CommitLog file.

//org.apache.rocketmq.store.DefaultMessageStore::getMessage
public GetMessageResult getMessage(final String group, final String topic, final int queueId, final long offset,
    final int maxMsgNums,
    final MessageFilter messageFilter) {
    long nextBeginOffset = offset;
 
    GetMessageResult getResult = new GetMessageResult();
 
    final long maxOffsetPy = this.commitLog.getMaxOffset();
    //找到对应的 ConsumeQueue
    ConsumeQueue consumeQueue = findConsumeQueue(topic, queueId);
    ...
    //根据 offset 找到对应的 ConsumeQueue 的 MappedFile
    SelectMappedBufferResult bufferConsumeQueue = consumeQueue.getIndexBuffer(offset);
    status = GetMessageStatus.NO_MATCHED_MESSAGE;
    long maxPhyOffsetPulling = 0;
 
    int i = 0;
    //能返回的最大信息大小，不能大于 16M
    final int maxFilterMessageCount = Math.max(16000, maxMsgNums * ConsumeQueue.CQ_STORE_UNIT_SIZE);
    for (; i < bufferConsumeQueue.getSize() && i < maxFilterMessageCount; i += ConsumeQueue.CQ_STORE_UNIT_SIZE) {
        //CommitLog 物理地址
        long offsetPy = bufferConsumeQueue.getByteBuffer().getLong();
        int sizePy = bufferConsumeQueue.getByteBuffer().getInt();
        maxPhyOffsetPulling = offsetPy;
        ...
        //根据 offset 和 size 从 CommitLog 拿到具体的 Message
        SelectMappedBufferResult selectResult = this.commitLog.getMessage(offsetPy, sizePy);
        ...
        //将 Message 放入结果集
        getResult.addMessage(selectResult);
        status = GetMessageStatus.FOUND;
    }
 
    //更新 offset
    nextBeginOffset = offset + (i / ConsumeQueue.CQ_STORE_UNIT_SIZE);
 
    long diff = maxOffsetPy - maxPhyOffsetPulling;
    long memory = (long) (StoreUtil.TOTAL_PHYSICAL_MEMORY_SIZE
        * (this.messageStoreConfig.getAccessMessageInMemoryMaxRatio() / 100.0));
    getResult.setSuggestPullingFromSlave(diff > memory);
    ...
    getResult.setStatus(status);
    getResult.setNextBeginOffset(nextBeginOffset);
    return getResult;
}

4.2 Query by key

The process of reading a message is to use topic and key to find a record in the IndexFile index file, and read the entity content of the message from the CommitLog file according to the offset of CommitLog in the record.

//org.apache.rocketmq.store.DefaultMessageStore::queryMessage
public QueryMessageResult queryMessage(String topic, String key, int maxNum, long begin, long end) {
    QueryMessageResult queryMessageResult = new QueryMessageResult();
    long lastQueryMsgTime = end;
     
    for (int i = 0; i < 3; i++) {
        //获取 IndexFile 索引文件中记录的消息在 CommitLog 文件物理偏移地址
        QueryOffsetResult queryOffsetResult = this.indexService.queryOffset(topic, key, maxNum, begin, lastQueryMsgTime);
        ...
        for (int m = 0; m < queryOffsetResult.getPhyOffsets().size(); m++) {
            long offset = queryOffsetResult.getPhyOffsets().get(m);
            ...
            MessageExt msg = this.lookMessageByOffset(offset);
            if (0 == m) {
                lastQueryMsgTime = msg.getStoreTimestamp();
            }
            ...
            //在 CommitLog 文件获取消息内容
            SelectMappedBufferResult result = this.commitLog.getData(offset, false);
            ...
            queryMessageResult.addMessage(result);
            ...
        }
    }
 
    return queryMessageResult;
}

In the IndexFile index file, search for the physical offset address of the CommitLog file is implemented as follows:

//org.apache.rocketmq.store.index.IndexFile::selectPhyOffset
public void selectPhyOffset(final List<Long> phyOffsets, final String key, final int maxNum,
final long begin, final long end, boolean lock) {
    int keyHash = indexKeyHashMethod(key);
    int slotPos = keyHash % this.hashSlotNum;
    int absSlotPos = IndexHeader.INDEX_HEADER_SIZE + slotPos * hashSlotSize;
    //获取相同 hash 值 key 的第一个 IndexItme 存储位置，即链表的首节点
    int slotValue = this.mappedByteBuffer.getInt(absSlotPos);
     
    //遍历链表节点
    for (int nextIndexToRead = slotValue; ; ) {
        if (phyOffsets.size() >= maxNum) {
            break;
        }
 
        int absIndexPos =
            IndexHeader.INDEX_HEADER_SIZE + this.hashSlotNum * hashSlotSize
                + nextIndexToRead * indexSize;
 
        int keyHashRead = this.mappedByteBuffer.getInt(absIndexPos);
        long phyOffsetRead = this.mappedByteBuffer.getLong(absIndexPos + 4);
 
        long timeDiff = (long) this.mappedByteBuffer.getInt(absIndexPos + 4 + 8);
        int prevIndexRead = this.mappedByteBuffer.getInt(absIndexPos + 4 + 8 + 4);
 
        if (timeDiff < 0) {
            break;
        }
 
        timeDiff *= 1000L;
 
        long timeRead = this.indexHeader.getBeginTimestamp() + timeDiff;
        boolean timeMatched = (timeRead >= begin) && (timeRead <= end);
 
        //符合条件的结果加入 phyOffsets
        if (keyHash == keyHashRead && timeMatched) {
            phyOffsets.add(phyOffsetRead);
        }
 
        if (prevIndexRead <= invalidIndex
            || prevIndexRead > this.indexHeader.getIndexCount()
            || prevIndexRead == nextIndexToRead || timeRead < begin) {
            break;
        }
         
        //继续遍历链表
        nextIndexToRead = prevIndexRead;
    }
    ...
}

Five, summary

This article introduces the core module implementation of RocketMQ storage system from the perspective of source code, including storage architecture, message writing and message reading.

RocketMQ writes all messages under Topic into CommitLog, realizing strict sequential writing. Prevent the Page Cache from being swapped to the swap space through file preheating, and reduce the interruption of page faults when reading and writing files. Use mmap to read and write CommitLog files, and convert file operations into direct operations on memory addresses, which greatly improves the efficiency of file reading and writing.

For scenarios with high performance requirements and low data consistency requirements, you can enable off-heap memory to achieve read-write separation and improve disk throughput. In short, the learning of storage modules requires a certain understanding of operating system principles. The extreme performance optimization scheme adopted by the author is worthy of our study.

6. References

1. RocketMQ official document

Author: vivo internet server team-Zhang Zhenglin

In-depth analysis of RocketMQ source code-message storage module

1. Introduction

Two, storage architecture

2.1 Page Cache and mmap

2.2 Broker module

Three, message writing

3.1 MappedFile initialization

3.2 Message writing

3.2.1 Write CommitLog

3.2.2 Write to ConsumeQueue

3.2.3 Write IndexFile

3.3 Message refresh

Four, message reading

4.1 Query according to offset

4.2 Query by key

Five, summary

6. References

vivo互联网技术

引用和评论

vivo 游戏中心包体积优化方案与实践

Go 程序如何实现优雅退出？来看看 K8s 是怎么做的——上篇

RocketMQ漫谈之从消息队列到事务消息

一键生成 HTTP + gRPC 混合架构微服务代码：更简单、更灵活、更兼容的微服务系统构建方式

Ubuntu20.04开机卡在[OK] Started ****，无法正常开机

【Flink】TaskManager 内存模型及计算逻辑详解

涂鸦智能落地 Koupleless 合并部署，实现云服务降本增效