7 pictures reveal the essence of RocketMQ storage design

About RocketMQ, as a disk storage-based middleware, has unlimited backlog capabilities and provides high-throughput and low-latency service capabilities. Its core part must be its elegant storage design.

RocketMQ, as a disk storage-based middleware, has unlimited backlog capabilities and provides high-throughput and low-latency service capabilities. Its core part must be its elegant storage design.

Storage overview

The files stored by RocketMQ mainly include Commitlog files, ConsumeQueue files, and Index files.

RocketMQ stores the messages of all topics in the same file to ensure that the files are written in order when the messages are sent, and try their best to ensure the high availability and high throughput of the message sending.

However, message middleware is generally based on topic-based subscription and publishing modes. When messages are consumed, messages must be selected according to topics. Obviously, filtering messages by topic from the Commitlog file will become extremely inefficient. In order to improve the retrieval of messages based on topics For the efficiency, RocketMQ introduces the ConsumeQueue file, which is commonly known as the consumption queue file.

Relational databases can perform record retrieval according to field attributes. As a message middleware mainly for business development, RocketMQ also provides retrieval capabilities based on message attributes. The underlying core design concept is to establish a hash index for Commitlog files and store them In the Index file.

After sequentially writing to the Commitlog file in RocketMQ, the ConsumeQueue and Index files are constructed asynchronously, and the data flow diagram is as follows:

Storage file organization

RocketMQ pursues the ultimate disk sequential writing during the message writing process. All messages of all topics are written to one file, the Commitlog file. All messages are appended to the file in order of arrival. Once a message is written, it does not support modification. The specific layout of the Commitlog file is shown in the figure below:

There is a big difference between file-based programming and memory-based programming. In memory-based programming mode, we have ready-made data structures, such as List and HashMap. It is very convenient to read and write data. Then one message is stored in the file Commitlog. , How to find it?

Just as relational data introduces an ID field for each piece of data, in the file-based programming model, it also introduces an identity mark for a message: the message physical offset, that is, the message is stored at the beginning of the file.

It is precisely with the concept of physical offset that Commitlog's file name naming is also very tricky, using the offset of the first message stored in the file in the entire Commitlog file group to name, for example, the first The Commitlog file is

0000000000000000000, the second file is

00000000001073741824, and so on.

The advantage of this is that the physical offset of any message is given. For example, the message offset is 73741824, which can be searched by dichotomy, quickly locate this file in the first file, and then use the physical offset of the message The difference obtained by subtracting the name of the file is the absolute address in the file.

The design concept of the Commitlog file is to pursue the ultimate message writing, but we know that the message consumption model is a topic-based subscription mechanism, that is, a consumer group consumes messages on a specific topic. If you retrieve messages from the commitlog file based on the topic, we will find that this is by no means a good idea. You can only retrieve one by one from the first message of the file. The performance can be imagined. Therefore, in order to solve the topic-based message retrieval problem, RocketMQ The consumequeue file is introduced, and the structure of the consumequeue is shown in the figure below.

The ConsumeQueue file is a message consumption queue file. It is a Topic-based index file for the Commitlog file. It is mainly used by consumers to consume messages according to the Topic. The organization method is /topic/queue. There are multiple files in the same queue.

The design of Consumequeue is very tricky, and each entry has a fixed length (8 bytes commitlog physical offset, 4 bytes message length, 8 bytes tag hashcode).

Instead of storing the original string of the tag, I choose to store the hashcode. The purpose is to ensure that the length of each entry is fixed, and the entry can be quickly located by accessing array subscripts, which greatly improves the reading performance of the ConsumeQueue file.

Imagine the message consumer according to topic, message consumption progress (consumeuqe logical offset), that is, which Consumeque item, such consumption progress to access the message is to use the logical offset logicOffset * 20 to find the item The starting offset (the offset in the consumequeue file), and then read the 20 bytes after the offset to get an entry, no need to traverse the consumequeue file.

RocketMQ has a powerful advantage over Kafka, that is, it supports retrieving messages based on message attributes. The introduction of the consumequeue file solves the problem of topic-based search. However, if you want to find messages based on a certain attribute of the message, the consumequeue file can do nothing.

RocketMQ introduces Index index file to realize file-based hash index. The file storage structure of IndexFile is shown in the figure below:

IndexFile file implements Hash index based on physical disk file. The file consists of a 40-byte file header, 5 million Hash slots, 4 bytes for each Hash slot, and finally 20 million Index entries. Each entry is composed of 20 bytes, each with a 4-byte index key. Hashcode, 8-byte message physical offset, 4-byte time stamp, 4-byte previous Index entry (Hash conflict linked list structure).

That is, the mapping relationship between the hashcode of the index key and the physical offset is established, and the commitlog file is quickly defined according to the key.

Sequential write

Based on disk read and write, another design principle to improve its write performance is disk sequential write.

Disk sequential writes are widely used in file-based storage models. You may wish to consider the purpose of introducing MySQL Redo logs. We know that in the storage engine of MySQL InnoDB, there will be a memory pool, which is used to cache the file blocks of the disk. After the statement modifies the data, it will first modify it in the memory, then write the changes to the redo file (flush to disk), and then periodically flush the data in the InnoDB memory pool to the disk.

Why not update directly to the specified data file whenever there is a data change? In MySQL InnoDB, there are thousands of sheets in one inventory, and each sheet of data will be stored in a separate file. If the data of each table is changed, it will be flushed to the disk, there will be a large number of random writes, and the performance cannot be improved. , So introducing a redo file and writing redo files sequentially, on the surface, is an extra step of flashing operation, but because it is sequential writing, compared to random writing, the performance improvement brought by it is very significant.

Memory mapping mechanism

Although disk-based sequential writing can greatly improve IO write efficiency, if file-based storage uses conventional JAVA file operation APIs, such as FileOutputStream, the performance improvement will be limited. RocketMQ introduces memory mapping to map disk files In the memory, the disk is operated in the same way as the memory, and the performance is improved to a new level.

In JAVA, a memory mapped file can be created through the map method of FileChannel.

The file created by this method in the Linux server uses the pagecache of the operating system, that is, the page cache.

The memory usage strategy in the Linux operating system uses the machine's physical memory as much as possible and resides in the memory, which is the so-called page cache. When the operating system's memory is not enough, a cache replacement algorithm is used, for example, LRU reclaims the page cache that is not commonly used, that is, the operating system will automatically manage this part of the memory.

If the RocketMQ Broker process exits abnormally, the data stored in the page cache will not be lost. The operating system will periodically persist the data in the page cache to disk to ensure data safety and reliability. However, if there is an abnormal situation such as a machine power failure, the data stored in the page cache may be lost.

Flexible and changeable brushing strategy

With the support of sequential write and memory mapping, RocketMQ's write performance is greatly guaranteed, but everything has advantages and disadvantages. The memory mapping and page cache mechanism are introduced. The message will be written to the page cache first. There is no real persistence to disk. So after the broker receives the message from the client, does it return success directly after storing it in the page cache, or does it need to be persisted to disk before returning success?

This is a "difficult" choice, a trade-off between performance and message reliability. To this end, RocketMQ provides a variety of strategies: synchronous brushing, asynchronous brushing.

1. Synchronous flashing

Synchronous flashing becomes a group submission in the implementation of RocketMQ, and not every message must be flashed. The design concept is shown in the figure:

Synchronous flashing is adopted. After each thread tracks the data to the memory, it submits a flashing request to the flashing thread, and then it will be blocked; the flashing thread gets a task from the task queue and then triggers a flashing, but not only Instead of refreshing the messages related to the request, all the messages in the memory to be refreshed will be directly refreshed at once, and then a group of request threads can be awakened to realize the group refreshing.

2. Asynchronous brushing

The advantage of synchronous flashing is to ensure that the message is not lost, that is, if the message is returned to the client successfully, it means that the message has been persisted to the disk, that is, the message is very reliable, but this is at the expense of write response delay performance. RocketMQ messages are written to the pagecache first, so the possibility of message loss is small. If you can tolerate a certain probability of message loss, you can consider using asynchronous flushing.

Asynchronous flushing means that the broker returns success immediately after storing the message in the pagecache, and then starts an asynchronous thread to periodically execute the forece method of FileChannel to flush the data in the memory to the disk regularly. The default interval is 500ms.

Memory-level read and write separation

RocketMQ introduces the transientStorePoolEnable mechanism in order to reduce the use pressure of pagecache, which is a memory-level read-write separation mechanism.

By default, RocketMQ writes messages to pagecache, and reads from pagecache when messages are consumed. In this way, the pressure on pagecache will be relatively high during high concurrency, which is prone to transient broker busy. Therefore, RocketMQ also introduces transientStorePoolEnable to write messages to the heap first. The external memory and return immediately, and then asynchronously submit the data in the off-heap memory to the pagecache, and then asynchronously flush the disk to the disk. Its working mechanism is shown in the figure below:

When messages are consumed and read, they will not try to read from off-heap memory, but from pagecache. This forms a memory-level read-write separation, that is, when messages are written, they mainly face off-heap memory, while reading messages When facing pagecache.

The advantage of this scheme is that the message is written directly to the off-heap memory and then written to the pagecache asynchronously. Compared with each message appended directly to pagechae, its biggest advantage is that the operation of writing messages to pagecache is batched.

The disadvantage of this scheme is that if the Broker process exits abnormally due to some unexpected operations, the data stored in the off-heap memory will be lost, but if it is placed in the pagecache, the broken exit will not lose the message.

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.