📄
Text| Huang (SOFAJRaft project team)
Department of Computer Science, Fuzhou University
Research direction|Distributed middleware, distributed database
Github homepage| https://github.com/hzh0425
Proofreading| Feng Jiachun (Head of SOFAJRaft open source community)
This article 9402 word reading 18 minutes
▼
PART. 1 Project Introduction
1.1 Introduction to SOFAJRaft
SOFAJRaft is a production-level high-performance Java implementation based on the RAFT consensus algorithm. It supports MULTI-RAFT-GROUP and is suitable for high-load and low-latency scenarios. Using SOFAJRaft you can focus on your own business area, SOFAJRaft is responsible for all the technical problems related to RAFT, and SOFAJRaft is very easy to use, you can master it in a short time through a few examples.
Github address:
https://github.com/sofastack/sofa-jraft
1.2 Task requirements
goal: current LogStorage implementation, the use of index and data separation design, we will be key and value of the offset write rocksdb as an index, and log entries (data) written Segment Log. Because users who use SOFAJRaft often use different versions of rocksdb, users have to change their own versions of rocksdb to adapt to SOFAJRaft, so we hope to make an improvement: remove the dependence on rocksdb and build a pure Java implementation Index module.
PART. 2 Pre-knowledge
Log Structured File Systems
If you have learned about message queues like Kafka, you should be familiar with log-based systems.
As shown in the figure, we can store some log files on a stand-alone disk. These files generally contain a collection of old files and new files. The difference is that the Active Data File is generally mapped to the memory and the new file (based on mmap memory mapping technology) is being written, while the Older Data File has already been written and flushed to the old file on the disk. After the Active File is written, it will be closed and a new Active File will be opened to continue writing.
And every time it is written, each Log Entry will be Append to the end of the Active File, and Active File often uses mmap memory mapping technology to map the file to the os Page Cache, so every write is in memory order Write, the performance is very high.
In the end, a file is nothing more than a collection of some Log Entry, as shown in the figure:
At the same time, it is not enough to just write the log to the File, because when we need to search the log, it is impossible for us to traverse each file in order to search, so the performance is too poor. So we also need to build a "directory" of these files, that is, the index file. The index here is essentially a collection of some files. The index items stored in it are generally of fixed size and provide the meta-information of LogEntry, such as:
File_Id the corresponding LogEntry stored
Value_sz : Data size of LogEntry
(Note: LogEntry is stored in binary after serialization)
Value_pos : Where to start the storage in the corresponding File
-Others may include crc, timestamp, etc...
Then according to the characteristics of the index file, it is very convenient to find IndexEntry.
log entry IndexEntry is a fixed size
IndexEntry stores the meta information of LogEntry
IndexEntry has a monotonically increasing characteristic
For example, if you want to find the log with LogIndex = 4:
first step , according to LogIndex = 4, you can know the location of the index storage: IndexPos = IndexEntrySize * 4
second step is , according to IndexPos, go to the index file and take out the corresponding index entry IndexEntry
third step , according to the meta information in IndexEntry, such as File_Id, Pos, etc., search in the corresponding Data File
fourth step , find the corresponding LogEntry
memory mapping technology mmap
A technology has been mentioned above: mapping files to memory and writing Active files in memory is also a key technology of log-based systems. There are generally two ways to read and write files under Unix/Linux systems.
Traditional file IO model
A standard IO process is to open a file, and then use the Read system call to read part or all of the file. The Read process is like this: the kernel reads the data in the file from the disk area to the kernel page high-speed buffer, and then reads from the kernel's high-speed buffer to the address space of the user process. This involves two copies of data: disk->kernel, kernel->user mode.
And when there are multiple processes reading the same file at the same time, the address space in each process will save a copy, which is definitely not the optimal way, resulting in a waste of physical memory, see the following figure:
Memory mapping technology
The second way is to use memory mapping
The specific operation method is: Open a file, and then call the mmap system call to directly map all or part of the file content to the address space of the process (directly establish a mapping relationship between an area in the private address space of the user process and the file object) After the mapping is completed, the process can do other operations like ordinary memory, such as memcpy and so on. mmap does not allocate physical address space in advance, it just occupies the virtual address space of the process.
When the first process accesses the buffer in the kernel, because the data is not actually copied, the MMU cannot find the physical address corresponding to the address space in the address mapping table, that is, if the MMU fails, the defect will be triggered. The page breaks. The kernel reads the data of this page of the file into the kernel high-speed buffer, and updates the page table of the process so that the page table points to this page of the Page Cache in the kernel buffer. Later, when another process accesses this page again, the page is already in memory, and the kernel only needs to register the process's page table and point it to the kernel's page high-speed buffer, as shown in the following figure:
For larger files, (file size 1 generally needs to be limited to less than 1.5~2G) , the read/write efficiency and performance of mmap are very high.
Of course, if mmap memory mapping is used, calling Write at this time is not to write to disk, but to write to Page Cache. Therefore, if we want to save the written data to the hard disk, we also need to consider when Flush is most suitable (described later) .
PART. 3 Architecture Design
3.1 SOFAJRaft's original log system architecture
The following figure shows the overall design of SOFAJRaft's original log system:
Among them, LogManager provides interfaces related to logs, such as:
/*** Append log entry vector and wait until it's stable (NOT COMMITTED!)** @param entries log entries* @param done callback*/void appendEntries(final Listentries, StableClosure done);/*** Get the log entry at index.** @param index the index of log entry* @return the log entry with {@code index}*/LogEntry getEntry(final long index);/*** Get the log term at index.** @param index the index of log entry* @return the term of log entry*/long getTerm(final long index);
In fact, when the upper Node calls these methods, LogManager does not directly process it, but publishes the event to the high-performance concurrent queue Disruptor (done, EventType)
Therefore, LogManager can be regarded as a "facade", providing an interface for accessing logs, and performing concurrent scheduling through Disruptor.
"Note": There are many places in SOFAJRaft that are based on Disruptor for decoupling, asynchronous callback, and parallel scheduling, such as SnapshotExecutor, NodeImpl, etc. Interested friends can go to the community to find out. It is great for learning Java concurrent programming. benefit!
For an introduction to the Disruptor concurrent queue, you can see here:
https://tech.meituan.com/2016/11/18/disruptor.html
Finally, the place where logs are actually stored is the calling object of LogManager, LogStorage.
And LogStorage is also an interface:
/*** Append entries to log.*/boolean appendEntry(final LogEntry entry);/*** Append entries to log, return append success number.*/int appendEntries(final Listentries);/*** Delete logs from storage's head, [first_log_index, first_index_kept) will* be discarded.*/boolean truncatePrefix(final long firstIndexKept);/*** Delete uncommitted logs from storage's tail, (last_index_kept, last_log_index]* will be discarded.*/boolean truncateSuffix(final long lastIndexKept);
In the original system, the default implementation class is RocksDBLogStorage, and the design of separate storage for index and log is adopted. The index is stored in RocksDB, and the log is stored in SegmentFile.
As shown in the figure, RocksDBSegmentLogStorage inherits RocksDBLogStorage RocksDBSegmentLogStorage is responsible for log storage and RocksDBLogStorage is responsible for index storage.
3.2 Project task analysis
Through the above description of the original log system, combined with the needs of the project, we can know that what I need to do for this task is to implement a new LogStorage based on Java, and can not rely on RocksDB. In fact, the log and index storage will have a lot of similarities in the implementation process. For example, file memory mapping mmap, file pre-allocation, asynchronous flashing, etc. Therefore, my task is not only to build a new index module, but also to do the following:
-A set of file systems that can be reused, so that both logs and indexes can directly reuse the file system to achieve their respective storage
-Compatible with SOFAJRaft storage system, realize a new LogStorage, which can be called by LogManager
-A high-performance storage system needs to greatly improve the performance of the original storage system
-A storage system with strong code readability, the code needs to comply with SOFAJRaft specifications
......
In this task, my instructor and I discussed and modified the design of the storage architecture many times, and finally designed a complete solution that can perfectly meet all the above requirements.
3.3 Improved log system
Architecture design
The following figure shows an improved version of the log system, where DefaultLogStorage is the implementation class of LogStorage described above. The three major DBs are logical storage objects, and the actual data is stored in AbstractFiles managed by FileManager. In addition, the Service in ServiceManager plays an auxiliary effect. For example, FlushService can provide the function of flashing disks.
Why do we need three big DBs to store data? What is ConfDB used for?
The following picture can explain the role of the three major DBs:
Because in SOFAJraft's original storage system, in order to improve the performance of reading Configuration type logs, Configuration type logs and ordinary logs are stored separately. Therefore, here we need a ConfDB to store Configuration type logs.
3.4 Code module description
The code is mainly divided into four major modules:
module 1617a124091331 (under the db folder)
-File module (under File folder)
module 1617a124091358 (under the service folder)
module 1617a12409136d (under the factory folder)
-DefaultLogStorage is the new LogStorage implementation class described above
3.5 Performance test
test background
-Operating system: Window
-Total size of written data: 8G
-Memory: 24G
-CPU: 4 cores and 8 threads
-Test code:
#DefaultLogStorageBenchmark
data display
Log Number means that a total of 524288 logs have been written
Log Size represents the size of each log is 16384
Total size means that a total of 8589934592 (8G) size data has been written
Writing time (45s)
Reading time (5s)
Test write: Log number :524288 Log Size :16384 Cost time(s) :45 Total size :8589934592 Test read: Log number :524288 Log Size :16384 Cost time(s) :5 Total size :8589934592Test done!
PART. 4 System Highlights
### 4.1 Log System File Management
In Section 2.1, I introduced the basic concepts of a logging system, and recap:
And how are the log files of this project managed? As shown in the figure, all log files (IndexDB corresponds to IndexFile, SegmentDB corresponds to SegmentFile) each DB are managed by File Manager.
Take the IndexFile used by IndexDB as an example , assuming that the size of each IndexFile is 126, where fileHeader = 26 bytes, the file can store ten index items, and the size of each index item is 10 bytes.
And FileHeader stores the basic meta information of a file:
// 第一个存储元素的索引 : 对应图中的 StartIndexdprivate volatile long FirstLogIndex = BLANK_OFFSET_INDEX;// 该文件的偏移量,对应图中的 BaseOffsetprivate long FileFromOffset = -1;
Therefore, FileManager can manage all files in a unified manner based on these two basic meta-information. This has the following advantages:
-Unified management of all files
-It is convenient to find the specific log in which file according to LogIndex, because all files are arranged according to FirstLogIndex, it is obvious that you can find it based on the binary algorithm here:
int lo = 0, hi = this.files.size() - 1;while (lo <= hi) { final int mid = (lo + hi) >>> 1; final AbstractFile file = this.files.get(mid); if (file.getLastLogIndex() < logIndex) { lo = mid + 1; } else if (file.getFirstLogIndex() > logIndex) { hi = mid - 1; } else { return this.files.get(mid); }}
flush brush 1617a12409153c (mentioned in section 4.2)
4.2 Group Commit-Group Commit
In chapter 2.2, we talked about that because of the existence of memory mapping technology mmap, Write cannot return directly after Write, and Flush is needed to ensure that data is saved to disk, but at the same time, it cannot be written back to disk directly, because the speed of disk IO is extremely slow. , Flush every time a log is written, the performance will be poor.
Therefore, in order to prevent the disk from being'dragging', this project introduces the Group commit mechanism. The idea of Group commit is to delay Flush, first write a batch of logs to Page Cache as much as possible, and then call Flush uniformly to reduce the number of flushes. ,as the picture shows:
-LogManager writes logs in batches by calling appendEntries()
-DefaultLogStorage writes the log by calling the DB interface
-DefaultLogStorage registers a FlushRequest to the FlushService of the corresponding DB, and blocks the waiting. The FlushRequest contains the ExpectedFlushPosition of the expected flush position.
private boolean waitForFlush(final AbstractDB logDB, final long exceptedLogPosition, final long exceptedIndexPosition) { try { final FlushRequest logRequest = FlushRequest.buildRequest(exceptedLogPosition); final FlushRequest indexRequest = FlushRequest.buildRequest(exceptedIndexPosition); // 注册 FlushRequest logDB.registerFlushRequest(logRequest); this.indexDB.registerFlushRequest(indexRequest); // 阻塞等待唤醒 final int timeout = this.storeOptions.getWaitingFlushTimeout(); CompletableFuture.allOf(logRequest.getFuture(), indexRequest.getFuture()).get(timeout, TimeUnit.MILLISECONDS); } catch (final Exception e) { LOG.error(.....); return false; }}
-After FlushService reaches the expectedFlushPosition, doWakeupConsumer() wakes up the blocked waiting DefaultLogStorage
while (!isStopped()) { // 阻塞等待刷盘请求 while ((size = this.requestQueue.blockingDrainTo(this.tempQueue, QUEUE_SIZE, WAITING_TIME, TimeUnit.MILLISECONDS)) == 0) { if (isStopped()) { break; } } if (size > 0) { ....... // 执行刷盘 doFlush(maxPosition); // 唤醒 DefaultLogStorage doWakeupConsumer(); ..... }}
So how does FlushService cooperate with FileManager to flush disks? Or should I ask how FlushService finds the corresponding files for flushing?
In fact, a variable FlushedPosition is maintained in FileManager, which represents the current flushing position. From Section 4.1, we learned that the FileHeader of each file in the FileManager records the BaseOffset of the current File. Therefore, we only need to find the corresponding file according to the FlushedPosition and find which file is currently in the interval, for example:
The current FlushPosition = 130, you can know that the second file is currently flushed.
4.3 File pre-allocation
When the log system is full of a file and you want to open a new file, it is often a time-consuming process. The so-called file pre-allocation is to map some empty files in the container through mmap in advance. When we want to Append a Log next time and the previous file is used up, we can directly go to this container to fetch an empty file and use it directly in this project That's it. There is a background thread AllocateFileService. In this Allocator, I use the typical producer-consumer model, which uses ReentrantLock + Condition to achieve file pre-allocation.
// Pre-allocated filesprivate final ArrayDequeblankFiles = new ArrayDeque<>();private final Lock allocateLock private final Condition fullCond private final Condition emptyCond
Among them, fullCond is used to represent whether the current container is full, and emptyCond is used to represent whether the current container is empty.
private void doAllocateAbstractFileInLock() throws InterruptedException { this.allocateLock.lock(); try { // 如果容器满了, 则阻塞等待, 直到被唤醒 while (this.blankAbstractFiles.size() >= this.storeOptions.getPreAllocateFileCount()) { this.fullCond.await(); } // 分配文件 doAllocateAbstractFile0(); // 容器不为空, 唤醒阻塞的消费者 this.emptyCond.signal(); } finally { this.allocateLock.unlock(); }}public AbstractFile takeEmptyFile() throws Exception { this.allocateLock.lock(); try { // 如果容器为空, 当前消费者阻塞等待 while (this.blankAbstractFiles.isEmpty()) { this.emptyCond.await(); } final AllocatedResult result = this.blankAbstractFiles.pollFirst(); // 唤醒生产者 this.fullCond.signal(); return result.abstractFile; } finally { this.allocateLock.unlock(); }}
4.4 File preheating
When mmap was introduced in section 2.2, we know that the operating system does not directly allocate physical memory space after the mmap system call. Only when a page is accessed for the first time, the OS will allocate a page fault interrupt. It can be imagined that if a file size is 1G and a page is 4KB, then it will take about 1 million page fault interruptions to map a file, so optimization is also needed here.
When AllocateFileService pre-allocates a file, it will call two systems at the same time:
Madvise() : short, it is recommended that the operating system pre-read the file, and the operating system may adopt this opinion
Mlock() : part or all of the address space used by the process in physical memory to prevent it from being reclaimed by the operating system
For the SOFAJRaft scenario, the pursuit is low latency of message read and write, so I definitely hope to use as much physical memory as possible to improve the efficiency of data read and write access.
-Harvest-
In this process, I slowly learned the general process of a project:
-First of all, carefully polish the project plan, and thoroughly consider whether the plan is feasible.
-Secondly, communicate more with the instructor during the project and discover problems as soon as possible. This project also encountered some problems that I could not solve. Teacher Jia Chun was very patient to help me find out the problem, thank you very much!
-Finally, you should pay attention to every detail of the code, including naming and comments.
As Teacher Jiachun mentioned in the final comment, "What really makes xxx stand out is attention to low-level details".
In future project development, I will pay more attention to the details of the code, with the goal of pursuing beautiful code and taking into account performance.
In the future, I plan to make more contributions to the SOFAJRaft project, and hope to be promoted to a community Committer as soon as possible. We will also use the excellent projects of the SOFAStack community to continuously explore cloud native!
-Acknowledgements-
First of all, I am very lucky to be able to participate in this summer of open source activities, thank you Feng Jiachun for your patient guidance and help!
Thank you for the open source software supply chain lighting project and the SOFAStack community for giving me this opportunity!
*Recommended reading this week*
SOFAJRaft's practice in the same journey
next frontier of Kubernetes: multi-cluster management
RAFT-based production-level high-performance Java implementation-SOFAJRaft Series Content Collection
finally! SOFATracer completed its link visualization journey
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。