LotusDB Design and Implementation—1 Basic Concepts

LotusDB is a brand new KV storage engine, Github address: https://github.com/flower-corp/lotusdb , I hope you can support it, click a star or get involved!

LotusDB is a stand-alone KV storage engine designed based on LSM Tree and combined with the advantages of B+ tree, with stable and fast read and write performance.

In the traditional LSM Tree architecture, the added and deleted data are written to the SST file in an orderly manner. There may be multiple copies of the data corresponding to the same key, which requires a complex compaction strategy to reclaim space, which also brings space Amplify and write zoom issues.

LSM Tree maintains multi-level SSTable files on the disk. When reading data, it needs to scan the files layer by layer to find the specified data. In the worst case, it needs to scan the SSTable of each layer, and the read performance is unstable.

Corresponding to LSM Tree, another common data storage model is B+ Tree. B+ Tree is widely used in database storage engines due to its good adaptation to disk pages, such as the most well-known Mysql InnoDB engine. .

B+ Tree maintains data in the bottom leaf node of the tree, and the read performance is relatively stable, but data insertion and update are written by random IO, resulting in relatively low write performance of B+ Tree.

We know that the LSM storage model was born in the era of HDD (mechanical hard disk), and the random and sequential read and write speeds of HDD are very different. Therefore, the design of LSM maximizes the advantages of sequential IO. All data is cached in the memory buffer first. Then batches are sequentially written to the file. However, with the update and iteration of storage hardware, the difference between random and sequential reads and writes of disks has become smaller, and in some media, there is not even much difference between sequential and random reads and writes.

Some designs of LSM Tree for sequential IO will be too complicated, making it difficult to implement and control the entire system (if you are familiar with rocksdb, you will have a deep understanding).

It is easier to design the underlying storage engine of a system by yourself than to master a complex project, and it is easier to locate and solve related problems, which is why cockroach uses the self-developed Pebble storage engine to replace rocksdb, and LotusDB is one such A storage engine that is easy to learn and master because it is concise, intuitive, and efficient.

The overall architecture diagram of LotusDB is as follows:

LotusDB still retains the writing process in the LSM Tree, because this can maximize the durability of the written data and write throughput, so the WAL log is maintained on the disk, and the newly written data is first appended to the WAL to ensure that the data is not lost before being written to memory.
Multiple jump table structures are maintained in memory. The latest jump table is called active memtable. After a memtable is full, it will become an immutable memtable, that is, an immutable memtable, which cannot receive new writes and is waiting to be flushed by the background thread to disk.

During Flush, the data index information will be stored in the B+ tree, and the value will be stored separately in the Value Log. The structure of the value log is similar to WAL, and the data is written by log appending, but the value log will have a Threshold, a new value log will be opened after it is full, so there are multiple value logs.

It should be noted that the B+ Tree should be stored in new storage media as much as possible, such as solid-state drives, because as mentioned earlier, the B+ tree is written randomly. If a traditional mechanical hard disk is used, the write performance is limited and the write amplification is serious. Flush may will be a bottleneck.

This is the overall implementation of LotusDB. In this implementation, let's take a look at what the basic data read and write process looks like.
Write a key/value: As mentioned earlier, it is exactly the same as the LSM model, first encapsulate the key/value into a log and append it to the WAL, and then write the k/v to the active memtable in memory.
Read a value according to the key: first search in the active memtable and immutable memtable in the memory in turn, and return directly if found. Otherwise, it means that the value may be in the disk, and the index information of the key is obtained from the B+ tree. The index information is a two-tuple <fid, offset>, which identifies the specific file in the value log and the position in the file, and then directly according to This index information can be obtained from the value log file to obtain the value.

Finally, let's summarize the advantages of the LotusDB architecture. The brief summary is as follows:
1. The data writing process is completely consistent with the traditional LSM model, which ensures high throughput of sequential IO and data persistence
2. Compared with the native LSM model, the read performance is more stable, and the read amplification is reduced. Because of the introduction of the B+ tree, thanks to the stable read performance of the B+ tree, the overall read efficiency will be more controllable
3. The multi-level SSTable in the LSM Tree model is completely removed, and there is no SSTable maintenance, and the existing B+ tree implementation (BoltDB) is used, which greatly reduces the complexity of the system
4. Compaction reduces the loss of storage media. Only the value log in LotusDB has Compaction; the native LSM not only requires Compaction for SSTable, but also requires Compaction for value log if kv separation is performed.
5. The read and write process is simple and intuitive, without bloom filter, block cache, etc.

LotusDB Github address: https://github.com/flower-corp/lotusdb

LotusDB Design and Implementation—1 Basic Concepts

roseduan

引用和评论

近期对 wal 组件的性能提升

被 Manus 带火的 MCP 是什么｜一文看懂

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

MySQL慢查询日志：性能优化的终极指南

Go slice切片使用教程，一次通关！

腾讯 tRPC-Go 教学——（1）搭建服务