Look at LSM-Tree algorithm design from RocksDB

Original is not easy, please indicate the source for reprinting

Preface

At present, the author himself is building the company's internal messaging platform based on Pulsar, and naturally he has done some research on its underlying storage. Pulsar uses BookKeeper as the storage layer, and the bottom layer of BookKeeper uses RocksDB to save the position index corresponding to Entry (the data storage unit in BookKeeper). RocksDB is the storage engine technology that I have been paying attention to, because when I was investigating persistent KV storage, I found that mainstream open source pika/kvrocks, and finally selected persistent KV storage services of cloud vendors, the bottom layer is based on RocksDB. There is also the famous TiDB, whose storage engine is also RocksDB.

With curiosity, I started to learn RocksDB. Since RocksDB is generally used for low-level development, if it is not for data storage middleware, it is difficult to access it daily, so I decided to learn RocksDB's data structure design: LSM tree.

This article first introduced RocksDB's implementation of the LSM tree, then summarized the design ideas of the LSM tree, and compared the storage design ideas of Elasticsearch Lucene, and finally compared the LSM tree with the common B+ tree.

Introduction to LSM Tree

LSM tree, the full name is Log-Structured-Merge Tree. At first glance at the name, you might think it will be a tree, but it is not. The LSM tree is actually a complex algorithm design. This algorithm design is derived from Google's Bigtable paper (introduced the terms SSTable and memtable).

The storage engine designed and implemented based on the LSM tree algorithm is called the LSM storage engine. LevelDB, RocksDB, Cassandra, and HBase all implement corresponding storage engines based on the LSM tree algorithm.

Let's learn more about the design idea of LSM tree through RocksDB's LSM tree implementation. If you only want to see the summary of the design ideas of the LSM tree, you can jump to the final summary part, and think that the summary is still good.

RocksDB LSM tree implementation

1. Core composition

First, let's take a look at the three basic file formats memtable & WAL & SSTable of RocksDB.

The following figure describes the core composition and key process steps (read & write & flush & compaction) of the RocksDB LSM tree.

1.1 memtable (active & immutable)

Memtable is a data structure in RocksDB memory, serving both reading and writing; data will always be written into the active memtable when writing data, and the memtable must always be queried first when executing a query, because the data in the memtable is always updated ; The memtable implementation method is skiplist, which is suitable for range query and insertion;

memtable life cycle

When an active memtable is full, it will be set to a read-only state and become an immutable memtable. Then a new active memtable will be created to provide writing.

The immutable memtable will be kept in memory, waiting for the background thread to flush it. The trigger condition for flush is that the number of immutable memtables exceeds min_write_buffer_number_to_merge. Flush will merge and compress the immutable memtable once and write it to the L0 sst of the disk. The corresponding memtable will be destroyed after flush.

Related parameters:
write_buffer_size: the capacity of a memtable
max_write_buffer_number: the maximum number of existence of memtable
min_write_buffer_number_to_merge: Set the minimum number of memtables that can be merged before flushing sst (if set to 1, it means that there is no merge compression operation during the flush process)

1.2 WAL (write-ahead log)

Everyone should be familiar with WAL, a persistent log file that is conducive to sequential writing. Many storage systems have similar designs (such as MySQL's redo log/undo log, ZK's WAL);

Every time RocksDB writes data, it will first write to WAL and then to memtable. When a failure occurs, the data in the memory is restored by replaying the WAL to ensure data consistency.

What are the benefits of this design? In this way, the LSM tree can treat volatile memory as persistent storage and trust the data in the memory.

As for the creation and deletion timing of WAL, every time a CF (column family data, mentioned below) is flushed, a new WAL will be created. This does not mean that the old WAL will be deleted, because other CF data may not have been placed on the disk. Only if all CF data is flushed and all WAL related data is placed on the disk, the related WAL will be deleted. .

1.3 SSTable (sorted string table)

SSTable, the full name is Sorted String Table, exists on the disk, and is a persistent, orderly, and unchangeable Map structure. Key and Value are arbitrary Byte strings. As mentioned above, the memtable in the memory will perform a flush operation if the conditions are met

The file structure of SSTable is as follows:

Now that the file structure is divided, each area must have its role:

Data Block, which stores ordered key-value pairs, is the data entity of ; 161cc1c99e9c47 does not store the complete key value for each key-value pair, but stores the previous one in order to save storage space The non-shared part of the key avoids the storage of duplicate content of the key (this way of saving space through delta encode is also common in the bottom layer of other storage middleware) .
Meta Block stores Filter related information to speed up the efficiency of querying data in sst; Filter uses Bloom Filter to filter and determine whether the data to be queried exists in the specified data block.
Meta Index Block, the index of the Meta Block, it has only one record, the key is the name of the meta index (that is, the name of the Filter), and the value is the position that points to the meta index;
Index Block, index block is used to store all the relevant index information of the data block. The indexblock contains several records, and each record represents the index information of a data block;
Footer, points to the location and size of each partition. Footer has a fixed length. When reading an SSTable file, it reads a fixed number of bytes from the end of the file, and then obtains the Footer information. The information in Footer indicates the location of MetaIndexBlock and IndexBlock, and then finds MetaBlock and DataBlock.

As you can see, in addition to storing the actual data, the SSTable file also has an index structure and Filter to speed up the query efficiency of the SST. The design is very delicate.

2. Some other noun concepts

Column Family （CF）

After RocksDB 3.0, a Column Family feature has been added. When each kv is stored, the CF where it is located must be specified. In order to be compatible with previous versions, RocksDB creates a "default" CF by default. If CF is not specified when storing kv, RocksDB will store it in the "default" CF.

RocksDB allows users to create multiple Column Families. These Column Families each have independent memtable and SST files, but share the same WAL file. The advantage of this is that different configurations can be selected for different Column Families according to the characteristics of the application, but there is no increase. Number of writes to WAL.

If the analogy is to a relational database, the column family can be regarded as the concept of a table.

3. Read & Write

3.1 Read operation

Find in active memtable;
If there is no active memtable, search in immutable memtable;
If the immutable memtable is not available, search in the L0 SSTable (RocksDB uses a traversal method to search for the L0 SSTable, in order to improve the search efficiency, it will control the number of L0 files);
If you can’t find it, look it up in the remaining SSTables (for files at level L1 and levels above L1, each SSTable does not overlap, you can use binary search to quickly find the Level and SSTable where the key is located)

Each SSTable uses the bloom filter to quickly determine whether the data exists in the current file before searching, reducing unnecessary IO.

RocksDB sets up a read cache structure Block cache frequently accessed data blocks in SST, and provides two out-of-the-box implementations LRUCache and ClockCache.

3.2 Write operation

The write operation will write the WAL file first to ensure that the data is not lost;
After completing the WAL write, write the data to the active memtable in the memory (in order to ensure orderliness, RocksDB uses a jump table data structure to implement memtable);
Then when the memtable data reaches a certain scale, it will be transformed into an immutable memtable, and a new memtable will be generated to provide services at the same time;
After satisfying the placing conditions, the immutable memtable will be merged and flashed into the SST of the hard disk;

By the way, by default, the behavior of writing to disk in RocksDB is asynchronous writing. Just write data into the operating system's cache area and return (pageCache), and writing to the disk is an asynchronous process. The throughput rate of asynchronous writing is more than a thousand times that of synchronous writing. asynchronous write is that when the machine or operating system crashes, the data cached by the operating system issued by the latest batch of write requests may be lost, but RocksDB's own crash will not cause data loss. The probability of a machine or operating system crash is relatively low, so in most cases, it can be considered that asynchronous writing is safe .

4. Compaction

The LSM tree converts discrete random write requests into batch sequential write requests (WAL + memtable) to improve write performance. But it also brings some problems:

Read Amplification. According to the description of [Read Operation] above, the read operation may access a large number of files;
Space Amplification. Because all writes are append-only, instead of in-place update of the data, the expired data will not be cleaned up immediately.

Therefore, it is necessary to maintain and reduce the number of SST files. RocksDB will perform compaction operations according to the different compaction algorithm strategies configured. Compaction operation deletes expired or marked as deleted/duplicated keys, and re-combines the data to improve query efficiency.

4.1 Level Style Compaction (default compaction style)

By default, RocksDB uses Level Style Compaction as the compaction strategy of the LSM tree.

If Level Style Compaction is enabled, L0 stores the latest RocksDB data, and Lmax stores older data. Duplicate keys may be stored in L0, but duplicate keys cannot exist in other layer files. Each compaction task will select a file in the Ln layer and multiple files in the adjacent Ln+1 layer to merge, delete the expired or marked as deleted or duplicate keys, and then put the merged file into the Ln+1 layer .

compaction

Although Compaction reduces read amplification (reduces the number of SST files) and space enlargement (cleans up invalid data), it also brings about write amplification (Write Amplification) problems (low-level I/O is consumed by the compaction operation, which will be larger than the upper-level request express )

RocksDB also supports other Compaction strategies.

4.2 Universal Compaction

Only compress all files of L0, and put them into the L0 layer after merging;
The goal is lower write magnification, and trade off in read magnification and space magnification;

The specific algorithm is not detailed in this article;

4.3 FIFO Compaction

FIFO is first-in-first-out as its name implies, this mode periodically deletes old data. In FIFO mode, all files are at L0. When the total size of SST files exceeds compaction_options_fifo.max_table_files_size, the oldest SST file is deleted. For FIFO, its strategy is very simple. All SSTs are at Level 0. If the threshold is exceeded, the oldest SST will be deleted.
This mechanism is very suitable for storing time series data.

⭐Summary of LSM tree design ideas

The design idea of the LSM tree is very interesting. I will summarize here.

The LSM tree converts random writes to disks into disk-friendly sequential writes (regardless of mechanical disks or SSDs, random reads and writes are much slower than sequential reads and writes), thereby greatly improving write performance.

So how is it transformed? The core is to maintain an orderly memory table (memtable) in memory. When the memory table is larger than the threshold, it is flashed to disk in batches to generate the latest SSTable file. Because the memtable itself has maintained the key-value pairs sorted by key, this step can be completed efficiently.

writes the data to the WAL log first when writing the memory table, so that when a failure occurs, the data in the memory can be restored by replaying the WAL to ensure the data consistency of the database.

In this append-only write mode, deleting data becomes adding a delete mark to the data, and updating data becomes writing a new value. At the same time, there will be a new value of the same key in the database. And the old value. This effect is called Space Amplification .

As data is written, there will be more and more underlying SSTable files.

In this mode, the read request becomes a search for keywords in the memory first, and if it is not found, it searches for the SSTable file in the disk according to the new -> old one. In order to optimize the read performance of this access mode, the storage engine usually uses common read optimization strategies, such as the use of additional Bloom Filter and read Cache .

This process (or impact) that requires multiple readings is called Read Amplification . Obviously, read amplification will affect the read performance of the LSM tree.

In order to optimize read performance (read magnification) and storage space (space magnification) at the same time, the LSM tree reduces the number of SSTable files by running the merge and compression process. Deleting the old value of invalid (deleted or overwritten) This process is called compaction .

However, compaction also has some impact. Each data write in the life cycle of the database will actually cause multiple disk writes. This effect is called Write Amplification. In the write-heavy application , the performance bottleneck may be the speed at which the database can be written to disk. In this case, write amplification will result in a direct performance penalty: the more times the storage engine writes to the disk, the fewer writes per second within the available disk bandwidth.

This is also a shortcoming of LSM engine storage in my opinion, that is, the may interfere with the ongoing read and write request . Although the storage engine tries to perform compression gradually without affecting concurrent access, disk resources are limited, so it is easy to request that the disk needs to wait for the disk to complete the expensive compression operation. The impact on throughput and average response time is usually small, but if it is a high percentile (such as P99 RT), sometimes there will be a long query response.

The above is a personal summary of the LSM tree. Some abstract descriptions that do not mention implementation details (such as memtable, compaction) have corresponding implementations in actual storage. The details may be different, but the design ideas are similar.

In the RocksDB implementation specifically mentioned in this article, write amplification, read amplification, and space amplification, these three are just like the CAP theorem, and cannot be optimized at the same time. For this reason, RocksDB exposes a lot of parameters for users to tune to adapt to more application scenarios. A large part of this work is to trade off between the three amplification factors of write amplification, read amplification and space amplification.

LSM-like design ideas in Elasticsearch Lucene

The segment design idea of Lucene, the underlying search engine of ES, is very similar to the LSM tree. Thoughts such as WAL, memory buffer, and segmented merge are also used.

After a document is indexed, it will be added to the memory buffer and appended to the translog.

With the refresh operation of the current shard, these documents in the memory buffer are flushed to a new segment, this segment is opened to make it searchable, and the corresponding memory buffer is emptied.

As the data is written, a commit operation will be triggered to do a full commit, and then the corresponding Translog will be deleted. (The specific process is limited in space and will not be repeated).

Before the segment is refreshed or committed, the data is stored in memory and cannot be searched. This is why Lucene is said to provide near real-time rather than real-time query.

The document written in the segment cannot be modified, but can be deleted. The way of deletion is not to change the file in-situ. Instead, another file will save the DocID of the document that needs to be deleted to ensure that the data file cannot be modified. Index query needs to query multiple segments and merge the results, and also need to process deleted documents. In order to optimize the query, Lucene will have a strategy to merge multiple segments, which is similar to LSM's compaction of SSTable .

This mechanism avoids random writes, and data writes are both Batch and Append, which can achieve high throughput. At the same time, in order to improve the efficiency of writing, the file caching system and memory are used to accelerate the performance during writing, and the log is used to prevent data loss.

⭐ LSM tree vs B+ tree

Having said that, it is better to compare with our common B+ tree.

Different design concepts

Although like LSM trees, B+ trees maintain key-value pairs sorted by key (which allows efficient key-value lookup and range query), the design concepts of the two are completely different.

The LSM tree breaks the database into variable-sized segments, usually several megabytes or more in size, and always writes the segments sequentially.
In contrast, the B+ tree breaks the database into fixed-size blocks or pages, which are traditionally 4KB in size (sometimes larger), and only one page can be read or written at a time. This design is closer to the underlying hardware, because the disks are also arranged in fixed-size blocks.

Data update and deletion

B(+) tree can be updated and deleted in-place (in-place update), this method is database transaction support , because a key will only appear in one Page page;
However, since the LSM tree can only be written out (out-place update) and overlaps in the SSTable of the L0 layer, it has weak transaction support and can only be updated and deleted during compaction.

Performance

The advantage of the LSM tree is that it supports high-throughput writing (which can be considered O(1)). This feature is more important in distributed systems. Of course, for reading the ordinary LSM tree structure, reading is O(n) complicated The degree of complexity can reach O(logN) after using index or cache optimization.
The advantage of B+ tree is that it supports efficient reading (stable O(logN)), but under large-scale write requests (complexity O(LogN)), the efficiency will become relatively low, because with the operation of insert, in order to Maintaining the tree structure, nodes will continue to split and merge. The random read and write probability of operating the disk will increase, resulting in performance degradation.

Generally , we would say that the write performance of the LSM tree will be better than the B tree, and the read performance of the B tree will be better than the LSM tree. But please don't ignore the influence of LSM tree write amplification, it is necessary to think more dialectically in the performance judgment.

refer to

"Designing Data-intensive Applications" Chapter 3: Storage and Retrieval (a book strongly recommended!!!)
https://github.com/facebook/rocksdb/wiki
https://www.lxkaka.wang/rocksdb-lsm/
http://alexstocks.github.io/html/rocksdb.html
https://cloud.tencent.com/developer/article/1441835
https://www.elastic.co/guide/cn/elasticsearch/guide/current/translog.html

Look at LSM-Tree algorithm design from RocksDB

Preface

Introduction to LSM Tree