LevelDB (go version) architecture design analysis

Here is a simple exploration of the implementation of LevelDB architecture with the source code of LevelDB (go version)

Overall structure

log

The write operation of leveldb is not written directly to the disk, but first written to the memory. Assuming that the data written to the memory has not yet been persisted in the future, the leveldb process is abnormal, or the host machine is down at this time, which will cause the user to write data to be lost. Therefore, leveldb will first write all write operations to the log file, which is the log file, before writing to the memory. When the process is abnormal, the log can be used to recover. A log corresponds to the data of memdb or frozen memdb frozen memdb persisted to sstable, this log will be deleted

version

This concept is not shown in the figure above, but it is a more important concept

First look at the structure of version in the code

type version struct {
    id int64 // unique monotonous increasing version id
    s  *session

    levels []tFiles

    // Level that should be compacted next and its compaction score.
    // Score < 1 means compaction is not strictly needed. These fields
    // are initialized by computeCompaction()
    cLevel int
    cScore float64

    cSeek unsafe.Pointer

    closing  bool
    ref      int
    released bool
}

// tFiles hold multiple tFile.
type tFiles []*tFile

// tFile holds basic information about a table.
type tFile struct {
    fd         storage.FileDesc
    seekLeft   int32
    size       int64
    imin, imax internalKey
}

The version structure is a two-dimensional array. The key of the two-dimensional array represents the level of the level, and the value represents all the files under the current level.

This is also the origin of the name of LevelDB

manifest

From the definition of version above, we can see that each sst file (LevelDB storage file) is encoded and recorded in the memory, and which level it belongs to; but the version is only the structure in the memory. If the program is restarted, then The version information does not exist, so the main function of the manifest is to record the version information; there are generally two recording methods: incremental recording and full recording. The manifest uses a mixed method of incremental recording and full recording.

Before continuing to introduce the storage content of the manifest, first introduce a data structure

type sessionRecord struct {
    hasRec         int
    comparer       string
    // 日志文件的序号
    journalNum     int64
    // 已废弃
    prevJournalNum int64
    // 下一个sst文件的序号
    nextFileNum    int64
    // 日志的序号，每增加一条日志，序号+1
    seqNum         uint64
    compPtrs       []cpRecord
    // 当前这个compaction中，增加的表信息
    addedTables    []atRecord
    // 当前这个compaction中，增加的表信息
    deletedTables  []dtRecord

    scratch [binary.MaxVarintLen64]byte
    err     error
}

type atRecord struct {
    level int
    num   int64
    size  int64
    imin  internalKey
    imax  internalKey
}

type dtRecord struct {
    level int
    num   int64
}

sessionRecord records the tables added and deleted in a compaction, as well as the level of the added table, and other information, and the version can also be mapped to a sessionRecord. When the program is restarted, the program will read the old manifest file , Generate version, and map this version into a sessionRecord record in the manifest file, and then append sessionRecord (that is, sessionRecord) during subsequent compaction

current

Point to the manifast file that should be used currently

memdb

Memory table, when users add or delete data, they do not directly manipulate the file, but write the log, and write the data into the memory table at the same time. When the memory expression reaches a certain threshold (user can set), this memory table is not allowed to be written again , Converted to frozen memdb , and then compacted; the bottom layer of memdb uses the jump table data structure

frozen memdb

When MemDB time to reach a set threshold, it will be converted into Frozen MemDB , this is just a theoretical concept, its structure type with MemDB is the same, but can not write it, when MemDB converted to frozen memdb will trigger the background compaction and convert it to file storage

sstable

The final landing files of LevelDB, these files will be integrated regularly, that is, after compaction, the files will be logically divided into several layers, frozen memdb directly dumped files fall on level 0, level 0 files After compaction, it will fall on layer 1, and so on; at the same time, it should be noted that the keys between files in layer 0 can overlap, and the keys between files in layers above 0 will not overlap.

Read and write operations

Write

The write operation only operated the journal log and memdb

When starting to write, first determine whether write merge is enabled. Write merge is to merge requests from multiple goroutines into one goroutine for operation to improve writing efficiency
If write merge is turned on, and another goroutine obtains the write lock, then hand over your data to another goroutine and wait for his return to write the result.
If the write merge is turned on and the write lock is obtained, then check with Richard to see if there are other goroutines that need to take a ride to write together
Put the data in the batch structure, which is a structure applied for during a batch operation (so, whether you write a single data or write data in a batch, the final operation is in the form of batch writing), and then judge, The remaining space of memdb is not enough to write the current data. If it is not enough, freeze the current memdb, and then apply for a memdb again, and the frozen memdb will perform asynchronous compaction.
Then start writing to the log
After writing the log, start writing to memdb
If merge write is enabled, notify other merge write goroutines of the write result

Read

Put down the memdb structure first

type DB struct {
    cmp comparer.BasicComparer
    rnd *rand.Rand

    mu     sync.RWMutex
    // 数据存储的地方
    kvData []byte
    // Node data:
    // [0]         : KV offset
    // [1]         : Key length
    // [2]         : Value length
    // [3]         : Height
    // [3..height] : Next nodes
    // 跳跃表
    nodeData  []int
    prevNode  [tMaxHeight]int
    maxHeight int
    n         int
    kvSize    int
}

The reading logic is relatively simple, first look up from memdb and frozen memdb, and then go to the file to look up, but some details are still more complicated, let’s mainly look at what ikey is.

The above picture is an example of kv storage, so if two kvs k1=v1, k2=v2 are stored in the memory, then it is a byte data in memdb, roughly as follows

Then the general structure of the snapshot is as follows

After talking about the storage structure of kv, it may feel a little weird, that is, kv are stacked together, how do I distinguish which is the key and which is the value? There is also a field nodeData in the db data structure, which serves as The skip table is also used to record the length of kv

In this way, how to find data in the form of skiplist based on nodedData and kvData in memdb is clearer

In addition to searching in memdb, it will also be searched in sstable. This will be discussed in detail later

Log read and write

Log structure

The following figure shows the data storage diagram of a block

(Figure: Log read and write-chunk structure)

And a log structure is composed of multiple blocks

(Figure: Log Read and Write-Log Structure)

Among them, the data in the above figure is the batch1...N in the corresponding (Figure: log read and write-block structure)

checksum: used to verify data accuracy

Chunk type: There are three types: first/middle/last/full, used to indicate the integrity of a chunk, full means a complete chunk

If a piece of data can be completely written in a chunk, the chunk type of this chunk is full

If a piece of data is relatively large and requires more than 3 chunks of space, the first chunk is first, the last is last, and the rest are middle

length: the number of batches

Log write

Log writing is relatively simple. Inside the program, a singleWrite is encapsulated. This singleWrite is responsible for one write and block writes for big data.

When the log is written, it will first determine whether the remaining space of the block is enough to write a header. If it is not enough, add 0 to the remaining space of the block and write it if it is enough. The chunk type is to determine whether the current block is enough for data After writing, it is Full if you can write it, if you can’t, write first, then middle/last

Log reading

To read the log, press the write in reverse. The only thing to note is that when reading a block, check the checksum and determine a chunk type. If the chunk type is not Full/Last, it means it is not the last one. block, the data is incomplete, continue to read until the read is complete

sstable read

sstable structure

When the data in frozen memdb is persisted to the file, it is organized according to a certain structure, and this data storage structure of leveldb is called sstable

The data structure of an sstable is as follows

data block: used to store kv key-value pairs

filter block: After the filter is turned on, the filter created by the key is stored. Only the bloom filter is provided in levelDB, and it is not turned on by default; if it is not turned on, it will not be stored here

meta index block: used to store the index information of the filter block

index block: used to store the index information of each data block

footer: used to store the index information of the index block and the index information of the meta index block, and also store a special character-"\x57\xfb\x80\x8b\x24\x75\x47\xdb"

Note: Each block will store the compression type and CRC check information, the filter block does not perform compression, and does not store CRC check information

footer

The length of footer is 48 fields, and it is constant. It stores the offset and the length of the meta index of the meta index block and index block in the entire file, respectively, and the last 8 bytes store a magic character

meta index block

The meta index block records the offset and length of the filter block

index block

The index block is relative to the meta index block. Each record stores a max key, which records the maximum key of each data block, which is used to quickly locate the target data block.

filter block

The filter block is a filter, which is turned off by default in levelDB. When it is closed, this filter block has no data. When it is turned on, you can select a filter. Only bloom filters are implemented in levelDB for use.

The structure of the filter block is as above

base lg: The size of a bloom filter, the default value is 11, the size of the corresponding bloom filter is 2^11 = 2K

filter offset's offset: the position of the first filter offset, used to split data and filter

filter n offset: Although filter n offset points to filter data n, when actually looking for filter data, it will not first look for filter offset and then offset it; the actual operation is that baselg records the size of each filter data , Direct displacement operation is fine

data block

Let's first look at the structure of the data block

In this structure, there are several raw faces--entry, restart pointer

Here you need to first understand the storage structure of kv in the data block

Entry is the storage structure of kv in the data block, which can be expressed as

Shared key length: refers to the same length as the previous key and prefix, similar to the previous key being test1, and the currently stored key is test2, then shared key = test, unshared key = 4

It can be seen that in order to save storage space in levelDB, it is not stored like the kv structure in memdb, but the key is compressed, so the appearance of the restart pointer is also caused.

restart pointer: the location where each complete key is stored in the entry

The restart pointer exists to quickly find the key. When we find whether a key exists, we first find the complete key according to the restart pointer, and compare it to determine whether it is between the keys pointed to by the two restart pointers. If it is, Then traverse the data between the two keys; if there is no restart pointer, you need to traverse the entries one by one to find the key, which improves efficiency a lot

For example 🌰:

// 每间隔2个，设置一个restart pointer
restart_interval=2 
entry1: key=deck,value=v1
entry2: key=dock,value=v2
entry3: key=duck,value=v3

The stored data structure is

Read operation

After understanding the above data structure, you can roughly sort out the reading process of sstable

When searching for the key in the sstable file, there is a little difference between the 0 layer file and the non-layer 0 file search

Layer 0 files are allowed to overlap, while non-layer 0 files are not allowed to overlap

Layer 0 files 1-N files are not guaranteed to be arranged in order. Non-layer 0 files 1-N files are arranged in order, that is, the key of file 1 is 0-20, and the key of file 2 is 20-40. The key of file 1 of layer 0 file is 0-20, and the key of file 2 can be 10-30

The above two differences lead to the difference between the search of layer 0 files and non-layer 0 files

compaction

There are two types of compaction

memory compaction: Persist data in frozen memdb to L0 layer files
table compaction: the process of merging N-layer files with files in N+1 layer

memory compaction

Memory compaction is relatively simple. It traverses and reads memdb data and writes them to sstable sequentially. Just add auxiliary data. Its essence is the persistence of memory data; after each memory compaction, a new layer 0 file will be added. ；

Memory compaction is a process with high timeliness requirements, and it needs to be completed in the shortest possible time, so its priority is higher than table compaction, and table compaction needs to give way to memory compaction

table compaction

Table compaction is a bit more complicated than memory compaction, the reason is

How to confirm which layer of files to compact
How to confirm the level of compaction and which files to compaction

Therefore, in response to the above requirements, some conditions are designed during compaction, and compaction starts when the conditions are met.

When the number of files at level 0 exceeds the predetermined upper limit (default is 4)
When the total size of level i files exceeds (10 ^ i) MB
When a file is invalid read too many times

Determine the level of compaction, the next step is to confirm the sstable of compaction level, the main logic is as follows

If it is level 0 comapction, select the oldest sstable, namely tFiles[0]
If it is not level 0 compaction, if it is due to too many invalid reads, select this sstable
If it is neither level 0 compaction nor caused by too many invalid reads, select the sstable that is greater than the maxKey of the compapction sstable last time

compaction process

Level 0 compaction

Because level 0 allows key overlap, the compaction of level 0 is different: after determining an sstable, it will first traverse all sstables under the current level, and then find all sstables with overlapping keys to form a new [minKey, maxKey] range, then go to level 1 to find

compaction-0

Level N compaction

Level N (N> 0) does not have key overlap, so level N does not need to traverse all sstables under the current level to expand the range of keys that need to be searched

compaction-1