Here is a simple exploration of the implementation of LevelDB architecture with the source code of LevelDB (go version)
Overall structure
log
The write operation of leveldb is not written directly to the disk, but first written to the memory. Assuming that the data written to the memory has not yet been persisted in the future, the leveldb process is abnormal, or the host machine is down at this time, which will cause the user to write data to be lost. Therefore, leveldb will first write all write operations to the log file, which is the log file, before writing to the memory. When the process is abnormal, the log can be used to recover. A log corresponds to the data of memdb or frozen memdb frozen memdb persisted to sstable, this log will be deleted
version
This concept is not shown in the figure above, but it is a more important concept
First look at the structure of version in the code
type version struct {
id int64 // unique monotonous increasing version id
s *session
levels []tFiles
// Level that should be compacted next and its compaction score.
// Score < 1 means compaction is not strictly needed. These fields
// are initialized by computeCompaction()
cLevel int
cScore float64
cSeek unsafe.Pointer
closing bool
ref int
released bool
}
// tFiles hold multiple tFile.
type tFiles []*tFile
// tFile holds basic information about a table.
type tFile struct {
fd storage.FileDesc
seekLeft int32
size int64
imin, imax internalKey
}
The version structure is a two-dimensional array. The key of the two-dimensional array represents the level of the level, and the value represents all the files under the current level.
This is also the origin of the name of LevelDB
manifest
From the definition of version above, we can see that each sst file (LevelDB storage file) is encoded and recorded in the memory, and which level it belongs to; but the version is only the structure in the memory. If the program is restarted, then The version information does not exist, so the main function of the manifest is to record the version information; there are generally two recording methods: incremental recording and full recording. The manifest uses a mixed method of incremental recording and full recording.
Before continuing to introduce the storage content of the manifest, first introduce a data structure
type sessionRecord struct {
hasRec int
comparer string
// 日志文件的序号
journalNum int64
// 已废弃
prevJournalNum int64
// 下一个sst文件的序号
nextFileNum int64
// 日志的序号,每增加一条日志,序号+1
seqNum uint64
compPtrs []cpRecord
// 当前这个compaction中,增加的表信息
addedTables []atRecord
// 当前这个compaction中,增加的表信息
deletedTables []dtRecord
scratch [binary.MaxVarintLen64]byte
err error
}
type atRecord struct {
level int
num int64
size int64
imin internalKey
imax internalKey
}
type dtRecord struct {
level int
num int64
}
sessionRecord records the tables added and deleted in a compaction, as well as the level of the added table, and other information, and the version can also be mapped to a sessionRecord. When the program is restarted, the program will read the old manifest file , Generate version, and map this version into a sessionRecord record in the manifest file, and then append sessionRecord (that is, sessionRecord) during subsequent compaction
current
Point to the manifast file that should be used currently
memdb
Memory table, when users add or delete data, they do not directly manipulate the file, but write the log, and write the data into the memory table at the same time. When the memory expression reaches a certain threshold (user can set), this memory table is not allowed to be written again , Converted to frozen memdb , and then compacted; the bottom layer of memdb uses the jump table data structure
frozen memdb
When MemDB time to reach a set threshold, it will be converted into Frozen MemDB , this is just a theoretical concept, its structure type with MemDB is the same, but can not write it, when MemDB converted to frozen memdb will trigger the background compaction and convert it to file storage
sstable
The final landing files of LevelDB, these files will be integrated regularly, that is, after compaction, the files will be logically divided into several layers, frozen memdb directly dumped files fall on level 0, level 0 files After compaction, it will fall on layer 1, and so on; at the same time, it should be noted that the keys between files in layer 0 can overlap, and the keys between files in layers above 0 will not overlap.
Read and write operations
Write
The write operation only operated the journal log and memdb
- When starting to write, first determine whether write merge is enabled. Write merge is to merge requests from multiple goroutines into one goroutine for operation to improve writing efficiency
- If write merge is turned on, and another goroutine obtains the write lock, then hand over your data to another goroutine and wait for his return to write the result.
- If the write merge is turned on and the write lock is obtained, then check with Richard to see if there are other goroutines that need to take a ride to write together
- Put the data in the batch structure, which is a structure applied for during a batch operation (so, whether you write a single data or write data in a batch, the final operation is in the form of batch writing), and then judge, The remaining space of memdb is not enough to write the current data. If it is not enough, freeze the current memdb, and then apply for a memdb again, and the frozen memdb will perform asynchronous compaction.
- Then start writing to the log
- After writing the log, start writing to memdb
- If merge write is enabled, notify other merge write goroutines of the write result
Read
Put down the memdb structure first
type DB struct {
cmp comparer.BasicComparer
rnd *rand.Rand
mu sync.RWMutex
// 数据存储的地方
kvData []byte
// Node data:
// [0] : KV offset
// [1] : Key length
// [2] : Value length
// [3] : Height
// [3..height] : Next nodes
// 跳跃表
nodeData []int
prevNode [tMaxHeight]int
maxHeight int
n int
kvSize int
}
The reading logic is relatively simple, first look up from memdb and frozen memdb, and then go to the file to look up, but some details are still more complicated, let’s mainly look at what ikey is.
The above picture is an example of kv storage, so if two kvs k1=v1, k2=v2 are stored in the memory, then it is a byte data in memdb, roughly as follows
Then the general structure of the snapshot is as follows
After talking about the storage structure of kv, it may feel a little weird, that is, kv are stacked together, how do I distinguish which is the key and which is the value? There is also a field nodeData in the db data structure, which serves as The skip table is also used to record the length of kv
In this way, how to find data in the form of skiplist based on nodedData and kvData in memdb is clearer
In addition to searching in memdb, it will also be searched in sstable. This will be discussed in detail later
Log read and write
Log structure
The following figure shows the data storage diagram of a block
(Figure: Log read and write-chunk structure)
And a log structure is composed of multiple blocks
(Figure: Log Read and Write-Log Structure)
Among them, the data in the above figure is the batch1...N in the corresponding (Figure: log read and write-block structure)
checksum: used to verify data accuracy
Chunk type: There are three types: first/middle/last/full, used to indicate the integrity of a chunk, full means a complete chunk
If a piece of data can be completely written in a chunk, the chunk type of this chunk is full
If a piece of data is relatively large and requires more than 3 chunks of space, the first chunk is first, the last is last, and the rest are middle
length: the number of batches
Log write
Log writing is relatively simple. Inside the program, a singleWrite is encapsulated. This singleWrite is responsible for one write and block writes for big data.
When the log is written, it will first determine whether the remaining space of the block is enough to write a header. If it is not enough, add 0 to the remaining space of the block and write it if it is enough. The chunk type is to determine whether the current block is enough for data After writing, it is Full if you can write it, if you can’t, write first, then middle/last
Log reading
To read the log, press the write in reverse. The only thing to note is that when reading a block, check the checksum and determine a chunk type. If the chunk type is not Full/Last, it means it is not the last one. block, the data is incomplete, continue to read until the read is complete
sstable read
sstable structure
When the data in frozen memdb is persisted to the file, it is organized according to a certain structure, and this data storage structure of leveldb is called sstable
The data structure of an sstable is as follows
data block: used to store kv key-value pairs
filter block: After the filter is turned on, the filter created by the key is stored. Only the bloom filter is provided in levelDB, and it is not turned on by default; if it is not turned on, it will not be stored here
meta index block: used to store the index information of the filter block
index block: used to store the index information of each data block
footer: used to store the index information of the index block and the index information of the meta index block, and also store a special character-"\x57\xfb\x80\x8b\x24\x75\x47\xdb"
Note: Each block will store the compression type and CRC check information, the filter block does not perform compression, and does not store CRC check information
footer
The length of footer is 48 fields, and it is constant. It stores the offset and the length of the meta index of the meta index block and index block in the entire file, respectively, and the last 8 bytes store a magic character
meta index block
The meta index block records the offset and length of the filter block
index block
The index block is relative to the meta index block. Each record stores a max key, which records the maximum key of each data block, which is used to quickly locate the target data block.
filter block
The filter block is a filter, which is turned off by default in levelDB. When it is closed, this filter block has no data. When it is turned on, you can select a filter. Only bloom filters are implemented in levelDB for use.
The structure of the filter block is as above
base lg: The size of a bloom filter, the default value is 11, the size of the corresponding bloom filter is 2^11 = 2K
filter offset's offset: the position of the first filter offset, used to split data and filter
filter n offset: Although filter n offset points to filter data n, when actually looking for filter data, it will not first look for filter offset and then offset it; the actual operation is that baselg records the size of each filter data , Direct displacement operation is fine
data block
Let's first look at the structure of the data block
In this structure, there are several raw faces--entry, restart pointer
Here you need to first understand the storage structure of kv in the data block
Entry is the storage structure of kv in the data block, which can be expressed as
Shared key length: refers to the same length as the previous key and prefix, similar to the previous key being test1, and the currently stored key is test2, then shared key = test, unshared key = 4
It can be seen that in order to save storage space in levelDB, it is not stored like the kv structure in memdb, but the key is compressed, so the appearance of the restart pointer is also caused.
restart pointer: the location where each complete key is stored in the entry
The restart pointer exists to quickly find the key. When we find whether a key exists, we first find the complete key according to the restart pointer, and compare it to determine whether it is between the keys pointed to by the two restart pointers. If it is, Then traverse the data between the two keys; if there is no restart pointer, you need to traverse the entries one by one to find the key, which improves efficiency a lot
For example 🌰:
// 每间隔2个,设置一个restart pointer
restart_interval=2
entry1: key=deck,value=v1
entry2: key=dock,value=v2
entry3: key=duck,value=v3
The stored data structure is
Read operation
After understanding the above data structure, you can roughly sort out the reading process of sstable
When searching for the key in the sstable file, there is a little difference between the 0 layer file and the non-layer 0 file search
Layer 0 files are allowed to overlap, while non-layer 0 files are not allowed to overlap
Layer 0 files 1-N files are not guaranteed to be arranged in order. Non-layer 0 files 1-N files are arranged in order, that is, the key of file 1 is 0-20, and the key of file 2 is 20-40. The key of file 1 of layer 0 file is 0-20, and the key of file 2 can be 10-30
The above two differences lead to the difference between the search of layer 0 files and non-layer 0 files
compaction
There are two types of compaction
- memory compaction: Persist data in frozen memdb to L0 layer files
- table compaction: the process of merging N-layer files with files in N+1 layer
memory compaction
Memory compaction is relatively simple. It traverses and reads memdb data and writes them to sstable sequentially. Just add auxiliary data. Its essence is the persistence of memory data; after each memory compaction, a new layer 0 file will be added. ;
Memory compaction is a process with high timeliness requirements, and it needs to be completed in the shortest possible time, so its priority is higher than table compaction, and table compaction needs to give way to memory compaction
table compaction
Table compaction is a bit more complicated than memory compaction, the reason is
- How to confirm which layer of files to compact
- How to confirm the level of compaction and which files to compaction
Therefore, in response to the above requirements, some conditions are designed during compaction, and compaction starts when the conditions are met.
- When the number of files at level 0 exceeds the predetermined upper limit (default is 4)
- When the total size of level i files exceeds (10 ^ i) MB
- When a file is invalid read too many times
Determine the level of compaction, the next step is to confirm the sstable of compaction level, the main logic is as follows
- If it is level 0 comapction, select the oldest sstable, namely tFiles[0]
- If it is not level 0 compaction, if it is due to too many invalid reads, select this sstable
- If it is neither level 0 compaction nor caused by too many invalid reads, select the sstable that is greater than the maxKey of the compapction sstable last time
compaction process
Level 0 compaction
Because level 0 allows key overlap, the compaction of level 0 is different: after determining an sstable, it will first traverse all sstables under the current level, and then find all sstables with overlapping keys to form a new [minKey, maxKey] range, then go to level 1 to find
Level N compaction
Level N (N> 0) does not have key overlap, so level N does not need to traverse all sstables under the current level to expand the range of keys that need to be searched
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。