LSM-Tree - LevelDb source code analysis - 技术读书笔记

LSM-Tree - LevelDb source code analysis

introduction

In the previous article [[LSM-Tree - LevelDb Understanding and Implementation]], I introduced LevelDb-related data structures and core components, the core read and write parts of LevelDB, and why writing is faster than reading in this database. is several times faster.

The source code of LevelDB is relatively easy to understand. It is so easy to understand that people who have only learned JAVA and only basic knowledge of fixed-point C language can understand it. On the other hand, the author has commented on key places, and even told you why To design it like this <s>(It's so well written that it makes me cry why I don't have such a colleague)</s>.

If you still don't understand, the author has also written a lot of md documents (in the doc directory) of the introduction of data structures to tell you the role of the core components.

In short, don't be afraid of this database, whether it is used as an excellent code and design pattern or a variety of mainstream data structure algorithm applications, it is very worth learning and reference.

Tip: This section contains a lot of code, so it is not recommended to read it on a mobile phone or mobile device, it is more suitable for viewing on a PC.

source code run

The compilation of LevelDB is relatively simple, and the code can be cloned directly from the official website.

Address: https://github.com/google/leveldb

The specific operation steps are as follows (you can also refer to README in the warehouse):

 git clone --recurse-submodules https://github.com/google/leveldb.git
mkdir -p build && cd build
cmake -DCMAKE_BUILD_TYPE=Release .. && cmake --build .

After completing the entire compilation action, we can add a dynamic library, a static library and a test directory, and then we can write unit tests. At the same time, there are many unit tests in the official source code, and we can provide our own test programs for debugging. , of course, skip these content here and start directly from the source code.

underlying storage storage structure

Association: [[SSTable]]

In LevelDB, SSTable is the most important structure of the entire database. The content of all SSTable files cannot be modified . Although the data is usually operated in memory, the data cannot be stored indefinitely. When the data reaches a certain amount, it needs to be persisted to In the disk, the processing of compression and merging is a very test of system performance. For this reason, LevelDb uses a hierarchical structure for storage. Let's start with the external use structure to understand the internal design.

The entire external black box is the database itself. Taking a transactional database as an example, the usual operations are nothing more than four types of ACID, but the data structure of the LSM-Tree is a bit different, because the update and deletion will actually be done through "add" With the "merge" method, the new data overwrites the old data.

So far, let's start with simple concepts, first the source code of the entire DB, the DB source code can be accessed through the following paths:

https://github.com/google/leveldb/blob/main/include/leveldb/db.h

First of all, we need to understand the DB storage structure. We can see that the interface provided by the storage engine is very simple:

 class LEVELDB_EXPORT DB {

public:

    // 设置数据库的key-value结构，如果没有返回OK则视为操作失败，
    // 备注：考虑默认打开sync=true操作，`Put` 方法在内部最终会调用 `Write` 方法，只是在上层为调用者提供了两个不同的选择。
    virtual Status Put(const WriteOptions& options, const Slice& key,
    
    const Slice& value) = 0;
    
    // 成功返回OK，如果异常则不返回OK，如果什么都返回，说明被删除的Key不存在，
    virtual Status Delete(const WriteOptions& options, const Slice& key) = 0;
    
    // 写入操作
    virtual Status Write(const WriteOptions& options, WriteBatch* updates) = 0;
    
    // 根据key获取数据
    virtual Status Get(const ReadOptions& options, const Slice& key,
    
    std::string* value) = 0;

}

Get and Put is the interface provided by LevelDB for upper layer to read and write, pay attention to the Update and Delele . It is done by Put . The implementation method is internal type judgment, which is very interesting. You can pay attention to it here.

write section

Let's start with the write operation to see how the data enters LevelDb and how it is managed internally.

The internal logic of Write is relatively complex, so here is the basic flowchart:

We cut in from the Write() interface method of DB. After simplifying the code, the general process is as follows:

 //  为写入构建足够的空间，此时可以不需要加锁。
    Status status = MakeRoomForWrite(updates == nullptr);
    //  通过 `AddRecord` 方法向日志中追加一条写操作的记录；
    status = log_->AddRecord(WriteBatchInternal::Contents(write_batch));
    //  如果日志记录成功，则将数据进行写入
    if (status.ok()) {
        status = WriteBatchInternal::InsertInto(write_batch, mem_);
    }

The entire execution process is as follows:

First call the MakeRoomForWrite method to provide enough space for the upcoming write.
- If the current space is insufficient and the current memtable needs to be frozen, this happens Minor Compaction and a new MemTable object is created.
- If the trigger is satisfied Major Compaction the data needs to be compressed and merged for the SSTable.
Append a write operation record to the log through the AddRecord method.
Finally, memtable is called to add key/value to the memory structure to complete the final write operation.

The source code logic of the write operation is simplified as follows:

 Status DBImpl::Write(const WriteOptions& options, WriteBatch* my_batch) {
  Writer w(&mutex_);
  w.batch = my_batch;

  MakeRoomForWrite(my_batch == NULL);
    
  uint64_t last_sequence = versions_->LastSequence();
  Writer* last_writer = &w;
  WriteBatch* updates = BuildBatchGroup(&last_writer);
  WriteBatchInternal::SetSequence(updates, last_sequence + 1);
    // 记录最终的操作记录点
  last_sequence += WriteBatchInternal::Count(updates);
    // 日志编写
  log_->AddRecord(WriteBatchInternal::Contents(updates));
    // 将数据写入memtable
  WriteBatchInternal::InsertInto(updates, mem_);

  versions_->SetLastSequence(last_sequence);
  return Status::OK();
}

The above code has more method encapsulation, here we look at them one by one.

MaybeScheduleCompaction() Compression and merge (if you feel that it is abrupt, please refer to the flow chart above) In the source code, the system will periodically check whether compression and merge can be performed, and if/else is used for multi-threaded concurrent writing. For the write operation, when different threads are found in the operation, they will wait for the result or wait until the lock is obtained to take over the combined write operation.

If you have any questions about the code below, you can read the section about "merge writing" in [[LSM-Tree - LevelDb Understanding and Implementation]]. In order to save time, you can directly enter the keyword " merge writing " in the webpage to quickly locate , it is assumed that the reader already understands the basic workflow, and will not be repeated here.

LevelDb merge write operations

 void DBImpl::MaybeScheduleCompaction() {

mutex_.AssertHeld();

if (background_compaction_scheduled_) {

// Already scheduled
    // 正在压缩

} else if (shutting_down_.load(std::memory_order_acquire)) {

// DB is being deleted; no more background compactions
    
    // DB正在被删除；不再有后台压缩

} else if (!bg_error_.ok()) {

// Already got an error; no more changes
    // 已经发生异常，不能做更多改动。

} else if (imm_ == nullptr && manual_compaction_ == nullptr &&

    !versions_->NeedsCompaction()) {

    // 不需要合并则不工作

    } else {
    // 设置当前正常进行压缩合并

    background_compaction_scheduled_ = true;
    // 开始压缩合并

    env_->Schedule(&DBImpl::BGWork, this);

    }

}

Immutable memtable :

There is such a string of code inside the write function. At this time, the unlocking will be suspended and waiting for writing. What is this writing for?

 Status status = MakeRoomForWrite(updates == nullptr);

Entering the method, you will find that the current memtable status is judged through a while loop. Once you find that the memtable write has been filled with the entire mem , you need to stop writing and Convert the current memtable to immutiablememtable , and create a new mem to switch to write, at the same time, it will also judge whether it can be compressed mem according to some conditions.

Here is an additional explanation of the meaning of GUARDED_BY in the source code:

GUARDED_BY is an attribute of a data member that declares that the data member is protected by a given function. Read operations on the data require shared access, while write operations require exclusive access.

The GUARDED_BY attribute declares that a thread must lock the listener_list_mutex before it can read or write to the listener_list, thus ensuring that increment and decrement operations are atomic.

Summary: In fact, it is a typical mutual exclusion shared lock. As for the implementation, it is not the focus of this article.

mem can be regarded as a current system memo or a temporary accounting board. Similar to most logs or relational databases, the log is written first to perform all subsequent "transaction" operations, that is, the log takes precedence over the record. According to the operation principle, the normal operation of concurrent operations is completed by locking the log write operation.

MakeRoomForWrite The more critical parts of the method are commented. Many operation authors have the intention of introducing, and the code logic is relatively simple. Just read it several times to understand the general idea. (If you don’t understand the C++ grammar, you don’t need to worry too much, just understand what he wants to do, mainly because I can’t understand it either, haha)

 while (true) {

    if (!bg_error_.ok()) {
    
        // Yield previous error
        
        s = bg_error_;
        
        break;
    
    } else if (allow_delay && versions_->NumLevelFiles(0) >=
    
    config::kL0_SlowdownWritesTrigger) {
        // 我们正接近于达到对L0文件数量的硬性限制。L0文件的数量。当我们遇到硬性限制时，与其将单个写操作延迟数而是在我们达到硬限制时，开始将每个mem单独写1ms以减少延迟变化。另外。这个延迟将一些CPU移交给压缩线程，因为 如果它与写入者共享同一个核心的话。
        
        mutex_.Unlock();
        
        env_->SleepForMicroseconds(1000);
        // 不要将一个单一的写入延迟超过一次
        allow_delay = false; 
        mutex_.Lock();
    
    } else if (!force &&
    
    (mem_->ApproximateMemoryUsage() <= options_.write_buffer_size)) {
        
        // 在当前的mem中还有空间
        break;
    
    } else if (imm_ != nullptr) {
        
        // 我们已经填满了当前的memtable，但之前的的mem还在写入，所以需要等待
        
        background_work_finished_signal_.Wait();
    
    } else if (versions_->NumLevelFiles(0) >= 
                config::kL0_StopWritesTrigger) {
    
        
        background_work_finished_signal_.Wait();
    
    } else {
    
        // A试图切换到一个新的memtable并触发对旧memtable的压缩
        assert(versions_->PrevLogNumber() == 0);
        // 新建文件号
        uint64_t new_log_number = versions_->NewFileNumber(); //return next_file_number_++;
        
        WritableFile* lfile = nullptr;
        // 新建可写入文件, 内部通过一个map构建一个文件：文件状态的简易文件系统
        // typedef std::map<std::string, FileState*> FileSystem;
        s = env_->NewWritableFile(LogFileName(dbname_, new_log_number), &lfile);
        
        if (!s.ok()) {
            // 避免死循环重复新增文件号
            versions_->ReuseFileNumber(new_log_number);
            break;
        
        }
    
        delete log_;
        
        delete logfile_;
        
        logfile_ = lfile;
        
        logfile_number_ = new_log_number;
        // 写入日志
        log_ = new log::Writer(lfile);
        // **重点：imm_ 就是immutable 他将引用指向当前已经写满的mem，其实和mem对象没什么区别，就是加了一个互斥共享锁而已（写互斥，读共享）**
        imm_ = mem_;
        
        has_imm_.store(true, std::memory_order_release);
        // 新建新的memtable
        mem_ = new MemTable(internal_comparator_);
        // 引用至新块
        mem_->Ref();
        
        force = false; // Do not force another compaction if have room
        // 尝试对于已满mem压缩合并 ，此处承接上文
        MaybeScheduleCompaction();
    
    }

}

Let's use a simple schematic diagram to understand the general process above:

Note that this is obviously different from the implementation structure of the original theory of [[SSTable]], of course, this is a normal difference between theory and practice.

Under normal circumstances memtable you can wait for the compression to complete through a short delay of read and write requests, but once you find that the memory occupied by mem is too large, you need to lock the current mem into the _imu state , and then Create a new MemTable instance and transfer incoming requests to the new mem , so that you can continue to accept external write operations without waiting for the end of Minor Compaction .

Note again that it will be judged by the function MaybeScheduleCompaction whether the compression and merge operation is performed.

This no-waiting design idea comes from: [[Dynamic-sized NonBlocking Hash table]], you can take a look at the next paper yourself, and of course you can wait for my later article.

log part

After understanding the general operation process of writing, let's take a look at the log management of LevelDb, which is the operation of the AddRecord() function:

Note that the core part of the log is not inside AddRecord() , because there are only some simple string splicing operations inside, here the core is placed in the RecordType part, you can see that through the current log characters Length judges different types, RecordType identifies the position currently recorded in the block:

 //....
    enum RecordType {

        // Zero is reserved for preallocated files
        
        kZeroType = 0,
        
          
        
        kFullType = 1,
        
          
        
        // For fragments
        
        kFirstType = 2,
        
        kMiddleType = 3,
        
        kLastType = 4
    
    };
    //....
    
    
    RecordType type;
    
    const bool end = (left == fragment_length);
    
    if (begin && end) {
    
        type = kFullType;
    
    } else if (begin) {
    
        type = kFirstType;
    
    } else if (end) {
    
        type = kLastType;
    
    } else {
    
        type = kMiddleType;
    
    }

First: is the type of the first segment recorded by the user,
Last: is the type of the last segment recorded by the user.
Middle: is the type of all internal fragments recorded by a user.

If you don't understand the source code, you can follow the author's md document introduction or get a general understanding of the log file structure:

 record :=
      checksum: uint32     // crc32c of type and data[] ; little-endian
      length: uint16       // little-endian
      type: uint8          // One of FULL, FIRST, MIDDLE, LAST
      data: uint8[length]

We can simply draw a diagram based on the description:

From the internal definition of RecordType you can see that the log is fixed at 32KB in size, and will be divided into multiple parts in the log file, but a log is only contained in a single file block.

The RecordType stores the following:

The first 4 bytes are used for CRC check
The next two bytes are the block data length
followed by a byte type identifier (identifying the current log record position in the block)
Finally, the data payload part

The size of 32kb is selected in consideration of the disk alignment of the log record line and log read and write. The speed of log writing is also very fast. The written log is first written to the file table in the memory, and then buffered by the fdatasync(...) method. The area fflush is stored in the disk and persisted, and finally the operation of failure recovery is completed through the log.

It should be noted that if the log record is large, it may exist in multiple blocks.

A record will never start within the last six bytes of a block, on the grounds that a record needs some other space in front of it (i.e. checksum data length identification information for the record line, etc.).

This "waste" is acceptable in order to prevent a single log block from being split into multiple files and compression considerations.

If the reader has to know what is stored in the last few bytes, and wants to satisfy his curiosity, you can look at the following code:

dest_->Append(Slice("\x00\x00\x00\x00\x00\x00", leftover));

Log writing flow chart :

The log writing process is relatively simple. The main difference is whether the remaining space of the current block is enough to write a header, and the last 6 bytes will be filled with blanks.

In the process of log writing, through a while(ture) continuous judgment buffer size, if the size exceeds 32KB - the last 6 bytes, you need to stop writing and start writing to the present The location is a block of data.

The following is the log writing flow chart:

日志写流程图

The following is the log reading flow chart:

日志读流程图

Since the log size is 32kb, the read and write unit of the log should also be 32kb, and then the data block is scanned. When scanning the chunk, if the CRC check fails, an error message will be returned. If the data is damaged, the current chunk will be discarded.

I flipped through the code, in short, it reads through the while(true) loop read , until the read type is Last chunk Log record read complete.

memtable The interesting feature is that both insertion and deletion are implemented by "adding" (you read that right), and the internal maintenance status is maintained through Mainfest , and at the same time according to the version number It maintains whether a record is added or deleted with the serial number and ensures that the read content is the latest value. The details are also described in the previous section [[LSM-Tree - LevelDb Understanding and Implementation]].

Note that the records cannot be queried after writing to the log (because there may be a power failure failure in the middle and the real records are not written), the log is only used for fault recovery, and the data can only be accessed after the data is written to the mem .

The code for adding and deleting mem is as follows:

 namespace {

class MemTableInserter : public WriteBatch::Handler {

public:
    
    SequenceNumber sequence_;
    
    MemTable* mem_;
    
    void Put(const Slice& key, const Slice& value) override {
    
        mem_->Add(sequence_, kTypeValue, key, value);
        
        sequence_++;
    
    }
    
    void Delete(const Slice& key) override {
    
        mem_->Add(sequence_, kTypeDeletion, key, Slice());
        
        sequence_++;
    
    }
    
    };

} // namespace

In the Add() function, the data insertion is completed through a [[LSM-Tree - LevelDb Skiplist skip table]], and the record key value is included in the data node, in order to ensure that the read data is always up-to-date. Yes, the records need to be sorted inside skiplist , the node sorting uses the more common comparator Compare , if the user wants to customize the sorting (such as processing different character encodings, etc.) You can write your own comparator implementation.

For the structure of a record, we can also see the author's note from the Add() function:

 // Format of an entry is concatenation of:
  //  key_size     : varint32 of internal_key.size()
  //  key bytes    : char[internal_key.size()]
  //  tag          : uint64((sequence << 8) | type)
  //  value_size   : varint32 of value.size()
  //  value bytes  : char[value.size()]

[[VarInt32 encoding]]: Although it is a variable-length integer type, it is actually represented by 4 bytes.
uint64((sequence << 8) | type : The sequence length is actually 7 bytes after the bit operation. Note that there is a ValueType tag between tag and value_size to mark whether the record is added or deleted.

VarInt32 (vary int 32), ie: variable-length 32 is an integral type. In general, the length of the int type is fixed at 32 bytes. But the data length of VarInt32 type is not fixed, and the highest bit of each byte in VarInt32 has special meaning. If the highest bit is 1, the next byte is also part of the number.

Therefore, an integer number is represented by at least 1 byte and at most 5 bytes. If the majority of numbers in a system require >= 4 bytes to represent, then it's not really suitable to use VarInt32 to encode.

According to get() code is distinguished by valueType , valueType occupies one byte of space to determine whether to add or delete records, the default comparator determines whether to add or delete records The logging logic is as follows:

 if (comparator_.comparator.user_comparator()->Compare(

    Slice(key_ptr, key_length - 8), key.user_key()) == 0) {
    
    // Correct user key
    
    const uint64_t tag = DecodeFixed64(key_ptr + key_length - 8);
    
    switch (static_cast<ValueType>(tag & 0xff)) {
    
    case kTypeValue: {
    
        Slice v = GetLengthPrefixedSlice(key_ptr + key_length);
        
        value->assign(v.data(), v.size());
        
        return true;
    
    }
    
    case kTypeDeletion:
    
        *s = Status::NotFound(Slice());
        
        return true;
    
    }

}

Draw the following structure diagram according to the code definition and the above description:

Compare key sorting

LevelDb's memtable maintains the keys through the jump table. By default, it uses InternalKeyComparator to compare the keys. The following is the internal logic of the comparison:

The comparator is sorted by --- 8c9c83e293e6411230c7db01b19aecf3 user_key and sequence_number , and sorted in ascending order according to user_key. Information.

 /*
 一个用于内部键的比较器，它使用一个指定的比较器用于用户键部分比较，并通过递减序列号来打破平衡。
*/
int InternalKeyComparator::Compare(const Slice& akey, const Slice& bkey) const {

    // Order by:
    
    // 增加用户密钥（根据用户提供的比较器）。
    
    // 递减序列号
    
    // 递减类型（尽管序列号应该足以消除歧义）。
    
    int r = user_comparator_->Compare(ExtractUserKey(akey), ExtractUserKey(bkey));
    
    if (r == 0) {
    
        const uint64_t anum = DecodeFixed64(akey.data() + akey.size() - 8);
        
        const uint64_t bnum = DecodeFixed64(bkey.data() + bkey.size() - 8);
        
        if (anum > bnum) {
        
            r = -1;
        
        } else if (anum < bnum) {
        
            r = +1;
        
        }
    
    }
    
    return r;

}

It should be noted that the compared keys may contain completely different content . Here readers will definitely have doubts about whether it will affect the extraction of key acquisition values. However, from the logic of get, it can be obtained through key length, serial number and other information. Key, and get the header information of the header , so it has no effect if the key is any type.

record query

Now let's go back and look at how memtable is read. From the relationship between memtable and imumemble , it can be seen that it is somewhat similar to the cache , when memtable change to imumem when full and wait to sync to disk.

The order of key read and lookup is as follows:

Get the specified key in memtable, and end the search if the data meets the conditions.
Searches for the specified Key in Imumemtable, and ends the search if the data meets the conditions.
Search for the specified key in the sstable file of level i in the order of low-level to high-level. If a data item that meets the conditions is found, the search will be ended. Otherwise, a Not Found error will be returned, indicating that the specified data does not exist in the database.

The records are searched according to the hierarchical relationship, first from the current memory being written memtable search, then imumemtable , and then SSTable , SSTable is marked in the form of *.ldb , which can be quickly found.

Finally, we can think of the LevelDb query as the following form:

Summary :

In this part, we learned about the basic structure DB such as the source code part of LevelDB, and introduced the basic external interface of LevelDB. The interfaces of LevelDB and map look very similar. This part focuses on the source code such as read and write operations, as well as some internal merge compression. detail.

In addition, actions such as recording and querying are roughly similar to the reading and writing processes such as LevelDB introduced before. Of course, the code simplifies a lot of content, and readers can study according to their own interests.

SSTable operations

Earlier we mentioned the addition, deletion, modification and checking of the underlying query of records, and the details of reading and writing logs. The following is an introduction to the special data structure invented by Google SSTable .

How does SSTable work?

SSTable In the initial paper, the following characteristics can be summarized:

When writing, it does not write to disk but first writes to the data structure of the memory table.
When the memory occupation of the data structure exceeds a certain threshold, it can be directly written to the disk file. Since it is already in a sorted state, the old structure can be directly overwritten, and the writing efficiency is relatively high. And writes and data structure changes can be done at the same time.
The read and write order is memory-disk-last-write-file-not found.
The background timing thread periodically merges and compresses the sorted segments, and overwrites or discards the discarded values.

[[SSTable]] first appeared in Google's 2006 paper, LevelDB's SSTable design also has some features to reflect this data structure, of course, it is not completely consistent, LevelDB uses SSTable to maintain multi-level data nodes in disk.

It can be considered that understanding the SSTable structure is equivalent to understanding the core data structure design of LevelDb.

Multilevel SSTable

Let's focus on the multi-level SSTable part. When levelDB scans the SSTable in the disk, LevelDB will not skip the level. There must be doubts about the efficiency of scanning each level. In response to this problem, the author designed it in the db The following data structure:

 struct FileMetaData {

    FileMetaData() : refs(0), allowed_seeks(1 << 30), file_size(0) {}
    int refs;
    
    int allowed_seeks; // 允许压缩搜索
    
    uint64_t number;
    
    uint64_t file_size; // 文件大小
    
    InternalKey smallest; // 表提供的最小内部密钥
    
    InternalKey largest; // 表提供最大内部密钥

};

In the above structure declaration, all the information of the compressed SSTable file is defined, including the maximum and minimum values, the number of running searches, the number of file references and the file number. The SSTable will be stored in the same directory in a fixed form, so you can pass File number for quick search.

The search and record key order is similar, and they are read in the order from small to large . Take Level 0 as an example, which usually contains 4 fixed SSTables , and there are usually key intersections inside, so it will be in the order from SSTable1-4. Reads are performed, while higher-level layers look for information on the maximum and minimum values of the structures above (smallest and largest).

The specific file search details can be found through TableCache::FindTable . Due to the limited space, the code will not be posted here. The brief logic is to cooperate with the cache and RandomAccessFile to read and write the file, and then read and write the read The file information is written into the memory to facilitate the next acquisition.

If you understand the Mysql Btree design, you will find that the file search is somewhat similar to the page directory search. The difference is that the Btree page directory is sparsely searched through page directories, etc.

SSTable merge

Let's take a look at how SSTables are merged. As mentioned earlier, SSTables try to merge through MaybeScheduleCompaction . It should be noted that this merge compression is similar to that of Bigtable. It is judged according to different conditions whether to merge or not. Once it can be merged, it will be executed BackgroundCompaction Operation.

The merger is divided into two cases, one is Minor Compaction , the other is to fill the Memtable data into an immutable object (actually lock), and execute CompactMemtable to compress.

The source code of the simplified version of the merge operation is as follows:

 void DBImpl::CompactMemTable() {

    VersionEdit edit;
    Version* base = versions_->current();
    WriteLevel0Table(imm_, &edit, base);
    versions_->LogAndApply(&edit, &mutex_);
    RemoveObsoleteFiles();

}

The CompactMemTable method will first build the current modified version number, and then call the WriteLevel0Table() method to try to write the current Imumtable to the Level0 level.
If it is found that there are too many hierarchical SSTables in Level0, then proceed to Major Compaction , and at the same time select the appropriate compression level and compression method according to BackgroudCompcation() .

Here is the simplified code for writeLevel0 :

The last few lines of the simplified code will get the maximum and minimum value of the file information to determine whether to search in the current SSTable or jump to the next one.

If the data is written to Level0, we can regard it as Major Compaction .

 Status DBImpl::WriteLevel0Table(MemTable* mem, VersionEdit* edit,

Version* base) {
    
    // SSTable文件信息
    FileMetaData meta;
    
    meta.number = versions_->NewFileNumber();
    
    pending_outputs_.insert(meta.number);

    Iterator* iter = mem->NewIterator();
    // 构建SSTable文件
    BuildTable(dbname_, env_, options_, table_cache_, iter, &meta);

    pending_outputs_.erase(meta.number);

    // 注意，如果 file_size 为零，则该文件已被删除，并且不应被添加到清单中。
    // 获取文件信息的最大值和最小值
    const Slice min_user_key = meta.smallest.user_key();
    
    const Slice max_user_key = meta.largest.user_key();
    // level层级扫描
    base->PickLevelForMemTableOutput(min_user_key, max_user_key);
    // 写入文件
    edit->AddFile(level, meta.number, meta.file_size, meta.smallest,
    
    return Status::ok();

Combining the upper and lower source code, it can be found that the file management is finally completed by VersionEdit , if the writing is successful, it will return the current SSTable FileMetaData , in VersionEdit Internally, the changes in the file are recorded by the logAndApply method, which is the log management function described above. After completion, the RemoveObsoleteFiles() method is used to clean up the data.

If Level0 is full, then a Major Compaction is required. This compression is a bit more complicated than the previous one because it involves low-level to high-level compression.

Here we need to look back at the code of BackgroundCompaction , the specific code is as follows:

 void DBImpl::BackgroundCompaction() {  
    // 如果存在不可变imumem,进行压缩合并
    CompactMemTable();
  
    versions_->PickCompaction();

    VersionSet::LevelSummaryStorage tmp;

    CompactionState* compact = new CompactionState(c);
    
    DoCompactionWork(compact);

    CleanupCompaction(compact);

    c->ReleaseInputs();

    RemoveObsoleteFiles();

}

First, find the information that needs to be compressed according to VersionSet , and package it into the Compaction object. This object selects the two levels to be compressed according to the number of queries and the size limit, because level0 contains many overlapping keys, it will be in the Find the SSTable with overlapping keys at a higher level, and then use FileMetaData to find the file that needs to be compressed. In addition, the frequently queried SSTable will be "upgraded" to a higher level for compressed storage, and the file information will be updated for convenience. Find once.

Merged triggers

After each SSTable is created allowed_seeks is 100 times, when allowed_seeks < 0 will trigger the file's and higher level and merge, because frequently queried data usually slows down the system performance.

The reason for this design is that searching for keys at a high level means that the search for the same key must be the same at the previous level , and it is also to reduce the coverage and scanning of multiple levels of scanning each time to find data. Ultimately, the core of this design is to update FileMetaData to reduce the performance overhead of the next query.

In addition, this kind of processing can be simply understood as when we search for deep-level folders in the operating system, if it is very troublesome to frequently query a certain deep-level data, the first way to solve this problem is to create a "shortcut" Folder, the other is to directly point to this directory as a label, in fact, both are similar, so the compression design is the same.

The DoCompactionWork method in LevelDB will use merge sort to merge the key values in all incoming SSTables, and finally generate a new SSTable in a higher level.

Merge sort is mainly to merge the keys, so that the keys are ordered and can be directly merged into the specified high-level level during iteration. The key code exists in the code below
Iterator* input = versions_->MakeInputIterator(compact->compaction);

merge sort

The source code of DoCompactionWork merge sort is as follows:

 Status DBImpl::DoCompactionWork(CompactionState* compact) {
    int64_t imm_micros = 0; // Micros spent doing imm_ compactions
    

    if (snapshots_.empty()) {
        // 快照为空，找到直接采用记录信息的最后序列号
    
        compact->smallest_snapshot = versions_->LastSequence();
    
    } else {
        // 快照存在，则抛弃之前所有的序列
        compact->smallest_snapshot = snapshots_.oldest()->sequence_number();
    
    }
    
      
    // 对于待压缩数据进行，内部生成一个MergingIterator，当构建迭代器之后键内部就是有序的状态了，也就是前面说的归并排序的部分
    Iterator* input = versions_->MakeInputIterator(compact->compaction);

    input->SeekToFirst();
    
    Status status;
    
    ParsedInternalKey ikey;
    
    std::string current_user_key;
    //当前记录user key
    bool has_current_user_key = false;
    
    SequenceNumber last_sequence_for_key = kMaxSequenceNumber;
    
    while (input->Valid() && !shutting_down_.load(std::memory_order_acquire)) {
    
    // 优先考虑imumemtable的压缩工作
    
    if (has_imm_.load(std::memory_order_relaxed)) {
    
        const uint64_t imm_start = env_->NowMicros();
        
        imm_micros += (env_->NowMicros() - imm_start);
    
    }
    
    
    Slice key = input->key();
    
    if (compact->compaction->ShouldStopBefore(key) &&
        
        compact->builder != nullptr) {
        
        status = FinishCompactionOutputFile(compact, input);
    
    }
    
      
    
    // 处理键/值，添加到状态等。
    
    bool drop = false;
    
    if (!ParseInternalKey(key, &ikey)) {
    
        // 删除和隐藏呗删除key
        
        current_user_key.clear();
        
        has_current_user_key = false;
        // 更新序列号
        last_sequence_for_key = kMaxSequenceNumber;
    
    } else {
    
        if (!has_current_user_key ||
        
            user_comparator()->Compare(ikey.user_key, Slice(current_user_key)) !=
        
        0) {
        
        // 用户key第一次出现
        
        current_user_key.assign(ikey.user_key.data(), ikey.user_key.size());
        
        has_current_user_key = true;
        
        last_sequence_for_key = kMaxSequenceNumber;
        
    }
        
      
    
    if (last_sequence_for_key <= compact->smallest_snapshot) {
    
        // 压缩以后旧key边界的被新的覆盖
        
        drop = true; // (A)
    
    } else if (ikey.type == kTypeDeletion &&
    
        ikey.sequence <= compact->smallest_snapshot &&
    
        compact->compaction->IsBaseLevelForKey(ikey.user_key)) {
    
        // 对于这个用户密钥：
    
        // (1) 高层没有数据
        
        // (2) 较低层的数据会有较大的序列号
        
        // (3) 层中的数据在此处被压缩并具有
        
        // 较小的序列号将在下一个被丢弃
        
        // 这个循环的几次迭代（根据上面的规则（A））。
        
        // 因此，此删除标记已过时，可以删除。
    
        drop = true;
    
    }
    
      
    
    last_sequence_for_key = ikey.sequence;

}

}

Merge sort and process key-value information to complete cross-level compression, followed by some finishing work. The finishing work requires statistics of information after compression.

 CompactionStats stats;

    stats.micros = env_->NowMicros() - start_micros - imm_micros;
    //选择两个层级的SSTable
    for (int which = 0; which < 2; which++) {
    
        for (int i = 0; i < compact->compaction->num_input_files(which); i++) {
        
            stats.bytes_read += compact->compaction->input(which, i)->file_size;
        
        }
    
    }
    for (size_t i = 0; i < compact->outputs.size(); i++) {
    
        stats.bytes_written += compact->outputs[i].file_size;
    
    }
    // 压缩到更高的层级
    stats_[compact->compaction->level() + 1].Add(stats);
    // 注册压缩结果
    InstallCompactionResults(compact);
    // 压缩信息存储
    VersionSet::LevelSummaryStorage tmp;
    Log(options_.info_log, "compacted to: %s", versions_->LevelSummary(&tmp));
    
    return status;

The default level of the last level compression is 7 levels , which are defined as follows in the source code:

static const int kNumLevels = 7;

summary

Here we summarize the two operations of merge compression: Minor Compaction and Major Compaction :

Minor Compaction : This GC is mainly some compression operations of the Level0 level. Since the Level0 level is used more frequently, similar to the first level cache, the key value will not be forced to be sorted, so there will be more overlapping keys. The compression process is easy to understand. The key part is to build a new SSTable in the skiplist and insert it into the specified level.

Note: Minor Compaction will pause when it is in progress Major Compaction operation.

Minor Compaction : This is much more complicated than Minor Compaction, not only including cross-level compression, but also key range determination and iterator merge sort and final statistics operations, the most critical part of which is the merge sort compression list, and then the old file Merge with the new file to produce a new VersionSet information, in addition to the global compression progress and management operations.

In addition, after the Minor Compaction is completed, the Minor Compaction will be tried again, because the Minor Compaction may bring more duplicate keys, so another compression can further improve the search efficiency.

Major Compaction : This operation needs to suspend the reading and writing of the entire LevelDB, because at this time, it is necessary to perform cross-level merging of the entire LevelDB multi-level, and cross-level compression is much more complicated. The specific details will be introduced later.

It can be considered that the author found a situation in the process of testing and optimized it.

Storage Status - VersionSet

From the name of this object, it is directly understood as "version collection". Internally, a version structure is used to "version control" key-value information. There is no doubt that this is due to the characteristics brought by multi-threaded compression, so the final result is Concatenation in the form of a doubly linked list + historical versions, but only one version is always the current version.
The most frequent and critical operation function of VersionSet LogAndApply , the following is the simplified code VersionSet::LogAndApply :

Here you can understand the undo log analogy in Mvcc of the relational database Mysql

 Status VersionSet::LogAndApply(VersionEdit* edit, port::Mutex* mu) {
    // 更新版本链表信息
    if (!edit->has_prev_log_number_) {
    
        edit->SetPrevLogNumber(prev_log_number_);
    
    }

    edit->SetNextFile(next_file_number_);
    
    edit->SetLastSequence(last_sequence_);

    Version* v = new Version(this);
    // 构建当前的版本version，委托给建造器进行构建
    Builder builder(this, current_);
    
    builder.Apply(edit);
    
    builder.SaveTo(v);

    // 关键方法：内部通过打分机制确定文件所在的层级，值得注意的是level0的层级确定在源代码中有较多描述
    Finalize(v);

// 如有必要，通过创建包含当前版本快照的临时文件来初始化新的描述符日志文件。

    std::string new_manifest_file;
    //  没有理由在这里解锁*mu，因为我们只在第一次调用LogAndApply时（打开数据库时）碰到这个路径。
    new_manifest_file = DescriptorFileName(dbname_, manifest_file_number_);
    // 写入mainfest文件
    env_->NewWritableFile(new_manifest_file, &descriptor_file_);

    // 写入版本信息快照
    WriteSnapshot(descriptor_log_);

    // 把记录写到 MANIFEST中

    descriptor_log_->AddRecord(record);

    //如果创建了新的文件，则将当前版本指向这个文件

    SetCurrentFile(env_, dbname_, manifest_file_number_);

    // 注册新版本
    
    AppendVersion(v);
    
    return Status::OK();

}

The key part of the comments have been given, the Mainfest details here have not been mentioned before, and the mainfest provided by the author impl.md is introduced like this:

The MANIFEST file lists the set of sorted tables that make up each level, the corresponding key ranges, and other important metadata. Whenever the database is reopened, a new MANIFEST file is created (with a new number embedded in the filename). MANIFEST files are formatted as a log, and changes made to the service state (as files are added or removed) are appended to this log.

From a personal point of view, this file is somewhat similar to the metadata in BigTable Meta .

SSTable file format

To understand this part, you don't need to look at the source code in a hurry. There are also relevant descriptions in the file table_format.md in the warehouse. Here we directly copy the official document translation:

leveldb file format
<beginning_of_file>
[Data Block 1]
[Data Block 2]
...
[Data Block N]
[Metablock 1]
...
[metablock K]
[meta index block]
[index block]
[footer] (fixed size; starts from file_size - sizeof(Footer))
<end_of_file>

We can draw a corresponding structure diagram according to the description:

The structure diagram above is described from top to bottom as follows:

Data block: According to the data storage specification of LSM-Tree, it is sorted in the order of key/value. The data block is formatted according to the internal logic of block.builder.cc , and you can choose whether to compress the storage.
Metadata block: Metadata block and data block are also formatted using block.builder.cc , and can choose whether to compress or not. Metadata block will be expanded to more types later (mainly used for data type records)
"meta-index" block: the index for each other metadata block, the key is the name of the meta-block, and the value is the BlockHandle that points to that meta-block.
"index block": contains the index of the data block, the key is the last key of the corresponding string >= data block , and the value is the BlockHandle of the data block before the first key of the consecutive data block.
At the end of the file is a fixed-length footer that contains the meta index and the BlockHandle of the index block and a magic number .

Magic numbers are also known as magic numbers. For example, the first 8-bit byte code of JAVA is CAFEBABE . The value and byte size are meaningless, and more of the author's interest.

Note that the Footer footer has a fixed size of 48 bytes, where we can get the position of the meta index block and the index block , and then use these two indexes to find the corresponding positions of other values.

For more details, you can continue to refer to table_format.md introduction, so I won't repeat it here.

TableBuilder :

The SSTable interface is defined in a TableBuilder builder. TableBuilder provides an interface for building Table. The definition of this interface is as follows:

TableBuilder provides the interface for building tables (an immutable and ordered map from keys to values).

Multiple threads can call const methods on a TableBuilder without external synchronization. But if any one thread may call a non-constant method, all threads accessing the same TableBuilder must use external synchronization.

 // TableBuilder 提供了用于构建 Table 的接口
//（从键到值的不可变且排序的映射）。
//
// 多个线程可以在 TableBuilder 上调用 const 方法，而无需
// 外部同步，但如果任何线程可能调用
// 非常量方法，所有访问同一个 TableBuilder 的线程都必须使用
// 外部同步。
class LEVELDB_EXPORT TableBuilder {

public:

    TableBuilder(const Options& options, WritableFile* file);
    
    TableBuilder(const TableBuilder&) = delete;
    
    TableBuilder& operator=(const TableBuilder&) = delete;
    /*
    改变该构建器所使用的选项。注意：只有部分的
    选项字段可以在构建后改变。如果一个字段是
    不允许动态变化，并且其在结构中的值
    中的值与传递给本方法的结构中的值不同。
    结构中的值不同，该方法将返回一个错误
    而不改变任何字段。
    */
    Status ChangeOptions(const Options& options);
    
    void Add(const Slice& key, const Slice& value);
    
    void Flush();
    
    Status status() const;
    
    Status Finish();
    /*
    表示应该放弃这个建设者的内容。停止
    在此函数返回后停止使用传递给构造函数的文件。
    如果调用者不打算调用Finish()，它必须在销毁此构建器之前调用Abandon()
    之前调用Abandon()。
    需要。Finish()、Abandon()未被调用。
    */
    void Abandon();
    
    uint64_t NumEntries() const;
    
    uint64_t FileSize() const;
    
    private:
    
        bool ok() const { return status().ok(); }
        
        void WriteBlock(BlockBuilder* block, BlockHandle* handle);
        
        void WriteRawBlock(const Slice& data, CompressionType, BlockHandle* handle);
        
        struct Rep;
        
        Rep* rep_;
    
    };
}

Summary :

SSTable-related design has an important position and role in the entire LevelDB. We introduced the details of multi-level merging and compression of SSTable, as well as two different compression forms. The first is simple compression for Level0. Simple compression only requires Compressing the SSTable that exists in memory is also compressing the Imumemtable to disk for storage. It is important to note that this action is usually performed again after the first completion, in order to prevent it from being merged.

The other is multi-level compression for frequent key queries. Multi-level compression is much more complicated than simple compression, but multi-level compression is the key to improving the writing performance and query performance of the entire LevelDB.

Finally, from LevelDB, you can also see the implementation of many classic data structures and algorithms. Compared with key management, the method of skip table + merge sort is used to improve management efficiency. The sorted content is not only conducive to query, but also to data storage during storage. Sequential scan.

Skiplist skip list

The skip table is not only used in LevelDb, but also implemented in many other middleware. This part will be introduced separately in the next article.

The compressed file uses the merge sort method for key merging, and the internal database uses the more critical [[LSM-Tree - LevelDb Skiplist skip table]] in addition to the merge sort for ordered key value management. Before going into the details of the jump table, you need to understand the basic concept of the data structure of the jump table.

[[LevelDb skip table implementation]]

Bloom filter

Bloom Filter is a random data structure with high space efficiency. It uses a bit array to represent a collection succinctly and can judge whether an element belongs to this collection. This efficiency of Bloom Filter has a certain cost: when judging whether an element belongs to a certain set, it may mistake elements that do not belong to this set as belonging to this set (false positive). Therefore, Bloom Filter is not suitable for those "zero error" applications. In applications where a low error rate can be tolerated, Bloom Filter trades very few errors for great savings in storage space.

In leveldb, the Bloom filter is used to determine whether the specified key value exists in the sstable. If the filter indicates that it does not exist, the key must not exist, thus speeding up the search efficiency.

Bloom filters are also used more in different open source components, so here is also a separate article to explain.

[[LSM-Tree - LevelDb Bloom Filter]]

write at the end

The design of LevelDB is still very interesting, the key is that most of the code is explained and introduced.

The source code has a lot of content, but it is not difficult to analyze if you analyze it carefully. Thank you for reading to the end.

LSM-Tree - LevelDb source code analysis