Implement a KV database using the LSM Tree idea

▌Directory

Design ideas

memory table
WAL
Structure of SSTable
Structure of SSTable elements and indexes
SSTable Tree
SSTable in memory
data lookup process
What is LSM-Treee
References
the whole frame

Implementation process

file compression test
Insertion test
load test
Find a test
SSTable structure
SSTable file structure
SSTable Tree structure and management of SSTable files
Read SSTable file
SSTable file merge
SSTable lookup process
Insert SSTable file procedure
WAL file recovery process
Binary Sort Tree Structure Definition
insert operation
find
delete
Traversal algorithm
Representation of Key/Value
Implementation of memory table
WAL
SSTable and SSTable Tree
Simple to use test

When I was studying data structure some time ago, I happened to hear about LSM Tree, so I tried to implement a simple KV database by myself through the design idea of LSM Tree.

The code is open source, the code warehouse address: https://github.com/whuanle/lsm

The author uses the Go language to implement the LSM Tree database, because the implementation of the LSM Tree requires reading and writing files, lock processing, data search, file compression, etc., so the coding process also improves the experience of using Go, and the project also uses Some simple algorithms such as stacks and binary sorting trees can also consolidate the basic algorithm capabilities. Appropriately set challenging goals for yourself, you can improve your technical level.

Next, let's understand the design idea of LSM Tree and how to implement a KV database.

Design ideas

▌What is LSM-Treee

The full name of LSM Tree is Log-Structured Merge Tree, which is a data structure for key-value databases. As far as I know, currently NoSQL databases such as Cassandra and ScyllaDB use LSM Tree.
The core theoretical basis of LSM Tree is that the disk sequential write performance is much faster than the random write speed. Because no matter what kind of database, disk IO is the biggest influence factor on database read and write performance, so rational organization of database files and full use of the mechanism of disk read and write files can improve the performance of database programs. LSM Tree first buffers all write operations in memory. When the used memory reaches the threshold, it will flush the memory to the disk. This process only writes sequentially and does not occur random writes, so LSM has superior write performance.

The author will not repeat the concept of LSM Tree here, readers can refer to the materials listed below.

▌References

" What is a LSM Tree? "
Raw Cake: " Understanding LSM Tree: An Efficient Read and Write Storage Engine "
https://mp.weixin.qq.com/s/7kdg7VQMxa4TsYqPfF8Yug
Xiao Hansong: " Starting from 0: Implementing LSM Database with 500 Lines of Code "
Little House Hero: " Golang Practices LSM Related Content "
" SM-based storage techniques: a survey " Chinese translation

▌Overall structure

The following figure shows the overall structure of LSM Tree, which can be divided into two parts: memory and disk files. In addition to database files (SSTable files), disk files also include WAL log files.

The memory table is used to buffer write operations. When Key/Value is written to the memory table, it will also be recorded in the WAL file at the same time. The WAL file can be used as the basis for restoring the memory table data. When the program starts, if it is found that there is a WAL file in the directory, you need to read the WAL file and restore the memory table in the program.

In the disk file, there are multiple layers of database files, each layer will have multiple SSTable files, SSTable files are used to store data, that is, database files. The database files of the next layer are generated after the database files of the previous layer are compressed and merged. Therefore, the larger the number of layers, the larger the database file.

Let's take a closer look at the design ideas of different parts of the LSM Tree, and what stages need to go through when reading and writing operations.

memory table

In the memory area of LSM Tree, there are two memory tables, one is a variable memory table Memory Table, the other is an immutable memory table Immutable Memory Table, both have the same data structure, generally a binary sorting tree.
At the beginning, the database has no data. At this time, the Memory Table is empty, that is, there are no elements, and the Immutable Memory Table is nil, that is, no memory is allocated. At this time, all write operations are on the Memory Table, and the write operations include Set the value of Key and delete Key. If the writing to the Memory Table is successful, then the operation information will be recorded in the WAL log file.

Of course, the Key/Value stored in the Memory Table should not be too much, otherwise it will take up too much memory. Therefore, generally when the number of Keys in the Memory Table reaches the threshold, the Memory Table will become Immutable Memory Table, and then create a new one. The Immutable Memory Table will be converted to SSTable at the right time and stored in the disk file.

Therefore, Immutable Memory Table is a temporary object that only exists temporarily when synchronizing elements in memory to SSTable.

It should also be noted here that when the memory table is synchronized to the SSTable, the Wal file needs to be deleted. The data that can be recovered by using the Wal file should be consistent with the KV elements in the current memory, that is, the WAL file can be used to restore the last running state of the program. If the current memory table has been moved to the SSTable, then the WAL file is no longer necessary to retain, it should be deleted and Recreate an empty WAL file.

Regarding the implementation of the WAL part, there are different approaches. Some have only one WAL file globally, while others use multiple WAL files. The specific implementation will change according to the scene.

WAL

WAL is Write Ahead LOG. When performing a write operation (insert, modify or delete Key), because the data is in memory, in order to avoid program crash and stop or host shutdown, etc., resulting in loss of memory data, it is necessary to record the write operation in time. To the WAL file, when the program is started next time, the program can read the operation record from the WAL file, and restore to the state before the program exit through the operation record.

The log saved by WAL records all operations of the current memory table. When using WAL to restore the memory table of the last program, it is necessary to read the information of each operation from the WAL file and re-act on the memory table, that is, re-execute various writes. into operation. Therefore, writing directly to the memory table is the same as rewriting the memory table after recovering data from the WAL.

It can be said that the WAL records the operation process, and the binary sorted tree stores the final result.

What WAL needs to do is to restore all write operations to the memory table and perform these operations in sequence to restore the memory table to the previous state .

The WAL file is not a binary file backup of the memory table. The WAL file is a backup of the write operation, and the restore is also the write operation process, not the memory data.

Structure of SSTable

The full name of SSTable is Sorted String Table, which is a persistent file of memory table.
The SSTable file consists of three parts: data area, sparse index area, and metadata, as shown in the following figure.

When the memory table is converted to SSTable, the Immutable Memory Table is first traversed, each KV is sequentially compressed into binary data, and a corresponding index structure is created to record the insertion position and data length of the binary KV. Then put all binary KVs at the beginning of the disk file, and then convert all index structures to binary and put them after the data area. Then put the information about the data area and the index area into a metadata structure and write it to the end of the file.

Each element in the memory will have a key. When the memory table is converted into an SSTable, the element set will be sorted according to the key, and then these elements will be converted into binary and stored at the beginning of the file, that is, in the data area.

But how do we separate each element from the data area?

For different developers, the structure of the SSTable set in the coding process is different, and the processing method of converting the memory table to SSTable is also different, so here I only talk about my own practice when writing LSM Tree.

The author's approach is that when generating the data area, the set of elements is not generated in binary at one time, but the elements are sequentially traversed and processed.

First, generate a binary key/value element, put it at the beginning of the file, then generate an index, record the starting position and length of the binary data of this element in the file, and then put the index into the memory first.

Then, continue to process the remaining elements and generate corresponding indices in memory.

A sparse index represents one data block in each index execution file.

When all the elements are processed, the SSTable file has generated the data area. Next, we generate binary data from all the index sets and append them to the file.

Then, we also need to generate file metadata for the starting position and length of the data area and the sparse index area, so that the data area and the sparse index area can be split when the file is subsequently read, and the two parts of the data can be processed separately.

The metadata structure is also very simple, with four main values:

 // 数据区起始索引
  dataStart int64
  // 数据区长度
  dataLen int64
  // 稀疏索引区起始索引
  indexStart int64
  // 稀疏索引区长度
  indexLen int64

Metadata is appended to the end of the file and is fixed in bytes.

When reading the SSTable file, we first read the last few bytes of the file, such as 64 bytes, and then restore the field value according to every 8 bytes to generate metadata, and then we can use the data area and sparse index. area is processed.

Structure of SSTable elements and indexes

We store a Key/Value in the data area, then this block of files that stores a Key/Value element is called block. In order to represent Key/Value, we can define a structure like this:

`Key
Value
Deleted`

This structure is then converted into binary data and written to the data area of the file.
In order to locate the position of Key/Value in the data area, we also need to define an index whose structure is as follows:

`Key
Start
Length
`
Each Key/Value is located using an index.

SSTable Tree

Every time a memtable is converted to an SSTable, an SSTable file is generated, so we need to manage the SSTable file so that the number of files is not too large.

The following is the SSTable file organization structure of LSM Tree.

As you can see in the above figure, the database consists of many SSTable files, and the SSTables are separated into different layers. In order to manage the SSTables of different layers, the organization of all SSTable disk files also has a tree structure. Through the SSTable Tree, the management Disk file size or number of SSTables for different tiers.

There are three main points about SSTable Tree:

1. The SSTable files of layer 0 are all converted from memory tables.
2. Except for the 0th layer, the SSTable files of the next layer can only be generated by compressing and merging the SSTable files of the previous layer, and the SSTable files of one layer can be merged when the total file size or number reaches the threshold, and a The new SSTable is inserted into the next level.
3. The SSTable of each layer has an order, which is sorted according to the generation time. This feature is used to find data from all SSTables.

Since an SSTable file is created every time a memory table is persisted, the number of SSTable files will increase. After there are more files, more file handles need to be saved, and when reading data from multiple files, the speed will also be slower. If left unchecked, too many files can lead to poor read performance and bloated footprint , a phenomenon known as space amplification and read amplification .

Since the SSTable cannot be changed, if you want to delete a key or modify the value of a key, you can only mark it in a new SSTable, but cannot modify it, which will lead to the existence of the same key in different SSTables, and the file will be bloated.

Therefore, it is also necessary to compress the small SSTable files, merge them into a large SSTable file, and put them in the next layer in order to improve the read performance.

When the total size of the SSTable files in one layer is greater than the threshold, or when the number of SSTable files is too large, the merge action needs to be triggered to generate a new SSTable file, put it in the next layer, and then delete the original SSTable file, as shown in the following figure this process.

Although merging and compressing SSTables can suppress the problems of space enlargement and read enlargement, when merging multiple SSTables into one SSTable, it is necessary to load each SSTable file, read the contents of the file in memory, create a new SSTable file, and Delete old files, which consumes a lot of CPU time and disk IO. This phenomenon is called write amplification.

The figure below demonstrates the storage space change before and after the merge.

SSTable in memory

When the program starts, the metadata and sparse index area of each SSTable will be loaded into the memory, that is, the SSTable caches the Key list in the memory. When you need to find the Key in the SSTable, first search in the sparse index area of the memory. If the Key is found, read the binary data of the Key/Value from the disk file according to the Start and Length of the index. The binary data is then converted into a Key/Value structure.

Therefore, to determine whether a key exists in an SSTable, it is searched in memory. This process is very fast. Only when the value of the key needs to be read, it needs to be read from the file.

However, when the number of keys is too large, it will consume a lot of memory to cache them all in memory, and it will take a certain amount of time to find them one by one. You can also use Bloom Filter to determine whether a key exists more quickly.

data lookup process

First, query from the Memory Table according to the Key to be found.

If the corresponding Key cannot be found in the Memory Table, it will be searched from the Immutable Memory Table.

In the LSM Tree database written by the author, there is only Memory Table and no Immutable Memory Table.

If the Key cannot be found in both memory tables, it must be searched from the SSTable list.

First query the SSTable table of the 0th layer, start from the latest SSTable table of this layer, if not found, query other SSTables in the same layer, if not, then check the next layer.

When a key is found, regardless of the state of the key (valid or deleted), the search will be stopped and the value of the key and the delete flag will be returned.

Implementation process

In this section, the author will explain the general implementation idea of LSM Tree, and give some code examples, but the complete code needs to be viewed in the warehouse. Only the implementation-related code definitions are given here, and the specific ones are not listed. Code details.

The following figure is the main focus of LSM Tree:

For the memory table, we need to implement additions, deletions, inspections, and traversal;
For WAL, the operation information needs to be written to the file, and the memory table can be restored from the WAL file;
For SSTable, it can load file information and find corresponding data from it;
Corresponding to SSTable Tree, responsible for managing all SSTables, merging files, etc.

▌Representation of Key/Value

As a Key/Value database, we need to be able to hold any type of value. Although GO 1.18 adds generics, generic structures cannot store any value arbitrarily, solving the problem of storing various types of Values, so I do not use generic structures. Moreover, no matter what data is stored, it is not important to the database, and the database does not have to know the meaning of Value at all. The type and meaning of this value are only useful to the user, so we can directly convert the value to binary storage, When the user fetches data, the binary is converted to the corresponding type.

Define a struct to hold any type of value:

 // Value 表示一个 KV
type Value struct {
  Key     string
  Value   []byte
  Deleted bool
}

The Value structure reference path is kv.Value.

If there is a struct like this:

 type TestValue struct {
  A int64
  B int64
  C int64
  D string
}

Then you can put the serialized binary data of the structure into the Value field.

 data,_ := json.Marshal(value)

v := Value{
    Key: "test",
    Value: data,
    Deleted: false,
}

Key/Value serializes the value through json, converts it to binary and stores it in memory.

Because in LSM Tree, even if a Key is deleted, the element will not be cleaned up, but the element will be marked as deleted, so in order to determine the search result, we need to define an enumeration to determine if this Key is found. After that, whether this Key is valid.

 // SearchResult 查找结果
type SearchResult int

const (
  // None 没有查找到
  None SearchResult = iota
  // Deleted 已经被删除
  Deleted
  // Success 查找成功
  Success
)

For the code part, readers can refer to:
https://github.com/whuanle/lsm/blob/1.0/kv/Value.go

▌ Implementation of memory table

The memory table in LSM Tree is a binary sorting tree. The operations of the binary sorting tree mainly include setting value, inserting, searching, and traversing. For detailed code readers, please refer to:

https://github.com/whuanle/lsm/blob/1.0/sortTree/SortTree.go

The following is a brief description of the implementation of the binary sorted tree.

Assuming that the Key list we want to insert is [30, 45, 25, 23, 17, 24, 26, 28], then after insertion, the structure of the memory table is as follows:

When writing a binary sorting tree, the author found several mistakes-prone places, so I will list them here.

First of all, we have to remember: after a node is inserted, its position does not change, it cannot be removed, and it cannot be changed.
The first point, the newly inserted node, can only be used as a leaf.

The following is a correct insert operation:

As shown in the figure, 23, 17, and 24 already exist, so when inserting 18, it needs to be inserted in the right child of 17.
The following is an incorrect insert operation:

When performing an insert operation, the position of the old node cannot be moved, and the relationship between the left child and the right child cannot be changed.

The second point is that when deleting a node, it can only be marked for deletion, but not really deleted.

Binary Sort Tree Structure Definition

The structures and methods of a binary sorted tree are defined as follows:

 // treeNode 有序树节点
type treeNode struct {
  KV    kv.Value
  Left  *treeNode
  Right *treeNode
}

// Tree 有序树
type Tree struct {
  root   *treeNode
  count  int
  rWLock *sync.RWMutex
}


// Search 查找 Key 的值
func (tree *Tree) Search(key string) (kv.Value, kv.SearchResult) {
}

// Set 设置 Key 的值并返回旧值
func (tree *Tree) Set(key string, value []byte) (oldValue kv.Value, hasOld bool) {
}

// Delete 删除 key 并返回旧值
func (tree *Tree) Delete(key string) (oldValue kv.Value, hasOld bool) {
}

For specific code implementation, please refer to:
https://github.com/whuanle/lsm/blob/1.0/sortTree/SortTree.go

Because the string type of the Go language is a value type, it can directly compare the size, so when inserting Key/BValue, a lot of code can be simplified.

insert operation

Because the tree is ordered, when inserting Key/Value, it is necessary to compare the size of the Key from top to bottom at the root node of the tree, and then insert it into the tree in the form of a leaf node.

The insertion process can be divided into various situations.

The first is that when there is no related Key, it is directly inserted as a leaf node, as the left child or right child of the previous element.

 if key < current.KV.Key {
      // 左孩为空，直接插入左边
      if current.Left == nil {
        current.Left = newNode
                // ... ...
      }
      // 继续对比下一层
      current = current.Left
    } else {
      // 右孩为空，直接插入右边
      if current.Right == nil {
        current.Right = newNode
                // ... ...
            }
      current = current.Right
        }

Second, when the Key already exists, the node may be valid, and we need to replace the Value; the node may be deleted by the standard, we need to replace the Value, and change the Deleted flag to false.

 node.KV.Value = value
      isDeleted := node.KV.Deleted
      node.KV.Deleted = false

So, what is the time complexity when inserting a Key/Value into a binary sorted tree?

If the binary sorting tree is relatively balanced, that is, the left and right are relatively symmetrical, then the time complexity of the insertion operation is O(logn).

As shown in the figure below, there are 7 nodes in the tree and there are only three layers, so when inserting, you need to compare at most three times.

If the binary sorted tree is unbalanced, the worst case is that all nodes are on the left or right, and the time complexity of insertion is O(n).
As shown in the figure below, there are four nodes and four layers in the tree, so when inserting, you need to compare at most four times.

Please refer to the code for inserting nodes:
https://github.com/whuanle/lsm/blob/5ea4f45925656131591fc9e1aa6c3678aca2a72b/sortTree/SortTree.go#L64

find

When searching for a key in a binary sorting tree, the left child or right child is selected according to the size of the key for the next level search. The search code example is as follows:

 currentNode := tree.root
  // 有序查找
  for currentNode != nil {
    if key == currentNode.KV.Key {
      if currentNode.KV.Deleted == false {
        return currentNode.KV, kv.Success
      } else {
        return kv.Value{}, kv.Deleted
      }
    }
    if key < currentNode.KV.Key {
      // 继续对比下一层
      currentNode = currentNode.Left
    } else {
      // 继续对比下一层
      currentNode = currentNode.Right
    }
  }

Its time complexity is consistent with insertion.
To find the code, please refer to: https://github.com/whuanle/lsm/blob/5ea4f45925656131591fc9e1aa6c3678aca2a72b/sortTree/SortTree.go#L34

delete

When deleting, you only need to find the corresponding node, clear the Value, and then set the delete flag. The node cannot be deleted.

 currentNode.KV.Value = nil
        currentNode.KV.Deleted = true

Its time complexity is consistent with insertion.

Please refer to the deletion code: https://github.com/whuanle/lsm/blob/5ea4f45925656131591fc9e1aa6c3678aca2a72b/sortTree/SortTree.go#L125

Traversal algorithm

Reference code: https://github.com/whuanle/lsm/blob/5ea4f45925656131591fc9e1aa6c3678aca2a72b/sortTree/SortTree.go#L175
In order to traverse the nodes of the binary sorting tree in order, the recursive algorithm is the simplest, but when the level of the tree is very high, recursion will consume a lot of memory space, so we need to use the stack algorithm to traverse the tree in order. Get all nodes.

In the Go language, the stack is implemented using slices:
https://github.com/whuanle/lsm/blob/1.0.0/sortTree/Stack.go

The sequential traversal of a binary sorting tree is actually a pre-order traversal. According to the pre-order traversal, after the traversal is completed, the key of the obtained node set must be in order.
The reference code is as follows:

 // 使用栈，而非递归，栈使用了切片，可以自动扩展大小，不必担心栈满
  stack := InitStack(tree.count / 2)
  values := make([]kv.Value, 0)

  tree.rWLock.RLock()
  defer tree.rWLock.RUnlock()

  // 从小到大获取树的元素
  currentNode := tree.root
  for {
    if currentNode != nil {
      stack.Push(currentNode)
      currentNode = currentNode.Left
    } else {
      popNode, success := stack.Pop()
      if success == false {
        break
      }
      values = append(values, popNode.KV)
      currentNode = popNode.Right
    }
  }

Traversal code:
https://github.com/whuanle/lsm/blob/33d61a058d79645c7b20fd41f500f2a47bc95357/sortTree/SortTree.go#L175

The stack size is allocated to half the number of tree nodes by default. If the tree is balanced, the number is more appropriate. And it is not necessary to push all nodes to the stack before reading. As long as there is no left child, elements can be read from the stack.

If the tree is not balanced, the actual stack space required may be larger, but the stack uses slices and will automatically expand if the stack space is insufficient.

The traversal process is shown in the following animation:

Animation is not easy to make~
As you can see, how much stack space is required is related to the height of the binary tree.

▌WAL

The structure of WAL is defined as follows:

 type Wal struct {
  f    *os.File
  path string
  lock sync.Locker
}

WAL requires two capabilities:
1. When the program starts, it can read the content of the WAL file and restore it to a memory table (binary sorting tree).
2. After the program is started, when writing or deleting the memory table, the operation should be written into the WAL file.
Reference Code:
https://github.com/whuanle/lsm/blob/1.0/wal/Wal.go

Let's explain the author's WAL implementation process.

Here is the simplified code to write to the WAL file:

 // 记录日志
func (w *Wal) Write(value kv.Value) {
  data, _ := json.Marshal(value)
  err := binary.Write(w.f, binary.LittleEndian, int64(len(data)))
  err = binary.Write(w.f, binary.LittleEndian, data)
}

It can be seen that an 8-byte is written first, and then the Key/Value is serialized and written.

In order to correctly restore data from the WAL file when the program is started, the WAL file must be properly separated so that each element operation can be read correctly.
Therefore, each element written to WAL needs to record its length, which is represented by int64 type, and int64 occupies 8 bytes.

WAL file recovery process

In the previous section, an element of the WAL file was written, consisting of the element data and its length. Then the file structure of WAL can be viewed like this:

Therefore, when restoring data using a WAL file, first read the 8 bytes at the beginning of the file, determine the number of bytes n of the first element, and then load the binary data in the range of 8 ~ (8+n) into memory , and then deserialize the binary data to type kv.Value via json.Unmarshal().

Next, read the 8 bytes at the position of (8+n) ~ (8+n)+8 to determine the data length of the next element, so that the entire WAL file is read little by little.

Generally, the WAL file is not very large, so when the program starts, the data recovery process can load all the WAL files into the memory, then read and deserialize one by one, identify whether the operation is Set or Delete, and then call the binary sorting tree. The Set or Deleted methods add elements to the node.
The reference code is as follows:

Code location:
https://github.com/whuanle/lsm/blob/4faddf84b63e2567118f0b34b5d570d1f9b7a18b/wal/Wal.go#L43

▌SSTable and SSTable Tree

SSTable involves a lot of code, which can be divided into three parts: saving the SSTable file, parsing the SSTable from the file and searching for the Key.

The list of all SSTable code files written by the author is as follows:

SSTable structure
The structure of SSTable is defined as follows:

 // SSTable 表，存储在磁盘文件中
type SSTable struct {
  // 文件句柄
  f        *os.File
  filePath string
  // 元数据
  tableMetaInfo MetaInfo
  // 文件的稀疏索引列表
  sparseIndex map[string]Position
  // 排序后的 key 列表
  sortIndex []string
  lock sync.Locker
}

The elements in sortIndex are ordered, and the memory locations of the elements are connected, which is convenient for CPU cache and improves the search performance. You can also use the Bloom filter to quickly determine whether the key exists in the SSTable.

When the SSTable is determined, the index of this element is looked up from sparseIndex so that it can be located in the file.

The structures of metadata and sparse indexes are defined as follows:

 type MetaInfo struct {
  // 版本号
  version int64
  // 数据区起始索引
  dataStart int64
  // 数据区长度
  dataLen int64
  // 稀疏索引区起始索引
  indexStart int64
  // 稀疏索引区长度
  indexLen int64
}

 // Position 元素定位，存储在稀疏索引区中，表示一个元素的起始位置和长度
type Position struct {
  // 起始索引
  Start int64
  // 长度
  Len int64
  // Key 已经被删除
  Deleted bool
}

It can be seen that in addition to pointing to a disk file, an SSTable structure also needs to cache some things in memory, but different developers do it differently. For example, in the author's practice, this mode is fixed at the beginning, and the Keys list needs to be cached in memory, and then the dictionary is used to cache element positioning.

 // 文件的稀疏索引列表
  sparseIndex map[string]Position
  // 排序后的 key 列表
  sortIndex []string

But in fact, keeping only sparseIndex map[string]Position can also complete all search operations, sortIndex[]string is not necessary.

SSTable file structure

SSTable files are divided into three parts: data area, sparse index area, metadata/file index. The stored content is related to the data structure defined by the developer. As shown below:

The data area is a list of serialized Value structures, and the sparse index area is a list of serialized Positions. However, the serialization of the two regions is handled differently.

The sparse index area is serialized into binary storage of map[string]Position type, so when we can read the file, we can directly deserialize the entire sparse index area into map[string]Position.

The data area is appended after serialization of kv.Value, so it is not possible to deserialize the entire data area into []kv.Value. You can only read each block of the data area step by step through Position, and then deserialize it. Converted to kv.Value.

SSTable Tree structure and management of SSTable files

In order to organize a large number of SSTable files, we also need a structure, in a hierarchical structure, to manage all the disk files.
We need to define a TableTree structure, which is defined as follows:

 // TableTree 树
type TableTree struct {
  levels []*tableNode  // 这部分是一个链表数组
  // 用于避免进行插入或压缩、删除 SSTable 时发生冲突
  lock *sync.RWMutex
}

// 链表，表示每一层的 SSTable
type tableNode struct {
  index int
  table *SSTable
  next  *tableNode
}

In order to facilitate the layering of the SSTable and mark the insertion order, it is necessary to formulate the naming convention of the SSTable file.
As shown in the following file:

 ├── 0.0.db
├── 1.0.db
├── 2.0.db
├── 3.0.db
├── 3.1.db
├── 3.2.db

The SSTable file consists of {level}.{index}.db, the first number represents the SSTable level where the file is located, and the second number represents the index in this level.

Among them, the larger the index, the newer the file.

Insert SSTable file procedure

When converting from a memory table to an SSTable, each converted SSTable is inserted at the end of Level 0.

Each layer of SSTable is managed using a linked list:

 type tableNode struct {
  index int
  table *SSTable
  next  *tableNode
}

Therefore, when inserting an SSTable, look down and put it at the end of the linked list.
An example of the code section for inserting a node into a linked list is as follows:

 for node != nil {
      if node.next == nil {
        newNode.index = node.index + 1
        node.next = newNode
        break
      } else {
        node = node.next
      }
    }

When converting from memory table to SSTable, it will involve more operations, please refer to the code: https://github.com/whuanle/lsm/blob/1.0/ssTable/createTable.go

Read SSTable file

When the program starts, it needs to read all SSTable files in the directory into TableTree, and then load the sparse index area and metadata of each SSTable.
The author's LSM Tree processing process is shown in the figure:

The author's LSM Tree loaded these files, which took a total of 19.4259983s.

The code for the loading process is in:
https://github.com/whuanle/lsm/blob/1.0/ssTable/Init.go

The following author will talk about the general loading process.

First read all .db files in the directory:

 infos, err := ioutil.ReadDir(dir)
  if err != nil {
    log.Println("Failed to read the database file")
    panic(err)
  }
  for _, info := range infos {
    // 如果是 SSTable 文件
    if path.Ext(info.Name()) == ".db" {
      tree.loadDbFile(path.Join(dir, info.Name()))
    }
  }

Then create an SSTable object to load the file's metadata and sparse index area:

 // 加载文件句柄的同时，加载表的元数据
  table.loadMetaInfo()
    // 加载稀疏索引区
  table.loadSparseIndex()

Finally, according to the file name of .db, insert it into the specified location in TableTree:

SSTable file merge

When there are too many SSTable files in one layer, or when the files are too large, the SSTable files of this layer need to be merged to generate a new SSTable without repeated elements, and put them in a new layer.

Therefore, the author's approach is to use a new thread after the program starts to check whether the memory table needs to be converted into SSTable and whether the SSTable layer needs to be compressed. When checking, starting from Level 0, two conditional thresholds are checked, the first is the number of SSTables, and the other is the total file size of the SSTables at this level.

The SSTable file merging threshold needs to be set when the program starts.

 lsm.Start(config.Config{
    DataDir:    `E:\项目\lsm数据测试目录`,
    Level0Size: 1,    // 第0层所有 SSTable 文件大小之和的阈值
    PartSize:   4,    // 每一层 SSTable 数量阈值
    Threshold:  500,    // 内存表元素阈值
        CheckInterval: 3, // 压缩时间间隔
  })

The sum of the SSTable file size of each layer is generated according to layer 0. For example, when you set layer 0 to 1MB, layer 1 is 10MB, and layer 2 is 100MB, users only need to set The total file size threshold for layer 0 is sufficient.

The following describes the SSTable file merging process.
For the complete code of compaction, please refer to: https://github.com/whuanle/lsm/blob/1.0/ssTable/compaction.go
Here is the initial file tree:

First create a binary sorted tree object:
memoryTree := &sortTree.Tree{}
Then in Level 0, starting from the SSTable with the smallest index, each block in the file data area is read, and after deserialization, an insert operation or delete operation is performed.

 for k, position := range table.sparseIndex {
      if position.Deleted == false {
        value, err := kv.Decode(newSlice[position.Start:(position.Start + position.Len)])
        if err != nil {
          log.Fatal(err)
        }
        memoryTree.Set(k, value.Value)
      } else {
        memoryTree.Delete(k)
      }
    }

Load all SSTables of Level 0 into a binary sorted tree, ie merge all elements.

Then convert the binary sorted tree to SSTable and insert it into Level 1.

Next, delete all SSTable files for Level 0.

Note, because the author's compression method will load the file into memory and use slices to store the file data, there may be an overcapacity error.

This is a place to watch.

SSTable lookup process

For the complete code, please refer to:
https://github.com/whuanle/lsm/blob/1.0/ssTable/Search.go

When an element needs to be searched, it is firstly searched in the memory table. When it cannot be found, it is necessary to search the SSTable one by one in the TableTree.

 // 遍历每一层的 SSTable
  for _, node := range tree.levels {
    // 整理 SSTable 列表
    tables := make([]*SSTable, 0)
    for node != nil {
      tables = append(tables, node.table)
      node = node.next
    }
    // 查找的时候要从最后一个 SSTable 开始查找
    for i := len(tables) - 1; i >= 0; i-- {
      value, searchResult := tables[i].Search(key)
      // 未找到，则查找下一个 SSTable 表
      if searchResult == kv.None {
        continue
      } else { // 如果找到或已被删除，则返回结果
        return value, searchResult
      }
    }
  }

When searching inside SSTable, binary search method is used:

 // 元素定位
  var position Position = Position{
    Start: -1,
  }
  l := 0
  r := len(table.sortIndex) - 1

  // 二分查找法，查找 key 是否存在
  for l <= r {
    mid := int((l + r) / 2)
    if table.sortIndex[mid] == key {
      // 获取元素定位
      position = table.sparseIndex[key]
      // 如果元素已被删除，则返回
      if position.Deleted {
        return kv.Value{}, kv.Deleted
      }
      break
    } else if table.sortIndex[mid] < key {
      l = mid + 1
    } else if table.sortIndex[mid] > key {
      r = mid - 1
    }
  }

  if position.Start == -1 {
    return kv.Value{}, kv.None
  }

As for the writing of the LSM Tree database, this is the end. Let's learn about the author's database performance and usage.

▌Simple usage test

Sample code location:
https://gist.github.com/whuanle/1068595f46824466227b93ef583499d3
First download the dependency package:

go get -u github.com/whuanle/lsm@v1.0.0

Then use lsm.Start() to initialize the database, and then add, delete, check and change the Key. The sample code is as follows:

 package main

import (
  "fmt"
  "github.com/whuanle/lsm"
  "github.com/whuanle/lsm/config"
)

type TestValue struct {
  A int64
  B int64
  C int64
  D string
}

func main() {
  lsm.Start(config.Config{
    DataDir:    `E:\项目\lsm数据测试目录`,
    Level0Size: 1,
    PartSize:   4,
    Threshold:  500,
        CheckInterval: 3, // 压缩时间间隔
  })
  // 64 个字节
  testV := TestValue{
    A: 1,
    B: 1,
    C: 3,
    D: "00000000000000000000000000000000000000",
  }

  lsm.Set("aaa", testV)

  value, success := lsm.Get[TestValue]("aaa")
  if success {
    fmt.Println(value)
  }

  lsm.Delete("aaa")
}

testV is 64 bytes, and kv.Value holds the value of testV, and the size of kv.Value is 131 bytes.

file compression test

We can write a Key that takes any 6 letters from 26 letters, insert it into the database, and observe the file compression and merge, and the insertion speed.

Number of elements inserted at different loop levels:

List of generated test files:

The process of file compression and merging animation is as follows (about 20 seconds):

Insertion test

Below are some loosely tested results.

Set the configuration when starting the database:

 lsm.Start(config.Config{
    DataDir:    `E:\项目\lsm数据测试目录`,
    Level0Size: 10,  // 0 层 SSTable 文件大小
    PartSize:   4,   // 每层文件数量
    Threshold:  3000, // 内存表阈值
        CheckInterval: 3, // 压缩时间间隔
  })

  lsm.Start(config.Config{
    DataDir:    `E:\项目\lsm数据测试目录`,
    Level0Size: 100,
    PartSize:   4,
    Threshold:  20000,
        CheckInterval: 3,
  })

Insert data:

 func insert() {

  // 64 个字节
  testV := TestValue{
    A: 1,
    B: 1,
    C: 3,
    D: "00000000000000000000000000000000000000",
  }

  count := 0
  start := time.Now()
  key := []byte{'a', 'a', 'a', 'a', 'a', 'a'}
  lsm.Set(string(key), testV)
  for a := 0; a < 1; a++ {
    for b := 0; b < 1; b++ {
      for c := 0; c < 26; c++ {
        for d := 0; d < 26; d++ {
          for e := 0; e < 26; e++ {
            for f := 0; f < 26; f++ {
              key[0] = 'a' + byte(a)
              key[1] = 'a' + byte(b)
              key[2] = 'a' + byte(c)
              key[3] = 'a' + byte(d)
              key[4] = 'a' + byte(e)
              key[5] = 'a' + byte(f)
              lsm.Set(string(key), testV)
              count++
            }
          }
        }
      }
    }
  }

  elapse := time.Since(start)
  fmt.Println("插入完成，数据量：", count, ",消耗时间：", elapse)
}

In both tests, the total file size of the generated SSTable is about 82MB.
Time spent on two tests:

 插入完成，数据量：456976 ,消耗时间：1m43.4541747s

插入完成，数据量：456976 ,消耗时间：1m42.7098146s

Therefore, each element is 131 bytes, and this database can insert about 45w pieces of data in 100s, that is, 4500 pieces of data per second.

If the value of kv.Value is relatively large, when the test is 3231 bytes, 456976 pieces of data are inserted, the file is about 1.5GB, and the time consumption is 2m10.8385817s, that is, 3500 pieces of data are inserted per second.

Insert kv.Value of larger value, code example: https://gist.github.com/whuanle/77e756801bbeb27b664d94df8384b2f9

load test

The following is a list of SSTable files after inserting 450,000 pieces of data when each element is 3231 bytes. When the program starts, we need to load these files.

 2022/05/21 21:59:30 Loading wal.log...
2022/05/21 21:59:32 Loaded wal.log,Consumption of time :  1.8237905s
2022/05/21 21:59:32 Loading database...
2022/05/21 21:59:32 The SSTable list are being loaded
2022/05/21 21:59:32 Loading the  E:\项目\lsm数据测试目录/1.0.db
2022/05/21 21:59:32 Loading the  E:\项目\lsm数据测试目录/1.0.db ,Consumption of time :  92.9994ms
2022/05/21 21:59:32 Loading the  E:\项目\lsm数据测试目录/1.1.db
2022/05/21 21:59:32 Loading the  E:\项目\lsm数据测试目录/1.1.db ,Consumption of time :  65.9812ms
2022/05/21 21:59:32 Loading the  E:\项目\lsm数据测试目录/2.0.db
2022/05/21 21:59:32 Loading the  E:\项目\lsm数据测试目录/2.0.db ,Consumption of time :  331.6327ms
2022/05/21 21:59:32 The SSTable list are being loaded,consumption of time :  490.6133ms

It can be seen that, except that WAL loading is time-consuming (because it needs to be inserted into memory one by one), the loading of SSTable files is relatively fast.

Find a test

If all elements are in memory, even if there are 450,000 pieces of data, the search speed is very fast. For example, searching for data of aaaaaa (the smallest key) and aazzzz (the largest key) will take very little time.

The following uses a database file of 3kb per element for testing.

Find the code:

 start := time.Now()
  elapse := time.Since(start)
  v, _ := lsm.Get[TestValue]("aaaaaa") // 或者 aazzzz
  fmt.Println("查找完成，消耗时间：", elapse)
  fmt.Println(v)

If you search in the SSTable, because aaaaaa is written first, it must be at the end of the bottom SSTable file, which takes more time.
SSTable file list:

 ├── 1.0.db      116MB
├── 2.0.db    643MB
├── 2.1.db    707MB

约 1.5GB

aaaaaa In 2.0db, lookups are loaded in the order 1.0.db, 2.1.db, 2.0.db.
Query speed test:

 2022/05/22 08:25:43 Get aaaaaa
查找 aaaaaa 完成，消耗时间：19.4338ms

2022/05/22 08:25:43 Get aazzzz
查找 aazzzz 完成，消耗时间：0s

Regarding the author's LSM Tree database, I will introduce it here. For the detailed implementation code, please refer to the Github repository.

Microsoft Most Valuable Professional (MVP )

The Microsoft Most Valuable Professional is a global award given to third-party technology professionals by Microsoft Corporation. For 29 years, technology community leaders around the world have received this award for sharing their expertise and experience in technology communities both online and offline.
MVPs are a carefully selected team of experts who represent the most skilled and intelligent minds, passionate and helpful experts who are deeply invested in the community. MVP is committed to helping others and maximizing the use of Microsoft technologies by Microsoft technical community users by speaking, forum Q&A, creating websites, writing blogs, sharing videos, open source projects, organizing conferences, etc.
For more details, please visit the official website:
https://mvp.microsoft.com/en-us

Long press to identify the QR code and follow Microsoft China MSDN

Click to apply to join the Microsoft Most Valuable Professional Program