Implement a kv storage engine from scratch

The purpose of writing this article is to help more people understand rosedb. I will implement a simple kv storage engine that includes PUT, GET, and DELETE operations from scratch. You can think of it as a simple version of rosedb , Just call it minidb (mini version of rosedb).

Whether you are a Go language beginner, want to advance to the Go language, or are interested in kv storage, you can try to implement it yourself. I believe it will be of great help to you.

When it comes to storage, one of the core issues that is actually solved is how to store data and how to retrieve data. In the world of computers, this problem will be more diverse.

There are memory and disks in the computer. The memory is volatile. All data stored after a power failure is lost. Therefore, if you want the system to crash and restart, you have to store the data in non-volatile media. , The most common is the disk.

Therefore, for a stand-alone version of kv, we need to design how data should be stored in the memory and how it should be stored in the disk.

Of course, there have been many excellent predecessors to explore, and there has been a classic summary, mainly divided into two types of data storage models: B+ tree and LSM tree.

The focus of this article is not to talk about these two models, so only a brief introduction.

B+ tree

在这里插入图片描述

The B+ tree evolved from a binary search tree. By increasing the number of nodes at each level, the height of the tree is reduced, disk pages are adapted, and disk IO operations are minimized.

B+ tree query performance is relatively stable. When writing or updating, it will find and locate the position in the disk and perform in-situ operations. Note that this is random IO, and a large number of insertions or deletions may also trigger page splits and merges. The write performance is average, so the B+ tree is suitable for scenarios with more reads and less writes.

LSM tree

在这里插入图片描述

LSM Tree (Log Structured Merge Tree) is actually not a specific tree type data structure, but just a data storage model. Its core idea is based on the fact that sequential IO is much faster than random IO. .

Unlike the B+ tree, in LSM, data insertion, update, and deletion are all recorded as a log, and then appended to the disk file, so that all operations are sequential IO.

LSM is more suitable for scenarios where more writes and less reads.

After reading the previous two basic storage models, I believe you have a basic understanding of how to access data, and minidb is based on a simpler storage structure, which is generally similar to LSM.

I will not talk about the concept of this model directly, but take a simple example to look at the data PUT, GET, and DELETE process in minidb, so that you can understand this simple storage model.

PUT

We need to store a piece of data, namely key and value. First, in order to prevent data loss, we will encapsulate this key and value into a record (here we call this record an Entry) and append it to the disk file. The content of Entry is roughly the key, value, key size, value size, and writing time.

在这里插入图片描述

So the structure of the disk file is very simple, it is a collection of multiple Entry.

在这里插入图片描述

After the disk is updated, update the memory again. A simple data structure, such as a hash table, can be selected in the memory. The key of the hash table corresponds to the location of the Entry on the disk, which is easy to obtain when searching.

In this way, in minidb, the process of data storage is completed, and there are only two steps: an addition of disk records, and an index update in the memory.

GET

Let's look at GET to obtain data. First, find the index information corresponding to the key in the hash table in the memory, which contains the location of the value stored in the disk file, and then directly based on this location, go to the disk to retrieve the value. .

DEL

Then there is the delete operation. Here, the original record will not be located for deletion, but the deleted operation will be encapsulated as an Entry and appended to the disk file, but here we need to identify that the type of Entry is delete.

Then the hash table in the memory deletes the index information of the corresponding key, and the delete operation is completed.

It can be seen that there are only two steps for inserting, querying, and deleting: an index update in the memory and an addition of records to the disk file. So regardless of the size of the data, the write performance of minidb is very stable.

Merge

Finally, let's look at a more important operation. As mentioned earlier, the record of the disk file is constantly being written to, which will cause the file capacity to increase. And for the same key, there may be multiple entries in the file (recall that updating or deleting the key content will also add records), then there is actually redundant Entry data in the data file.

To give a simple example, for example, for key A, set its value to 10, 20, and 30 successively, then there are three records in the disk file:
在这里插入图片描述

At this time, the latest value of A is 30, so the first two records are already invalid.

In response to this situation, we need to regularly merge data files to clean up invalid Entry data. This process is generally called merge.

The idea of merge is also very simple. You need to take out all the entries of the original data file, rewrite the valid entries into a new temporary file, and finally delete the original data file. The temporary file is the new data file.

在这里插入图片描述

This is the underlying data storage model of minidb, its name is bitcask, of course rosedb uses this model. It essentially belongs to the LSM-like model. The core idea is to use sequential IO to improve write performance, but it is much simpler than LSM in terms of implementation.

After introducing the underlying storage model, you can start the code implementation. I put the complete code implementation on my Github at the address:

https://github.com/roseduan/minidb，

Part of the key code is intercepted in the article.

The first is to open the database. You need to load the data file first, then take out the Entry data in the file and restore the index status. The key part of the code is as follows:

func Open(dirPath string) (*MiniDB, error) {
   // 如果数据库目录不存在，则新建一个
   if _, err := os.Stat(dirPath); os.IsNotExist(err) {
      if err := os.MkdirAll(dirPath, os.ModePerm); err != nil {
         return nil, err
      }
   }

   // 加载数据文件
   dbFile, err := NewDBFile(dirPath)
   if err != nil {
      return nil, err
   }

   db := &MiniDB{
      dbFile:  dbFile,
      indexes: make(map[string]int64),
      dirPath: dirPath,
   }

   // 加载索引
   db.loadIndexesFromFile(dbFile)
   return db, nil
}

Let's take a look at the PUT method. The process is the same as the above description. First update the disk, write a record, and then update the memory:

func (db *MiniDB) Put(key []byte, value []byte) (err error) {
  
   offset := db.dbFile.Offset
   // 封装成 Entry
   entry := NewEntry(key, value, PUT)
   // 追加到数据文件当中
   err = db.dbFile.Write(entry)

   // 写到内存
   db.indexes[string(key)] = offset
   return
}

The GET method needs to first fetch the index information from the memory to determine whether it exists, return directly if it does not exist, and fetch the data from the disk if it exists.

func (db *MiniDB) Get(key []byte) (val []byte, err error) {
   // 从内存当中取出索引信息
   offset, ok := db.indexes[string(key)]
   // key 不存在
   if !ok {
      return
   }

   // 从磁盘中读取数据
   var e *Entry
   e, err = db.dbFile.Read(offset)
   if err != nil && err != io.EOF {
      return
   }
   if e != nil {
      val = e.Value
   }
   return
}

The DEL method is similar to the PUT method, except that Entry is identified as DEL , and then encapsulated as Entry and written to the file:

func (db *MiniDB) Del(key []byte) (err error) {
   // 从内存当中取出索引信息
   _, ok := db.indexes[string(key)]
   // key 不存在，忽略
   if !ok {
      return
   }

   // 封装成 Entry 并写入
   e := NewEntry(key, nil, DEL)
   err = db.dbFile.Write(e)
   if err != nil {
      return
   }

   // 删除内存中的 key
   delete(db.indexes, string(key))
   return
}

The last is the important operation of merging data files. The process is the same as the above description. The key codes are as follows:

func (db *MiniDB) Merge() error {
   // 读取原数据文件中的 Entry
   for {
      e, err := db.dbFile.Read(offset)
      if err != nil {
         if err == io.EOF {
            break
         }
         return err
      }
      // 内存中的索引状态是最新的，直接对比过滤出有效的 Entry
      if off, ok := db.indexes[string(e.Key)]; ok && off == offset {
         validEntries = append(validEntries, e)
      }
      offset += e.GetSize()
   }

   if len(validEntries) > 0 {
      // 新建临时文件
      mergeDBFile, err := NewMergeDBFile(db.dirPath)
      if err != nil {
         return err
      }
      defer os.Remove(mergeDBFile.File.Name())

      // 重新写入有效的 entry
      for _, entry := range validEntries {
         writeOff := mergeDBFile.Offset
         err := mergeDBFile.Write(entry)
         if err != nil {
            return err
         }

         // 更新索引
         db.indexes[string(entry.Key)] = writeOff
      }

      // 删除旧的数据文件
      os.Remove(db.dbFile.File.Name())
      // 临时文件变更为新的数据文件
      os.Rename(mergeDBFile.File.Name(), db.dirPath+string(os.PathSeparator)+FileName)

      db.dbFile = mergeDBFile
   }
   return nil
}

Excluding the test files, the core code of minidb is only 300 lines. Although the sparrow is small and complete, it already contains the main idea of the bitcask storage model, and it is also the underlying foundation of rosedb.

After you understand minidb, you can basically master the storage model of bitcask. Take a little more time, and I believe you can do well with rosedb.

Further, if you are interested in kv storage, you can study more related knowledge in more depth. Although bitcask is concise and easy to understand, there are many problems. Rosedb has made some optimizations in the process of practice. However, there are still many problems.

Some people may be wondering, bitcask is a simple model, is it just a toy, is it applied in the actual production environment? The answer is yes.

Bitcask originally originated from the underlying storage model of the Riak project, and Riak is a distributed kv storage, which also ranks among the top in NoSQL rankings:

在这里插入图片描述

The distributed kv storage used by Douban is actually based on the bitcask model with many optimizations. At present, there are not many kvs purely based on the bitcask model, so you can check out the rosedb code more, and you can put forward your own opinions and suggestions to improve this project together.

Finally, attach the relevant project address:

minidb：https://github.com/roseduan/minidb

rosedb：https://github.com/roseduan/rosedb

Reference materials:

https://riak.com/assets/bitcask-intro.pdf

https://medium.com/@arpitbhayani/bitcask-a-log-structured-fast-kv-store-c6c728a9536b