LSM-Tree - LevelDb Skiplist - 技术读书笔记

LSM-Tree - LevelDb Skiplist

Introduction to skip table

SkipList (SkipList) was proposed by William Pugh. In his paper "Skip lists: a probabilistic alternative to balanced trees" , he introduces the details of the skip list structure, insertion and deletion operations in detail.

Documentation: Skiplist Skiplist Original Paper - pugh-skiplists-cacm1990.pdf
Link: http://note.youdao.com/noteshare?id=667aed96f012edcada047baf75aa1769&sub=B427E00C818B428ABD38A813180AF9A8

background

In the linear data data structure, we can often think of arrays and linked lists. Arrays are slow to insert and query fast, while linked lists are fast to insert, and queries are slightly slower, while skip lists are mainly optimized for linked list query speed . A data structure , the multi-level skip list is actually an index to the underlying linked list, which is a very typical space for time, and the query time of the linked list is controlled as much as possible to O(logN).

accomplish

Since a similar index point data maintenance method is used, the addition and deletion need to maintain the skip table structure at the same time. The skip table uses the probability balance method to simplify the addition and deletion operations, which is different from the tree operation that uses left-hand and right-hand operations to maintain data balance. , the skip table uses a method similar to guessing coins to determine which layer to insert or delete nodes and update the index.

From the abcde diagram below, we can take a look at the evolution of the skip table:

First of all, a is a typical linked list structure. It takes O(n) time for the query. The longer the length of the linked list, the slower the query.

b On the basis of a, an extra pointer is added every 2 nodes each time. Through this operation, each query time is reduced by [n/2] + times.

c, d, and e continue to add additional pointers in this way, and finally leave only one layer of pointers from the beginning to the end.

However, it can be seen that adding nodes to each layer according to the unified idea is very inefficient for maintaining the entire node. We regard the node with additional pointers as a K-layer node. According to the arrangement in the figure, we can see that for the nodes of layer 1 It accounts for 50%, the second layer is 25%, and the third layer is 12.5%... If inserting new nodes can be inserted and deleted according to this rule, then the efficiency improvement will not have a great performance impact.

Maintaining auxiliary pointers will bring greater complexity. The index in the nodes of each layer will point to the next node where the current layer is located, that is to say, each layer is a linked list.

How is the time complexity calculated?

The derivation formula is as follows:

n/2^k => n / 2^k = 2 => h = log2n -1 => O(logn)

k represents the number of node layers, h represents the highest level

The original linked list has n elements, then the first-level index has n/2 elements, the second-level index has n/4 elements, and the k-level index has n/(2^k) elements. The highest-level index generally has 2 elements (head to tail), the highest-level index points to 2 = n/(2^h), that is, h = (log2)n-1, and the highest-level index h is the index layer height + original data Height, the final jump table height is h = (log2) n.

After index optimization, the time complexity of the entire query can be approximated as O(logn) , which is similar to the efficiency of binary search.

How is the space complexity calculated?

As the number of layers increases, the space overhead of indexing is getting smaller and smaller, the index of one layer is n/2, the index of the second layer is n/4, the index of the third layer is n/8 ..... The last n/2 + n/4 + n/8 +.... + 2 (the highest level is generally 2 nodes) Finally, because the denominator is added, it can be considered to be approximately O(n) space complexity.

Inquire

The following is the processing flow of adding, deleting, modifying and querying a skip table. Since both deletion and insertion depend on the query, let's start with the query:

The operation mode of the query can be seen in the following drawing. For example, if you need to find the node 17 in the middle, it will be searched according to the line order. Here is a brief description of the query order:

Do a lookup from the top level of the index and go straight to the next node.
If the current content is greater than the node content, then directly find the next node for comparison.
Returns directly if the current node is equal to the lookup node.
If the current node is greater than the node, and the next node is greater than the current node, and the layer height is not 0, continue to search for a lower layer node and return to the previous node at the lower layer, if the layer height is 0, then Returns the current node. The key of the current node is greater than the searched key.

Search is easy to understand, that is, use the index to quickly cross multiple linked list nodes to reduce the number of searches, and then drill down to find the relevant node. Pay attention to the node with the index. Usually, the upper node will have a pointer to the lower layer.

insert

The insertion operation is more critical, because it involves the election of the index node that greatly affects the performance of the jump table. According to the previous search operation steps, the insertion operation needs to record the predecessor nodes of each layer each time .

Key point : The key point of insertion is to elect that node to increase the layer height to maintain the efficiency of binary search. After finding the location, a random coin toss method is usually used to randomly increase the layer height of the node. The higher the layer, the probability of being selected is usually exponential. times, and then it is calculated according to three parameters: the seed (used to achieve random probability), the number of layers of the current node and the probability value P or other random value algorithms, but in order to prevent the layer height of a single node from being too high, it is usually limited. The final layer height prevents the layer height of a single node from exceeding the upper limit.

According to Murphy's Law, no matter how high the possibility of a single node's layer is, it needs to be restricted.

Here is a piece of code for the LevelDB jump table data structure to introduce:

 func (p *DB) randHeight() (h int) {
    // 限制跳表扩展的最高层级
    const branching = 4
    h = 1
    for h < tMaxHeight && p.rnd.Int()%branching == 0 {
        h++
    }
    return
}

After inserting a node, the index of the new node will be established according to the skip table rules according to the position of the recorded predecessor node in each layer.

The update of the skip table inserting a new node itself is very simple, just point the next node of the current node to the next node of the next node of the inserted node, and the next node of the inserted node to the current node .

Insert operation time complexity

If it is a singly linked list, then one traversal can be completed, and the result is always O(1) . For the insertion of the skip table, the worst case is that the index needs to be updated at all layers, which is O(logN) .

delete

Deletion also depends on the query. After the node to be deleted is found according to the query, the current node can be deleted at each layer according to the query rules.

The time complexity of deletion depends on the number of layers of the query. Suppose that N elements need to be deleted, and each layer is actually a singly linked list. The query of a singly linked list is O(1). The best is the efficiency of approximate binary search. The number of index layers is logn, and the number of deleted layers is also logN, and N depends on the level.

The total time to finally remove the element contains:

Time to find an element + _time to delete logn elements = O(logn) + O(logn) = 2O(logn) , ignoring the constant part the final result is O(logn).

scenes to be used

In fact, most of the LST-Tree data structures related to Key-Value have similar skip list implementations, because linked lists may be used less in business, but they are crucial in data structure and database design:

HBase
Redis
LevelDB

summary

The skip list allows the linked list to complete the binary search operation
Element insertion will randomly elect Level based on coin toss and weight distribution
The bottom layer is always the original linked list, and the upper layer is the index data
The index node usually has one more pointer to point to the lower node, but not all programs are designed in this way, and there are other processing methods to indirectly implement this function (for details, see the zset source code of redis)
The time complexity of skip table query, insertion and deletion is O(log n), which is close to the balanced binary tree

LevelDb skip table implementation

In the previous discussion of merging compressed files, the merge sort method is used for key merging, and in addition to the merge sort, the internal database also uses the more critical [[LSM-Tree - LevelDb Skiplist skip table]] for ordered key value management .

The skip list is implemented in both Redis and Kafka. The Skiplist here is actually similar, which can be regarded as a case of the C++ version of the skip list.

This part does not look at the author's documentation, we directly open the source code.

basicly construct

First of all, we need to know what the skip table of LevelDB contains? At the beginning of the code, the Node node is defined to represent the linked list node, and the content of the Iterator iterator is iterated. Internally, an array with a length equal to the height of the node is defined std::atomic<Node*> next_[1] .

next_[0] is the bottommost node (used to skip the table to obtain data across layers), and the core is the general Random randomizer that the author thinks he wrote (generates a random bit number through bit operations).

The entire implementation of LevelDB is relatively simple and standardized, and many functions are defined in the design to simplify the increase of complex codes.

important method

levelDB insert operation #levelDB query operation

After learning about [[LSM-Tree - LevelDb Skiplist Skiplist]], we found that for the data structure of skip list, the core part lies in query and insertion. Of course, query is the premise of understanding the insertion point, but for insertion The implementation of coin toss elections needs to be looked into.

query operation

The query operation is easy to understand, similar to the data structure of the skip table, similar to the implementation of [[LSM-Tree - LevelDb Skiplist Skiplist]]:

It can be found that it is exactly the same as the original implementation of the skip table, which is equivalent to repeating the content of the theory:

Do a lookup from the top level of the index and go straight to the next node.
If the current content is greater than the node content, then directly find the next node for comparison.
Returns directly if the current node is equal to the lookup node.
If the current node is greater than the node, and the next node is greater than the current node, and the layer height is not 0, continue to search for a lower layer node and return to the previous node at the lower layer, if the layer height is 0, then Returns the current node. The key of the current node is greater than the searched key.

 // 返回层级最前的节点，该节点位于键的位置或之后。如果没有这样的节点，返回nullptr。如果prev不是空的，则在[0...max_height_1]中的每一级，将prev[level]的指针填充到前一个 节点的指针来填充[0...max_height_1]中的每一级的 "level"。
template <typename Key, class Comparator>

typename SkipList<Key, Comparator>::Node*

SkipList<Key, Comparator>::FindGreaterOrEqual(const Key& key,

Node** prev) const {

    Node* x = head_;

    // 防止无限for循环
    int level = GetMaxHeight() - 1;
    
    while (true) {
    
        Node* next = x->Next(level);
        

        if (KeyIsAfterNode(key, next)) {
            
            // 如果当前节点在层级之后，则查找下一个链表节点
            
            x = next;
        
        } else {
            
        
            if (prev != nullptr) prev[level] = x;
            
            if (level == 0) {
            
                return next;
        
            } else {
            
                // 层级下沉
                
                level--;
            
            }
    
        }
    
    }

}

insert operation

The code of the insert operation is as follows. Note that the skip table needs to lock the node before inserting.

 template <typename Key, class Comparator>

void SkipList<Key, Comparator>::Insert(const Key& key) {

    // 因为前置节点最多有kMaxHeight层，所以直接使用kMaxHeight 简单粗暴
    Node* prev[kMaxHeight];

    // 返回层级最前的节点，该节点位于键的位置或之后。如果没有这样的节点，返回nullptr。如果prev不是空的，则在[0...max_height_1]中的每一级，将prev[level]的指针填充到前一个 节点的指针来填充[0...max_height_1]中的每一级的 "level"。
    Node* x = FindGreaterOrEqual(key, prev);

    // 不允许进行重复插入操作（同步加锁）
    assert(x == nullptr || !Equal(key, x->key));
    
    // **新增层级选举**，使用随机函数和最高层级限制，按照类似抛硬币的规则选择是否新增层级。
    // 随机获取一个 level 值
    int height = RandomHeight();
    
    // 当前随机level是否大于 当前点跳表层数
    if (height > GetMaxHeight()) {

        // 头指针下探到最低层
    
        for (int i = GetMaxHeight(); i < height; i++) {
            prev[i] = head_;
        }
        
        /*
        这部分建议多读读原注释。
        
        机器翻译：在没有任何同步的情况下突变max_height_是可以的。与并发读取器之间没有任何同步。一个并发的读者在观察到的新值的并发读者将看到max_height_的旧值。的新水平指针（nullptr），或者在下面的循环中设置一个新的值。下面的循环中设置的新值。在前一种情况下，读者将立即下降到下一个级别，因为nullptr会在所有的键之后。在后一种情况下，读取器将使用新的节点

        理解：意思是说这一步不需要并发加锁，这是因为并发读读取到更新的跳表层数，哪怕现在这个节点没有插入，也会返回nullptr，在leveldb的比较器当中的nullpt会在最前面，默认看作比所有的key都要大，所以会往下继续找，这样就可以保证写入和读取都是符合预期的。
        */
        max_height_.store(height, std::memory_order_relaxed);
    
    }
    
    // 新增跳表节点
    x = NewNode(key, height);
    
    for (int i = 0; i < height; i++) {
    // NoBarrier_SetNext()就够了，因为当我们在prev[i]中发布一个指针 "x "时，我们会添加一个障碍。我们在prev[i]中发布一个指向 "x "的指针。
    
        // 更新指针引用
        // 为了保证并发读的准确性，需要先设置节点指针然后再设置原始表的prev 指针
        x->NoBarrier_SetNext(i, prev[i]->NoBarrier_Next(i));
        // 内部会强制进行同步
        prev[i]->SetNext(i, x);
    
    }

}

The difficulty of skip table implementation lies in the determination of the number of layers, and the difficulty of LevelDB is how to ensure that the insertion node can correctly read concurrently when concurrently writing .

RandomHeight() adds level election :

The core code of the level election in LevelDb is: height < kMaxHeight && rnd_.OneIn(kBranching) , the internal control jump table layer number does not exceed the maximum kMaxHeight layer, for the operation of taking the remainder of 4 to realize the construction P = 3/4 of the geometric distribution , and finally determine whether to add new layers.

In the original case, adding 1 layer to the skip list is 1/2, 2 layers is 1/4, 3 layers is 1/8, and 4 layers is 1/16. The 11-layer top layer of LevelDB limits the number of keys, but the probability of 11-layer nodes is usually very, very small.
The final result of LevelDB selection is that 3/4 of the nodes are level 1 nodes, 3/16 nodes are level 2 nodes, 3/64 nodes are level 3 nodes, and so on.

Features of Tier Elections:

The number of pointers to insert a new node is determined by independently calculating a probability value, so that the number of pointers of the global node satisfies the geometric distribution.
There is no need to do additional node adjustment when inserting, just find the position where it needs to be placed, and then modify the pointing of him and the predecessor.

 template <typename Key, class Comparator>

int SkipList<Key, Comparator>::RandomHeight() {

    // 在kBranching中以1的概率增加高度
    
    static const unsigned int kBranching = 4;
    
    int height = 1;
    // rnd_.OneIn(kBranching):"1/n "的时间会返回真没其他情况会返回假
    // 相当于层数会按照4 的倍数减小， 4层是3层的4分之一，简单理解为 每次加一层概率就要乘一个 1/4。
    while (height < kMaxHeight && rnd_.OneIn(kBranching)) {
    
        height++;
    
    }
    
    assert(height > 0);
    
    assert(height <= kMaxHeight);
    
    return height;

}

From the above code, we can see that the probability P uses a 1/4 calculation method. The advantage of using 1/4 is to make the number of layers more dispersed. The typical time-to-space operation will sacrifice a part of the space, but obtain a higher performance, in this case up to n = (1/p)^kMaxHeight nodes can be supported .

Efficiency is the best consideration for the business of LevelDB, which writes faster than reads.

How much data can a 12-story-high node store at most? Then you can directly use 4^12 to calculate approximately equal to 16M . Of course, the probability of 12 floors is very small.

delete operation

The LevelDB skip table does not delete the concept, and the corresponding update is also for the change of the next pointer.

Unless the skip table is destroyed, the skip table nodes will only be added but not deleted , because the skip table does not provide a delete interface at all .
Nodes inserted into the skip list, except for the next pointer, are immutable and only insert operations change the skip list. (instead of updating)

Traversal operation

The previous [[LSM-Tree - LevelDb source code analysis]] analysis explained that the traversal of the entire jump table is completed by Iterator , and the merge sort is used internally to sort the keys, and at the same time null ptr as Special values always come first.

LevelDB's own iterator implementation is relatively rich, in addition to the classic iterators remove() , next() , haseNext() , there are Seek , SeekToFirst , SeekToLast , and Prev forward traversal operations

 // Advances to the next position.

// REQUIRES: Valid()

void Next();

// Advances to the previous position.

// REQUIRES: Valid()

void Prev();

// Advance to the first entry with a key >= target

void Seek(const Key& target);

// Position at the first entry in list.

// Final state of iterator is Valid() iff list is not empty.

void SeekToFirst();

// Position at the last entry in list.

// Final state of iterator is Valid() iff list is not empty.

void SeekToLast();

What needs to be emphasized here is that this operation of forward traversal is not reversed by increasing the prev pointer , but starting from the head , which is also time for space.

Finally, there are two more frequently used operations FindLast and FindLessThan . The comments are simple and clear, so I won't introduce them further.

 // Return the latest node with a key < key.

// Return head_ if there is no such node.

Node* FindLessThan(const Key& key) const;

  
// Return the last node in the list.

// Return head_ if list is empty.

Node* FindLast() const;

Summarize

The difficulty in the design of the skip table of LevelDB is mainly reflected in the maintenance of concurrent read and write and the hierarchical election of nodes. This part is quite different from the original skip table, and other places can basically be regarded as the theoretical design of the original skip table. Therefore, it is also highly recommended to use LevelDB as the template code learning for the skip table.

References

Skip List--Skip List (one of the most detailed skip list articles on the entire network)

The implementation of the jump table data structure, the linked list of the JAVA version can see the following code:
algo/SkipList.java at master · wangzheng0822/algo · GitHub

LSM-Tree - LevelDb Skiplist