头图

Redis core principle and practice-ziplist structure of the principle of list realization

binecy
中文

The Redis list type can store a set of strings sorted by insertion order. It is very flexible, supports inserting and popping data at both ends, and can act as a stack and a queue.

> LPUSH fruit apple
(integer) 1
> RPUSH fruit banana
(integer) 2
> RPOP fruit
"banana"
> LPOP fruit
"apple"

This article discusses the implementation of list types in Redis.

ziplist

Both array and linked list structures can be used to implement list types. Redis uses a linked list structure. The following is a common implementation of linked list adlist.h:

typedef struct listNode {
    struct listNode *prev;
    struct listNode *next;
    void *value;
} listNode;

typedef struct list {
    listNode *head;
    listNode *tail;
    void *(*dup)(void *ptr);
    void (*free)(void *ptr);
    int (*match)(void *ptr, void *key);
    unsigned long len;
} list;

Redis internally uses this linked list to store operating data, such as all slave server information under the master service.
But Redis does not use this linked list to save user list data, because it is not friendly to memory management:
(1) Each node in the linked list occupies an independent piece of memory, resulting in excessive memory fragmentation.
(2) The pointers of the front and back nodes in the linked list node occupy too much extra memory.
Readers can think about it, what structure can be used to better solve the above two problems? That's right, the array. Ziplist is a compact linked list format similar to an array. It will apply for a whole block of memory and store all the data of the linked list in this memory. This is the design idea of ziplist.

definition

The overall layout of ziplist is as follows:

<zlbytes> <zltail> <zllen> <entry> <entry> ... <entry> <zlend>
  • zlbytes: uint32_t, records the number of bytes occupied by the entire ziplist, including the 4 bytes occupied by zlbytes.
  • zltail: uint32_t, records the offset from the start position of the ziplist to the last node, used to support the pop-up of the linked list from the end or the reverse (end-to-head) traversal of the linked list.
  • zllen: uint16_t, record the number of nodes. If there are more than 216-2 nodes, this value is set to 216-1. At this time, you need to traverse the entire ziplist to get the real number of nodes.
  • zlend: uint8_t, a special marking node, equal to 255, marking the end of the ziplist. Other node data will not start with 255.

The entry is the node saved in the ziplist. The format of entry is as follows:

<prevlen> <encoding> <entry-data>
  • entry-data: The node element, that is, the data stored by the node.
  • prevlen: Record the length of the predecessor node in bytes. The length of this attribute is 1 byte or 5 bytes.
    ① If the length of the predecessor node is less than 254, use 1 byte to store the length of the predecessor node.
    ② Otherwise, use 5 bytes, and the first byte is fixed at 254, leaving 4 bytes to store the length of the predecessor node.
  • encoding: Represents the encoding format of the current node element, including encoding type and node length. In a ziplist, the encoding format of different node elements can be different. The encoding format specification is as follows:
    ① 00pppppp (pppppp represents the lower 6 bits of encoding, the same below): string encoding, the length is less than or equal to 63 (26-1), and the length is stored in the lower 6 bits of encoding.
    ② 01pppppp: string encoding, the length is less than or equal to 16383 (214-1), and the length is stored in the last 6 bits of encoding and the last 1 byte of encoding.
    ③ 10000000: string encoding, the length is greater than 16383 (214-1), and the length is stored in the last 4 bytes of encoding.
    ④ 11000000: Numerical code, the type is int16_t, occupying 2 bytes.
    ⑤ 11010000: Numerical coding, the type is int32_t, occupying 4 bytes.
    ⑥ 11100000: Numerical encoding, the type is int64_t, occupying 8 bytes.
    ⑦ 11110000: Numerical encoding, using 3 bytes to store an integer.
    ⑧ 11111110: Numerical encoding, using 1 byte to store an integer.
    ⑨ 1111xxxx: Use the lower 4 bits of encoding to store an integer, and the storage value range is 0-12. The available range of the lower 4 bits of encoding under this encoding is 0001 to 1101, and the lower 4 bits of encoding minus 1 is the actual stored value.
    ⑩ 11111111: 255, the end node of the ziplist.
    Note that in the encoding formats ② and ③, in addition to the encoding attribute, additional space is needed to store the length of the node element. The format of ⑨ is also special, and the node element is directly stored in the encoding attribute. The encoding is optimized for small numbers. At this time entry-data is empty.

Byte order

The encoding attribute uses multiple bytes to store the length of the node element. The endianness of this multi-byte data when stored in computer memory or network transmission is called endianness. There are two types of endianness: big endian Order and little endian.

  • Big-endian byte order: low-byte data is stored at the high memory location, and high-byte data is stored at the low memory location.
  • Little-endian byte order: low-byte data is stored at the low address of the memory, and high-byte data is stored at the high address of the memory.

The big-endian and little-endian storage methods of the value 0X44332211 are shown in Figure 2-1.
图2-1

CPU processing instructions are usually executed according to the direction of memory address growth. Using little-endian byte order, the CPU can read and process the low-order byte first, and it is more efficient to perform calculation borrowing and carry operations. Big-endian byte order is more in line with people's reading and writing habits.
The ziplist is in little endian order.
Here is a simple example provided by Redis:

  • [0f 00 00 00]: zlbytes is 15, which means that the entire ziplist occupies 15 bytes. Note that this value is stored in little-endian byte order.
  • [0c 00 00 00]: zltail is 12, representing the offset from the starting position of the ziplist to the last node ([02 f6]).
  • [02 00]: zllen is 2, which means there are 2 nodes in the ziplist.
  • [00 f3]: 00 represents the length of the previous node, f3 uses the encoding format (9), and the stored data is the lower 4 bits of encoding minus 1, that is, 2.
  • [02 f6]: 02 represents that the length of the previous node is 2 bytes, the encoding format of f5 is the same as above, and the stored data is 5.
  • [ff]: End flag node.
    Ziplist is a relatively complex data structure in Redis. It is hoped that readers can understand the storage format of data in ziplist based on the above attribute descriptions and examples.

Operational analysis

Tip: The following codes in this section are in ziplist.h and ziplist.c unless otherwise specified.

The ziplistFind function is responsible for finding elements in the ziplist:

unsigned char *ziplistFind(unsigned char *p, unsigned char *vstr, unsigned int vlen, unsigned int skip) {
    int skipcnt = 0;
    unsigned char vencoding = 0;
    long long vll = 0;

    while (p[0] != ZIP_END) {
        unsigned int prevlensize, encoding, lensize, len;
        unsigned char *q;
        // [1]
        ZIP_DECODE_PREVLENSIZE(p, prevlensize);
        // [2]
        ZIP_DECODE_LENGTH(p + prevlensize, encoding, lensize, len);
        q = p + prevlensize + lensize;

        if (skipcnt == 0) {
            // [3]
            if (ZIP_IS_STR(encoding)) {
                if (len == vlen && memcmp(q, vstr, vlen) == 0) {
                    return p;
                }
            } else {
                // [4]
                if (vencoding == 0) {
                    if (!zipTryEncoding(vstr, vlen, &vll, &vencoding)) {
                        vencoding = UCHAR_MAX;
                    }
                    assert(vencoding);
                }

                // [5]
                if (vencoding != UCHAR_MAX) {
                    long long ll = zipLoadInteger(q, encoding);
                    if (ll == vll) {
                        return p;
                    }
                }
            }

            // [6]
            skipcnt = skip;
        } else {
            skipcnt--;
        }

        // [7]
        p = q + len;
    }

    return NULL;
}

Parameter Description:

  • p: Specify which node in the ziplist to start searching.
  • vstr, vlen: the content and length of the element to be found.
  • skip: How many nodes to perform an element comparison operation at intervals.

[1] Calculate whether the length of the prevlen attribute of the current node is 1 byte or 5 bytes, and store the result in the prevlensize variable.
[2] Calculate the relevant attributes of the current node, and store the results in the following variables:
encoding: Node encoding format.
lensize: The number of bytes for storing the length of the node element. The encoding of the format ② and ③ requires extra space to store the length of the node element.
len: The length of the node element.
[3] If the current node element is a string encoding, compare the contents of the String, and return if they are equal.
[4] The current node element is a numerical code, and the content to be searched for vstr has not been coded, then it is coded (the coding operation is performed only once), and the coded value is stored in the vll variable.
[5] If the previous step is successfully encoded (the content to be searched is also a numeric value), compare the encoded results, otherwise there is no need to compare the encoded results. The zipLoadInteger function extracts the value stored in the node from the node element and compares it with the vll variable obtained in the previous step.
[6] Skipcnt is not 0, skip the node directly and reduce skipcnt by 1, and compare data until skipcnt is 0.
[7] p points to p + prevlensize + lensize + len (data length) to get the starting position of the next node.

Tip: Because some functions in the source code are too long, in order to clean the layout, this book divides them into multiple code segments and uses "// more" to mark the function with other code segments. Readers should pay attention to this mark.

Let's take a look at how to insert a node in the ziplist:

unsigned char *__ziplistInsert(unsigned char *zl, unsigned char *p, unsigned char *s, unsigned int slen) {
    ...
    // [1]
    if (p[0] != ZIP_END) {
        ZIP_DECODE_PREVLEN(p, prevlensize, prevlen);
    } else {
        unsigned char *ptail = ZIPLIST_ENTRY_TAIL(zl);
        if (ptail[0] != ZIP_END) {
            prevlen = zipRawEntryLength(ptail);
        }
    }

    // [2]
    if (zipTryEncoding(s,slen,&value,&encoding)) {
        reqlen = zipIntSize(encoding);
    } else {
        reqlen = slen;
    }

    // [3]
    reqlen += zipStorePrevEntryLength(NULL,prevlen);
    reqlen += zipStoreEntryEncoding(NULL,encoding,slen);

    // [4]
    int forcelarge = 0;
    nextdiff = (p[0] != ZIP_END) ? zipPrevLenByteDiff(p,reqlen) : 0;
    if (nextdiff == -4 && reqlen < 4) {
        nextdiff = 0;
        forcelarge = 1;
    }

    // more
}

Parameter Description:

  • zl: To be inserted into the ziplist.
  • p: The trailing node that points to the insertion position.
  • s, slen: the content and length of the element to be inserted.

[1] Calculate the length of the predecessor node and store it in the prevlen variable.
If p does not point to ZIP_END, you can directly take the prevlen attribute of the p node, otherwise you need to find the predecessor node through ziplist.zltail, and then get the length of the predecessor node.
[2] Encode the content of the element to be inserted, and store the length of the content in the reqlen variable.
The zipTryEncoding function tries to encode the content of the element as a value. If the content of the element can be encoded as a value, the function returns 1. At this time, the value points to the encoded value, and encoding stores the corresponding encoding format, otherwise it returns 0.
[3] The zipStorePrevEntryLength function calculates the length of the prevlen attribute (1 byte or 5 bytes).
The zipStoreEntryEncoding function calculates the number of bytes required to store the length of the additional node element (the format ② and ③ in the encoding). The value of the reqlen variable is added to the return value of these two functions to become the length of the inserted node.
[4] The zipPrevLenByteDiff function calculates how many bytes the length of the prevlen attribute of the trailing node needs to be adjusted, and the result is stored in the nextdiff variable.

If p points to the node e2, and the predecessor node of e2 before insertion is e1, the prevlen of e2 stores the length of e1.
After insertion, the predecessor node of e2 is the insertion node. At this time, the prevlen of e2 should store the length of the inserted node, so the prevlen of e2 needs to be modified. Figure 2-2 shows a simple example.
图2-2

As can be seen from Figure 2-2, the length of the prevlen attribute of the rear-drive node e2 has changed from 1 to 5, and the nextdiff variable is 4.
If the length of the inserted node is less than 4 and the length of the prevlen attribute of the original trailing node e2 is 5, then forcelarge is set to 1, which means that the length of the prevlen attribute of the trailing node e2 is forced to remain unchanged. Readers can think about it, why is it designed like this?
Continue to analyze the __ziplistInsert function:

unsigned char *__ziplistInsert(unsigned char *zl, unsigned char *p, unsigned char *s, unsigned int slen) {
    ...
    // [5]
    offset = p-zl;
    zl = ziplistResize(zl,curlen+reqlen+nextdiff);
    p = zl+offset;

    if (p[0] != ZIP_END) {
        // [6]
        memmove(p+reqlen,p-nextdiff,curlen-offset-1+nextdiff);

        // [7]
        if (forcelarge)
            zipStorePrevEntryLengthLarge(p+reqlen,reqlen);
        else
            zipStorePrevEntryLength(p+reqlen,reqlen);

        // [8]
        ZIPLIST_TAIL_OFFSET(zl) =
            intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+reqlen);
        // [9]
        zipEntry(p+reqlen, &tail);
        if (p[reqlen+tail.headersize+tail.len] != ZIP_END) {
            ZIPLIST_TAIL_OFFSET(zl) =
                intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+nextdiff);
        }
    } else {
        // [10]
        ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(p-zl);
    }

    // [11]
    if (nextdiff != 0) {
        offset = p-zl;
        zl = __ziplistCascadeUpdate(zl,p+reqlen);
        p = zl+offset;
    }

    // [12]
    p += zipStorePrevEntryLength(p,prevlen);
    p += zipStoreEntryEncoding(p,encoding,slen);
    if (ZIP_IS_STR(encoding)) {
        memcpy(p,s,slen);
    } else {
        zipSaveInteger(p,value,encoding);
    }
    // [13]
    ZIPLIST_INCR_LENGTH(zl,1);
    return zl;
}

[5] Re-allocate memory for ziplist, mainly to apply for space for inserting nodes. The memory size of the new ziplist is curlen+reqlen+nextdiff (curlen variable is the length of the ziplist before insertion). Re-assign p to zl+offset (the offset variable is the offset of the inserted node), because the ziplistResize function may apply for a new memory address for the ziplist.
The following deals with the scenario where there is a trailing node.
[6] Move all the nodes behind the insertion position back to make room for the insertion node. The starting address of the mobile space is p-nextdiff, and nextdiff is subtracted because the prevlen attribute of the rear drive node needs to adjust the length of nextdiff. The length of the moving space is curlen-offset-1+nextdiff, minus 1 because the last end marker node has been set in the ziplistResize function.
memmove is a memory movement function provided by the C language.
[7] Modify the prevlen attribute of the trailing node.
[8] Update ziplist.zltail and add the value of reqlen.
[9] If there are multiple trailing nodes, the value of nextdiff should be added to ziplist.zltail.
If there is only one rear-drive node, there is no need to add nextdiff, because then the size of the rear-drive node changes nextdiff, but the rear-drive node only moves reqlen.

Tip: The zipEntry function will assign all the information of a given node to the zlentry structure. The zlentry structure is used to store node information during the calculation process. The actual storage data format does not use this structure. Readers should not be misled by the variable name tail. It only points to the trailing node of the inserted node, not necessarily the tail node.

[10] For the scenario where there is no trailing node, only the last node offset ziplist.zltail needs to be updated.
[11] Cascade update.
[12] Write insert data.
[13] Update the number of ziplist nodes ziplist.zllen.

Explain the following code:

ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+reqlen);

The intrev32ifbe function completes the following tasks: If the host uses little-endian byte order, no processing is done. If the host uses big-endian byte order, reverse the data byte order (the first and fourth bits of the data, and the second and third bits of the data are exchanged), so that the big-endian data will be converted into little-endian Endian, or convert little-endian data into big-endian.
In the above code, if the host CPU uses little-endian byte order, the intrev32ifbe function does not do any processing.
If the host CPU uses big-endian byte order, after fetching the data from the memory, first call the intrev32ifbe function to convert the data into big-endian byte order before calculating. After the calculation is completed, call the intrev32ifbe function to convert the data into little-endian byte order and then store it in the memory.

Cascade update

Example 2-1:
Consider an extreme scenario, insert a new node ne before the e2 node of the ziplist, the element data length is 254, as shown in Figure 2-3.
图2-3

Insert the node as shown in Figure 2-4.
图2-4

After inserting the node, the length of the prevlen attribute of e2 needs to be updated to 5 bytes.
Note that the prevlen of e3, the length of e2 before insertion is 253, so the length of the prevlen attribute of e3 is 1 byte, after inserting a new node, the length of e2 is 257, then the length of the prevlen attribute of e3 will also be updated, this is the cascade renew. In extreme cases, the subsequent nodes of e3 must continue to update the prevlen attribute.
Let's take a look at the implementation of cascading updates:

unsigned char *__ziplistCascadeUpdate(unsigned char *zl, unsigned char *p) {
    size_t curlen = intrev32ifbe(ZIPLIST_BYTES(zl)), rawlen, rawlensize;
    size_t offset, noffset, extra;
    unsigned char *np;
    zlentry cur, next;
    // [1]
    while (p[0] != ZIP_END) {
        // [2]
        zipEntry(p, &cur);
        rawlen = cur.headersize + cur.len;
        rawlensize = zipStorePrevEntryLength(NULL,rawlen);
              
        if (p[rawlen] == ZIP_END) break;
        // [3]
        zipEntry(p+rawlen, &next);
        
        if (next.prevrawlen == rawlen) break;
        // [4]
        if (next.prevrawlensize < rawlensize) {
            // [5]
            offset = p-zl;
            extra = rawlensize-next.prevrawlensize;
            zl = ziplistResize(zl,curlen+extra);
            p = zl+offset;

            // [6]
            np = p+rawlen;
            noffset = np-zl;

            if ((zl+intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))) != np) {
                ZIPLIST_TAIL_OFFSET(zl) =
                    intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+extra);
            }

            // [7]
            memmove(np+rawlensize,
                np+next.prevrawlensize,
                curlen-noffset-next.prevrawlensize-1);
            zipStorePrevEntryLength(np,rawlen);

            // [8]
            p += rawlen;
            curlen += extra;
        } else {
            // [9]
            if (next.prevrawlensize > rawlensize) {
                zipStorePrevEntryLengthLarge(p+rawlen,rawlen);
            } else {
                // [10]
                zipStorePrevEntryLength(p+rawlen,rawlen);
            }
            // [11]
            break;
        }
    }
    return zl;
}

Parameter Description:

  • p: p points to the trailing node of the inserted node. For the convenience of description, the node pointed to by p is called the current node below.

[1] If you encounter ZIP_END, exit the loop.
[2] If the next node is ZIP_END, exit.
The rawlen variable is the length of the current node, and the rawlensize variable is the number of bytes occupied by the current node length.
p[rawlen] is the first byte of the trailing node of p.
[3] Calculate rear-drive node information. If the prevlen of the trailing node is equal to the length of the current node, then exit.
[4] Assuming that the storage of the current node length needs to use actprevlen (1 or 5) bytes, there are 3 cases that need to be handled here. Case 1: The length of the prevlen attribute of the rear drive node is less than the actprevlen, and the capacity needs to be expanded at this time, as shown in the scenario in Example 2-1.
[5] Re-allocate memory for ziplist.
[6] If the rear drive node is not ZIP_END, you need to modify the ziplist.zltail attribute.
[7] Move all the nodes behind the current node back to make room to modify the prevlen of the back-drive node.
[8] Point the p pointer to the trailing node, and continue processing the prevlen of the trailing node.
[9] Case 2: The length of the prevlen attribute of the rear drive node is greater than the actprevlen, and the capacity needs to be reduced at this time. In order to prevent the cascading update from continuing, the prevlen of the rear drive node is forced to remain unchanged at this time.
[10] Case 3: The length of the prevlen attribute of the trailing node is equal to actprevlen, as long as the prevlen value of the trailing node is modified, there is no need to adjust the size of the ziplist.
[11] In case 2 and case 3, the cascade update does not need to continue and exit.
Going back to the question of why forcelarge is set to 1 in the __ziplistInsert function above, this is to avoid the occurrence of cascading updates when inserting small nodes, so it is mandatory to keep the prevlen attribute length of the trailing node unchanged.

From the above analysis, we can see that the performance under cascading update is very bad, and the code complexity is also high, so how to solve this problem? Let's first look at why we need to use the prevlen attribute? This is because in the reverse traversal, every time a node is crossed forward, the length of the previous node must be known.
In this case, we save a copy of the length of each node to the last position of the node. When traversing backwards, can it not be enough to get the length of the previous node directly from the last position of the previous node? And in this way, each node is independent, and there will be no cascading update phenomenon when inserting or deleting nodes. Based on this design, the Redis author designed another structure listpack. The purpose of designing listpack is to replace ziplist, but ziplist is widely used and more complicated to replace, so it is currently only used in the newly added Stream structure. We will discuss the design of the listpack when we analyze the Stream. This shows that good design is not achieved overnight.
The commonly used functions provided by ziplist are shown in Table 2-1.

functioneffect
ziplistNewCreate an empty ziplist
ziplistPushAdd elements to the head or tail of the ziplist
ziplistInsertInsert the element into the specified position of the ziplist
ziplistFindFind the given element
ziplistDeleteDelete a given node

Even with the new listpack format, every time a new node is inserted, two memory copies may still be required.
(1) Allocate new memory space for the entire linked list, mainly to create space for new nodes.
(2) Move all rear-drive nodes of the inserted node back to make room for the inserted node.
If the linked list is very long, a large amount of memory copy is required every time a node is inserted or deleted. This performance is unacceptable, so how to solve this problem? The quicklist will be used at this time.
Due to space limitations, we will analyze the content of the quicklist in the next article.

The content of this article is excerpted from the author’s new book "Redis Core Principles and Practice" . This book deeply analyzes the internal mechanisms and implementation methods of Redis common features. Most of the content comes from the analysis of Redis source code, and summarizes the design ideas, Implementation principle. By reading this book, readers can quickly and easily understand the internal operating mechanism of Redis.
With the consent of the editor of the book, I will continue to publish some chapters in the personal technical public account (binecy) as a preview of the book. You are welcome to check it out, thank you.

Jingdong link
Douban link

阅读 489

binecy
个人技术博客
31 声望
12 粉丝
0 条评论
你知道吗?

31 声望
12 粉丝
文章目录
宣传栏