Redis large cluster expansion performance optimization practice

1. Background

In the live network environment, some services that use Redis clusters often need to perform node expansion operations as the business volume increases.

Previously, I learned that after operation and maintenance students performed expansion operations on some Redis clusters with a relatively large number of nodes, the business side reported that the performance of the cluster was degraded, and the specific manifestation was that the access delay increased significantly.

Certain services are sensitive to the access delay of the Redis cluster. For example, the live network environment reads the model in real time, or some services rely on the synchronization process of reading the Redis cluster, which will affect the real-time process delay of the business. It may not be acceptable on the business side.

In order to find the root cause of this problem, we investigated the cluster performance degradation after a certain Redis cluster migration operation.

1.1 Problem description

The specific scenario of the Redis cluster problem this time is that a certain Redis cluster has undergone an expansion operation. The business side uses Hiredis-vip for Redis cluster access and MGET operation.

The business side perceives that the latency of accessing the Redis cluster becomes higher.

1.2 Description of the live network environment

At present, most of the Redis versions deployed in the live network environment are 3.x or 4.x versions;
There are many types of clients for business access to the Redis cluster, and most of them use Jedis. The business of this troubleshooting is accessed using the client Hiredis-vip;
The number of nodes in the Redis cluster is relatively large, with a scale of 100+;
There was an expansion operation before the cluster.

1.3 Observed phenomena

Because the delay becomes higher, we will investigate from several aspects:

Whether the bandwidth is full;
Whether the CPU usage is too high;
Is the OPS high;

Through simple monitoring and troubleshooting, the bandwidth load is not high. But it was found that the CPU behaved abnormally:

1.3.1 Compare OPS and CPU load

Observing the MGET and CPU load used by the business feedback, we found the corresponding monitoring curve.

From the time point of view, MGET and high CPU load are not directly related. The feedback from the business side is that the delay of MGET is generally increased. Here we see that the OPS and CPU load of MGET is staggered.

Here, it can be temporarily determined that there is no direct relationship between the service request and the CPU load, but it can be seen from the curve: on the same time axis, there should be an indirect relationship between the service request and the CPU load.

1.3.2 Comparing Cluster instruction OPS and CPU load

Since some colleagues on the operation and maintenance side have reported that the cluster has been expanded before, there must be slot migration.

Considering that business clients generally use the cache to store the slot topology information of the Redis cluster, it is suspected that the Cluster command will have a certain relationship with the CPU load.

We found some connections among them:

It can be clearly seen here that when an instance executes the Cluster instruction, the CPU usage will increase significantly.

Based on the above phenomenon, a simple focus can be roughly carried out:

The business side executes MGET, and the Cluster instruction is executed for some reasons;
The Cluster instruction causes high CPU usage for some reasons and affects other operations;
It is suspected that the Cluster instruction is a performance bottleneck.

At the same time, a few issues that need attention are extended:

Why are more Cluster instructions executed?

Why is the CPU resource relatively high when the Cluster instruction is executed?

Why is it easy to "succumb to" the operation of slot migration in clusters with large node scales?

Two, troubleshooting

2.1 Redis hot spot troubleshooting

We use perf top to perform a simple analysis on a Redis instance with high CPU load on site:

As can be seen from the above figure, the function (ClusterReplyMultiBulkSlots) occupies up to 51.84% of the CPU resources, which is abnormal.

2.1.1 Implementation principle of ClusterReplyMultiBulkSlots

We analyze the clusterReplyMultiBulkSlots function:

void clusterReplyMultiBulkSlots(client *c) {
    /* Format: 1) 1) start slot
     *            2) end slot
     *            3) 1) master IP
     *               2) master port
     *               3) node ID
     *            4) 1) replica IP
     *               2) replica port
     *               3) node ID
     *           ... continued until done
     */
 
    int num_masters = 0;
    void *slot_replylen = addDeferredMultiBulkLength(c);
 
    dictEntry *de;
    dictIterator *di = dictGetSafeIterator(server.cluster->nodes);
    while((de = dictNext(di)) != NULL) {
        /*注意：此处是对当前Redis节点记录的集群所有主节点都进行了遍历*/
        clusterNode *node = dictGetVal(de);
        int j = 0, start = -1;
 
        /* Skip slaves (that are iterated when producing the output of their
         * master) and  masters not serving any slot. */
        /*跳过备节点。备节点的信息会从主节点侧获取。*/
        if (!nodeIsMaster(node) || node->numslots == 0) continue;
        for (j = 0; j < CLUSTER_SLOTS; j++) {
            /*注意：此处是对当前节点中记录的所有slot进行了遍历*/
            int bit, i;
            /*确认当前节点是不是占有循环终端的slot*/
            if ((bit = clusterNodeGetSlotBit(node,j)) != 0) {
                if (start == -1) start = j;
            }
            /*简单分析，此处的逻辑大概就是找出连续的区间，是的话放到返回中；不是的话继续往下递归slot。
              如果是开始的话，开始一个连续区间，直到和当前的不连续。*/
            if (start != -1 && (!bit || j == CLUSTER_SLOTS-1)) {
                int nested_elements = 3; /* slots (2) + master addr (1). */
                void *nested_replylen = addDeferredMultiBulkLength(c);
 
                if (bit && j == CLUSTER_SLOTS-1) j++;
 
                /* If slot exists in output map, add to it's list.
                 * else, create a new output map for this slot */
                if (start == j-1) {
                    addReplyLongLong(c, start); /* only one slot; low==high */
                    addReplyLongLong(c, start);
                } else {
                    addReplyLongLong(c, start); /* low */
                    addReplyLongLong(c, j-1);   /* high */
                }
                start = -1;
 
                /* First node reply position is always the master */
                addReplyMultiBulkLen(c, 3);
                addReplyBulkCString(c, node->ip);
                addReplyLongLong(c, node->port);
                addReplyBulkCBuffer(c, node->name, CLUSTER_NAMELEN);
 
                /* Remaining nodes in reply are replicas for slot range */
                for (i = 0; i < node->numslaves; i++) {
                    /*注意：此处遍历了节点下面的备节点信息，用于返回*/
                    /* This loop is copy/pasted from clusterGenNodeDescription()
                     * with modifications for per-slot node aggregation */
                    if (nodeFailed(node->slaves[i])) continue;
                    addReplyMultiBulkLen(c, 3);
                    addReplyBulkCString(c, node->slaves[i]->ip);
                    addReplyLongLong(c, node->slaves[i]->port);
                    addReplyBulkCBuffer(c, node->slaves[i]->name, CLUSTER_NAMELEN);
                    nested_elements++;
                }
                setDeferredMultiBulkLength(c, nested_replylen, nested_elements);
                num_masters++;
            }
        }
    }
    dictReleaseIterator(di);
    setDeferredMultiBulkLength(c, slot_replylen, num_masters);
}
 
/* Return the slot bit from the cluster node structure. */
/*该函数用于判断指定的slot是否属于当前clusterNodes节点*/
int clusterNodeGetSlotBit(clusterNode *n, int slot) {
    return bitmapTestBit(n->slots,slot);
}
 
/* Test bit 'pos' in a generic bitmap. Return 1 if the bit is set,
 * otherwise 0. */
/*此处流程用于判断指定的的位置在bitmap上是否为1*/
int bitmapTestBit(unsigned char *bitmap, int pos) {
    off_t byte = pos/8;
    int bit = pos&7;
    return (bitmap[byte] & (1<<bit)) != 0;
}
typedef struct clusterNode {
    ...
    /*使用一个长度为CLUSTER_SLOTS/8的char数组对当前分配的slot进行记录*/
    unsigned char slots[CLUSTER_SLOTS/8]; /* slots handled by this node */
    ...
} clusterNode;

Each node (ClusterNode) uses a bitmap (char slots[CLUSTER_SLOTS/8]) to store slot allocation information.

Briefly talk about the logic of BitmapTestBit: clusterNode->slots is an array of length CLUSTER\_SLOTS/8. CLUSTER\_SLOTS is a fixed value of 16384. Each bit in the array represents a slot. The subscript of the bitmap array here is 0 to 2047, and the range of slot is 0 to 16383.

Because it is necessary to determine whether the bit at the position of pos is 1, therefore:

off_t byte = pos/8: get the corresponding byte (Byte) on the bitmap to store the pos position information. Because a Byte has 8 bits. Use pos/8 to guide which Byte needs to be found. Here the bitmap is treated as an array, and the corresponding Byte here is the corresponding subscript.
int bit = pos&7: get the information of which bit on this byte represents the position of this pos. &7 is actually %8. It is conceivable to group every 8 groups of pos, and the number of the last group (not satisfying 8) corresponds to the corresponding bit array subscript position on the Byte corresponding to the bitmap.
(bitmap[byte] & (1<<bit)): Determine whether the corresponding bit exists on bitmap[byte].

Take the slot as 10001 as an example:

Therefore, the slot 10001 corresponds to the Byte of the subscript 1250, and the bit to be verified is the bit of the subscript 1.

Corresponding to the corresponding position on ClusterNode->slots:

The green square in the figure indicates bitmap[1250], which corresponds to the Byte storing slot 10001; the red box mark (bit[1]) corresponds to the position of 1<<bit. bitmap[byte] & (1<<bit), that is, confirm whether the position corresponding to the red box is 1. If yes, it means 10001 has been marked on the bitmap.

The summary logic of ClusterNodeGetSlotBit is: Determine whether the current slot is allocated on the current node . Therefore, the approximate logical representation of ClusterReplyMultiBulkSlots is as follows:

The approximate steps are as follows:

Traverse each node;
For each node, traverse all the slots, and use ClusterNodeGetSlotBit to determine whether the slot in the traversal is allocated to the current node;

From the result of obtaining the CLUSTER SLOTS instruction, it can be seen that the complexity is <number of cluster master nodes> *<total number of slots>. The total number of slots is 16384, a fixed value.

2.1.2 Redis hot spot investigation summary

For now, the CLUSTER SLOTS instruction delay increases linearly with the number of master nodes in the Redis cluster. However, the number of cluster master nodes we checked this time is relatively large, which can explain why the CLUSTER SLOTS command delay in the live network phenomenon checked this time is relatively large.

2.2 Client troubleshooting

I understand that the operation and maintenance students have expansion operations. After the expansion is completed, it will inevitably involve some keys that have MOVED errors when they are accessed.

The Hiredis-vip client code currently in use is briefly browsed, and the following is a brief analysis of how the Hiredis-vip client currently used by the current business will deal with MOVED. Since the Jedis client is commonly used in most other businesses, the corresponding process of the Jedis client is also briefly analyzed here.

2.2.1 The realization principle of Hiredis-vip for MOVED processing

Hiredis-vip's operation for MOVED:

View the calling process of Cluster\_update\_route:

The cluster\_update\_route\_by\_addr here has performed the CLUSTER SLOT operation. As you can see, Hiredis-vip will re-update the Redis cluster topology when the MOVED error is received, with the following characteristics:

Because the node uses ip:port as the key, the hashing method is the same. If the cluster topology is similar, multiple clients can easily access the same node at the same time;
If a node fails to access, it will find the next node through the iterator. Due to the above reasons, it is easy for multiple clients to access the next node at the same time.

2.2.2 Jedis's realization principle of MOVED processing

A simple browsing of the Jedis client code, and found that if there is a MOVED error, renewSlotCache will be called.

Continue to look at the call of renewSlotCache, here you can confirm: In cluster mode, when Jedis encounters a MOVED error, it will send the Redis command CLUSTER SLOTS to re-pull the slot topology of the Redis cluster.

2.2.3 Summary of client implementation principles

Since Jedis is a Redis client for Java and Hiredis-vip is a Redis client for C++, it can be simply considered that this exception handling mechanism is a common operation.

The process of MOVED in the client cluster mode is roughly as follows:

In general:

1) Use the slot topology cached by the client to access the key;

2) The Redis node returns to normal:

The access is normal, continue the follow-up operation

3) The Redis node returns MOVED:

Perform CLUSTER SLOTS command execution on the Redis node to update the topology;
Use the new topology to revisit the key.

2.2.3 Summary of client troubleshooting

The Redis cluster is expanding, that is, there must be some Redis clients that encounter MOVED when accessing the Redis cluster, and execute the Redis command CLUSTER SLOTS to update the topology.

If the migration key hit rate is high, the CLUSTER SLOTS instruction will be executed more frequently. The result of this is that the Redis cluster will continue to be executed by the client to execute the CLUSTER SLOTS command during the migration process.

2.3 Summary of investigation

Here, combined with the CLUSTER SLOTS mechanism on the Redis side and the client's processing logic for MOVED, the previous questions can be answered:

Why are more Cluster instructions executed?

Because of the migration operation, business access to some of the migrated keys will get MOVED returned, and the client will re-pull the slot topology information for the returned and execute CLUSTER SLOTS.

Why is the CPU resource relatively high when the Cluster instruction is executed?

Analyzing the Redis source code, it is found that the time complexity of the CLUSTER SLOT instruction is proportional to the number of master nodes. The current business Redis cluster has a large number of master nodes, which is naturally time-consuming and consumes high CPU resources.

Why is it easy to "succumb to" the operation of slot migration in clusters with large node scales?

The migration operation will inevitably bring some clients to return MOVED when accessing the key;
The client will execute the CLUSTER SLOTS command for the return of MOVED;
The CLUSTER SLOTS instruction increases as the number of cluster master nodes increases, the delay will increase;
Business access during slot migration will increase due to the latency of CLUSTER SLOTS, and the external perception is that the latency of executing instructions increases.

Three, optimization

3.1 Status analysis

According to the current situation, it is a normal process for the client to execute CLUSTER SLOTS when it encounters MOVED, because the slot topology of the cluster needs to be updated to improve the subsequent cluster access efficiency.

In addition to Jedis and Hiredis-vip in the process here, other clients should also perform similar slot information cache optimization. There is not much room for process optimization here, which is determined by Redis's cluster access mechanism.

Therefore, the cluster information record of Redis is analyzed.

3.1.1 Redis cluster metadata analysis

Each Redis node in the cluster will have some cluster metadata records, recorded in server.cluster, with the following content:

typedef struct clusterState {
    ...
    dict *nodes;          /* Hash table of name -> clusterNode structures */
    /*nodes记录的是所有的节点，使用dict记录*/
    ...
    clusterNode *slots[CLUSTER_SLOTS];/*slots记录的是slot数组，内容是node的指针*/
    ...
} clusterState;

As described in 2.1 , the original logic obtains the topology structure by traversing the slot information of each node.

3.1.2 Redis cluster metadata analysis

Observe the return result of CLUSTER SLOTS:

/* Format: 1) 1) start slot
 *            2) end slot
 *            3) 1) master IP
 *               2) master port
 *               3) node ID
 *            4) 1) replica IP
 *               2) replica port
 *               3) node ID
 *           ... continued until done
 */

Combined with the cluster information stored in server.cluster, the author believes that server.cluster->slots can be used to traverse here. Because server.cluster->slots has been updated every time the cluster topology changes, the node pointer is stored.

3.2 Optimization plan

The simple optimization ideas are as follows:

Traverse the slot and find out that the nodes in the slot are continuous blocks;
If the node of the currently traversed slot is consistent with the node previously traversed, it means that the currently visited slot is under the same node as the previous one, that is, in the "continuous" slot area under a certain node;
If the node of the currently traversed slot is inconsistent with the previously traversed node, it means that the currently visited slot is different from the previous one. The previous "continuous" slot area can be output; and the current slot is used as the start of the next new "continuous" slot area .

Therefore, as long as the server.cluster->slots is traversed, the demand can be met. The simple representation is roughly as follows:

This time complexity is reduced to <total number of slots>.

3.3 Implementation

The optimization logic is as follows:

void clusterReplyMultiBulkSlots(client * c) {
    /* Format: 1) 1) start slot
     *            2) end slot
     *            3) 1) master IP
     *               2) master port
     *               3) node ID
     *            4) 1) replica IP
     *               2) replica port
     *               3) node ID
     *           ... continued until done
     */
    clusterNode *n = NULL;
    int num_masters = 0, start = -1;
    void *slot_replylen = addReplyDeferredLen(c);
 
    for (int i = 0; i <= CLUSTER_SLOTS; i++) {
        /*对所有slot进行遍历*/
        /* Find start node and slot id. */
        if (n == NULL) {
            if (i == CLUSTER_SLOTS) break;
            n = server.cluster->slots[i];
            start = i;
            continue;
        }
 
        /* Add cluster slots info when occur different node with start
         * or end of slot. */
        if (i == CLUSTER_SLOTS || n != server.cluster->slots[i]) {
            /*遍历主节点下面的备节点，添加返回客户端的信息*/
            addNodeReplyForClusterSlot(c, n, start, i-1);
            num_masters++;
            if (i == CLUSTER_SLOTS) break;
            n = server.cluster->slots[i];
            start = i;
        }
    }
    setDeferredArrayLen(c, slot_replylen, num_masters);
}

By traversing server.cluster->slots, find the "continuous" slot area under a certain node, once the subsequent discontinuity, output the node information of the previous "continuous" slot area and its standby node information, and then continue The search for the next "continuous" slot area is output.

4. Comparison of optimization results

A horizontal comparison of the CLUSTER SLOTS instructions of the two versions of Redis.

4.1 Test environment & stress test scenario

Operating system: manjaro 20.2

hardware configuration:

CPU：AMD Ryzen 7 4800H
DRAM：DDR4 3200MHz 8G*2

Redis cluster information:

1) Persistent configuration

Close aof
Close bgsave

2) Cluster node information:

Number of nodes: 100
All nodes are master nodes

stress test scenario:

Use the benchmark tool to continuously send CLUSTER SLOTS instructions to a single node in the cluster;
After the pressure test of one of the versions is completed, the cluster is recycled, and the next round of pressure test is performed after redeploying.

4.2 Comparison of CPU resource usage

Perf exports the flame graph. Original version:

Optimized:

It can be clearly seen that the proportion after optimization has dropped significantly. Basically in line with expectations.

4.3 Time-consuming comparison

Test on it and embed the time-consuming test code:

else if (!strcasecmp(c->argv[1]->ptr,"slots") && c->argc == 2) {
        /* CLUSTER SLOTS */
        long long now = ustime();
        clusterReplyMultiBulkSlots(c);
        serverLog(LL_NOTICE,
            "cluster slots cost time:%lld us", ustime() - now);
    }

Enter the log for comparison;

original log output:

37351:M 06 Mar 2021 16:11:39.313 * cluster slots cost time:2061 us。

optimized version log output:

35562:M 06 Mar 2021 16:11:27.862 * cluster slots cost time:168 us。

From the time-consuming point of view, the decrease is obvious: from 2000+us to 200-us; the time-consuming in a cluster of 100 master nodes is reduced to 8.2% of the original; the optimization results are basically in line with expectations.

Five, summary

Here can briefly describe the above-mentioned actions of the next article to draw such a conclusion: performance defects.

Briefly summarize the above-mentioned investigation and optimization process:

Redis large clusters have obvious access delays for some nodes due to the CLUSTER command;
Use the perf top command to check the Redis instance and find that the clusterReplyMultiBulkSlots command occupies CPU resources abnormally;
Analyze clusterReplyMultiBulkSlots, this function has obvious performance problems;
Optimize clusterReplyMultiBulkSlots, and the performance is improved significantly.

From the above investigation and optimization process, a conclusion can be drawn: the current Redis has performance defects in the CLUSTER SLOT instruction.

Because of Redis's data sharding mechanism, the key access method in Redis cluster mode is to cache the topology information of the slot. Optimization points can only be started in CLUSTER SLOTS. The number of cluster nodes in Redis is generally not so large, and the problem is not obvious.

In fact, the logic of Hiredis-vip also has certain problems. As 2.2.1 , Hiredis-vip's slot topology update method is to traverse all nodes and perform CLUSTER SLOTS one by one. If the Redis cluster is large and the client side of the business side is large, a chain reaction will occur:

1) If the Redis cluster is large, the CLUSTER SLOTS response is slower;

2) If a node does not respond or returns an error, the Hiredis-vip client will continue to request the next node;

3) The method for iterative traversal of Redis cluster nodes in the Hiredis-vip client is the same (because the cluster information is basically the same in each client). At this time, when the client is large in size, a certain Redis node may be blocked. Will cause hiredis-vip client to traverse the next Redis node;

4) A large number of Hiredis-vip clients access some Redis nodes one by one. If the Redis node cannot afford such a request, this will cause the Redis node to request one by one under the "traversal" of a large number of Hiredis-vip clients:

Combining the above point 3, you can imagine that there are 1w clients accessing the Redis cluster. Because there is a migration operation for a key with a higher hit rate, all clients need to update the slot topology. Since the cluster node information cached by all clients is the same, the order of traversing each node is the same. These 1w clients all use the same sequence to traverse each node of the cluster to operate CLUSTER SLOTS. Due to the poor performance of CLUSTER SLOTS in large clusters, Redis nodes are easily inaccessible due to a large number of client requests. The Redis node will be accessed by most of the clients (for example, 9k+ clients) in turn according to the traversal order, and execute the CLUSTER SLOTS command, causing the Redis nodes to be blocked one by one.

5) The final performance is that the CPU load of most Redis nodes has skyrocketed, and many Hiredis-vip clients continue to be unable to update the slot topology.

The end result is that after the large-scale Redis cluster performs slot migration operations, the business side perceives that the latency of ordinary commands becomes higher under the access of the large-scale Hiredis-vip client, and the CPU resource usage of the Redis instance is high. This logic can be optimized.

At present, the optimization of the above subsection 3 has been submitted and merged into the Redis 6.2.2 version.

Six, reference materials

1、Hiredis-vip: https://github.com

2、Jedis: https://github.com/redis/jedis

3、Redis: https://github.com/redis/redis

4、Perf：https://perf.wiki.kernel.org

Author: vivo Internet Database Team—Yuan Jianwei