2

redis problem

Recently, redis of preferential services often fluctuates intermittently. The specific manifestation is that redis rt has risen significantly in a short period of time, and RedisCommandTimeoutException has increased abnormally, as shown in the following figure:

The monitoring panel is based on the minute-level statistics, so the increase in rt does not seem to be very obvious.

This situation is definitely not normal, and the frequency of occurrence is on the rise in the near future.

Reason for positioning

When encountering this kind of problem, the first thing you will think is whether it is caused by the jitter of redis itself. The appearance is actually very similar, irregular, intermittent, and the impact time is very short. Therefore, the DBA was first checked to confirm whether there was a problem with the redis instance at the time. , Or the network has jittered, and at the same time, go to the monitoring panel of dms redis to check whether the running indicators are normal. Unfortunately, the recovery is that during the period of service jitter, redis is running normally, and the network status is normal, and from the monitoring panel, redis is running very well, the cpu load is not high, and the io load is not high. The kernel running rt is also normal, and there is no obvious fluctuation. (The following figure selects a node instance in the redis cluster, and the conditions of the 16 nodes are basically the same)

redis cpu:

redis io:

redis maxRT

At this point, the cause of the middleware itself can basically be ruled out. Then, it can only be a matter of using posture. The possible impact of using the posture is to first determine whether there is a hot key and a big key. If a big key is also a hot key at the same time, it is very likely to cause this phenomenon at the same time as the traffic spikes.

First go to the Alibaba Cloud redis monitoring panel to see the hot key statistics

It is found that there is no hot key or big key within a week. Obviously, the cache content itself is quite reasonable. This is a bit of a headache. Redis itself and the cached content are all right. Then you can only focus on the code, and the reason can be reversed by the code exception.

In Tianyan monitoring, we found a lot of RedisCommandTimeoutException exceptions, then first take a sample and look at the request context that generated the exception

The exception interface is: batch price calculation service for commodity flow in the venue

Redis mget was used in this request to obtain multiple keys at the same time. There were probably dozens of keys, but it timed out, and 500ms was not enough.

Change to an abnormal interface

Both of these two interfaces use mget to pull keys in batches. From the perspective of key naming, they still rely on the same data. Of course, this does not affect. Above we have seen that the data cached by redis is okay. There is no big key or hot key, redis itself is running healthy, and the network is normal. Then, there is only one possibility, is there a problem with this mget, and how does mget get more at once? A key, with questions, let’s follow the source code of mget (the system uses Lettuce pool)

public RedisFuture<List<KeyValue<K, V>>> mget(Iterable<K> keys) {
        //获取分区slot和key的映射关系
        Map<Integer, List<K>> partitioned = SlotHash.partition(codec, keys);
 
        //如果分区数小于2也就是只有一个分区即所有key都落在一个分区就直接获取
        if (partitioned.size() < 2) {
            return super.mget(keys);
        }
 
        //每个key与slot映射关系
        Map<K, Integer> slots = SlotHash.getSlots(partitioned);
        Map<Integer, RedisFuture<List<KeyValue<K, V>>>> executions = new HashMap<>();
 
        //遍历分片信息,逐个发送
        for (Map.Entry<Integer, List<K>> entry : partitioned.entrySet()) {
            RedisFuture<List<KeyValue<K, V>>> mget = super.mget(entry.getValue());
            executions.put(entry.getKey(), mget);
        }
 
        // restore order of key 恢复key的顺序
        return new PipelinedRedisFuture<>(executions, objectPipelinedRedisFuture -> {
            List<KeyValue<K, V>> result = new ArrayList<>();
            for (K opKey : keys) {
                int slot = slots.get(opKey);
 
                int position = partitioned.get(slot).indexOf(opKey);
                RedisFuture<List<KeyValue<K, V>>> listRedisFuture = executions.get(slot);
                result.add(MultiNodeExecution.execute(() -> listRedisFuture.get().get(position)));
            }
 
            return result;
        });
}

The whole mget operation is actually divided into the following steps:

  1. Obtain the mapping relationship between the partition slot and the key, and traverse each partition slot corresponding to the required key.
  2. Determine whether the number of slots is less than 2, that is, whether all keys are in the same partition. If so, initiate a mget command and get it directly.
  3. If the number of partitions is greater than 2, and the keys correspond to multiple partitions, then all partitions are traversed, and mget requests are issued to redis to obtain data.
  4. Wait for the execution of all requests to end, reassemble the result data, keep the sequence of the input parameters consistent, and then return the result.

It can be seen that when using the mget method to obtain multiple keys, and these keys still exist in different slot partitions, then one mget operation will actually initiate multiple requests for mget commands to redis, and how many times there are slots. , And then after all requests are executed, the mget method can continue to execute. It looks like one mget method call, but the bottom layer corresponds to multiple redis calls and multiple io interactions.

(Picture a)

This picture can be very intuitive to see the drawbacks of mget in redis cluster mode.

Problem optimization:

Option 1-hashtag

Hashtag forces the key to be placed on a redis node. This solution is equivalent to degenerating the redis cluster into a stand-alone redis. The high availability of the system and the disaster tolerance capability are greatly reduced. You can only try to use other distributed architectures such as master-slave and sentinel to slow down. However, since the cluster is selected It is certain that the cluster mode is most in line with the current system architecture status compared to other modes. Using this scheme may cause greater problems. Not recommended.

Scenario 2-Concurrent calls

We can see from Figure a and the above code that multiple serial redis calls in the for loop are the reason for the increase in the execution of rt. Then, naturally, we can think of whether it is possible to use parallel to replace the underlying serial logic. That is, the keys in mget, according to the slot sharding rules, are groupBy first, and then executed in parallel in a multi-threaded manner.

Then the ideal situation of rt is actually the rt time-consuming of a stand-alone mget, that is, a network io time-consuming time, and a redis mget command time-consuming time.

A seemingly perfect solution, but in fact it is not. Let’s consider the actual scenario: First of all, in this solution, the design of the thread pool for concurrent calls to submit redis mget tasks is very important, and the adjustment of various parameters is bound to be very sufficient. The pressure test is relatively difficult in itself. Secondly, in our daily use, the key of a mget is basically tens to 100, which is an order of magnitude difference compared to the fixed number of slots of redis 16384. Therefore, the keys we request at a time are basically distributed In different slots, in other words, if you split the keys in this way, the high probability is equivalent to splitting out get requests equal to the number of keys. . It loses the meaning of mget.

Both options have their own advantages and disadvantages. The first option is simple, but the hidden dangers at the architecture level are relatively large. The second option is complicated to implement, but the reliability is relatively better. mget has always been loved and hated by people. The key is to look at the usage scenario. The more redis cluster nodes where the key is distributed, the worse the performance. However, for a small number of levels, such as 5 to 20, the problem is not big. .

Text/Hulk
Pay attention to Dewu Technology, and work hand in hand to the cloud of technology


得物技术
851 声望1.5k 粉丝