Redis suddenly slows down, how to troubleshoot and solve it?

Redis is usually an important component in our business system, such as cache, account login information, ranking list, etc.

Once Redis request latency increases, it may cause an "avalanche" of business systems.

I work in a single matchmaker-type internet company, and on Double Eleven, we launched an event to send girlfriends when you place an order.

Who would have thought that after 12 o'clock in the morning, the number of users increased sharply, there was a technical failure, and users could not place orders. At that time, the boss was furious!

After searching, I found that Redis reported Could not get a resource from the pool .

No connection resources can be obtained, and the number of single Redis connections in the cluster is high.

A large amount of traffic went directly to MySQL without the cached response of Redis, and finally the database went down...

Therefore, the maximum number of connections and the number of connection waits are changed in various ways. Although the frequency of error messages has eased, continues to report the error .

Later, after offline testing, it was found that the character data stored in Redis was very large, and the average 1s returned data .

It can be found that once the Redis delay is too high, various problems will be caused.

Today, "Code Brother" will analyze with you how to determine that Redis has performance problems and solutions.

[toc]

Is there a problem with Redis performance?

The maximum delay is the time from the client sending the command to the client receiving the response to the command. Under normal circumstances, the processing time of Redis is extremely short, in the microsecond level.

When the performance of Redis fluctuates, for example, it reaches a few seconds to more than ten seconds, it is obvious that we can determine that the performance of Redis has slowed down.

Some hardware configuration is relatively high, when the delay is 0.6ms, we may think that it is slow. The hardware is relatively poor maybe 3 ms before we think there is a problem.

So how do we define that Redis is really slow?
Therefore, we need to measure the Redis baseline performance of the current environment, that is, the basic performance of a system under low pressure and no interference.

you find that the latency of Redis runtime is more than 2 times the baseline performance, you can judge that Redis performance is slow.

Latency Baseline Measurement

The redis-cli command provides the –intrinsic-latency option to monitor and count the maximum delay (in milliseconds) during the test period, which can be used as the baseline performance of Redis.

redis-cli --latency -h `host` -p `port`

For example, execute the following command:

redis-cli --intrinsic-latency 100
Max latency so far: 4 microseconds.
Max latency so far: 18 microseconds.
Max latency so far: 41 microseconds.
Max latency so far: 57 microseconds.
Max latency so far: 78 microseconds.
Max latency so far: 170 microseconds.
Max latency so far: 342 microseconds.
Max latency so far: 3079 microseconds.
45026981 total runs (avg latency: 2.2209 microseconds / 2220.89 nanoseconds per run).
Worst run took 1386x longer than the average latency.

Note: The parameter 100 is the number of seconds the test will execute. The longer we run the tests, the more likely we are to spot latency spikes.
Usually running for 100 seconds is usually good enough to detect latency issues, but we can choose to run several times at different times to avoid errors.
The maximum latency of "Code Brother" running is 3079 microseconds, so the baseline performance is 3079 (3 milliseconds) microseconds.

It should be noted that we are running on the Redis server, not the client. In this way, the impact of the network on the performance can be .

You can connect to the server through -h host -p port . If you want to monitor the performance impact of the network on Redis, you can use Iperf to measure the network delay from the client to the server.

If the network delay is hundreds of milliseconds, it means that there may be other programs with large traffic running on the network, causing network congestion. You need to find the operation and maintenance to coordinate the traffic distribution of the network.

Slow command monitoring

How to judge whether it is a slow command?
See if the operation complexity is O(N) . official document describes the complexity of each command, use O(1) and O(log N) commands as much as possible.

It relates to the complexity of the set of operations generally O(N) , such as set total amount query HGETALL、SMEMBERS , and a set of polymerization operation: the SORT , LREM , SUNION like.

Is there any monitoring data to observe? The code is not written by me, I don't know if anyone has used slow instructions. There are two ways to find out:

Use Redis slow log function detect slow commands;
latency-monitor tool.

Also, you can use yourself (top, htop, prstat, etc.) to quickly check the CPU consumption of the Redis master process. High CPU usage and low traffic usually indicate slow commands.

`Slow log function`

The slowlog command in Redis allows us to quickly locate those slow commands that exceed the specified execution time. By default, if the command execution time exceeds 10ms, it will be logged to the log.

The slowlog only records the time of its command execution, not including io round-trip operations, nor does it record slow responses caused by network latency alone.

We can customize the slow command standard according to the baseline performance (configured to be and adjust the threshold that triggers the recording of slow commands.

Commands that log more than 6 milliseconds can be configured in redis-cli by typing:

redis-cli CONFIG SET slowlog-log-slower-than 6000

Can also be set in the Redis.config configuration file, in microseconds.

To view all commands with slow execution time, you can use the Redis-cli tool to enter the slowlog get command to view, and the third field of the returned result displays the command execution time in microseconds.

If you only need to view the last 2 slow commands, enter slowlog get 2.

示例：获取最近2个慢查询命令
127.0.0.1:6381> SLOWLOG get 2
1) 1) (integer) 6
   2) (integer) 1458734263
   3) (integer) 74372
   4) 1) "hgetall"
      2) "max.dsp.blacklist"
2) 1) (integer) 5
   2) (integer) 1458734258
   3) (integer) 5411075
   4) 1) "keys"
      2) "max.dsp.blacklist"

Taking the first HGET command as an example, each slowlog entity has 4 fields:

Field 1: 1 integer, indicating the serial number of the slowlog, which is incremented after the server is started, and is currently 6.
Field 2: Represents the Unix timestamp when the query was executed.
Field 3: Indicates the number of microseconds for query execution, currently 74372 microseconds, about 74ms.
Field 4: Indicates the query commands and parameters. If the parameters are many or very large, only part of them will be displayed and the number of parameters will be given. The current command is hgetall max.dsp.blacklist .

`Latency Monitoring`

Redis introduced the Latency Monitoring function in version 2.8.13, which is used to monitor the occurrence frequency of various events with the granularity of seconds.

The first step in enabling the latency monitor is to set the latency threshold (in milliseconds) . Only times that exceed this threshold will be recorded, for example we set the threshold to 9 ms based on 3 times the baseline performance (3ms).

It can be set with redis-cli or in Redis.config;

CONFIG SET latency-monitor-threshold 9

Details of related events recorded by the tool can be found in the official documentation: https://redis.io/topics/latency-monitor

For example to get the latest latency

127.0.0.1:6379> debug sleep 2
OK
(2.00s)
127.0.0.1:6379> latency latest
1) 1) "command"
   2) (integer) 1645330616
   3) (integer) 2003
   4) (integer) 2003

the name of the event;
The latest delayed Unix timestamp of when the event occurred;
time delay in milliseconds;
The maximum delay for this event.

`How to solve Redis slow down?`

Redis data read and write is performed by a single thread. If the operation time performed by the main thread is too long, the main thread will be blocked.

Let's analyze which operations will block the main thread, and how can we solve it?

`Latency due to network communication`

Clients connect to Redis using a TCP/IP connection or a Unix domain connection. Typical latency for a 1 Gbit/s network is about 200 us.

The redis client executes a command in 4 processes:

Send command -> command queue -> command execution -> return result This process is called Round trip time (RTT for short, round-trip time). mget mset effectively saves RTT, but most commands (such as hgetall, but not mhgetall) do not support batch operations and need to consume N times of RTT. At this time, a pipeline is required to solve this problem.

The Redis pipeline concatenates multiple commands together to reduce the number of network response round trips.

`Latency due to slow instructions`

Monitor the query document according to the slow command above, and query the slow query command. It can be solved in the following two ways:

For example, in a Cluster cluster, O(N) operations such as aggregation operations are run on the slave, or completed on the client side.
Use efficient commands instead. Use incremental iteration to avoid querying a large amount of data at one time. For details, see SCAN , SSCAN , HSCAN and commands.

In addition to that, the KEYS command is disabled in production, and it only works for debugging. Because it traverses all key-value pairs, the operation latency is high.

`Latency caused by fork to generate RDB`

generates RDB snapshot , Redis must fork the background process. fork operation (running in the main thread) itself causes delays.

Redis uses the operating system's multi-process copy-on-write technology COW (Copy On Write) to achieve snapshot persistence and reduce memory usage.

But fork involves copying a lot of linked objects, and a large 24 GB Redis instance requires 24 GB / 4 kB * 8 = 48 MB of page tables.

This will involve allocating and copying 48 MB of memory when doing bgsave.

In addition, cannot provide read and write services during the RDB loading from the library, so the data size of the main library is controlled at about 2~4G, so that the slave library can quickly load .

`transparent huge pages`

Conventional memory pages are allocated according to 4 KB. The Linux kernel supports the memory huge page mechanism since 2.6.38, which supports the allocation of memory pages with a size of 2MB.

Redis uses fork to generate RDB for persistence and provides data reliability guarantee .

When generating RDB snapshots, Redis uses copy technology so that the main thread can still receive write requests from clients.

That is, when the data is modified, Redis will copy a copy of the data, and then modify it.

Memory huge pages are used. During RDB generation, even if the data modified by the client is only 50B of data, Redis needs to copy 2MB of huge pages. When there are many instructions written, a large number of copies will be caused, resulting in slower performance.

Use the following command to disable large pages in Linux memory:

echo never > /sys/kernel/mm/transparent_hugepage/enabled

`swap: OS paging`

When the physical memory (memory stick) is not enough, the data on part of the memory is exchanged to the swap space, so that the system will not cause oom or more fatal situations due to insufficient memory.

When a process requests the OS to find insufficient memory, the OS will swap out the temporarily unused data in the memory and put it in the SWAP partition. This process is called SWAP OUT.

When a process needs these data again and the OS finds that there is free physical memory, it will swap the data in the SWAP partition back to the physical memory. This process is called SWAP IN.

Memory swap is a mechanism for swapping memory data back and forth between memory and disk in the operating system, involving read and write from disk.

What are the situations that trigger swap? For Redis, there are two common cases:

Redis uses more memory than is available;
Other processes running on the same machine as Redis are performing a large number of file read and write I/O operations (including RDB files and AOF background threads that generate large files), and file read and write takes up memory, which reduces the memory obtained by Redis and triggers swap .

Code brother, how can I check whether the performance is slow due to swap? Linux provides great tools to troubleshoot this, so when you suspect a delay due to swapping, just follow the steps below to troubleshoot.

`Get Redis instance pid`

$ redis-cli info | grep process_id
process_id:13160

Enter the /proc filesystem directory for this process:

cd /proc/13160

There is a smaps file here, which describes the memory layout of the Redis process. Run the following command to grep to find the Swap field in all files.

$ cat smaps | egrep '^(Swap|Size)'
Size:                316 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                  8 kB
Swap:                  0 kB
Size:                 40 kB
Swap:                  0 kB
Size:                132 kB
Swap:                  0 kB
Size:             720896 kB
Swap:                 12 kB

Each line of Size indicates the size of a block of memory used by the Redis instance, and the Swap below the Size corresponds to how much data has been swapped out to the disk in the memory area of this size.

If Size == Swap, the data is completely swapped out.

You can see that a memory size of 720896 kB has 12 kb swapped out to disk (only 12 kB is swapped), which is fine.

Redis itself uses a lot of memory blocks of different sizes, so you can see that there are many Size rows, some are small, 4KB, and some are very large, such as 720896KB. Different memory blocks are swapped out to disk at different sizes.

out

If Swap everything is 0kb, or sporadic 4k, then everything works fine.

When the swap size of 100 MB or even GB level appears, it indicates that at this time, the memory pressure of the Redis instance is very high, and it is likely to slow down.

`solution`

increase machine memory;
Run Redis on a separate machine to avoid running processes that require a lot of memory on the same machine, so as to meet the memory requirements of Redis;
Increase the number of clusters to share the data volume and reduce the memory required for each instance.

`Latency due to AOF and disk I/O`

In order to ensure data reliability, Redis uses AOF and RDB snapshots to achieve immediate and fast recovery and .

The appendfsync configuration can be used to configure AOF to perform write or fsync on disk in three different ways (this setting can be modified at runtime with the CONFIG SET command, eg: redis-cli CONFIG SET appendfsync no ).

no : Redis does not perform fsync, the only delay comes from the write call, and write only needs to write the log records to the kernel buffer before returning.
everysec : Redis performs an fsync every second. The fsync operation is done asynchronously using a background child thread. A maximum of 1s of data is lost.
always : every write operation will do an fsync, then reply to the client with an OK code (redis will actually try to aggregate many concurrent commands into a single fsync), no data loss. In this mode, performance is generally very low, and a fast disk and filesystem implementation that can perform fsyncs for short periods of time is strongly recommended.

We usually use Redis for caching, data loss is completely maliciously obtained from data, and high data reliability is not required. It is recommended to set it to no or everysec.

In addition, to avoid the AOF file being too large, Redis will rewrite the AOF to generate a reduced AOF file.

The configuration item no-appendfsync-on-rewrite can be set to yes, which means that the fsync operation will not be performed when the AOF is rewritten.

That is to say, after the Redis instance writes the write command to the memory, it returns directly without calling the background thread to perform the fsync operation.

`expires out expired data`

Redis has two ways to weed out expired data:

Lazy deletion: When the request is received and the key is found to have expired, the deletion is performed;
Scheduled deletion: delete some expired keys every 100 milliseconds.

The algorithm for timing deletion is as follows:

Randomly sample A CTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP keys, and delete all expired keys;
If it is found that more than 25% of the keys have expired, go to step 1.

The default setting of ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP is 20, which is executed 10 times per second, and it is not a big problem to delete 200 keys.

If the second item is triggered, it will cause Redis to consistently delete expired data and release memory. while delete is blocking.

Brother code, what is the trigger condition? That is, a large number of keys are set with the same time parameter. In the same second, a large number of keys expire and need to be deleted multiple times to reduce to less than 25%.

In short: a large number of keys that expire at the same time can cause performance fluctuations.

`solution`

If a batch of keys does expire at the same time, you can add 16215aa49d0064 to the expiration time parameters of EXPIREAT and EXPIRE and a random number a certain size range. stress caused by simultaneous expiration.

`bigkey`

Usually, we call a key with large data or a large number of members and lists as a big key. Below we will describe the characteristics of a large key with several practical examples:

A Key of type STRING, its value is 5MB (data is too large)
A Key of type LIST with 10,000 lists (too many lists)
A Key of type ZSET with 10,000 members (too many members)
A key in HASH format, although the number of members is only 1000, the total value of these members is 10MB (the size of the members is too large)

bigkey brings a problem as follows:

Redis memory keeps getting bigger and bigger, causing OOM, or reaching maxmemory setting value, causing write blocking or eviction of important keys;
A node in Redis Cluster has far more memory than other nodes, but because the minimum granularity of data migration in Redis Cluster is Key, the memory on the node cannot be balanced;
The read request of bigkey occupies too much bandwidth, which slows down and affects other services on the server;
Deleting a bigkey will cause the main library to block for a long time and cause synchronization interruption or master-slave switchover;

`find bigkey`

Use the redis-rdb-tools tool to find the big key in a customized way.

`solution`

`Split on big keys`

For example, split a HASH Key with tens of thousands of members into multiple HASH Keys, and ensure that the number of members of each key is within a reasonable range. In the Redis Cluster structure, the splitting of large keys can play a role in the memory balance between nodes. significant effect.

`Asynchronously clean up big keys`

Redis has provided the UNLINK command since 4.0, which can slowly and gradually clear incoming keys in a non-blocking manner. Through UNLINK, you can safely delete large keys or even extra large keys.

`Summarize`

The following checklist will help you solve problems efficiently when you encounter slow Redis performance.

Get the current baseline performance of Redis;
Enable slow command monitoring to locate problems caused by slow commands;
Find the slow instruction and use the scan method;
Control the data size of the instance to 2-4GB to avoid blocking when the master-slave replication loads too large RDB files;
Disable huge pages in memory and use huge pages in memory. During RDB generation, even if the data modified by the client is only 50B of data, Redis needs to copy 2MB of huge pages. When there are many instructions written, a large number of copies will be caused, resulting in slower performance.
Whether the memory used by Redis is too large to cause swap;
Whether the AOF configuration is reasonable, you can set the configuration item no-appendfsync-on-rewrite to yes to avoid AOF rewriting and fsync competing for disk IO resources, resulting in increased Redis latency.
Bigkey will bring some column problems, we need to split to prevent bigkey, and delete it asynchronously through UNLINK.

Redis suddenly slows down, how to troubleshoot and solve it?

Is there a problem with Redis performance?

Latency Baseline Measurement

Slow command monitoring

`Slow log function`

`Latency Monitoring`

`How to solve Redis slow down?`

`Latency due to network communication`

`Latency due to slow instructions`

`Latency caused by fork to generate RDB`

`transparent huge pages`

`swap: OS paging`

`Get Redis instance pid`

`solution`

`Latency due to AOF and disk I/O`

`expires out expired data`

`solution`

`bigkey`

`find bigkey`

`solution`

`Split on big keys`

`Asynchronously clean up big keys`

`Summarize`

码哥字节

`引用和评论`

我离职了，聊聊职场、大学、友情和爱情：人不能两次踏入同一条河流，生命只能倒着被理解，但却必须正着被经历

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性