Failure Analysis | A case of bgsave causing redis to freeze regularly - 个人文章

Author: Ren Kun
Now living in Zhuhai, he has served as a full-time Oracle and MySQL DBA, and now he is mainly responsible for the maintenance of MySQL, mongoDB and Redis.
Source of this article: original contribution
*The original content is produced by the open source community of Aikesheng, and the original content shall not be used without authorization. For reprinting, please contact the editor and indicate the source.

1. Background

There is a set of redis master-slave online, version 4.0, and developers complain that there are often periodic freezes.

The application log shows that it occurs every 10 minutes. A common call needs to be executed for about 1s, and then automatically recovers. Both get/set are affected.

2. Diagnosis

Looking at redis qps and cpu monitoring, no useful clues were found.

The evicted_keys indicator is always 0. Although the number of expired_keys is large, it has not fluctuated significantly. It is unlikely to be caused by the expulsion of expired keys.

Colleagues in the group reminded that the latest_fork_usec indicator is executed for about 1s, and bgsave is triggered every 15 minutes, which is roughly consistent with the frequency of slow queries in the application. Now it is initially determined that the application is stuck due to the regular bgsave of the redis instance.

For a long time, I always thought that the bgsave of redis spawned a child process and adopted the copy-on-write mechanism, which would not have much impact on redis itself, and at most it would take up some IO resources when placing the disk.

Potential bottlenecks appear on fork() calls

 Under Linux, fork() is implemented using copy‐on‐write pages, so the 
only penalty that it incurs is the time and memory required to duplicate the 
parent's page tables, and to create a unique task structure for the child.

If the page table of the parent process is relatively large, the fork() time will be prolonged accordingly, and redis adopts a single-worker process model, and all user requests will be blocked during the execution of fork().

The RSS of the current redis instance has reached 16G

Page table size 33M

 cat /proc/8844/status | grep ‐i pte
VmPTE: 33840 kB

Use strace to track the fork() time. In glibc, the fork call is actually mapped to the lower-level clone system call, because -e trace=clone is specified

 # strace ‐p 20324 ‐e trace=clone ‐T 
... 
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHL D, child_tidptr=0x7f409d771a10) = 30793 <1.013945> 
...

The corresponding redis monitoring indicators for this are

 latest_fork_usec:1014778

The two are consistent, so it can be confirmed that redis is caused by regular bgsave.

The easiest way is to disable bgsave, but this behavior has a great risk. Once the master library is killed by mistake and pulled up quickly before the master-slave switch, all redis data will be lost.

Looking at the memory utilization of redis, there is serious memory fragmentation.

 used_memory_human 8.7G
......
used_memory_rss_human 16.1G

This set of redis will be migrated soon. The RSS of the new environment instance is only 8.8G, and the latest_fork_usec indicator has also dropped to about 0.25s. After confirming with the development, it can meet the application requirements. After the migration, the periodic lag of the application has been significantly alleviated. .

Redis 4.0 introduces the automatic fragment recovery function, which is controlled by the parameter activedefrag and is disabled by default. After the migration, activedefrag is turned on for the old redis (the rest of the parameters remain the default values), and finally the used_memory_rss_human is fixed at about 11G, and the latest_fork_usec is about 0.76s. The new environment may also encounter severe memory fragmentation in the future. At that time, either open activedefrag or restart the instance during the maintenance period. The latter effect is obviously better.

3. Summary

Our online redis master and slave have turned on bgsave, and we have ignored the performance fluctuations that bgsave/fork() may cause. The best solution is to control the memory limit of a single redis. If the business side cannot lose weight, you can consider redis cluster or set maxmemory. In the future, if you encounter regular redis freezes, you can start with the latest_fork_usec monitoring indicators first.

Failure Analysis | A case of bgsave causing redis to freeze regularly

1. Background

2. Diagnosis

3. Summary

爱可生开源社区

引用和评论

如何巧妙解决 Too many connections 报错？

嘎嘎好用！推荐三款开源的 Redis 桌面客户端！

自制审批流框架记录

如何实现页面广告随时上下线、过期自动下线及到时自动上线

Redis 又双叒叕改开源协议了，微软提前推出高性能替代方案 Garnet

Redis与MySQL数据一致性问题解决方案

数据库审计与智能监控：从日志分析到异常检测