Failure Analysis | A case of high cpu load caused by a large number of short-term processes - 个人文章

Author: Ren Kun
Now living in Zhuhai, he has served as a full-time Oracle and MySQL DBA, and now he is mainly responsible for the maintenance of MySQL, mongoDB and Redis.
Source of this article: original contribution
*The original content is produced by the open source community of Aikesheng, and the original content shall not be used without authorization. For reprinting, please contact the editor and indicate the source.

1. Background

In the development environment of a project, a single virtual machine is installed with a mongo cluster for testing, 1 mongos + 3 node config + 1 shard * 3 copies, a total of 7 mongo instances. mongo version 4.2.19, OS is centos 7.9.

After the test, the CPU load has been maintained at about 50%, and the qps of mongo has dropped to 0 at this time.

Only mongo is installed on this machine, and all mongo instances are shut down, and the cpu load returns to normal immediately, and then the mongo instance is turned on again. After a while, the cpu load starts to soar again. The scene can be reproduced, and it is confirmed that it is related to the mongo instance.

2. Diagnosis

Execute the top command, the usr of the CPU has reached 40%, but the %cpu of the first few processes is far from enough.

Looking at the qps of mongos, it is true that the user command is not executed.

dstat to see the overall load (vmstat is not well formatted, the last few columns are always misaligned).

Except for the abnormal CPU load, other indicators are normal, and interrupts and context switching are not high, so it is unlikely that these two are caused.
perf record -ag -- sleep 10 && perf report to view the CPU execution.

There are indeed a lot of mongo calls, but the API naming is not intuitive and the corresponding execution logic cannot be guessed.

So far, it is confirmed that the problem is caused by the mongo instance, but the application connection of mongo is 0, and no useful information can be found by looking at the calling API stack.

Going back to the beginning of this article, the total CPU utilization of the top process is far less than the overall CPU load. There is a high probability that frequent and short-term processes have stolen this part of the CPU resources, causing the top command to fail to capture statistics.

sar -w 1 Check the number of processes generated per second. On average, more than 80 new processes are created per second, which should be it.

To catch applications that frequently create short-term processes, you can use execsnoop, which monitors the exec() behavior of processes in real time through ftrace, and outputs basic information of short-term processes, including process PID/PPID and command line parameters.

 #下载execsnoop#
cd /usr/bin
wget https://raw.githubusercontent.com/brendangregg/perf‐tools/master/execsnoop
chmod 755 execsnoop

The following is the output content, all the monitoring system is executing, constantly connecting to mongo and performing grep filtering on the output results, each operation will spawn a new thread/process, and more than 400 records are captured in 10s.

Shut down the zabixx process, the cpu immediately returned to normal, and the culprit was found.

We also use zabbix monitoring in other environments, but we have not encountered similar problems.

The node deploys 7 mongo instances, and zabbix monitors each mongo instance by default, which is equivalent to a 7-fold increase in execution loss, and the machine is a virtual machine with only 4-core CPU.

When these factors come together, problems can arise. This is a development environment. Zabbix monitoring is temporarily turned off. In the future, the monitoring logic should be optimized to minimize the number of connections to db and the length of the grep call chain.

3. Summary

When the CPU load of the machine continues to rise but the top process cannot be captured, execsnoop can be used to capture short-term processes. Similar tools include iosnoop and opensnoop.

Failure Analysis | A case of high cpu load caused by a large number of short-term processes

1. Background

2. Diagnosis

3. Summary

爱可生开源社区

引用和评论

SQLShift 重大更新：Oracle→PostgreSQL 存储过程转换功能上线！

CPU密集型任务线程池参数设置

Linux使用cpulimit对CPU使用率进行限制

x-cmd install | cpufetch - 轻量强大的高颜值 CPU 信息工具，型号/架构/频率一目了然！

Failure Analysis | A case of high cpu load caused by a large number of short-term processes

1. Background

2. Diagnosis

3. Summary

爱可生开源社区

引用和评论

SQLShift 重大更新：Oracle→PostgreSQL 存储过程转换功能上线！​​

CPU密集型任务线程池参数设置

Linux使用cpulimit对CPU使用率进行限制

x-cmd install | cpufetch - 轻量强大的高颜值 CPU 信息工具，型号/架构/频率一目了然！

SQLShift 重大更新：Oracle→PostgreSQL 存储过程转换功能上线！