Author: Ren Kun

Now living in Zhuhai, he has served as a full-time Oracle and MySQL DBA, and now he is mainly responsible for the maintenance of MySQL, mongoDB and Redis.

Source of this article: original contribution

*The original content is produced by the open source community of Aikesheng, and the original content shall not be used without authorization. For reprinting, please contact the editor and indicate the source.


1. Background

In the development environment of a project, a single virtual machine is installed with a mongo cluster for testing, 1 mongos + 3 node config + 1 shard * 3 copies, a total of 7 mongo instances. mongo version 4.2.19, OS is centos 7.9.

After the test, the CPU load has been maintained at about 50%, and the qps of mongo has dropped to 0 at this time.

Only mongo is installed on this machine, and all mongo instances are shut down, and the cpu load returns to normal immediately, and then the mongo instance is turned on again. After a while, the cpu load starts to soar again. The scene can be reproduced, and it is confirmed that it is related to the mongo instance.

2. Diagnosis

Execute the top command, the usr of the CPU has reached 40%, but the %cpu of the first few processes is far from enough.

Looking at the qps of mongos, it is true that the user command is not executed.

dstat to see the overall load (vmstat is not well formatted, the last few columns are always misaligned).

Except for the abnormal CPU load, other indicators are normal, and interrupts and context switching are not high, so it is unlikely that these two are caused.
perf record -ag -- sleep 10 && perf report to view the CPU execution.

There are indeed a lot of mongo calls, but the API naming is not intuitive and the corresponding execution logic cannot be guessed.

So far, it is confirmed that the problem is caused by the mongo instance, but the application connection of mongo is 0, and no useful information can be found by looking at the calling API stack.

Going back to the beginning of this article, the total CPU utilization of the top process is far less than the overall CPU load. There is a high probability that frequent and short-term processes have stolen this part of the CPU resources, causing the top command to fail to capture statistics.

sar -w 1 Check the number of processes generated per second. On average, more than 80 new processes are created per second, which should be it.

To catch applications that frequently create short-term processes, you can use execsnoop, which monitors the exec() behavior of processes in real time through ftrace, and outputs basic information of short-term processes, including process PID/PPID and command line parameters.

 #下载execsnoop#
cd /usr/bin
wget https://raw.githubusercontent.com/brendangregg/perf‐tools/master/execsnoop
chmod 755 execsnoop

The following is the output content, all the monitoring system is executing, constantly connecting to mongo and performing grep filtering on the output results, each operation will spawn a new thread/process, and more than 400 records are captured in 10s.

Shut down the zabixx process, the cpu immediately returned to normal, and the culprit was found.

We also use zabbix monitoring in other environments, but we have not encountered similar problems.

The node deploys 7 mongo instances, and zabbix monitors each mongo instance by default, which is equivalent to a 7-fold increase in execution loss, and the machine is a virtual machine with only 4-core CPU.

When these factors come together, problems can arise. This is a development environment. Zabbix monitoring is temporarily turned off. In the future, the monitoring logic should be optimized to minimize the number of connections to db and the length of the grep call chain.

3. Summary

When the CPU load of the machine continues to rise but the top process cannot be captured, execsnoop can be used to capture short-term processes. Similar tools include iosnoop and opensnoop.


爱可生开源社区
426 声望209 粉丝

成立于 2017 年,以开源高质量的运维工具、日常分享技术干货内容、持续的全国性的社区活动为社区己任;目前开源的产品有:SQL审核工具 SQLE,分布式中间件 DBLE、数据传输组件DTLE。


引用和评论

0 条评论