Author: Ren Kun

Now living in Zhuhai, he has served as a full-time Oracle and MySQL DBA. Now he is mainly responsible for the maintenance of MySQL, mongoDB and Redis.

Source of this article: original submission

* Produced by the Aikesheng open source community, original content is not allowed to be used without authorization, please contact the editor and indicate the source for reprinting.


1 background

On June 5th, I received several alarm messages at around 22:30. The shard2 of a mongo cluster on the line experienced a master-slave delay, and it automatically recovered after a few minutes. I wanted to pretend that I didn't see it, but after a while, I called the master-slave delay. Although it was automatically restored this time, I felt a little unsure if I didn't go up and look at it. This environment is a 3.4 cluster, 3 mongos + 4 shard (1 master and 2 slaves), the master node priority=2, and the remaining nodes priority=1, when the network is unblocked, ensure that the main library is always on node 1.

2 troubleshooting

Log in to grafana to view the monitoring information of the main library. During this period of time, the cpu and memory usage have increased significantly

qps linearly drops to 0 around 22:30

The number of connections continues to skyrocket

From this, it can be roughly inferred that the connection storm caused excessive pressure on the master database and caused master-slave replication delays. For specific information, you need to check mongod.log. The advantage of the mongo log file is that the recorded information is very detailed, but the disadvantage is that too much recorded information causes the file to take up a lot of space. For this reason, mongo provides a special switch log instruction, but this instruction is only responsible for switching and not cleaning up historical log files, and mongo itself cannot set a job to realize periodic automatic switching. For this, you need to write a separate script.

Create a mongo account backup with the role of hostManager, write the following shell script, and execute it once an hour.

[root@ mongod]# more logrotate_mongo.sh
#!/bin/sh

MONGO_CMD=/usr/local/mongodb/bin/mongo
KEEP_DAY=7

#flush mongod log
datadir=/data/mongodb
port=27017
role=mongod
destdir=/data/backup/mongolog/${role}
if [ ! ‐d "${destdir}" ]; then
  mkdir ‐p ${destdir}
fi
$MONGO_CMD ‐‐authenticationDatabase admin admin ‐ubackup ‐p"*****" ‐‐eval
"db.runCommand({logRotate:1})" ‐‐port $port
mv ${datadir}/mongod.log.????‐??‐??T??‐??‐?? ${destdir}/
find $destdir ‐name "mongod.log.*" ‐mtime +${KEEP_DAY} ‐exec rm ‐rf {} \;

The archived log files are as follows. The amount of logs generated during the time period of 22:00-23:00 is one order of magnitude more than usual. From this, it can also be inferred that mongo has abnormal pressure during this period of time.

‐rw‐r‐‐r‐‐ 1 root root  11647760 Jun 5 18:00 mongod.log.2021‐06‐05T10‐00‐02
‐rw‐r‐‐r‐‐ 1 root root  12869316 Jun 5 19:00 mongod.log.2021‐06‐05T11‐00‐01
‐rw‐r‐‐r‐‐ 1 root root  15625361 Jun 5 20:00 mongod.log.2021‐06‐05T12‐00‐03
‐rw‐r‐‐r‐‐ 1 root root  10993819 Jun 5 21:00 mongod.log.2021‐06‐05T13‐00‐02
‐rw‐r‐‐r‐‐ 1 root root  71500070 Jun 5 22:00 mongod.log.2021‐06‐05T14‐00‐40
‐rw‐r‐‐r‐‐ 1 root root 215216326 Jun 5 23:00 mongod.log.2021‐06‐05T15‐00‐02

Query the log at the corresponding point in time, and there are a lot of information about the failure of creating worker threads

2021‐06‐05T22:28:00.821+0800 I ‐    [thread2] failed to create service entry worker
thread for 172.31.0.65:39890
2021‐06‐05T22:28:00.821+0800 I NETWORK [thread2] connection accepted from
172.31.0.66:45090 #4051839 (32622 connections now open)
2021‐06‐05T22:28:00.821+0800 I ‐    [thread2] pthread_create failed: Resource
temporarily unavailable
2021‐06‐05T22:28:00.821+0800 I ‐    [thread2] failed to create service entry worker
thread for 172.31.0.66:45090
2021‐06‐05T22:28:00.821+0800 I NETWORK [thread2] connection accepted from
172.31.0.65:39892 #4051840 (32622 connections now open)
2021‐06‐05T22:28:00.821+0800 I ‐    [thread2] pthread_create failed: Resource
temporarily unavailable

Logs from the library in the same time period:

2021‐06‐05T22:30:07.325+0800 I REPL     [ReplicationExecutor] Error in heartbeat request
to 172.31.0.183:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
2021‐06‐05T22:30:07.425+0800 I SHARDING [migrateThread] Waiting for replication to catch
up before entering critical section

The master library has almost lost its responsiveness, and connections initiated by the slave library's timing heartbeat cannot be created, which triggers the master-slave election. The current slave library is temporarily selected as the master library. After the old main library resumes its response and rejoins the cluster, it regains control.

Using mtools to parse the log, the output also verified this point. Under normal circumstances, 183 is the master library and 114 is the slave library.

[root@vm172‐31‐0‐183 mongod]# mloginfo mongod.log.2021‐06‐05T15‐00‐02 ‐‐rsstate
.....
RSSTATE
date               host                  state/message

Jun 05 22:28:05    172.31.0.114:27017    PRIMARY    #114被选举为主库
Jun 05 22:36:41    172.31.0.114:27017    SECONDARY  #114被降级为从库

[root@vm172‐31‐0‐114 mongod]# mloginfo mongod.log.2021‐06‐05T14‐48‐24 ‐‐rsstate......
RSSTATE
date               host                  state/message

Jun 05 22:31:50    172.31.0.183:27017    ROLLBACK   #183以从库身份加入集群,执行rollback和recover流程,
Jun 05 22:34:48    172.31.0.183:27017    RECOVERING
Jun 05 22:35:52    172.31.0.183:27017    SECONDARY  #183的状态变为secondary后,因为priority更高因此触发了新一轮选举
Jun 05 22:36:43    172.31.0.183:27017    PRIMARY    #183重新变成了主库

The next question is to confirm which application server initiated the connection storm. The connection of the shard node is initiated by mongos. You can only check the log of the mongos instance. Still use the mtools tool, the output is as follows: This is the connection creation information for the 21:00-22:00 time period

[root@ mongos]# mloginfo mongod.log.2021‐06‐05T14‐00‐01 ‐‐connections
......

CONNECTIONS
     total opened: 8904
     total closed: 5645
    no unique IPs: 6
 socket exceptions: 649

127.0.0.1    opened: 2683 closed: 2683
172.31.0.21  opened: 1531 closed: 733
172.31.0.28  opened: 1525 closed: 676
172.31.0.81  opened: 1520 closed: 704
172.31.0.78  opened: 1495 closed: 699
172.31.0.134 opened: 150  closed: 150

This is the connection creation information for the time period of 22:00-23:00. Obviously there is a problem in this time period.

[root@ mongos]# mloginfo mongod.log.2021‐06‐05T15‐00‐01 ‐‐connections
......

CONNECTIONS
     total opened: 58261
     total closed: 58868
    no unique IPs: 7
  socket exceptions: 3708
   
172.31.0.78  opened: 14041  closed: 14576
172.31.0.21  opened: 13898  closed: 14284
172.31.0.28  opened: 13835  closed: 13617
172.31.0.81  opened: 13633  closed: 13535
127.0.0.1    opened: 2681   closed: 2680
172.31.0.134 opened: 170    closed: 176
172.31.0.4   opened: 3      closed: 0

This confirms that this incident was caused by a change in the application server connection, and the rest is to pass this information to the developer for further review.

3 summary

This case is generally better for diagnosis, but it requires two tools, one is a shell script that switches logs regularly, and one is mtools written by an official mongo engineer. They can help us quickly view the logs and extract the key information needed for diagnosis, greatly improving work efficiency.


爱可生开源社区
426 声望211 粉丝

成立于 2017 年,以开源高质量的运维工具、日常分享技术干货内容、持续的全国性的社区活动为社区己任;目前开源的产品有:SQL审核工具 SQLE,分布式中间件 DBLE、数据传输组件DTLE。