4

顺风车运营研发团队 黄桃

问题现象
某机器这段时间出现cpu-idle掉地的报警 如图:

clipboard.png

原因分析

查看当时的监控(php-fpm-idle、cpu-idle,io-wait、io-write 等)
(1)php-fpm-idle今天出现两次突降,一次是12点左右,一次是16:30左右,查看整周也经常出现突降,如图

clipboard.png

(2)io-wait在11:58分时 突然升高

clipboard.png

(3)io-write也在11:58时出现大量写:

clipboard.png

(4)cpu-idle当时短暂出现降低,之后出现徒增,但是结合整周的曲线来看,一直维持在70-80之间,徒增原因待分析:

clipboard.png

(5)原因推测:因为当时出现了大批量的写日志,导致io-wait上升,php-fpm进程因为写文件出现延时,造成整体响应过慢,从而导致了fpm掉地; 对同一个文件进行 write时,大批量并行会出现等待,阻塞

验证推测
(1)查看当时的php-fpm的慢日志,看当时阻塞的地方,基本是在调用fwrite阻塞

clipboard.png

(2)查看当时的程序日志trace.log的大小,日志文件越大的时间段,正好是fpm-idle下滑严重的阶段:

clipboard.png

(3)在通过 sar命令,验证下当时的写入磁盘情况,在出现掉地的时间段确实出现极大的写入,wr_sec/s 从每秒几百低峰的几百增长到十几万:

clipboard.png

问题原因及优化建议
写入日志大可能两方面原因:

(1)当时请求暴增
(2)请求未暴增,但是某些请求触发了某些不合理的打日志
验证原因1,出现问题时间段每秒的流量

32 [02/Jul/2018:12:01:12
18 [02/Jul/2018:12:01:13
18 [02/Jul/2018:12:01:14
42 [02/Jul/2018:12:01:15
30 [02/Jul/2018:12:01:16
35 [02/Jul/2018:12:01:17
26 [02/Jul/2018:12:01:18
30 [02/Jul/2018:12:01:19
1 [02/Jul/2018:12:01:22
108 [02/Jul/2018:12:01:24
17 [02/Jul/2018:12:01:25
1 [02/Jul/2018:12:01:27
1 [02/Jul/2018:12:01:29
1 [02/Jul/2018:12:01:30
9 [02/Jul/2018:12:01:33
1 [02/Jul/2018:12:01:31
1 [02/Jul/2018:12:01:32
146 [02/Jul/2018:12:01:33
62 [02/Jul/2018:12:01:34
44 [02/Jul/2018:12:01:35
1 [02/Jul/2018:12:01:37
1 [02/Jul/2018:12:01:38
1 [02/Jul/2018:12:01:41
2 [02/Jul/2018:12:01:44
2 [02/Jul/2018:12:01:50
1 [02/Jul/2018:12:01:45
12 [02/Jul/2018:12:01:50
2 [02/Jul/2018:12:01:45
7 [02/Jul/2018:12:01:50
1 [02/Jul/2018:12:01:46
1 [02/Jul/2018:12:01:50
1 [02/Jul/2018:12:01:46
15 [02/Jul/2018:12:01:50
7 [02/Jul/2018:12:01:48
2 [02/Jul/2018:12:01:50
1 [02/Jul/2018:12:01:48
342 [02/Jul/2018:12:01:50
65 [02/Jul/2018:12:01:51
46 [02/Jul/2018:12:01:52
54 [02/Jul/2018:12:01:53
1 [02/Jul/2018:12:01:55
1 [02/Jul/2018:12:01:56
1 [02/Jul/2018:12:01:57
1 [02/Jul/2018:12:01:59
16 [02/Jul/2018:12:02:03
1 [02/Jul/2018:12:02:01
1 [02/Jul/2018:12:02:02
42 [02/Jul/2018:12:02:03
1 [02/Jul/2018:12:02:02
187 [02/Jul/2018:12:02:03
39 [02/Jul/2018:12:02:04
40 [02/Jul/2018:12:02:05
25 [02/Jul/2018:12:02:06
44 [02/Jul/2018:12:02:07
29 [02/Jul/2018:12:02:08

正常情况是QPS是30左右,出问题时间段则非常不稳定,时高时低,相差非常大,例如在 12:01:50 时 342 qps,但是前十几秒则基本都是个位数;原因?前几十秒都阻塞了,响应不了,积累到了12:01:50才响应;整体流量并未出现暴增;

验证原因2,查看当时traceId日志,查看写进去的内容:

发现这段写入非常巨大,一行就119kb ,总共写了 33555 行,总大小占:33555 * 119KB = 3993045KB =3899M 基本可以断定为这行出的问题了
图片描述

优化建议:在底层打日志的类中对字符串长度做限制,避免这种大批量的写入;


AI及LNMPRG研究
7.2k 声望12.8k 粉丝

一群热爱代码的人 研究Nginx PHP Redis Memcache Beanstalk 等源码 以及一群热爱前端的人