Facing the ancestral code Debug, I saved a programmer who was going to run away

The style of the exchange group suddenly changed, and instead of the old nonsense, they actually talked about technology.
1623650919857100.jpg
1623650924550111.jpg

After reading all the chat records of everyone, I found that the problem has not been solved. It's rare for so many people in the group to talk about technology, holding on to the principle of continuous question and answer, there must be a response, and take the initiative to hook up and see what the problem is.

I probably understand that the problem is this, the apache child process keeps exiting abnormally

$ sudo tail -f /var/log/httpd/error_log
[Sun Jun 13 19:32:25.660349 2021] [core:notice] [pid 23340] AH00052: child pid 20605 exit signal Bus error (7)
[Sun Jun 13 19:32:25.660413 2021] [core:notice] [pid 23340] AH00052: child pid 20606 exit signal Bus error (7)
[Sun Jun 13 19:35:39.862368 2021] [core:notice] [pid 23340] AH00052: child pid 21332 exit signal Bus error (7)
[Sun Jun 13 19:35:50.872363 2021] [core:notice] [pid 23340] AH00052: child pid 21369 exit signal Bus error (7)
[Sun Jun 13 19:39:43.079650 2021] [core:notice] [pid 23340] AH00052: child pid 18595 exit signal Bus error (7)
[Sun Jun 13 19:42:08.210353 2021] [core:notice] [pid 23340] AH00052: child pid 21348 exit signal Bus error (7)
[Sun Jun 13 19:42:16.221076 2021] [core:notice] [pid 23340] AH00052: child pid 21331 exit signal Bus error (7)
[Sun Jun 13 19:42:16.221177 2021] [core:notice] [pid 23340] AH00052: child pid 23156 exit signal Bus error (7)
[Sun Jun 13 19:44:11.329344 2021] [core:notice] [pid 23340] AH00052: child pid 27824 exit signal Bus error (7)

On-site recurrence

I learned that its architecture is apache + php , and group friends are also temporarily accepted ancestral codes, and the project is not very clear. You can use strace to see what is the problem behind it

$ ps -ef|grep apache

First, I saw that the main process id is 23340, and it needs to follow its child process here, so I executed the following command

$ sudo strace -o strace.log -s 1024 $(pidof "/usr/sbin/httpd" -o 23340|sed 's/\([0-9]*\)/-p \1/g')
strace: Process 26212 attached
strace: Process 26211 attached
strace: Process 26191 attached
strace: Process 25940 attached
...

The parameters are roughly explained

codeParaphrase
-o strace.logOutput strace log to strace.log
-s 1024When a parameter of the system call is a string, the content of the specified length is output at most, the default is 32 bytes
pidof "/usr/sbin/httpd" -o 23340Exclude httpd child processes other than the main process 23340

Then open a new window to monitor the log

$ sudo tail -f strace.log |grep SIGBUS -B 500
28741 open("/xxxx/www/data/cache/xxx.php", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 15
28741 write(15, "<?php defined('DYMall') or exit('Access Invalid!'); return array ( ) ?>", 71) = 71
28741 close(15)
...
28741 open("/xxxx/www/data/cache/xxx.php", O_RDONLY) = 15
28741 fstat(15, {st_mode=S_IFREG|0644, st_size=71, ...}) = 0
28741 fstat(15, {st_mode=S_IFREG|0644, st_size=71, ...}) = 0
28741 fstat(15, {st_mode=S_IFREG|0644, st_size=71, ...}) = 0
28741 mmap(NULL, 71, PROT_READ, MAP_SHARED, 15, 0) = 0x7f584aee8000
28741 --- SIGBUS {si_signo=SIGBUS, si_code=BUS_ADRERR, si_addr=0x7f584aee8000} ---
28741 chdir("/etc/httpd")               = 0
28741 rt_sigaction(SIGBUS, {SIG_DFL, [], SA_RESTORER|SA_INTERRUPT, 0x7f58499bc5e0}, {SIG_DFL, [], SA_RESTORER|SA_RESETHAND, 0x7f58499bc5e0}, 8) = 0
28741 kill(28741, SIGBUS)               = 0
28741 rt_sigreturn({mask=[]})           = 140017190993928
28741 --- SIGBUS {si_signo=SIGBUS, si_code=SI_USER, si_pid=28741, si_uid=48} ---
28741 +++ killed by SIGBUS +++
SIGBUS is the most common scenario in user mode, and it is also the easiest to trigger. Generally speaking, the root cause is that after a process mmap a file, another process truncated the file, causing some memory pages from mmap to exceed the actual size of the file. , SIGBUS is triggered when accessing those memory pages that are exceeded

Verify file size changes

Monitor the size of the file to verify whether the file has been changing

$ while :; do cat /xxxx/www/data/cache/xxx.php|wc -L ;done
$ while :; do cat /xxxx/www/data/cache/xxx.php|wc -L ;done|grep -v 71

found that the file does become 0 bytes in a few cases. According to the path of the file, it is preliminarily estimated that the cache file has been overwritten

Confirm the source of the file write operation

$ sudo yum -y install audit auditd-libs

Monitor the write operation of the topic_goodsclass.php

$ sudo auditctl -w /xxxx/www/data/cache/xxx.php -p w

The result is similar to

time->Mon Jun 14 21:21:39 2021
type=PROCTITLE msg=audit(1623676899.778:1883303): proctitle=2F7573722F7362696E2F6874747064002D44464F524547524F554E44
type=PATH msg=audit(1623676899.778:1883303): item=1 name="/xxxx/www/data/cache/xxx.php" inode=30151674 dev=fd:11 mode=0100644 ouid=48 ogid=48 rdev=00:00 objtype=NORMAL
type=PATH msg=audit(1623676899.778:1883303): item=0 name="/xxxx/www/data/cache/" inode=30151667 dev=fd:11 mode=040755 ouid=48 ogid=48 rdev=00:00 objtype=PARENT
type=CWD msg=audit(1623676899.778:1883303):  cwd="/xxxx/www"
type=SYSCALL msg=audit(1623676899.778:1883303): arch=c000003e syscall=2 success=yes exit=16 a0=55b452b93078 a1=241 a2=1b6 a3=2a items=2 ppid=31318 pid=18269 auid=4294967295 uid=48 gid=48 euid=48 suid=48 fsuid=48 egid=48 sgid=48 fsgid=48 tty=(none) ses=4294967295 comm="httpd" exe="/usr/sbin/httpd" key=(null)

Collect for a period of time, do statistics

$ sudo ausearch -f /xxxx/www/data/cache/xxx.php |grep "exe="|awk '{print $12,$26}'|sort|uniq -c
  44200 ppid=31318 exe="/usr/sbin/httpd"

It is found that they are all from /usr/sbin/httpd , and there is no CLI mode (worry about timing tasks to generate the cache), and clear the monitoring

sudo auditctl -D

The solution of multi-process file read-write conflict

fopen($filename, "w+") is performed on the file, the file has been emptied, so the lock needs to be before the fopen target file, so my method is to add a file with a one-to-one corresponding lock file. as follows:
aa.php

function addFileLock($filename, $lock)
{
    $fp = fopen($filename . ".lock", "w+");
    flock($fp, $lock);
    return $fp;
}

function releaseLock($fp)
{
    flock($fp, LOCK_UN);
}

$filename = "cc.log";


// mock 数据
file_put_contents($filename, "100");

$fp = fopen($filename, "r");

if ($lockfp = addFileLock($filename, LOCK_SH)) {

    $n = 10;
    while ($n > 0) {
        sleep(1);
        printf("%d %d\n", time(), --$n);
    }

    echo fgets($fp);

    releaseLock($lockfp);
}

fclose($fp);

bb.php

function setFileLock($filename, $lock)
{
    $fp = fopen($filename . ".lock", "w+");
    flock($fp, $lock);
    return $fp;
}

function releaseFileLock($fp)
{
    flock($fp, LOCK_UN);
    fclose($fp);
}

$filename = "cc.log";

if ($lockfp = setFileLock($filename, LOCK_EX)) {
    $fp = fopen($filename, "w+");
    echo time();
    fwrite($fp, 200);
    releaseFileLock($lockfp);
}

fclose($fp);

So first execute php aa.php and then execute php bb.php , you can see the effect.

Actual code positioning

According to the string keywords in the system call, the relevant cache operation code was found

28741 write(15, "<?php defined('DYMall') or exit('Access Invalid!'); return array ( ) ?>", 71) = 71

After he found online is still a problem, I had already suspected life, or by calling the system found time to read and I did not go to increase the share lock
WechatIMG536.jpg
It turns out that the cache reading is based on other logic. From the screenshot above, you can see that the cache writing has gone. The modified code (locking logic), and then search for the cache writing method based on personal experience, and find it according to the picture. The read logic of another cache file was modified, and then modified again, the online problem did not reappear.

The problem was solved, Paolo sent me a red envelope, and I accepted it with satisfaction.

IMG_9532.jpg

to sum up

  1. First SIGBUS the scene of strace
  2. It is expected to be a problem caused by concurrent reading and writing of files under the multi-process model
  3. Through auditctl monitor the write operation source of the statistical file are httpd no other process to modify
  4. There are still problems online through read plus shared lock, write plus exclusive lock
  5. According to strace again, I found that the reading place is elsewhere. Based on experience, make up and fix the problem.

Disadvantages: The final positioning relies too much on personal experience and is not systematic enough.

Small idea: You can make an extension and use zend_set_user_opcode_handler to monitor custom functions and methods and system methods and function call chains, and then write logs during the initialization phase of each request, such as

write(fd, "{unique code}", xxx);
write(fd, "request uri", xxx);

Then the method call zend_set_user_opcode_handler , so that it is convenient to troubleshoot some business logic problems of the ancestral code.

If there are other faster and simpler solutions, please leave a message and let me know.


周梦康
金三银四 还不上车补课?学了就是赚了 [链接]

退隐江湖

8.9k 声望
6.7k 粉丝
0 条评论
推荐阅读
什么?JVM 老年代内存不断上涨竟是因为获取 ServletContext 姿势不对
前几日一直在筹备一个比较大的项目,发现一个问题,还好流量不是非常非常大,不然又得提桶跑路了。在线上运行的时候发现当并发过高的情况,会出现老年代内存上涨的情况。

周梦康2阅读 2.7k

怎样用 PHP 来实现枚举?
在数学和计算机科学理论中,一个集的枚举是列出某些有穷序列集的所有成员的程序,或者是一种特定类型对象的计数。这两种类型经常(但不总是)重叠。枚举是一个被命名的整型常数的集合,枚举在日常生活中很常见,...

唯一丶25阅读 6.3k评论 4

Git操作不规范,战友提刀来相见!
年终奖都没了,还要扣我绩效,门都没有,哈哈。这波骚Git操作我也是第一次用,担心闪了腰,所以不仅做了备份,也做了笔记,分享给大家。问题描述小A和我在同时开发一个功能模块,他在优化之前的代码逻辑,我在开...

王中阳Go5阅读 1.8k评论 2

封面图
微信公众号开发:自动回复文本/图片/图文消息/关键词回复/上传素材/自定义菜单
对接流程1、申请微信公众号测试账号URL:[链接]2、登录,配置开发者服务器URL和Token开发者服务器配置代码:config.php {代码...} URL是config.php在你服务器的URLToken是上面代码自己设置的Token搞定之后,就能完...

TANKING2阅读 10k

Hyperf 3.0 发布,PHP 新时代
在过去的一年半时间里,Hyperf 2.2 共发布了 35 个小版本,使 Hyperf 达到了一个前所未有的高度,这里也获得了一些不错的数据反馈。

huangzhhui3阅读 1.1k

封面图
多个著名 Go 开源项目被放弃,做大开源不能用爱发电,更不能只靠自己!
大家好,我是煎鱼。相信关注我的许多同学都有接触 Go 语言的开发,甚至在企业中多有实践。那么你在日常开发中,势必会接触到 gorilla 组织下的各个 Go 开源项目。如下图:gorilla/mux:Star:17.9k。a powerful r...

煎鱼1阅读 2.3k

Go for 循环有时候真的很坑。。。
大家好,我是煎鱼。不知道有多少 Go 的面试题和泄露,都和 for 循环有关。今天我在周末认真一看,发现了 redefining for loop variable semantics 。著名的硬核大佬 Russ Cox 表示他一直在研究这个问题,并表示十...

煎鱼阅读 3.4k

退隐江湖

8.9k 声望
6.7k 粉丝
宣传栏