7

The style of the exchange group suddenly changed, and instead of the old nonsense, they actually talked about technology.
1623650919857100.jpg
1623650924550111.jpg

After reading all the chat records of everyone, I found that the problem has not been solved. It's rare for so many people in the group to talk about technology, holding on to the principle of continuous question and answer, there must be a response, and take the initiative to hook up and see what the problem is.

I probably understand that the problem is this, the apache child process keeps exiting abnormally

$ sudo tail -f /var/log/httpd/error_log
[Sun Jun 13 19:32:25.660349 2021] [core:notice] [pid 23340] AH00052: child pid 20605 exit signal Bus error (7)
[Sun Jun 13 19:32:25.660413 2021] [core:notice] [pid 23340] AH00052: child pid 20606 exit signal Bus error (7)
[Sun Jun 13 19:35:39.862368 2021] [core:notice] [pid 23340] AH00052: child pid 21332 exit signal Bus error (7)
[Sun Jun 13 19:35:50.872363 2021] [core:notice] [pid 23340] AH00052: child pid 21369 exit signal Bus error (7)
[Sun Jun 13 19:39:43.079650 2021] [core:notice] [pid 23340] AH00052: child pid 18595 exit signal Bus error (7)
[Sun Jun 13 19:42:08.210353 2021] [core:notice] [pid 23340] AH00052: child pid 21348 exit signal Bus error (7)
[Sun Jun 13 19:42:16.221076 2021] [core:notice] [pid 23340] AH00052: child pid 21331 exit signal Bus error (7)
[Sun Jun 13 19:42:16.221177 2021] [core:notice] [pid 23340] AH00052: child pid 23156 exit signal Bus error (7)
[Sun Jun 13 19:44:11.329344 2021] [core:notice] [pid 23340] AH00052: child pid 27824 exit signal Bus error (7)

On-site recurrence

I learned that its architecture is apache + php , and group friends are also temporarily accepted ancestral codes, and the project is not very clear. You can use strace to see what is the problem behind it

$ ps -ef|grep apache

First, I saw that the main process id is 23340, and it needs to follow its child process here, so I executed the following command

$ sudo strace -o strace.log -s 1024 $(pidof "/usr/sbin/httpd" -o 23340|sed 's/\([0-9]*\)/-p \1/g')
strace: Process 26212 attached
strace: Process 26211 attached
strace: Process 26191 attached
strace: Process 25940 attached
...

The parameters are roughly explained

codeParaphrase
-o strace.logOutput strace log to strace.log
-s 1024When a parameter of the system call is a string, the content of the specified length is output at most, the default is 32 bytes
pidof "/usr/sbin/httpd" -o 23340Exclude httpd child processes other than the main process 23340

Then open a new window to monitor the log

$ sudo tail -f strace.log |grep SIGBUS -B 500
28741 open("/xxxx/www/data/cache/xxx.php", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 15
28741 write(15, "<?php defined('DYMall') or exit('Access Invalid!'); return array ( ) ?>", 71) = 71
28741 close(15)
...
28741 open("/xxxx/www/data/cache/xxx.php", O_RDONLY) = 15
28741 fstat(15, {st_mode=S_IFREG|0644, st_size=71, ...}) = 0
28741 fstat(15, {st_mode=S_IFREG|0644, st_size=71, ...}) = 0
28741 fstat(15, {st_mode=S_IFREG|0644, st_size=71, ...}) = 0
28741 mmap(NULL, 71, PROT_READ, MAP_SHARED, 15, 0) = 0x7f584aee8000
28741 --- SIGBUS {si_signo=SIGBUS, si_code=BUS_ADRERR, si_addr=0x7f584aee8000} ---
28741 chdir("/etc/httpd")               = 0
28741 rt_sigaction(SIGBUS, {SIG_DFL, [], SA_RESTORER|SA_INTERRUPT, 0x7f58499bc5e0}, {SIG_DFL, [], SA_RESTORER|SA_RESETHAND, 0x7f58499bc5e0}, 8) = 0
28741 kill(28741, SIGBUS)               = 0
28741 rt_sigreturn({mask=[]})           = 140017190993928
28741 --- SIGBUS {si_signo=SIGBUS, si_code=SI_USER, si_pid=28741, si_uid=48} ---
28741 +++ killed by SIGBUS +++
SIGBUS is the most common scenario in user mode, and it is also the easiest to trigger. Generally speaking, the root cause is that after a process mmap a file, another process truncated the file, causing some memory pages from mmap to exceed the actual size of the file. , SIGBUS is triggered when accessing those memory pages that are exceeded

Verify file size changes

Monitor the size of the file to verify whether the file has been changing

$ while :; do cat /xxxx/www/data/cache/xxx.php|wc -L ;done
$ while :; do cat /xxxx/www/data/cache/xxx.php|wc -L ;done|grep -v 71

found that the file does become 0 bytes in a few cases. According to the path of the file, it is preliminarily estimated that the cache file has been overwritten

Confirm the source of the file write operation

$ sudo yum -y install audit auditd-libs

Monitor the write operation of the topic_goodsclass.php

$ sudo auditctl -w /xxxx/www/data/cache/xxx.php -p w

The result is similar to

time->Mon Jun 14 21:21:39 2021
type=PROCTITLE msg=audit(1623676899.778:1883303): proctitle=2F7573722F7362696E2F6874747064002D44464F524547524F554E44
type=PATH msg=audit(1623676899.778:1883303): item=1 name="/xxxx/www/data/cache/xxx.php" inode=30151674 dev=fd:11 mode=0100644 ouid=48 ogid=48 rdev=00:00 objtype=NORMAL
type=PATH msg=audit(1623676899.778:1883303): item=0 name="/xxxx/www/data/cache/" inode=30151667 dev=fd:11 mode=040755 ouid=48 ogid=48 rdev=00:00 objtype=PARENT
type=CWD msg=audit(1623676899.778:1883303):  cwd="/xxxx/www"
type=SYSCALL msg=audit(1623676899.778:1883303): arch=c000003e syscall=2 success=yes exit=16 a0=55b452b93078 a1=241 a2=1b6 a3=2a items=2 ppid=31318 pid=18269 auid=4294967295 uid=48 gid=48 euid=48 suid=48 fsuid=48 egid=48 sgid=48 fsgid=48 tty=(none) ses=4294967295 comm="httpd" exe="/usr/sbin/httpd" key=(null)

Collect for a period of time, do statistics

$ sudo ausearch -f /xxxx/www/data/cache/xxx.php |grep "exe="|awk '{print $12,$26}'|sort|uniq -c
  44200 ppid=31318 exe="/usr/sbin/httpd"

It is found that they are all from /usr/sbin/httpd , and there is no CLI mode (worry about timing tasks to generate the cache), and clear the monitoring

sudo auditctl -D

The solution of multi-process file read-write conflict

fopen($filename, "w+") is performed on the file, the file has been emptied, so the lock needs to be before the fopen target file, so my method is to add a file with a one-to-one corresponding lock file. as follows:
aa.php

function addFileLock($filename, $lock)
{
    $fp = fopen($filename . ".lock", "w+");
    flock($fp, $lock);
    return $fp;
}

function releaseLock($fp)
{
    flock($fp, LOCK_UN);
}

$filename = "cc.log";


// mock 数据
file_put_contents($filename, "100");

$fp = fopen($filename, "r");

if ($lockfp = addFileLock($filename, LOCK_SH)) {

    $n = 10;
    while ($n > 0) {
        sleep(1);
        printf("%d %d\n", time(), --$n);
    }

    echo fgets($fp);

    releaseLock($lockfp);
}

fclose($fp);

bb.php

function setFileLock($filename, $lock)
{
    $fp = fopen($filename . ".lock", "w+");
    flock($fp, $lock);
    return $fp;
}

function releaseFileLock($fp)
{
    flock($fp, LOCK_UN);
    fclose($fp);
}

$filename = "cc.log";

if ($lockfp = setFileLock($filename, LOCK_EX)) {
    $fp = fopen($filename, "w+");
    echo time();
    fwrite($fp, 200);
    releaseFileLock($lockfp);
}

fclose($fp);

So first execute php aa.php and then execute php bb.php , you can see the effect.

Actual code positioning

According to the string keywords in the system call, the relevant cache operation code was found

28741 write(15, "<?php defined('DYMall') or exit('Access Invalid!'); return array ( ) ?>", 71) = 71

After he found online is still a problem, I had already suspected life, or by calling the system found time to read and I did not go to increase the share lock
WechatIMG536.jpg
It turns out that the cache reading is based on other logic. From the screenshot above, you can see that the cache writing has gone. The modified code (locking logic), and then search for the cache writing method based on personal experience, and find it according to the picture. The read logic of another cache file was modified, and then modified again, the online problem did not reappear.

The problem was solved, Paolo sent me a red envelope, and I accepted it with satisfaction.

IMG_9532.jpg

to sum up

  1. First SIGBUS the scene of strace
  2. It is expected to be a problem caused by concurrent reading and writing of files under the multi-process model
  3. Through auditctl monitor the write operation source of the statistical file are httpd no other process to modify
  4. There are still problems online through read plus shared lock, write plus exclusive lock
  5. According to strace again, I found that the reading place is elsewhere. Based on experience, make up and fix the problem.

Disadvantages: The final positioning relies too much on personal experience and is not systematic enough.

Small idea: You can make an extension and use zend_set_user_opcode_handler to monitor custom functions and methods and system methods and function call chains, and then write logs during the initialization phase of each request, such as

write(fd, "{unique code}", xxx);
write(fd, "request uri", xxx);

Then the method call zend_set_user_opcode_handler , so that it is convenient to troubleshoot some business logic problems of the ancestral code.

If there are other faster and simpler solutions, please leave a message and let me know.


周梦康
9k 声望6.7k 粉丝

退隐江湖