Just get a CPU utilization rate in the container, how can it occupy 100% of a single core?

Background: This is reproduced in the environment of centos7 + lxcfs and jdk11.

Below is a list of how we troubleshoot and solve this problem.

1. Failure phenomenon

OPPO kernel team received a case sent by brother Gan of jvm, the phenomenon is that the hot spot of java has always been the following functions, which account for a high proportion.

at sun.management.OperatingSystemImpl.getProcessCpuLoad(Native Method)

We log in to the oppo cloud host after authorization and find that the top display is as follows:

   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                       
617248 service   20   0   11.9g   8.6g   9300 R 93.8  2.3   3272:09 java                                                                          
179526 service   20   0   11.9g   8.6g   9300 R 93.8  2.3  77462:17 java     

2. Failure analysis

Java threads can hit the CPU usage rate so high, which is usually an infinite loop of gc or jni. Judging from the hotspots reported by the jvm team, this thread should always obtain cpu utilization, rather than the usual problems before. From the perspective of perf, the thread has been triggering read calls.

Then strace looks at the return value of read and prints as follows:

read(360, 0x7f6f9000b000, 4096)         = -1 ENOTCONN (Transport endpoint is not connected)
read(360, 0x7f6f9000b000, 4096)         = -1 ENOTCONN (Transport endpoint is not connected)
read(360, 0x7f6f9000b000, 4096)         = -1 ENOTCONN (Transport endpoint is not connected)
read(360, 0x7f6f9000b000, 4096)         = -1 ENOTCONN (Transport endpoint is not connected)
read(360, 0x7f6f9000b000, 4096)         = -1 ENOTCONN (Transport endpoint is not connected)
read(360, 0x7f6f9000b000, 4096)         = -1 ENOTCONN (Transport endpoint is not connected)

Then check what type of 360 fd is:

# ll /proc/22345/fd/360
lr-x------ 1 service service 64 Feb  4 11:24 /proc/22345/fd/360 -> /proc/stat

It is found that it is reading /proc/stat, which is consistent with the above java hotspot reading cpu utilization.

According to the mount point of /proc/stat in the container.

# cat /proc/mounts  |grep stat
tmpfs /proc/timer_stats tmpfs rw,nosuid,size=65536k,mode=755 0 0
lxcfs /proc/stat fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/stat fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/diskstats fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0

The strange thing is that two mount points are found to be /proc/stat.

Try to read this file manually:

# cat /proc/stat
cpu  12236554 0 0 9808893 0 0 0 0 0 0
cpu0 10915814 0 0 20934 0 0 0 0 0 0
cpu1 1320740 0 0 9787959 0 0 0 0 0 0

It means that there is no problem with bash accessing this mount point. Is the mount point accessed by bash and the mount point accessed by the java thread in question not the same?

Check the super_block and mount corresponding to the fd opened by java.

crash> files 22345 |grep -w 360
360 ffff914f93efb400 ffff91541a16d440 ffff9154c7e29500 REG  /rootfs/proc/stat/proc/stat
crash> inode.i_sb ffff9154c7e29500
  i_sb = 0xffff9158797bf000

Then check the mount point information in the namespace where the corresponding process belongs to mount:

crash> mount -n 22345 |grep lxcfs
ffff91518f28b000 ffff9158797bf000 fuse   lxcfs     /rootfs/proc/stat
ffff911fb0209800 ffff914b3668e800 fuse   lxcfs     /rootfs/var/lib/baymax/var/lib/baymax/lxcfs
ffff915535becc00 ffff914b3668e800 fuse   lxcfs     /rootfs/proc/cpuinfo
ffff915876b45380 ffff914b3668e800 fuse   lxcfs     /rootfs/proc/meminfo
ffff912415b56e80 ffff914b3668e800 fuse   lxcfs     /rootfs/proc/uptime
ffff91558260f300 ffff914b3668e800 fuse   lxcfs     /rootfs/proc/stat/proc/stat
ffff911f52a6bc00 ffff914b3668e800 fuse   lxcfs     /rootfs/proc/diskstats
ffff914d24617180 ffff914b3668e800 fuse   lxcfs     /rootfs/proc/swaps
ffff911de87d0180 ffff914b3668e800 fuse   lxcfs     /rootfs/sys/devices/system/cpu/online

Because the cat operation is short, write a simple demo code, do not close after open and read the corresponding /proc/stat path. It is found that the super_blcok of the mount point it visits is ffff914b3668e800. Because the code is extremely simple, it is not listed here.

It is further confirmed that the problematic java process accesses a problematic mount point. So why does ENOTCONN return when accessing this mount point?

crash> struct file.private_data ffff914f93efb400
  private_data = 0xffff911e57b49b80---根据fuse模块代码,这个是一个fuse_file

crash> fuse_file.fc 0xffff911e57b49b80
  fc = 0xffff9158797be000
crash> fuse_conn.connected 0xffff9158797be000
  connected = 0

Sure enough, the connection status of the corresponding fuse_conn is 0. From the point of view, it is returned in fuse_get_req. The call chain is as follows:

sys_read-->vfs_read-->fuse_direct_read-->__fuse_direct_read-->fuse_direct_io
    --->fuse_get_req--->__fuse_get_req

static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
                       bool for_background)
{
...
    err = -ENOTCONN;
    if (!fc->connected)
        goto out;
...
}

The next thing I want to locate is why the java thread that has the problem does not judge the return value of read and keeps trying again?

Gdb keeps track of:

(gdb) bt
#0  0x00007fec3a56910d in read () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007fec3a4f5c04 in _IO_new_file_underflow (fp=0x7fec0c01ae00) at fileops.c:593
#2  0x00007fec3a4f6dd2 in __GI__IO_default_uflow (fp=0x7fec0c01ae00) at genops.c:414
#3  0x00007fec3a4f163e in _IO_getc (fp=0x7fec0c01ae00) at getc.c:39
#4  0x00007febeacc6738 in get_totalticks.constprop.4 () from /usr/local/paas-agent/HeyTap-Linux-x86_64-11.0.7/lib/libmanagement_ext.so
#5  0x00007febeacc6e47 in Java_com_sun_management_internal_OperatingSystemImpl_getSystemCpuLoad0 ()
   from /usr/local/paas-agent/HeyTap-Linux-x86_64-11.0.7/lib/libmanagement_ext.so

It was confirmed that the user mode was looping into the _IO_getc process, and then the brother Gan of jvm started to walk through the code and found that there was a very suspicious code:

UnixOperatingSystem.c: get_totalticks function, there will be a loop as follows:

static void next_line(FILE *f) {
    while (fgetc(f) != '\n');
}

From the code of the fuse module, there are not many places where fc->connected = 0. It is basically caused by abort or fuse_put_super calls. By comparing the time point when the java thread rises and viewing the journal log, the mount point of the fuse is further invalidated. A user-mode file system lxcfs exits, and the mount point and inode are accessed, which makes it impossible to recycle this super_block in time. The new mount uses:

remount_container() {
    export f=/proc/stat    && test -f /var/lib/baymax/lxcfs/$f && (umount $f;  mount --bind /var/lib/baymax/lxcfs/$f $f)

The bind mode is used, and you can refer to the information on the Internet for the specific characteristics of bind mount.

And in the journal log, I found the key:

: umount: /proc/stat: target is busy

This line of log indicates that the uninstallation failed at that time, and not all access was closed.

Three, failure to reproduce

In response to the above doubts, I wrote a very simple demo as follows:

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#define SCNd64 "I64d"
#define DEC_64 "%"SCNd64

static void next_line(FILE *f) {
    while (fgetc(f) != '\n');//出错的关键点
}

#define SET_VALUE_GDB 10
int main(int argc,char* argv[])
{
  unsigned long i =1;
  unsigned long j=0;
  FILE * f=NULL;
  FILE         *fh;
  unsigned long        userTicks, niceTicks, systemTicks, idleTicks;
  unsigned long        iowTicks = 0, irqTicks = 0, sirqTicks= 0;
  int             n;

  if((fh = fopen("/proc/stat", "r")) == NULL) {
    return -1;
  }

    n = fscanf(fh, "cpu " DEC_64 " " DEC_64 " " DEC_64 " " DEC_64 " " DEC_64 " "
                   DEC_64 " " DEC_64,
           &userTicks, &niceTicks, &systemTicks, &idleTicks,
           &iowTicks, &irqTicks, &sirqTicks);

    // Move to next line
    while(i!=SET_VALUE_GDB)----------如果挂载点不变,则走这个大循环,单核cpu接近40%
    {
       next_line(fh);//挂载点一旦变化,返回 ENOTCONN,则走这个小循环,单核cpu会接近100%
       j++;
    }
    
   fclose(fh);
   return 0;
}

Execute the above code, and the results are as follows:

#gcc -g -o caq.o caq.c
一开始单独运行./caq.o,会看到cpu占用如下:
628957 root      20   0    8468    612    484 R  32.5  0.0  18:40.92 caq.o 
发现cpu占用率时32.5左右,
此时挂载点信息如下:
crash> mount -n 628957 |grep lxcfs
ffff88a5a2619800 ffff88a1ab25f800 fuse   lxcfs     /rootfs/proc/stat
ffff88cf53417000 ffff88a4dd622800 fuse   lxcfs     /rootfs/var/lib/baymax/var/lib/baymax/lxcfs
ffff88a272f8c600 ffff88a4dd622800 fuse   lxcfs     /rootfs/proc/cpuinfo
ffff88a257b28900 ffff88a4dd622800 fuse   lxcfs     /rootfs/proc/meminfo
ffff88a5aff40300 ffff88a4dd622800 fuse   lxcfs     /rootfs/proc/uptime
ffff88a3db2bd680 ffff88a4dd622800 fuse   lxcfs     /rootfs/proc/stat/proc/stat
ffff88a2836ba400 ffff88a4dd622800 fuse   lxcfs     /rootfs/proc/diskstats
ffff88bcb361b600 ffff88a4dd622800 fuse   lxcfs     /rootfs/proc/swaps
ffff88776e623480 ffff88a4dd622800 fuse   lxcfs     /rootfs/sys/devices/system/cpu/online

由于没有关闭/proc/stat的fd,也就是进行大循环,然后这个时候重启lxcfs挂载:
#systemctl restart lxcfs
重启之后,发现挂载点信息如下:
crash> mount -n 628957 |grep lxcfs
ffff88a5a2619800 ffff88a1ab25f800 fuse   lxcfs     /rootfs/proc/stat
ffff88a3db2bd680 ffff88a4dd622800 fuse   lxcfs     /rootfs/proc/stat/proc/stat------------这个挂载点,由于fd未关闭,所以卸载肯定失败,可以看到super_block是重启前的
ffff887795a8f600 ffff88a53b6c6800 fuse   lxcfs     /rootfs/var/lib/baymax/var/lib/baymax/lxcfs
ffff88a25472ae80 ffff88a53b6c6800 fuse   lxcfs     /rootfs/proc/cpuinfo
ffff88cf75ff1e00 ffff88a53b6c6800 fuse   lxcfs     /rootfs/proc/meminfo
ffff88a257b2ad00 ffff88a53b6c6800 fuse   lxcfs     /rootfs/proc/uptime
ffff88cf798f0d80 ffff88a53b6c6800 fuse   lxcfs     /rootfs/proc/stat/proc/stat/proc/stat--------bind模式挂载,会新生成一个/proc/stat
ffff88cf36ff2880 ffff88a53b6c6800 fuse   lxcfs     /rootfs/proc/diskstats
ffff88cf798f1f80 ffff88a53b6c6800 fuse   lxcfs     /rootfs/proc/swaps
ffff88a53f295980 ffff88a53b6c6800 fuse   lxcfs     /rootfs/sys/devices/system/cpu/online
cpu立刻打到接近100%
628957 root      20   0    8468    612    484 R  98.8  0.0  18:40.92 caq.o     

Four, fault avoidance or resolution

Our solution is:

1. Increase the return value judgment in fgetc, and it can't be brainless.

2. Since lxcfs has a segfault in the 3.1.2 version that will cause a bug to exit, try to upgrade to version 4.0.0 or higher, but this involves the version adaptation of the libfuse dynamic library.

3. When remounting using the user-mode file system of fuse, such as lxcfs, if umount fails, the delay and retry will be delayed, and an alarm will be thrown through the log alarm system if the number of times is exceeded.

image.png


OPPO数智技术
612 声望950 粉丝