This article is the best practice for troubleshooting with the Essence Securities Container Cloud technical team.

Problem background

We found that the CPU usage of the Kubelet process of all the worker nodes in the customer's Kubernetes cluster environment is too high for a long time. Through pidstat, we can see that the CPU usage is as high as 100%. This article records the process of troubleshooting this time.

图片

Cluster environment

在这里插入图片描述

Investigation process

Use strace tool to track

1. Due to the abnormal CPU usage of the Kubelet process, the strace tool can be used to dynamically track the call of the kubelet process. First, use the strace -cp <PID> command to count the time and call of each system call of the kubelet process within a certain period of time. And error conditions.

图片

As can be seen from the above figure, during the execution of the system call, futex throws more than 5,000 errors, which is not a normal number, and the function takes up 99% of the time, so you need to further check the kubelet process related transfer.

2. Since the strace -cp command can only view the overall call situation of the process, we can print the timestamp of each system call through the strace -tt -p <PID> command, as follows:

图片

Judging from the results output by strace, there are a lot of Connect timed out when executing futex-related system calls, and errors of -1 and ETIMEDOUT are returned, so you can see so many errors in strace -cp.

Futex is a synchronization mechanism that mixes user mode and kernel mode. When the futex variable tells the process that there is competition, it will execute system calls to complete the corresponding processing, such as wait or wake up. According to the official documentation, futex has this Several parameters:

futex(uint32_t *uaddr, int futex_op, uint32_t val,
                 const struct timespec *timeout,   /* or: uint32_t val2 */
                 uint32_t *uaddr2, uint32_t val3);

The official document gives an explanation of ETIMEDOUT:

ETIMEDOUT
       The operation in futex_op employed the timeout specified in
       timeout, and the timeout expired before the operation
       completed.

It means that the corresponding operation could not be completed in the specified timeout time, where futex_op 160a7304bbb713 corresponds to FUTEX_WAIT_PRIVATE and FUTEX_WAIT_PRIVATE above output. You can see that the 160a7304bbbb719 basically occurred when the timeout occurred.

From the current system call level, it can be judged that futex cannot enter the sleep state smoothly, but it is still unclear what operations the futex has performed, so it is still impossible to determine the reason for the soaring kubeletCPU, so we need to further see from the function calls of kubelet. Where is the execution stuck.

FUTEX_PRIVATE_FLAG: This parameter tells the kernel that futex is dedicated to the process and is not shared with other processes. Here FUTEX_WAIT_PRIVATE and FUTEX_WAKE_PRIVATE are two FLAGs;

related description 1: 160a7304bbb781 https://man7.org/linux/man-pages/man7/futex.7.html
Fuex related description 2: https://man7.org/linux/man-pages/man2/futex.2.html

Use go pprof tool to analyze kubelet function calls

In the early Kubernetes version, debug data can be obtained directly through the debug/pprof interface. After considering related security issues, this interface has been cancelled. For specific information, please refer to CVE-2019-11248 ( https://github.com/kubernetes /kubernetes/issues/81023 ). Therefore, we will enable proxy through kubectl to obtain relevant data indicators:

1. First use the kubectl proxy command to start the API server proxy

kubectl proxy --address='0.0.0.0'  --accept-hosts='^*$'

It should be noted here that if you are using the kubeconfig file copied on the Rancher UI, you need to use the context that specifies the master IP. If it is installed by RKE or other tools, you can ignore it.

2. Build the Golang environment. go pprof needs to be used in the golang environment. If golang is not installed locally, you can quickly build the Golang environment through Docker

docker run -itd --name golang-env --net host golang bash

3. Use the go pprof tool to export the collected indicators. Here, replace 127.0.0.1 with the IP of the apiserver node. The default port is 8001. If the docker run environment runs on the node where the apiserver is located, you can use 127.0.0.1. In addition, replace NODENAME with the corresponding node name.

docker exec -it golang-env bash
go tool pprof -seconds=60 -raw -output=kubelet.pprof http://127.0.0.1:8001/api/v1/nodes/${NODENAME}/proxy/debug/pprof/profile

4. The exported pprof file is inconvenient to view and needs to be converted into flame graph. It is recommended to use FlameGraph tool to generate svg graph

git clone https://github.com/brendangregg/FlameGraph.git
cd FlameGraph/
./stackcollapse-go.pl kubelet.pprof > kubelet.out
./flamegraph.pl kubelet.out > kubelet.svg

After converting to the flame graph, you can visually see the ratio of the function-related calls to the specific call time in the browser.

5. Analyze the flame graph

图片

As can be seen from the flame graph of k8s.io/kubernetes/vendor/github.com/google/cadvisor/manager.(*containerData).housekeeping 160a7304bbb8a0, where cAdvisor is a built-in indicator of kubelet Collection tools are mainly responsible for real-time monitoring and performance data collection of resources and containers on node machines, including CPU usage, memory usage, network throughput, and file system usage.

In-depth function call can be found k8s.io/kubernetes/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs. ( Manager) GetStats This function occupies k8s.io/kubernetes/github.com/opencontainers/fs. /google/cadvisor/manager.( containerData) .housekeeping This function takes the longest time, indicating that it takes a lot of time to get the container CGroup related state.

6. Since this function takes a long time, let's analyze what this function does.

View the source code:
https://github.com/kubernetes/kubernetes/blob/ded8a1e2853aef374fc93300fe1b225f38f19d9d/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs/memory.go#L162

func (s *MemoryGroup) GetStats(path string, stats *cgroups.Stats) error {
  // Set stats from memory.stat.
  statsFile, err := os.Open(filepath.Join(path, "memory.stat"))
  if err != nil {
    if os.IsNotExist(err) {
      return nil
    }
    return err
  }
  defer statsFile.Close()

  sc := bufio.NewScanner(statsFile)
  for sc.Scan() {
    t, v, err := fscommon.GetCgroupParamKeyValue(sc.Text())
    if err != nil {
      return fmt.Errorf("failed to parse memory.stat (%q) - %v", sc.Text(), err)
    }
    stats.MemoryStats.Stats[t] = v
  }
  stats.MemoryStats.Cache = stats.MemoryStats.Stats["cache"]

  memoryUsage, err := getMemoryData(path, "")
  if err != nil {
    return err
  }
  stats.MemoryStats.Usage = memoryUsage
  swapUsage, err := getMemoryData(path, "memsw")
  if err != nil {
    return err
  }
  stats.MemoryStats.SwapUsage = swapUsage
  kernelUsage, err := getMemoryData(path, "kmem")
  if err != nil {
    return err
  }
  stats.MemoryStats.KernelUsage = kernelUsage
  kernelTCPUsage, err := getMemoryData(path, "kmem.tcp")
  if err != nil {
    return err
  }
  stats.MemoryStats.KernelTCPUsage = kernelTCPUsage

  useHierarchy := strings.Join([]string{"memory", "use_hierarchy"}, ".")
  value, err := fscommon.GetCgroupParamUint(path, useHierarchy)
  if err != nil {
    return err
  }
  if value == 1 {
    stats.MemoryStats.UseHierarchy = true
  }

  pagesByNUMA, err := getPageUsageByNUMA(path)
  if err != nil {
    return err
  }
  stats.MemoryStats.PageUsageByNUMA = pagesByNUMA

  return nil
}

As you can see from the code, the process will read memory.stat , which stores the memory usage of the cgroup. In other words, it took a lot of time to read this file. At this time, if we manually check this file, what will be the effect?

# time cat /sys/fs/cgroup/memory/memory.stat >/dev/null
real 0m9.065s
user 0m0.000s
sys 0m9.064s

The clues can be seen from here, it took 9s to read this file, which is obviously abnormal.

Based on the above results, we found an issue on cAdvisor's GitHub ( https://github.com/google/cadvisor/issues/1774 ). From the issue, we can know that the issue is related to the slab memory cache. Relationship. According to the issue, the memory of the affected machine will gradually be used. Through /proc/meminfo, the memory used is slab memory, which is the memory page of the kernel cache, and most of them are dentry cache . From this we can judge that when the life cycle of the process in the CGroup ends, it is still stored in the slab memory due to caching, causing it to be unable to be released like a zombie CGroup.

That is, whenever a memory CGroup is created, it will be created and allocated in the kernel memory space. This memory contains the current CGroup-related cache (dentry, inode), which is the cache of the directory and file index. The cache is essentially to improve the efficiency of reading. But when all the processes in the CGroup exit, the cache in the kernel memory space is not cleared.

The kernel allocates memory through the buddy algorithm. Whenever a process applies for memory space, it will allocate at least one memory page for it, that is, at least 4k of memory will be allocated, and each time the memory is released, it will be released according to at least one page. When the requested memory size is tens of bytes or hundreds of bytes, 4k is a huge memory space for it. In Linux, in order to solve this problem, the slab memory allocation management mechanism is introduced. To handle this small amount of memory request, this will lead to, when all processes in the CGroup exit, this part of the memory will not be easily reclaimed, and the cached data in this part of the memory will also be read to stats In this way, the performance of reading will be affected.

Solution

1. Clear the node cache. This is a temporary solution. Temporarily emptying the node memory cache can alleviate the kubelet CPU usage, but after the cache comes up, the CPU usage will rise again.

echo 2 > /proc/sys/vm/drop_caches

2. Upgrade the kernel version

In fact, this is mainly a kernel issue. It https://github.com/torvalds/linux/commit/205b20cc5a99cdf197c32f4dbee2b09c699477f0 ). In the kernel version above 5.2+, the CGroup stats is optimized. For related query performance, if you want to better solve the problem, it is recommended that you can refer to your own operating system and environment to upgrade the kernel version reasonably.
In addition, Redhat has also optimized the performance of related CGroup in the kernel-4.18.0-176 ( https://bugzilla.redhat.com/show_bug.cgi?id=1795049 ) version, while CentOS 8/RHEL 8 uses it by default The kernel version is 4.18. If the operating system you are currently using is RHEL7/CentOS7, you can try to gradually replace the new operating system and use the kernel version 4.18.0-176 or higher. After all, the new version of the kernel is always container-related The experience will be much better.

Kernel related commit:
https://github.com/torvalds/linux/commit/205b20cc5a99cdf197c32f4dbee2b09c699477f0
redhat kernel bug fix:
https://bugzilla.redhat.com/show_bug.cgi?id=1795049

Rancher
1.2k 声望2.5k 粉丝

Rancher是一个开源的企业级Kubernetes管理平台,实现了Kubernetes集群在混合云+本地数据中心的集中部署与管理。Rancher一向因操作体验的直观、极简备受用户青睐,被Forrester评为“2020年多云容器开发平台领导厂商...