2

在过去的几天中,我的某些Pod持续panic,并且OS Syslog显示OOM Killer杀死了容器进程。我做了一些研究,以找出这些东西是如何工作的。

Pod内存限制和cgroup内存设置

创建一个将内存限制设置为128Mi的Pod:

kubectl run --restart=Never --rm -it --image=ubuntu --limits='memory=128Mi' -- sh
If you don't see a command prompt, try pressing enter.
root@sh:/#

打开另外一个终端,使用下面的方式获取到该Pod的uid:

kubectl get pods sh -o yaml | grep uid
  uid: 98f587f8-8994-4eb4-a7b6-d62be890cc08

然后通过以下的命令找到Pod所运行的node节点:

kubectl get pods -o wide
NAME   READY   STATUS    RESTARTS   AGE   IP             NODE          NOMINATED NODE   READINESS GATES
sh     1/1     Running   0          52s   10.107.1.136   10.67.62.22   <none>           <none>

在Pod运行的服务器(10.67.62.22)上,根据Pod的uid检查cgroup设置,

首先进入到对应Pod的cgroup下:

cd /sys/fs/cgroup/memory/kubepods/burstable/pod98f587f8-8994-4eb4-a7b6-d62be890cc08

执行ls 可以看到如下:

ls
bdc2f6d0b9791f9a8b86c1e877c830e387170c955cdc09866350344b19e08a6e  memory.kmem.failcnt                 memory.kmem.tcp.usage_in_bytes   memory.memsw.usage_in_bytes      memory.swappiness
cgroup.clone_children                                             memory.kmem.limit_in_bytes          memory.kmem.usage_in_bytes       memory.move_charge_at_immigrate  memory.usage_in_bytes
cgroup.event_control                                              memory.kmem.max_usage_in_bytes      memory.limit_in_bytes            memory.numa_stat                 memory.use_hierarchy
cgroup.procs                                                      memory.kmem.slabinfo                memory.max_usage_in_bytes        memory.oom_control               notify_on_release
ffe85090722bdfbb94fab8a7b58ce714191c3456c2b405240afd58756664cc88  memory.kmem.tcp.failcnt             memory.memsw.failcnt             memory.pressure_level            tasks
memory.failcnt                                                    memory.kmem.tcp.limit_in_bytes      memory.memsw.limit_in_bytes      memory.soft_limit_in_bytes
memory.force_empty                                                memory.kmem.tcp.max_usage_in_bytes  memory.memsw.max_usage_in_bytes  memory.stat

查看限制值:

cat memory.limit_in_bytes
134217728

数字134217728是精确的128Mi(128 * 1024 * 1024)。因此,现在更加清楚的是,Kubernetes通过cgroup设置了内存限制。一旦pod消耗的内存超过了限制,cgroup将开始终止容器进程。

Stress test

让我们通过打开的shell会话在Pod上安装压力工具。

root@sh:/# apt update; apt install -y stress

同时,在node上通过运行dmesg -Tw监视Syslog。

首先在内存限制在100M之内运行压力工具。

root@sh:/# stress --vm 1 --vm-bytes 100M &
[1] 253
root@sh:/# stress: info: [253] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

然后开始第二个压测:

root@sh:/# stress --vm 1 --vm-bytes =100M
stress: info: [256] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: FAIL: [253] (415) <-- worker 254 got signal 9
stress: WARN: [253] (417) now reaping child worker processes
stress: FAIL: [253] (451) failed run completed in 66s

第一个压力进程(进程id 253)立即被信号9杀死。

此时,查看syslog显示:

[Thu May 21 08:48:41 2020] stress invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=999
[Thu May 21 08:48:41 2020] stress cpuset=ffe85090722bdfbb94fab8a7b58ce714191c3456c2b405240afd58756664cc88 mems_allowed=0-1
[Thu May 21 08:48:41 2020] CPU: 22 PID: 5222 Comm: stress Tainted: G           O   ---- -------   3.10.0-862.14.1.5.h328.eulerosv2r7.x86_64 #1
[Thu May 21 08:48:41 2020] Hardware name: OpenStack Foundation OpenStack Nova, BIOS rel-1.10.2-0-g5f4c7b1-20181220_000000-szxrtosci10000 04/01/2014
[Thu May 21 08:48:41 2020] Call Trace:
[Thu May 21 08:48:41 2020]  [<ffffffffa2d2673f>] dump_stack+0x19/0x1b
[Thu May 21 08:48:41 2020]  [<ffffffffa2d21fe3>] dump_header+0x90/0x229
[Thu May 21 08:48:41 2020]  [<ffffffffa279f6c6>] ? find_lock_task_mm+0x56/0xc0
[Thu May 21 08:48:41 2020]  [<ffffffffa279fbc4>] oom_kill_process+0x254/0x3d0
[Thu May 21 08:48:41 2020]  [<ffffffffa281bad3>] mem_cgroup_oom_synchronize+0x553/0x580
[Thu May 21 08:48:41 2020]  [<ffffffffa281af40>] ? mem_cgroup_charge_common+0xc0/0xc0
[Thu May 21 08:48:41 2020]  [<ffffffffa27a0414>] pagefault_out_of_memory+0x14/0x90
[Thu May 21 08:48:41 2020]  [<ffffffffa2d200e6>] mm_fault_error+0x6a/0x157
[Thu May 21 08:48:41 2020]  [<ffffffffa2d338d6>] __do_page_fault+0x4a6/0x4f0
[Thu May 21 08:48:41 2020]  [<ffffffffa2d33a06>] trace_do_page_fault+0x56/0x150
[Thu May 21 08:48:41 2020]  [<ffffffffa2d32f92>] do_async_page_fault+0x22/0xf0
[Thu May 21 08:48:41 2020]  [<ffffffffa2d2f7a8>] async_page_fault+0x28/0x30
[Thu May 21 08:48:41 2020] Task in /kubepods/burstable/pod98f587f8-8994-4eb4-a7b6-d62be890cc08/ffe85090722bdfbb94fab8a7b58ce714191c3456c2b405240afd58756664cc88 killed as a result of limit of /kubepods/burstable/pod98f587f8-8994-4eb4-a7b6-d62be890cc08
[Thu May 21 08:48:41 2020] memory: usage 131072kB, limit 131072kB, failcnt 6711
[Thu May 21 08:48:41 2020] memory+swap: usage 131072kB, limit 9007199254740988kB, failcnt 0
[Thu May 21 08:48:41 2020] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[Thu May 21 08:48:41 2020] Memory cgroup stats for /kubepods/burstable/pod98f587f8-8994-4eb4-a7b6-d62be890cc08: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[Thu May 21 08:48:41 2020] Memory cgroup stats for /kubepods/burstable/pod98f587f8-8994-4eb4-a7b6-d62be890cc08/bdc2f6d0b9791f9a8b86c1e877c830e387170c955cdc09866350344b19e08a6e: cache:0KB rss:1656KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:1656KB inactive_file:0KB active_file:0KB unevictable:0KB
[Thu May 21 08:48:41 2020] Memory cgroup stats for /kubepods/burstable/pod98f587f8-8994-4eb4-a7b6-d62be890cc08/ffe85090722bdfbb94fab8a7b58ce714191c3456c2b405240afd58756664cc88: cache:100KB rss:129316KB rss_huge:98304KB mapped_file:4KB swap:0KB inactive_anon:0KB active_anon:129268KB inactive_file:0KB active_file:0KB unevictable:0KB
[Thu May 21 08:48:41 2020] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[Thu May 21 08:48:41 2020] [25078]     0 25078     1104      397       7        0          -998 pause
[Thu May 21 08:48:41 2020] [25519]     0 25519     4624      104      14        0           999 bash
[Thu May 21 08:48:41 2020] [ 5221]     0  5221     2057       19       8        0           999 stress
[Thu May 21 08:48:41 2020] [ 5222]     0  5222    27658    17581      42        0           999 stress
[Thu May 21 08:48:41 2020] [ 6772]     0  6772     2057       19       8        0           999 stress
[Thu May 21 08:48:41 2020] [ 6774]     0  6774    27658    14566      36        0           999 stress
[Thu May 21 08:48:41 2020] Memory cgroup out of memory: Kill process 5222 (stress) score 1513 or sacrifice child
[Thu May 21 08:48:41 2020] Killed process 5222 (stress) total-vm:110632kB, anon-rss:70320kB, file-rss:4kB, shmem-rss:0kB

主机上的进程ID 5222被OOM杀死。我们需要详细看看syslog日志的最后部分:

[Thu May 21 08:48:41 2020] Memory cgroup out of memory: Kill process 5222 (stress) score 1513 or sacrifice child
[Thu May 21 08:48:41 2020] Killed process 5222 (stress) total-vm:110632kB, anon-rss:70320kB, file-rss:4kB, shmem-rss:0kB

对于此Pod,有一些进程是OOM Killer 选择杀死的候选对象。保证网络进程名称空间的pause 容器的oom_score_adj值为-998,保证不会被杀死。容器中的其余所有进程的oom_score_adj值均为999。我们可以根据来自Kubernetes文档的公式如下验证该值,

min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)

找出节点可分配的内存:

kubectl describe nodes 10.67.62.22 | grep Allocatable -A 7
Allocatable:
  attachable-volumes-csi-disk.csi.everest.io:  58
  cce/eni:                                     15
  cpu:                                         31850m
  ephemeral-storage:                           28411501317
  hugepages-1Gi:                               0
  hugepages-2Mi:                               0
  memory:                                      60280908Ki

如果未设置,则请求内存默认情况下与限制值相同。根据公式,我们计算应该是 oom_score_adj为999。

具体:

min(max(2, 1000-128*1024/60280908), 999)

请注意,容器中的所有进程都具有相同的oom_score_adj值。 OOM杀手将根据内存使用情况计算OOM值,并使用oom_score_adj值进行微调。最后,它终止了使用内存最多的第一个压力进程,即进程id为5222的stress。

结论

我们只是详细介绍了Qos为Burstable类型的Pod内存限制,关于其他类型大家可以自己去做测试。

不同类型的Pod的oom_score_adj不同。


iyacontrol
1.4k 声望2.7k 粉丝

专注kubernetes,devops,aiops,service mesh。