Troubleshoot problems when kubernetes pod restarts for no reason

background

An application in the kubernetes environment of a project will restart every 20 minutes or so. The time is not fixed, but there is no record in the container event.

Troubleshoot

First of all, it is suspected that the health check failed to cause the container to restart automatically. The following is the configuration of the health check

According to this configuration, the check interval is 30s, the unhealthy threshold is 60, that is, it will be marked as unhealthy after 1800s=30 minutes, but the guess is quickly ruled out for the reason

It is ok to manually call the health check
There is a health check record in the background, and the record shows that all have passed
If the health check fails, there will be a record in the event, and no record can be found here

Under normal circumstances, if kubernetes restarts the pod, it will record the reason for the restart in the event, but there is no record in the event, so it may not be restarted by kubernetes, so it is easy to think of a possibility that is OOM, that is, the process is triggered because the memory usage exceeds the limit. OOM operating system kills it, check the operating system log /var/log/messages find the following log

May 11 15:01:36 k8s-node-prod-3 kernel: [17094.020710]  oom_kill_process.cold.30+0xb/0x1cf
May 11 15:01:36 k8s-node-prod-3 kernel: [17094.020788] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
May 11 15:01:36 k8s-node-prod-3 kernel: [17094.109707] oom_reaper: reaped process 25929 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
May 11 15:29:12 k8s-node-prod-3 kernel: [18750.337581] java invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=969
May 11 15:29:12 k8s-node-prod-3 kernel: [18750.337621]  oom_kill_process.cold.30+0xb/0x1cf
May 11 15:29:12 k8s-node-prod-3 kernel: [18750.337709] [  pid  ]   uid  tgid total_

Sure enough, there is a record of OOM, and several kill events are also in line with the application restart time, and then check the oom_score of other java processes, it turns out that there is a high risk of being killed again, so the culprit is OOM, so why Will cause OOM. At this time, the remaining memory of the operating system is still sufficient. Note that the memory configured by the startup parameters of the container application is 4096M

JAVA_OPTS="$JAVA_OPTS -server -Xms4096M -Xmx4096M -Xss512k ..."

The memory limit given by the container is 4500M, which means that a 4500M host runs a program that requires 4096M. The program occupies a large proportion of the host memory, and of course it is easy to be killed by the system.

OOM killer is a feature of the Linux kernel. It checks for processes that take up too much memory and kills them. The operating system will score according to the memory usage of the process. The score is stored in /proc/{process PID}/oom_score. The larger the score, the more Easy to kill

solve

Because the memory is limited in the startup command, there is no need to limit the memory on kubernetes, so the memory limitation problem of kubernetes is removed.

Summary of experience

Although the problem is small, the impact is very large. Because the client is a multinational group and its business involves multiple countries, the transaction volume throughout the day is very large. From this problem, the following experience can be summarized

If kubernetes does not have event records, it is likely to be an operating system level problem, and you should start with the operating system to troubleshoot
Practice is the only criterion for testing the truth. Before the accident, everyone felt that this configuration was reasonable, and they never expected to be OOM. Only when something went wrong did they realize it.

Troubleshoot problems when kubernetes pod restarts for no reason

background

Troubleshoot

solve

Summary of experience

DQuery

引用和评论

SourceTree自定义操作的一个应用

Jenkins 企业级 CI/CD 实践：安装、配置与 Kubernetes & Docker 集成

k8s集群部署（一主两从）

k8s实战基础

使用kubeadm部署高可用IPV4/IPV6集群---V1.32

centos7使用yum网络安装

基于k3s部署Nginx、MySQL、PHP和Redis的详细教程