background
An application in the kubernetes environment of a project will restart every 20 minutes or so. The time is not fixed, but there is no record in the container event.
Troubleshoot
First of all, it is suspected that the health check failed to cause the container to restart automatically. The following is the configuration of the health check
According to this configuration, the check interval is 30s, the unhealthy threshold is 60, that is, it will be marked as unhealthy after 1800s=30 minutes, but the guess is quickly ruled out for the reason
- It is ok to manually call the health check
- There is a health check record in the background, and the record shows that all have passed
- If the health check fails, there will be a record in the event, and no record can be found here
Under normal circumstances, if kubernetes restarts the pod, it will record the reason for the restart in the event, but there is no record in the event, so it may not be restarted by kubernetes, so it is easy to think of a possibility that is OOM, that is, the process is triggered because the memory usage exceeds the limit. OOM operating system kills it, check the operating system log /var/log/messages
find the following log
May 11 15:01:36 k8s-node-prod-3 kernel: [17094.020710] oom_kill_process.cold.30+0xb/0x1cf
May 11 15:01:36 k8s-node-prod-3 kernel: [17094.020788] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
May 11 15:01:36 k8s-node-prod-3 kernel: [17094.109707] oom_reaper: reaped process 25929 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
May 11 15:29:12 k8s-node-prod-3 kernel: [18750.337581] java invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=969
May 11 15:29:12 k8s-node-prod-3 kernel: [18750.337621] oom_kill_process.cold.30+0xb/0x1cf
May 11 15:29:12 k8s-node-prod-3 kernel: [18750.337709] [ pid ] uid tgid total_
Sure enough, there is a record of OOM, and several kill events are also in line with the application restart time, and then check the oom_score of other java processes, it turns out that there is a high risk of being killed again, so the culprit is OOM, so why Will cause OOM. At this time, the remaining memory of the operating system is still sufficient. Note that the memory configured by the startup parameters of the container application is 4096M
JAVA_OPTS="$JAVA_OPTS -server -Xms4096M -Xmx4096M -Xss512k ..."
The memory limit given by the container is 4500M, which means that a 4500M host runs a program that requires 4096M. The program occupies a large proportion of the host memory, and of course it is easy to be killed by the system.
OOM killer is a feature of the Linux kernel. It checks for processes that take up too much memory and kills them. The operating system will score according to the memory usage of the process. The score is stored in /proc/{process PID}/oom_score. The larger the score, the more Easy to kill
solve
Because the memory is limited in the startup command, there is no need to limit the memory on kubernetes, so the memory limitation problem of kubernetes is removed.
Summary of experience
Although the problem is small, the impact is very large. Because the client is a multinational group and its business involves multiple countries, the transaction volume throughout the day is very large. From this problem, the following experience can be summarized
- If kubernetes does not have event records, it is likely to be an operating system level problem, and you should start with the operating system to troubleshoot
- Practice is the only criterion for testing the truth. Before the accident, everyone felt that this configuration was reasonable, and they never expected to be OOM. Only when something went wrong did they realize it.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。