子问题:k8s pod OOM 原因分析
如何判断 k8s 的 OOMKilled 是因为 node 内存不足从而 kill pod,还是因为 pod 本身申请的内存超过了 limt 声明限制而被kill呢?
能不能直接通过 kubectl describe 看出来?
从下面的 describe 子命令的输出信息,只能看出 OOMKilled,但是看不出是 pod 自身导致的 OOMKilled,还是外部原因导致的 OOMKilled
─➤ kb describe -n xxxxx pod image-vector-api-server-prod-5fffcd4884-j9447
Name: image-vector-api-server-prod-5fffcd4884-j9447
Namespace: mediawise
Priority: 0
Service Account: default
Node: cn-hangzhou.xxxxx/xxxx
Start Time: Wed, 01 Nov 2023 17:25:54 +0800
Labels: app=image-vector-api
pod-template-hash=5fffcd4884
Annotations: kubernetes.io/psp: ack.privileged
Status: Running
IP: xxxxx
IPs:
IP: xxxxx
Controlled By: ReplicaSet/image-vector-api-server-prod-5fffcd4884
Containers:
image-vector-api:
Container ID: docker://78dc88a880d769d5cb4a553672d8a4b4a0b69b720fcbf9380096a77d279c5645
Image: registry-vpc.cn-xxxx.xxxx.com/xxx-cn/image-vector:master-xxxxxx
Image ID: docker-pullable://registry-vpc.cn-hangzhou.aliyuncs.com/xxx-cn/image-vector@sha256:058c43265845a975d7cc537911ddcc203fa26f608714fe8b388d5dfd1eb02d92
Port: 9205/TCP
Host Port: 0/TCP
Command:
python
api.py
State: Running
Started: Wed, 01 Nov 2023 18:35:49 +0800
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Wed, 01 Nov 2023 18:25:34 +0800
Finished: Wed, 01 Nov 2023 18:35:47 +0800
Ready: True
Restart Count: 8
Limits:
cpu: 2
memory: 2000Mi
Requests:
cpu: 10m
memory: 1000Mi
Liveness: http-get http://:9205/ delay=60s timeout=1s period=30s #success=1 #failure=3
Readiness: http-get http://:9205/ delay=60s timeout=1s period=30s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2kwj9 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-2kwj9:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
想在不引入外部工具的情况下(比如 prometheus),分析这种问题
我去 kubernetes 那里提了这个问题: https://discuss.kubernetes.io/t/how-can-we-tell-if-the-oomkil...
一个曾经用过的方案:在 k8s 加一个监控,可以没有 pod 的内存使用变化曲线,以及 node 的总内存占用变化曲线。挂了之后,看下挂的时候两者内存大概都用了多少基本就知道了。监控上还可以加报警,就是内存使用达到某一个阈值就发个邮件,或者发个短信(如果用接口的话)。
说曾经,是因为现在我不需要去管理 k8s 了。