Failure Analysis | Kubernetes Failure Diagnosis Process - 个人文章

Author: Zheng Zengquan
Aikesheng South District database engineer, a member of the Aikesheng DBA team, responsible for database-related technical support. Hobbies: billiards, badminton, coffee, movies
Source of this article: original submission
* Produced by the Aikesheng open source community, original content is not allowed to be used without authorization, please contact the editor and indicate the source for reprinting.

1. Summary of this article and main terms

1.1 Overview

This article is divided based on the three modules of Pod, Service, and Ingress. For the daily failures that may occur in Kubernetes, it provides more specific troubleshooting steps, and attaches related solutions or references.

1.2 Main terms

Pod: The smallest deployable computing unit created and managed in Kubernetes. Is a group (one or more) of containers; these containers share storage, network, and statements about how to run these containers.
Port-forward: Map the local port to the specified application port through port forwarding.
Service: A Kubernetes Service is an abstraction that defines a logical set of Pods and a strategy for accessing them-sometimes called microservices.
Ingress: Provides routing communication from the outside of the cluster to the internal HTTP and HTTPS services, and the traffic routing is controlled by the rules defined on the Ingress resource.

Second, the fault diagnosis process

2.1 Pods module check

If the following process is successful, proceed to the next step; if it fails, proceed to jump according to the prompts.

2.1.1 Check if any pod is in PENDING state

kubectl get pods: If there are pods in the PENDING state, look down, otherwise go to 2.1.5.

[root@10-186-65-37 ~]# kubectl get pods
NAME                               READY    STATUS    RESTARTS   AGE
myapp-deploy-55b54d55b8-5msx8      0/1      Pending   0          5m

kubectl describe pod <pod-name>: If the detailed information of the specified one or more resources is output correctly, it is judged whether the cluster resources are insufficient, if insufficient, then expand, otherwise go to 2.1.2.

2.1.2 Check whether the ResourceQuota limit is triggered

kubectl describe resourcequota -n <namespace>:

[root@10-186-65-37 ~]# kubectl describe quota compute-resources --namespace=myspace
Name:            compute-resources
Namespace:       myspace
Resource         Used  Hard
--------         ----  ----
limits.cpu       0     2
limits.memory    0     2Gi
pods             0     4
requests.cpu     0     1
requests.memory  0     1Gi

If there are restrictions, release or expand the corresponding resources, refer to:
https://kubernetes.io/zh/docs/concepts/configuration/manage-resources-containers/#extended-resources
Otherwise go to 2.1.3

2.1.3 Check whether any PVC is in PENDING state

Persistent Volume (PV) is a piece of storage in the cluster, which can be provisioned by the administrator in advance, or dynamically provisioned by the storage class (Storage Class); Persistent Volume Claim (PVC) expresses the user's request for storage .

kubectl describe pvc <pvc-name>:
If STATUS is Pending

[root@10-186-65-37 k8s-file]# kubectl get pvc
NAME               STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
local-device-pvc   Pending                                      local-device   72s

Then refer to the following link to solve:
https://kubernetes.io/zh/docs/concepts/storage/persistent-volumes/

Otherwise go to 2.1.4

2.1.4 Check whether the pod is assigned to the node

kubectl get pods -o wide：
If it has been assigned to node

[root@10-186-65-37 ~]# kubectl get pods -o wide
NAME                             READY   STATUS    RESTARTS   AGE   IP            NODE            NOMINATED NODE   READINESS GATES
myapp-deploy-55b54d55b8-5msx8    1/1     Running   0          14d   10.244.4.9    10-186-65-122   <none>           <none>
myapp-deploy-55b54d55b8-7ldj4    1/1     Running   0          14d   10.244.2.10   10-186-65-126   <none>           <none>
myapp-deploy-55b54d55b8-cwdwt    1/1     Running   0          14d   10.244.3.9    10-186-65-126   <none>           <none>
myapp-deploy-55b54d55b8-gvmb9    1/1     Running   0          14d   10.244.4.10   10-186-65-122   <none>           <none>
myapp-deploy-55b54d55b8-xbqb6    1/1     Running   0          14d   10.244.5.9    10-186-65-118   <none>           <none>

It is a problem with the Scheduler, please refer to the following link to solve it:
https://kubernetes.io/zh/docs/concepts/scheduling-eviction/kube-scheduler/

Otherwise, it is a problem with Kubectl.

2.1.5 Check whether pods are in the RUNNING state

kubectl get pods -o wide：
If the pods are in the RUNNING state, go to 2.1.10, otherwise go to 2.1.6.

2.1.6 Check pod log

kubectl logs <pod-name>:

If the log can be obtained correctly, fix the related problems according to the log.

[root@10-186-65-37 ~]# kubectl logs myapp-deploy-55b54d55b8-5msx8
127.0.0.1 - - [30/Sep/2021:06:53:16 +0000] "GET / HTTP/1.1" 200 65 "-" "curl/7.29.0" "-"
127.0.0.1 - - [30/Sep/2021:07:49:44 +0000] "GET / HTTP/1.1" 200 65 "-" "curl/7.29.0" "-"
127.0.0.1 - - [30/Sep/2021:07:51:09 +0000] "GET / HTTP/1.1" 200 65 "-" "curl/7.29.0" "-"
127.0.0.1 - - [30/Sep/2021:07:57:00 +0000] "GET / HTTP/1.1" 200 65 "-" "curl/7.29.0" "-"
127.0.0.1 - - [30/Sep/2021:08:03:56 +0000] "GET / HTTP/1.1" 200 65 "-" "curl/7.29.0" "-"

If the log cannot be obtained, judge whether the container quickly stops running, if it stops quickly, execute:
kubectl logs <pod-name> --previous
Unable to get the log, and the container does not stop running quickly, go to 2.1.7.

2.1.7 Whether the Pod status is ImagePullBackOff

kubectl describe pod <pod-name>:
Check whether the status is ImagePullBackOff? If it is not ImagePullBackOff, go to 2.1.8.
Check whether the image name is correct, and correct the error.
Check whether the image tag exists and has been verified.
Do you pull the image from the private registry? If so, confirm that the configuration information is correct.
If the image is not pulled from the private registry, the problem may be CRI (Container Runtime Interface) or kubectl.

2.1.8 Whether the Pod status is CrashLoopBackOff

kubectl describe pod <pod-name>:
Check whether the status is CrashLoopBackOff? Otherwise, go to 2.1.9.
If so, check the log and fix the application crash.
Are there any missing CMD instructions in the Dockerfile?
Docker history <image-id> (after adding --no-trunc to display the complete output)

[root@10-186-65-37 ~]# docker history fb4cca6b4e4c 
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
fb4cca6b4e4c        22 months ago       /bin/sh -c #(nop) COPY file:957630e64c05c549…   121MB               
<missing>           2 years ago         /bin/sh -c #(nop)  CMD ["/bin/sh"]              0B                  
<missing>           2 years ago         /bin/sh -c #(nop) ADD file:1d711f09b1bbc7c8d…   42.3MB

Does the Pod status restart frequently and the status is switched between Running and CrashLoopBackOff? If so, you need to fix the liveness probe (liveness probe) problem, please refer to the following link:
https://kubernetes.io/zh/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

2.1.9 Whether the Pod state is in RunContainerError

kubectl describe pod <pod-name>:
Check whether the status is RunContainerError.
If the status is RunContainerError, the problem may be caused by mounting the volume (volume), please refer to the following link:
https://kubernetes.io/zh/docs/concepts/storage/volumes/
Otherwise, please ask for help on sites such as StackOverflow.

2.1.10 Check if pods are in READY state

If it is in the READY state, continue to perform the mapping setting

[root@10-186-65-37 ~]# kubectl get pods
NAME                             READY   STATUS    RESTARTS   AGE
myapp-deploy-55b54d55b8-5msx8    1/1     Running   0          14d
myapp-deploy-55b54d55b8-7ldj4    1/1     Running   0          14d

If there are no pods in the READY state, go to 2.1.11.

kubectl port-forward <pod-name> 8080:<pod-port>
Successfully mapped to 2.2

a) Mapping

[root@10-186-65-37 ~]# kubectl port-forward myapp-deploy-55b54d55b8-5msx8  8080:80
Forwarding from 127.0.0.1:8080 -> 80
Forwarding from [::1]:8080 -> 80

b) Verify that the mapping is successful

[root@10-186-65-37 ~]# curl localhost:8080
Hello MyApp | Version: v2 | <a href="hostname.html">Pod Name</a>

If it fails, you need to confirm that the program can be monitored by all addresses. The setting statement is as follows:

kubectl port-forward --address 0.0.0.0 <pod-name> 8080:<pod-port>

If it cannot be monitored by all addresses, it is in the Unknown state.

2.1.11 Check Readiness (ready detector)

kubectl describe pod <pod-name>
For normal output, fix the corresponding problem according to the log and refer to the following link
https://kubernetes.io/zh/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
Failure is unknown state (Unknown state).

2.2 Service module check

2.2.1 Service current status check

kubectl describe service <service-name>
The successful output is as follows:

[root@10-186-65-37 ~]#  kubectl describe service myapp
\Name:              myapp
Namespace:         default
Labels:            <none>
Annotations:       kubectl.kubernetes.io/last-applied-configuration:
                     {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"name":"myapp","namespace":"default"},"spec":{"ports":[{"name":"http","po...
Selector:          app=myapp,release=canary
Type:              ClusterIP
IP:                10.96.109.76
Port:              http  80/TCP
TargetPort:        80/TCP
Endpoints:         10.244.2.10:80,10.244.3.9:80,10.244.4.10:80 + 2 more...
Session Affinity:  None
Events:            <none>

Can you see the Endpoints column and there is normal output? For abnormal output, go to 2.2.2.

kubectl port-forward service/<service-name> 8080:<service-port>
The successful output is as follows:

[root@10-186-65-37 ~]# kubectl port-forward service/myapp 8080:80 
Forwarding from 127.0.0.1:8080 -> 80
Forwarding from [::1]:8080 -> 80

Go to 2.3 for success and 2.2.4 for failure.

2.2.2 Selector and Pod label comparison

View the label information of the pod
kubectl describe pod <pod-name>

[root@10-186-65-37 ~]# kubectl describe pod myapp-deploy-55b54d55b8-5msx8 | grep -i label -A 2
Labels:       app=myapp
              pod-template-hash=55b54d55b8
              release=canary

View the selector information of the service
kubectl describe service <service-name>

[root@10-186-65-37 ~]#  kubectl describe service myapp  | grep -i selector 
Selector:          app=myapp,release=canary

Check whether the two match correctly, correct the error, and go to 2.2.3 if it is correct.

2.2.3 Check whether the Pod has been assigned an IP

View pod's ip information
kubectl describe pod <pod-name>

The ip has been allocated correctly, the problem is caused by kubectl.

[root@10-186-65-37 ~]# kubectl describe pod myapp-deploy-55b54d55b8-5msx8 | grep -i 'ip'
IP:           10.244.4.9
IPs:
  IP:           10.244.4.9

If the ip is not assigned, the problem is caused by the Controller manager.

2.2.4 Check Service TargetPort and Pod ContainerPort

View the TargetPort information of the service:
kubectl describe service <service-name>

[root@10-186-65-37 ~]#  kubectl describe service myapp  | grep -i targetport
TargetPort:        80/TCP

View the ContainerPort information of the pod:
kubectl describe pod < pod-name >

[root@10-186-65-37 ~]# kubectl describe pod myapp-deploy-55b54d55b8-5msx8 | grep -i port
 Port:           80/TCP
 Host Port:      0/TCP

If the above two are consistent, the problem is caused by kube-proxy, and if they are inconsistent, the information will be corrected.

2.3 Ingress module check

2.3.1 Ingress current status check

kubectl describe ingress <ingress-name>
The successful output is as follows:

[root@10-186-65-37 ~]# kubectl describe ingress ingress-tomcat-tls
Name:             ingress-tomcat-tls
Namespace:        default
Address:          
Default backend:  default-http-backend:80 (<none>)
TLS:
  tomcat-ingress-secret terminates tomcat.quan.com
Rules:
  Host               Path  Backends
  ----               ----  --------
  tomcat.quan.com  
                     tomcat:8080 (10.244.2.11:8080,10.244.4.11:8080,10.244.5.10:8080)
Annotations:
  kubectl.kubernetes.io/last-applied-configuration:  {"apiVersion":"extensions/v1beta1","kind":"Ingress","metadata":{"annotations":{"kubernets.io/ingress.class":"nginx"},"name":"ingress-tomcat-tls","namespace":"default"},"spec":{"rules":[{"host":"tomcat.quan.com","http":{"paths":[{"backend":{"serviceName":"tomcat","servicePort":8080},"path":null}]}}],"tls":[{"hosts":["tomcat.quan.com"],"secretName":"tomcat-ingress-secret"}]}}

  kubernets.io/ingress.class:  nginx
Events:                        <none>

Can you see the backends column and there is normal output? Normal output goes to 2.3.4, otherwise goes to 2.3.2.

2.3.2 Check ServiceName and ServicePort

kubectl describe ingress <ingress-name>
kubectl describe service <service-name>

[root@10-186-65-37 ~]# kubectl describe ingress ingress-tomcat-tls | grep -E 'serviceName|servicePort'
  kubectl.kubernetes.io/last-applied-configuration:  {"apiVersion":"extensions/v1beta1","kind":"Ingress","metadata":{"annotations":{"kubernets.io/ingress.class":"nginx"},"name":"ingress-tomcat-tls","namespace":"default"},"spec":{"rules":[{"host":"tomcat.quan.com","http":{"paths":[{"backend":{"serviceName":"tomcat","servicePort":8080},"path":null}]}}],"tls":[{"hosts":["tomcat.quan.com"],"secretName":"tomcat-ingress-secret"}]}}

Check if the ServiceName and ServicePort of the first two are written correctly, if they are correct, go to 2.3.3, please correct the errors.

2.3.3 Ingress controller document

The problem is caused by the Ingress controller, please refer to the documentation for a solution:
https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/

2.3.4 Check port-forward ingress

1.kubectl port-forward <ingress-pod-name> 8080:<ingress-port>
Test whether it can be accessed normally: curl localhost:8080

You can go to 2.3.5 for normal access, otherwise go to 2.3.3.

2.3.5 Check whether you can access through Ingress on the external network

It can be successfully accessed from the external network, and the troubleshooting is over.
If you cannot access from the external network, the problem is caused by the infrastructure or the exposed method of the cluster. Please troubleshoot.

Failure Analysis | Kubernetes Failure Diagnosis Process