Root cause diagnosis of Ingress failures in a Kubernetes cluster

Author: scwang18, mainly responsible for the technical architecture, has done a lot of research in the direction of container cloud.

foreword

KubeSphere is an open source Kubernetes-based cloud-native distributed operating system of Qingyun, which provides a cool Kubernetes cluster management interface. Our team uses KubeSphere as the development platform.

This article records the solution process of a network failure in the KubeSphere environment.

Phenomenon

Developers reported that the Harbor warehouse they built always had problems, and occasionally reported net/http: TLS handshake timeout , and accessing harbor.xxxx.cn through curl would also hang randomly and frequently. But the feedback of ping everything is normal.

Cause Analysis

After receiving the error report, after several rounds of analysis, the cause was finally located. It should be that the latest version of Kubernetes 1.23.1 was used when KubeSphere was installed.

Although using ./kk version --show-supported-k8s you can see that KubeSphere 3.2.1 can support Kubernetes 1.23.1, but it is actually only experimental support, there are pits.

The analysis process is as follows:

There is a problem with Harbor registry access, and subconsciously thinks that there is a problem with Harbor deployment, but when checking the Harbor core log, there is no corresponding error message when there is no abnormality, and there is even no log information at the info level.
Put the target on the Harbor portal again, and check the access log, but no abnormal information is found.
According to the access chain, continue to trace the kubesphere-router-kubesphere-system, that is, the nginx ingress controller of the KubeSphere version, and also found no abnormal log.
Try to access Harbor's in-cluster service address in other Pods in the cluster, and find that there is no access timeout problem. The preliminary judgment is the problem of the Ingress that comes with KubeSphere.
Turn off the Ingress Controller that comes with kubeSphere, install the ingress-nginx-controller version officially recommended by Kubernetes, the fault remains, and no abnormal information is found in the Ingress log.
Based on the above analysis, the problem should appear between the client and the Ingress Controller. My Ingress Controller is exposed to the outside of the cluster through NodePort. Therefore, when testing other services exposed to the outside of the cluster through NodePort, it is found that the fault is the same. At this point, the Harbor deployment problem can be completely ruled out, and it is basically determined to be the problem from the client to the Ingress Controller.
When the external client accesses the Ingress Controller through NodePort, it will analyze the kube-proxy log through the kube-proxy component and find the alarm information

 can’t set sysctl net/ipv4/vs/conn_reuse_mode, kernel version must be at least 4.1

This warning message is because the kernel version of my centos 7.6 is too low, currently 3.10.0-1160.21.1.el7.x86_64, and there is a compatibility problem with the new version of Kubernetes ipvs.

It can be solved by upgrading the kernel version of the operating system.

After upgrading the kernel, Calico cannot be started, and the following error message is reported

 ipset v7.1: kernel and userspace incompatible: settype hash:ip,port with revision 6 not supported by userspace.

The reason is that the Calico version installed by default when KubeSphere is installed is v3.20.0, this version does not support the latest version of Linux Kernel, the upgraded kernel version is 5.18.1-1.el7.elrepo.x86_64, calico needs to be upgraded to v3.23.0 above version.

After upgrading the Calico version, Calico continues to report errors

 user "system:serviceaccount:kube-system:calico-node" cannot list resource "caliconodestatuses" in api group "crd.projectcalico.org"

There is another error message, all because the resource permissions of the clusterrole are insufficient, and the problem can be solved by modifying the clusterrole.

At this point, the inexplicable network problem has been solved.

solution process

According to the above analysis, the main solutions are as follows:

Upgrade the operating system kernel

Use Alibaba Cloud's yum source

 wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo
yum clean all && yum -y update

Enable elrepo repository

 rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm

Install the latest version of the kernel

 yum --enablerepo=elrepo-kernel install kernel-ml

View all available kernels on the system

 awk -F\' '$1=="menuentry " {print i++ " : " $2}' /etc/grub2.cfg

Set the new kernel as the default version of grub2

Looking at the list of available kernels for the system returned in step 4, unsurprisingly the first one should be the latest installed kernel.

 grub2-set-default 0

Generate grub configuration file and reboot

 grub2-mkconfig -o /boot/grub2/grub.cfg
reboot now

verify

 uname -r

Upgrade Calico

Calico on Kubernetes is generally deployed using Daemonset. In my cluster, Calico's Daemonset is named calico-node.

Directly output it as a yaml file, and modify all image version numbers in the file to the latest version v3.23.1. Recreate the Daemonset.

output yaml

 kubectl -n kube-system get ds  calico-node -o yaml>calico-node.yaml

calico-node.yaml:

 apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    k8s-app: calico-node
  name: calico-node
  namespace: kube-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: calico-node
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: calico-node
    spec:
      containers:
      - env:
        - name: DATASTORE_TYPE
          value: kubernetes
        - name: WAIT_FOR_DATASTORE
          value: "true"
        - name: NODENAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CALICO_NETWORKING_BACKEND
          valueFrom:
            configMapKeyRef:
              key: calico_backend
              name: calico-config
        - name: CLUSTER_TYPE
          value: k8s,bgp
        - name: NODEIP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        - name: IP_AUTODETECTION_METHOD
          value: can-reach=$(NODEIP)
        - name: IP
          value: autodetect
        - name: CALICO_IPV4POOL_IPIP
          value: Always
        - name: CALICO_IPV4POOL_VXLAN
          value: Never
        - name: FELIX_IPINIPMTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: FELIX_VXLANMTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: FELIX_WIREGUARDMTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: CALICO_IPV4POOL_CIDR
          value: 10.233.64.0/18
        - name: CALICO_IPV4POOL_BLOCK_SIZE
          value: "24"
        - name: CALICO_DISABLE_FILE_LOGGING
          value: "true"
        - name: FELIX_DEFAULTENDPOINTTOHOSTACTION
          value: ACCEPT
        - name: FELIX_IPV6SUPPORT
          value: "false"
        - name: FELIX_HEALTHENABLED
          value: "true"
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: calico/node:v3.23.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - /bin/calico-node
            - -felix-live
            - -bird-live
          failureThreshold: 6
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        name: calico-node
        readinessProbe:
          exec:
            command:
            - /bin/calico-node
            - -felix-ready
            - -bird-ready
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        resources:
          requests:
            cpu: 250m
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/etc/cni/net.d
          name: cni-net-dir
        - mountPath: /lib/modules
          name: lib-modules
          readOnly: true
        - mountPath: /run/xtables.lock
          name: xtables-lock
        - mountPath: /var/run/calico
          name: var-run-calico
        - mountPath: /var/lib/calico
          name: var-lib-calico
        - mountPath: /var/run/nodeagent
          name: policysync
        - mountPath: /sys/fs/
          mountPropagation: Bidirectional
          name: sysfs
        - mountPath: /var/log/calico/cni
          name: cni-log-dir
          readOnly: true
      dnsPolicy: ClusterFirst
      hostNetwork: true
      initContainers:
      - command:
        - /opt/cni/bin/calico-ipam
        - -upgrade
        env:
        - name: KUBERNETES_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CALICO_NETWORKING_BACKEND
          valueFrom:
            configMapKeyRef:
              key: calico_backend
              name: calico-config
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: calico/cni:v3.23.1
        imagePullPolicy: IfNotPresent
        name: upgrade-ipam
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/cni/networks
          name: host-local-net-dir
        - mountPath: /host/opt/cni/bin
          name: cni-bin-dir
      - command:
        - /opt/cni/bin/install
        env:
        - name: CNI_CONF_NAME
          value: 10-calico.conflist
        - name: CNI_NETWORK_CONFIG
          valueFrom:
            configMapKeyRef:
              key: cni_network_config
              name: calico-config
        - name: KUBERNETES_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CNI_MTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: SLEEP
          value: "false"
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: calico/cni:v3.23.1
        imagePullPolicy: IfNotPresent
        name: install-cni
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/opt/cni/bin
          name: cni-bin-dir
        - mountPath: /host/etc/cni/net.d
          name: cni-net-dir
      - image: calico/pod2daemon-flexvol:v3.23.1
        imagePullPolicy: IfNotPresent
        name: flexvol-driver
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/driver
          name: flexvol-driver-host
      nodeSelector:
        kubernetes.io/os: linux
      priorityClassName: system-node-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: calico-node
      serviceAccountName: calico-node
      terminationGracePeriodSeconds: 0
      tolerations:
      - effect: NoSchedule
        operator: Exists
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoExecute
        operator: Exists
      volumes:
      - hostPath:
          path: /lib/modules
          type: ""
        name: lib-modules
      - hostPath:
          path: /var/run/calico
          type: ""
        name: var-run-calico
      - hostPath:
          path: /var/lib/calico
          type: ""
        name: var-lib-calico
      - hostPath:
          path: /run/xtables.lock
          type: FileOrCreate
        name: xtables-lock
      - hostPath:
          path: /sys/fs/
          type: DirectoryOrCreate
        name: sysfs
      - hostPath:
          path: /opt/cni/bin
          type: ""
        name: cni-bin-dir
      - hostPath:
          path: /etc/cni/net.d
          type: ""
        name: cni-net-dir
      - hostPath:
          path: /var/log/calico/cni
          type: ""
        name: cni-log-dir
      - hostPath:
          path: /var/lib/cni/networks
          type: ""
        name: host-local-net-dir
      - hostPath:
          path: /var/run/nodeagent
          type: DirectoryOrCreate
        name: policysync
      - hostPath:
          path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
          type: DirectoryOrCreate
        name: flexvol-driver-host
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate

ClusterRole

You also need to modify the ClusterRole, otherwise Calico will keep reporting permission errors.

output yaml

 kubectl get clusterrole calico-node -o yaml >calico-node-clusterrole.yaml

calico-node-clusterrole.yaml

 apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: calico-node
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - nodes
  - namespaces
  verbs:
  - get
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - watch
  - list
- apiGroups:
  - ""
  resources:
  - endpoints
  - services
  verbs:
  - watch
  - list
  - get
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - nodes/status
  verbs:
  - patch
  - update
- apiGroups:
  - networking.k8s.io
  resources:
  - networkpolicies
  verbs:
  - watch
  - list
- apiGroups:
  - ""
  resources:
  - pods
  - namespaces
  - serviceaccounts
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - pods/status
  verbs:
  - patch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - globalfelixconfigs
  - felixconfigurations
  - bgppeers
  - globalbgpconfigs
  - bgpconfigurations
  - ippools
  - ipamblocks
  - globalnetworkpolicies
  - globalnetworksets
  - networkpolicies
  - networksets
  - clusterinformations
  - hostendpoints
  - blockaffinities
  - caliconodestatuses
  - ipreservations
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - ippools
  - felixconfigurations
  - clusterinformations
  verbs:
  - create
  - update
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - bgpconfigurations
  - bgppeers
  verbs:
  - create
  - update
- apiGroups:
  - crd.projectcalico.org
  resources:
  - blockaffinities
  - ipamblocks
  - ipamhandles
  verbs:
  - get
  - list
  - create
  - update
  - delete
- apiGroups:
  - crd.projectcalico.org
  resources:
  - ipamconfigs
  verbs:
  - get
- apiGroups:
  - crd.projectcalico.org
  resources:
  - blockaffinities
  verbs:
  - watch
- apiGroups:
  - apps
  resources:
  - daemonsets
  verbs:
  - get

Summarize

This strange network failure was ultimately caused by a mismatch between the version of KubeSphere and the version of Kubernetes. Therefore, the working environment should be stable first, and do not rashly use the latest version. Otherwise, it will take a lot of time to solve inexplicable problems.

This article is published by OpenWrite , a multi-post blog platform!

Root cause diagnosis of Ingress failures in a Kubernetes cluster

foreword

Phenomenon

Cause Analysis

solution process

Upgrade the operating system kernel

Upgrade Calico

ClusterRole

Summarize

KubeSphere

引用和评论

云原生周刊：Kubernetes v1.33 正式发布

数据库的下一场革命：S3 延迟已降至原先的 10%，云数据库架构该进化了

一文读懂 DeepSeek 如何使用：慧星云 × DeepSeek 提供高效本地化部署服务

在 ApeCloud （云猿生数据）实习是怎样的体验？跟行业大佬练技术修为的一年小记

基于 KubeBlocks 的 PikiwiDB(原Pika) 云化下一站

阿里云 ESA 游戏行业解决方案｜安全防护、加速、低延时的技术融合

KubeSphere 和 K8s 高可用集群离线部署全攻略