Author: scwang18, mainly responsible for the technical architecture, has done a lot of research in the direction of container cloud.

foreword

KubeSphere is an open source Kubernetes-based cloud-native distributed operating system of Qingyun, which provides a cool Kubernetes cluster management interface. Our team uses KubeSphere as the development platform.

This article records the solution process of a network failure in the KubeSphere environment.

Phenomenon

Developers reported that the Harbor warehouse they built always had problems, and occasionally reported net/http: TLS handshake timeout , and accessing harbor.xxxx.cn through curl would also hang randomly and frequently. But the feedback of ping everything is normal.

Cause Analysis

After receiving the error report, after several rounds of analysis, the cause was finally located. It should be that the latest version of Kubernetes 1.23.1 was used when KubeSphere was installed.

Although using ./kk version --show-supported-k8s you can see that KubeSphere 3.2.1 can support Kubernetes 1.23.1, but it is actually only experimental support, there are pits.

The analysis process is as follows:

  1. There is a problem with Harbor registry access, and subconsciously thinks that there is a problem with Harbor deployment, but when checking the Harbor core log, there is no corresponding error message when there is no abnormality, and there is even no log information at the info level.
  2. Put the target on the Harbor portal again, and check the access log, but no abnormal information is found.
  3. According to the access chain, continue to trace the kubesphere-router-kubesphere-system, that is, the nginx ingress controller of the KubeSphere version, and also found no abnormal log.
  4. Try to access Harbor's in-cluster service address in other Pods in the cluster, and find that there is no access timeout problem. The preliminary judgment is the problem of the Ingress that comes with KubeSphere.
  5. Turn off the Ingress Controller that comes with kubeSphere, install the ingress-nginx-controller version officially recommended by Kubernetes, the fault remains, and no abnormal information is found in the Ingress log.
  6. Based on the above analysis, the problem should appear between the client and the Ingress Controller. My Ingress Controller is exposed to the outside of the cluster through NodePort. Therefore, when testing other services exposed to the outside of the cluster through NodePort, it is found that the fault is the same. At this point, the Harbor deployment problem can be completely ruled out, and it is basically determined to be the problem from the client to the Ingress Controller.
  7. When the external client accesses the Ingress Controller through NodePort, it will analyze the kube-proxy log through the kube-proxy component and find the alarm information
 can’t set sysctl net/ipv4/vs/conn_reuse_mode, kernel version must be at least 4.1

This warning message is because the kernel version of my centos 7.6 is too low, currently 3.10.0-1160.21.1.el7.x86_64, and there is a compatibility problem with the new version of Kubernetes ipvs.

It can be solved by upgrading the kernel version of the operating system.

  1. After upgrading the kernel, Calico cannot be started, and the following error message is reported

     ipset v7.1: kernel and userspace incompatible: settype hash:ip,port with revision 6 not supported by userspace.

The reason is that the Calico version installed by default when KubeSphere is installed is v3.20.0, this version does not support the latest version of Linux Kernel, the upgraded kernel version is 5.18.1-1.el7.elrepo.x86_64, calico needs to be upgraded to v3.23.0 above version.

  1. After upgrading the Calico version, Calico continues to report errors

     user "system:serviceaccount:kube-system:calico-node" cannot list resource "caliconodestatuses" in api group "crd.projectcalico.org"

There is another error message, all because the resource permissions of the clusterrole are insufficient, and the problem can be solved by modifying the clusterrole.

  1. At this point, the inexplicable network problem has been solved.

solution process

According to the above analysis, the main solutions are as follows:

Upgrade the operating system kernel

  1. Use Alibaba Cloud's yum source
 wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo
yum clean all && yum -y update
  1. Enable elrepo repository
 rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm
  1. Install the latest version of the kernel
 yum --enablerepo=elrepo-kernel install kernel-ml
  1. View all available kernels on the system
 awk -F\' '$1=="menuentry " {print i++ " : " $2}' /etc/grub2.cfg
  1. Set the new kernel as the default version of grub2

Looking at the list of available kernels for the system returned in step 4, unsurprisingly the first one should be the latest installed kernel.

 grub2-set-default 0
  1. Generate grub configuration file and reboot
 grub2-mkconfig -o /boot/grub2/grub.cfg
reboot now
  1. verify
 uname -r

Upgrade Calico

Calico on Kubernetes is generally deployed using Daemonset. In my cluster, Calico's Daemonset is named calico-node.

Directly output it as a yaml file, and modify all image version numbers in the file to the latest version v3.23.1. Recreate the Daemonset.

  1. output yaml
 kubectl -n kube-system get ds  calico-node -o yaml>calico-node.yaml
  1. calico-node.yaml:
 apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    k8s-app: calico-node
  name: calico-node
  namespace: kube-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: calico-node
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: calico-node
    spec:
      containers:
      - env:
        - name: DATASTORE_TYPE
          value: kubernetes
        - name: WAIT_FOR_DATASTORE
          value: "true"
        - name: NODENAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CALICO_NETWORKING_BACKEND
          valueFrom:
            configMapKeyRef:
              key: calico_backend
              name: calico-config
        - name: CLUSTER_TYPE
          value: k8s,bgp
        - name: NODEIP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        - name: IP_AUTODETECTION_METHOD
          value: can-reach=$(NODEIP)
        - name: IP
          value: autodetect
        - name: CALICO_IPV4POOL_IPIP
          value: Always
        - name: CALICO_IPV4POOL_VXLAN
          value: Never
        - name: FELIX_IPINIPMTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: FELIX_VXLANMTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: FELIX_WIREGUARDMTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: CALICO_IPV4POOL_CIDR
          value: 10.233.64.0/18
        - name: CALICO_IPV4POOL_BLOCK_SIZE
          value: "24"
        - name: CALICO_DISABLE_FILE_LOGGING
          value: "true"
        - name: FELIX_DEFAULTENDPOINTTOHOSTACTION
          value: ACCEPT
        - name: FELIX_IPV6SUPPORT
          value: "false"
        - name: FELIX_HEALTHENABLED
          value: "true"
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: calico/node:v3.23.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - /bin/calico-node
            - -felix-live
            - -bird-live
          failureThreshold: 6
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        name: calico-node
        readinessProbe:
          exec:
            command:
            - /bin/calico-node
            - -felix-ready
            - -bird-ready
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        resources:
          requests:
            cpu: 250m
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/etc/cni/net.d
          name: cni-net-dir
        - mountPath: /lib/modules
          name: lib-modules
          readOnly: true
        - mountPath: /run/xtables.lock
          name: xtables-lock
        - mountPath: /var/run/calico
          name: var-run-calico
        - mountPath: /var/lib/calico
          name: var-lib-calico
        - mountPath: /var/run/nodeagent
          name: policysync
        - mountPath: /sys/fs/
          mountPropagation: Bidirectional
          name: sysfs
        - mountPath: /var/log/calico/cni
          name: cni-log-dir
          readOnly: true
      dnsPolicy: ClusterFirst
      hostNetwork: true
      initContainers:
      - command:
        - /opt/cni/bin/calico-ipam
        - -upgrade
        env:
        - name: KUBERNETES_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CALICO_NETWORKING_BACKEND
          valueFrom:
            configMapKeyRef:
              key: calico_backend
              name: calico-config
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: calico/cni:v3.23.1
        imagePullPolicy: IfNotPresent
        name: upgrade-ipam
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/cni/networks
          name: host-local-net-dir
        - mountPath: /host/opt/cni/bin
          name: cni-bin-dir
      - command:
        - /opt/cni/bin/install
        env:
        - name: CNI_CONF_NAME
          value: 10-calico.conflist
        - name: CNI_NETWORK_CONFIG
          valueFrom:
            configMapKeyRef:
              key: cni_network_config
              name: calico-config
        - name: KUBERNETES_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CNI_MTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: SLEEP
          value: "false"
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: calico/cni:v3.23.1
        imagePullPolicy: IfNotPresent
        name: install-cni
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/opt/cni/bin
          name: cni-bin-dir
        - mountPath: /host/etc/cni/net.d
          name: cni-net-dir
      - image: calico/pod2daemon-flexvol:v3.23.1
        imagePullPolicy: IfNotPresent
        name: flexvol-driver
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/driver
          name: flexvol-driver-host
      nodeSelector:
        kubernetes.io/os: linux
      priorityClassName: system-node-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: calico-node
      serviceAccountName: calico-node
      terminationGracePeriodSeconds: 0
      tolerations:
      - effect: NoSchedule
        operator: Exists
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoExecute
        operator: Exists
      volumes:
      - hostPath:
          path: /lib/modules
          type: ""
        name: lib-modules
      - hostPath:
          path: /var/run/calico
          type: ""
        name: var-run-calico
      - hostPath:
          path: /var/lib/calico
          type: ""
        name: var-lib-calico
      - hostPath:
          path: /run/xtables.lock
          type: FileOrCreate
        name: xtables-lock
      - hostPath:
          path: /sys/fs/
          type: DirectoryOrCreate
        name: sysfs
      - hostPath:
          path: /opt/cni/bin
          type: ""
        name: cni-bin-dir
      - hostPath:
          path: /etc/cni/net.d
          type: ""
        name: cni-net-dir
      - hostPath:
          path: /var/log/calico/cni
          type: ""
        name: cni-log-dir
      - hostPath:
          path: /var/lib/cni/networks
          type: ""
        name: host-local-net-dir
      - hostPath:
          path: /var/run/nodeagent
          type: DirectoryOrCreate
        name: policysync
      - hostPath:
          path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
          type: DirectoryOrCreate
        name: flexvol-driver-host
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate

ClusterRole

You also need to modify the ClusterRole, otherwise Calico will keep reporting permission errors.

  1. output yaml
 kubectl get clusterrole calico-node -o yaml >calico-node-clusterrole.yaml
  1. calico-node-clusterrole.yaml
 apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: calico-node
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - nodes
  - namespaces
  verbs:
  - get
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - watch
  - list
- apiGroups:
  - ""
  resources:
  - endpoints
  - services
  verbs:
  - watch
  - list
  - get
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - nodes/status
  verbs:
  - patch
  - update
- apiGroups:
  - networking.k8s.io
  resources:
  - networkpolicies
  verbs:
  - watch
  - list
- apiGroups:
  - ""
  resources:
  - pods
  - namespaces
  - serviceaccounts
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - pods/status
  verbs:
  - patch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - globalfelixconfigs
  - felixconfigurations
  - bgppeers
  - globalbgpconfigs
  - bgpconfigurations
  - ippools
  - ipamblocks
  - globalnetworkpolicies
  - globalnetworksets
  - networkpolicies
  - networksets
  - clusterinformations
  - hostendpoints
  - blockaffinities
  - caliconodestatuses
  - ipreservations
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - ippools
  - felixconfigurations
  - clusterinformations
  verbs:
  - create
  - update
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - bgpconfigurations
  - bgppeers
  verbs:
  - create
  - update
- apiGroups:
  - crd.projectcalico.org
  resources:
  - blockaffinities
  - ipamblocks
  - ipamhandles
  verbs:
  - get
  - list
  - create
  - update
  - delete
- apiGroups:
  - crd.projectcalico.org
  resources:
  - ipamconfigs
  verbs:
  - get
- apiGroups:
  - crd.projectcalico.org
  resources:
  - blockaffinities
  verbs:
  - watch
- apiGroups:
  - apps
  resources:
  - daemonsets
  verbs:
  - get

Summarize

This strange network failure was ultimately caused by a mismatch between the version of KubeSphere and the version of Kubernetes. Therefore, the working environment should be stable first, and do not rashly use the latest version. Otherwise, it will take a lot of time to solve inexplicable problems.

This article is published by OpenWrite , a multi-post blog platform!

KubeSphere
124 声望57 粉丝

KubeSphere 是一个开源的以应用为中心的容器管理平台,支持部署在任何基础设施之上,并提供简单易用的 UI,极大减轻日常开发、测试、运维的复杂度,旨在解决 Kubernetes 本身存在的存储、网络、安全和易用性等痛...