Author: scwang18, mainly responsible for the technical architecture, has done a lot of research in the direction of container cloud.
foreword
KubeSphere is an open source Kubernetes-based cloud-native distributed operating system of Qingyun, which provides a cool Kubernetes cluster management interface. Our team uses KubeSphere as the development platform.
This article records the solution process of a network failure in the KubeSphere environment.
Phenomenon
Developers reported that the Harbor warehouse they built always had problems, and occasionally reported net/http: TLS handshake timeout
, and accessing harbor.xxxx.cn through curl would also hang randomly and frequently. But the feedback of ping everything is normal.
Cause Analysis
After receiving the error report, after several rounds of analysis, the cause was finally located. It should be that the latest version of Kubernetes 1.23.1 was used when KubeSphere was installed.
Although using ./kk version --show-supported-k8s
you can see that KubeSphere 3.2.1 can support Kubernetes 1.23.1, but it is actually only experimental support, there are pits.
The analysis process is as follows:
- There is a problem with Harbor registry access, and subconsciously thinks that there is a problem with Harbor deployment, but when checking the Harbor core log, there is no corresponding error message when there is no abnormality, and there is even no log information at the info level.
- Put the target on the Harbor portal again, and check the access log, but no abnormal information is found.
- According to the access chain, continue to trace the kubesphere-router-kubesphere-system, that is, the nginx ingress controller of the KubeSphere version, and also found no abnormal log.
- Try to access Harbor's in-cluster service address in other Pods in the cluster, and find that there is no access timeout problem. The preliminary judgment is the problem of the Ingress that comes with KubeSphere.
- Turn off the Ingress Controller that comes with kubeSphere, install the ingress-nginx-controller version officially recommended by Kubernetes, the fault remains, and no abnormal information is found in the Ingress log.
- Based on the above analysis, the problem should appear between the client and the Ingress Controller. My Ingress Controller is exposed to the outside of the cluster through NodePort. Therefore, when testing other services exposed to the outside of the cluster through NodePort, it is found that the fault is the same. At this point, the Harbor deployment problem can be completely ruled out, and it is basically determined to be the problem from the client to the Ingress Controller.
- When the external client accesses the Ingress Controller through NodePort, it will analyze the kube-proxy log through the kube-proxy component and find the alarm information
can’t set sysctl net/ipv4/vs/conn_reuse_mode, kernel version must be at least 4.1
This warning message is because the kernel version of my centos 7.6 is too low, currently 3.10.0-1160.21.1.el7.x86_64, and there is a compatibility problem with the new version of Kubernetes ipvs.
It can be solved by upgrading the kernel version of the operating system.
After upgrading the kernel, Calico cannot be started, and the following error message is reported
ipset v7.1: kernel and userspace incompatible: settype hash:ip,port with revision 6 not supported by userspace.
The reason is that the Calico version installed by default when KubeSphere is installed is v3.20.0, this version does not support the latest version of Linux Kernel, the upgraded kernel version is 5.18.1-1.el7.elrepo.x86_64, calico needs to be upgraded to v3.23.0 above version.
After upgrading the Calico version, Calico continues to report errors
user "system:serviceaccount:kube-system:calico-node" cannot list resource "caliconodestatuses" in api group "crd.projectcalico.org"
There is another error message, all because the resource permissions of the clusterrole are insufficient, and the problem can be solved by modifying the clusterrole.
- At this point, the inexplicable network problem has been solved.
solution process
According to the above analysis, the main solutions are as follows:
Upgrade the operating system kernel
- Use Alibaba Cloud's yum source
wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo
yum clean all && yum -y update
- Enable elrepo repository
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm
- Install the latest version of the kernel
yum --enablerepo=elrepo-kernel install kernel-ml
- View all available kernels on the system
awk -F\' '$1=="menuentry " {print i++ " : " $2}' /etc/grub2.cfg
- Set the new kernel as the default version of grub2
Looking at the list of available kernels for the system returned in step 4, unsurprisingly the first one should be the latest installed kernel.
grub2-set-default 0
- Generate grub configuration file and reboot
grub2-mkconfig -o /boot/grub2/grub.cfg
reboot now
- verify
uname -r
Upgrade Calico
Calico on Kubernetes is generally deployed using Daemonset. In my cluster, Calico's Daemonset is named calico-node.
Directly output it as a yaml file, and modify all image version numbers in the file to the latest version v3.23.1. Recreate the Daemonset.
- output yaml
kubectl -n kube-system get ds calico-node -o yaml>calico-node.yaml
- calico-node.yaml:
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
k8s-app: calico-node
name: calico-node
namespace: kube-system
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
k8s-app: calico-node
template:
metadata:
creationTimestamp: null
labels:
k8s-app: calico-node
spec:
containers:
- env:
- name: DATASTORE_TYPE
value: kubernetes
- name: WAIT_FOR_DATASTORE
value: "true"
- name: NODENAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CALICO_NETWORKING_BACKEND
valueFrom:
configMapKeyRef:
key: calico_backend
name: calico-config
- name: CLUSTER_TYPE
value: k8s,bgp
- name: NODEIP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
- name: IP_AUTODETECTION_METHOD
value: can-reach=$(NODEIP)
- name: IP
value: autodetect
- name: CALICO_IPV4POOL_IPIP
value: Always
- name: CALICO_IPV4POOL_VXLAN
value: Never
- name: FELIX_IPINIPMTU
valueFrom:
configMapKeyRef:
key: veth_mtu
name: calico-config
- name: FELIX_VXLANMTU
valueFrom:
configMapKeyRef:
key: veth_mtu
name: calico-config
- name: FELIX_WIREGUARDMTU
valueFrom:
configMapKeyRef:
key: veth_mtu
name: calico-config
- name: CALICO_IPV4POOL_CIDR
value: 10.233.64.0/18
- name: CALICO_IPV4POOL_BLOCK_SIZE
value: "24"
- name: CALICO_DISABLE_FILE_LOGGING
value: "true"
- name: FELIX_DEFAULTENDPOINTTOHOSTACTION
value: ACCEPT
- name: FELIX_IPV6SUPPORT
value: "false"
- name: FELIX_HEALTHENABLED
value: "true"
envFrom:
- configMapRef:
name: kubernetes-services-endpoint
optional: true
image: calico/node:v3.23.1
imagePullPolicy: IfNotPresent
livenessProbe:
exec:
command:
- /bin/calico-node
- -felix-live
- -bird-live
failureThreshold: 6
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
name: calico-node
readinessProbe:
exec:
command:
- /bin/calico-node
- -felix-ready
- -bird-ready
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
resources:
requests:
cpu: 250m
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/etc/cni/net.d
name: cni-net-dir
- mountPath: /lib/modules
name: lib-modules
readOnly: true
- mountPath: /run/xtables.lock
name: xtables-lock
- mountPath: /var/run/calico
name: var-run-calico
- mountPath: /var/lib/calico
name: var-lib-calico
- mountPath: /var/run/nodeagent
name: policysync
- mountPath: /sys/fs/
mountPropagation: Bidirectional
name: sysfs
- mountPath: /var/log/calico/cni
name: cni-log-dir
readOnly: true
dnsPolicy: ClusterFirst
hostNetwork: true
initContainers:
- command:
- /opt/cni/bin/calico-ipam
- -upgrade
env:
- name: KUBERNETES_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CALICO_NETWORKING_BACKEND
valueFrom:
configMapKeyRef:
key: calico_backend
name: calico-config
envFrom:
- configMapRef:
name: kubernetes-services-endpoint
optional: true
image: calico/cni:v3.23.1
imagePullPolicy: IfNotPresent
name: upgrade-ipam
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/cni/networks
name: host-local-net-dir
- mountPath: /host/opt/cni/bin
name: cni-bin-dir
- command:
- /opt/cni/bin/install
env:
- name: CNI_CONF_NAME
value: 10-calico.conflist
- name: CNI_NETWORK_CONFIG
valueFrom:
configMapKeyRef:
key: cni_network_config
name: calico-config
- name: KUBERNETES_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CNI_MTU
valueFrom:
configMapKeyRef:
key: veth_mtu
name: calico-config
- name: SLEEP
value: "false"
envFrom:
- configMapRef:
name: kubernetes-services-endpoint
optional: true
image: calico/cni:v3.23.1
imagePullPolicy: IfNotPresent
name: install-cni
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/opt/cni/bin
name: cni-bin-dir
- mountPath: /host/etc/cni/net.d
name: cni-net-dir
- image: calico/pod2daemon-flexvol:v3.23.1
imagePullPolicy: IfNotPresent
name: flexvol-driver
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/driver
name: flexvol-driver-host
nodeSelector:
kubernetes.io/os: linux
priorityClassName: system-node-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: calico-node
serviceAccountName: calico-node
terminationGracePeriodSeconds: 0
tolerations:
- effect: NoSchedule
operator: Exists
- key: CriticalAddonsOnly
operator: Exists
- effect: NoExecute
operator: Exists
volumes:
- hostPath:
path: /lib/modules
type: ""
name: lib-modules
- hostPath:
path: /var/run/calico
type: ""
name: var-run-calico
- hostPath:
path: /var/lib/calico
type: ""
name: var-lib-calico
- hostPath:
path: /run/xtables.lock
type: FileOrCreate
name: xtables-lock
- hostPath:
path: /sys/fs/
type: DirectoryOrCreate
name: sysfs
- hostPath:
path: /opt/cni/bin
type: ""
name: cni-bin-dir
- hostPath:
path: /etc/cni/net.d
type: ""
name: cni-net-dir
- hostPath:
path: /var/log/calico/cni
type: ""
name: cni-log-dir
- hostPath:
path: /var/lib/cni/networks
type: ""
name: host-local-net-dir
- hostPath:
path: /var/run/nodeagent
type: DirectoryOrCreate
name: policysync
- hostPath:
path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
type: DirectoryOrCreate
name: flexvol-driver-host
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
ClusterRole
You also need to modify the ClusterRole, otherwise Calico will keep reporting permission errors.
- output yaml
kubectl get clusterrole calico-node -o yaml >calico-node-clusterrole.yaml
- calico-node-clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: calico-node
rules:
- apiGroups:
- ""
resources:
- pods
- nodes
- namespaces
verbs:
- get
- apiGroups:
- discovery.k8s.io
resources:
- endpointslices
verbs:
- watch
- list
- apiGroups:
- ""
resources:
- endpoints
- services
verbs:
- watch
- list
- get
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- apiGroups:
- ""
resources:
- nodes/status
verbs:
- patch
- update
- apiGroups:
- networking.k8s.io
resources:
- networkpolicies
verbs:
- watch
- list
- apiGroups:
- ""
resources:
- pods
- namespaces
- serviceaccounts
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- pods/status
verbs:
- patch
- apiGroups:
- crd.projectcalico.org
resources:
- globalfelixconfigs
- felixconfigurations
- bgppeers
- globalbgpconfigs
- bgpconfigurations
- ippools
- ipamblocks
- globalnetworkpolicies
- globalnetworksets
- networkpolicies
- networksets
- clusterinformations
- hostendpoints
- blockaffinities
- caliconodestatuses
- ipreservations
verbs:
- get
- list
- watch
- apiGroups:
- crd.projectcalico.org
resources:
- ippools
- felixconfigurations
- clusterinformations
verbs:
- create
- update
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- list
- watch
- apiGroups:
- crd.projectcalico.org
resources:
- bgpconfigurations
- bgppeers
verbs:
- create
- update
- apiGroups:
- crd.projectcalico.org
resources:
- blockaffinities
- ipamblocks
- ipamhandles
verbs:
- get
- list
- create
- update
- delete
- apiGroups:
- crd.projectcalico.org
resources:
- ipamconfigs
verbs:
- get
- apiGroups:
- crd.projectcalico.org
resources:
- blockaffinities
verbs:
- watch
- apiGroups:
- apps
resources:
- daemonsets
verbs:
- get
Summarize
This strange network failure was ultimately caused by a mismatch between the version of KubeSphere and the version of Kubernetes. Therefore, the working environment should be stable first, and do not rashly use the latest version. Otherwise, it will take a lot of time to solve inexplicable problems.
This article is published by OpenWrite , a multi-post blog platform!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。