Kubernetes cluster lossless upgrade practice

1. Background

An active community and a large user base keep Kubernetes at a high-frequency release rhythm of 3 months. Frequent version releases have brought more new features to the ground and bug fixes in time, but the online environment business is running for a long time, any changes and errors may bring huge economic losses, upgrades are relatively difficult for enterprises, and keep up with the community. It is almost impossible, so the contradiction between high-frequency release and stable production requires the container team to weigh and make choices.

Since the Vivo Internet team built large-scale Kubernetes clusters, some clusters have been using the v1.10 version for a long time. However, due to the increasing proportion of business containerization, demands for large-scale cluster stability and application release diversity are increasing. The cluster upgrade is imminent. The following issues will be resolved after the cluster is upgraded:

High-version clusters are optimized in large-scale scenarios, and upgrades can solve a series of performance bottlenecks.
Only high version clusters can support CNCF projects such as OpenKruise, and upgrades can solve version dependency problems.
The new features added in higher version clusters can improve cluster resource utilization, reduce server costs and improve cluster efficiency.
The company maintains multiple clusters of different versions within the company. After the upgrade, the fragmentation of the cluster version is reduced, and operation and maintenance costs are further reduced.

This article will introduce from 0 to 1 how the cluster supporting the online business of the vivo Internet team can be upgraded from v1.10 to v1.17 without affecting the normal operation of the original business. The reason for upgrading to v1.17 instead of the higher v1.18 or higher version is because the code changes introduced in the v1.18 version [1] will cause the extensions/v1beta1 and other advanced resource types to be unable to continue to run (this part of the code Deleted in the v1.18 version).

2. Difficulties of non-destructive upgrade

Container cluster construction usually has two methods: binary systemd deployment and core component static Pod containerized deployment. Multiple copies of the cluster API service are externally load balanced. There is not much difference between the two deployment methods when upgrading. Binary deployment is more suitable for earlier clusters. Therefore, this article will share the cluster upgrade of binary deployment.

For clusters deployed in binary mode, cluster component upgrades are mainly binary replacement, configuration file update, and service restart; from the perspective of the production environment SLO requirements, the upgrade process must not cause business restart due to changes in the logic of the cluster components themselves. Therefore, the difficulty of upgrading is concentrated on the following points:

First of all, the current internal cluster running version is low, but the number of running containers is large, and some of them are still running in a single copy. In order not to affect business operations, it is necessary to avoid container restarts as much as possible. This is undoubtedly the biggest difficulty in the upgrade. Between the .10 version and the v1.17 version, kubelet's calculation of the container hash value has changed, which means that once the upgrade will inevitably trigger the kubelet to restart the container.

Secondly, the method recommended by the community is based on the deviation strategy [2] to ensure that the high-availability cluster upgrades and does not cause compatibility errors in components such as kube-apiserve and kubelet due to the difference in API resources version, which requires every upgrade The component version cannot have more than 2 Final Release deviations. For example, it is not recommended to directly upgrade from v1.11 to v1.13.

Third, due to the introduction of new features during the upgrade process, API compatibility may cause the configuration of the old version of the cluster to not take effect, laying a hidden stability hazard for the entire cluster. This requires familiarity with the ChangeLog between the upgraded versions as much as possible before the upgrade, and find out new features that may bring potential hazards.

Third, the non-destructive upgrade program

In view of the aforementioned difficulties, this section will propose specific solutions one by one, and will also introduce the high version bugs and solutions encountered after the upgrade. I hope that the compatibility screening before the upgrade and the troubleshooting during the upgrade process can inspire readers.

3.1 Upgrade method

In the software field, there are two mainstream application upgrade methods, namely in-situ upgrade and replacement upgrade. At present, these two upgrade methods are adopted by major Internet companies in the industry, and the specific plan selection has a lot to do with the business on the cluster.

Replacement upgrade

1) Kubernetes replacement upgrade is to prepare a high version cluster first, for the low version cluster, gradually rotate and upgrade the nodes in the low version cluster to the new version by draining the nodes one by one, deleting and finally joining the new cluster.
2) The advantage of the replacement upgrade is that it is more atomic, and each node is gradually upgraded. There is no intermediate state in the upgrade process, which is more secure for business security; the disadvantage is that the cluster upgrade workload is relatively large, and the drain operation is highly sensitive to pod restarts. Applications, stateful applications, single-copy applications, etc. are not friendly.

In-place upgrade

1) Kubernetes in-situ upgrade is to batch update components on the node, such as kube-controller-manager, kubelet, etc. in a certain order, and manage component versions in batches from the node role dimension.
2) The advantage of in-situ upgrade is that the automatic operation is convenient, and the continuity of the container's life cycle can be guaranteed through appropriate modification; the disadvantage is that the component upgrade sequence in the cluster upgrade is very important, there is an intermediate state during the upgrade, and a component restarts Failure may affect subsequent upgrades of other components, and the atomicity is poor.

Some businesses running on the vivo container cluster have a low tolerance for restart. Avoiding container restarts as much as possible is the first priority for the upgrade. After the container restart caused by the upgraded version is resolved, the upgrade method can be selected according to local conditions in consideration of the degree of containerization of the business and the type of business. Binary deployment clusters are recommended to choose the in-situ upgrade method, which has the advantages of short time, simple operation, and single-copy business will not be affected by the upgrade.

3.2 Cross-version upgrade

Since Kubernetes itself is an API-based microservice architecture, the internal architecture of Kuberntes also coordinates resource status through API calls and List-Watch of resource objects. Therefore, community developers follow the principle of upward or downward compatibility when designing APIs. This compatibility rule also follows the community’s deviation strategy [2], that is, when API groups are deprecated or enabled, they will take effect immediately for the Alpha version, and for the Beta version, 3 versions will continue to be supported. Exceeding the corresponding version will cause the API resource version to fail. compatible. For example, kubernetes deprecated the extensions/v1beta1 version of Deployment and other resources in v1.16, and deleted it from the code level in v1.18. When upgrading across 3 or more versions, related resources cannot be identified, correspondingly Add, delete, modify, and check operations cannot be performed.

If you follow the official recommended upgrade strategy, upgrading from v1.10 to v1.17 requires at least 7 upgrades. This is a complex production environment with complex business scenarios and high business risks.

For similar API breaking changes, not every version will exist. The deviation strategy recommended by the community is the safest upgrade strategy. After careful Change Log combing and sufficient cross-version testing, we confirm that there can be no impact between these versions API compatibility issues for business operations and cluster management operations. For the abandonment of API types, you can configure the corresponding parameters in the apiserver to start the continued use to ensure that environmental services continue to run normally.

3.3 Avoid container restart

During the preliminary verification of the upgrade plan, it was found that a large number of containers were rebuilt. The reason for the restart was "Container definition changed" from the kubelet component log after the upgrade. Combined with the source code error, the computePodActions method is located in the pkg/kubelet/kuberuntime_manager.go file. This method is used to calculate whether the spec hash value of the pod has changed. If it changes, it returns true and tells the kubelet syncPod method to trigger the pod content rebuild or pod rebuild.

kubelet container Hash calculation;

func (m *kubeGenericRuntimeManager) computePodActions(pod *v1.Pod, podStatus *kubecontainer.PodStatus) podActions {
    restart := shouldRestartOnFailure(pod)
    if _, _, changed := containerChanged(&container, containerStatus); changed {
        message = fmt.Sprintf("Container %s definition changed", container.Name)
        // 如果 container spec 发生变化，将会强制重启 container（将 restart 标志位设置为 true）
        restart = true
    }
    ...
    if restart {
       message = fmt.Sprintf("%s, will be restarted", message)
       // 需要重启的 container 加入到重启列表
       changes.ContainersToStart = append(changes.ContainersToStart, idx)
    }
}
 
func containerChanged(container *v1.Container, containerStatus *kubecontainer.ContainerStatus) (uint64, uint64, bool) {
   // 计算 container spec 的 Hash 值
   expectedHash := kubecontainer.HashContainer(container)
   return expectedHash, containerStatus.Hash, containerStatus.Hash != expectedHash
}

Compared to the v1.10 version, the v1.17 version uses the container structure json serialized data when calculating the container Hash, instead of the v1.10 version using the container struct structure data. Moreover, new attributes have been added to the structure of the container in the higher version of kubelet. The results calculated by the go-spew library are naturally inconsistent, and the return value is further passed upwards so that the syncPod method triggers the container reconstruction.

Is it possible to remove the newly added fields from the data structure of the container struct by modifying go-spew? The answer is yes, but it is not an elegant way, because this intrusion into the core code logic is more serious. Each version upgrade in the future will require custom code, and more and more fields are added, and the maintenance complexity will also increase. Higher. From another perspective, if the Pod created by the old version of the cluster kubelet skips this check during the upgrade transition, the container restart can be avoided.

After communicating with colleagues in the circle, I found that similar ideas have been implemented in the community. Create a local configuration file that records the old cluster version information and startup time. The kubelet code maintains a cache to read the configuration file. In each syncPod cycle, when the kubelet is If it finds that its own version is higher than the oldVersion recorded in the cache, and the container startup time is earlier than the current kubelet startup time, the container hash value calculation will be skipped. The upgraded cluster runs a scheduled task to detect whether the Pod's containerSpec is consistent with the hash results calculated by the higher version calculation method. If it is, the local configuration file can be deleted, and the syncPod logic is restored to be completely consistent with the community.

specific plan refers to the benefit of this implementation is that it has little intrusion to the native kubelet code, does not change the core code logic, and if you need to upgrade the higher version in the future, you can reuse the code. If all Pods in the cluster are created by the current version of kubelet, the logic of the community itself will be restored.

3.4 Pod Unexpected Eviction Problem

Although Kubernetes has iterated more than a dozen versions, the community activity of each iteration is still very high, maintaining about 30 new features for scalability and stability enhancement in each version. One reason for choosing to upgrade is to introduce many new features developed by the community to enrich the functions of the cluster and improve the stability of the cluster. The development of new features also follows the deviation strategy. Upgrades across major versions are likely to cause new features to be enabled when some configurations are not loaded, which brings stability risks to the cluster. Therefore, it is necessary to sort out some features that affect the Pod life cycle, and pay special attention. Controller-related functions.

Note here that the TaintBasedEvictions feature introduced in the v1.13 version is used to manage Pod eviction conditions in a more fine-grained manner. Before the v1.13 condition-based version, the eviction was based on the uniform time eviction of the NodeController. The Pod on the node would be evicted after the node NotReady exceeds the default 5 minutes; after the TaintBasedEvictions is enabled by default in v1.16, the eviction of the node NotReady will be Differentiate processing according to the TolerationSeconds configured by each Pod.

Pods created in the old version of the cluster do not have TolerationSeconds set by default. Once TaintBasedEvictions is turned on after the upgrade is completed, the Pod on the node will be expelled 5 seconds after the node becomes NotReady. Temporary network fluctuations, kubelet restarts, etc. will affect the stability of the business in the cluster.

The controller corresponding to TaintBasedEvictions determines the eviction time of the Pod according to the toleranceSeconds in the pod definition, which means that as long as the toleranceSeconds in the Pod is set correctly, the unexpected eviction of the Pod can be avoided.

The DefaultTolerationSeconds admission controller, which is turned on by default in the v1.16 version of the community, is based on the k8s-apiserver input parameters default-not-ready-toleration-seconds and default-unreachable-toleration-seconds. Set the default tolerance for the Pod to tolerate notready:NoExecute And unreachable: NoExecute taint.

After the request is sent, the newly created Pod will pass the DefaultTolerationSeconds admission controller to add the default tolerations to the pod. But how does this logic take effect on Pods that have been created in the cluster? Checking the admission controller found that in addition to supporting the create operation, the update operation will also update the pod definition to trigger the DefaultTolerationSeconds plug-in to set tolerances. Therefore, we can achieve our goal by labeling the Pods that are already running in the cluster.

tolerations:
- effect: NoExecute
  key: node.kubernetes.io/not-ready
  operator: Exists
  tolerationSeconds: 300
- effect: NoExecute
  key: node.kubernetes.io/unreachable
  operator: Exists
  tolerationSeconds: 300

3.5 Pod MatchNodeSelector

In order to determine whether the Pod is unexpectedly eviction during the upgrade and whether there is a batch restart of the Pod content container, there are scripts to synchronize the non-Running Pod on the node and the restarted container in real time.

During the upgrade process, suddenly dozens of additional pods were marked as MatchNodeSelector status, check that the business container on the node really stopped. See the following error log in the kubelet log;

predicate.go:132] Predicate failed on Pod: nginx-7dd9db975d-j578s_default(e3b79017-0b15-11ec-9cd4-000c29c4fa15), for reason: Predicate MatchNodeSelector failed
kubelet_pods.go:1125] Killing unwanted pod "nginx-7dd9db975d-j578s"

After analysis, the Pod becomes the MatchNodeSelector state because when the kubelet restarts the Pod on the node, it cannot find the node label that meets the requirements of the node when the Pod on the node is checked. The pod state will be set to the Failed state, and the Reason is set to the MatchNodeSelector. When the kubectl command is obtained, the printer performs the corresponding conversion and directly displays Reason, so we see that the Pod status is MatchNodeSelector. By labeling the node, you can make the Pod reschedule back, and then delete the Pod in the MatchNodeSelector state.

It is recommended to write a script before the upgrade to check whether the NodeSelector attribute node used in the pod definition on the node has a corresponding Label.

3.6 Unable to access kube-apiserver

After the upgraded cluster of the pre-release environment is running in the v1.17 version, a node suddenly becomes a NotReady state alarm. After the analysis, the kubelet node is restored to normal by restarting the kubelet node. Continue to analyze the cause of the error and found that a large number of use of closed network connection errors appeared in the kubelet log. Searching for related issues in the community found that similar to , in which the developer described the cause and solution of the problem, and the code has been incorporated in v1.18.

The cause of the problem is that the kubelet default connection is HTTP/2.0 persistent connection. The golang net/ package used when constructing the client-server connection has 161c529accef37 bug broken can still be obtained in the http connection pool. As a result, kubelet cannot communicate with kube-apiserver normally.

The golang community circumvented this problem by adding http2 connection health checks, but there are still bugs in this fix. The community has completely fixed it in golang v1.15.11. We internally resolved this problem by backporting to the v1.17 branch and using golang version 1.15.15 to compile the binary.

3.7 TCP connection number problem

During the test run of the pre-release environment, it was accidentally discovered that each node kubelet of the cluster had nearly 10 long connections to communicate with kube-apiserver. This is obviously inconsistent with our perception that kubelet reuses connections and communicates with kube-apiserver. See v1. There is indeed only one long connection in the version 10 environment. This increase in the number of TCP connections will undoubtedly put pressure on the LB. As the number of nodes increases, once the LB is dragged down, the kubelet cannot report the heartbeat, the node will become NotReady, and then a large number of Pods will be evicted, and the consequence will be disaster Sexual. Therefore, in addition to tuning the parameters of LB itself, it is also necessary to locate the reason for the increase in the number of connections from kubelet to kube-apiserver.

The v1.17.1 version of the kubeadm cluster built locally has only one long connection from kubelet to kube-apiserver, indicating that this problem was introduced between v1.17.1 and the upgrade target version. After troubleshooting ( issue ), it was found that it was increased The judgment logic causes the kubelet to no longer obtain the cached long connection from the cache when obtaining the client. The main function of transport is actually to cache long connections for connection reuse in a large number of http request scenarios, reducing the time loss of TCP (TLS) connection establishment when sending requests. Customize the RoundTripper interface for transport in this PR. Once there is a Dial or Proxy attribute in the tlsConfig object, a new connection is created without using the connection in the cache.

// client-go 从 cache 获取复用连接逻辑
func tlsConfigKey(c *Config) (tlsCacheKey, bool, error) {
    ...
 
    if c.TLS.GetCert != nil || c.Dial != nil || c.Proxy != nil {
        // cannot determine equality for functions
        return tlsCacheKey{}, false, nil
    }
...
}
 
 
func (c *tlsTransportCache) get(config *Config) (http.RoundTripper, error) {
    key, canCache, err := tlsConfigKey(config)
    ...
 
    if canCache {
        // Ensure we only create a single transport for the given TLS options
        c.mu.Lock()
        defer c.mu.Unlock()
 
        // See if we already have a custom transport for this config
        if t, ok := c.transports[key]; ok {
            return t, nil
        }
    }
...
}
 
// kubelet 组件构建 client 逻辑
func buildKubeletClientConfig(ctx context.Context, s *options.KubeletServer, nodeName types.NodeName) (*restclient.Config, func(), error) {
    ...
    kubeClientConfigOverrides(s, clientConfig)
    closeAllConns, err := updateDialer(clientConfig)
    ...
    return clientConfig, closeAllConns, nil
}
 
// 为 clientConfig 设置 Dial属性,因此 kubelet 构建 clinet 时会新建 transport
func updateDialer(clientConfig *restclient.Config) (func(), error) {
    if clientConfig.Transport != nil || clientConfig.Dial != nil {
        return nil, fmt.Errorf("there is already a transport or dialer configured")
    }
    d := connrotation.NewDialer((&net.Dialer{Timeout: 30 * time.Second, KeepAlive: 30 * time.Second}).DialContext)
    clientConfig.Dial = d.DialContext
    return d.CloseAll, nil

The closeAllConns object is constructed here to close the connection that is already Dead but not yet Closed, but the previous problem was solved by upgrading the golang version, so we rolled back part of the modified code in the local code branch to solve the increase in the number of TCP connections The problem.

Recently, the tracking community found that the solution has been merged, and the TCP connection reuse of the custom RESTClient is realized by refactoring the client-go interface.

Four, non-destructive upgrade operation

The biggest risk of cross-version upgrade is the inconsistency of object definitions before and after the upgrade, which may cause the upgraded components to fail to parse the objects stored in the ETCD database; it may also be that the upgrade has an intermediate state, the kubelet has not been upgraded and the control plane component has been upgraded, and there is a reporting state Abnormal, the worst case is that the Pod on the node is evicted. These are all things that need to be considered and verified by testing before upgrading.

After repeated testing, the above problems do not exist temporarily between v1.10 and v1.17, except for the partially obsolete API Resources through the addition of the kube-apiserver configuration method. In order to ensure that special cases that are not covered during the upgrade can be handled in time, it is strongly recommended to back up the ETCD database before the upgrade, and stop the controller and scheduler during the upgrade to avoid unexpected control logic (in fact, it should be stopped in the controller manager. Part of the controller, but need to modify the code to compile the temporary controller manager, which increases the upgrade process steps and management complexity, so the global controller is directly stopped).

In addition to the above code changes and upgrade process precautions, before replacing the binary upgrade, the difference between the configuration items of the new and old version of the service is left to ensure that the service is successfully started and running. After comparison, it is found that the --allow-privileged parameter is no longer supported when the kubelet component is started and needs to be deleted. It is worth noting that deleting does not mean that the higher version no longer supports running privileged containers on nodes. After v1.15, a set of security features for pod access can be defined through the Pod Security Policy resource object, and security control can be more fine-grained.

Based on the modified and compiled binary on the side of the non-destructive upgrade code discussed above, after modifying each configuration item in the cluster component configuration file, you can start the online upgrade. The entire upgrade steps are:

Backup cluster (binary, configuration file, ETCD database, etc.);
Upgrade some nodes in grayscale to verify the correctness of the binary and configuration files
Distribute upgraded binary files in advance;
Stop the controller, scheduler and alarm;
Update the control plane service configuration file and upgrade the components;
Update computing node service configuration files and upgrade components;
Label the node to trigger the pod to add the tolerations attribute;
Turn on the controller and scheduler, and enable alarms;
Check the cluster business to confirm that the cluster is normal.

During the upgrade process, it is recommended that the number of nodes concurrency is not too high, because a large number of node kubelets restart at the same time to report information, which will have an impact on the LB used in front of kube-apiserver. Under special circumstances, the heartbeat report of the node may fail, and the node status will be between NotReady and Ready. beat.

Five, summary

The cluster upgrade has been troubled by the container team for a long time. After a series of investigations and repeated tests, after solving several key issues mentioned above, the cluster was successfully upgraded from v1.10 to v1.17, with 1000 nodes The cluster performs the upgrade operation in batches, which takes about 10 minutes, and will be upgraded to a higher version again after the platform interface transformation is completed.

The cluster version upgrade improves the stability of the cluster, increases the scalability of the cluster, and at the same time enriches the capabilities of the cluster. The upgraded cluster is also more compatible with CNCF projects.

As mentioned in the opening paragraph, frequent upgrades to large-scale clusters according to the deviation strategy may not be realistic. Therefore, although cross-version upgrades are relatively risky, they are also widely adopted in the industry. At the 2021 KubeCon conference in China, Alibaba also shared about zero-downtime cross-version upgrade of Kubernetes clusters, mainly about the introduction of key upgrade points such as application migration and traffic switching. The upgrade preparation and upgrade process are relatively complicated. Compared with Alibaba's cluster cross-version upgrade solution, the in-situ upgrade method requires a small amount of modification in the source code, but the upgrade process will be simpler and the operation and maintenance automation will be higher.

Because the cluster version has great selectivity, the upgrade described in this article is not necessarily widely applicable. The author hopes to provide readers with ideas and risk points for cross-version upgrades in production clusters. The upgrade process is short, but the preparation and research work before the upgrade is time-consuming and laborious. It requires in-depth exploration of the features and source code of different versions of Kubernetes. At the same time, you have a complete understanding of the Kubernetes API compatibility strategy and release strategy, so that you can do it before the upgrade. With sufficient testing, you can more calmly face unexpected situations during the upgrade process.