A Practical Guide for Non-destructive Upgrade of Ant Large-scale Kubernetes Cluster [Exploration]

Text｜Wang Lianping (flower name: Ye Chuan)

Senior Development Engineer, Ant Group

Responsible for ant Kubernetes cluster container delivery, focusing on cluster delivery capability, delivery performance and delivery Trace and other related fields

This article 12623 words read 20 minutes

—— Paoding Jie Niu, so that the upgrade is no longer troublesome

PART. 1 Background

Ant Sigma is the core infrastructure of Ant Group. After years of development, its scale has taken the lead in the industry. Large-scale clusters have higher requirements on the stability and functionality of Kubernetes. Ant Sigma strives to challenge the efficient, stable, non-destructive and non-inductive cloud-native operating system upgrade in a cloud-native environment with a scale of 10,000, and provide users with extremely stable and innovative cloud-native services.

Why continue to iteratively upgrade?

The Kubernetes community is very active, and many cloud native enthusiasts contribute wisdom to the community and promote the continuous update of the community version. The upgrade is to keep up with the pace of the community and enjoy the excellent features accumulated by the community in a timely manner, thereby bringing greater benefits to the company.

Why is it so hard to upgrade?

According to the scale of Ant Sigma, upgrading is a very difficult thing for us, mainly reflected in:

In the upgrade preparation stage, it is necessary to fully promote the client to upgrade, and the business side should arrange for special people to invest in it, which is time-consuming and labor-intensive;
During the upgrade process, in order to avoid the unpredictable consequences of Kubernetes resource operations during version rolling, traffic is generally shut down during the upgrade process, and the business experience is not good;
For the selection of the upgrade time window, in order to provide users with a better service experience, the upgrade should be carried out at a time when the business volume is low, which is not very friendly to the platform operation and maintenance personnel.

Therefore, how to improve the happiness of users, R&D, and SRE during the upgrade process is our goal. We expect to achieve non-destructive upgrades to reduce upgrade risks, decoupling users to improve happiness, efficient iteration to provide more powerful platform capabilities, and ultimately achieve unattended.

This article will combine the upgrade practice of Ant Sigma system, start from the goals and challenges of Kubernetes system upgrade, and gradually analyze the relevant Kubernetes knowledge, and give some principles and thinking of Ant Sigma for these challenges.

[Two different upgrade ideas]

Before introducing the challenges and benefits, let's take a look at how the current cluster is upgraded. Kubernetes upgrade is similar to ordinary software upgrade. There are two common upgrade methods: replacement upgrade and in-place upgrade.

Replacement and upgrade: Switch the running environment of the application to the new version, and take the old version service offline, that is, the replacement is completed. In a Kubernetes upgrade, a new version of the Kubernetes cluster is created before the upgrade, applications are migrated to the new Kubernetes cluster, and then the old version of the cluster is taken offline. Of course, this kind of replacement and upgrade can be replaced from different granularities. From the cluster to the degree, it is to switch the cluster; from the node dimension, after the control node components are upgraded separately, when the kubelet node is upgraded, the Pod on the node is migrated to the new version node, and the old version is offline. version node.
In-place upgrade: Replace the upgraded software package in place, stop the old service process, and re-run the service with the new software package. In the Kubernetes upgrade, the apiserver and kubelet use the in-situ software package update, and then restart the service. The biggest difference between this method and the replacement upgrade is that the workload on the node does not need to be migrated, the application does not need to be interrupted, and the business continuity is maintained.

The above two methods have their own advantages and disadvantages. Ant Sigma adopts in-situ upgrade.

【Methodology - Paoding Jie Niu】

When using in-place upgrade, you will inevitably encounter the problem of in-place upgrade. The most important problem is the compatibility problem, which mainly includes two aspects: Kubernetes API and internal control logic compatibility of components.

The Kubernetes API level includes changes in API interface, resource structure, and feature. The changes in the internal control logic of components are mainly the changes in the resource flow behavior within Kubernetes.

The former is the most important factor affecting the stability of users and clusters, and it is also the problem we focus on solving.

The change of the API interface certainly involves the upgrade of the client, especially for the deprecated and removed API, the client can no longer use the old version of the API interface. The change of the resource interface mainly refers to the change of the resource field. The adjustment of the field means the change of the API capability. The difference in the fields of the same resource in the old and new versions will lead to the difference in the API capability. Fields and field defaults change. In terms of features, the GA of some features led to the removal of the feature switch capability, and the addition of some new features.

In the face of the above-mentioned core problems, we divide the compatibility problems encountered during the upgrade into three stages: "before upgrade", "upgrade" and "post upgrade".

Before the upgrade, there will be a lot of client upgrade promotion problems. By exploring the differences between versions and the coexistence of multi-version clients, we will formulate some rules, which will greatly reduce the number of client upgrades and improve the efficiency of upgrades.
During the upgrade, we will face the problem of coexistence of multiple versions of apiserver, the problem of data storage version conversion, and of course the problem of rollback. For these problems, we will use refined flow control capabilities to avoid tampering and suppress the resource storage version and GVK. The version is guaranteed to be rolled back, and at the same time, version migration is performed for the data in etcd, so as to achieve non-destructive upgrade and rollback.
After the upgrade, for a small number of clients that may cause unacceptable failures, we reduce the risk of tampering by identifying the resource modification request intent.

There is another important link. We need to automate and visualize the whole process. It is necessary to have sufficient grayscale of traffic during the upgrade process. The automatic promotion of the upgrade rhythm and the manual controllability in emergency scenarios are also very important. , which will be covered in detail in another article.

On the whole, we improve the efficiency of the upgrade through the ability to minimize the upgrade and rolling automatic upgrade of the client, and improve the reliability and stability.

PART. 2 Before upgrade

Cluster upgrades will inevitably involve API updates and iterations, which are mainly reflected in the addition, evolution and removal of APIs. In Kubernetes, the evolution of APIs is generally Alpha, Beta, and GA. The API version of a resouce will be iterated according to the above versions. When an API is added, it starts at the Alpha stage, such as "cert-manager.io/v1alpha3". After several iterations, new features enter the beta version, and finally enter the stable GA version. This process may span several large communities. version, some versions will be deprached after the GA version runs stably for a certain period of time, and the deprached API version will be directly removed after a period of time, which has a rigid requirement for our clients to upgrade.

Before introducing the client upgrade, let's introduce the general resource API changes.

Schema changes

There may be differences in the Schema fields of different versions of Kubernetes resources, mainly in the following two aspects:

Field addition/deletion/modification
Default value adjustment for fields

Field additions and deletions

If the resource of Kubernetes adjusts a field, it includes: adding, deleting, and modifying. For the "add" operation, it can appear in the new GV (GroupVersion) or in the old GV. For "deletion" and "modification", generally only appear in the new GV.

Based on the above conditions, the following conclusions can be drawn for the adjustment of the resource field introduced by the new version of APISever:

Field default value changes

The change of the default value of the field means that the default value of a certain field in the resource is inconsistently filled in the old and new apiservers. There are two problems that may arise from changing the default value of word breaker:

The container hash changes, which will cause the container to restart
Control actions that affect control components

The impact of field changes is mainly on the cross access between the old and new versions of the client and the coexistence of multiple versions of the apiserver. The specific impact is described below.

Client upgrade

The client upgrade is to be compatible with the new version of the API, to ensure that there will be no problems after the upgrade, and to achieve precise and differentiated upgrade of operators to improve upgrade efficiency. The following problems will be encountered if the low-version client is not upgraded:

the core issue

According to the changes of the old and new versions of GVK (GroupVersionKind), sort out the various situations that may occur in the low-version client during the upgrade process:

As shown in the figure above, the core problems mainly fall into the following three categories:

1. The low-version client accesses the GroupVersionKind that has been depreached/removed

2. There is a problem of adding fields to the resource operated by the low version client

3. The resource of the low version operation has the problem of changing the default value of the field

For the first problem, when accessing a GVK that has been depreached, especially removed, the server directly returns a class 404 error, indicating that the rest url or secondary GVK no longer exists.

For the second and third questions, both appear in the Update operation of the low-version client, why is there no problem with the patch operation? Because the Update operation is a full update, and the patch operation is a partial update. In the case of a full update, if the client version is low and there is no new field or no field whose default value has changed, go to Update this resource at this time, and the submitted request data will not be included. This field appears, but this field will be completed and populated inside the apiserver, so the value of this field completely depends on the logic inside the apiserver.

In response to the increase of fields, we did an experiment, as follows:

In the 1.18 version, there is an additional patchType field in the Ingress. We first create it through the 1.18 client, and set pathType=Prefix, and then update through the 1.16 version of the client, and find that this post has been changed to the default value of pathType. as follows:

Think and Solve

For the first problem, the handling method is relatively clear, the client must be upgraded, because GVK has been removed from the new version of apiserver. According to the Kubernetes community's API deprecation rules (API versions of a given category cannot be deprecated until a new, non-stability API version is released; except for the latest version in each category of API versions, older API versions are deprecated before they are announced It still needs to be supported for at least a certain period of time after it is abandoned), we can explicitly control the backward compatibility of some APIs to meet the needs of some low-version clients, but this method is not infinite, and it will always be used in a high-level version. removed, so it is not recommended to delay or tolerate client upgrades.

For the second and third questions, both involve the Update operation, because the Update operation will have misoperations. If the user client version is lower, but the user does not care about the new fields of the resource and does not want to use these new functions, he can ignore this field completely, there is no problem with create/delete/patch, and the patch operation will not for this field. Therefore, the client who controls the Update operation can avoid the tampering behavior of the newly added fields.

PART. 3 is being upgraded

The upgrade of Kubernetes cluster mainly includes client upgrade and core component upgrade. The core components include apiserver, controller-manger, scheduler and kubelet.

The client here is a client in a broad sense, that is, the business operator and the management and control operator are both called clients. The upgrade of the client is done by the client itself. The core of the large version upgrade process is the dirty data problem that may occur during the apiserver upgrade process.

The dirty data problem mentioned here is mainly reflected in the following two aspects:

Multi-version apiserver cross-operate the same resource
The problem of tampering occurs in the operation of the resource with schema changes in the high and low version apiserver. The essence of the problem is consistent with the tampering that occurs when the multi-version client operates the same resource with schema changes. It's just that the apiserver version is different, and whether the client version is the same will cause tampering problems.
How data stored in etcd is guaranteed to be updated correctly
It is a problem that people generally don’t pay attention to, because the apiserver upgrade process will help us handle it well, but it is not 100% perfect, and it will also help you to have a deeper understanding of Kubernetes data storage. .

The problem of dirty data is easily associated with rollback during the upgrade process. We cannot guarantee that the upgrade is 100% successful, but we must have rollback capability. According to the community's suggestion, Kubernetes does not recommend rollback during the upgrade process. It will bring more compatibility issues.

These issues will be discussed in detail below.

Multiple versions of apiserver coexist]

From the upgrade process, we can see that there are two main traffic flows under traffic control:

Traffic for Updateu/patch actions
All other traffic remaining

The focus here is the update traffic, which is the main reason for control and the same reason that the low-version client tampered with the field.

The problem with the client is that when the apiserver is of a high version, there will be tampering when the client has a high and low version and operates the same resource at the same time, so we will push the client with the Update action to upgrade.

Another problem occurred when upgrading the apiserver. From the process point of view, during the upgrade process, the traffic will be sent to the 1.16 and 1.18 versions of the client at the same time, so even if the client version is high, the same resource will be written to the same resource through different versions of the apiserver. Tampering phenomenon.

Multi-version apiserver cross access

For this problem, we also talk about the type of resource change mentioned above.

Field changes

Field changes include addition, deletion and modification. For deletion and modification, they will appear in the new GVK, so we only consider the addition. As shown in the figure below, the Kubernetes Pod in version 1.18 has one more field "NewFiled" than in version 1.16. During the upgrade process, if you cross access to the same PodA, the data stored in PodA will keep changing. First, create PodA through apiserver version 1.18, Then, the newly added fields of PodA will be deleted after the apiserver Update of 1.16, and filled back in again through the apiserver Update of version 1.18.

Regarding this issue, the following conclusions are drawn:

(1) For the case of field addition, there is a risk that the default value of the field will be deleted when the resource with the new field is updated through the old version of apiserver; there is no such risk in the case of field deletion and modification;

(2) If the new field is used to calculate the container hash, but since the kubelet is still in version 1.16 when the apiserver is upgraded, the hash is still calculated according to the 1.16 version, and the cross change of the apiserver will not cause the container to rebuild.

Field default value changes

The change of the default value of the field means that the default value of a resource field is filled inconsistently in the old and new apiservers. As shown in the figure below, the default value of the Pod field "FiledKey" in Kubernetes version 1.16 is "default_value_A", and the default value of this field becomes "default_value_B" in version 1.18. There is a problem that the default value is tampered with. The conditions for this situation are relatively harsh. Generally, before Update, the current Pod configuration in the cluster will be pulled down, and the fields of interest will be changed and then updated again. This method will keep the field values with the default values changed, but if the user does not pull the cluster Pod configuration, there will be problems with direct Update.

Regarding this issue, the following conclusions are drawn:

When a field uses the default filling function, its value will depend on the defaulting value in the apiserver to change.
If the newly added field is used to calculate the container hash, it will cause the risk of container reconstruction.

Think and Solve

The problem of multi-version apiserver cross access was introduced earlier, and then how do we solve this problem.

The essence of solving this problem is to control the traffic of the two operations of Update/patch. Some people may have doubts here. Isn’t there a problem with different fields when obtaining resources through multi-version apiserver? If so, then get/watch traffic also needs to be controlled.

There is a pre-fact that needs to be mentioned here. Before upgrading apiserver, there will be multiple versions of clients. Some of these clients can see field changes and some cannot, but they are in a stable state before the upgrade. High-version clients can run stably even when they cannot see the new fields, and do not have a strong dependence on the new features brought by the new fields. For low-version clients, they can’t see the new fields at all and don’t care about the new features. . Our core goal is to ensure that there is no field tampering during the upgrade process to avoid the uncontrollable control behavior caused by frequent switching of the same resource view during the upgrade process.

Ant Sigma has implemented the Service Mesh capability at the management and control level, so in the upgrade process, the powerful mesh capability is used for refined traffic control, avoiding the problem of cross access. The dark area in the upgrade process will also become narrower and narrower. My heart will be much more at ease.

etcd datastore update

The data storage in Kubernetes has its own complete theory. Here we briefly introduce several transformations of a resource in Kubernetes from request entry to storage in etcd, and then introduce the problems that may be encountered during the upgrade process and our thinking in detail. .

Kubernetes resource version conversion

Resource version in apiserver

A resource in Kubernetes will have an internal version, because a resource may correspond to multiple versions in the entire iteration process, for example, deployment will have extensions/v1beta1, apps/v1.

In order to avoid problems, kube-apiserver must know how to convert between each pair of versions (eg, v1⇔v1alpha1, v1⇔v1beta1, v1beta1⇔v1alpha1), so it uses a special internal version, internal version as A generic version will contain the fields of all versions, and it has the functions of all versions. The Decoder will first convert the creater object to the internal version, and then convert it to the storage version, which is another version when stored in etcd.

apiserver request processing

The flow of a request request in the apiserver:

、、、Go
http filter chain | => | http handler

auth ⇒ sentinel ⇒ apf => conversion ⇒ admit ⇒ storage
、、、

The process of reading/storing a resource is as follows:

Data storage compatibility

This article will focus on explaining how the data stored in the API resource in Kubernetes ensures compatibility during the upgrade process. Mainly answer the following two questions:

Question 1: What does the API resource stored in etcd look like in Kubernetes?

The resources in Kubernetes will have an internal version, this internal version is only used for data processing and flow in the apiserver memory, and the data corresponding to this internal version is the complete set of multiple GVs currently supported by the apiserver, such as apps/v1beta1 supported in 1.16 and deployments for the apps/v1 version.

But when storing to etcd, how can the apiserver first convert the internal version to the storage version? How can the storage version be determined?

There are two cases as follows:

core resource
The storage version is determined when the apiserver is initialized, and the storage version is determined for a GroupVersion in two steps:

(1) Make sure to use the same storage version as this GV group resource ---> statically defined overrides in StorageFactory

(2) Select the version with the highest priority in the group as the storage version ---> Schema is obtained in the order of static definition when it is registered

custom resource
The storage version of the custom CR is determined in the configuration of the CRD
(See: CRD Configuration) https://kubernetes.io/en/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/

Question 2: How to store data compatibility in different versions of Kubernetes?

The storage version is not static in the iterative update of the Kubernetes version, and it will be continuously updated. First, let's look at a storage version upgrade rule in Kubernetes: the "storage version" of a given API group cannot upgrade its version number until a Kubernetes release that supports both the old version and the new version comes out.

In other words, when the storage version of a resource in a Kubernetes version changes, this version of Kubernetes must support both the old and new storage versions. In this way, with the ability to convert between multiple versions in the Schema, version upgrades and downgrades can be easily achieved.

For upgrade or downgrade, apiserver can dynamically identify what version of data is currently stored in etcd, convert it to the Internal version, and then convert it to the latest Storage Version after the current upgrade when writing to etcd.

Think and Solve

As can be seen from the above, apiserver has the ability to dynamically convert the storage version, but it needs to perform a read and write operation on the old version of the data. The ability to dynamically convert is not unlimited, and the version converted in a Kubernetes version is several versions that are compatible and supported by the current version.

Suppose a certain resource data has not been read in etcd. At this time, the version of Kubernetes has been upgraded by several versions, and the apiserver is no longer compatible with the version in etcd. When this data is read, an error will be reported. First of all, this The data can no longer be accessed, which may cause the apiserver to crash in severe cases.

Therefore, we need to ensure that the data in etcd are converted to the latest version during the version upgrade process. Naturally, everyone will think of implementing a converter to solve this problem by themselves. There is no problem with the idea, but this will give people a feeling of repeating the wheel, because apiserver already has the ability of converters in various versions, we You only need to make good use of the native conversion capabilities. After each upgrade, use the original storage data conversion capabilities of apiserver to update the data, which can easily avoid the problem of data residue or incompatibility after multiple version upgrades. Don't underestimate this action, it is easy to be ignored. You can imagine the strange phenomenon of apiserver crash suddenly during the already tense upgrade process. At this time, your heart must collapse.

Upgrades can be rolled back

When it comes to upgrading, everyone naturally thinks of rollback. The upgrade of a Kubernetes cluster is similar to the iterative release of business applications. What should I do if there is a problem in the process? The fastest way to stop bleeding is to roll back the version.

A single application service is easy to perform version rollback, but it is not so easy to rollback the data center operating system. The problems involve many aspects. Here we believe that the following problems are encountered in the Kubernetes upgrade and rollback Common tricky questions:

API incompatibilities cause component calls to fail after fallback
Datastore incompatibility in etcd

API incompatibility

API Compatibility Issues Several types of API changes have been described in detail before. Here, the main ones are the changes of API interfaces and the changes of Schema fields.

The change of API interface is not a big problem. The reason is that all controllers, clients and apiservers of lower version have reached a stable state before the upgrade, that is to say, the API interface is available, so there is not much problem after the rollback. But usually after upgrading to a higher apiserver version, some new GVKs will appear, such as some Alpha capabilities appear or the Beta version of the GV becomes the GA version.

A real example: the GV v1beta1.discovery.k8s.io was added during the upgrade from 1.16 to 1.18. In this case, the lower version of the apiserver does not recognize the new version of the GV. Although the apiserver can be started normally after the rollback, the There will be problems when performing operations involving this GV. For example, when deleting a namespace, all resources under this ns will be deleted, and all GVs will be traversed. At this time, there will be a phenomenon that ns cannot be deleted.

The other is the change of Schema. In fact, rollback can be regarded as another kind of upgrade. From the "upgrade" of the high version to the low version, the problems encountered in this process are consistent with the "upgrade" of the low version to the high version, that is, the high and low The version client access tampering problem and the multi-version apiserver coexist with the cross access problem, but the client problem does not exist during the rollback process, because the higher version of the client is backward compatible. For the cross-access problem, the cross-access problem will also be avoided by using refined flow control.

etcd datastore incompatible

The data storage problem is encountered during the upgrade process, and it is also encountered during the rollback process. The core is when the storage version of a resource has changed in the high version of the apiserver, and the new storage GV is not in the low version of the apiserver. Identification, resulting in an error when obtaining the corresponding resource through the old version of the apiserver after the rollback. This error occurred during the conversion from the Storage Version to the Internel Version.

An example: In 1.16, the storage version of csinodes is v1beta1, and it is upgraded to v1 version in 1.18. If you roll back directly from 1.18 to 1.16, the csinode resource acquisition will be wrong, because there is no v1 version in the 1.16 apiserver at all. csinode.

At this point, some people may ask, why do you need to upgrade across versions?

The above problem will not occur if the upgrade from 1.16 to 1.17 to 1.18 version by version will not occur. This idea is very good, but for the size of Ant Sigma Kubernetes, frequent upgrades are more difficult, which is why we do this. The original power of the upgrade will become more automatic and more efficient. When this goal is achieved, this problem will no longer exist, and it is still difficult to roll back the incompatibility of the storage version at the current stage.

Think and Solve

The upgrade itself is an operation that introduces many variables. We try to find a way to control the changes. The most basic methodology is to control variables. Therefore, for API compatibility, our core principle is: new features are not necessary. The advanced suppression that is turned on ensures that it can be rolled back.

There are two main goals of suppression:

The new GVK in the later version of apiserver
Make sure they don't appear in this updated version
Stored version of data in etcd
The storage version is transparent to the user, and we must also ensure that the suppression adjustment is insensitive to the user. The means of adjustment and suppression can be achieved by adjusting the compatibility of the apiserver code.

For other compatibility problems, there is currently no good solution. Currently, we mainly expose problems by upgrading and rolling back the e2e test, and make corresponding compatibility modifications for incompatible problems.

The means of compatibility suppression only exists in the upgrade process, and it is also a temporary phenomenon in the upgrade process. When suppressing adjustment, we need to fully consider whether other uncontrollable problems will be introduced. This depends on the changes of GVK itself. Of course, everything has to go back from theory to practice, and adequate e2e testing is also required. With the blessing of the two sharp blades of theory and testing, I believe that the compatibility problem will be solved easily.

The above are the three thorny problems encountered in the upgrade process and the related solutions. Next, we will introduce the guarantee work after the upgrade.

PART. 4 After upgrade

When a major version is upgraded, there is no guarantee that 100% of the clients will be upgraded to the corresponding latest version. Although we will push the update traffic client to upgrade before the upgrade, it may not be able to achieve 100% upgrade. More importantly, after the upgrade, a user may access with a lower version of the client. We hope that through the webhook, the low-version client can avoid accidental tampering with the resource field after the upgrade, so as to achieve the purpose of real upgrade without loss.

The main principles of field control are summarized in one sentence: the fields that prevent the default value from changing are modified to the new default value by the user using the lower version client in the way of Update.

Field control

residual problem

The biggest challenge in field control is how to accurately identify whether the user tampered with the field unintentionally. To determine whether the user has unintentionally tampered with, two key pieces of information need to be obtained:

User's original request content
The content of the user's original request is the key to judging whether the user has unintentionally tampered with it. If there is a field in the original request, it means that the user clearly wants to modify it, and no control is required at this time.
User client version information
Otherwise, it depends on whether the client version of the user is lower than the current cluster version. If it is not lower than the cluster version, it means that the user has this field explicitly modified and does not need to be controlled.

So the question is, how to get these two information? First, let's talk about the original request content of the user. According to the capabilities of Kubernetes, we cannot easily get the request content through webhook or other plug-in mechanisms. The content when the apiserver calls the webhook is already the content after the version conversion.

As for the user client version information, although this information can be obtained from the monitoring of the apiserver, of course, in order to connect with the management and control link, we do not directly pull the monitoring information, but supplement the information in the apiserver.

Think and Solve

The essence of solving this problem is to understand the "original intent of the user", to be able to identify which actions are unintentional tampering and which are the real needs. This action needs to rely on the following two pieces of information:

User original request information
User client version information

The acquisition and accuracy of the above two information is one of our follow-up work directions, and currently we do not have a good solution. It is very difficult to understand the user's intention and understand it in real time. The compromise is to define a set of rules to identify whether the user is tampering with some core fields using a lower version of the client. We would rather kill a thousand by mistake than let go of one Bad operation, because the unexpected behavior of the business container can easily lead to a P-level failure, because the business container does not meet expectations.

PART. 5 Boost effect

The above explorations have been implemented in Ant Sigma, and the upgrade efficiency has been greatly improved, mainly reflected in the following aspects:

There is no need to stop the upgrade process, and the release time is shortened to 0. The whole process avoids the delayed delivery of a large number of Pods, and platform users do not even have any physical sensation. smoother;
The number of client upgrades promoted in the early stage of the upgrade has been greatly reduced by 80%, the overall upgrade promotion time has been reduced by about 90%, and the manpower input of the business side has been reduced by 80%. The entire upgrade work is much easier;
The upgrade process has been automated, and manual intervention can be implemented at any time in order to prevent the upgrade process from happening unexpectedly. The upgrade process frees the hands of Sigma R&D and SRE, and can hold the coffee and watch the progress;
The upgrade process realizes precise control of traffic. The grayscale test is implemented for the traffic of thousands of namespaces in the cluster according to the rules. Dozens of BVT tests are performed for the new version instance. of.

PART. 6 The road ahead

On the whole, the core of the upgrade is to do a good job of compatibility. At the same time, the whole process should be more automated and the observation should be better. Next, there are several directions to continue:

1. More precise

At present, there is still a lack of information acquisition for management and control, and the namespace dimension is currently used for traffic management and control, all of which have the problem of insufficient precision. As mentioned earlier, we are in the process of building the meshing capabilities of control components, and there will be more flexible and fine-grained traffic control and data processing capabilities in the future. At the same time, with the help of Mesh capability, the grayscale test of multi-version traffic during the upgrade process of management and control components is realized, and the upgrade is accurate and controllable.

2. Platformization

The challenges and technical solutions introduced in this article are actually part of the upgrade process. The whole process includes the initial client minimization upgrade, the rolling upgrade of core components, and subsequent management and control. This process is cumbersome and error-prone. We hope to put these The process is standardized and platform-based, and tools such as differentiated comparison, traffic control, and traffic monitoring are integrated into the platform to make upgrades more convenient.

3. More efficient

The iteration speed of the community is very fast, and the current iteration speed is unable to keep up with the community. We use the above-mentioned more intelligent and platform-based capabilities to improve the speed of infrastructure upgrade. Of course, the speed of the upgrade is also closely related to the cluster architecture. In the future, ants will move to the federated cluster architecture. Under the federated architecture, specific user APIs can be forward-compatible and converted, which can greatly decouple customers. The upgrade relationship between the client and the apiserver.

Upgrading a Kubernetes cluster at the scale of Ant Sigma is not an easy task. As the core operating base of Ant, we want to make the iterative upgrade of the infrastructure through technical means to be truly non-intrusive and non-destructive, allowing users to Stop waiting and let yourself stop worrying. Achieving these goals in the face of the behemoth of Kubernetes is challenging, but that doesn't stop us from exploring. The road is long and obstacles, and the road is coming. As the head of the global Kubernetes scale construction, Ant Group will continue to export more stable and easier-to-use technologies to the community, helping cloud native become the core driving force for technology-driven development.

The Ant Sigma team is committed to the construction of a large-scale cloud-native scheduling platform to provide faster, better, and more stable container resource delivery for businesses. Recently, we have also achieved remarkable results in terms of cluster stability and high performance. Welcome to communicate with each other.

"References"

Kubernetes API Policy

"Introduction to Kubernetes Version 1.16"

"Kubernetes cluster correct upgrade posture"

Thirst for talent:

Ant Group's Kubernetes cluster scheduling system supports Ant Group's online and real-time business scheduling of millions of container resources, provides standard container services and dynamic resource scheduling capabilities to various upper-level financial businesses, and shoulders the responsibility of Ant Group's resource cost optimization. We have the industry's largest Kubernetes cluster, the most in-depth cloud native practice, and the best scheduling technology.

Students who are interested in Kubernetes/cloud native/container/kernel isolation co-location/scheduling/cluster management are welcome to join. Beijing, Shanghai, and Hangzhou look forward to your participation.

Email: xiaoyun.maoxy@antgroup.com

A Practical Guide for Non-destructive Upgrade of Ant Large-scale Kubernetes Cluster [Exploration]

PART. 1 Background

Why continue to iteratively upgrade?

Why is it so hard to upgrade?

[Two different upgrade ideas]

【Methodology - Paoding Jie Niu】

PART. 2 Before upgrade

Schema changes

Field additions and deletions

Field default value changes

Client upgrade

the core issue

Think and Solve

PART. 3 is being upgraded

Multiple versions of apiserver coexist]

Multi-version apiserver cross access

Think and Solve

etcd datastore update

Kubernetes resource version conversion

Data storage compatibility

Think and Solve

Upgrades can be rolled back

API incompatibility

etcd datastore incompatible

Think and Solve

PART. 4 After upgrade

Field control

residual problem

Think and Solve

PART. 5 Boost effect

PART. 6 The road ahead

"References"

Thirst for talent:

Recommended reading of the week

SOFAStack

引用和评论

蚂蚁 Flink 实时计算编译任务 Koupleless 架构改造

Jenkins 企业级 CI/CD 实践：安装、配置与 Kubernetes & Docker 集成

k8s集群部署（一主两从）

k8s实战基础

使用kubeadm部署高可用IPV4/IPV6集群---V1.32

centos7使用yum网络安装

基于k3s部署Nginx、MySQL、PHP和Redis的详细教程