头图.png

Author | Ink Seal
Source | Alibaba Cloud Native Official Account

A week ago, we introduced "How to find problems before users in the face of large-scale K8s clusters" .

In this article, we will continue to introduce how ASI SRE (ASI, Alibaba Serverless infrastructure, Alibaba's unified infrastructure designed for cloud native applications) explores how to build ASI's own infrastructure under the Kubernetes system in large-scale cluster scenarios The ability to change the gray scale.

What are we facing

ASI was born when Alibaba Group went to the cloud in an all-round way. While carrying a large amount of the group's infrastructure to be fully cloud-native, ASI's own structure and form are constantly evolving.

ASI mainly adopts the Kube-on-Kube architecture as a whole. It maintains a core Kubernetes meta-cluster at the bottom, and deploys the master control components of each tenant cluster in the cluster: apiserver, controller-manager, scheduler, and etcd. In each business cluster, various addon components such as controllers and webhooks are deployed to jointly support the various capabilities of ASI. At the data plane component level, some ASI components are deployed on nodes in the form of DaemonSet, and another part is deployed in the form of RPM packages.

1.jpg

At the same time, ASI carries hundreds of clusters and hundreds of thousands of nodes in the group and sales area scenarios. Even in the early stage of ASI's construction, the nodes under its jurisdiction reached tens of thousands. During the rapid development of ASI's own architecture, component and online changes were quite frequent. In the early days, ASI component changes could reach hundreds of times in a single day. And ASI's core basic components such as CNI plug-in, CSI plug-in, etcd, Pouch, etc., regardless of any one of the wrong changes may cause the entire cluster-level failure, causing irreparable loss of upper-level business.

2.jpg

In short, the large scale of the cluster, the large number of components, frequent changes, and complex business forms are several serious challenges for building gray-scale capabilities and changing systems in ASI or other Kubernetes infrastructure layers. At that time, within Alibaba, ASI/Sigma had several existing change systems, but they all had certain limitations.

  • Space-based: Has the ability to publish general nodes, but does not include ASI metadata information such as clusters and node sets.
  • UCP: The early release platform of sigma 2.0, in disrepair.
  • sigma-deploy: sigma 3.x release platform, update deployment/daemonset in the form of mirror patch.
  • asi-deploy: The early release platform of ASI, managed ASI's own components, only supports mirrored patches, only adapts to Aone's CI/CD pipeline, and supports grayscale among multiple different environments, but the grayscale granularity is relatively small. Crude.

Therefore, we hope to learn from the history of the release platform of the previous generations of sigma/ASI, start from the time of change, focus on system capabilities, supplemented by process specifications, and gradually build the gray system under the ASI system, and build the Kubernetes technology stack. The operation and maintenance change platform ensures the stability of thousands of large-scale clusters.

Presuppositions and ideas

The development of ASI's own structure and form will greatly affect its own gray system construction method. Therefore, in the early development of ASI, we made the following bold presuppositions for the future form of ASI:

  1. ACK as the base : ACK (Alibaba Cloud Container Service) provides various capabilities of the cloud. ASI will rely on the ability to reuse these clouds and feed back the advanced experience accumulated in the Alibaba Group to the cloud.
  2. cluster size : In order to improve cluster resource utilization, ASI will exist as a large cluster, and a single cluster provides a common resource pool to host multiple second-party tenants.
  3. clusters. : ASI not only divides clusters according to the Region dimension, but also divides independent clusters according to dimensions such as business side.
  4. Addon : The Kubernetes system is an open architecture that will derive a lot of operators, and these operators will work with ASI core components to provide various capabilities to the outside world.
  5. Change scenarios are complicated. : ASI's component change scenarios will not only be in the form of mirroring releases. Kubernetes' declarative object lifecycle management is doomed to the complexity of the change scenarios.

Based on the above assumptions, we can summarize several issues that need to be resolved in the early stage of ASI construction:

  1. How to build the gray-scale capability of change in a single large-scale cluster?
  2. How to establish large-scale change gray-scale capability among multiple clusters?
  3. In the case of a large number and variety of components, how to ensure component management and ensure that each release of the component will not affect the online environment?

3.png

Let's change our perspective, break away from the dimension of the cluster, and try to solve the complexity of change from the perspective of components. For each component, its life cycle can be roughly divided into requirements and design phases, research and development phases and release phases. For each stage, we hope to standardize and solve the characteristics of Kubernetes itself, put fixed specifications into the system, and ensure the gray-scale process with system capabilities.

Combining the form of ASI and the particularity of the change scenario, we proceed from the following ideas to systematically build the gray system of ASI:

  • Requirements and design phase

    • Solution TechReview
    • Component online change review
  • Component development stage

    • Standardized component development process
  • Component release change phase

    • Provide component workbench capabilities for large-scale management of components
    • Build ASI metadata and refine gray-scale units
    • Building ASI single-cluster and cross-cluster gray-scale capabilities

Gray system construction

1. R&D process standardization

The research and development process of ASI core components can be summarized as the following processes:

Aiming at ASI's own core components, we and our students from the quality technical team jointly built an e2e test process for ASI components. In addition to the unit test and integration test of the component itself, we have built a separate e2e cluster for normalized ASI overall functional verification and e2e testing.

4.jpg

Starting from the perspective of a single component, after the new function of each component is developed, code review is merged into the develop branch, and the e2e process is immediately triggered. After the mirror is built through the chorus (cloud native test platform) system, the ASIOps (ASI Operation and maintenance management and control platform) Deploy to the corresponding e2e cluster and perform standard Kubernetes Conformance suite test tasks to verify whether the functions within the scope of Kubernetes are normal. Only when all test cases pass, the version of the component can be marked as a flattenable version, otherwise subsequent releases will be subject to control restrictions.

However, as mentioned above, the open architecture of Kubernetes means that it not only includes core components such as control and scheduling, but the functions of the cluster also rely to a large extent on the upper-layer operator to implement together. Therefore, white box testing within the scope of Kubernetes cannot cover all applicable scenarios of ASI. Changes in the functions of the underlying components will greatly affect the use of upper-level operators. Therefore, we have added black-box test cases to the white-box Conformance, which includes functional verification of various operators themselves, such as those initiated from the upper-level paas Expand and shrink the capacity, verify the quota verification capabilities of the release link, and run in the cluster as normal.

5.jpg

2. Component scale management

In view of the characteristics of multiple ASI components and multiple clusters, we expanded on the original asi-deploy function, using components as the entry point to enhance the management capabilities of components between multiple clusters, from mirror management to YAML management .

6.jpg

Based on the capabilities of Helm Template, we separate the YAML of a component into three parts: template, mirror, and configuration, which represent the following parts of information:

  • Template: Information that is fixed in all environments in YAML, such as apiVersion, kind, etc.;
  • Mirroring: information related to component mirroring in YAML, which is expected to be consistent in a single environment or all clusters;
  • Configuration: The information bound to a single environment and a single cluster in YAML allows for diversified content, and the configuration in different clusters may be different;

Therefore, a complete YAML is rendered by templates, images, and configurations. ASIOps will then manage the cluster dimension and time dimension (multi-version) of the YAML for mirroring information and configuration information, respectively, and calculate the distribution of the current version information of the component in many clusters and the consistency of the version of the component in a single cluster.

For the mirrored version, we promote the unification of the version on the system to ensure that there will be no online problems due to the version being too low; for the configuration version, we simplify its complexity from management to prevent configuration errors from being sent to the cluster.

7.jpg

With the basic prototype of the component, we hope to publish more than just "replace the image field in the workload". We currently maintain the entire YAML information, including configuration content other than mirroring, and need to support changes other than mirroring changes. Therefore, we try to deliver YAML as close as possible to kubectl apply.

We will record three parts of YAML Specification information:

  • Cluster Spec: the status of the specified resource in the current cluster;
  • Target Spec: YAML information to be released into the cluster now;
  • DB Spec: The YAML information of the last successful deployment, which has the same function as the last-applied-configuration stored in the annotation of kubectl apply.

8.jpg

For a YAML constructed by mirroring, configuration, and template, we will collect the above three Spec information and perform a diff to obtain the resource diff patch, and then perform a filter out to filter out dangerous fields that are not allowed to be changed. Finally, the overall patch is sent to the APIServer in the form of a strategic merge patch or merge patch to trigger the workload to re-enter the reconcile process to change the actual status of the workload in the cluster.

In addition, due to the strong correlation between ASI components, there are many scenarios where multiple components need to be released at the same time. For example, when we initialize a cluster, or make an overall release of the cluster. Therefore, we have added the concept of Addon Release on the basis of the deployment of a single component. A collection of components is used to indicate the release version of the entire ASI, and the deployment flow is automatically generated according to the dependencies of each component to ensure that the overall release process will not appear. Circular dependency.

9.png

3. Single cluster gray-scale capacity building

In the cloud-native environment, we describe the deployment form of the application in the form of the final state, and Kubernetes provides the ability to maintain the final state of various workloads, and the operator compares the gap between the current state of the workload and the final state and performs state coordination. This coordination process, in other words the process of workload release or rollback, can be handled by the release strategy defined by the Operator to deal with this "process-oriented process in the final state scenario".

Compared with the application load of the upper layer of Kubernetes, the underlying infrastructure components are more concerned with the gray-scale release strategy and gray-scale pause capability of the component during the release process. That is, no matter any type of component, it needs to be able to be timely in the release process. The ability to stop publishing to provide more time for feature testing, decision making, and rollback_. Specifically, these capabilities can be summarized into the following categories:

  • updateStrategy: streaming upgrade/rolling upgrade
  • pause/resume: pause/resume capability
  • maxUnavailable: The upgrade can be quickly stopped when the number of unavailable copies reaches a certain level
  • partition: Upgrade pause ability, only upgrade a fixed number of copies at a time, and keep a certain number of copies of the old version

ASI has enhanced Kubernetes' native workload capabilities and node capabilities. Relying on the capabilities of operators such as Kruise and KubeNode in the cluster and the cooperation of the upper management and control platform ASIOps, we have implemented the aforementioned gray-scale capabilities for Kubernetes infrastructure components. For Deployment / StatefulSet / DaemonSet / Dataplane type components, the capabilities supported when released in a single cluster are as follows:

10.jpg

The following article will briefly introduce our gray-scale implementation of components of different workload types. For detailed implementation details, please pay attention to our open source project OpenKruise and the subsequent open source KubeNode.

1)Operator Platform

Most Kubernetes operators are deployed in the form of Deployment or StatefulSet. During the release of the Operator, once the image field changes, all copies of the Operator will be upgraded. Once there is a problem with the new version in this process, it will cause irreparable problems.

For this type of operator, we separate controller-runtime from operator and build a centralized component operator-manager (controller-mesh in OpenKruise open source implementation). At the same time, an operator-runtime sidecar container will be added to each operator pod to provide the main container of the component with the core capabilities of the operator through the gRPC interface.

11.png

After the operator establishes a Watch connection to the APIServer, the event is monitored and converted into a task stream to be coordinated and processed by the operator (that is, the operator's traffic). The operator-manager is responsible for centralized control of all operator traffic, and performs traffic fragmentation and distribution according to the rules To a different operator-runtime, the workerqueue in the runtime then triggers the coordination task of the actual operator.

In the grayscale process, operator-manager supports the allocation of operator traffic to two copies of the old and new versions according to the namespace level and hash fragmentation, so that the workload processed by the two copies can be used to verify this grayscale release Is there a problem?

2)Advanced DaemonSet

The community’s native DaemonSet supports RollingUpdate, but its rolling upgrade capability only supports maxUnavailable, which is unacceptable for ASI with thousands of nodes in a single cluster. Once the image is updated, all DaemonSet Pods will be upgraded. , And cannot be suspended, and can only be protected by maxUnavailable strategy. Once DaemonSet has released a bug version and the process can be started normally, maxUnavailable cannot take effect.

In addition, the community provides the onDelete method, which can manually delete Pods to create new Pods, and the publishing sequence and gray level are controlled by the publishing platform center. This model cannot achieve a self-closed loop in a single cluster, and all the pressure rises to the publishing platform. Let the upper release platform carry out Pod expulsion, the risk is relatively high. The best way is that Workload can provide component updates in a self-closed loop. Therefore, we have strengthened DaemonSet's capabilities in Kruise to support the above-mentioned important grayscale capabilities.

The following is an example of a basic Kruise Advanced DaemonSet:

apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
  # ...
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 5
      partition: 100
      paused: false

The partition means the number of Pod copies that retain the old version of the mirror. Once the Pod with the specified number of copies is upgraded during the rolling upgrade process, the new Pod will no longer be mirrored. We control the value of partition in the upper ASIOps to upgrade the DaemonSet rolling, and cooperate with other UpdateStrategy parameters to ensure the grayscale progress, and perform some directional verification on the newly created Pod.

12.jpg

3)MachineComponentSet

MachineComponentSet is the Workload in the KubeNode system. The node components outside of Kubernetes in ASI (components that cannot be released with Kubernetes' own Workload), such as Pouch, Containerd, Kubelet, etc., are all released through this Workload.

The node component is represented by the custom resource MachineComponent inside Kubernetes, which contains an installation script of a specified version of the node component (such as pouch-1.0.0.81), installation environment variables and other information; while the MachineComponentSet is the mapping of the node component and the node set, Indicates that this batch of machines needs to install the node component of this version. The Machine-Operator at the center will coordinate this mapping relationship, compare the component version on the node and the difference between the target version in the form of the final state, and try to install the node component of the specified version.

13.jpg

In this part of the gray release, the design of MachineComponentSet is similar to Advanced DaemonSet, and it provides RollingUpdate features including partition and maxUnavailable. For example, the following is an example of MachineComponentSet: ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

apiVersion: kubenode.alibabacloud.com/v1
kind: MachineComponentSet
metadata:
  labels:
    alibabacloud.com/akubelet-component-version: 1.18.6.238-20201116190105-cluster-202011241059-d380368.conf
    component: akubelet
  name: akubelet-machine-component-set
spec:
  componentName: akubelet
  selector: {}
  updateStrategy:
    maxUnavailable: 20%
    partition: 55
    pause: false

Similarly, the upper-level ASIOps interacts with the Machine-Operator on the cluster side when controlling the grayscale upgrade node components, and modifies the partition and other fields of the specified MachineComponentSet for rolling upgrade.

Compared with the traditional node component release mode, the KubeNode system also closes the life cycle of node components into the Kubernetes cluster, and lowers the control of gray release to the cluster side, reducing the pressure on the node metadata management from the center side.

4. Cross-cluster gray-scale capacity building

Alibaba internally formulated the red line 3.0 for cloud products and basic products, and required the batch grayscale, control interval, observability, pause, and rollback of the change operation of control plane components and data plane components. However, the gray scale of the change object in the region unit does not meet the complex scene of ASI, so we try to refine the type of change unit to which the change of the control plane and the data plane on the ASI belongs.

We abstract up and down around the basic unit of the cluster, and get the following basic units:

  • cluster group : has information on common business parties (two-party users undertaken by ASI), network domain (sales area/OXS/group), environment (e2e/test/pre-release/canary/small traffic/production) information, Therefore, the configuration of monitoring, warning, inspection, release, etc. has commonality.
  • cluster : ASI cluster concept, corresponding to a Kubernetes cluster plan.
  • Node Set : A set of nodes with common characteristics, including information such as resource pools and sub-service pools.
  • Namespace : A single Namespace in a single cluster. Usually, an upper-layer service in ASI corresponds to a Namespace.
  • node : A single host node corresponds to a Kubernetes Node.

14.jpg

For each release mode (control component, node component), we use the principle of minimum explosion radius to arrange their corresponding gray-scale units in series, so that the gray-scale process can be solidified into the system, and the component development is in release The process must be followed and the deployment is carried out unit by unit. During the layout process, we mainly consider the following factors:

  • Business attributes
  • Environment (testing, pre-launch, small flow, production)
  • Network domain (Group V, sales area, OXS)
  • Cluster size (Pod/Node number)
  • User attributes (GC level of the hosting user)
  • Unit/center
  • Component characteristics

At the same time, we score the weight of each unit and arrange the dependencies between the units. For example, the following is the release pipeline of an ASI monitoring component. Since the monitoring component uses the same solution in all ASI scenarios, it will be flattened to all ASI clusters. And in the process of flattening, it will first be verified by the pan-e-commerce transaction cluster, then the two parties in the group VPC will be released, and finally the sales area cluster will be released.
In each cluster, the component will be released in batches of 1/5/10 in batches according to the grayscale method in the single cluster discussed in the previous section.

15.jpg

After arranging the gray-scale unit, we can get the basic skeleton of a component flattening pipeline. For each gray unit on the skeleton, we try to enrich its pre-check and post-check, so that we can confirm the success of the gray after each release, and make effective changes to block it. At the same time, we set a certain quiet period for a single batch to allow enough time for the post-check to run, and provide enough time for component development to verify. At present, the contents of single-batch pre-post verification include:

  • Global risk rules (blocking network, fusing, etc.)
  • Publish time window (ASI pilots the rules that prohibit publishing on weekends)
  • KubeProbe cluster black box detection
  • Canary mission (ASI full link expansion and contraction mission initiated by Normandy)
  • Core monitoring indicators
  • Component logs (component panic alarms, etc.)
  • Active diagnosis task (actively query whether the corresponding monitoring information has changed significantly during the release process)

16.jpg

Connecting the entire multi-cluster release process together, we can get a component from R&D, testing to online release. The events experienced by the entire process are as follows:

17.jpg

In terms of the implementation of the assembly line arrangement, we conducted a selection survey on the existing tekton and argo in the community, but considering that we have more logic in the release process that is not suitable for execution in a container alone, at the same time, we have The demand is not only for CI/CD, and these two projects are not stable in the community at the beginning of the design. Therefore, we refer to tekton's basic design (task / taskrun / pipeline / pipelinerun) for implementation, and maintain the common design direction with the community. In the future, we will adjust the way closer to the community and more cloud-native.

Results

After nearly a year and a half of construction, ASIOps currently carries nearly a hundred management and control clusters, nearly a thousand business clusters (including ASI clusters, Virtual Cluster multi-tenant virtual clusters, Sigma 2.0 virtual clusters, etc.), and more than 400 components (including ASI core Components, two-party components, etc.). At the same time, ASIOps contains more than 30 flattening pipelines, which are suitable for different release scenarios of ASI itself and the business parties carried by ASI.

At the same time, there are nearly 400 component changes (including mirroring changes and configuration changes) every day, and the number of them is 7900+ at this time through the pipeline. At the same time, in order to improve the publishing efficiency, we have enabled the ability of automatic grayscale in a single cluster under the condition of perfect front and rear inspection. At present, this ability is used by most of the components of the ASI data plane.

The following is an example of component leveling through ASIOps:

18.png

At the same time, our batch grayscale and post-check change blocking on ASIOps also helped us to stop certain failures caused by component changes. For example, when the Pouch component was performing grayscale, the cluster was unavailable due to the incompatibility of the version. This phenomenon was discovered through the post-inspection triggered after the release, and the grayscale process was blocked.

19.png

Most of the components on ASIOps are the underlying infrastructure components of ASI/Kubernetes, and there have been no failures caused by component changes for nearly one and a half years. We strive to solidify the specified specifications through system capabilities to reduce and eliminate changes that violate the red line of changes, so as to gradually shift the occurrence of faults to the right, from low-level faults caused by changes to complex faults caused by code bugs.

Outlook

With the gradual expansion of the scenarios covered by ASI, ASIOps, as its management and control platform, needs to meet the challenges of more complex scenarios, larger number of clusters, and number of components.

First of all, we urgently need to solve the trade-off between stability and efficiency. When the number of clusters managed by ASIOps reaches a certain level, it will take a considerable amount of time to perform a component leveling. We hope that after we have built enough front and back verification capabilities, we will provide the ability to change fully managed, and the platform will automatically level the components within the release range, and perform effective change blocking, which is truly done at the Kubernetes infrastructure layer. To CI/CD automation.

At the same time, we currently need to manually arrange the grayscale units to determine the grayscale order. In the future, we hope to build the entire ASI metadata, and automatically filter, score and arrange all units within the scope of each release.

Finally, currently ASIOps only provides the ability to grayscale for component-related changes, and the changes within the scope of ASI are far more than just components. The gray-scale system should be a general category, and the gray-scale pipeline needs to be empowered in other scenarios where resources are injected into operation and maintenance and plan execution.
In addition, the gray-scale capabilities of the entire management and control platform are not tightly coupled with Alibaba, and are built entirely based on Kruise / KubeNode and other Workloads. In the future, we will explore open source capabilities to export the entire set of capabilities to the community.


阿里云云原生
1k 声望302 粉丝