Author: Zhao Mingshan (Liheng)

foreword

OpenKruise [1] is an open source cloud native application automation management suite of Alibaba Cloud, and it is also a Sandbox project currently hosted under the Cloud Native Computing Foundation (CNCF). It comes from Alibaba's years of containerization and cloud-native technology accumulation. It is a standard extension component based on Kubernetes for large-scale application in Alibaba's internal production environment. best practice. In addition to the original workload, sidecar management and other areas, Kruise is currently experimenting in the field of incremental delivery.

What is incremental delivery?

The term "incremental delivery" originated from large, complex industrial projects. It attempts to dismantle complex projects in stages and reduce delivery costs and time through continuous small closed-loop iterations. With the popularity of Kubernetes and cloud native concepts, especially after the emergence of continuous deployment pipelines, incremental delivery provides the infrastructure and implementation methods for Internet applications.

In the product iteration process, the specific behavior of incremental delivery can be attached to the pipeline, and the entire delivery pipeline can be regarded as a process of product iteration and an incremental delivery cycle. Incremental delivery is implemented in practice by technical means such as A/B testing, canary/grayscale release . Taking Taobao product recommendation as an example, each time it releases a major function, it will go through a typical progressive delivery process, so as to improve the stability and efficiency of delivery through progressive delivery :

 title=

Why Kruise Rollout

Kubernetes only provides the Deployment controller for application delivery, and the Ingress and Service abstractions for traffic. However, Kubernetes does not have a standard definition of how to combine the above implementations into an out-of-the-box progressive delivery solution. Argo-rollout and Flagger are the more popular progressive delivery solutions in the community, but they are different from our vision in some capabilities and concepts. First, they only support Deployment, not Statefulset, Daemonset, let alone custom operators; secondly, they are not "non-intrusive progressive release methods" , for example: Argo-rollout cannot support community K8S Native Deployment, Flagger Copying the Deployment created by the business causes the Name to change and there are some compatibility issues with Gitops or self-built Paas.

In addition, blooming a hundred flowers is a major feature of cloud native. The Alibaba Cloud container team is responsible for the evolution of the cloud-native architecture of the entire container platform. There is also a strong demand in the field of progressive application delivery. Therefore, based on reference to community solutions and consideration of Alibaba's internal scenarios, we have the following goals in the design of Rollout :

  1. Non-intrusive: no modification is made to the native Workload controller and user-defined Application Yaml definitions, ensuring clean and consistent native resources
  2. Scalability: Support K8S Native Workload, Custom Workload, Nginx, Isito and other traffic scheduling methods in an extensible way
  3. Ease of use: out of the box for users, it can be easily used in combination with community Gitops or self-built Paas

Kruise Rollout: Incremental Delivery Capability for Bypass

Kruise Rollout [2] is Kruise’s definition model for progressive delivery abstraction. The complete definition of Rollout: canary release, blue-green release, A/B Testing release that meets application traffic and actual deployment instances, and the release process can be based on Prometheus Metrics automatically batches and pauses indicators, and can provide bypass without inductive connection, and is compatible with various existing workloads (Deployment, CloneSet, DaemonSet). The architecture is as follows:

 title=

Traffic scheduling (canary, A/B Test, blue-green) and batch release

Canary and batch releases are the most commonly used releases in progressive delivery practices, as follows:

  1. The selection of workloadRef bypass requires Rollout Workload (Deployment, CloneSet, DaemonSet).
  2. canary.Steps defines that the entire Rollout process is divided into five batches. The first batch only grayscales a new version of the Pod, and routing 5% of the traffic to the new version of the Pod, and needs to manually confirm whether to continue publishing.
  3. The second batch releases 40% of the new version of the Pod, and Routing 40% of the traffic to the new version of the Pod, and after the release is completed, sleep 10m, and the latter batch is automatically released.
  4. trafficRoutings defines the business ingress controller as Nginx, which is designed as an extensible implementation. In addition to Nginx, it can also support other traffic controllers such as Istio and Alb.
 apiVersion: rollouts.kruise.io/v1alpha1
kind: Rollout
spec:
  strategy:
    objectRef:
      workloadRef:
        apiVersion: apps/v1
        # Deployment, CloneSet, AdDaemonSet etc.
        kind: Deployment 
        name: echoserver
    canary:
      steps:
        # routing 5% traffics to the new version
      - weight: 5
        # Manual confirmation, release the back steps
        pause: {}
        # optional, The first step of released replicas. If not set, the default is to use 'weight', as shown above is 5%.
        replicas: 1
      - weight: 40
        # sleep 600s, release the back steps
        pause: {duration: 600}
      - weight: 60
        pause: {duration: 600}
      - weight: 80
        pause: {duration: 600}
        # 最后一批无需配置
      trafficRoutings:
        # echoserver service name
      - service: echoserver
        # nginx ingress
        type: nginx
        # echoserver ingress name
        ingress:
          name: echoserver

Automatic batching and pausing based on Metrics

During the Rollout process, the business Prometheus Metrics can be automatically analyzed, and then combined with the steps to determine whether the Rollout needs to be continued or suspended. As shown below, analyze the http status codes of the business in the past five minutes after publishing each batch, if the ratio of http 200 is less than 99.5, the Rollout process will be suspended.

 apiVersion: rollouts.kruise.io/v1alpha1
kind: Rollout
spec:
  strategy:
    objectRef:
      ...
    canary:
      steps:
      - weight: 5
        ...
      # metrics分析  
      analysis:
        templates:
        - templateName: success-rate
          startingStep: 2 # delay starting analysis run until setWeight: 40%
          args:
          - name: service-name
            value: guestbook-svc.default.svc.cluster.local

# metrics analysis模版
apiVersion: rollouts.kruise.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 5m
    # NOTE: prometheus queries return results in the form of a vector.
    # So it is common to access the index 0 of the returned array to obtain the value
    successCondition: result[0] >= 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]
          )) / 
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
          ))

Canary Release Practice

  1. Suppose the user has deployed the echoServer service based on Kubernetes as follows, and served externally through nginx ingress:

 title=

  1. Define a Kruise Rollout canary release (1 new version Pod, and 5% traffic) and apply -f to the Kubernetes cluster
 apiVersion: rollouts.kruise.io/v1alpha1
kind: Rollout
metadata:
  name: rollouts-demo
spec:
  objectRef:
    ...
  strategy:
    canary:
      steps:
      - weight: 5
        pause: {}
        replicas: 1
      trafficRoutings:
        ...
  1. Upgrade the echoserver image version (Version 1.10.2 -> 1.10.3), and kubectl -f to the Kubernetes cluster
 apiVersion: apps/v1
kind: Deployment
metadata:
  name: echoserver
...
spec:
  ...
  containers:
  - name: echoserver
    image: cilium/echoserver:1.10.3

When Kruise Rollout detects the above behavior, it will automatically start the canary release process. As shown below, canary Deployment, service and Ingress are automatically generated, and 5% traffic is configured to the new version of Pods:

 title=

  1. After a period of canary, after the business R&D students confirm that the new version is normal, they can release all the remaining Pods through the command kubectl-kruise rollout approve rollout/rollouts-demo -n default . Rollout will accurately control the subsequent process. When the release is completed, all canary resources will be recycled and restored to the state deployed by the user.

 title=

  1. If the new version is found abnormal during the canary process, you can adjust the business image to the previous version (1.10.2), and then kubectl apply -f to the Kubernetes cluster. Kruise Rollout listens to this behavior and will recycle all canary resources to achieve the effect of fast rollback.
 apiVersion: apps/v1
kind: Deployment
metadata:
  name: echoserver
...
spec:
  ...
  containers:
  - name: echoserver
    image: cilium/echoserver:1.10.2

 title=

Summarize

With the increasing number of applications deployed on Kubernetes, how to achieve a balance between rapid business iteration and application stability is a problem that platform builders must solve. Kruise Rollout is a new exploration of OpenKruise in the field of progressive delivery, which aims to solve the problem of traffic scheduling and batch deployment in the field of application delivery. Kruise Rollout has officially released version v0.1.0, and has been integrated with the community OAM KubeVela project. Vela users can quickly deploy and use Rollout capabilities through Addons. In addition, I also hope that community users can join in, and we will do more expansion in the field of application delivery.

 title=

Related Links

[1] OpenKruise:

​​https://github.com/openkruise/kruise

[2] Kruise Rollout:

​​https://github.com/openkruise/rollouts/blob/master/docs/getting_started/introduction.md

👇👇 Poke ​​here ​​to view the official homepage and documentation of the OpenKruise project!


阿里云云原生
1k 声望305 粉丝