Author: Wang Yike (Chu Yue)

background

Helm is a widely used client application packaging and deployment tool in the cloud-native field. Its concise design and easy-to-use experience have been recognized by users and formed its own ecosystem. Up to now, nearly 10,000 applications use Helm Chart. way to pack. Helm's design philosophy is simple enough that it can even be summarized into the following two:

1. Package and template the complex Kubernetes API, abstract and simplify it into a few parameters.

2. Provide application lifecycle solutions: production, upload (hosting), versioning, distribution (discovery), deployment.

These two design principles ensure that Helm is flexible enough and simple enough to cover all Kubernetes APIs, which is a good solution to the one-time delivery of cloud-native applications. However, for enterprises with a certain scale, using Helm for continuous software delivery presents a lot of challenges.

The challenges of Helm continuous delivery

Helm was designed from the beginning to keep it simple and easy to use, and to give up complex component orchestration. Therefore, when the application is deployed, Helm delivers all the resources to the Kubernetes cluster, and expects to automatically solve the application dependency and orchestration problems through the self-healing capability of Kubernetes in the final state. Such a design may be fine when deployed for the first time, but is too ideal for an enterprise production environment of a certain scale.

On the one hand, when the application is upgraded, it is easy to update all the resources at once because some services are temporarily unavailable, causing the overall service interruption; on the other hand, if there is a bug in the software, it cannot be rolled back in time. control. In some more serious scenarios, if some configurations of the production environment have been manually modified by the operation and maintenance, because the one-time deployment of Helm will overwrite all the original modifications, and the previous version of Helm may not be consistent with the production environment, resulting in the return Rolling also cannot be recovered, resulting in a larger area of failure.

It can be seen that when a certain scale is achieved, the ability of software to be grayed out and rolled back in the production environment is extremely important, and Helm itself cannot guarantee sufficient stability.

How to do a canary release for Helm?

Usually, a rigorous software upgrade process will follow a process similar to the following: roughly divided into three stages, the first stage upgrades a small number (such as 20%) of instances, and switches a small amount of traffic to the new version, and suspends the upgrade after completing this stage . After manual confirmation, continue to the second stage, upgrade a larger proportion (such as 90%) of instances and traffic, and pause again for manual confirmation. The final stage will be fully upgraded to the new version and verified, thus completing the entire release process. If any abnormality including business indicators is found during the upgrade, such as abnormally high CPU or memory usage or too many requests for 500 logs, it can be rolled back quickly.

 title=

The above is a typical canary release scenario, so how do we complete the above process for the Helm Chart application? Typical practices in the industry are usually as follows:

  1. Modify the Helm Chart to convert the workload into two copies, and expose different Helm parameters respectively, and continuously modify the image, number of instances, and traffic ratio of the two workloads when publishing, so as to achieve grayscale publishing.
  2. Modify the Helm Chart, modify the original basic workload to a custom workload with the same function but with grayscale publishing capabilities, and expose the Helm parameters, which are the CRDs that manipulate these grayscale publishing.

These two solutions are very complicated and require a lot of modification costs, especially when your Helm Chart is a third-party component that cannot be modified or does not have the ability to maintain Helm Chart itself, these methods are not feasible. Even if it is really transformed, there is still a lot of stability risk compared to the original simple workload model. The reason is that Helm itself is positioned as a package management tool, and it is not designed with grayscale release or workload management in mind.

In fact, after in-depth communication with a large number of users in the community, we found that most of the users' applications are not complicated, and the categories are classic types such as Deployment and StatefulSet. Therefore, through the powerful plug-in mechanism of KubeVela, we cooperate with the OpenKruise community [ 1] to make a canary release plug-in for these limited types. This plugin helps you to easily complete the grayscale release of Helm Chart without any migration and transformation. Not only that, but if your Helm Chart is complex, you can completely customize a plugin for your scene to get the same experience.

Let's take a practical example (take the Deployment workload as an example) to take you through the complete process.

Canary Publishing with KubeVela

Prepare the environment

  • Install KubeVela
 curl -fsSl https://static.kubevela.net/script/install-velad.sh | bash
velad install

See end of document 1 [ 2] for more installation details.

  • Enable related addons
 vela addon enable fluxcd
vela addon enable ingress-nginx
vela addon enable kruise-rollout
vela addon enable velaux

In this step, the following plugins are started:

1) The fluxcd plugin helps us have the ability to deliver helm;

2) The ingress-nginx plugin is used to provide the traffic management capability released by the canary;

3) kruise-rollout provides canary release capability;

4) The velaux plugin provides interface operation and visualization.

  • Map the port of nginx ingress-controller to local
 vela port-forward addon-ingress-nginx -n vela-system

first deployment

Publish the helm application for the first time by executing the following command. In this step, we deploy through vela's CLI tool. If you are familiar with Kubernetes, you can also deploy through kubectl apply with the same effect.

 cat <<EOF | vela up -f -
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
  name: canary-demo
  annotations:
    app.oam.dev/publishVersion: v1
spec:
  components:
  - name: canary-demo
    type: helm
    properties:
      repoType: "helm"
      url: "https://wangyikewxgm.github.io/my-charts/"
      chart: "canary-demo"
      version: "1.0.0"
    traits:
    - type: kruise-rollout
      properties:
        canary:
          # The first batch of Canary releases 20% Pods, and 20% traffic imported to the new version, require manual confirmation before subsequent releases are completed
          steps:
          - weight: 20
          # The second batch of Canary releases 90% Pods, and 90% traffic imported to the new version.
          - weight: 90
          trafficRoutings:
          - type: nginx
EOF

In the above example, we declare an application named canary-demo, which contains a helm-type component (KubeVela also supports other types of component deployment), and the parameters of the component include the address and version of the chart.

In addition, we also declared the operation and maintenance characteristics of kruise-rollout for this component, which is the ability of the kruise-rollout plugin after installation. Among them, the upgrade strategy of helm can be specified. In the first stage, 20% of the instances and traffic will be upgraded first, and then 90% will be upgraded after manual confirmation, and finally the full amount will be upgraded to the latest version.

It should be noted that in order to demonstrate the effect intuitively (reflecting version changes), we have specially prepared a chart [ 3] . The body of the helm chart contains a Deployment and Ingress objects, which are the most common scenarios for helm chart production. If your helm chart also has the above resources, you can also use this example to release canary.

After the deployment is successful, we access the gateway address in your cluster through the following command, and you will see the following effect:

 $ curl -H "Host: canary-demo.com" http://localhost:8080/version
Demo: V1

In addition, through VelaUX's resource topology page, we can see that the five V1 version instances are all ready.

 title=

Upgrade the app

Use the yaml below to upgrade your application.

 cat <<EOF | vela up -f -
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
  name: canary-demo
  annotations:
    app.oam.dev/publishVersion: v2
spec:
  components:
  - name: canary-demo
    type: helm
    properties:
      repoType: "helm"
      url: "https://wangyikewxgm.github.io/my-charts/"
      chart: "canary-demo"
      # Upgade to version 2.0.0
      version: "2.0.0"
    traits:
    - type: kruise-rollout
      properties:
        canary:
          # The first batch of Canary releases 20% Pods, and 20% traffic imported to the new version, require manual confirmation before subsequent releases are completed
          steps:
          - weight: 20
          # The second batch of Canary releases 90% Pods, and 90% traffic imported to the new version.
          - weight: 90
          trafficRoutings:
          - type: nginx
EOF

We noticed that the new application has only two changes compared to the first deployment:

  1. Upgraded the annotation of app.oam.dev/publishVersion from v1 to v2. This means that this revision is a new version.
  2. The version of the helm chart has been upgraded to 2.0.0, and the tag of the deployment image in this version of the chart has been upgraded to V2.

After a while, we will find that the upgrade process stops at the first batch we defined above, which is to upgrade only 20% of the instances and traffic. At this time, execute the above command to access the gateway multiple times, and you will find that Demo: v1 and Demo: v2 appear alternately, and there is about a 20% probability of getting the result of Demo: v2.

 $ curl -H "Host: canary-demo.com" http://localhost:8080/version
Demo: V2

Looking at the topological state of the application's resources again, we can see that the rolloutCR created by the kruise-rollout trait creates a new version instance for us, while the 5 old version instances created by the previous workload have not changed.

 title=

Next, we execute the following command through vela's CLI to resume the upgrade through manual review:

 vela workflow resume canary-demo

After some time, looking through the resource topology, we see five instances of the new version being created. At this time, when we visit the gateway again, we will find that the probability of Demo:v2 appears greatly increased, close to 90%.

fast rollback

Usually in a release in a real scenario, after manual review, it is often found that the status of the new version of the application is abnormal. It is necessary to terminate the current upgrade and quickly roll back the application to the version before the upgrade.

We can execute the following command to pause the current publishing workflow first:

 $ vela workflow suspend canary-demo
Rollout default/canary-demo in cluster  suspended.
Successfully suspend workflow: canary-demo

Then roll back to the previous version, which is V1:

 $ vela workflow rollback canary-demo
Application spec rollback successfully.
Application status rollback successfully.
Rollout default/canary-demo in cluster  rollback.
Successfully rollback rolloutApplication outdated revision cleaned up.

At this time, when we visit the gateway again, we will find that all the request results have returned to the V1 state.

 $ curl -H "Host: canary-demo.com" http://localhost:8080/version
Demo: V1

At this time, through the resource topology diagram, we can see that all instances of the canary version have also been deleted, and from beginning to end, the five instances of v1, as instances of the stable version, have not undergone any changes.

 title=

If you change the above rollback operation to resume and continue the upgrade, the subsequent upgrade process will continue to complete the full release.

For the complete operation process of the above demo, please refer to document 2 [ 4] at the end of the article.

If you want to directly use native K8s resources to implement the above process, you can refer to document 3 [ 5] at the end of the article. In addition, in addition to Deployment, the kruise-rollout plugin also supports StatefulSet and OpenKruise's CloneSet. If the workload type in your chart is the above three, you can implement canary release through the above example.

I believe you have also noticed that in the above example, we give a seven-layer traffic segmentation scheme based on nginx-Ingress-controller. In addition, we also support the Kubernetes Gateway API [ 6] to support more gateway types and Four-layer traffic segmentation scheme.

How is the stability of the release process guaranteed?

After the first deployment, the kruise rollout plugin (rollout for short) will monitor the resources deployed by Helm Chart, in our case deployment, servcie and ingress, and also supports StatefulSet and OpenKruise Cloneset. rollout will take over the subsequent upgrade actions of this deployment.

During the upgrade, the Helm deployment of the new version will take effect first, and the deployment image will be updated to v2. However, the deployment upgrade process will be taken over from the controller-manager by rollout at this time, so that the Pods under the deployment will not be upgraded. At the same time, rollout will copy a canary version of deployemnt, the tag of the image is v2, and create a service to filter to the instance below it, and an ingress pointing to this service, and finally set the annotation corresponding to the ingress to let This ingress undertakes the traffic of the canary version (for details, please refer to document 4 [ 7] at the end of the article), so as to realize traffic segmentation.

After all manual confirmation steps are completed and the full release is completed, rollout will return the deployment upgrade control of the stable version to the controller-manager. At that time, the instances of the stable version will be upgraded to the new version one after another. When all the instances of the stable version are ready , the canary version of the deployment, service and ingress will be destroyed one after another, thus ensuring that the request traffic will not hit the instance that is not ready, resulting in abnormal request, so as to achieve lossless canary release.

After that, we will continue to iterate in the following aspects to support more scenarios and bring a more stable and reliable upgrade experience:

  1. The upgrade process is connected to the workflow system of KubeVela, thereby introducing a richer intermediate step expansion system and supporting functions such as notification sending through workflow during the upgrade process. Even in the pause phase of each step, it connects to the external observability system, and automatically decides whether to continue publishing or rollback by checking logs or monitoring indicators, thereby realizing an unattended publishing strategy.
  2. Integrate more addons such as istio to support the traffic segmentation scheme of ServiceMesh.
  3. In addition to supporting percentage-based traffic segmentation, it supports header or cookie-based traffic segmentation rules, and supports features such as blue-green publishing.

Summarize

As mentioned above, the process of KubeVela supporting Helm to do canary release is completely realized through the plugin (Addon) system. The fluxcd addon helps us deploy and manage the life cycle of the helm chart. The kruise-rollout addon helps us implement workload instance upgrades and switch traffic during the upgrade process. By combining two addons, the whole life cycle management and canary upgrade of the Helm application is realized, without any changes to your Helm Chart. You can also write plugins [ 8] for your scenarios to complete more special scenarios or processes.

Based on KubeVela's powerful scalability, not only can you combine these addons flexibly, but you can also dynamically replace the underlying capabilities according to different platforms or environments without making any changes to the upper-layer application. For example, if you prefer to use argocd instead of fluxcd to deploy Helm applications, you can implement the same function by enabling argocd's addon, and the upper-layer Helm application does not need to be changed or migrated.

Now the KubeVela community has provided dozens of addons, which can help the platform expand the capabilities of observability, gitops, finops, rollout and other aspects.

 title=

If you are interested in addon, you are also very welcome to submit your custom plugin to Addon's repository [ 9] to contribute new ecological capabilities to the community!

Reference link:

[1] OpenKruise Community:

https://openkruise.io/

[2] Document 1:

https://kubevela.net/docs/install#1-install-velad

[3] chart:

https://github.com/wangyikewxgm/my-charts/tree/main/canary-demo

[4] Document 2:

https://kubevela.net/docs/tutorials/helm

[5] Document 3:

* https://kubevela.net/docs/tutorials/k8s-object-rollout

[6] API:

https://gateway-api.sigs.k8s.io/

[7] Document 4:

https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/#canary

[8] Write a plugin:

https://kubevela.net/docs/platform-engineers/addon/intro

[9] Addon's warehouse address:

https://github.com/kubevela/catalog

You can learn more details about KubeVela and the OAM project through the following materials:

  • Project code base: github.com/oam-dev/kubevela Welcome to Star/Watch/Fork!
  • The official homepage and documentation of the project: kubevela.io, since version 1.1, Chinese and English documents have been provided, and developers are welcome to translate more language documents.
  • Project DingTalk Group: 23310022; Slack: CNCF #kubevela Channel
  • Join WeChat group: Please add the following maintainer WeChat account to indicate that you have entered the KubeVela user group:

 title=

Click here : Check out the official website of the KubeVela project! !


阿里云云原生
1k 声望302 粉丝