Kruise Rollout: A Flexible and Pluggable Progressive Release Framework

Author: Zhao Mingshan (Liheng)

foreword

Kruise Rollout is a progressive delivery framework open sourced by the OpenKruise community. Kruise Rollout supports canary release, blue-green release, and A/B Testing release in accordance with traffic and instance grayscale, and the release process can be automatically batched and paused based on Prometheus Metrics indicators, and provides bypass insensitivity docking, compatibility with existing There are multiple workloads (Deployment, CloneSet, DaemonSet).

Recently, I also shared themes on the "2022 Open Atom Global Open Source Summit". The following are the main contents.

What is incremental delivery?

title=

Progressive release is mainly different from full, one-time release. It mainly includes the following features:

Incremental release process : Generally speaking, we can divide a release into multiple batches, and can control the start and stop of each batch.
The grayscale of the dual dimensions of instance and traffic : such as canary releases, A/B Testing releases, and blue-green releases that are common in the community.
Stage verifiability : that is, each batch released can verify the correctness of the release and whether it meets expectations.

Let's look at a practical example.

title=

If the line is X version, now need to publish to Y version. First, the release will be divided into multiple batches (for example, only ten instances will be released in the first batch); then, the traffic of certain grayscale rules will be sent to the Y version, for example, every major release like Taobao will use A/B Testing method, only grayscale the company employees to the new version; finally, verify the health of the new version, and after the verification is OK, you can repeat the above process to complete the remaining batches. If any exceptions are found during this process, you can quickly roll back to the X version. Through the above example, compared with the full release, the incremental release adds a lot of intermediate verification processes. The incremental release can be said to greatly improve the stability of delivery, especially for some large-scale scenarios. Publishing is very necessary.

The relationship between incremental release and K8s workloads

title=

All Pods in K8s are managed by workloads, and the two most common workloads are Deployment and statefulset. Deployment provides two parameters, maxUnavailable and maxSurge, for upgrades, but in essence, Deployment only supports streaming one-time release, and users cannot control batching. Although StatefulSet supports batching, it is still far from the ability of incremental release we want.

title=

Therefore, progressive release and workload are an inclusive relationship in terms of capability. In addition to the basic Pod release, it should also include traffic release and progress control . Now that the capabilities have been sorted out, let's take a look at the implementation. How to design and implement the Rollout capability is also very important. Here we can consider a question, from a design point of view, are they also inclusive relationships?

Design philosophy of the Rollout scheme

Before starting to do this, be sure to research the community's excellent solutions to see how others have solved it.

title=

Argo Rollout is a Workload launched by Argo Company. Its implementation idea is: redefine a workload similar to Deployment, and expand the related capabilities of Rollout on the basis of realizing the original capabilities of Deployment. Its advantage is that the workload has built-in Rollout capability, the configuration is simple, and the implementation will be relatively simple, and the currently supported functions are also very rich, supporting various publishing strategies, traffic grayscale and metrics analysis, and it is a relatively mature project.

However, it also has some problems, because it is a workload itself, so it cannot be applied to community Deployment, especially for companies that have already deployed with Deployment, which requires an online workload migration. Secondly, many solutions in the community now rely on Deployment, and many companies have built deployment-based container management platforms, which must be compatible and adapted. Therefore, Argo-Rollout is more suitable for companies with strong customization capabilities and no existing Deployment.

title=

Another community project is Flagger, which is completely different from Argo-Rollout. It does not implement a separate workload, but expands the ability of traffic grayscale and batch release based on the existing Deployment.

The advantage of Flagger is that it supports native Deployment and is compatible with community solutions such as Helm and Argo-CD. But there are also some problems. The first is the problem of Double Deployment resources in the release process. Because it upgrades the Deployment deployed by the user first, and then upgrades the Primary, it is necessary to prepare double Pod resources in this process. Second, additional docking is required for some self-built container platforms, because its implementation idea is to copy all user deployment resources and change the resource name and Label. Therefore, Flagger is more suitable for companies that are small in scale, deployed based on community solutions, and have less customization.

title=

In addition, blooming a hundred flowers is a major feature of cloud native. The Alibaba Cloud container team is responsible for the evolution of the cloud-native architecture of the entire container platform. There is also a strong demand in the field of progressive application delivery. Therefore, based on reference to community solutions and consideration of Alibaba's internal scenarios, we have the following goals in the design of Rollout :

1. Non-intrusive : No modification is made to the native Workload controller and user-defined Application Yaml definitions, ensuring clean and consistent native resources

2. Scalability : Support K8s Native Workload, Custom Workload, Nginx, Isito and other traffic scheduling methods in an extensible way

3. Ease of use : out-of-the-box for users, it can be easily combined with community Gitops or self-built PaaS

Kruise Rollout working mechanism and evolution

title=

The Kruise Rollout API design is very simple and mainly consists of the following four parts:

ObjectRef : used to indicate the workload that Kruise Rollout acts on, for example: Deployment Name
Strategy : Defines the rollout release process. The above is an example of a canary release. The first batch of 5% instances is released, and the grayscale 5% traffic is transferred to the new version. After manual confirmation, follow-up releases will be made **
TrafficRouting : The resource name required by the traffic grayscale, for example: Service, Ingress or Gateway API **
Status : used to display the process and status of Rollout

Next, let's introduce the working mechanism of Kruise Rollout.

title=

First, users make a version release based on the container platform (a release is essentially applying K8s resources to the cluster).

Kruise Rollout includes a webhook component that intercepts the user's post request and then Pauses the workload controller's work by modifying the workload strategy.
Then, according to the user's Rollout definition, dynamically adjust the parameters of the workload, such as partition, to realize the batch release of the workload.
After the batch release is completed, the ingress and service configurations will be adjusted to import specific traffic to the new version.
Finally, Kruise Rollout can also judge whether the release is normal through the business indicators in prometheus. For example, for a web-like http service, you can check whether the http status code is normal.

The above process completes the grayscale of the first batch, and the subsequent batches are similar. After the complete Rollout process is complete, kruise will restore the configuration of resources such as workloads. Therefore, the entire Rollout process is a synergy with the existing workload capabilities. It reuses the workload capabilities as much as possible, and achieves zero invasion of the non-Rollout process.

The working mechanism of Kruise Rollout is introduced here first. Now I will briefly introduce the OpenKruise community.

title=

at last

With the increasing number of applications deployed on K8s, how to achieve a balance between rapid business iteration and application stability is a problem that platform builders must solve. Kruise Rollout is a new exploration of OpenKruise in the field of progressive delivery, which aims to solve the problem of traffic scheduling and batch deployment in the field of application delivery. Kruise Rollout has officially released version v0.2.0, and has been integrated with the community OAM KubeVela project. Vela users can quickly deploy and use Rollout capabilities through Addons. In addition, I also hope that community users can join in, and we will do more expansion in the field of application delivery.

Github:
https://github.com/openkruise/rollouts
Official:
https://openkruise.io/
Slack:
https://kruise-workspace.slack.com/

Scan the code to join the community exchange DingTalk group

title=

Click here to check out the OpenKruise project github homepage!

Kruise Rollout: A Flexible and Pluggable Progressive Release Framework

foreword

What is incremental delivery?

The relationship between incremental release and K8s workloads

Design philosophy of the Rollout scheme

Kruise Rollout working mechanism and evolution

at last

阿里云云原生

引用和评论

Higress 入选全球 Top 100 MCP Servers 榜单｜MCPMarket.com

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

支付宝H5下载被拦截的原因排查与解决指南

云上玩转DeepSeek系列之四：DeepSeek R1 蒸馏和微调训练最佳实践

终于，AWS Aurora 也走向了融合架构，这一次阿里云 PolarDB-X 确实遥遥领先

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

草莓不是莓，西瓜才是莓——解读 Kubernetes 中被驱逐的 Pod