OpenKruise v0.10.0 release: new application elastic topology management, application protection and other capabilities

About Cloud's open source cloud native application automation management suite, CNCF Sandbox project - OpenKruise, today released the new version v0.10.0, which will also be the last minor version before OpenKruise v1.0. This article will take you at a glance at the new changes of v0.10.0, including the newly added WorkloadSpread, PodUnavailableBudget and other large particle features. There will be a transcript that will introduce the design and implementation principles in detail.

Author | 酒祝

**

background**

Alibaba Cloud's open source cloud native application automation management suite, CNCF Sandbox project-OpenKruise, today released a new version of v0.10.0, which will also be the last minor version before OpenKruise v1.0.

This article will take you at a glance at the new changes of v0.10.0, including the newly added WorkloadSpread, PodUnavailableBudget and other large particle features. There will be a transcript that will introduce the design and implementation principles in detail.

New features overview

1. WorkloadSpread: Bypass application flexible topology management capability

In the scenario of application deployment, operation and maintenance, there are a variety of topological fragmentation and flexibility requirements. The most common and basic one is to break up according to one or several topological levels, such as:

Application deployment needs to be broken up in the node dimension to avoid stacking (improve disaster tolerance)
Application deployment needs to be broken up according to the AZ (available zone) dimension (to improve disaster tolerance)

These basic demands can now be met through the capabilities of pod affinity and topology spread constraints natively provided by Kubernetes. However, in actual production scenarios, there are too many more complex partitions and flexibility requirements. Here are some practical examples:

When breaking up by zone, you need to specify the proportion of deployment in different zones. For example, the proportion of the number of Pods deployed in zone a, b, and c of an application is 1: 1: 2 etc. (due to some practical reasons such as the application Unbalanced traffic in multiple zones, etc.)
There are multiple zones or topologies of different models. When the application expands, it is first deployed to a zone or model, and when resources are insufficient, it is deployed to another zone or model (and so on); When capacity, the reverse order should be used, and the pod on the back zone or model should be scaled first (and so on)
There are multiple basic node pools and flexible node pools. A fixed number or proportion of Pods need to be deployed in the basic node pool during application deployment, and the rest are expanded to the elastic node pool.

For these examples, in the past, it was generally only possible to split an application into multiple Workloads (such as Deployment) for deployment, in order to solve the application in different topologies using different proportions, expansion priority, resource awareness, flexible selection and other scenarios. Basic problems, but still need deep customization of the PaaS layer to support the refined management of multiple workloads for one application.

In response to these problems, the WorkloadSpread resource has been added to the Kruise v0.10.0 version. Currently, it supports the deployment, ReplicaSet, and CloneSet workload types to manage the partitioning and elastic topology of their subordinate Pods.

The following is a simplified example:

apiVersion: apps.kruise.io/v1alpha1
kind: WorkloadSpread
metadata:
  name: workloadspread-demo
spec:
  targetRef:
    apiVersion: apps/v1 | apps.kruise.io/v1alpha1
    kind: Deployment | CloneSet
    name: workload-xxx
  subsets:
  - name: subset-a
    requiredNodeSelectorTerm:
      matchExpressions:
      - key: topology.kubernetes.io/zone
        operator: In
        values:
        - zone-a
    maxReplicas: 10 | 30%
  - name: subset-b
    requiredNodeSelectorTerm:
      matchExpressions:
      - key: topology.kubernetes.io/zone
        operator: In
        values:
        - zone-b

The creation of this WorkloadSpread can be associated with a Workload object through targetRef, and then during the process of expanding the Pod of this Workload, Kruise will inject the corresponding topology rules by Kruise according to the above strategy. This is a bypass injection and management method, which will not interfere with Workload's expansion and release management of Pod.

Note: WorkloadSpread's priority control of Pod shrinkage is achieved through Pod Deletion Cost:

If the Workload type is CloneSet, this feature is already supported, and the shrinking priority can be achieved
If the workload type is Deployment/ReplicaSet, Kubernetes version >= 1.21 is required, and the feature-gate PodDeletionCost must be enabled on kube-controller-manager in 1.21

To use the WorkloadSpread function, you need to open the feature-gate of WorkloadSpread when installing/upgrading Kruise v0.10.0.

The above example is only the simplest configuration. For more instructions, please refer to the official website document. The specific implementation principle will be shared with you in subsequent articles.

2. PodUnavailableBudget: Application availability protection

In many Voluntary Disruption scenarios, the Pod Disruption Budget (PDB) natively provided by Kubernetes ensures high availability of applications by limiting the number of Pods that are interrupted at the same time.

However, there are still many scenarios where even PDB protection will still cause business interruption and service degradation, such as:

The application owner is undergoing a version upgrade through Deployment, and at the same time, the cluster administrator is shrinking the node due to the low utilization of machine resources.
The middleware team is using SidecarSet to upgrade the sidecar version in the cluster (for example: ServiceMesh envoy) in situ, while HPA is scaling down the same batch of applications
The application owner and middleware team utilize the ability of CloneSet and SidecarSet to upgrade in situ, and are upgrading the same batch of Pods

This is actually very understandable-PDB can only prevent and control Pod eviction triggered by the Eviction API (for example, kubectl drain evokes all Pods on the node), but it cannot protect against many operations such as Pod deletion and in-place upgrade.

The PodUnavailableBudget (PUB) function added in Kruise v0.10.0 is an enhanced extension of the native PDB. It includes the capabilities of the PDB itself, and on this basis adds protection for more Voluntary Disruption operations, including but not limited to Pod deletion, in-situ upgrades, etc.

apiVersion: apps.kruise.io/v1alpha1
kind: PodUnavailableBudget
metadata:
  name: web-server-pub
  namespace: web
spec:
  targetRef:
    apiVersion: apps/v1 | apps.kruise.io/v1alpha1
    kind: Deployment | CloneSet | StatefulSet | ...
    name: web-server
  # selector 与 targetRef 二选一配置
# selector:
#   matchLabels:
#     app: web-server
  # 保证的最大不可用数量
  maxUnavailable: 60%
  # 保证的最小可用数量
# minAvailable: 40%

To use the PodUnavailableBudget function, you need to open the feature-gate when installing/upgrading Kruise v0.10.0 (you can choose to open one or both of them):

PodUnavailableBudgetDeleteGate: intercepts and protects Pod deletion, expulsion and other operations
PodUnavailableBudgetUpdateGate: Blocks and protects Pod in-situ upgrades and other update operations

For more instructions, please refer to the official website document. The specific implementation principle will be shared with you in subsequent articles.

3. CloneSet supports scaling based on topology rules

When CloneSet shrinks (reducing the number of replicas), there is a set of fixed algorithm sorting for choosing which Pod to delete:

Not scheduled <scheduled
PodPending < PodUnknown < PodRunning
Not ready < ready
Smaller pod-deletion cost <Larger pod-deletion cost
larger scatter weight <smaller
Being in Ready time is shorter <longer
More container restarts <less
Shorter creation time <longer

Among them, "4" is a feature provided in Kruise v0.9.0 to support user-specified deletion order (WorkloadSpread uses this function to achieve shrinking priority); and "5" is currently provided by v0.10.0 Features, that is, it will refer to the topology of the application for sorting when shrinking.

If the application is configured with topology spread constraints, CloneSet will select Pod deletion according to the topology dimension in it when shrinking (for example, try to balance the number of Pods deployed in multiple zones)
If the application is not configured with topology spread constraints, then by default CloneSet will be scaled down according to the node dimension to select Pod deletion (to minimize the number of stacks on the same node)

4. Advanced StatefulSet supports streaming expansion

In order to avoid a large number of failed pods being created after a new Advanced StatefulSet is created, the maxUnavailable strategy in the scale strategy has been introduced since Kruise v0.10.0:

apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
  # ...
  replicas: 100
  scaleStrategy:
    maxUnavailable: 10% # percentage or absolute number

When this field is set, Advanced StatefulSet will ensure that the number of unusable pods after pod creation does not exceed this limit.

For example, the StatefulSet above will only create 10 pods at once. After that, every time a pod becomes running and ready, a new pod will be created.

Note: This function is only allowed in the StatefulSet whose podManagementPolicy is \`Parallel\`.

5. Others

In addition to the above, there are some changes such as:

SidecarSet adds imagePullSecrets, injectionStrategy.paused and other fields to support configuration of sidecar mirroring to pull secrets and pause injection
Advanced StatefulSet supports preheating of images that cooperate with in-situ upgrades

See the ChangeLog documentation for details.

last

This time v0.10.0 will be the last minor version before OpenKruise v1.0. Kruise will release the first major version v1.0 before the end of the year, so stay tuned!

In addition, the OpenKruise community began to organize regular bi-weekly meetings, starting from this Thursday (September 9) at 19:00 (GMT+8 Asia/Shanghai) for the first time. This weekly meeting will explain the new version of v0.10.0 Features and demo demonstration. way of participation:

Zoom meeting link (see link at the end of the article)
Join the OpenKruise community exchange group (Dingdingsou group number 23330762), there will be a group live broadcast

more content

OpenKruise

https://github.com/openkruise/kruise

topology spread constraints

https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/

Pod Deletion Cost

https://kubernetes.io/docs/reference/labels-annotations-taints/#pod-deletion-cost

Official website document

https://openkruise.io/zh-cn/docs/workloadspread.html

ChangeLog documentation

https://github.com/openkruise/kruise/blob/v0.10.0/CHANGELOG.md

Zoom meeting link

https://us02web.zoom.us/j/87059136652?pwd=NlI4UThFWXVRZkxIU0dtR1NINncrQT09

Zoom document

https://shimo.im/docs/gXqmeQOYBehZ4vqo

Copyright Notice: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.