OpenKruise v0.9.0 version released: new pod restart, delete protection and other important functions

头图.jpg

Author | Wang Siyu (Wine Toast)
Photo Creidt@ 王思宇（酒祝）

background

OpenKruise is Alibaba Cloud's open source cloud native application automation management suite, and it is also a Sandbox project currently hosted under the Cloud Native Computing Foundation (CNCF). It comes from the accumulation of Alibaba’s containerization and cloud-native technology for many years. It is a standard extension component based on Kubernetes for large-scale application of Alibaba’s internal production environment. It is also a technical concept and the most advanced technology that closely follows upstream community standards and adapts to Internet-scale scenarios. Best practice.

OpenKruise released the latest v0.9.0 version ( ChangeLog ) on May 20, 2021, which added heavy functions such as Pod container restart and resource cascading deletion protection. The following article gives an overall overview of the new version. .

Pod container restart/rebuild

"Restart" is a very simple requirement. Even the daily operation and maintenance requirements are also a common "recovery method" in the technical field. In native Kubernetes, it does not provide any operational capabilities for container granularity, and Pod as the smallest operation unit has only two operation methods: create and delete.

Some students may ask, in the cloud-native era, why do users care about container restarts? In the ideal serverless mode, the business only needs to care about the service itself, right?

This comes from the difference between the cloud native architecture and the traditional infrastructure in the past. In the era of traditional physical machines and virtual machines, multiple application instances are often deployed and run on one machine, and the life cycles of the machine and the application are different; in this case, the restart of the application instance may only be one Commands such as systemctl or supervisor without restarting the entire machine. However, in the container and cloud native mode, the life cycle of the application is bound to the Pod container; that is, under normal circumstances, a container only runs one application process, and a Pod only provides the services of one application instance.

Based on the above limitations, there is currently no API under native Kubernetes to provide container (application) restart capabilities for upper-layer services. The Kruise v0.9.0 version provides a single-Pod-dimensional container restart capability, compatible with standard Kubernetes clusters of version 1.16 and above. In install or upgrade Kruise after just create ContainerRecreateRequest (referred to as CRR) object to specify the restart, the easiest YAML as follows:

apiVersion: apps.kruise.io/v1alpha1
kind: ContainerRecreateRequest
metadata:
  namespace: pod-namespace
  name: xxx
spec:
  podName: pod-name
  containers:
  - name: app
  - name: sidecar

Among them, the namespace needs to be in the same namespace as the Pod to be operated on, and the name can be optional. The podName in the spec is the name of the Pod, and the container list can specify one or more container names in the Pod to perform the restart.

In addition to the above mandatory fields, CRR also provides a variety of optional restart strategies:

spec:
  # ...
  strategy:
    failurePolicy: Fail
    orderedRecreate: false
    terminationGracePeriodSeconds: 30
    unreadyGracePeriodSeconds: 3
    minStartedSeconds: 10
  activeDeadlineSeconds: 300
  ttlSecondsAfterFinished: 1800

failurePolicy: Fail or Ignore, the default is Fail; it means that once a container stops or fails to rebuild, CRR ends immediately.
orderedRecreate: Default false; true means that when there are multiple containers in the list, the reconstruction of the previous container is completed before the reconstruction of the next one is started.
terminationGracePeriodSeconds: The time to wait for the container to exit gracefully. If you do not fill in the time defined in Pod by default.
unreadyGracePeriodSeconds: Set the Pod to not ready before rebuilding, and wait for this period of time before starting to rebuild.
- Note: This feature depends on the feature-gate of KruisePodReadinessGate to be turned on, which will inject a readinessGate when each Pod is created. Otherwise, readinessGate will only be injected into the Pods created by Kruise workload by default, which means that only these Pods can use unreadyGracePeriodSeconds during CRR reconstruction.
minStartedSeconds: After the rebuild, the new container keeps running for at least this period of time before the container is considered to be rebuilt successfully.
activeDeadlineSeconds: If the CRR execution exceeds this time, it is directly marked as finished (unfinished containers are marked as failed).
ttlSecondsAfterFinished: After CRR ends, it will be deleted automatically after this period of time.

Implementation principle: After the user has created a CRR, it will be received by the kruise-daemon on the node where the Pod is located and executed by the kruise-manager center after the initial processing. The execution process is as follows:

If the Pod container defines preStop, kruise-daemon will first execute the preStop in the container with the CRI runtime exec.
If there is no preStop or the execution is complete, kruise-daemon calls the CRI interface to stop the container.
When the kubelet senses that the container exits, it creates a new container with an increasing "serial number" and starts to start (and execute postStart).
kruise-daemon senses that the new container has started successfully, and reports the completion of the CRR restart.

The above-mentioned container "serial number" actually corresponds to the restartCount reported by the kubelet in Pod status. Therefore, after the container restarts, you will see the restartCount of the Pod increase. In addition, because the container is rebuilt, the files temporarily written to the rootfs of the old container will be lost, but the data in the volume mounted by the volume mount still exists.

Cascade delete protection

Kubernetes' final state-oriented automation is a "double-edged sword", which not only brings declarative deployment capabilities to applications, but also potentially magnifies some misoperations by final state. For example, its "cascading delete" mechanism, that is, once the parent resource is deleted under normal circumstances (non-orphan delete), all child resources will be deleted in association:

Delete a CRD, and all its corresponding CRs will be cleared.
Delete a namespace, all resources including Pod under this namespace will be deleted together.
Deleting a workload (Deployment/StatefulSet/...) will delete all Pods under it.

Similar to the failures caused by this "cascading delete", we have heard many complaints from K8s users and developers in the community. For any company, the accidental deletion of such a scale in its production environment is an unbearable pain, and Alibaba is no exception.

Therefore, in the Kruise v0.9.0 version, we exported the anti-cascading deletion capabilities made by Alibaba to the community, hoping to bring stability guarantee to more users. If you need to use this feature, then in the current version install or upgrade Kruise time need to explicitly open ResourcesDeletionProtection this feature-gate.

For resource objects that need to be protected from deletion, users can policy.kruise.io/delete-protection , and there are two types of value:

Always: Indicates that this object is forbidden to be deleted, unless the above label is removed.
Cascading: If this object still has available subordinate resources, it is forbidden to be deleted.

The currently supported resource types and cascading relationships are as follows:

CloneSet new features

1. Delete priority

controller.kubernetes.io/pod-deletion-cost is an annotation added after Kubernetes version 1.21. ReplicaSet will refer to this cost value for sorting when shrinking. CloneSet also supports this feature since Kruise v0.9.0.

The user can configure this annotation to the pod. Its value is of type int, which means that compared to the "deleting cost" of other pods under the same CloneSet, the pod with a lower cost has a higher delete priority. The default deletion cost of pods without this annotation is 0.

Note that this deletion order is not mandatory, because the deletion of real pods is similar to the following order:

Not scheduled <scheduled
PodPending < PodUnknown < PodRunning
Not ready < ready
Smaller pod-deletion cost <Larger pod-deletion cost
Being in Ready time is shorter <longer
More container restarts <less
Shorter creation time <longer

2. Mirror warm-up with in-situ upgrade

When using CloneSet to upgrade the application in place, only the container image will be upgraded, and the Pod will not be rebuilt. This ensures that the node where the Pod is located before and after the upgrade will not change, so during the in-place upgrade process, if CloneSet pulls the new version image on all Pod nodes in advance, the Pod will be in place in the subsequent release batches. The upgrade speed will be greatly improved.

If you need to use this feature, then in the current version install or upgrade Kruise time need to explicitly open PreDownloadImageForInPlaceUpdate this feature-gate. After opening, when the user updates the image in the CloneSet template and the release strategy supports in-situ upgrade, CloneSet will automatically create an ImagePullJob object for this new image (the batch image preheating function provided by OpenKruise) to advance to the node where the Pod is located Warm up the new image.

By default, the concurrency that CloneSet configures for ImagePullJob is 1, which means that each node is mirrored. If you need to adjust, you can set the concurrency of its image warm-up on the CloneSet annotation:

apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
  annotations:
    apps.kruise.io/image-predownload-parallelism: "5"

3. Pod replacement method that expands and then shrinks

In the past version, CloneSet's maxUnavailable and maxSurge strategies only took effect during the application release process. Starting from Kruise v0.9.0, these two policies will also take effect for Pod designated deletion.

In other words, when the user podsToDelete or apps.kruise.io/specified-delete: true (see the official website document for details), CloneSet will only delete it when the number of currently unavailable Pods (relative to the total number of replicas) is less than maxUnavailable . Meanwhile, if the user configures maxSurge policy is CloneSet likely will create a new Pod, waiting for a new Pod ready, and then delete the specified old Pod.

The specific replacement method used depends on the current maxUnavailable and the actual number of unavailable Pods. such as:

For a CloneSet maxUnavailable=2, maxSurge=1 and one pod-a in an unavailable state, if you pod-b , CloneSet will delete it immediately, and then create a new Pod.
For a CloneSet maxUnavailable=1, maxSurge=1 and there is a pod-a in the unavailable state, if you pod-b , then CloneSet will first create a Pod, wait for it to be ready, and finally delete pod-b .
For a CloneSet maxUnavailable=1, maxSurge=1 and a pod-a in an unavailable state, if you pod-a , CloneSet will delete it immediately, and then create a new Pod.
...

4. Efficient rollback based on the final state of the partition

In the native workload, deployment itself does not support grayscale publishing. StatefulSet has partition semantics to allow users to control the number of grayscale upgrades; Kruise workloads such as CloneSet and Advanced StatefulSet also provide partitions to support grayscale batching.

For CloneSet, the semantics of Partition is retain the number or percentage of the old version of Pod . For example, for a CloneSet with 100 copies, the partition value is periodically changed to 80 -> 60 -> 40 -> 20 -> 0 when the image is upgraded, and the release in 5 batches is completed.

But in the past, whether it was Deployment, StatefulSet or CloneSet, if you wanted to roll back during the release process, you had to change the template information (mirror) back to the old version. When the latter two are in the grayscale process, reducing the partition will trigger the upgrade of the old version to the new version, but it will not be processed if the partition is increased again.

Starting from the v0.9.0 version, the partition of CloneSet supports the "final state rollback" function. If install or upgrade Kruise opened when the CloneSetPartitionRollback this feature-gate, when the user adjust the partition large, CloneSet will correspond to the number of new versions of Pod to re-roll back to the old version.

The benefits of this are obvious: in the process of grayscale release, you only need to adjust the partition value before and after, and you can flexibly control the proportion of new and old versions. But it should be noted that the "old and new versions" on which CloneSet is based correspond to the updateRevision and currentRevision in its status:

updateRevision: corresponds to the template version defined by the current CloneSet.
currentRevision: The CloneSet the template version of 160bdf53625967 last time.

5. Short hash

controller-revision-hash set in the Pod label of CloneSet is the full name of ControllerRevision, such as:

apiVersion: v1
kind: Pod
metadata:
  labels:
    controller-revision-hash: demo-cloneset-956df7994

It is spliced by CloneSet name and ControllerRevision hash value. Generally, the length of the hash value is 8~10 characters, and the label value in Kubernetes cannot exceed 63 characters. Therefore, the name of CloneSet generally cannot exceed 52 characters. If it exceeds, the Pod cannot be successfully created.

In v0.9.0 version introduces CloneSetShortHash new feature-gate. If it is turned on, CloneSet will controller-revision-hash in the Pod to only the hash value, such as 956df7994, so the length of the CloneSet name will not have any restrictions. (Even if this function is enabled, CloneSet will still recognize and manage the revision label of the past stock as a full-format Pod.)

SidecarSet

Sidecar hot upgrade function

SidecarSet is a workload provided by Kruise to independently manage sidecar containers. Users can inject and upgrade designated sidecar containers in a certain range of Pods through SidecarSet.

By default, the independent in-place upgrade of sidecar is to stop the old version of the container first, and then create the new version of the container. This method is more suitable for sidecar containers that do not affect the availability of Pod services, such as log collection agents, but for many proxies or runtime sidecar containers, such as Istio Envoy, this upgrade method is problematic. As a proxy container in Pod, Envoy proxy all traffic. If you restart the upgrade directly, the availability of the Pod service will be affected. If you need to upgrade the envoy sidecar separately, you need a complex grace termination and coordination mechanism. So we provide a new solution for this sidecar container upgrade, namely hot upgrade.

apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
spec:
  # ...
  containers:
  - name: nginx-sidecar
    image: nginx:1.18
    lifecycle:
      postStart:
        exec:
          command:
          - /bin/bash
          - -c
          - /usr/local/bin/nginx-agent migrate
    upgradeStrategy:
      upgradeType: HotUpgrade
      hotUpgradeEmptyImage: empty:1.0.0

upgradeType: HotUpgrade represents that the type of the sidecar container is hot upgrade, and the hot upgrade scheme will be executed. hotUpgradeEmptyImage: When the sidecar container is hot upgraded, the business must provide an empty container for container switching during the hot upgrade process. The empty container has the same configuration as the sidecar container (except for the mirror address), such as command, lifecycle, probe, etc., but it does not do any work.
lifecycle.postStart: State migration. This process completes the state migration during the hot upgrade process. The script needs to be implemented by the business according to its own characteristics. For example, the nginx hot upgrade requires the completion of Listen FD sharing and traffic reload.

For specific sidecar injection and hot upgrade procedures, please refer to official website document .

At last

To learn more about the above capabilities, you can visit official website document . Students who are interested in OpenKruise are welcome to participate in our community building. Users who have already used the OpenKruise project, please register issue

Dingding search group number 23330762 Join Dingding exchange group!

OpenKruise v0.9.0 version released: new pod restart, delete protection and other important functions

background

Pod container restart/rebuild

Cascade delete protection

CloneSet new features

1. Delete priority

2. Mirror warm-up with in-situ upgrade

3. Pod replacement method that expands and then shrinks

4. Efficient rollback based on the final state of the partition

5. Short hash

SidecarSet

Sidecar hot upgrade function

At last

阿里云云原生

引用和评论

通义灵码 AI IDE 上线，第一时间测评体验

JManus - 面向 Java 开发者的开源通用智能体

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

k8s集群部署（一主两从）

AD系列：Windows Server 2025 搭建AD域控和初始化

黑客眼中的"肥羊"：刚开通的VPS为何最危险？

一体化运维，降本增效！秒云助力某基金打造智能运维平台