OpenKruise v1.2: Added PersistentPodState to implement stateful Pod topology fixation and IP multiplexing

Author: Wang Siyu (wine toast)

Cloud native application automation management suite, CNCF Sandbox project - OpenKruise, recently released v1.2 version.

OpenKruise [ 1] is an enhanced capability suite for Kubernetes, focusing on the deployment, upgrade, operation and maintenance, stability protection and other fields of cloud-native applications. All functions are extended by standard methods such as CRD, and can be applied to any Kubernetes cluster of version 1.16 and above. One-click deployment of Kruise can be done with a single helm command, no further configuration is required.

Version resolution

In version v1.2, OpenKruise provides a new CRD and controller called PersistentPodState, new fields in CloneSet status and lifecycle hook, and multiple optimizations for PodUnavailableBudget.

1. Added CRD and Controller-PersistentPodState

With the development of cloud native, more and more companies begin to deploy stateful services (such as Etcd, MQ) on Kubernetes. K8s StatefulSet is a workload for managing stateful services, and it considers the deployment characteristics of stateful services in many ways. However, StatefulSet can only maintain a limited number of Pod states, such as: Pod Name is ordered and unchanged, PVC is persistent, and cannot meet the maintenance requirements of other Pod states, such as: fixed IP scheduling, priority scheduling to previously deployed Node, etc. Typical cases are:

The service discovery middleware service is very sensitive to the Pod IP after deployment, and it is required that the IP cannot be changed at will
The database service persists data to the host disk, and changes to the Node to which it belongs will result in data loss

In response to the above description, Kruise can maintain other related states of the Pod by customizing the PersistentPodState CRD, such as "fixed IP scheduling".

A PersistentPodState resource object YAML is as follows:

 apiVersion: apps.kruise.io/v1alpha1
kind: PersistentPodState
metadata:
  name: echoserver
  namespace: echoserver
spec:
  targetRef:
    # 原生k8s 或 kruise StatefulSet
    # 只支持 StatefulSet 类型
    apiVersion: apps.kruise.io/v1beta1
    kind: StatefulSet
    name: echoserver
  # required node affinity，如下：Pod重建后将强制部署到同Zone
  requiredPersistentTopology:
    nodeTopologyKeys:
      failure-domain.beta.kubernetes.io/zone[,other node labels]
  # preferred node affinity，如下：Pod重建后将尽量部署到同Node
  preferredPersistentTopology:
    - preference:
        nodeTopologyKeys:
          kubernetes.io/hostname[,other node labels]
      # int, [1 - 100]
      weight: 100

"Fixed IP scheduling" should be a relatively common K8s deployment requirement for stateful services. Its meaning is not "specified Pod IP deployment", but requires regular operation and maintenance such as business release or machine eviction after the first deployment of Pod None of the operations will cause the Pod IP to change. To achieve the above effects, first of all, the K8s network component needs to support the ability of Pod IP reservation and keep the IP unchanged as much as possible. In this paper, the Host-local plug-in in the flannel network component has been modified in some codes, so that it can maintain the Pod under the same Node. The effect of IP unchanged, the relevant principle is not stated here, please refer to the code: host-local [ 2] .

"Fixed IP scheduling" seems to be supported by network components. What does this have to do with PersistentPodState? Because, the network components have certain limitations to achieve "Pod IP remains unchanged", for example: flannel can only support keeping Pod IP unchanged with Node. However, the biggest feature of K8s scheduling is "uncertainty", so "how to ensure that the Pod is scheduled to the same Node after reconstruction" is the problem solved by PersistentPodState.

In addition, you can let Kruise automatically create PersistentPodState objects for your StatefulSet by adding the following annotations to StatefulSet or Advanced StatefulSet, thus avoiding the burden of manually creating all PersistentPodStates.

 apiVersion: apps.kruise.io/v1alpha1
kind: StatefulSet
metadata:
  annotations:
    # 自动生成PersistentPodState对象
    kruise.io/auto-generate-persistent-pod-state: "true"
    # preferred node affinity，如下：Pod重建后将尽量部署到同Node
    kruise.io/preferred-persistent-topology: kubernetes.io/hostname[,other node labels]
    # required node affinity，如下：Pod重建后将强制部署到同Zone
    kruise.io/required-persistent-topology: failure-domain.beta.kubernetes.io/zone[,other node labels]

2. CloneSet calculates logical changes for partitions in the form of percentages, and adds a status field

In the past, a CloneSet calculated its partition value by "rounding up" (when it was a percentage), which meant that even if you set the partition to a percentage less than 100%, the CloneSet might not upgrade Any Pod to the new version. For example, for a CloneSet object with replicas=8 and partition=90%, the actual partition value it calculates is 8 (from 8 * 90% rounded up), so it will not perform the upgrade action for the time being. This can sometimes be confusing for users, especially for scenarios that use some rollout rolling upgrade components, such as Kruise Rollout or Argo.

Therefore, starting from v1.2, CloneSet will guarantee that at least 1 Pod will be upgraded when partition is less than 100% percentage value, unless CloneSet is in replicas <= 1 situation.

However, this will make it difficult for users to understand the calculation logic, and at the same time, it is necessary to know the number of Pods expected to be upgraded when the partition is upgraded to determine whether the batch upgrade is complete.

So we also added the expectedUpdatedReplicas field to CloneSet status, which can directly show how many Pods are expected to be updated based on the current partition value. For users:

Just compare status.updatedReplicas>= status.expectedUpdatedReplicas with another updatedReadyReplicas to determine whether the current release phase has reached the completion state.

 apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
  replicas: 8
  updateStrategy:
    partition: 90%
status:
  replicas: 8
  expectedUpdatedReplicas: 1
  updatedReplicas: 1
  updatedReadyReplicas: 1

3. Set Pod not-ready in the lifecycle hook phase

Kruise provided the lifecycle hook function in earlier versions, in which CloneSet and Advanced StatefulSet both support PreDelete and InPlaceUpdate hooks, and Advanced DaemonSet currently only supports PreDelete hook.

In the past, these hooks only stuck the current operation and allowed the user to do some custom things (such as removing the Pod from the service endpoint) before the Pod was deleted or before and after an in-place upgrade. However, the Pod may still be in the Ready state in these stages. At this time, removing it from some custom service implementations is actually a bit contrary to the common sense of Kubernetes. Generally speaking, it will only remove the Pod in the NotReady state. Extract from the service endpoint.

Therefore, in this version, we have added the markPodNotReady field to the lifecycle hook, which controls whether the Pod will be forced to be in the NotReady state when it is in the hook phase.

 type LifecycleStateType string
// Lifecycle contains the hooks for Pod lifecycle.
type Lifecycle struct 
    // PreDelete is the hook before Pod to be deleted. 
    PreDelete *LifecycleHook `json:"preDelete,omitempty"` 
    // InPlaceUpdate is the hook before Pod to update and after Pod has been updated. 
    InPlaceUpdate *LifecycleHook `json:"inPlaceUpdate,omitempty"`
}
type LifecycleHook struct {
    LabelsHandler     map[string]string `json:"labelsHandler,omitempty"`
    FinalizersHandler []string          `json:"finalizersHandler,omitempty"`

    /**********************  FEATURE STATE: 1.2.0 ************************/
    // MarkPodNotReady = true means:
    // - Pod will be set to 'NotReady' at preparingDelete/preparingUpdate state.
    // - Pod will be restored to 'Ready' at Updated state if it was set to 'NotReady' at preparingUpdate state.
    // Default to false.
    MarkPodNotReady bool `json:"markPodNotReady,omitempty"`
    /*********************************************************************/ 
}

For the PreDelete hook configured with markPodNotReady: true, it will set the Pod to NotReady during the PreparingDelete phase, and this Pod cannot return to the normal state when we reset the replicas value.

For the InPlaceUpdate hook configured with markPodNotReady: true, it will set the Pod to NotReady in the PreparingUpdate phase, and will force the NotReady state to be removed in the Updated phase.

4. PodUnavailableBudget supports custom workload and performance optimization

Kubernetes itself provides PodDisruptionBudget to help users protect high-availability applications, but it can only protect against eviction eviction in one scenario. For a variety of unavailable operations, PodUnavailableBudget can more comprehensively protect the high availability and SLA of the application. It can not only prevent Pod eviction, but also support other operations such as deletion and in-place upgrade that will cause Pod to become unavailable.

In the past, PodUnavailableBudget only supported some specific workloads, such as CloneSet, Deployment, etc., but it could not identify some unknown workloads defined by users themselves.

Starting with v1.2, PodUnavailableBudget supports protecting Pods for any custom workloads, as long as those workloads declare the scale subresource subresource.

In the CRD, the scale sub-resource is declared as follows:

 subresources:
      scale:
        labelSelectorPath: .status.labelSelector
        specReplicasPath: .spec.replicas
        statusReplicasPath: .status.replicas

However, if your project is generated by kubebuilder or operator-sdk, then just add a line to your workload definition structure and remake manifests:

 // +kubebuilder:subresource:scale:specpath=.spec.replicas,statuspath=.status.replicas,selectorpath=.status.labelSelector

In addition, PodUnavailableBudget also improves runtime performance in large-scale clusters by turning off the default DeepCopy operation when the client list is turned off.

5. Other changes

You can view more changes and their author and commit history via the Github release [ 3] page.

Community Involvement

You are very welcome to join us through Github/Slack/Dingding/WeChat to participate in the OpenKruise open source community. Do you already have something to share with our community? Available at our biweekly community meeting

Share your voice on ( https://shimo.im/docs/gXqmeQOYBehZ4vqo ) or join the discussion through the following channels:

Join the community Slack channel (English)\
https://kubernetes.slack.com/?redir=%2Farchives%2Fopenkruise \
Join the community DingTalk group: search group number 23330762 (Chinese)
Join the community WeChat group (new): add user openkruise and let the robot pull you into the group (Chinese)

Reference link:

[1] OpenKruise:

https://openkruise.io/**

[2] host-local:

https://github.com/openkruise/samples**

[3] Github release:

https://github.com/openkruise/kruise/releases

Click here to check out the OpenKruise project github homepage! !

OpenKruise v1.2: Added PersistentPodState to implement stateful Pod topology fixation and IP multiplexing

Version resolution

1. Added CRD and Controller-PersistentPodState

2. CloneSet calculates logical changes for partitions in the form of percentages, and adds a status field

3. Set Pod not-ready in the lifecycle hook phase

4. PodUnavailableBudget supports custom workload and performance optimization

5. Other changes

Community Involvement

阿里云云原生

引用和评论

Higress 入选全球 Top 100 MCP Servers 榜单｜MCPMarket.com

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

支付宝H5下载被拦截的原因排查与解决指南

云上玩转DeepSeek系列之四：DeepSeek R1 蒸馏和微调训练最佳实践

终于，AWS Aurora 也走向了融合架构，这一次阿里云 PolarDB-X 确实遥遥领先

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

ClkLog埋点分析系统-支持APP崩溃分析