Ant&#39;s large-scale Sigma cluster Etcd splitting practice

Text｜Du Kewei (flower name: Su Lin)

Senior Development Engineer, Ant Group

Responsible for the stability of the ant Kubernetes cluster
Focus on cluster component changes, stability risk assurance

This article 15738 words read 20 minutes

foreword

In order to support the iterative upgrade of Ant's business, Ant Infrastructure launched the Gzone comprehensive cloudification project this year. It is required that Gzone needs to be combined and deployed in the same cluster with the cloud-based Rzone. The scale of nodes managed by a single Sigma cluster will exceed 10,000, and the business undertaken by a single cluster will be more complex.

Therefore, we launched a performance optimization scheme for large-scale Sigma clusters, expecting to be able to align with community standards in terms of request latency and not decrease due to scale growth.

As the data storage database of the Sigma cluster, etcd is the cornerstone of the entire cluster and can directly determine the performance ceiling. The storage limit of a single etcd cluster suggested by the community is 8G, and the storage capacity of a single etcd cluster of the Ant Sigma cluster has already exceeded this limit. The Gzone cloud project will inevitably increase the burden of etcd.

First of all, the Ant business mixes churn computing, offline computing and online business, and mixes a large number of Pods with a life cycle of minutes or even seconds. The number of Pods created per day in a single cluster has also increased to hundreds of thousands, all of which need etcd to support;

Secondly, complex business requirements have spawned a large number of List (list all, list by namespace, list by label), watch, create, update, delete requests. According to the storage characteristics of etcd, the performance of these requests will vary with the storage scale of etcd. Increases and seriously attenuates, and even causes etcd OOM, request timeout and other anomalies;

Finally, the increase in the number of requests also exacerbated etcd's surge in requests for RT P99 due to compact and defrag operations, and even request timeouts, resulting in intermittent loss of key cluster components such as scheduler and CNI services, resulting in cluster unavailability.

According to previous experience, horizontal splitting of data for etcd clusters is an effective optimization method. A typical split is to store important data such as Pods in a separate etcd cluster, thereby reducing the pressure of single etcd storage and request processing and reducing requests Processing delays. However, Pod resource data has special characteristics for Kubernetes clusters, and has high requirements that other resources do not have. In particular, it is necessary to be extremely careful when splitting K8s clusters that are already serving a large scale.

This article mainly records some practical experience and insights of Ant Group in the process of splitting Pod resource data.

brick and attracting jade, please give me more advice!

PART. 1 CHALLENGES

From the previous experience of Pod data splitting, we know that Pod data splitting is a high-risk and complex process, and the reason comes from the particularity of Pod data itself.

A Pod is a combination of a group of containers, the smallest schedulable unit in a Sigma cluster, and the ultimate carrier of business workloads. The core and final delivery resource of a Sigma cluster is the Pod resource.

The core SLO of a Sigma cluster is also indicators such as creation, deletion, and upgrade of Pods. Pod resource data can be said to be the most important resource data of a Sigma cluster. At the same time, the Sigma cluster is event-driven and designed for the final state system. Therefore, in addition to the basic data consistency issues before and after Pod resource data splitting, the impact on other components during the splitting process must also be considered.

The core operations in the previous split experience process are data integrity check and shutdown of key service components. As the name suggests, data integrity verification is to ensure the consistency of data before and after, and the shutdown of key service components is to avoid unintended consequences if components are not shut down during the splitting process, and there may be unexpected deletion of Pods, Pod status is destroyed, etc. . But if you copy this process to the ant Sigma cluster, the problem will come.

Ant Sigma is the core infrastructure of Ant Group. After more than two years of development, it has become a cloud base with 80+ clusters and the number of nodes in a single cluster can reach 1.2w+. On such a large-scale cluster, there are millions of Pods running inside Ant, and the number of short-running Pods created every day is 20w+ times. In order to meet various business development needs, the Sigma team cooperates with multiple cloud-native teams such as Ant Storage, Network, and PaaS. Up to now, Sigma has built hundreds of third-party components. If the Pod is split to restart the components, a lot of communication work with the business side is required, and multiple people are required to work together. If done carelessly, incomplete combing and missing a few components can lead to unintended consequences.

在这里插入图片描述

From the current situation of the ant Sigma cluster, summarize the problems of the existing Pod data splitting experience process:

Manual operation of a large number of components takes a long time to restart and is prone to errors

There are dozens of components that potentially need to be restarted. It is necessary to communicate and confirm with the owner of each component to sort out the components that need to be restarted, which takes a lot of communication time. In case of omission, it may cause unintended consequences, such as resource residues, dirty data, etc.

Complete downtime with long duration breaking SLO

During data splitting, the components are completely shut down, the cluster function is completely unavailable, and the splitting operation is extremely time-consuming. According to previous experience, the duration may be as long as 1 to 2 hours, which completely breaks the SLO promise of the Sigma cluster.

Weak data integrity check methods

In the splitting process, the etcd open source tool make-mirror is used to migrate data. The implementation of this tool is relatively simple, that is, to read the key data of one etcd and then rewrite it to another etcd. It does not support resuming the transfer from a breakpoint. etcd causes the revision of the important field of the original key to be destroyed, which affects the resourceVersion of the Pod data, which may cause unintended consequences. The revision will be explained in detail later. The final verification method is to check whether the number of keys is consistent. If the data of the intermediate key is damaged, it cannot be found.

PART. 2 Problem Analysis

good expectations

As a lazy person, I don't want to communicate with so many component owners about restarting. Restarting a large number of components is easy to miss operations and cause unexpected problems. At the same time, is there a better means of data integrity verification?

If the components are not restarted, the whole process later evolves into the following process, which is expected to simplify the process while maintaining security.

In order to achieve good expectations, let's go back to the source and re-review the entire process.

在这里插入图片描述

What is data splitting doing?

As we all know, etcd stores various resource data in the Kubernetes cluster, such as Pods, Services, Configmaps, Deployments, etc.

By default, all resource data of Kube-apiserver is stored in an etcd cluster. As the storage scale grows, the etcd cluster will face performance bottlenecks. It is an empirical optimization idea agreed in the industry to split etcd data based on the resource dimension to improve the performance of Kube-apiserver accessing etcd. The essence is to reduce the data size of a single etcd cluster and reduce the access QPS of a single etcd cluster.

According to the scale and requirements of the Ant Sigma cluster itself, it needs to be divided into 4 independent etcd clusters to store Pods, Leases, events and other resource data respectively. The allocated resource data.

在这里插入图片描述

Event resource

K8s event resource data is not an event in a watch, but generally represents events that occur on associated objects, such as Pod pulling images, container startup, etc. In terms of business, CI/CD generally needs to display the status timeline in a streamlined manner, and needs to frequently pull event resource data.

The event resource data itself is valid (the default is 2 hours). Except for observing the life cycle changes of the resource object through the event, there is generally no important business dependence. Therefore, the event data is generally considered to be discardable, and there is no need to ensure the consistency of the data. .

Because of the above data characteristics, the splitting of events is the easiest. You only need to modify the startup configuration of the APIServer and restart the APIServer. There is no need to do data migration or clean up old data. Except for Kube-apiserver, the whole splitting process does not require any component restart or configuration modification.

### Lease Resources

Lease resources are generally used for Kubelet heartbeat reporting, and are also the resource type recommended by the community for controller class component selection.

Each Kubelet uses a Lease object for heartbeat reporting, which is reported every 10s by default. The more nodes, the more update requests etcd undertakes. The number of updates per minute of node Lease is 6 times the total number of nodes. 10,000 nodes are 60,000 times per minute, which is still very impressive. The update of the Lease resource is very important for judging whether the Node is Ready, so it is split separately.

The main selection logic of the controller class component is basically the open source main selection code package used, that is, the components that use the Lease main selection are unified main selection logic. The code logic of Kubelet's heartbeat reporting is under our control. Analysis from the code shows that Lease resources do not require strict data consistency. It is only necessary to ensure that Lease data has been updated within a certain period of time, which will not affect the normal functions of components using Lease.

The default time for Kubelet to judge whether the logic of Ready is in the controller-manager is 40s, that is, as long as the corresponding Lease resource has been updated within 40s, it will not be judged as NotReady. And the time of 40s can be adjusted, as long as it is updated at this time, it will not affect the normal function. The main selection Lease duration of the controller class component of the main selection is generally 5s~65s, which can be set by yourself.

Therefore, although Lease resource splitting is more complicated than event, it is also relatively simple. The extra step is that in the process of splitting, the Lease resource data in the old etcd needs to be synchronized to the new etcd cluster. Generally, we use the etcdctl make-mirror tool to synchronize the data. At this time, if a component updates the Lease object, the request may fall in the old etcd or in the new etcd. Updates that fall in the old etcd will be synchronized to the new etcd through the make-mirror tool, and since there are fewer Lease objects, the whole process lasts for a short time, and there is no problem. In addition, it is also necessary to delete the Lease resource data in the old etcd after the migration and split is completed, so as to release the space occupied by the lock. Although the space is small, do not waste it. Similar to event resource splitting, the entire splitting process also does not require any component restart or configuration modification except kube-apiserver.

### Pod resources

Pod resources may be the most familiar resource data. All workloads are ultimately carried by Pods. The core of K8s cluster management lies in the scheduling and management of Pod resources. Pod resource data requires strict data consistency. Watch events generated by any Pod update must not be missed, otherwise it may affect the delivery of Pod resources. The characteristics of Pod resources are also the reason why the related components need to be restarted on a large scale in the process of traditional Pod resource data splitting, and the reasons will be analyzed later.

The community kube-apiserver component itself already has the configuration --etcd-servers-overrides to set independent etcd storage by resource type.

--etcd-servers-overrides strings
Per-resource etcd servers overrides, comma separated. The individual override format: group/resource#servers, where servers are URLs, semicolon separated. Note that this applies only to resources compiled into this server binary.

A brief configuration example of our common resource splitting is as follows:

events split configuration

--etcd-servers-overrides=/events#https://etcd1.events.xxx:2xxx;https://etcd2.events.xxx:2xxx;https://etcd3.events.xxx:2xxx

lease split configuration

--etcd-servers-overrides=coordination.k8s.io/leases#https://etcd1.leases.xxx:2xxx;https://etcd2.leases.xxx:2xxx;https://etcd3.leases.xxx:2xxx

pods split configuration

--etcd-servers-overrides=/pods#https://etcd1.pods.xxx.net:2xxx;https://etcd2.pods.xxx:2xxx;https://etcd3.pods.xxx:2xxx

Is restarting the component necessary?

In order to understand if restarting the component is necessary, and what is the impact of not restarting the component. We verified in the test environment and found that after the split is completed, new Pods cannot be scheduled, existing Pods cannot be deleted, and finaliziers cannot be removed. After analysis, it is found that the related components cannot perceive Pod creation and deletion events.

So why does this problem occur? To answer this question, we need to clearly explain everything from the entire core design concept of K8s to the specific details of implementation. Let's go into details.

If K8s is an ordinary business system, the splitting of Pod resource data only affects the storage location where kube-apiserver accesses Pod resources, that is, if the impact is only at the kube-apiserver level, this article will not exist.

For ordinary business systems, there will be a unified storage access layer. Data migration and split operation and maintenance operations will only affect the configuration of the storage access layer, and higher-level business systems will not perceive it at all.

However, K8s is a different kind of fireworks!

在这里插入图片描述

K8s cluster is a complex system, which is composed of many expansion components to provide various capabilities.

Extension components are designed for the final state. There are mainly two state concepts in the final state: Desired State and Current State. All objects in the cluster have a desired state and a current state.

The expected state is simply the final state described by the Yaml data of the object we submit to the cluster;

The current state is the actual state of the object in the cluster.

The data requests we use, such as create, update, patch, delete, etc., are all modification actions we make for the final state, expressing our expectations for the final state. After executing these actions, the current cluster state and our desired state are different. , the various Operators (Controllers) extension components in the cluster are continuously tuned (Reconclie) through the difference between the two, and drive the object from the current state to the final state.

在这里插入图片描述

The current Operators class components are basically developed using open source frameworks, so it can be considered that the code logic of their running components is consistent and unified. Inside the Operator component, the final final state is to obtain the final final state object yaml data by sending a List request to kube-apiserver, but in order to reduce the load pressure of kube-apiserver, the List request is only executed once when the component starts (if no abnormal Expected error), if there is any change in the final state data object yaml, the event (WatchEvent) message will be actively pushed to the Operator through the kube-apiserver.

From this point of view, it can also be said that the K8s cluster is an event-driven final state-oriented design.

在这里插入图片描述

The WatchEvent message flow between the Operator and the kube-apiserver needs to ensure that no event is lost. The yaml data returned by the initial List request, plus the change event of the WatchEvent, is the final state that the Operator should see, and it is also the user desired state. The important concept to ensure that events are not lost is resourceVersion.

Each object in the cluster has this field, even the resources defined by the user through CRD (CustomResourceDefinition).

The point is that the resourceVersion mentioned above is closely related to the unique feature (revision) of etcd storage itself, especially for List requests that are heavily used by Operators. Data splitting and migrating to the new etcd storage cluster will directly affect the resourceVersion of resource objects.

So the question comes again, what is etcd revision? What is the relationship with the resourceVersion of the K8s resource object?

3 Revisions of Etcd

There are three kinds of Revisions in Etcd, namely Revision, CreateRevision and ModRevision. The association relationship and characteristics of these three kinds of Revisions are summarized as follows:

When the key-value is written or updated, there will be a Revision field, and it is guaranteed to be strictly incremented, which is actually the logical clock of MVCC in etcd.

在这里插入图片描述

K8s ResourceVersion and Etcd Revision

Each object output from kube-apiserver must have a resourceVersion field, which can be used to detect whether the object has changed and concurrency control.

More information can be seen from the code comments:

// ObjectMeta is metadata that all persisted resources must have, which includes all objects
// users must create.
type ObjectMeta struct {  
    ...// omit code here
    // An opaque value that represents the internal version of this object that can
  // be used by clients to determine when objects have changed. May be used for optimistic
  // concurrency, change detection, and the watch operation on a resource or set of resources.
  // Clients must treat these values as opaque and passed unmodified back to the server.
  // They may only be valid for a particular resource or set of resources.
  //
  // Populated by the system.
  // Read-only.
  // Value must be treated as opaque by clients and .
  // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency
  // +optional
  ResourceVersion string `json:"resourceVersion,omitempty" protobuf:"bytes,6,opt,name=resourceVersion"`
    ...// omit code here
}

The write operations of create, update, patch and delete in the request verbs of kube-apiserver will update the revision in etcd, more strictly speaking, it will cause the growth of revision.

The corresponding relationship between the resourceVersion field in the resource object in K8s and the various Revisions in etcd is summarized as follows:

在这里插入图片描述

Among all kube-apiserver requests and responses, special attention should be paid to the List response. The resourceVersion of the List request is etcd's Header.Revision, which is the MVCC logic clock of etcd. The write operation to any key in etcd triggers the monotonous increase of the Revision, which in turn affects the value of the resourceVersion in the List request response.

For example, even if there is no modification action for Pod resources under test-namespace, if List Pods under test-namespace, the resourceVersion in the response is likely to increase every time (because other keys in etcd have write operations) .

In our non-stop component Pod data split, we only prohibit the write operation of the Pod, and other data are not prohibited. During the rolling effect of the kube-apiserver configuration update, it will inevitably cause the Revision of the old etcd to be much larger than the storage Pod. new etcd of data. This caused serious inconsistencies before and after the split of List resourceVersion.

The value of resourceVersion in the Operator is the key to ensure that the event is not lost. Therefore, the data splitting of etcd affects not only kube-apiserver, but also many Operator-like components. Once the change event is lost, it will cause problems such as failure of Pod delivery and dirty data.

在这里插入图片描述

So far, although we have learned that the list resourceVersion obtained by the operator is inconsistent, the list resourceVersion returned from old etcd is larger than that from new etcd, so what does it have to do with the operator losing the Pod update event?

To answer this question, we need to start from ListAndWatch in the component collaboration design of K8s, which is bound to be from the client-side Client-go and the server-side kube-apiserver.

### ListAndWatch in Client-go

We all know that the Operator component is event aware through the open source Client-go code package.

在这里插入图片描述

Schematic diagram of Client-go aware data object events in

The core key is the ListAndWatch method, which ensures that the client does not lose the resourceVersion of the event event is obtained through the List request in this method.

ListAndWatch will list all the objects for the first time, and get the version number of the resource object, and then watch the version number of the resource object to see if it has been changed. First, the resource version number will be set to 0, list() may cause a delay in the local cache relative to the content in etcd. Reflector will supplement the delayed part through the watch method, so that the local cache data is consistent with the etcd data.

The key code is as follows:

// Run repeatedly uses the reflector's ListAndWatch to fetch all the
// objects and subsequent deltas.
// Run will exit when stopCh is closed.
func (r *Reflector) Run(stopCh <-chan struct{}) {
  klog.V(2).Infof("Starting reflector %s (%s) from %s", r.expectedTypeName, r.resyncPeriod, r.name)
  wait.BackoffUntil(func() {
    if err := r.ListAndWatch(stopCh); err != nil {
      utilruntime.HandleError(err)
    }
  }, r.backoffManager, true, stopCh)
  klog.V(2).Infof("Stopping reflector %s (%s) from %s", r.expectedTypeName, r.resyncPeriod, r.name)
}
// ListAndWatch first lists all items and get the resource version at the moment of call,
// and then use the resource version to watch.
// It returns error if ListAndWatch didn't even try to initialize watch.
func (r *Reflector) ListAndWatch(stopCh <-chan struct{}) error {
  var resourceVersion string
  // Explicitly set "0" as resource version - it's fine for the List()
  // to be served from cache and potentially be delayed relative to
  // etcd contents. Reflector framework will catch up via Watch() eventually.
  options := metav1.ListOptions{ResourceVersion: "0"}

  if err := func() error {
    var list runtime.Object
      ... // omit code here
    listMetaInterface, err := meta.ListAccessor(list)
      ... // omit code here
    resourceVersion = listMetaInterface.GetResourceVersion()
        ... // omit code here
    r.setLastSyncResourceVersion(resourceVersion)
    ... // omit code here
    return nil
  }(); err != nil {
    return err
  }
    ... // omit code here
  for {
        ... // omit code here
    options = metav1.ListOptions{
      ResourceVersion: resourceVersion,
      ... // omit code here
    }
    w, err := r.listerWatcher.Watch(options)
        ... // omit code here
    if err := r.watchHandler(w, &resourceVersion, resyncerrc, stopCh); err != nil {
        ... // omit code here
      return nil
    }
  }
}

Organized into a flow chart to make it clearer:

在这里插入图片描述

Watch handling in kube-apiserver

After reading the processing logic of the client, let's look at the processing of the server. The key is the processing of the watch request by kube-apiserver. For each watch request, kube-apiserver will create a new watcher and start a goroutine watchServer to serve the watch request. Push resource event messages to the client in this newly created watchServer.

在这里插入图片描述

But here's the point, the parameter watchRV in the client's watch request comes from the List response in Client-go, and kube-apiserver only pushes event messages larger than watchRV to the client. During the splitting process, the client's watchRV may be much larger than that of kube - The resourceVersion of the event local to the apiserver, which is the root cause of the client losing the Pod update event message.

From this point of view, it is necessary to restart the Operator component. Restarting the component can trigger the relist of Client-go and get the latest Pod list resourceVersion, so as not to lose the update event message of the Pod.

在这里插入图片描述

PART. 3 The problem is broken

Crack the restart problem

At this point, it seems that we cannot escape the fate of needing to restart the components, but after analyzing the problem, we have clarified the cause of the problem, and in fact, we have found a solution to the problem.

The problem of restarting components mainly involves two subjects: Client-go on the client side and kube-apiserver on the server side, so to solve the problem, we can start from these two subjects and seek the breakthrough point of the problem.

First of all, for the client-side Client-go, the key is to let ListAndWatch re-initiate the List request to get the latest resourceVersion of kube-apiserver, so as not to lose subsequent event messages. If Client-go can request to refresh the local resourceVersion through List again at a specific time, the problem will be solved, but if the client-go code is changed, it still needs to be released and restarted to take effect, then the problem is how to not use it By modifying the code of Client-go, you can re-initiate the List request.

We review the logical flow of ListAndWatch, and we can find that the key to judging whether a List request needs to be initiated is the judgment of the return error of the Watch method. The error returned by the watch method is determined according to the response of kube-apiserver to the watch request. Let's focus on the server-side kube-apiserver.

在这里插入图片描述

Different watch request handling

The watch request processing of kube-apiserver has been introduced above. We can achieve our goal by modifying the watch request processing process of kube-apiserver to achieve mutual cooperation with Client-go.

From the above we know that Client-go's watchRV is much larger than the resourceVersion in kube-apiserver's local watch cache. According to this feature, kube-apiserver can send specified error (TooLargeResourceVersionError), thereby triggering Client-go's relist action. The kube-apiserver component inevitably needs to be restarted. After updating the configuration, the logic of our transformation can be executed.

The transformation logic is shown as follows:

在这里插入图片描述

Technical guarantee data consistency

The previous experience is to realize data migration through the etcd make-mirror tool. The advantage is that it is simple and convenient, and the open source tool can be used out of the box. The disadvantage is that this work is simple to implement, that is, to read the key from one etcd, and then rewrite it to another etcd, it does not support resuming the transfer from a breakpoint, and it is not friendly to the migration of large data volume and time-consuming. In addition, the createRevision information in the etcd key is also destroyed. Therefore, after the migration is completed, strict data integrity testing is required.

In response to the above problem, we can change our way of thinking. Our essence is to do data migration. The storage structure (KeyValue) of etcd itself is special, and we hope to preserve the integrity of the data before and after. So I thought of etcd's snapshot tool. The snapshot tool was originally used for etcd's disaster recovery, that is, a new etcd instance can be recreated using a snapshot data of etcd. And the data through the snapshot can maintain the integrity of the original keyValue in the new etcd, which is exactly what we want.

// etcd KeyValue 数据结构
type KeyValue struct {
  // key is the key in bytes. An empty key is not allowed.
  Key []byte `protobuf:"bytes,1,opt,name=key,proto3" json:"key,omitempty"`
  // create_revision is the revision of last creation on this key.
  CreateRevision int64 `protobuf:"varint,2,opt,name=create_revision,json=createRevision,proto3" json:"create_revision,omitempty"`
  // mod_revision is the revision of last modification on this key.
  ModRevision int64 `protobuf:"varint,3,opt,name=mod_revision,json=modRevision,proto3" json:"mod_revision,omitempty"`
  // version is the version of the key. A deletion resets
  // the version to zero and any modification of the key
  // increases its version.
  Version int64 `protobuf:"varint,4,opt,name=version,proto3" json:"version,omitempty"`
  // value is the value held by the key, in bytes.
  Value []byte `protobuf:"bytes,5,opt,name=value,proto3" json:"value,omitempty"`
  // lease is the ID of the lease that attached to key.
  // When the attached lease expires, the key will be deleted.
  // If lease is 0, then no lease is attached to the key.
  Lease int64 `protobuf:"varint,6,opt,name=lease,proto3" json:"lease,omitempty"`
}

Migration data clipping

Although the etcd snapshot data has the integrity of the KeyValue we want, the data stored in the reconstructed etcd is all the data of the old etcd, which is not what we want. Of course, we can start the cleanup of redundant data after creating etcd, but this is not the best way.

We can implement our data clipping during the snapshot process by transforming the etcd snapshot tool. In the storage model of etcd, there is a list of buckets. Buckets are a storage concept of etcd. Corresponding to the relational database, it can be considered as a table, and each key in it corresponds to a row in the table. The most important bucket is the bucket named key, which stores all the resource objects in K8s. The keys of all resource objects in K8s have a fixed format. According to the resource category and namespace, each resource has a fixed prefix. For example, the prefix of Pod data is /registry/Pods/. During the snapshot process, we can distinguish Pod data based on this prefix and cut out non-Pod data.

In addition, according to the characteristics of etcd, the storage size of etcd for snapshot data is the size of etcd's hard disk file. There are two values db total size and db inuse size. The size of db total size is the size of the storage file occupied by etcd in the hard disk. , which contains a lot of data that has become junk keys, but not cleaned. The db inuse size size is the total size of all available data. When the etcd defrag method is used infrequently to organize storage space, the value of total is generally much larger than the value of inuse.

In data clipping, even if we clip the non-Pod data, the data of the entire snapshot will not change. At this time, we need to use the defrag method to release the redundant storage space.

In the following diagram, you can see the change process of the db total, and finally the snapshot data size we get is the size of the Pod data, which is very important for us to save data transmission time.

在这里插入图片描述

Pod forbidden pit

In the previous splitting process, we mentioned that when K8s prohibits writing a class of resources, it can be implemented through MutatingWebhook, which is to return the deny result directly, which is relatively simple. Here is a record of a small pit we encountered at that time.

Our initial MutatingWebhookConfiguration configuration is as follows, but we found that after applying this configuration, we can still receive Pod update event messages.

// 第一个版本配置，有问题
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: deny-pods-write
webhooks:
- admissionReviewVersions:
  - v1beta1
  clientConfig:
    url: https://extensions.xxx/always-deny
  failurePolicy: Fail
  name: always-deny.extensions.k8s
  namespaceSelector: {}
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - "*"
    resources:
    - pods
    scope: '*'  
  sideEffects: NoneOnDryRun

After investigation, it was found that the status field of the Pod was updated. By reading the code of apiserver, we found that the resource related to Pod storage is not only one Pod, but also the following types. Pod status and Pod are different for apiserver storage. Resources.

"pods":             podStorage.Pod,
"pods/attach":      podStorage.Attach,
"pods/status":      podStorage.Status,
"pods/log":         podStorage.Log,
"pods/exec":        podStorage.Exec,
"pods/portforward": podStorage.PortForward,
"pods/proxy":       podStorage.Proxy,
"pods/binding":     podStorage.Binding,

After adjustment, the following configuration can prevent the Pod data from being completely updated, pay attention to the resource configuration field.

This is a small pit, recorded here.

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: deny-pods-write
webhooks:
- admissionReviewVersions:
  - v1beta1
  clientConfig:
    url: https://extensions.xxx/always-deny
  failurePolicy: Fail
  name: always-deny.extensions.k8s
  namespaceSelector: {}
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - "*"
    resources:
    - pods
    - pods/status
    - pods/binding
    scope: '*'  
  sideEffects: NoneOnDryRun

final splitting process

After solving the previous problems, our final splitting process came out.

The indication is as follows:

在这里插入图片描述

During data splitting, only Pod data cannot be written, read is possible, and other resources can be read and written normally. The whole process can be realized through program automation.

The time of Pod's write-disable operation varies according to the size of the Pod data. It is mainly consumed in the Pod data copy process. Basically, the entire process can be completed within a few minutes.

Except that kube-apiserver cannot avoid the need to update the storage configuration and restart, no component restart is required. At the same time, it also saves a lot of time for communication with component owners, and also avoids many uncertainties in many operations.

The entire splitting process can be done by one person.

PART. 4 Final summary

Starting from the goal of data splitting, this paper draws on the experience of predecessors, but according to its own actual situation and requirements, it breaks through the previous experience, and solves the problems of component restart and data consistency guarantee through technological innovation, while improving efficiency. It is also technically safe.

The present process introduces the whole thinking process and key points of realization.

在这里插入图片描述

The whole thought process and key points of realization

We didn't invent anything, we just made some improvements based on the existing logic and tools to accomplish our goals. However, behind the transformation and improvement process, we need to understand the details of the underlying layer, which is not something that can be learned by drawing a few boxes.

Knowing why is a must at most jobs, and while it takes up a lot of our time, it's worth it.

To end with an old saying:

*The wonderful use, keep one mind
Share with you. *

"References"

（1）【etcd storage limit】：

https://etcd.io/docs/v3.3/dev-guide/limit/

（2）【etcd snapshot】：

https://etcd.io/docs/v3.3/op-guide/recovery/

(3) [Climbing the peak of scale - Ant Group's large-scale Sigma cluster ApiServer optimization practice]:

https://www.sofastack.tech/blog/climbing-to-the-top-of-scale-ant-groups-large-scale-sigma-cluster-apiserver-optimization-in-practice/
在这里插入图片描述

Ant's large-scale Sigma cluster Etcd splitting practice

foreword

PART. 1 CHALLENGES

PART. 2 Problem Analysis

good expectations

Event resource

events split configuration

lease split configuration

pods split configuration

3 Revisions of Etcd

PART. 3 The problem is broken

Crack the restart problem

PART. 4 Final summary

SOFAStack

引用和评论

蚂蚁 Flink 实时计算编译任务 Koupleless 架构改造

k8s集群部署（一主两从）

AD系列：Windows Server 2025 搭建AD域控和初始化

一体化运维，降本增效！秒云助力某基金打造智能运维平台

k8s实战基础

HTTP500代码怎么解决？常见的5xx网页错误及其原因

linux运维之NFS

Ant&#39;s large-scale Sigma cluster Etcd splitting practice

foreword

PART. 1 CHALLENGES

PART. 2 Problem Analysis

good expectations

Event resource

events split configuration

lease split configuration

pods split configuration

3 Revisions of Etcd

PART. 3 The problem is broken

Crack the restart problem

PART. 4 Final summary

SOFAStack

引用和评论

蚂蚁 Flink 实时计算编译任务 Koupleless 架构改造

k8s集群部署（一主两从）

AD系列：Windows Server 2025 搭建AD域控和初始化

一体化运维，降本增效！秒云助力某基金打造智能运维平台

k8s实战基础

HTTP500代码怎么解决？常见的5xx网页错误及其原因

linux运维之NFS

Ant's large-scale Sigma cluster Etcd splitting practice