Climbing the peak of scale-Ant Group&#39;s large-scale Sigma cluster ApiServer optimization practice

Text｜Tang Bo (flower name: Boyi Ant Group technical expert)

Tan Chongkang (nickname: see cloud Ant Group Senior Technician)

This article is 10316 words read in 18 minutes

▼

Ant Group runs one of the world's largest Kubernetes (internally called Sigma) clusters. The Kubernetes community officially uses 5K node as the de facto standard for Kubernetes scale, and Ant Group has maintained a Kubernetes cluster with a single cluster size of more than 1W node in 2019.

This is not only a difference in the magnitude of a single cluster node, but also a difference in business scale, business diversification and complexity.

A visual analogy is that if the official and official Kubernetes users can imagine the size of the Kubernetes cluster is Mount Tai, then the Ant Group has achieved Mount Everest on top of the official solution.

The evolution of Ant Group's Kubernetes has gone through more than 3 years from 2018 to the present. Although the scale of 10,000 clusters was built in 2019, today, both business forms and cluster servers are a huge change occured.

- First of all, the cluster of 10,000 nodes at that time was mainly small-sized servers, but now they are all large models. Although the number of machines is also 10,000, the number of CPUs actually managed has doubled.

- followed by almost all of the long running online services in the cluster at that time. The creation frequency of Pods is only a few thousand per day. Now our clusters are almost full of pods allocated on-demand such as streaming computing and offline computing services. Therefore, the number of Pods has doubled, and the number of Pods actually managed exceeds one million.

- Finally, it is the rapid development of Serverless business. The life cycle of Serverless Pod is basically in the order of minutes or even seconds. The number of Pods created in the cluster every day exceeds hundreds of thousands, accompanied by a large number of Kubernetes list watch and CRUD requests. The apiserver of the cluster has been under pressure several times more than before.

Therefore, under the background of business serverless, we launched a large-scale Sigma cluster performance optimization program at Ant. According to the growth trend of the business, we set the goal to build a cluster of 1.4W nodes, and at the same time, through technical optimization, It is expected that the request delay will not decrease due to the scale, and it can be aligned with the community standard, that is, the day-level P99 RT for create/update/delete requests is within 1s.

It is conceivable that the challenge is very huge.

PART. 1 Challenges of large-scale clusters

Undoubtedly, large-scale clusters have brought many challenges:

- As the cluster size increases, the explosion radius of the failure will also increase. The Sigma cluster carries many important applications of Ant Group. Ensuring cluster stability and business stability is the most basic and highest priority requirement.

- users have a large number of list operations, including list all, list by namespace, list by label, etc., which will increase the cost as the cluster grows. These reasonable or unreasonable list requests will cause the apiserver's memory to grow rapidly in a short period of time, causing OOM exceptions and failing to respond to requests externally. In addition, the business party's list request will also be retried continuously because the apiserver cannot process the request, causing the apiserver to overload and unrecoverable service capacity after restarting, which affects the availability of the entire cluster.

large number of List requests directly access etcd services through apiserver, which will also cause the memory of etcd instance to increase sharply and cause OOM exceptions.

- as traffic growth, particularly off-line task increases, the number of create / update / delete request, etc. also increased rapidly, resulting in the client request apiserver RT speed has risen, and some controllers such as a scheduler is selected from The main request timed out and lost the main.

- traffic growth will exacerbate the like etcd compact since the operational performance of their own problems, the surge etcd of P99 RT, leading to apiserver not respond to the request.

- cluster controller services, including Kubernetes community comes with a controller such as service controller, cronjob controller and operator services, etc., existence of their own performance problems will be further enlarged in the face of large-scale clusters. These problems will further spread to the online business, leading to business damage.

As the old adage of computer science says:

「 All problems in computer science can be solved by another level of indirection, except for the problem of too many layers of indirection... and the performance problems. 」

large-scale cluster of 16152bd84d5a4b is both a mirror and a touchstone.

PART. 2 Benefits of large-scale clusters

It is true that building a large-scale Kubernetes cluster also provides many benefits:

- provides a more convenient infrastructure for running large-scale services, which is convenient for coping with the soaring resource demand during business expansion. For example, during double eleven and other e-commerce promotion events, you can deal with business growth by expanding existing clusters instead of building other small clusters. At the same time, cluster managers can manage fewer clusters and simplify infrastructure management.

- provides more resources for offline computing tasks of big data and machine learning, and provides more space for scheduling methods such as time-sharing multiplexing/time-sharing scheduling, so that offline computing tasks are in the low peak period of online business You can use more resources for calculations, enjoy extreme flexibility and fast delivery.

- there is a very important point, in larger clusters can be more effectively enhance the overall cluster resource utilization by scheduling means richer orchestration.

PART. 3 SigmaApiServer performance optimization

The Sigma apiserver component is the access portal for all external requests of the Kubernetes cluster and the collaboration hub for all components inside the Kubernetes cluster. Apiserver has the following functions:

- mask persistent backend data storage component etcd detail, and introduces a data cache, on the basis of the type of the data provides more access mechanisms.

- provides standard APIs so that external access clients can perform CRUD operations on resources in the cluster.

- provides a list-watch primitive, so that clients can obtain real-time into the resource state of the resource.

We can disassemble the performance of apiserver from two levels, namely the startup phase of apiserver and the operation phase of apiserver.

apiserver startup phase performance optimization helps :

- reduce long-upgrade impact of changes / long recovery, reduce unavailability user-perceptible, give Sigma end-users to provide quality service experience (business-oriented overall goal is to Sigma monthly availability SLO reached 99.9%, a single failure can not be Use time <10min).

- reduce apiserver pressure because the client list to re-publish the full amount of resources caused by excessive situation.

apiserver operating phase performance optimization significance is:

- stably supports larger Kubernetes clusters.

- improves the service capacity of apiserver per unit resource in a normal and stable state; that is, it increases the acceptable concurrency and qps of requests, and reduces the request RT.

- reduces client timeout and various problems caused by timeout; provides more traffic access capabilities under existing resources;

Overall optimization ideas

Building a large-scale Kubernetes cluster and performance optimization is not an easy task, as stated in the Google Kubernetes Engine K8s scale article:

「The scale of a Kubernetes cluster is like a multidimensional object composed of all the cluster’s resources—and scalability is an envelope that limits how much you can stretch that cube. The number of pods and containers, the frequency of scheduling events, the number of services and endpoints in each service—these and many others are good indicators of a cluster’s scale.

The control plane must also remain available and workloads must be able to execute their tasks.

What makes operating at a very large scale harder is that there are dependencies between these dimensions. 」

In other words, the scale and performance optimization of the cluster need to consider the information of all dimensions in the cluster, including the number of pod, node, configmap, service, endpoint and other resources, the frequency of pod creation/scheduling, and the rate of change of various resources in the cluster And so on. At the same time, we need to consider the mutual dependence between these different dimensions. The factors of different dimensions form a multi-dimensional space with each other.

In order to cope with the complex impact of so many variables on large-scale clusters, we have adopted the method of exploring the nature of the problem in order to remain unchanged. In order to optimize the apiserver comprehensively and systematically, we divide the apiserver into three levels from bottom to top, namely the storage layer (storage) , the cache layer (cache) , the access layer (registry5fbd) handler) .

- underlying etcd is Kubernetes meta data storage service is the cornerstone of apiserver. The storage layer provides apiserver access to etcd, including apiserver's list watch to etcd, and CRUD operations.

- is equivalent to a layer of encapsulation for etcd, providing a layer of data caching for list-watch requests from the client that consumes a lot of resources, thereby improving the service carrying capacity of apiserver. At the same time, the cache layer also provides the ability to search by conditions.

- provides some special logic for processing CRUD requests and provides various resource operation services to the client.

For the different levels mentioned above, some possible optimization items are as follows:

At the same time, in order to better measure the performance of apiserver, we have formulated detailed SLO for Kubernetes apiserver, including P99 RT indicators for operations such as create/update/delete, and P99 RT indicators for list under different scales of resources.

At the same time, apiserver is optimized under the guidance of SLO, so that we can still provide users with better API service quality under a larger-scale Kubernetes cluster.

Cache layer optimization

"List to watchCache"

Since apiserver obtains a large amount of data from etcd list data, and performs deserialization and filtering operations, it consumes a lot of memory. Some users' clients include irregular access to apiserver. For example, some clients may list every few seconds without resourceversion. These clients caused a lot of memory pressure on the apiserver and almost caused cluster failures. In order to deal with these irregular user accesses and reduce the CPU/memory consumption of apiserver, we have modified the list operation so that all irregular list operations of users go to watchCache. In other words, when the user performs the list operation, the request will not be transparently transmitted to the back-end etcd service.

In one of our large-scale clusters, the apiserver memory will soar to 400G, causing OOM to appear in tens of minutes. During this period, the RT of apiserver's access to etcd will also be as high as 100s, which is almost unavailable. After allowing users to perform all list operations through the watchCache of apiserver, the memory of apiserver is basically stable at about 100G, which is a 4 times improvement, and RT can also be stabilized at the order of 50ms. List's move to watchCache is also based on the ultimate consistency of the list-watch primitive. Watch will continue to monitor the information of related resources, so there will be no impact on data consistency.

In the future, we are also considering whether we can also operate the get operation from watchCache, such as waiting for watchCache for a certain millisecond time for data synchronization, so as to further reduce the pressure of apiserver on etcd, and at the same time can continue to maintain data consistency.

"watchCache size adaptive"

In a cluster with a relatively large resource change rate of (churn rate) , the size of the watchCache of the apiserver has a great effect on the overall stability of the apiserver and the amount of client access.

A watchCache that is too small will cause the client's watch operation to trigger too old resource version errors because the content of the corresponding resource vesion cannot be found in the watchCache, which triggers the client to perform a relisting operation. These re-listing operations will further have negative feedback on the performance of the apiserver and affect the overall cluster. In extreme cases, a vicious circle of list -> watch -> too old resource version -> list Correspondingly, too large watchCache will put pressure on the memory usage of apiserver.

Therefore, dynamically adjusting the size of the apiserver watchCache and choosing a suitable upper limit of the watchCache size is very important for large-scale large-scale clusters.

We have dynamically adjusted the watchCache size, according to the same resource (pod/node/configmap) change rate (create/delete/update operation frequency) to dynamically adjust the size of the watchCache; and according to the cluster resource The frequency of change and the time-consuming of the list operation calculate the upper limit of the watchCache size.

After these optimizations and changes, the client's watch error (too old resource version) almost disappeared.

"Increase watchCache index"

After analyzing the business of Ant Group, it is found that the new computing (big data real-time/offline tasks, machine learning offline tasks) has specific access patterns to the lists of various resources, and business parties such as spark and blink have a large number of lists. By label operation, that is, to find pod by label, there are a lot of visits.

By analyzing the apiserver log, we extracted the list by label operations of each business party, and added the index of the relevant label in watchCache. When performing list by label operations on resources of the same scale, the client RT can be increased by 4-5 times.

The following figure is an introduction to the optimization content of the above watchCache:

Storage layer optimization

In the case that the resource update frequency is relatively fast, GuaranteeedUpdate will perform a large number of retries and cause unnecessary etcd pressure. Sigma adds an exponential backoff retry strategy to GuaranteedUpdate, which reduces the number of conflicts in update operations and also reduces the update pressure of apiserver on etcd.

In a large-scale high-traffic cluster, we found that some unreasonable log output of apiserver can cause severe performance jitters of apiserver. For example, we adjusted the log output level for operations such as GuaranteeedUpdate/delete when updating or deleting conflicts. This reduces disk io operations and reduces the response time for client requests to access apiserver. In addition, when the cluster resource change rate is very high, there will be a lot of "fast watch slow processing" logs. This mainly indicates that the rate of apiserver constructing watchCache in the cache from etcd watch event is lower than the rate from etcd watch to event, and it cannot be improved temporarily without modifying the watchCache data structure. Therefore, we have also adjusted the slow processing log level to reduce log output.

Access layer optimization

Golang profiling has always been an optimization tool for applications written in Go language. During the online profiling of apiserver, we also found many hot spots and optimized them.

E.g:

- can see events. GetAttrs/ToSelectableFields will take up a lot of CPU when users list event. We modified ToSelectableFields. The CPU util of the single function is increased by 30%, so that the CPU util will be improved during the list event.

- addition, can be found through profiling, when a large amount of time metrics can take up a lot of CPU, after the reduction of the amount of apiserver metrics, significantly reducing CPU util.

- Sigma apiserver for authentication model uses Node, RBAC, Webhook, for node authentication, apiserver builds a relatively large memory structure of FIG among Kubelet used to authenticate the access apiserver .

When a large number of resources (pod/secret/configmap/pv/pvc) created or changed in the cluster, the graph structure will be updated; when the apiserver is restarted, the graph structure will be rebuilt. In a large-scale cluster, we found that during the restart of the apiserver, Kubelet will be blocked due to permission issues because the node authorizer graph of the apiserver is still under construction. After locating the node authorizer problem, we also found the community's repair plan, and cherry-pick came back to improve its performance.

etcd will have a 1.5MB size limit for each stored resource, and will return etcdserver after the request size exceeds: request is too large; in order to prevent apiserver from writing resources larger than the limit into etcd, apiserver uses the limitedReadBody function to treat those larger than the resource limit The request is restricted. We have improved the limitedReadBody function to obtain the Content-Length field from the http header to determine whether the http request body exceeds the single resource (pod, node, etc.) of the 1.5MB storage upper limit of etcd.

Of course, not all programs will be improved. For example, we conducted some other encoding scheme tests and replaced encoding/json with jsoniter. In contrast, although the CPU util of apiserver has been reduced, the memory usage has been greatly increased, so the default encoding/json will continue to be used.

etcd split related optimization

In addition, the etcd split also greatly improves the RT of the client's request to access the apiserver. In large-scale clusters, we use multiple splits, and one of the etcd is Pod. In the process of etcd splitting, we found that the resource version of the split etcd will be smaller than the resource version of the original apiserver, which will cause the client list-watch apiserver to hang for a long time and fail to receive new Pod-related events.

In order to solve the problems encountered during the splitting of etcd, we modified the watch interface of apiserver and added the timeout mechanism for watch operations. The watch operation of the client waits at most for 3s. If the resource version does not match, it directly returns an error to let the client relist, so as to avoid the problem of the client hanging due to the resource version hang during the etcd splitting process.

Other optimizations

In addition, in order to ensure the high availability of apiserver, Ant Kubernetes apiserver implements hierarchical and hierarchical current limiting, and adopts the current limiting scheme of sentinel-go plus APF. Among them, sentinel-go limits the total amount, carries out multi-dimensional mixed flow restriction such as the ua dimension and the verb dimension, to prevent the service from being broken, and APF to ensure that the traffic between different business parties can be fairly intervened. However, sentinel-go comes with a periodic memory collection function. After we turned it off, it brought a certain increase in CPU utilization.

In addition, we are also working with the client to optimize the access behavior of apiserver. Up to now, Sigma and the business side have worked together to (flink on K8s) / tekton pipeline / spark operator on how to use apiserver.

Optimization effect

The following figure shows the comparison of the minute-level traffic of our two clusters. The business of one of the clusters has increased by leaps and bounds due to business mergers. The scale of the cluster's nodes exceeds 10,000 units. It can be seen that with the gradual increase in business, the pressure on the cluster has increased several times. All types of write requests have increased significantly. Among them, create and delete requests are more obvious. Create requests have risen from about 200 per minute to about 1000 per minute, and delete requests have risen from 2.7K to 5.9K per minute. After our optimization, with the gradual advancement of business migration, the overall cluster is running smoothly under the background of the continuous increase in scale and load, which basically meets the expectations of cluster optimization.

Basic resources

In the case that various types of traffic have increased in varying degrees with business growth, after optimization, the apiserver CPU utilization rate has dropped by about 7%. But in terms of memory, it has increased by about 20%. This is because watchCache caches more different types of resources (node/pod, etc.) objects than before after dynamic adjustment is turned on.

The benefit of caching more resource objects is that it reduces client reconnections and reduces the number of list operations. At the same time, it also indirectly reduces the RT of various operations on the client side, and improves the stability of the overall cluster and the running business. Of course, the follow-up will continue to optimize the memory usage of apiserver.

RT

The RT of the write request is one of the most critical indicators for the stability of the cluster and business. After optimization, the P99, P90, and P50 RT of various write requests that the client accesses to apiserver all have a significant drop, and the values tend to be more stable, indicating that apiserver is developing in a highly efficient and stable direction.

(Note: RT comparison is performed after including etcd split)

Watch error and the number of lists

Unreasonable watchCache size will cause the client's watch operation to trigger too old resource version error because the content of the corresponding resource vesion cannot be found in watchCache, which is the following watch error, which will cause the client to relist the apiserver .

After optimization, the number of watch errors per minute of pod dropped by about 25%, and the number of watch errors of node dropped to 0; the number of corresponding list operations also dropped by more than 1000 per minute.

PART. 4 Road to the Future

In general, to improve the overall capabilities of a distributed system, we can start from the following aspects:

1. Improve the system's own architecture, improve stability and performance

2. Manage the traffic of the system access party, optimize the use method and architecture of the system access party

3. Optimize the services that the system depends on

Corresponding to the performance optimization of apiserver, we will continue to in-depth in the following aspects in the future:

For apiserver itself, some possible optimization points include: optimizing the total startup time of apiserver and improving watchCache construction speed; optimizing threadSafeStore data structure; using caching for get operations; compressing the data stored in etcd by apiserver to reduce the data size. Improve etcd performance and so on.
In addition to optimizing the apiserver itself, the Ant Sigma team is also working on optimizing the upstream and downstream components of the apiserver. For example, etcd multi-sharding, asynchronous and other efficient solutions; and optimization of the overall link of the operator for various big data real-time and offline tasks.
Of course, the traction of SLO is indispensable, and it will also enhance the quantification of various indicators. Only when these coordinations become an organic whole can we say that we are likely to provide high-quality services for the business parties running on the infrastructure.

The construction of large-scale clusters is long and obstructive.

In the future, we will continue to invest further in the various aspects listed above, and provide a better operating environment for more online tasks, offline tasks, and new computing tasks.

At the same time, we will also further improve the methodology, from caching, asynchronization, horizontal split/scalability, merging operations, shortening resource creation links and other general directions for the next optimization. As the scale of the cluster continues to grow, the importance of performance optimization will become increasingly prominent. We will continue to work towards the goal of building and maintaining a large-scale Kubernetes cluster that is efficient, reliable and high-guarantee for users, just like the name Kubernetes The meaning is the same, escort the application!

「Reference Materials」

.【Kubernetes Scalability thresholds】

https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md

.【Kubernetes scalability and performance SLIs/SLOs】

https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md

.【Watch latency SLI details】

https://github.com/kubernetes/community/blob/master/sig-scalability/slos/watch_latency.md

.【Bayer Crop Science seeds the future with 15000-node GKE clusters】

https://cloud.google.com/blog/products/containers-kubernetes/google-kubernetes-engine-clusters-can-have-up-to-15000-nodes

.【Openstack benchmark】

https://docs.openstack.org/developer/performance-docs/test_results/container_cluster_systems/kubernetes/API_testing/index.html

thirsty"

The Kubernetes cluster scheduling system of Ant Group supports millions of container resource scheduling for Ant Group's online and real-time business, provides standard container services and dynamic resource scheduling capabilities for various financial services to the upper level, and shoulders the responsibility of Ant Group's resource cost optimization. We have the largest Kubernetes cluster in the industry, the deepest cloud native practice, and the best scheduling technology. Welcome students who are interested in Kubernetes/cloud native/container/kernel isolation hybrid/scheduling/cluster management to join, Beijing, Shanghai, Hangzhou look forward to your participation.

Contact email xiaoyun.maoxy@antgroup.com

Recommended reading this week

practice in the same journey

limit on the technical tuyere

Ant Group 10,000-scale k8s cluster etcd high-availability construction road

Cloud native technology development status and future trends in

Climbing the peak of scale-Ant Group's large-scale Sigma cluster ApiServer optimization practice

PART. 1 Challenges of large-scale clusters

PART. 2 Benefits of large-scale clusters