Ant Group&#39;s 10,000-scale K8s cluster etcd high-availability construction road

The Ant Group operation and maintenance may be the world's largest K8s cluster: K8s officially uses 5k node as the peak of K8s scale, and Ant Group actually operates and maintains a K8s cluster with a scale of 10k node. An image metaphor is that if the official and following official K8s users can imagine the scale of the K8s cluster is Mount Tai, then Ant Group has realized a Mount Everest on the official solution, leading the K8s scale technology promote.

This difference in magnitude is not only a difference in quantity, but also a qualitative improvement in the management and maintenance of K8s. The reason behind being able to maintain a K8s cluster with such a huge challenge and a huge scale is that the Ant Group has made an optimization effort far greater than the official K8s.

The so-called high-rise building rises from the ground. This article focuses on the high-availability construction work done by the Ant Group at the etcd level, the cornerstone of K8s: only when the cornerstone of etcd is stable, can the high-rise building of K8s maintain stability, with tidb large Supported by Huang Dongxu's circle of friends [picture has been authorized by Mr. Huang].

The challenge

First of all, etcd is the KV database of the K8s cluster.
From the perspective of the database, the roles of K8s overall cluster architecture are as follows:

etcd cluster database
API interface proxy and data caching layer of kube-apiserver etcd
Producers and consumers of kubelet data
Consumers and producers of kube-controller-manager data
Consumers and producers of kube-scheduler data

Etcd is essentially a KV database, which stores K8s own resources, user-defined CRDs, and events of the K8s system. The consistency of each type of data is inconsistent with the data security requirements. For example, the security of event data is less than K8s' own resource data and CRD data.

When the early proponents of K8s promoted K8s, they claimed that one of its advantages over OpenStack was that K8s did not use message queues, and its latency was lower than that of OpenStack. This is actually a misunderstanding. Whether it is the watch interface provided by etcd or the informer mechanism in the K8s client package, it all shows that K8s treats etcd as a message queue. There are many K8s message carriers, such as K8s event.

From the perspective of message queues, the roles of K8s overall cluster architecture are as follows:

etcd message router
kube-apiserver etcd producer message agent and message broadcast [or become a secondary message router, consumer agent]
Producers and consumers of kubelet messages
Consumers and producers of kube-controller-manager messages
Consumers and producers of kube-scheduler messages

etcd is a message queue in push mode. etcd is the KV database and message router of the K8s cluster, acting as the two roles of MySQL and MQ in the OpenStack cluster. This implementation seems to simplify the structure of the cluster, but it is not. In a large scale K8s cluster, the general experience is to first use a separate etcd cluster to store event data: physically separate the KV data from a part of the MQ data, and realize the partial separation of the roles of KV and MQ. For example, reference document 2 mentioned that Meituan "For the operation of etcd, reduce the pressure on the main library by splitting out independent event clusters."

When the K8s cluster expands, etcd bears the three pressures of a sharp increase in KV data, a surge in event messages, and message write amplification.
In order to prove that the statement is not false, some data are quoted as evidence:

etcd KV data level is above 1 million;
The data volume of etcd event is more than 100,000;
The peak reading flow pressure of etcd is above 300,000 pqm, and the reading event is above 10k qpm;
The peak write flow pressure of etcd is above 200,000 pqm, and the write event is above 15k qpm;
etcd CPU frequently soars above 900%;
etcd memory RSS is above 60 GiB;
etcd disk usage can reach more than 100 GiB;
The number of etcd's own goroutine is more than 9k;
User mode threads used by etcd reach more than 1.6k;
Etcd gc can take up to 15ms under normal conditions.

It is very difficult to manage these data using etcd implemented in Go language. Whether it is CPU, memory, gc, goroutine number or thread usage, it is basically close to the limit of go runtime management capabilities: often observed in the CPU profile that go runtime and gc occupy resources More than 50%.

Before Ant's K8s cluster experienced high-availability project maintenance, when the cluster size exceeded 7,000 nodes, the following performance bottlenecks occurred:

Etcd has a lot of read and write delays, and the delay can even reach minutes;
The kube-apiserver query pods / nodes / configmap / crd has a very high latency, resulting in etcd oom;
etcd list-all pods can last for more than 30 minutes;
In 2020, there were several accidents in the etcd cluster caused by the collapse of list-all pressure;
The controller cannot perceive data changes in time, such as watch data delay can reach more than 30s.

If the etcd cluster in this situation is dancing on the blade, the entire K8s cluster at this time is an active volcano: if you are not paying attention, you may have a P-level fault. At that time, the entire K8s master operation and maintenance work is probably the entire One of the most dangerous types of work in the Ant Group.

High availability strategy

To achieve the improvement of the high availability capability of a distributed system, there are probably the following means:

Improve its own stability and performance;
Fine management of upstream traffic;
Guarantee service downstream service SLO.

After so many years of tempering by the community and various users, etcd, its own stability is sufficient. What Ant-Man can do is to use Zhou's ability to improve the overall utilization of cluster resources, scale out and scale up two technical means to improve its performance as much as possible.

etcd itself is the cornerstone of K8s, and it has no downstream services. If there is, it is also the physical node environment used by itself. The following is a description of the high availability improvement work we have made at the etcd level from the perspective of etcd cluster performance improvement and request traffic management.

File system upgrade

Flying out of the golden phoenix in the mountain nest is not an easy task. There is no way to make etcd run faster than to provide a high-performance machine with shorter and faster results.

1. Use NVMe ssd

etcd itself = etcd program + its operating environment. The disk used by the early etcd server was a SATA disk. After a simple test, it was found that etcd reads the disk at a very slow rate. The boss boldly changed the machine shotgun --- upgraded to a f53 specification machine using NVMe SSD: etcd uses NVMe ssd to store boltdb data Later, the random write rate can be increased to more than 70 MiB/s.

Reference document 2 mentions that Meituan "based on high-profile SSD physical machine deployment can reach 5 times the daily high-traffic access", it can be seen that improving hardware performance is the first choice of large manufacturers, and you must not waste manpower if you can toss the machine.

2. Use tmpfs

Although NVMe ssd is good, theoretically its read and write limit performance is still an order of magnitude worse than the memory ratio. Our test found that after replacing NVMe ssd with tmpfs [swap out is not prohibited], etcd can still improve performance by as much as 20% in the case of concurrent read and write. After examining the characteristics of various data types in K8s, considering the fact that events have low requirements for data security but high requirements for real-time performance, we did not hesitate to run the event etcd cluster on the tmpfs file system. The overall performance of K8s has been improved by a level.

3. Disk file system

After the disk storage media is upgraded, the further thing that the storage layer can do is to study the file system format of the disk. Currently, the underlying file system used by etcd is in ext4 format, and its block size uses the default 4 KiB. Our team has performed a simple write parallel pressure test on etcd and found that when the file system is upgraded to xfs and the block size is 16 KiB [under the condition of the tested KV size total 10 KiB], etcd's write performance is still good Room for improvement.

But in the case of concurrent reading and writing, the write queue of the disk itself has almost no pressure, and since etcd version 3.4 implements parallel cache reading, the read pressure of the disk is almost zero, which means: continue to optimize the file system for etcd The room for performance improvement is almost useless. Since then, the key to single-node etcd scale up has shifted from disk to memory: optimizing its memory index read and write speed.

4. Disk transparent huge pages

In the memory management of modern operating systems, there are two technologies, huge page and transparent huge page, but the general user uses transparent huge page to achieve dynamic management of memory pages. In the etcd operating environment, turn off the transparent huge page function, otherwise regular monitoring indicators such as RT and QPS will often have many glitches, resulting in unstable performance.

etcd tuning

MySQL operation and maintenance engineers are often referred to as "parameter tuning engineers". RocksDB, another well-known KV database, is not too much. The two adjustable parameters are so horrible: the key lies in different storage and operation. The environment needs to use different parameters to make full use of the performance of the hardware. etcd can't do it easily, but after it doesn't pull people, it is expected that there will be more and more adjustable parameters in the future.

etcd itself also exposes many parameter adjustment interfaces. In addition to the improvement made by the Ali Group K8s team to optimize the freelist from list to map organization, the current regular etcd adjustable parameters are as follows:

write batch
compaction

1.write batch

Like other conventional DBs, when etcd disk submits data, it also uses regular batch submission and asynchronous disk writing to improve throughput, and balances its delay through memory caching. The specific adjustment parameter interface is as follows:

batch write number The number of batch write KVs, the default value is 10k;
batch write interval Batch write event interval, the default value is 100 ms.

The two default values of etcd batch are not suitable for large-scale K8s clusters. They need to be adjusted for the specific operating environment to avoid OOM of memory usage. The general rule is that the more the number of cluster nodes, the two values should decrease proportionally.

2.compaction

Because etcd itself supports transaction and message notification, it uses the MVCC mechanism to save the multi-version data of a key, and etcd uses the timing compaction mechanism to reclaim these obsolete data. The compression task parameters provided by etcd are as follows:

compaction interval: the duration of the compaction task cycle;
compaction sleep interval The duration of a single compression batch interval, the default is 10 ms;
compaction batch limit The number of KVs in a single compaction batch, the default is 1000.

(1) Compress the task cycle

The etcd compaction of the K8s cluster can be compacted in two ways:

etcd also provides the comapt command and API interface. Based on this API interface, K8s kube-apiserver also provides external compact cycle parameters;
etcd itself will periodically perform compaction;
etcd provides its own periodic compaction parameter adjustment interface externally, the value range of this parameter is (0, 1 hour);
The meaning is: etcd compaction can only be opened but not closed. If the set period is greater than 1 hour, etcd will be truncated to 1 hour.

After the Ant K8s team has been tested and verified in the offline environment, the current experience of the compression cycle value is:

Extend the compaction period as much as possible at the etcd level. For example, a value of 1 hour is the same as turning off compaction at the etcd level and handing over the fine adjustment of the compaction interval to K8s kube-apiserver;
At the K8s kube-apiserver level, different compaction intervals are selected according to the online cluster size.

The reason why the fine adjustment power of etcd compaction interval is adjusted to the kube-apiserver level is that etcd is a KV database, which is inconvenient to start and stop frequently for testing, while kube-apiserver is the cache of etcd, and its data is weak state data. It is more convenient to start and stop, and it is convenient to adjust parameters. As for the value of the compaction interval, one piece of experience is: the more nodes in the cluster, the larger the compaction interval value can be.

Compaction is essentially a write action. Frequent execution of compaction tasks in a large-scale cluster will affect the delay of cluster read and write tasks. The larger the cluster size, the more obvious the delay effect. This is reflected in the time-consuming monitoring of kube-apiserver requests. There are large burrs that occur frequently and periodically.

Furthermore, if the tasks running on the platform have obvious trough and peak characteristics, for example, 8:30 am ~ 21:45 pm every day is the peak period of business, and other periods are peak periods of business, then the compaction task can be executed as follows:

Set the compaction period at the etcd level to 1 hour;
Set the compapction period at the kube-apiserver level to 30 minutes;
Start a periodic task on the etcd operation and maintenance platform: the current time period is in the business trough period, then a 10 minutes week will be started
Periodic compaction task.

Its essence is to hand over the task of etcd compaction to the etcd operation and maintenance platform. When there is a special long period of time without valleys throughout the day such as an e-commerce promotion, the compaction task can be urgently closed on the platform, and the compaction task can be read normally. The impact of write requests is minimized.

(2) Single compression

Even if it is a single compression task, etcd is executed in batches. Because the read and write format of the storage engine boltdb used by etcd is multiple read and one write: multiple read tasks can be performed in parallel at the same time, but only one write task can be performed at the same time.

In order to prevent a single compaction task from occupying the read-write lock of boltdb all the time, after each execution of a fixed amount of [compaction batch limit] disk KV compression task, etcd will release the read-write lock sleep for a period of time [compaction sleep interval].

Before v3.5, the compaction sleep interval was fixed at 10 ms. After v3.5, etcd has opened this parameter to facilitate parameter adjustment for large-scale K8s clusters. Similar to the interval and number of batch write, the sleep interval and batch limit of a single compaction also require different cluster sizes to set different parameters to ensure the smooth operation of etcd and kube-apiserver's read and write RT indicators are smooth and free of glitches.

Operation and maintenance platform

Whether it is etcd tuning parameters or upgrading its running file system, it is through the means of scale up to improve etcd's capabilities. There are two scale up methods that have not been used:

Obtain etcd running profile through pressure test or online, analyze the bottleneck of etcd process, and then optimize the code process to improve performance;
Reduce the amount of single-node etcd data by other means.

Optimizing the performance of etcd through the code process can be carried out according to the human situation of the etcd user. The longer-term work should be to keep up with the community and obtain the technical bonus brought by its version upgrade in time. To obtain etcd performance improvement by reducing etcd data size must rely on etcd user's own capacity building.

We have performed a benchmark test on the relationship between etcd's single-node RT and QPS performance and KV data volume, and the conclusion is that when the KV data volume increases, its RT will increase linearly, and its QPS throughput will decrease exponentially. . One of the enlightenments brought by the test results of this step is to reduce the data size of a single etcd node as much as possible by analyzing the data composition, external traffic characteristics and data access characteristics in etcd.

At present, Ant's etcd operation and maintenance platform has the following data analysis functions:

longest N KV --- the longest N KV
top N KV --- The most visited N KVs in a period of time
top N namespace --- N namespace with the largest number of KVs
verb + resoure --- External access action and resource statistics
Number of connections --- the number of long connections for each etcd node
Client source statistics --- External request source statistics of each etcd node
Redundant data analysis --- KV distribution in etcd cluster without external access recently

According to the results of data analysis, the following tasks can be carried out:

Customer flow limit
Load balancing
Cluster split
Redundant data deletion
Fine analysis of business traffic

1. Cluster split

As mentioned earlier, a classic way to improve etcd cluster performance is to split event data into an independent etcd cluster independently, because event data is a medium-weight, highly liquid, and highly-visited data in the K8s cluster. After the split, the data size of etcd can be reduced and the external client traffic of a single node of etcd can be reduced.

Some empirical and conventional etcd splitting methods are:

pod/cm
node/svc
event, lease

After the data is split, there is a high probability that the RT and QPS of the K8s cluster can be significantly improved, but further data splitting is still necessary. According to the magnitude of the thermal data [top N KV] provided by the data analysis platform and the access [verb + resource] of external customers, a detailed analysis can be used as the basis for the splitting of the etcd cluster.

2. Customer data analysis

The analysis of customer data is divided into longest N KV analysis and top N namespace.

An obvious fact is: the longer the KV data for a single read and write access, the longer the etcd response time. After obtaining the longest N KV data written by the customer, it is possible to study with the platform user whether the method of using the platform is reasonable, and to reduce the traffic pressure of the business access to the K8s platform and the storage pressure of etcd itself.

Generally, each namespace of the K8s platform is allocated to a business for separate use. As mentioned earlier, K8s may be overwhelmed by list-all pressure. In most cases, these data accesses are namespace-level list-all. After obtaining the top N namespace from the platform, focus on monitoring the list-all long connection requests of these businesses with relatively large data magnitudes, and adopting current limiting measures at the kube-apiserver level can basically ensure that the K8s cluster will not be affected by these long connections. The connection request is broken to ensure the high availability of the cluster.

3. Redundant data analysis

There are not only hot data but also cold data in etcd. Although these cold data will not cause external traffic access pressure, it will increase the granularity of etcd memory index locks, which in turn will increase the latency of each etcd access to RT and decrease the overall QPS.

Recently, by analyzing the redundant data in a large-scale [7k node and above] K8s cluster etcd, it was found that a large amount of business data was stored in etcd. The amount of data was large but it was not accessed once in a week. I learned after inquiring with the business side : The business side uses etcd of the K8s cluster as a cold backup of its crd data. After communicating with the business party and migrating the data from etcd, the number of memory keys immediately dropped by about 20%, and most of the etcd KV RT P99 delays immediately dropped by 50% ~ 60%.

4. Load balancing

K8s platform operation and maintenance personnel generally have such an experience: if the etcd cluster is started or stopped, all K8s kube-apiservers need to be restarted as soon as possible to ensure the balance of the number of connections between kube-apiserver and etcd. There are two reasons:

When kube-apiserver is started, it can randomly ensure that it establishes a connection with a node in the etcd cluster, but after etcd is started and stopped, the number of connections between kube-apiserver and etcd is irregular, resulting in the burden of each etcd node Unbalanced client pressure;
When the number of connections between kube-apiserver and etcd is balanced, there is a 2/3 probability that all read and write requests are forwarded to the leader through the follower to ensure the balance of the overall etcd cluster load. If the connections are unbalanced, the cluster performance cannot be evaluated.

Through the connection load pressure of each etcd provided by the etcd operation and maintenance platform, the balance of the cluster connection can be obtained in real time, and then the timing of operation and maintenance intervention can be determined to ensure the overall health of the etcd cluster.

In fact, the latest etcd v3.5 version already provides automatic load balancing between etcd clients and etcd nodes, but this version hasn’t been released for a long time. At present, the latest version of K8s does not support this version. You can follow up with the K8s community in time. The support progress of this version will obtain this technical bonus in time, reducing the pressure on platform operation and maintenance.

The road to the future

Through more than a year of high-availability construction of K8s including kube-apiserver and etcd, the K8s cluster has now stabilized. A notable feature is that the K8s cluster has not had a P-level failure within six months, but its high-availability construction work is not It may stop --- Ant Group, as the global K8s large-scale construction leadership quadrant, is challenging the node-level larger K8s cluster. This work will promote the further improvement of etcd cluster construction capabilities.

Many of the above-mentioned etcd capability enhancement work is carried out around the enhancement of its scale up capability, and this capability needs to be further strengthened:

Follow up with the latest features of etcd in a timely manner, and convert the open source value brought by the technological progress of the community into the customer value on the Ant K8s platform in a timely manner

2. Timely follow up the etcd optimization work of Ali Group in etcd compact algorithm optimization, etcd single-node multiboltdb architecture optimization, and kube-apiserver server-side data compression [see reference document 1], and learn from and feedback on the work of brother teams , Coordinated operations to improve together

Follow up the performance bottleneck of etcd on Ant's own K8s platform, and propose our own solutions to improve the technical value of our platform while feeding back open source

In addition to focusing on the improvement of etcd single-node performance, our next step will focus on the scale out direction of distributed etcd clusters. The essence of the aforementioned etcd cluster splitting work is to improve the overall performance of the etcd cluster through the distributed etcd cluster: the data division method of the cluster is based on the data type of the K8s business level.

This work can be further expanded to: without distinguishing the business meaning of KV, from the pure KV level, the data is written to the back-end multi etcd sub-cluster according to a certain routing method, so as to realize the overall cooling and heating load balance of the etcd cluster.

There are two ways to implement distributed etcd clusters: proxyless and proxy based: the request link of proxy based etcd distributed cluster is client[kube-apiserver] -> proxy -> etcd server, and the so-called proxyless distributed etcd cluster The request link is client[kube-apiserver] -> etcd server.

The advantage of the proxy based etcd distributed cluster is that it can be developed directly based on the etcd proxy provided by the etcd community, and it can also give back to the community in the later period to realize the unification of its open source value, technical value and customer value. But after testing: According to the test, it is found that the RT and QPS of kube-apiserver are reduced by 20% ~ 25% after the proxy initiates a read and write request to etcd. So the next step is to develop a proxyless etcd cluster.

The current split etcd distributed cluster is essentially a proxy based distributed cluster, or 67% probability is a proxy based distributed cluster: kube-apiserver requests have about two-thirds of the probability that they are forwarded to the leader through the follower, where the follower is essentially one proxy. If all kube-apiserver requests are processed after being directly connected to the leader, in theory, the RT and QPS of the current K8s cluster will have a performance gain of 67% * 20% ≈ 13.4%.

The disadvantage of proxyless etcd distributed cluster is that if the routing logic of proxy is put into kube-apiserver, it will increase the cost of kube-apiserver version upgrade, but compared to at least 20% [In the future, the benefits of expanding the scale of the etcd cluster will definitely be more It is worth the cost of the version upgrade that only affects the single component of kube-apiserver.

In addition to the idea of multiple etcd clusters, the data middleware team implemented etcd V3 API based on OBKV, which is another good technical route, which is quite similar to the etcd V3 API interface layer on tikv mentioned by Huang Dongxu at the beginning of this article , Which can be called an etcd-like system, and related work is also in progress.

In short, as our K8s become larger and larger, the importance of the overall work of Ant Group etcd has become increasingly prominent. If the high-availability construction road of etcd in the early stage was to stagger along the muddy trails, then the road to high-availability construction of etcd in the future must be Kangzhuang Avenue --- the road becomes wider and wider!

See documentation

Reference Document 1:

https://www.kubernetes.org.cn/9284.html

Reference document 2:

https://tech.meituan.com/2020/08/13/openstack-to-kubernetes-in-meituan.html

About the Author

Yu Yu (github @AlexStocks), dubbogo community leader, a programmer with 11 years of front-line work experience in server-side infrastructure and middleware research and development.

He has successively participated in and improved well-known projects such as Redis/Pika/Pika-Port/etcd/Muduo/Dubbo/dubbo-go/Sentinel-go, and is currently engaged in container orchestration work in the ant large-scale K8s cluster scheduling team of the Trusted Native Department of Ant Financial , Participated in the maintenance of one of the world's largest Kubernetes production clusters, and is committed to building a large-scale, financial-grade, and credible cloud native infrastructure.

Welcome to join us if you are interested in serverless auto-scaling technology, adaptive hybrid deployment technology, and secure container technologies such as Kata/Nanovisor or graduates of the class of 2022.

Contact email xiaoyun.maoxy@antgroup.com or yuyu.zx@antgroup.com.

Ant Group's 10,000-scale K8s cluster etcd high-availability construction road

The challenge

High availability strategy

File system upgrade

1. Use NVMe ssd

2. Use tmpfs

3. Disk file system

4. Disk transparent huge pages

etcd tuning

1.write batch

2.compaction

Operation and maintenance platform

1. Cluster split

2. Customer data analysis

3. Redundant data analysis

4. Load balancing

The road to the future

See documentation

About the Author

Recommended reading this week

SOFAStack

引用和评论

蚂蚁 Flink 实时计算编译任务 Koupleless 架构改造

基于 MCP 的 AI Agent 应用开发实践

OSPO Summit 2025 正式定档！议题征集同步开启

OSPO Summit 2025 首批议程发布！

强烈推荐|新手从搭建到二开TinyEngine低代码引擎

Koupleless 2024 年度报告 & 2025 规划展望

AIBrix 深度解读：字节跳动大模型推理的云原生实践

Ant Group&#39;s 10,000-scale K8s cluster etcd high-availability construction road

The challenge

High availability strategy

File system upgrade

1. Use NVMe ssd

2. Use tmpfs

3. Disk file system

4. Disk transparent huge pages

etcd tuning

1.write batch

2.compaction

Operation and maintenance platform

1. Cluster split

2. Customer data analysis

3. Redundant data analysis

4. Load balancing

The road to the future

See documentation

About the Author

Recommended reading this week

SOFAStack

引用和评论

蚂蚁 Flink 实时计算编译任务 Koupleless 架构改造

基于 MCP 的 AI Agent 应用开发实践

OSPO Summit 2025 正式定档！议题征集同步开启

OSPO Summit 2025 首批议程发布！

强烈推荐|新手从搭建到二开TinyEngine低代码引擎

Koupleless 2024 年度报告 & 2025 规划展望

AIBrix 深度解读：字节跳动大模型推理的云原生实践

Ant Group's 10,000-scale K8s cluster etcd high-availability construction road