[In-depth discussion] The beauty of Alibaba&#39;s 10,000-scale K8s cluster global high-availability system

Introduction: Taiwanese writer Lin Qingxuan, in an interview with reporters, commented on his more than 30 years of writing career: "In the first ten years, I was brilliant, and the flash of thief made the surroundings eclipse; in the second ten years, I Finally, "Baoguang appeared", no longer trying to grab the limelight, but complementing the beauty around me; entering the third decade, the prosperity and prosperity were all seen really mellow, I entered the stage of "mellow light emerges", and I truly experienced the realm. beauty of".

头图.png

Author | Han Tang, Zheyuan, Shenzui
Source | Alibaba Cloud Native Public

Preface

In an interview with reporters, Taiwanese writer Lin Qingxuan commented on his more than 30-year writing career: "In the first ten years, I was so talented, and the surroundings were eclipsed by the flash of thieves; in the second ten years, I finally "baoguang". "Appearance', no longer grab the limelight, but complement each other with the beauty around me; entering the third decade, the prosperity is full of real mellowness, I have entered the stage of'mellow light emerges', and I truly appreciate the beauty of the realm."

The long night is poor, and the real water has no fragrance. After experiencing the thrill of K8s "in the rivers and lakes" and the blossoming flowers of its ecosystem, it is time to go back and appreciate the beauty of the high-availability system. After all, you can only stand the beating or you can't stand alone in martial arts!

There is a well-known problem in the field of K8s high availability, that is, the SLO problem caused by the scale of K8s single cluster. How to continue to guarantee it? Today, I will use the high-availability challenge brought about by the growth of a single cluster as an introduction to give everyone a sense of experience.

The ASI single cluster supports more than 5000 units in the community. This is a very interesting and challenging thing. For students who need to produce K8s, and even those with experience in K8s production, it will definitely be a topic of interest. . Looking back at the development path of ASI's single cluster size from 100 to 10,000, each cluster size increase brought about by business growth and innovation is gradually changing our pressures and challenges.

ASI: Alibaba Serverless infrastructure, Alibaba's unified infrastructure designed for cloud native applications. ASI is the Alibaba Group Enterprise Edition of Alibaba Public Cloud Service ACK.

Everyone knows that the K8s community can only support 5,000 nodes. When this scale is exceeded, various performance bottlenecks will occur, such as:

etcd suffers a lot of read and write latency.
The kube-apiserver query pods/nodes has a very high latency, even causing etcd oom.
The controller cannot detect data changes in time, such as watch data delay.

Take the e-commerce scenario as an example. When 100 nodes grow to 4,000 nodes, we have done a lot of performance optimization for the client and server of ASI apiserver in advance. From the perspective of the apiserver client, we give priority to access the local cache and do it on the client side. Load balancing; the apiserver server mainly does watch optimization and cache index optimization; concurrent reading is used on the etcd kernel to improve the read processing capacity of a single etcd cluster, the new freelist management algorithm based on hashmap increases the upper limit of etcd storage, and the raft learner technology is used to increase multiple Capacity and so on.

From 4,000 nodes to 8,000 nodes, we have done qps current limit management and capacity management optimization, etcd single-resource object storage split, component specification full life cycle landing, through the client's specification constraints to reduce the pressure and wear of apiserver Through the pressure of etcd and so on.

Finally ushered in the time when 8,000 nodes grew to tens of thousands of nodes, we began to carry out etcdcompact algorithm optimization in full swing; etcd single-node multi-multiboltdb architecture optimization, apiserver server-side data compression, and component management to reduce etcd write amplification, etc.; at the same time, we started Create a normalized stress testing service capability and continue to answer ASI's SLO.

These examples are commonplace in high-availability challenges, and the listed capabilities are only a small part of them. It may be difficult for you to see the relationship between capabilities and the underlying evolution logic. Of course, more capacity building has been deposited into our systems and mechanisms. This article will serve as a starting point to share several key parts of our ASI global high-availability system in the form of an overview, and then there will be detailed explanations of the technical points and evolution routes in the future. If you have any questions or what you want to know, please leave a message in the comment area.

ASI Global High Availability Overview

High availability is a relatively complicated proposition. Any daily changes such as service upgrades, hardware updates, data migration, and sudden increase in traffic may cause the service SLO to be damaged or even unavailable.

As a container platform, ASI does not exist in isolation, but forms a complete ecosystem with the underlying cloud and public services. To solve the high availability problem of ASI, it is necessary to look at the overall situation, find the optimal solution for each layer, and finally connect to form the optimal overall solution. The levels involved include:

Cloud infrastructure related management, including the selection, planning and hardware asset management of available zones
Node management
ASI cluster management
Public Service
Cluster operation and maintenance
Application R&D

Especially in the scenario of ASI, the number of business clusters to be supported is huge, there are many R&D operation and maintenance personnel involved, the iterative development model with frequent function releases, and the complexity and change of runtime caused by the variety of business types are compared with other containers. From a platform perspective, ASI high availability faces more challenges, and the difficulty is self-evident.

ASI global high availability design

As shown in the figure below, the overall strategy of high-availability capacity building at this stage is based on 1-5-10 (fault detection in 1 minute, positioning in 5 minutes, and stop loss in 10 minutes). SRE/Dev can oncall indiscriminately.

Avoiding problems as much as possible, discovering, locating and recovering problems as soon as possible is the key to achieving the goal. For this reason, we will implement the ASI global high-availability system into three parts: one is basic capacity building; the other is emergency system construction; the third is The freshness and continuous evolution of the above capabilities are completed through normalized pressure testing and fault drills.

Through the three-part rotation drive, the construction of an ASI global high-availability system is realized. The top layer is the SLO system and the 1-5-10 emergency system. Behind the emergency system and data-driven system, we have built a large number of high-availability basic capabilities, including defense systems, high-availability architecture upgrades, fault self-healing systems, and continuous improvement mechanisms. At the same time, we have established a number of basic platforms to provide supporting capabilities for high-all-use systems, such as a normalized fault drill platform, a full-link simulation stress testing platform, an alarm platform, a plan center, and so on.

Global high-availability basic capacity building

Before building the overall high-availability capability, our system is constantly undergoing accidents and dangers under rapid development and changes, and we need to respond to emergency situations at intervals, resulting in a situation where the problem is chased, and there is no effective means to deal with it. There are several serious problems. Challenges:

How to improve our availability in terms of architecture and capabilities, and reduce the probability and impact of system failures?
How to make some breakthroughs in core link performance and architecture to support such complex and changeable business scenarios and general needs for business growth?
How to stop the problem from chasing the body, do a good job of prevention and avoid emergencies?
How to quickly find, diagnose, and stop loss quickly when an emergency occurs?

In response to these problems, and summarized the following core reasons:

Insufficient availability: In the group scenario, components are constantly changing, increasing the pressure and complexity of the system. ASI lacks production availability capabilities, such as current limiting and downgrading, load balancing, etc., components are prone to misuse, causing low-level errors, affecting the cluster Availability.
Insufficient system risk control and pod protection capabilities: In the event of human misoperation or system bugs, it is easy to cause innocent or large-scale damage to business pods.
Capacity risk: The number of clusters is several hundred, and the components are close to one hundred. In addition, due to the configuration of podCIDR and the number of node IPs, the node size of most ASI meta-clusters is limited to 128 units. With the rapid development of business, the capacity risk is considered There are big challenges.
The limited scale of a single cluster and insufficient horizontal expansion capabilities affect business development: the continuous growth of the single cluster, changes in business types, and component changes have an impact on the maximum scale supported by a single cluster, and have an impact on the continued stability of SLO.

1. Top-level design of high-availability basic capabilities

In response to these problems, we have made a top-level design of high-availability basic capabilities. The overall basic capability building is mainly divided into several parts:

Performance optimization and high-availability architecture construction: mainly from the perspective of performance optimization and architecture upgrade to improve the type of business and business volume supported by the entire cluster.
Component specification full life cycle management: It is mainly implemented in the entire life cycle of the component from the perspective of the specification. From the start of birth and cluster access, to every change, to the entire life cycle of offline, it is necessary to prevent the abuse of components, barbaric growth, and the entire life cycle. Unlimited expansion, the control components are within the range of the system.
The construction of the offensive and defensive system: mainly triggered from the ASI system itself, to improve the security, defense and risk control capabilities of the system from the perspective of attack and defense.

The following is a description of several key capacity building for some of our pain points.

2. Pain points of K8s single cluster architecture

The ability to control ApiServer is not enough, and the emergency response is not enough. Our own experience, the number of abnormalities in the cluster master has exceeded 20+, and the recovery time has been longer than 1 hour.
ApiServer is a single point of the APIServer cluster, with a large explosion radius.
The scale of a single cluster is large, and the memory level of the Apiserver is relatively high. The pressure comes from frequent queries and writing more and larger resource objects.
The business layer lacks disaster tolerance across computer rooms. When ASI is unavailable, it can only rely on ASI's resilience.
The continuous expansion of the cluster size and the massive creation and deletion of offline tasks put greater pressure on the cluster.

There are two major perspectives to improve the availability of the cluster architecture. In addition to architecture optimization and performance breakthroughs in a single cluster, it is also necessary to support a larger scale through horizontal expansion capabilities such as multiple clusters.

One is to use multi-cluster capabilities such as federation to solve the horizontal scalability of a single cluster and the ability of single-region cross-cluster disaster recovery.
In addition, the architecture of a single cluster itself can also provide differentiated SLO guarantees from the perspective of isolation and priority strategies.

3. ASI architecture upgrade landing

1) APIServer multi-channel architecture upgrade

The core solution is to group apiservers and treat them with different priority strategies, so as to provide differentiated SLO guarantees for services.

Reduce the pressure on the main link apiserver by shunting (core demand)
- P2 and the following components are connected to the bypass apiserver, and can perform overall current limiting in emergency situations (such as their own stability is affected).
Bypass apiserver and cooperate with the main link to do blue-green and gray-scale (secondary requirements)
- Bypass apiserver can use an independent version to increase the gray dimension of new functions, such as using an independent current limiting strategy, such as opening a new feature function verification.
SLB disaster preparedness (sub-claim)
- Bypass apiserver can provide external services when the main apiserver is abnormal (the controller needs to switch the target address by itself).

2) ASI multi-cluster federation architecture upgrade

At present, a computer room in Zhangbei Center has tens of thousands of nodes. If the management problem of multi-cluster is not solved, the problems will be as follows:

Disaster tolerance level : The risk of deploying the central unit of the core transaction application in a cluster is great. In the worst case, the cluster is unavailable and the entire application service is unavailable.
Performance level : For business, if the core application is extremely sensitive when used at a certain point in time, various single machine maximum limits and CPU mutual exclusive guarantees are set. If they are deployed in a cluster, it will be due to the scale of cluster nodes. Limitations lead to stacking of applications, resulting in CPU hotspots, and performance that does not meet the requirements; for the ASI-controlled Master, a single cluster will expand without limitation, and performance will always be bottlenecked, and one day it will not be able to support it.
Operation and maintenance level : When an application is expanded and it is found that there is no resources, SRE has to consider which cluster the node is added to, which additionally increases the work of SRE cluster management.

Therefore, ASI needs to reach a unified multi-cluster management solution to help upper-level passes, SREs, application R&D, etc. provide better multi-cluster management capabilities, and use it to shield the differences between multi-clusters and facilitate multi-party resource sharing.

ASI chose to develop based on the community federation v2 version to meet our needs.

4. K8s cluster encounters great performance challenges brought about by scale growth

What problems will the performance encounter in a large-scale K8s cluster?

first query related questions . The most important thing in a large cluster is how to minimize expensive requests. For the number of millions of objects, it is easy to cause etcd and kube-apiserver OOM, packet loss, and even avalanches when querying Pod by label and namespace, and obtaining all Node and other scenarios.
followed by writing related questions . The applicable scenario of etcd is to read more and write less, and a large number of write requests may cause the db size to continue to grow, the write performance reaches the bottleneck, the speed is limited, and the read performance is affected. For example, a large number of offline jobs require frequent creation and deletion of pods, and the write amplification of pod objects through the ASI link will eventually increase the write pressure on etcd by dozens of times.
last big resource objects related issues . etcd is suitable for storing small key-value data. Under large value, the performance drops rapidly.

5. ASI performance bottleneck breakthrough

ASI performance optimization direction

The performance optimization of ASI can be optimized from three aspects: apiserver client, apiserver server, etcd storage.

On the client side, cache optimization can be done to allow each client to access the local informer cache first, and load balancing optimization is also required, mainly including load balancing for apiserver, etcd. At the same time, for the various optimizations of the client, the component performance specification can be used to verify whether it is satisfied when the component is enabled and admitted.
On the APIServer side, it can be optimized from three levels: access layer, cache layer, and storage layer. At the cache layer, we focused on the construction of cache index optimization and watch optimization. At the storage layer, we focused on data compression for pods through the snappy compression algorithm, and focused on building current limiting capabilities at the access layer.

We have also done a lot of work on the optimization of the etcd storage side from several aspects, including various algorithm optimizations at the etcd kernel level, and the ability to split different resources into different etcd clusters to achieve basic horizontal splitting capabilities. At the same time, the expansion capability of multi boltdb is also improved on the etcd server layer.

6. K8s clusters have weak preventive capabilities

In K8s, kube-apiserver serves as a unified entry point, and all controllers/clients work around kube-apiserver, although our SRE conducts specification constraint improvements through the full life cycle of components, such as through component activation and cluster standardization. In the entry stage, stuck point approval was carried out. After the comprehensive cooperation and transformation of each component owner, a large number of low-level errors were prevented, but there were still some controllers or some behaviors that were not controllable.

In addition to failures at the infrastructure level, changes in business traffic are factors that cause K8s to be very unstable. Sudden pod creation and deletion, if not restricted, it is easy to hang up the apiserver.

In addition, illegal operations or code bugs may affect business pods, such as deleting illegal pods.

Combine all risks with hierarchical design, and carry out risk prevention and control layer by layer.

7. The prevention ability of ASI single cluster is strengthened

1) Support the multi-dimensional (resource/verb/client) refined current limit of the API access layer

The current limiting method adopted in the early days of the community mainly used inflight to control the overall concurrency of reading and writing. At that time, we realized the lack of current limiting capabilities before the apf came out and did not have the ability to limit the source of requests. While apf uses User to limit the current (or to pass the authn filter first), there are some shortcomings. On the one hand, Authn is not cheap. On the other hand, it just allocates the capabilities of the API Server according to the configuration. It is not a kind of Current limiting plan and emergency plan. We need to urgently provide a current-limiting capability to deal with emergencies. We have developed the ua limiter current-limiting capability by ourselves, and implemented a set of current-limiting management capabilities based on the simple configuration of the ua limiter, which can be easily deployed in hundreds of clusters. Carry out the default current limit management and complete the emergency current limit plan.

The following is a detailed comparison between our self-developed ua limiter current limiting solution and other current limiting solutions:

The focus of ua limiter, APF, and sentinel on current limit is different:

ua limiter is based on ua to provide a simple QPS hard limit.
apf focuses more on the control of concurrency, considering the isolation of traffic and the fairness after isolation.
Sentinel has comprehensive functions, but the support for fairness is not as comprehensive as APF, and the complexity is somewhat too high.

Considering our current needs and scenarios, we found that the ua limiter is the most suitable, because we use the difference of user agents to limit the current of the components. Of course, to implement more refined current limiting in the future, you can still consider using APF and other solutions to further strengthen it.

How to manage the current limiting strategy? There are hundreds of clusters. Each cluster has a different scale. The number of cluster nodes and the number of pods are different. There are nearly a hundred internal components. Each component requests 4 types of resources on average. Different resources have an average of 3 different actions. If each is restricted, the rules will explode, and the maintenance cost will be very high even after convergence. Therefore, we grasp the most core: core resource pod\node, core actions (create, delete, large query); the largest: daemonset component, PV/PVC resource. Combining with the actual online traffic analysis, it sorts out about 20 general current limiting strategies and incorporates them into the cluster delivery process to achieve a closed loop.

When new components are connected, we will also design the current limit. If it is special, bind rules and automatically issue policies during cluster admission and deployment. If a large number of current limit situations occur, it will also be triggered. Alarms, SRE and R&D will follow up and optimize and solve them.

2) Support refined current limiting

All pod-related operations will be connected to the Kube Defender unified risk control center to perform flow control at the second, minute, hour, and day level. The global risk control current-limiting component is deployed at the central end to maintain the interface in each scenario to call the current-limiting function.

The defender is a risk control system that protects (flow control, fuse, verification) and audits against risky operations initiated by users or automatically initiated by the system from the perspective of the entire K8s cluster. The reason for being a defender is mainly considered from the following aspects:

For components like kubelet/controller, there are multiple processes in a cluster, and any single process cannot see the global view and cannot perform accurate current limiting.
From the perspective of operation and maintenance, it is difficult to configure and audit the speed limit rules scattered in each component. When some operations fail due to current limit reasons, the inspection link is too long and affects the efficiency of problem location.
K8s is a final state-oriented distributed design, and each component has the ability to make decisions, so a centralized service is needed to control the risk of those dangerous decisions.

The frame diagram of the defender is as follows:

The defender server is a K8s cluster-level service, multiple ones can be deployed, one of which is active and the others are standby.
Users can configure risk control rules through kubectl.
The components in K8s, such as controller, kubelet, extension-controller, etc., can be connected to the defender through the defender sdk (with minor changes), request the defender to perform risk control before performing the dangerous operation, and decide whether to continue the dangerous operation according to the risk control result . As a cluster-level risk control and protection center, defender protects the overall stability of the K8s cluster.

3) Digital capacity management

In the scenario with only a few core clusters, relying on expert experience to manage capacity can be easily done, but with the rapid development of container business, businesses covering pan-transactions, middleware, new ecology, new computing, and sales areas are connecting to ASI In just a few years, hundreds of clusters have been developed, and tens of thousands in a few years? It is difficult for so many clusters to rely on traditional human resource management methods, and the labor cost is getting higher and higher. Especially when facing problems such as the following, it is easy to cause low resource utilization and serious waste of machine resources, and eventually lead to insufficient capacity of some clusters. On risk.

Component changes are constantly changing, business types and pressures are also changing, the real online capacity (how much qps can be carried) is not known to everyone, when the business needs to increase the traffic, do you need to expand the capacity? Does horizontal expansion also fail to solve the problem?
Early application of container resources is arbitrary, resulting in serious waste of resource costs. It is necessary to specify how many resources (including cpu, memory and disk) should be reasonably applied based on the minimization of container cost. In the same region, in the same meta-cluster business cluster, waste of resources in one cluster will cause tension in other clusters.

In ASI, component changes are the norm, and how to adapt component capacity to this change is also a very big challenge. The daily operation and maintenance and diagnosis require accurate capacity data as the backup support.

Therefore, we decided to apply for reasonable (low cost, safe) container resources through data-based guidance components. Provide capacity-related data required for daily operation and maintenance through dataization, complete capacity preparation, and complete emergency capacity expansion when the production water level is abnormal.

At present, we have completed water level monitoring, full risk reporting, pre-scheduling, and regular capture of profile performance data, and then promoted the optimization of CPU memory and CPU memory ratio through component specifications. What is being done includes automated specification suggestions, node resource supplement suggestions, and automated import of nodes, combined with chatops, is building a closed loop of "one-click backup capacity" for nail groups. In addition, it is also combining the full-link stress test service data to obtain a baseline comparison of each component, and through risk decision-making, issue card points to ensure the safety of the components on the line. At the same time, in the future, we will continue to answer the SLO performance of the real environment and accurately predict the capacity in combination with real changes on the line.

Global high-availability emergency response capacity building

The construction of high-availability basic capabilities can provide ASI with a strong anti-risk guarantee, so as to ensure the availability of our services as much as possible when various hidden risks arise. However, how to quickly intervene to eliminate hidden dangers after a risk occurs, or to orderly stop losses after a failure that cannot be covered by high availability capabilities, has become an engineering problem with very technical depth and lateral complexity, which also makes ASI The building of emergency response capacity has become a very important input for us.

At the beginning of the construction of the emergency system, due to the rapid development and change of our system, the continuous occurrence of accidents and dangers clearly exposed several serious problems that we faced at that time:

Why do customers always find problems before us?
Why does recovery take so long?
Why do the same problems repeat?
Why are only a few people able to deal with online issues?

In response to these problems, we have also conducted sufficient brainstorming and discussion, and summed up the following core reasons:

There is a single method for finding problems: only metrics data is the most basic means for exposing problems.
Lack of ability to locate problems: There are only a few monitoring markets, and the degree of observability of core components is not uniform.
Lack of recovery methods: the repair of online problems requires temporary commands and scripts, which are inefficient and risky.
Lack of system standards for emergency response: lack of linkage with the business side, serious thinking of engineers, not taking stop loss as the first goal, and lack of awareness of the severity of the problem.
Lack of follow-up of long-term problems: hidden dangers found online, or follow-up items for accident resumption, lack of continuous follow-up ability, resulting in repeated stepping on pits.
Lack of capability preservation mechanism: Business changes are very rapid, leading to an embarrassing situation where some capabilities will be "unused or dare to use, and there is no guarantee that they will be available" after a period of time.

1. Top-level design of emergency capacity building

In response to these urgent problems, we have also made a top-level design of emergency response capabilities. The architecture diagram is as follows:

The overall emergency capacity building can be divided into several parts:

1-5-10 Emergency system : For any sudden risks that appear online, the underlying capabilities and mechanisms of "discovery in one minute, locate in five minutes, and recover in ten minutes" can be achieved.
Issue Tracking and Follow-up : The ability to continuously track and advance all the hidden risks found online, no matter whether it is serious or not.
Capacity preservation mechanism : for the construction of 1-5-10 capacity, in view of the nature of its low frequency of use.

2. Construction of sub-modules for emergency response capacity building

For each sub-module in the top-level design, we have made some phased work and results.

1) One minute discovery: problem discovery ability

In order to solve the problem that cannot be found earlier than the customer, the most important goal of our work is to achieve: let all problems have nowhere to hide, and be actively discovered by the system.

So this is like a protracted battle. What we have to do is to use all possible means to cover one new problem after another, and to conquer one city after another.

Driven by this goal, we have also concluded a very effective "strategic thinking", namely "1+1 thinking" . Its core point is that any means of finding problems may cause accidental failures due to external dependence or defects in its own stability, so there must be a link that can be used as a mutual backup for fault tolerance.

Under the guidance of this core idea, our team has built two core capabilities, namely black box/white box alarm dual channels, each of which has its own characteristics:

Black box channel: Based on the black box idea, from the customer's perspective, the ASI as a whole is regarded as a black box, and direct commands are issued to detect the forward function; for example, directly expand a statefulset.
White box channel: Based on the white box idea, the abnormal fluctuation of observability data of various dimensions exposed in the system is used to find potential problems; for example, the memory of APIServer is rising abnormally.

The specific product corresponding to the black box channel is called kubeprobe, which is a new product formed by our team based on the idea of the community kuberhealthy project for more optimization and transformation, and it has also become an important tool for us to judge whether the cluster has serious risks.

The construction of the white box channel is relatively more complicated. It needs to be built on the basis of complete observable data to be able to truly exert its power. Therefore, we first build three data channels based on SLS from the three dimensions of metrics, logs, and events, and unify all observable data to SLS for management. In addition, we have also built an alarm center, which is responsible for the batch management of the current hundreds of cluster alarm rules, and the ability to issue them. Finally, a white box alarm system with complete data and extensive problem coverage has been constructed. Recently, we are further migrating our alarm capabilities to SLS Alarm 2.0 to achieve more abundant alarm functions.

2) Five-minute location: automatic location of the root cause of the problem

With the continuous enrichment of online troubleshooting experience, we have found that many problems will appear more frequently. Their investigation methods and recovery methods have basically been relatively solidified. Even though there may be multiple reasons behind a certain problem, with the rich experience in online troubleshooting, it is possible to slowly iterate out a roadmap for troubleshooting this problem. As shown in the figure below, it is a troubleshooting route designed for unhealthy alarms in etcd clusters:

If these relatively confirmed troubleshooting experiences are solidified into the system, the decision-making can be automatically triggered after a problem occurs, which is bound to greatly reduce the time we spend on processing online problems. So in this regard, we have also started some related capacity building.

From the perspective of black box channels, kubeprobe has built a self-closed loop root cause location system, which sinks the expert experience of troubleshooting into the system, and realizes a fast and automatic problem location function. Through the ordinary root cause analysis tree and the machine learning classification algorithm for failed inspection detection events/logs (continuous development investment), the root cause location of each KubeProbe detection failure case is performed, and the problem is realized through the unified implementation of KubeProbe. The evaluation system (currently the rules here are still relatively simple) evaluates the severity of the alarm to determine how appropriate subsequent processing should be done, such as whether to heal itself, whether to call alarms, and so on.

From the perspective of the white box channel, through the orchestration capabilities of the underlying pipeline engine, combined with the multi-dimensional data in the data platform that has been built, we have realized a universal root cause diagnosis center, which will use various observable data to troubleshoot the root cause of the problem. The process is solidified into the system through yaml orchestration to form a root cause diagnosis task, and a diagnosis conclusion of the problem is formed after the task is triggered. And each conclusion will also be bound to the corresponding recovery means, such as calling the plan, self-healing, and so on.

Both channels use Dingding robots and other means to achieve an effect similar to chatops, which speeds up the problem-solving speed of oncall personnel.

3) Ten minutes recovery: restore the stop loss ability

In order to improve the speed of stop loss recovery from runtime failures, we also put the construction of stop loss recovery capabilities as the first priority. In this regard, our core principles are two:

Stop loss ability should be systematized, white screen, can be precipitated.
Everything is aimed at stopping loss, rather than finding the absolute root cause.

So driven by these two guidelines, we have done two aspects of work:

Construction plan center : Centralize all our stop loss capabilities into the system, white screen management, access, and operation. On the one hand, the pre-plans that were previously scattered in the hands of various research and development or documents can also be unified and managed at the central end, realizing centralized control of the pre-plans. On the other hand, the plan center has also developed the ability to support users to enter plans through yaml formatting, so as to achieve low-cost access.
Build a set of universal stop loss methods: Based on past historical experience, combined with the unique characteristics of ASI, build a set of multiple universal stop loss capabilities as an important starting point in emergency situations. It includes common functions such as in-situ restart of components, rapid expansion of components, rapid controller/webhook downgrade, and rapid cluster switching to read-only.

4) BugFix SLO

In response to the lack of follow-up capabilities, we proposed the BugFix SLO mechanism. As the name describes, we believe that each discovered problem is a "Bug" to be fixed, and we have done some work for this kind of bug:

On the one hand, a series of classification methods are defined to ensure that the problem can be clarified to the team and a specific person in charge.
On the one hand, define the resolution priority, that is, the SLO to solve this problem, L1-L4, different priorities represent different resolution standards, and L1 representatives must quickly follow up and solve them within the same day.

Every two weeks, we will produce a weekly stability report based on the new issues collected in the past period of time, which will provide an overview of the degree of problem resolution and synchronization of key issues. In addition, all staff will be aligned every two weeks, the person in charge of each new issue will be determined, and the priority will be aligned.

5) Capacity acceptance and preservation mechanism

Since stability risks occur at relatively low frequencies, the best way to preserve stability is drills. Therefore, on this basis, we designed or participated in two drills. They are:

Normalized failure drill mechanism
Production raid drill mechanism

【Normalized drill mechanism】

The core purpose of the normalized failure rehearsal mechanism is to continuously check and accept the ASI system-related failure scenarios and the resilience against this failure at a more frequent frequency, so as to not only find the stability defects of certain components, but also check and accept each component. The effectiveness of this recovery plan.

So in order to increase the frequency of exercises as much as possible, we:

On the one hand, it started to build its own fault scene library, and all scenes were stored, classified, and managed to ensure that the coverage of the scene was comprehensive.
On the other hand, we cooperated with the quality assurance team to make full use of the fault injection capability provided by its chorus platform to implement our design scenarios one by one and configure them to run continuously in the background. We also use the platform’s flexible plug-in and rich capabilities to connect the platform with our alarm system and plan system API. After the failure scenario is triggered and injected, it can be completely injected and checked for this scenario through the automatic call mode in the background. , Recovery is completed by running in the background.

In view of the high frequency of normalized drills, we usually trigger continuous background drills in a dedicated cluster to reduce the stability risk caused by the drills.

【Production Raid Drill Mechanism】

Even if normalized failure drills are done frequently, we cannot fully guarantee that the same problem will really occur in the production cluster, and whether we can respond in the same way; there is no way to really confirm whether the scope of the failure is as we expected The scope of these problems is the same; the most fundamental reason for these problems is that the clusters in our normalized failure exercises are generally test clusters with no production traffic.

Therefore, the failure simulation in the production environment can reflect the real situation on the line more realistically, thereby enhancing our confidence in the correctness of the recovery methods. In terms of landing, by actively participating in the quarterly production raids organized by the cloud native team, we have achieved a second acceptance in the production environment for some of our relatively complex or more important exercise scenarios, and at the same time, we are also discovering the speed. The response speed was also evaluated on the side, and not only found some new problems, but also brought a lot of reference input for how we designed the test cluster to be more in line with the real situation of the online scene.

Write at the end

This article is only the beginning of a comprehensive introduction to some of the exploration work and the thinking behind the construction of ASI's global high availability system. The follow-up team will focus on specific areas such as ASI emergency system construction, ASI prevention system construction, fault diagnosis and recovery, and the whole chain We will conduct in-depth interpretations on the construction and operation of refined SLOs, ASI single-cluster scale performance bottleneck breakthrough, etc., so stay tuned.

As the leading implementer of cloud native, ASI's high availability and its stability influence and even determine the business development of Alibaba Group and cloud products. The ASI SRE team has been recruiting people for a long time, and there are technical challenges and opportunities. Interested students are welcome to come to: en.xuze@alibaba-inc.com, hantang.cj@taobao.com.

开发者大会 banner.jpg

How to make better use of cloud capabilities in the digital age? What is a new and convenient development model? How to let developers build applications more efficiently? Technology empowers society, technology promotes change, expands the energy boundaries of developers, everything is different because of the cloud. Click to register now , 2021 Alibaba Cloud Developer Conference will bring you the answer.

Copyright statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

[In-depth discussion] The beauty of Alibaba's 10,000-scale K8s cluster global high-availability system

Preface

ASI Global High Availability Overview

ASI global high availability design

Global high-availability basic capacity building

1. Top-level design of high-availability basic capabilities