Thinking and Practice of Fault Drilling System Construction in Cloud Native Background—Guide to Cloud Native Chaos Engineering Series

Author: Ji Yeon (Zheng Yan), Huan Bi (He Ying)

What is chaotic engineering and the characteristics of chaotic engineering under the tide of cloud native

By using the services provided by cloud computing vendors such as Alibaba Cloud and AWS, modern service providers can provide rich software services more stably at a lower cost. But is it really that easy? Mainstream cloud computing vendors have experienced some historical failures within the scope of their SLA commitments. refer to this bloody github report list 161e320bf74bbb [1] . On the other hand, various cloud products provide users with some high-availability capabilities, which often still need to be configured and used in the correct posture.

Chaos engineering can help business system service providers to identify vulnerable links in production services by creating disruptive events, observing how systems and people respond, and improving for optimization, and implement improvements based on expected SLA goals. In addition to pointing out the design problems of system components that need to be improved, chaos engineering can also help to find blind spots that need to be monitored and alerted, and to discover personnel's lack of understanding of the system, emergency response SOPs, and troubleshooting capabilities, thereby making business systems and their R&D, The overall high availability level of operation and maintenance personnel has greatly risen. Therefore, after Netflix proposed this concept, major software manufacturers have carried out internal practice and external product provision.

On the basis of traditional cloud computing, cloud-native provides faster and lower-cost elasticity and better flexibility of software and hardware integration, and has become the fastest growing technology direction of cloud computing. Cloud native helps developers greatly reduce resource costs and delivery costs, so that they can win the market faster and better. At the same time, cloud native has also brought a complete change to the traditional operation and maintenance and research and development methods, which makes the traditional chaos engineering methods need to follow the evolution.

In the context of cloud native, what is the difference between the implementation of chaos engineering for application services on it and the traditional one? From our extensive practice in Alibaba e-commerce and middleware cloud native, the following main differences are summarized:

In the context of such differences, it is more appropriate to use cloud-native means to implement chaos engineering that is more specific to scenarios rooted in cloud-native applications, and can provide more capability improvements.

Stages and Development of Chaos Engineering Implementation Mode

Since chaos engineering can bring so many benefits, how to implement a cloud-native application service or system?

From the perspective of drill tools and implementation, an organization's failure drills are often divided into several stages of development: manual drills, process tool automation drills, normalized unattended drills, and production raid drills.

The implementation difficulty of these stages is from low to high, of course, the corresponding benefits are also from low to high. An organization (cloud user) can follow the process of increasing the volume, complexity and high availability of its business application services, select its own appropriate stage according to the actual situation, and then upgrade and develop accordingly. Even starting with the simplest manual exercises can often lead to fairly significant and long-term systematic improvements in high availability.

So what are the characteristics of each stage and how to choose?

Manual drill: generally completed by manually injecting faults in the early stage of high-availability construction, or in the case of one-time acceptance. Perform the drill by manually checking whether the alarm takes effect and the system recovery. At this stage, only some gadgets or scripts for fault injection are needed, which is convenient for subsequent use.
Automated drill: high-availability capacity building reaches a certain stage, there is often a need to regularly check whether the high-availability capacity is degraded, and the automated drill begins to be scheduled. The automated drill steps generally include: environment preparation -> fault injection -> inspection -> environment recovery. Configure the corresponding script in each step to form a drill process, and the next time you can automate it with one click.
: In the next stage of the 161e320bf74e21 exercise, we will have higher requirements. We hope that the exercise can be executed autonomously and chaotically, in an unattended manner, which brings new challenges to the high availability of the system. . This requires that the system not only has monitoring alarms to detect faults, but also has a corresponding plan module to be responsible for recovery. To be unattended, the system needs to be more intelligent and accurate to determine the fault situation and automatically execute the corresponding plan.

Production raid: conducted in a grayscale environment and will not affect the business. Production raid requires the system to be able to perform fault drills under the premise of controlling the explosion radius in the production environment, in order to find some business-related, scale-related, configuration For those related to emergency response, in the missing part of the grayscale environment, the drill in the production environment has higher requirements on the system, a set of execution specifications and higher requirements on the isolation capability of the system. Most of the work and capacity building are verified in the grayscale environment, but production raids are still used as an effective and necessary exercise method, using more realistic scenes to give R&D a sense of movement, allowing them to truly implement plans, and exercise emergency response capabilities. The system has more confidence and awareness.

How to conduct a complete failure drill implementation

When an application uses Kubernetes for application deployment and expansion for the first time, the first concern is whether the function is available, and the fault drill is a higher-level requirement. We assume that the current system has passed the preliminary function acceptance, but for some fault conditions Under the premise that the performance of the next system is still unknown, let's start our troubleshooting tour. As a destructive operation, the fault drill itself needs to be carried out step by step and follow certain norms and procedures. Next, we will introduce from the aspects of environment construction, system capability analysis, high-availability capability construction, and exercise implementation suggestions, how an application deployed in Kubernetes for the first time should implement a fault drill step by step.

Step 1: Isolation Environment Construction

The fault drill, especially before the first execution, needs to be clear about the current environment of the injected fault, whether it may affect the business traffic, and whether it will cause irreparable losses. Inside Alibaba, we have complex environment isolation and change control to Fail-safe injection affects business traffic.

In terms of environment categories, we will distinguish into the following categories:

Business test environment: It is used for e2e testing and comprehensive function acceptance. This environment is isolated from the production network with business traffic, which prevents traffic errors from entering other environments from the network. A fault tolerance test.
Canary environment: It can be understood as a comprehensive link grayscale environment. This environment has all the components of the current system. It is generally used for upstream and downstream joint debugging, and the link grayscale inside the system is used. This environment is not actual business flow;
Safe production grayscale environment: In this environment, we will introduce 1% of the production traffic, and build the ability to cut traffic in advance. Once there is a problem in this environment, the traffic can be quickly switched to the production environment. This environment is generally used to combine user traffic. Grayscale for a period of time to avoid uncontrollable release caused by full release;
Production environment: an environment with real user traffic, any operation and maintenance action in this environment requires strict change review and grayscale approval of the first few environments before it can be changed;

Fault drills are generally introduced in the canary environment. Some high-availability capabilities can be built and accepted in a full-link, no real-traffic environment, and drills that are normally performed. In the grayscale environment and the production environment, under the premise of controlling the explosion radius, conduct a real raid as the acceptance of the ability.

In general, considering the cost investment and system complexity, business applications may not build four isolated environments to advance step by step, but we recommend that applications should have at least two environments to distinguish user traffic, at least one environment and production environment. Isolated grayscale environment, at least initially. The issues that need attention in environmental construction are as follows:

Isolation: The grayscale environment and the production environment should be isolated as much as possible, including but not limited to network isolation, permission isolation, data isolation, etc. Considering some disaster tolerance capabilities, the two clusters can also be built in Kubernetes clusters in different regions .
Authenticity: The grayscale environment and the production environment should be as consistent as possible, such as external dependencies and component versions.

Only after the environmental construction meets the standards, the access conditions for the drill are met.

Step 2: Failure Scenario Analysis

When analyzing the high availability of a system, there is often no unified answer. The weak points and bottlenecks of each system are different. However, when sorting out the high availability of a system, we can provide some general ideas.

Historical failures:

Historical faults are usually textbooks for quickly understanding the weak capabilities of a system. By analyzing historical faults and classifying them, it is possible to quickly conclude which components of the current system are more prone to problems.

For example, the system capability needs to perform fast elastic scaling, and the scaling failure may affect the business traffic. It can be inferred that it strongly relies on the expansion and shrinkage capability of Kubernetes, and it is necessary to monitor the availability of this capability; for example, the system data is frequently read and written, and there has been data inconsistency in history. If there are problems, you can consider building stability at the data level, increasing backup capabilities, and rollback capabilities.

Architecture Analysis

The architecture of the system determines the bottleneck of the system to a certain extent. By analyzing the dependencies of the system, you can better understand the boundaries of the system, and it is easier to optimize the operation and maintenance.

For example, if an application is deployed in the active-standby mode, the ability to check is whether the active-standby switchover is smooth, and whether the switchover process affects business traffic; for example, an application strongly depends on the underlying storage, and once the storage fails, the business will be large. If there is a fault, when sorting out the high availability, you need to consider whether there is a downgrade plan after the storage fails, and whether the storage problem can be warned in advance.

Community Experience:

The architecture of many systems is similar, and referring to the experience of the community or friends is like watching the mock exam in advance, and there will always be unexpected gains. We always conduct self-reflection and rearrangement when some faults occur in the industry, and find some of our own problems many times. Valuable experience bases, such as network cables being cut, deleting libraries and running away, are all in the list of our regular drills.

In terms of Alibaba Cloud's native architecture, we have organized the drill model as shown below for reference. In this high availability model, we follow the system architecture according to the management and control layer components, meta-cluster components, extension components, data storage, node layer, The overall cluster is distinguished, and there are some common faults in each module that can be learned from each other.

Step 3: Building the high availability of the system

Before we actually do fault injection, we need to ask ourselves a few more questions. According to the above-mentioned list of high-availability capabilities we want the system to have, whether the system has agile discovery capabilities when these faults come, personnel have rapid response capabilities, and whether the system itself has self-healing capabilities or some available What about using tools to quickly restore the system during a failure? Below we give some general recommendations from the two aspects of discovery ability and recovery ability.

Discovery

Monitoring and alerting are ways to find out if the system is in steady state and make it clear to the application owner. Alibaba's internal team has built two monitoring and alerting methods. One is white-box alerting, which uses abnormal fluctuations in observable data of various dimensions exposed inside the system to discover potential problems; the other is black-box alerting, which is from the perspective of customers Treat the system as a black box and detect forward functionality.

Resilience

When a fault occurs, the optimal result is that the system is stable and smooth without any impact, which requires extremely high capacity building of the system, and the actual situation is often more complicated. In Ali's internal practice, in addition to building the system itself with basic process self-healing, flow cutting capabilities, migration capabilities, and current limiting capabilities, etc., it also built a plan center to centrally deposit all our stop loss capabilities into the system. In the middle, white screen management, access, operation, and establishment of a stop loss capability set based on expert experience, as an important tool in the event of failure.

Step 4: Exercise implementation

After the above steps are completed, we believe that the system has the preliminary high availability capability and can start to implement fault drills.

Under normal circumstances, we will select some core scenarios for the first exercise. In the pre-release or test environment, the tool is triggered by semi-automatic scripts or pipelines that only contain fault injection modules, and the first time is conducted in the presence of R&D and operation and maintenance personnel test. Confirm the expectations of the scenario before the test. For example, after the fault is injected, it will take 1 minute to give an alarm, and the system will self-heal and recover within 10 minutes, so that it can be confirmed at any time during the exercise. After the drill is performed, various personnel are required to manually confirm whether the system performance meets expectations, and the failure and environment can be recovered in time after the drill. The part of the scene that does not meet the expectations during the rehearsal process needs to be verified and rehearsed repeatedly at this stage; the scenes that meet the expectations are marked and can begin to enter the normalized rehearsal stage.

The key words in the normalized exercise phase are chaos and unattended operation. Due to the advantages of the architecture, the Kubernetes cluster has a certain self-healing ability, so it is more suitable for unattended exercise. We will screen the set of scenarios that have passed the semi-automatic drill and organize them into some fault drill pipelines. Each pipeline generally includes steps such as fault injection, monitoring inspection, recovery inspection, and fault recovery to complete a single drill process in a closed loop. At the same time, Alibaba uses cloud-native technology for chaotic triggering to achieve randomness in the exercise object, environment, time, and scene, so that these exercise scenes can be executed in a chaotic, normalized, and unattended manner. Through normalized failure drills, it helps to discover some occasional system problems, and assists in checking the existing high availability during the system upgrade process.

The implementation of production raids needs to be carried out according to the structure of the system. In Alibaba's internal implementation, one way to control risks is to select low traffic peaks to carry out, and prepare a one-click traffic switching plan in advance. In the event of a failure that cannot be recovered, Stop loss immediately. Other raid-related risk control designs will be analyzed in detail in subsequent series of articles.

Epilogue

In the process of implementing fault drills in the internal cloud native field, we analyzed more than 200 drill scenarios, and conducted normalized fault drills at a frequency of 1,000+/month, effectively discovering more than 90 problems and avoiding further expansion of the problem radius; Through the construction, verification and chaotic execution of the exercise process, the system's alarm and plan recovery capabilities were regularly monitored, and more than 50 new high-availability problems were effectively intercepted and launched. The surprise drill in the production environment is a difficult but powerful step we have taken. It has exercised the emergency response capabilities of R&D, operation and maintenance personnel, tempered the system under real user scenarios, promoted the product shift system, and improved the stability of the cloud native base. and competitiveness.

Thinking and Practice of Fault Drilling System Construction in Cloud Native Background—Guide to Cloud Native Chaos Engineering Series

What is chaotic engineering and the characteristics of chaotic engineering under the tide of cloud native

Stages and Development of Chaos Engineering Implementation Mode

How to conduct a complete failure drill implementation

Step 1: Isolation Environment Construction

Step 2: Failure Scenario Analysis

Step 3: Building the high availability of the system

Step 4: Exercise implementation

Epilogue

Related Links

阿里云云原生

引用和评论

通义灵码 AI IDE 上线，第一时间测评体验

支付宝H5下载被拦截的原因排查与解决指南

JManus - 面向 Java 开发者的开源通用智能体

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

PAI Model Gallery 支持云上一键部署 Qwen3 全尺寸模型

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强

分析型数据库入门指南：如何选择适合你的实时分析工具？