Chaos Engineering Practice of SoundNet

──The implementation of chaos engineering is not only the implementation of tools and methods, but also the implementation of culture and design

This article aims to help you understand chaos engineering and improve the reliability of business services through basic introduction and sharing of some of the experience of Shengwang.

00 Preface

"What is Chaos Engineering? Sounds awesome."
"What's the difference between chaos engineering and our failure drill?"
"Are we going to pass the Chaos Test without problems?"
"We don't encounter this scenario very often, because customers don't do this."

The above remarks are a few common issues that everyone will talk about when we advance the work of chaos engineering. So before discussing Chaos Engineering, we need to know what "chaos engineering" is, what kind of problems to solve, and how to ensure the stability of services.

In recent years, the scale and complexity of our software systems have been increasing, and the traditional large-scale monolithic form has been unable to adapt to the current iteration and deployment, so today's software systems are more inclined to develop distributed systems. Distributed systems solve the problems of slow iteration, high technical debt, and troublesome deployment in our monolithic architecture, but at the same time, they also bring new challenges. According to Google's 2021 Devops research report [1], we can see that more and more teams have practiced cloud migration, and they are also paying more and more attention to software delivery and operational performance. How to ensure the stability and high availability of the system under the rapid iteration of the distributed system has become a hot spot and a difficulty in recent years.

在这里插入图片描述

As shown in the figure above, we can see a typical microservice system, a service structure with service boundaries and loose coupling. Traditional testing methods can indeed ensure availability between services to a certain extent. For example, contract testing can detect more problems as early as possible and ensure the availability between services under a system with frequent iterations, but this is essentially a consistency check between the service caller and the provider , but better in business logic. Test Guaranteed. There is still a lack of coverage for the characteristics of microservice systems, such as high availability, service dependencies, and distributed consistency. The existing testing methods are difficult to identify the strong and weak dependencies between services, and it is also difficult to verify the completeness of the high-availability strategy. The emergence of chaos engineering has given us ideas and methods to solve the problem.

01 Overview

So what is Chaos Engineering?

Chaos engineering is the discipline of conducting experiments on distributed systems to build confidence in the system's ability to withstand runaway conditions in a production environment. Chaos engineering is not a test, it is a practical discipline with clear inputs and outputs to observe the weak points of a system for improvement.

Why do we need chaos engineering?

In the real world, glitches are everywhere. We also reviewed and counted some application services internally, and found that the results are similar to the Chaos Engineering Laboratory's "2021 China Chaos Engineering Survey Report" [5]. The results of the Chaos Engineering Laboratory are quoted here:

It can be seen from this that change failures are the main cause of major accidents, and online machines will never be in a Stable state. According to Hayne's Law, we can know that behind every serious accident, there must be 29 minor accidents, 300 near misses and 1000 hidden dangers. Reasonable use of chaos engineering can well weaken the above problems. The following figure is the result of the sound network landing practice.

在这里插入图片描述

What is the difference between Chaos Engineering and our failure drill?

Fault drill can be regarded as a specific practice of chaos engineering. Fault rehearsal is to observe the stability of the system by injecting real possible faults into the target system, but the injected scenarios are relatively fixed and known. In addition to providing some theoretical guidance on its basis, chaos engineering is also a practical process of discovering new problems, such as restarting services in a certain area.

02 How to implement

Before starting chaos engineering, we need to make sure that our system already has some high-availability features that can continue to work properly in the event of partial anomalies. We can conduct the following experiments according to the basic practical principles in Chaos Engineering Principles [2]:

1. First, define a "steady state" in terms of some measurable output of the system under normal behavior.

2. Second, it is assumed that this steady state will continue to remain steady in both the control group and the experimental group.

3. Then, variables that reflect real-world events, such as server crashes, hard disk failures, disconnected network connections, etc., are introduced into the experimental group.

4. Finally, the hypothesis of steady state is refuted by the state difference between the control group and the experimental group.

If chaos engineering is implemented and it is found that the two states are consistent, it is basically considered that this fault can be passed; if the states are different, then we have found a weak point, which can improve the stability of the system in a targeted manner.

This includes two key points:

1. How to generate a fault

2. How to observe the fault

In the related articles on chaos engineering practice, the most talked about is how to generate faults. There are more popular tools such as ChaosBlade[3] and Chaos Mesh[4] on the market. The choice of tools is inseparable from the actual business situation. According to Google's research [1], more and more companies are choosing the Chaos Cloud solution for deployment, and the above single tool cannot meet all needs, so how to meet their own business needs may be a headache for some teams. . As far as Shengwang's experience is concerned, self-research is inevitable, and in terms of ease of use, it is also necessary to provide a platform to provide full-capacity support. Doing is more important than fantasy. It may be a better way to implement and verify the scenario first, then conduct experiments in the business, and then optimize it later.

A point that is rarely mentioned in many articles is how to observe the occurrence of actual failures , which is the most important point in my opinion. Nowadays, every company basically has a monitoring and alarming platform, but there are still many cases that are discovered by users when our monitoring and alarming does not respond, that is, the monitoring system does not alarm but the user reports the fault first. This is our biggest resistance to chaos engineering - not being able to find problems effectively. To solve this problem, in the practice process of SoundNet, we have summarized several points for reference:

1. Complete the monitoring of all basic resources, and observe whether all basic resources will be affected during the experiment. We have encountered a situation where a kernel exception leads to slab memory leaks. This problem requires monitoring of basic resources to discover.

2. Improve the service SLI (Service Level Indicator) indicator. According to the characteristics of its own business and the points that customers care about, SLI indicators are defined for observation. For example, Netflix uses SPS (play button click rate) for observation. The requirements of the indicators are not necessarily constant, and there can be a certain law of change, but the indicators must be easy to measure and have a short statistical period . The higher the difficulty of measurement, the less means to describe the business status, and the need to think and improve; the longer the statistical period, the more likely it is to ignore the intermediate problems.

03 Evolution and Evaluation Criteria

The previous chapter mainly introduced some methods and ideas of chaos engineering, then how to evaluate after we do it, and how to continue to move forward. From the perspective of maturity, we believe that the landing will probably be divided into several stages:

1. Single experiment stage: This stage is mainly to develop and verify failure scenarios on a single node, as well as failure experiments on a single node.

2. Tooling of fault implementation: In this stage, fault implementation will be developed for the business and experiments will be carried out on the business, and preliminary automation, tooling and integration into CI/CD will be carried out.

3. Platform-based fault implementation: At this stage, automatic fault drills will be carried out, and the scope will gradually change from the test environment to the production environment, and the ease of use is greatly improved.

4. Chaos value output: This stage can provide the value of chaos engineering, such as for customers to use and improve the stability of their own services; use and apply AIOps and other means to carry out some abnormal early warning and monitoring, and continue to move towards zero failure.

在这里插入图片描述

Having said so much above, we ultimately have to serve to reduce online failures. If you don't know what you're doing, and you can't make continuous improvements, you can't really close the loop, and ultimately you won't be able to get the job done. Chaos engineering can help us reduce usability problems and find business hidden dangers, but it is difficult to measure our work by the number of hidden dangers found, which is full of uncertainty. Then we define our own evaluation criteria according to our own situation:

Evaluation means

• User Scenarios

The ultimate beneficiary of chaos engineering is a stable and reliable user experience. Under the premise that the maturity of chaos engineering is not high (there is no confidence to practice in the production environment), the higher the degree to which online scenarios can be simulated in the test environment or pre-release environment, the more confident we are to ensure the availability of online . On the other hand, user usage scenarios can be perceived. By covering more user usage scenarios, we can better discover problems and enhance business confidence. Therefore, we evaluate the coverage by the proportion of test scenarios to online user scenarios, the higher the better.

• Chaos scene

The choice of an indicator is more general. For chaos engineering, we will definitely design comparative experiments, and there are rules and regulations for how to design. Therefore, the chaos engineering scenario can be estimated as a whole based on common faults in the industry, hidden dangers of business characteristics, and historical business faults. The more scenarios we support, the more confident the business will be.

• Service Metrics

The source of service indicators can be SLO (Service Level Object) and SLI. In Agora, we also use XLA (eXperience Level Agreement). Chaos engineering needs to build a business indicator together with the business. This indicator is also an indicator that online operation and maintenance & chaos engineering needs to observe. The more perfect the indicators are, the better the criteria and confidence we have when we test our business.

04 Summary

Since the establishment of the sound network, there has been investment in usability, and now it has become an internal standard and system (see the figure below).

Chaos engineering is not a panacea. It needs to be designed and implemented in combination with the actual situation of the company. The investment in usability is not only the testing or operation and maintenance team, but also the process and design. The landing is not only the landing of tools and methods, but also the landing of a culture and design.

05 Quotes

[1] Accelerate State of DevOps 2021

https://services.google.com/fh/files/misc/state-of-devops-2021.pdf

[2] PRINCIPLES OF CHAOS ENGINEERING

https://principlesofchaos.org/

[3] ChaosBlade

https://github.com/chaosblade-io/chaosblade

[4] Chaos Mesh

https://github.com/chaos-mesh/chaos-mesh

[5] Chaos Engineering Lab: China Chaos Engineering Survey Report (2021)

http://www.caict.ac.cn/kxyj/qwfb/ztbg/202111/P020211115608682270800.pdf

Introduction to the Dev for Dev column

Dev for Dev (Developer for Developer) is a developer interactive innovation practice activity jointly initiated by Agora and the RTC developer community. Through various forms of technology sharing, communication and collision, and project co-construction from the perspective of engineers, the power of developers is gathered, the most valuable technical content and projects are mined and delivered, and the creativity of technology is fully released.

Chaos Engineering Practice of SoundNet

00 Preface

01 Overview

So what is Chaos Engineering?

Why do we need chaos engineering?

What is the difference between Chaos Engineering and our failure drill?

02 How to implement

03 Evolution and Evaluation Criteria

Evaluation means

04 Summary

05 Quotes

RTE开发者社区

引用和评论

语音独角兽 ElevenLabs 创始人：人性中的不完美，恰是人愿意互动的关键；秘塔「今天学点啥」：解析复杂内容语音讲解丨日报

一文掌握 MCP 上下文协议：从理论到实践

开放创新，昇腾 CANN 再向深处

AI Agent爆火后，MCP协议为什么如此重要！

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读