High-availability construction practice in heterogeneous scenarios with byte beating

This article first appeared in: Volcano Engine Developer Community;
Author: Shao Yuliang, head of system governance of the Bytedance Infrastructure Team.

This article mainly introduces some thinking and landing experience of ByteDance in high-availability construction. Let me give you a brief introduction to what the system governance team does. Within the infrastructure team, the system governance team is mainly responsible for the closed-loop ecology of byte-beating research and development: from service development, to joint debugging, development and corresponding release under a large-scale microservice architecture, to microservice governance and correspondence after going online Traffic scheduling, capacity analysis, and finally through the construction of chaotic engineering to help businesses improve high availability capabilities. Then enter the topic. First, I will introduce the background of the construction of ByteDance Chaos Project. Everyone knows that ByteDance has many apps, and we have many services. These services can be roughly divided into three types:

Online service: You can understand it as a back-end service that supports Douyin, Watermelon Video, etc. The characteristic of these services is that they run on our self-built large-scale PaaS cluster on K8s, which is a very large microservice architecture.
Offline services: including some recommended model training, big data report calculation, etc., all belong to offline services. They rely on large-scale storage and computing capabilities.
Infrastructure: Carrying all the business lines of Byte China, providing a set of PaaS-based capabilities, including computing and storage, to support various business use scenarios.

Different service systems pay different attention to high availability. Let's do a simple analysis:

Online service: It is a stateless service, running on the K8s container, and its storage is in external MySQL and Redis. These stateless services are very convenient for expansion, and can be as fault-tolerant as possible in the event of a failure. Of course, they may also be degraded.
Offline service: Stateful service, which pays much attention to the state of calculation. The computing service of big data is characterized by a long running time, and the time for training and model is particularly long. It can tolerate some errors (if a job hangs, you can retry), and its more state consistency and data integrity depend on the support of the underlying storage system. Therefore, our high-availability construction of offline services relies on the high-availability capabilities provided by the entire infrastructure on a large scale.
Infrastructure: The infrastructure itself is stateful. It is a platform for large-scale storage and computing. It may encounter some gray swan events such as network failures and disk failures. Among them, we pay more attention to data consistency.

In response to different types of services, students in the system management team responsible for high availability proposed different solutions. Let me introduce to you the evolution of chaos engineering when we deal with online services (stateless services).

Chaos Engineering Evolution of Online Services

Chaos Engineering Platform 1.0 Architecture We believe that our Chaos Engineering Platform 1.0 version is not yet a chaos engineering system, but more of a fault injection system.

The picture above is the architecture of our platform version 1.0. This platform provides users with a visual interface, fault injection and some simple configurations. We installed Agent on the underlying physical machine. Agent runs on the host machine, which can realize network-related fault injection between containers. For service steady state, when we are doing chaos drills, we can inject some metrics into the platform. Users can write a bosun statement to query metrics. We provide a threshold, and the system will poll the metrics to determine whether the service is stable. status. If the boundary is exceeded, we perform failure recovery. If it does not exceed the boundary, continue the exercise to see if it can meet expectations. Why can't this system be called a chaotic engineering system? Netflix’s Principle of Chaos defines chaos engineering ( http://principlesofchaos.org/ ) has five principles:

Establish a hypothesis around steady state behavior
Diverse real-world events
Run experiments in a production environment
Continuously run experiments automatically
Minimize the explosion radius control

Comparing the above five principles, let's take a look at why this platform is just a fault injection system.

First of all, the overall steady state is relatively crude.
There are various faults in the actual microservice architecture. Only relatively simple fault injections such as fault delay and network disconnection are implemented in this platform.
Drilling in a production environment was something that could be done at the time.
Because the steady state is relatively crude, it is difficult to truly assess whether the system is stable, and the system cannot automatically run experiments.
The entire system declares that the scale's scope is not particularly good. In addition, the structure of the technology at that time was to perform fault injection on the host of the physical machine, which itself had certain hidden dangers, and the explosion radius control was not particularly good.

Chaos Engineering Platform 2.0 Architecture

In 2019, we began to want to evolve the chaos engineering platform version 1.0 to the next generation, hoping to make a system that truly conforms to the chaos engineering standards, so with the platform version 2.0, we think it is byte beating in the true sense The first chaotic engineering system.

Some upgrades of Chaos Engineering Platform 2.0 version:

Architecture upgrade: Introduced a fault center layer, decoupling business logic and underlying fault injection.
Fault injection: With the larger-scale application of Service Mesh, network call-related faults are more based on sidecar implementation.
Stability model: At this stage, we have also built a steady-state system, which implements steady-state calculations based on key service indicators and algorithms such as machine learning. We are very concerned about the steady-state system and believe that the real automated exercise does not require manual intervention, so a system is needed to identify whether the system being exercised is stable. If the system only sees a bunch of metrics, it is difficult to directly recognize the stability of the system. We hope to aggregate these metrics into a percentile index through some specific algorithms. Assuming that this index reaches 90 points, we consider it stable. Later, we will introduce how we make algorithmic investment in this steady-state system.

Failure Center Architecture

Our failure center draws on the architecture of K8s.

There is a problem with the Chaos Engineering Platform 1.0 system: suppose that a delay fault is successfully injected into K8s through the Agent. But K8s itself has flexible scheduling capabilities. If unfortunately the service crashes during the exercise, K8s will automatically start the Pod on another machine. In this case, you think that the fault drill is successful, but in fact it did not succeed. Instead, a new service was started. The fault center can continue to inject faults when the container drifts. So we are a set of declarative APIs that declare not what faults to inject, but describe a state of the server. For example, the network between A and B is disconnected, then the fault center must guarantee A in any state And B are disconnected. Secondly, the entire system draws on the K8s architecture, and has a wealth of controllers that support different fault injection capabilities at the bottom. In the process of supporting the rapid demand of the business, we can quickly access open source projects such as Chaos Mesh and Chaos Blade in the controller. We have also made some native controllers, such as service mesh controller, agent controller, and service discovery controller.

Explosion radius control

As mentioned earlier, the fault center is to inject faults through the declarative API, we need to define the fault injection model.

As shown in FIG:

Target: Indicates the target service to be injected into the fault.
Scope Filter: For explosion radius control, an important point is that we want to allow the business to help declare the scope that we want to exercise. We call it Scope Filter. The target of fault injection can be defined through Scope Filter, which can be a computer room, a cluster, an availability zone, or even an instance level or even a traffic level.
Dependency: It is the source of all abnormalities that may affect the service itself, including middleware, a certain downstream service, and the dependent CPU, disk, network, etc.
Action: Failure event, that is, what kind of failure has occurred, such as a downstream service returning rejection, packet loss; another example is disk write abnormality, CPU preemption, etc.

Therefore, when the fault center declares a fault, the above content needs to be described to indicate what fault state the business wants in the system.

steady state system

The steady-state system will involve some algorithmic work. Here are mainly three algorithm scenarios:

Dynamic analysis of time series sequence: We call steady-state algorithm, we can try to analyze whether the service is stable. Among them, algorithms such as threshold detection, 3 sigma principle, sparse rule and so on are used.
AB contrast steady-state analysis: Drawing on the Mann-Whitney U test used by Netflix, you can read some related papers and article introductions.
Detection mechanism: Use index fluctuation consistency detection algorithm to analyze strong and weak dependencies.

Through the above algorithms (and other algorithms), the steady-state system can describe the stability of the system well.

automated drill

We define automated exercises as requiring no manual intervention at all, fault injection by the system, analysis of the stability of the service during the injection and evolution process, and stop losses or get results at any time. We now have the following prerequisites for automated exercises:

Be able to clarify the goal of the actual scene of the exercise;
Through the steady-state system, have the ability to automatically judge the steady-state assumptions;
The scope of influence of chaos drill can be controlled through declarative API and Scope Filter, and the production loss during the experiment is very small.

The current main application scenarios of automated exercises are strong and weak dependence analysis, including:

Whether the current situation of strong and weak dependence is consistent with the business label;
Whether the weak dependency timeout will bring down the overall link.

summary

Now let's review again, why we think that version 2.0 of the Chaos Engineering Platform is a chaos engineering system. Let’s compare the five principles mentioned above:

Establish a hypothesis around the steady state: the evolution of the steady state hypothesis has been initiated through the steady state system.
Diversified real-world events: The fault layering is now more reasonable, supplementing a large number of middleware faults and underlying faults.
Run the test in the production environment: This was achieved in the 1.0 period, and it was expanded in 2.0 to support various failure exercises in the production environment, pre-release environment, and local test environment.
Continuously automated operation test: Provide csv, sdk, api and other capabilities to allow business lines to continue to integrate with functions in the service release process they want. We also provide API capabilities to help business lines perform fault injection in the required environment.
Minimize the explosion radius: One of the reasons for the ability to provide a declarative API is to control the explosion radius.

Infrastructure chaos platform supporting the drill of the underlying system

As mentioned earlier, offline services rely heavily on the consistency of the underlying state, so if the storage and calculations in the infrastructure are done well, the business can be well supported. We use a new infrastructure chaos platform to do some internal experiments. For chaos engineering of infrastructure, we have to break some standard principles of chaos engineering.

First of all, chaos engineering for the infrastructure is not suitable for practice in the production environment, because it relies on the underlying fault injection, the impact area is very large, and the explosion radius is difficult to control.
In the automated exercise, the business side needs more flexible capabilities to further connect with their CI/CD, and also needs more complex scheduling requirements.
For steady-state models, in addition to stability, we pay more attention to consistency.

To support chaos engineering in an offline environment, this infrastructure chaos platform gives us a safe environment, allowing us to do more fault injections in it, such as CPU, Memory, File system and other system resource failures; rejection, loss Network faults such as packets; and other faults including clock jumps, process kills, code-level exceptions, and file system-level method error hooks. For the automated scheduling of automated exercises, we hope to provide users with more flexible scheduling capabilities through this platform, for example:

Serial parallel task execution
Pause at any time & resume from breakpoint
Infrastructure master-slave node identification

We also provide some plug-in capabilities to allow some component teams to inject faults more flexibly. Some business teams may have buried some hooks in their systems. They hope that this system can help inject faults more directly, and they also hope to reuse our orchestration system and platform system. Through the hook method, the business team can inject specific faults only by implementing the corresponding hook, and then continue to use our entire orchestration system and platform.

Infrastructure chaos platform architecture diagram

From chaos engineering to system high-availability construction

When we first started Chaos Engineering, the mission of the team was to land Chaos Engineering on ByteDance. But when we make some capabilities to find the line of business to use, we will find that the line of business has no demand for this. Later, after thinking hard, we adjusted the mission of the team: to help the business promote high-availability construction through chaos engineering or some other means. After the adjustment, we went from studying the industry development of Chaos Engineering in the past to understanding the high availability of the business based on the business. How do we help the business to build high availability?

What is high availability

We use the following formula to understand high availability.

MTTR (Mean Time To Repair): Mean Time To Repair
MTBF (Mean Time Between Failure): Mean time between failures
N: Number of accidents
S: Scope of influence

The value of this formula is obviously less than 1, which should be the so-called three nines or five nines. To make the value of A large enough, you need:

The value of MTTR N S is small enough. Therefore, it is necessary to reduce the MTTR, reduce the number of accidents, and reduce the scope of failures.
The value of MTBF becomes larger. That is, the interval between two failures is as wide as possible.

How to reduce MTTR, N, S?

Reduce the scope of fault impact (S)

When a failure of the production-oriented architecture occurs, to reduce the scope of the failure, some design methods can be used from the architecture side:

Unitized design: user request isolation
Multi-room deployment: system resource isolation
Independent deployment of core business: business function isolation
Asynchronous processing

Here, what Chaos Engineering can do is to help the SRE team verify whether these architectural designs meet expectations.

Reduce the number of failures (N)

Here we need to redefine the failure. Failure is unavoidable, and we should try our best to avoid the conversion of Failure to Error at the architectural level of the software system. How to reduce the conversion rate from Failure to Error? The most important thing is to strengthen the fault tolerance of the system, including:

Deployment: Multiple activities in different places, flexible scheduling of traffic, full service, plan management;
Service management: timeout configuration, fuse fail fast.

Among them, the role of chaos engineering helps to verify the fault tolerance of the system.

Reduce mean time to repair (MTTR)

The figure above shows some of the factors involved in MTTR: Fail Notification, diagnosis, repair, testing, and the time required to go online. To reduce MTTR, some design measures can be added to each factor involved:

Adequate monitoring and alarm coverage. Need to promote business management of alarms.
Ensure the accuracy of the alarm while fully covering the alarm.
Efficient positioning and strengthening of troubleshooting capabilities. At present, we are cooperating with the internal AIOps team to do further intelligent obstacle analysis and reduce the diagnosis time.
Quick stop loss plan. From repair to test to final launch, a plan system is required, and a plan library is prepared according to the diagnosed fault characteristics, so that you can select an accurate plan for recovery at the click of a button.

What Chaos Engineering can do is to conduct emergency response drills. In fact, the exercise is not only for the system, but also for the emergency response capabilities of everyone in the organization. When faced with an accident, the team can have a standard workflow to discover, locate, and solve problems, which is what the chaos engineering system hopes to achieve during the exercise.

Follow-up planning

Finally, we will introduce our follow-up planning in terms of high availability and chaos engineering. There are three main aspects:

Fault refinement capacity building
Hierarchical construction for different system failures
Abundant failure capabilities at all levels
Enrich the use scenarios of chaos engineering
Continue to explore automation scenarios
Reduce user access and use costs, and create a lightweight platform
Expanding the connotation of chaos engineering
Return to the usability perspective and continue to explore the relationship between chaos engineering and high availability
Establish a fault budget mechanism to predict and analyze the quantified fault loss, thereby assisting decision-making on investment in chaotic engineering.

High-availability construction practice in heterogeneous scenarios with byte beating

Chaos Engineering Evolution of Online Services

Chaos Engineering Platform 2.0 Architecture

Follow-up planning

火山引擎

引用和评论

基于 MCP 的 AI Agent 应用开发实践

多主体驱动生成能力达SOTA，字节UNO模型可处理多种图像生成任务

FlowGram 简介：开源前端流程搭建引擎

字节跳动开源 Godel-Rescheduler：适用于云原生系统的全局最优重调度框架

vArmor：云原生容器安全的多场景应用实践

CloudWeGo 2025 黑客松报名指南

MySQL遇到AI：字节跳动开源 MySQL 虚拟索引 VIDEX