This article is compiled from the speech record of Wu Zhaojun, a senior engineer of Tencent Interactive Entertainment, on PingCAP Infra Meetup. Please click [Read the original text] to view the video playback, and reply "135" in the background to get the PPT link of this issue.

This article first introduces the complex technical scenarios that Tencent Interactive Entertainment faces, and then introduces the cloud-native chaos engineering platform built by the Tencent Interactive Entertainment Chaos Engineering Team based on Chaos Mesh, and finally shares the benefits of Tencent Interactive Entertainment's chaotic engineering practice in the past six months.

Tencent's interactive entertainment operations have more than 10 billion visits per day, the peak QPS exceeds 1 million, the daily activity code is released and updated more than 500 times, and the data volume exceeds 200 TB. Faced with a large number of user requests and a fast version release iteration speed, how can we ensure the operation of the service quickly and steadily? The solution given by the operation team of Tencent's mutual entertainment event is DevOps and cloud native.

In the past, the release of activities was operated by operation and maintenance personnel. As the amount of activities increased rapidly, obvious bottlenecks appeared. To solve this problem, Tencent Interactive has designed a pipeline from the code to the production environment. Now, as long as the activity development submits the code to the warehouse and triggers the code submission, the operating platform will automatically compile and build the image, and automatically deploy the image to Tencent Cloud TKE. It only takes 5 minutes from code completion to production environment release, and the whole process is completed by self-development.

Nowadays, basically all services of Tencent's mutual entertainment operation activities are run on Tencent Cloud TKE. Benefiting from the technical dividends of cloud native, the elastic scaling of services, including service expansion and shrinking speed, is very fast, and it can be expanded from a single copy to a hundred copies in a few minutes.

For a more agile iteration, the development team will split a large, difficult-to-maintain service into many small services for independent operation. The small service code is small and the logic is relatively simple, so the cost of handover and learning is relatively low. This organization of microservices has gradually become the general trend, but as there are more and more small services, the calling relationships between services are becoming more and more complicated. So this brings a new problem: a small service abnormality may bring down the entire link, bringing a chain reaction .

Different developments deal with fault tolerance differently. Some services have very good fault tolerance and better degradation capabilities, but some services are not necessarily. Other alarms are not timely, and the fault location tools are not perfect, causing some troubles to be dealt with.

How to solve this problem? The answer given by the Chaos Engineering Team of Tencent Games is: put PingCAP's open source Chaos Mesh on Tencent Cloud TKE to solve the current problem of high frequency of service failures and high quality control challenges.

Chaos engineering is a concept put forward by Netflix 10 years ago. Gartner predicts that by 2023, 40% of organizations will practice chaos engineering as part of DevOps, which will reduce downtime by 20%.

The industry's definition of chaos engineering is to actively inject faults in order to discover potential problems in advance, iteratively improve the architecture and operation and maintenance methods, and ultimately achieve business resilience. To put it in a more familiar way is: fault drill .

Almost every technical team will do fault drills. Before a service goes online, it will test whether the active/standby switchover takes effect and whether the disaster tolerance capability can pass. For example, a Master node will be shut down to see if the service can automatically switch to the Slave node. These are actually the early chaotic engineering.

chaos engineering is fault drill, but fault drill is not equal to chaos engineering , chaos engineering is a new technology expanded on the basis of fault drill, mainly reflected in the emergence of professional chaos engineering tools, such as PingCAP open source Chaos Mesh and other products, and the establishment of related theoretical systems.

A year ago, Tencent Interactive Entertainment officially launched the Chaos Project. How to do chaos engineering in the K8S scene? At that time, through the selection and comparison, it was found that Chaos Mesh has many advantages over other products. The functions are obviously more than other products. The code is open source and does not require additional development for adaptation. And it is also a CNCF project, and it has obvious advantages in update and iteration speed.

In order to give full play to the advantages of Chaos Mesh, Tencent Interactive Entertainment Chaos Engineering team integrated it into the existing operating platform, deploying Chaos Mesh in each TKE cluster, and creating, executing, and destroying experiments through the dashboard API provided by Chaos Mesh , At the same time, based on the ability of the current operating platform to observe the effect of the experiment, the authority is also docked with the existing operating platform.

From the perspective of the architecture diagram, Chaos Mesh is the core engine of Tencent's entire chaos engineering system. It provides the most basic fault injection capabilities, including pod, container, network, IO and other fault injections. On this basis, a complete set of chaotic engineering capability systems including red and blue countermeasures, fault drills, fault orchestration, and fault observation are encapsulated.

Tencent mutual entertainment Chaos Chaos Mesh engineering team in the use of the process, also had a lot of interaction with the community . Wu Zhaojun mentioned that they have provided a lot of feedback and needs to the Chaos Mesh community, and then these feedback bugs will be fixed soon, and many needs will be reflected in the next version, which impressed them. When chaos engineering was first used, some documents of Chaos Mesh were not very complete, and it even needed to be guessed when using it. But up to this version, the documentation is very rich and comprehensive, and they feel that they have made great progress in this area.

The industry has a relatively complete theoretical system for the implementation of chaotic engineering. Wu Zhaojun summarized five links . The first is to have a convenient and easy-to-use experiment platform , this platform can be used to arrange, issue experiments, and execute experiments. Then is the platform needs to be able to do experiments planned effective risk control , in the course of the experiment needs to be able to perform live observations from experimental steady-state index, the experiment found the problem requires a follow-up to optimize, validate and process to be submitted in a timely manner after the problem has been resolved. Form a complete iterative closed loop. The Chaos Engineering Team of Tencent Interactive Entertainment is also based on this theory to implement the construction of the Chaos Experimental Platform.

Here is an example of performing an experiment on the internal platform of the Tencent Interactive Entertainment team. This is the experiment arrangement link. The arrangement here supports both parallel and serial. All the faults that need to be executed can be arranged in one experiment. , Multiple services perform chaos experiments in parallel.

Wu Zhaojun cited an experiment often done by Tencent Interactive Entertainment: test service performance under high CPU load . They will arrange the experiment first, then deliver it to execute, and observe the performance of related services. You can immediately see the service curve through the operating platform: including business indicators such as QPS, delay, and response success rate. After the experiment is completed, the platform can also output an experiment report to judge whether it can meet expectations after doing these experiments, and draw experimental conclusions.

In Tencent Interactive Entertainment, some business parties have proposed that they can run the Chaos Engineering package after the version is updated. Therefore, the Chaos Engineering team integrated Chaos Mesh into the release pipeline, and the chaos rehearsal link can be inserted when the user schedules and releases, so that the execution effect of the chaos experiment can be directly seen with each release.

Sometimes it is necessary to do chaos drills for a certain account, and to effectively control the explosion radius, only this account can be affected. Tencent Interactive Entertainment's approach is to hijack traffic at the gateway layer and deliver experiments at the gateway layer. The delayed fault can be issued for a specific account, and the performance of this account can be observed. The experiment can achieve fine control.

In addition, the chaos engineering team found that even if a convenient chaos experiment platform is provided, it is very boring to let the development students beat each other with left and right hands, and it is difficult to carry out for a long time. In order to implement the chaos project, Tencent Interactive has designed a red-blue confrontation gameplay. Operation and maintenance students will frequently choose certain services to initiate chaos experiments to test whether the services of the developers are fault-tolerant and show the results of the exercises. development students in order to avoid service loopholes being advertised, they will be very active to do chaos experiments in advance, solve hidden dangers in advance, and form a better virtuous circle.

In the scenario of microservices, it is very important to sort out the dependencies between services. Non-core services cannot bring down the main service. Chaos Engineering can easily check the strong and weak dependencies, inject faults into the service to be adjusted, and observe the performance of the main service. , You can intuitively and conveniently obtain the dependence of the strength of the relationship. Take the initiative to inject faults into the service being tuned, with a delay of more than three or five seconds, and observe the QPS or delay jitter of the service being tuned. The jitter of the main tone service indicates that the dependence between them is relatively strong, and some optimizations can be made according to the scene and the specific situation.

At the same time, Tencent Interactive Entertainment is still using chaos engineering to train fault diagnosis robots. When the service becomes more complex, the probability of failure will become greater. What Tencent Interactive wants to do now is to large-scale drills on the live network or in a specific environment through chaos engineering for 160a7304dd2bb5, so as to train a fault diagnosis model to help locate faults.

It has been about half a year for Tencent Interactive Entertainment to implement the Cloud Native Chaos Project. In fact, the Chaos Project has been launched by almost all teams within Tencent Interactive Entertainment. At present, Tencent Interactive Entertainment has conducted an average of more than 150 chaotic drills per week, with more than 100 problems discovered in advance, and the total number of drills initiated every week exceeds 50.

Generally speaking, fault drills require handwritten scripts, such as a 5% network packet loss drill. People who are familiar with it may quickly write this script. If they are unfamiliar, it may take a lot of time to debug. The advantages of chaos engineering are reflected in the following: only needs to make simple drag-and-drop arrangement of these faults on the platform, without writing and debugging scripts, you can send experiments and observe the effects of experiments in real time. fault drill has been reduced, and the efficiency has been greatly improved.

From the statistics of Tencent Interactive Entertainment, comparing chaos engineering with traditional fault drills, the efficiency of chaos engineering has increased by at least 10 times. This is the biggest benefit of chaos engineering.


PingCAP
1.9k 声望4.9k 粉丝

PingCAP 是国内开源的新型分布式数据库公司,秉承开源是基础软件的未来这一理念,PingCAP 持续扩大社区影响力,致力于前沿技术领域的创新实现。其研发的分布式关系型数据库 TiDB 项目,具备「分布式强一致性事务...