On July 27, the 2021 Trusted Cloud Conference was held in Beijing. At the meeting, Alibaba Cloud's fault rehearsal platform was selected as the trusted cloud best technical practice, and was the first batch to pass the highest level of trusted cloud chaos engineering platform capability requirements-advanced level certification. At the same time, the "Chaos Engineering Laboratory" jointly initiated by Alibaba Cloud Computing Co., Ltd. and a number of companies, led by the Institute of Information and Communications Technology, announced the establishment.
double certification, Alibaba Cloud's fault rehearsal platform obtained the highest level of trusted cloud certification
With the continuous deepening of enterprises' understanding and practice of cloud computing, cloud computing-based distributed architecture has become the preferred solution for more and more enterprise application construction. How to improve the stability of cloud native systems and ensure business continuity through chaos engineering has become common in the industry The topic of concern.
Chaos engineering mainly finds system stability and other problems in advance through fault injection, and aims to improve system and organizational resilience, create a resilient architecture, and ensure business continuity. In the evaluation of the Trusted Cloud Chaos Engineering Platform of the Institute of Information and Communications Technology, Alibaba Cloud's fault drill platform passed the highest scores in 8 capability evaluations including resource support, failure scenarios, scenario management, experimental procedures, experimental protection, experimental measurement, authority management, and security audits. , And was selected as the 2021 Trusted Cloud Best Technical Practice, double certification, once again proved Alibaba Cloud's technical and product strength in the field of chaos engineering.
Fault rehearsal has developed along with the development of Alibaba's system architecture from microservices, to containerization, to cloud native, and has nearly 10 years of practical experience in the implementation of chaos engineering. The Alibaba Cloud fault rehearsal platform outputs Alibaba’s internal practical experience in a productized manner, provides a wealth of experimental scenarios, a library of expert experience, and domain-based solutions to meet the needs of users’ fault scenarios, and in flexible process scheduling and openness Under the integrated capabilities of the company, it provides monitoring, reporting, etc. to realize the closed-loop implementation of chaos engineering, and controls the risk of failure drills through authority control and drill protection, helping enterprises to improve system stability and business continuity in the process of cloud migration, cloud readiness, and cloud native .
Since the chaos engineering theory was put forward, many companies have been exploring and practicing, but the landing form is different. What is the difference between the Alibaba Cloud fault drill platform?
- Flexible process orchestration: A set of standardized exercise process has been developed, and on this basis, the required process nodes can be added. Simultaneously supports multi-scene operation mode.
- Visualized fault drill: Integrate with architecture awareness, and realize fault injection based on the visualization of architectural topology. At the same time, it can cooperate with the architecture inspection to find system risk points and use fault drill to verify.
- Diverse expert experience library: Accumulate many years of Alibaba's internal fault drill experience into the drill template, which has the authenticity and practicability of the drill scene, greatly improves the efficiency of drill creation, and solves the difficult problem of users getting started with chaotic engineering.
- Domain-based solutions: provide productized solutions to verify the stability of service components, system architecture, etc., dynamically identify components and architectures through architecture perception, dependency analysis, etc., and automatically generate exercise plans to achieve fast, accurate, and complete exercises .
Using the fault drill platform to do chaos engineering can measure the fault tolerance of microservices, estimate the red line of system fault tolerance, and measure system fault tolerance. In addition, the fault rehearsal platform can verify whether the container orchestration configuration is reasonable, test whether the PaaS layer is robust, verify the timeliness of monitoring alarms, and improve the accuracy and timeliness of monitoring alarms. Through fault raids, faults are randomly injected into the system, and the ability of relevant personnel to respond to the problem is investigated, and whether the reporting and handling process of the problem is reasonable, so as to cultivate the ability of people to locate and solve the problem by fighting. Through fault injection, problems such as system stability are discovered in advance, aimed at improving system and organizational resilience, creating a resilient architecture, and ensuring business continuity.
Since its commercialization in 2019, the Alibaba Cloud failure drill platform has adopted diverse experimental tools, automated tool deployment, multi-dimensional drills, flexible process orchestration, rich failure scenarios, practical drill templates, and professional solutions. With secure drill protection and deep integration of cloud products, it has nearly a thousand corporate customers, serving customers including Huatai Securities, Bixin Technology, and Qinbao, helping companies build digital resilience capabilities in the cloud-native era.
promote standard unification, create ChaosBlade open source project, shorten the path of building chaotic engineering
In recent years, more and more companies have begun to pay attention to and explore chaotic engineering, which has gradually become a highly available test system and an indispensable tool for building system information. However, the field of chaos engineering is still in a stage of rapid evolution, and there is no unified standard for best practices and tool frameworks. The implementation of chaos engineering may bring some potential business risks, and the lack of experience and tools will further prevent DevOps personnel from implementing chaos engineering. There are also many excellent open source tools in the field of chaos engineering, covering a certain field, but the ways of using these tools are very different. Some of the tools are difficult to learn, costly to learn, and have a single ability to experiment with chaos, which discourages many people from the field of chaos engineering.
Alibaba Group has been practicing in the field of chaos engineering for many years. In order to help companies better build the path of chaos engineering, Alibaba opened up the chaos engineering project ChaosBlade in 2019 and became a CNCF Sandbox project this year. "Self-researched technology", "open source projects", and "commercial products" form a unified technical system. Alibaba Cloud has maximized the value of technology through the positive cycle of the trinity.
ChaosBlade is an open source tool that follows the principles of chaos engineering, including chaos engineering experimental tool chaosblade and chaos engineering platform chaosblade-box, which aims to help enterprises solve high-availability problems in the cloud-native process through chaos engineering. The experimental tool chaosblade supports 3 large system platforms, 4 programming language applications, involving more than 200 experimental scenarios and more than 3000 experimental parameters, which can finely control the scope of the experiment. ChaosBlade has become the basic capability base of Alibaba Cloud's fault rehearsal platform to serve many enterprise customers.
In the future, ChaosBlade will continue to provide multi-cluster, multi-environment, and multi-language chaos engineering platform and chaos engineering experiment tools based on cloud native; in the future, it will host more chaos experiment tools and mainstream compatible platforms to implement scene recommendations. Provide business and system monitoring integration, output experimental reports, and complete the closed-loop chaos engineering operation on the basis of ease of use.
industry’s first chaos engineering laboratory was formally established to promote the implementation of chaos engineering practice
Against the background that the digital industry has increasingly higher requirements for system stability and high availability of cloud computing, the Chaos Engineering Laboratory, led by the China Academy of Information and Communications Technology, and Alibaba Cloud and other companies, was formally established. The Chaos Engineering Laboratory will promote the implementation of chaotic engineering in typical application scenarios in various fields, and link upstream and downstream companies in cloud computing to jointly promote the rapid development of chaotic engineering.
Alibaba Cloud has the richest practical experience in chaos engineering in China, and is committed to building a chaotic engineering standard system in the cloud-native era. Alibaba Cloud has accumulated high-availability core technologies including full-link stress testing, online traffic control, and fault drills during the practice of massive Internet services and double 11 scenarios over the years, and exported them through open source and cloud services. , In order to help enterprise users and developers enjoy technological dividends, improve development efficiency, and shorten the business construction process.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。