Author: Dome | Alibaba Cloud technical expert, ChaosBlade community and Founder and Maintainer of commercial products, willing to participate in chaos engineering evangelism.
On December 7, 2021, the "Chaos Engineering Technology Salon-Financial Industry Excellence Salon" hosted by the Institute of Information and Communications Technology and undertaken by the Chaos Engineering Laboratory was held in Beijing. Technical experts from Alibaba Cloud, Qionggu, shared "From Party A to Party B, How to do a good job in the industrialization of chaotic engineering".
Chaos engineering has gradually become an important means for enterprises to improve stability
The cloud nativeization of enterprise systems has driven application releases and iterations faster and faster, but the complexity of distributed systems has also increased, leading to frequent failures. For example, the Google Cloud server has interrupted the authentication system due to internal storage quota issues, Failures such as AWS's handling of streaming media data services and cloud service downtime for 5 hours all have a significant impact.
As the unpredictable behavioral risks of the system increase, and the stability of complex systems on the cloud is difficult to guarantee, chaos engineering has gradually become an important solution for enterprises to seek business continuity. The distributed system that enterprises attempt to use chaos engineering to protect the production environment still has strong resilience in the face of out-of-control conditions.
At the same time, the industry is constantly enhancing the standard setting in the field of chaos engineering. The Institute of Information and Communications Technology has promoted the development of the field of chaos engineering by issuing standards such as "Chaos Engineering Platform Capability Requirements", "Chaos Engineering Maturity Model", and "Chaos Engineering Stability Measurement Model". At the same time, the establishment of chaos engineering laboratories in cooperation with various enterprises has promoted the rapid development of chaos engineering in China.
The demand and supply of the two parties are different, causing the industrialization of chaotic engineering to be difficult
Since the principle of chaos engineering was put forward, a number of chaotic engineering platform services have emerged from leading Internet and cloud manufacturers at home and abroad. The characteristics of Internet companies have made the platform development route focus on productization, production environment, experimental exploration, cloud Native and other aspects. According to the "China Chaos Engineering Survey Report" issued by the Institute of Information and Communications Technology and various companies, Party A and Party B have different choices for technology products in terms of chaotic engineering development technology. Party B (service supply side) is more inclined to use self-developed platforms as assistance, while Party A (service demand side) is more inclined to use commercial platforms as assistance, and pay more attention to product perfection, industry cases, and safety when choosing commercial platforms. , Technology controllability, etc., facing problems such as how to sort out the scene, how to build the implementation environment, how to evaluate the stability, how to control the scope of influence, how to formulate the experimental process, how to coordinate the organization, how to reflect the landing value, the coverage of industry characteristics, the operation How to integrate the dimensional system and other issues requires not only technology, but also services, integrating industry characteristics and practical models to implement chaotic engineering. How to do it? Alibaba combines the evolution of internal chaos engineering, including groups, commercialization, open source, and other directions to provide chaotic engineering capabilities in all directions, combined with the difficulties faced by industrialized customers in implementing chaotic engineering, and provides a set of mature chaotic engineering industrialized solutions.
Alibaba Chaos Engineering Industrialization Solution Provides Community Edition and Enterprise Edition
Alibaba's chaotic engineering industrialization solution includes two parts: platform technology and service. The platform technology part includes the chaos engineering platform community version and the chaos engineering platform enterprise version. The community version is a version that the code is all open source and developed and maintained by the community. The enterprise version is a version that provides public cloud SaaS and private cloud deployment. Compared with the community version, it provides a platform that meets the needs of enterprise-level large-scale, scenario-based, and safe and controllable platforms.
Chaos Engineering Platform Community Edition
The Chaos Engineering Platform Community Edition is an open source, multi-cluster, multi-environment, and multi-language general chaotic engineering platform. The purpose is to solve the problem of users starting to start chaos engineering. Multiple environments can be configured on the platform to achieve resource isolation. In each environment, it supports multi-host, multi-cluster, multi-container resource management and fault injection. It also supports Java, Golang, C++ and other multi-language applications running on these resources; At the same time, it also supports hosting the industry's mainstream chaos engineering experiment tools such as chaosblade, chaos mesh and litmuschaos, which can be deployed on the platform with one-click; and the experimental interface is unified, and the experimental scenes provided by these tools can be used directly on the platform. In addition to tool hosting functions, it also provides scene management, multi-dimensional exercises, process orchestration, steady-state detection, exercise protection, exercise reports, multi-tenancy and other capabilities, and provides OpenAPI for external integration. The community edition is closely related to CNCF ecology, such as Prometheus, HELM and other projects.
Chaos Engineering Platform Enterprise Edition
The positioning of the Enterprise Edition of Chaos Engineering Platform is to provide large-scale, scenario-based, automated, and secure product capabilities, covering IaaS, PaaS, and SaaS full-stack scenarios. Adopting the core capabilities of the community version, which can provide the ability to upgrade from the community version to the enterprise version with one click. At the same time, it is aimed at the existing operation and maintenance systems of industrial customers, such as full-link stress measurement system, environmental technology, unitized disaster recovery platform, and plan system , Observables, etc. are adapted and integrated. By integrating with these platforms, problems such as explosion radius, steady-state evaluation, and automated operation experiments can be solved well.
Industrial customer application architecture generally has the characteristics of multi-language, multi-platform, heterogeneous cloud, multi-vendor architecture, etc. Chaos Engineering Enterprise Edition can be well adapted and integrated to facilitate the realization of an integrated chaotic engineering platform. In addition to adaptation and integration, the enterprise version has richer capabilities than the community version, which can be viewed from four aspects:
- Rich drill scenarios: The enterprise version supports more than 200 failure scenarios, supports cloud services, is compatible with the Windows platform, supports pre-check, network disconnection, recovery, and replay one-stop disaster recovery and network disconnection exercises, and service-level microservices drill.
- Diversified exercise forms: support custom exercise machines and scenes, you can deposit experiments as an experience library or directly create experience templates for one-click exercises, which is simple and convenient. Provide high-level exercise programs, which can be configured on demand. Supports visual exercises, launching exercises with one click on the architecture topology diagram, which can effectively check the exercise status and guarantee radius.
- Easy-to-use exercise platform: The enterprise version platform can be used with zero business transformation, which supports the automatic perception of the architecture, realizes the automatic arrangement of the architecture topology, and the visualization of the exercise. Support one-click upgrade from community edition to enterprise edition to meet enterprise-level needs.
- Safe rehearsal guarantee: Provide a variety of rehearsal recovery strategies, such as controlling the rehearsal status by configuring business index thresholds, which is safe and controllable. Provide fine-grained authority management and control.
General Chaos Engineering Practice Mode
Alibaba's chaotic engineering practice model is a set of general chaotic engineering practice models abstracted on the basis of many years of Alibaba's internal chaotic engineering practice, community open source discussions, and multiple enterprise project cases. Through this practical mode, the introduction of chaotic engineering topics, goal setting, and organizational design for enterprises can be greatly reduced, and the purposeful chaotic engineering implementation can be ensured. There are three types of practice modes, business-oriented chaotic practice, architecture-oriented chaotic practice, and organization-oriented chaotic practice:
- Business-oriented chaotic practice is a practical method from the business perspective. Through pattern templates, you can quickly drill and expose the problems of business architecture design and reduce the impact of sudden failures on the business. The practical mode includes the strong and weak dependence mode between services, the financial and capital asset loss prevention and control mode, the user experience mode involving the user experience and the client terminal disaster recovery mode. Typical application scenarios are mobile banking, Transaction settlement.
- Architecture-oriented chaotic practice is a practical method from the perspective of infrastructure. Through pattern templates, problems are found from the perspective of users and operators of infrastructure, stability is measured, and failure recovery time is shortened. The practical model includes an observable model that verifies the coverage and effectiveness of monitoring. Through the SLI exercise, it verifies the service level agreement model that provides SLA to the outside world, verifies the disaster recovery model of dual-active in the same city and multiple actives in different places, and verifies the failure recovery model of service self-healing. Use scenarios include distributed transformation of core architecture, cloud access to core business, etc.
- Organization-oriented chaos engineering practice is a practical method to measure and improve stability from a global perspective. Through organization and operation, it can greatly enhance the chaos engineering atmosphere, promote team organization and coordination, and improve the efficiency of failure emergency response. The practical mode includes a planned failure exercise mode, a red and blue offensive and defensive mode that organizes the red and blue forces to confront at a specific time, and a surprise attack mode launched on the production environment. Typical usage scenarios include the assessment of compliance rate for failure emergency 1-5-10, the promotion of large-scale stability projects, etc.
Three delivery modes
Through consulting services, a set of chaotic engineering practice models can be summarized, which can more focus on solving the problems of chaotic engineering hierarchical landing practice. In actual customer delivery, three delivery modes are gradually derived according to the customer stage, namely: community version plus feasibility evaluation mode, enterprise version plus large-scale landing mode, enterprise version plus industry in-depth co-construction 3 modes.
- Mode 1: Community version + feasibility evaluation mode (light consultation), mainly through the open source chaos engineering platform and the experience of chaos engineering experts, quickly implement chaos engineering in the enterprise, and carry out the feasibility evaluation of the follow-up chaos engineering.
- Mode 2: Enterprise version + large-scale landing model, through public cloud or private cloud deployment, with the help of the enterprise-level features of this platform, large-scale landing in the enterprise can be achieved.
- Mode 3: Enterprise Edition + In-Depth Industry Co-construction Mode, through proprietary cloud deployment, through proprietary cloud version integration and the ability to be integrated, combined with the customer's existing systems, in-depth co-construction, to achieve platform integration.
Community version + feasibility evaluation (light consultation) delivery model
With the help of the Chaos Engineering Community Edition and Alibaba's many years of experience in the group and customers, it can effectively solve the dilemma that customers face the chaos engineering and quickly implement the chaos engineering, and realize the feasibility evaluation of the chaos engineering in the enterprise.
The case of a typical customer of this model is as shown above. The customer’s background is to meet the downward movement of the host, and the system has been transformed into a distributed architecture. The self-developed distributed framework needs to verify its high availability capabilities, but has no experience in chaos engineering and wants to take this project. Implement chaos engineering in the enterprise. The customer's purpose is also very clear. It is required to provide the chaos engineering technology methodology and teach relevant testers how to do scenario analysis, deploy the chaos engineering tool platform, teach the testers how to expand the failure scenarios based on ChaosBlade, lead the implementation of the chaos engineering, and teach the testers the entire chaos engineering Implementation process.
According to the customer's background and goals, first conduct research on the current situation of the customer's technical architecture, business architecture, deployment architecture, stability assurance, etc., produce stability analysis reports, propose stability problem risk points, provide chaotic engineering technology training, and analyze failure scenarios Analyze the case, lead the customer to analyze the failure scenario based on the customer's self-developed distributed framework; at the same time guide the customer to do the self-study failure scenario development based on ChaosBlade, and provide the overall technical solution and implementation plan of the chaos engineering:
Deploy the community version of the platform on the client side, provide drill plans and review based on the analyzed failure scenarios, implement failure drills, produce standardized failure drill reports and organize failure drill reviews, and provide follow-up planning suggestions for chaotic engineering.
The overall project delivery time is only 1 month, to guide and sort out dozens of failure scenarios of the self-developed framework, guide the implementation of fault drills twice, and implement fault drills by themselves for many times. High-availability switching, fault self-healing, monitoring alarms are found Wait for the income of many stability problem projects. Provide a feasibility assessment for the follow-up of enterprises to carry out large-scale chaos engineering.
Enterprise Edition plus large-scale delivery model
The value of chaotic engineering in major Internet companies is gradually being accepted by everyone. More and more banking, securities, and insurance financial companies have begun to plan and implement chaotic engineering technology. These financial companies are facing the transformation of distributed architecture and cloud technology. The use of chaotic engineering technology to manage the stability of complex systems under the complex infrastructure environment created by upgrades and financial information is a fast and effective technical means. In the face of how to quickly and effectively implement chaos engineering practice, lack of experience is the biggest adjustment these financial companies face. Because the chaos engineering basic platform itself is industry-independent, the introduction of mature chaos engineering enterprise versions and services through innovative projects has become more and more The first choice for many financial companies.
Chaos Engineering Enterprise Edition provides a wealth of exercise scenarios and one-stop exercise capabilities. The platform is easy to use, safe and controllable, ensuring large-scale deployment and implementation of enterprises, discovering system stability problems, and improving system resilience. For example, using the Chaos Engineering platform to improve failure response efficiency , Such as fault discovery ability, fault location ability, fault handling ability, fault review, etc. It can help customers shorten the implementation time of large-scale exercises, improve the efficiency of exercises, ensure the input-output ratio of chaotic engineering implementation, and allow customers to focus on structural risk identification and system optimization.
With the development and innovation of the business, the number and complexity of the system of a certain top securities client have continued to increase. The production and operation are facing risks such as functional defects, performance capacity, and single point failures that threaten safe and stable operation. They need to improve the system technically. Performance and stability. Through the use of the mature capabilities of Alibaba Cloud Chaos Engineering Enterprise Edition to carry out normalized and large-scale exercises, the vulnerability of a large number of failure scenarios in the production system is tested in advance to maximize the early identification and elimination of technical risks, and improve the reliability of system operation . In the one-month stability project, the chaos engineering has obtained huge benefits: a total of 23 types of risks were discovered, more than 300 problem points, more than 2,000 drills, and the highest number of drills on the day was as high as nearly 300, covering the core There are 300 systems, etc.; and based on this enterprise version platform, chaos engineering organization and operations are carried out, such as double random drills, drill large-screen kanbans, production quality analysis reports, drill data operations, etc.
Delivery model of enterprise version plus industry deep co-construction
Chaos Engineering Enterprise Edition provides the ability to integrate and be integrated. With this ability, it can be integrated with the customer's monitoring system, change system, test system, emergency system, CMDB and other systems, and build together according to the characteristics of the industry system, such as heterogeneous systems , Localized systems, network cloud systems, etc., to provide technological innovations in the field of industrialized chaotic engineering, such as quantitative evaluation, scenario exploration, and experimental automation, to achieve a win-win situation.
Through the standardization capabilities provided by the Chaos Engineering Enterprise Edition to adapt and integrate with customers' existing systems, the ability to build a chaotic engineering platform with industry attributes can meet industry needs and accelerate customers' landing and innovation in the field of Chaos Engineering.
More
Alibaba is committed to the implementation of chaotic engineering and industrialized solutions, through a variety of practical modes to make the implementation of chaotic engineering more focused, and multiple delivery modes to serve companies with different needs. You can quickly experience the Alibaba Cloud Chaos Engineering service or consult the Chaos Engineering solution through the link below.
1) Open source project address:
https://github.com/chaosblade-io/chaosblade
2) Product experience address:
https://developer.aliyun.com/adc/scenario/e9b27357ab9c4785bc7f43fb62f872e3
3) Solution address:
https://www.aliyun.com/solution/cloudnative/chaosengineering
Click here , chaos engineering experience to get started immediately! For more discussions on Chaos Engineering, welcome to join the group for communication! Scan the QR code below or search for Dingding group number: 23177705 to join the group!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。