Author: Ming Shao

What is Chaos Engineering

The system architecture has gone from stand-alone to distributed, and then to the current cloud-native architecture. Its complexity continues to rise, and the difficulty of problem locating also rises. In the face of failures that may occur at any time, is there any way to solve this dilemma well?

Chaos Engineering is a discipline that conducts experiments on distributed systems. By actively injecting faults, it discovers the weak points of the system in advance, promotes the improvement of the architecture, and finally achieves business resilience. Thereby, failures can be avoided in the online running environment.

 title=

Here we take cloud-native architecture as an example to illustrate why chaos engineering can solve problems in system architecture. The principle of cloud native architecture and the principle of chaos engineering can be found in the corresponding relationship. It is explained by the principle of service. The principle of service is the problem of how to manage services, that is, the problem of judging the strong and weak dependencies between upstream and downstream services. Through chaos engineering, you can locate the request to a specific machine, and then narrow it down to the application on the specific machine, continuously minimize the explosion radius, and judge whether the upstream and downstream services are normal by injecting faults between applications to determine their strong and weak dependencies. .

 title=

The goal of chaos engineering is to achieve a resilient architecture, which consists of two parts: resilient systems and resilient organizations. Resilient systems have redundancy, scalability, immutable infrastructure, stateless applications, avoidance of cascading failures, etc. A resilient organization includes efficient delivery, failure plans, and emergency response mechanisms. A highly resilient system can also fail unexpectedly, so a resilient organization can make up for the missing parts of a resilient system and build the ultimate resilient architecture through chaos engineering.

 title=

Chaos engineering is to find the weak points of the system in advance by actively injecting faults, promote architectural improvement, and finally achieve business resilience. Introducing Chaos Engineering has different business value for people in different functions:

  • Architects: can help them verify the fault tolerance of the architecture
  • Development / operation and maintenance: can improve the emergency efficiency of its failure
  • Test: help it expose online problems early and reduce the failure recurrence rate
  • Product/Design: Prompt the customer to use the experience

 title=

How to implement chaos engineering

How to implement chaos engineering for enterprises or businesses? Is there a tool or platform that can help it land quickly?

ChaosBlade is a chaotic experiment execution tool that follows the chaotic experiment model. It has the characteristics of high scene richness, easy to use, etc. It supports multi-platform and multi-language environments, including Linux, Kubernetes and Docker platforms, and supports Java, NodeJS, C++, Golang language application. Support more than 200 scenes and more than 3000 parameters. It is a fault injection tool for the terminal side, but when the business is implemented, there will be the following problems:

  • How to visualize the fault injection process?
  • How to do fault injection to multiple clusters or hosts at the same time?
  • How to get stats for the overall drill
  • ......

Therefore, a platform layer is required on top of ChaosBlade to manage and orchestrate the execution tools of chaos engineering.

 title=

ChaosBlade-Box is an open source cloud-native chaos engineering console for multi-cluster, multi-language, multi-environment.

The overall architecture of the open source platform and injection tool is as follows, mainly including several components:

  • ChaosBlade-Box Console: Chaos experiment user interface
  • ChaosBlade-Box: Server backend service, mainly including the orchestration and security control of drill scenarios, chaos engineering tool deployment (ChaosBlade, LitmusChaos...), support for probe management and multi-dimensional experiments
  • Agent: Probe, mainly includes (ChaosBlade-Box) server side to establish connection and maintain heartbeat, report K8s related data, exercise command distribution channel and other functions
  • ChaosBlade: a tool deployed in the business host or K8s cluster to perform drills on the device side

 title=

The new ChaosBlade-Box platform is a cloud-native chaos engineering platform for multi-cluster, multi-environment and multi-language. It supports internationalized Chinese and English switching, and supports global namespace, so that the same user can set different global namespaces according to their own needs, such as: test space, sandbox space and online space, etc. Provide automated tool deployment, simplify tool installation steps, and improve execution efficiency. The platform supports probe installation and drills in different environments, such as hosts and Kubernetes. The Kubernetes environment supports drills in the dimensions of Node, Pod, and Container. In the Kubernetes environment, Pod-related data in the cluster is automatically collected and managed in a unified manner in the application management, which simplifies the user drill query steps, and does not need to go to the cluster to view the Pod name or Container name of the application to be drilled. And support one-click migration to the enterprise version, and synchronize the exercise data of the community version to the enterprise version on demand.

 title=

 title=

 title=

 title=

The following is the whole process of conducting a drill on the new ChaosBlade-Box platform. It supports sequential execution and stage execution. Sequential execution means that multiple drill scenarios take effect in sequence, and stage execution is worthwhile for multiple drill scenarios to take effect at the same time. The recovery of the drill is ensured through various security policies, such as manual punishment and automatic stop. The automatic stop is configured by setting the timeout parameter when the drill is configured, so that even if the platform and the probe (Agent) are disconnected and cannot be stopped manually, It can also automatically recover from the failure when the timeout period expires.

 title=

 title=

What are the advantages of the new version

Compared with the old version, the new version released this time has unified the front-end interface and the enterprise version, which simplifies the switching cost of usage habits, improves the internationalization of Chinese and English switching, and supports the switching of global namespaces; the back-end provides more Smooth drill arrangement, perfect application management, and enhanced control of probes, and support for one-click migration to the enterprise version; enhanced probe functions, more complete APIs, support for multi-environment deployment and support in different environments The environment serves as a drill channel, supports automatic installation and uninstallation, and collects and reports data to simplify and smooth the drill.

 title=

Related Links

Address of the Middleware Developer Conference (PDF of the speech can be downloaded):

https://developer.aliyun.com/topic/middleware/developer/summit

10% discount for the first purchase of MSE Registration and Configuration Center Professional Edition, 15% discount for MSE Cloud Native Gateway Prepaid Full Specifications.


阿里云云原生
1k 声望302 粉丝