Haojing Technology&#39;s chaos engineering practice based on ChaosBlade

Introduction to Technology has accumulated high-availability core technologies including link pressure measurement, flow control management, dynamic expansion and contraction, fault drills, etc. during the practice of massive Internet services and the current explosive growth of traffic scenarios. And through the form of service-oriented, platform-oriented and tool-oriented on the cloud, it helps internal product development departments and customers to improve development efficiency and business stability. In order to open up multiple high-availability measures such as fault discovery, fault management, fault drills, emergency response, etc., to form a complete link for stability construction. Haojing Technology formed an IT Blue Army to conduct drill raids, quality control, and joint training. Since 2019, the IT Blue Army team has been built, focusing on the production environment and carrying out chaotic engineering practices to promote the improvement of code, infrastructure, processes, personnel, and monitoring. Starting this year, we have deepened the intensity of drills, regularized and periodic drills, and continuously improved the SRE's individual combat capabilities.

Author introduction: __Ye Wenchen, cloud native technology expert of Haojing Technology, contributor to the open source chaosBlade community, years of experience in distributed system architecture and stability construction, dedicated to stability assurance (SRE), IT blue army construction and operation and maintenance digital improvement .

Foreword

1. Agile development, DevOps stability pain points

With the rapid expansion of business scale, the emergence of agile development, DevOps practices, cloud-native architecture and governance have greatly improved the ability of application delivery and shortened the business listing cycle. In addition, the complexity of microservice governance brought about by it has expanded exponentially, and the difficulty of business agility and technical iteration is also increasing. At the same time, it is necessary to ensure the continuous high availability and stability of the business, and the traditional disaster preparedness in the face of failures The method has been unable to keep up with this rhythm.

The best way to reduce failures is to manage failures with anti-fragile thinking, treat failures as normal, and continue to repeat the abnormal process to continuously improve the fault tolerance and resilience of the system. Chaos engineering responds to this challenge, proactively injects faults, discovers potential problems in advance, iteratively improves the architecture and operation and maintenance methods, and finally achieves business resilience.

2. Chaos engineering requirements

Chaos engineering is a set of methodology that actively finds the weak links in the system through experiments on distributed systems. It was first proposed by Netflix and related teams. It aims to nip faults in the baby, that is, to identify faults before they cause interruptions. By actively creating faults, test the behavior of the system under various pressures, identify and repair faults, and avoid serious consequences. In 2012, Netflix open sourced Chaos Monkey. Today, many companies (including Google, Amazon, IBM, Nike, etc.) use some form of chaos engineering to improve the reliability of modern architectures.

In the practice of massive Internet services and the current explosive growth of traffic scenarios, Haojing Technology has precipitated high-availability core technologies including link pressure testing, flow control management, dynamic expansion and contraction, fault drills, etc., and through the cloud Service-oriented, platform-oriented and tool-oriented forms help internal product R&D departments and customers to improve development efficiency and business stability.

In order to open up multiple high-availability measures such as fault discovery, fault management, fault drills, emergency response, etc., to form a complete link for stability construction. Haojing Technology formed an IT Blue Army to conduct drill raids, quality control, and joint training. Since 2019, the IT Blue Army team has been built, focusing on the production environment and carrying out chaotic engineering practices to promote the improvement of code, infrastructure, processes, personnel, and monitoring. Starting this year, we have deepened the intensity of drills, regularized and periodic drills, and continuously improved the SRE's individual combat capabilities.

Fault drill platform

1. Build a fault drill platform

Based on this guiding ideology, Haojing Technology decided to establish a fault drill platform, based on tool-based fault injection and platform-based fault drill management to achieve standardized and periodic fault drills, thereby improving product resilience.

Platform goals:

Provide automation, visualization, orchestration, non-intrusive fault injection capabilities;

As a unified entrance for high-availability drills and fault tests;

Accumulate and precipitate high-availability test cases, and establish a quantitative stability evaluation system;

Functional goals:

Adapt to current failure scenarios such as JVM, CPP, containerization, and K8S;

Automatic fault injection, with fault life cycle management capabilities;

The scope of the fault explosion is controllable;

The fault injection type has good scalability;

2, fault injection tool selection

At present, the tools for simulating faults in the industry are relatively diversified, and the supported functions and scenarios also have their own advantages and disadvantages. By comparison, chaosblade support functions and scenarios are relatively rich, and the community is also relatively active. After fully verifying most of the injection functions, we chose it as the core module of the underlying injection.

Chaos Engineering open source tool comparison

3. Failure rehearsal steps

Combined with chaosblade's chaos engineering model, we standardize the entire fault injection and divide it into five steps:

4. Platform module

As the core component of the fault drill and the fault injection engine, the module construction of the platform is carried out around the service business drill.

Fault drill

1. Detailed explanation of the drill process

When we actually implement fault drills, it involves a series of operations such as environmental preparation, fault injection task scheduling, implementation of fault injection, fault review, and problem improvement.

Rehearsal plan confirmation

Before implementing the fault drill, confirm the target service/node for the fault injection, and confirm that it is included in the management of the fault drill platform. Confirm the time, place, stakeholders, service steady state, drill expectations, observation indicators and complete drill execution sequence of the failure implementation.

Use case choreography for failure rehearsal

Based on the high-availability exercise tool HATT, complete automated exercise task scheduling and implement the entire exercise process.

Implementation of the exercise

Use drill tools to monitor the entire life cycle of the drill and obtain the results of the drill. For alarms and monitoring abnormalities that occurred during the exercise, the stability indicators are synchronized to the exercise execution results to verify the stability expectations.

End of rehearsal/Resume

Output the current drill results and drill reports based on the fault drill platform, and output drill problem review reports based on index analysis.

Stability improvement

Based on the rehearsal report, determine the stability improvement construction plan, and track the implementation. Facilitate the failure of the next drill to return.

Fault drill use cases are deposited in the fault drill platform as the construction assets of the current business, and the common ones can also be reused.

2, from 1-100

Stability construction has never been accomplished overnight. Chaos Engineering aims to build a stable PDCA cycle, prompting SREs to continuously verify in the fast iterative product development cycle, optimize product stability, and keep up with product DevOps. In the face of a large number of repeated and periodic fault drills, standardization, automated execution and the solidification of the drill process have become effective tools for improving efficiency.

After completing the exercise plan design and docking, using the platform, a single IT blue army can complete the entire automated exercise process.

Typical case

Verify the availability of the service when a single node of the message queue is suspended.

Walkthrough scene:

A single Broker node in the message queue hangs to verify whether the message is sent and received normally.

Stability expectations:

The exception of a single broker does not affect the message sending of other nodes, and the faulty node will be excluded from the list of available nodes. After a short tps drop, message sending returns to normal tps.

Abnormal stability during the exercise:

After the node hangs, tps drops to 0, which does not meet expectations;

Improvement results:

1. The client introduces a fuse mechanism, and no more attempts to send messages to the faulty node after the message sending retry fails, to avoid continuous unavailability;

2. The namesrv routing service actively pushes broker failure information to the client, reducing the time for failure recovery.

Hao Whale Chaos Engineering Practice

Based on the practice of chaos engineering, we realize that fault drill is a part of stability construction, and to improve stability, emergency response handling of faults is an interlocking chain. The lack of any link affects the overall The stability quality. The establishment of a fault collaborative processing response chain is still a long-term development process.

Currently, we are:

At the planning level, promote the stratification of fault drill capabilities;

At the platform level, it is committed to opening up the linkage and coordination of architecture perception and operation and maintenance components;

At the system level, establish a fault emergency coordinated response chain;

At the implementation level of the drill, the failure drill will move from the test pre-production environment to the production environment;

Actively contribute to the open source community, and with the vigorous development of the underlying injection tool chaosblade, we will introduce more abundant fault types and flexible injection methods.

Taking Haojing Technology’s internal chaotic engineering practice as an example, various types of exercises have been implemented for 30+ important product lines to form a cumulative monthly/quarterly periodic fault exercise of 200+ use cases to ensure that the entire product line can cope with extreme business conditions. pressure. Comprehensively improve the application service level of the open platform, and provide solid support for the continuous optimization of the Haojingyun system architecture and the rapid innovation of products.

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

Haojing Technology's chaos engineering practice based on ChaosBlade

Foreword

Fault drill platform

Fault drill

Typical case

Hao Whale Chaos Engineering Practice

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

Haojing Technology&#39;s chaos engineering practice based on ChaosBlade

Foreword

Fault drill platform

Fault drill

Typical case

Hao Whale Chaos Engineering Practice

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

Haojing Technology's chaos engineering practice based on ChaosBlade