title: Chaos Mesh X SkyWalking: Observable Chaos Engineering
author: Wang Ningxuan
date: 2021-11-29
summary: This article will share how to use Event information monitoring by combining SkyWalking and Chaos Mesh to understand the impact of chaos experiments on application service performance in real time.

tags: ['Chaos Mesh']

Chaos Mesh is an open source cloud-native chaos engineering platform. With Chaos Mesh, users can easily inject abnormal faults into services and cooperate with Chaos Dashboard to monitor the running status of the entire chaos experiment. However, monitoring the operation of chaotic experiments cannot tell us the changes in application service performance. From the perspective of system observability, we may not be able to understand the full picture of the fault purely through the dynamics of the chaotic experiment, which also hinders us from further understanding and debugging the system and the fault.

Apache SkyWalking is an open source APM (Application Performance Monitor) system that can provide monitoring, tracking, diagnosis and other functions for cloud native services. SkyWalking supports collect Event (events), which can be viewed in the Dashboard which events have occurred in the distributed system, and can intuitively observe the impact of different events on the service performance, combined with Chaos Mesh, it can be a chaos experiment Provide monitoring of service impact caused.

This tutorial will share how to use the Event information monitoring by combining SkyWalking and Chaos Mesh to understand the real-time impact of chaotic experiments on the performance of application services.

Ready to work

Step 1-Visit the SkyWalking cluster

After installing SkyWalking, you can access its UI, but because there is no service to monitor, here you need to add a service and set the Agent burying point. This article uses the lightweight microservice framework Spring Boot as the embedded object to build a simple Demo environment.

You can refer to the demo-deployment.yaml file created chaos-mesh-on-skywalking Then use kubectl apply -f demo-deployment.yaml -n skywalking for deployment. After successful deployment, you can see the real-time monitoring service information in SkyWalking-UI.

Note: Because Spring Boot's port is also 8080, avoid conflicts with SkyWalking's port during port forwarding, for example, use kubectl port-forward svc/spring-boot-skywalking-demo 8079:8080 -n skywalking .

Step 2-Deploy SkyWalking Kubernetes Event Exporter

SkyWalking Kubernetes Event Exporter can be used to monitor and filter the events in the Kubernetes cluster, filter out the required events by setting filter conditions, and send these events to the SkyWalking background, so that you can observe the events in your Kubernetes cluster through SkyWalking When did the Event affect the various indicators of the service? If you want a command deployment, you can refer to this configuration create a yaml file, set the parameters of filters and exporters, and use kubectl apply to deploy.

Step 3-Use JMeter to pressurize the service

In order to achieve a better observation effect, you need to increase the service load on Spring Boot first. This article chooses to use JMeter, a widely used Java stress testing tool, to pressurize the service.
host:8079 was pressure tested through JMeter, and 5 threads were added for continuous pressure.


It can be seen from SkyWalking Dashboard that the current access success rate is 100%, and the service load is about 5300 CPM (Calls Per Minute).

Step 4-Chaos Mesh inject faults and observe the effect

With these preparations, you can use Chaos Dashboard to simulate stress scenarios and observe changes in service performance during the experiment.

The following uses different Stress Chaos configurations to observe the corresponding service performance changes:

  • The CPU load is 10%, and the memory load is 128 MB.

The time points for the beginning and end of the chaos experiment can be displayed in the graph through the switch on the right. Move the mouse to the short line and you can see that the experiment is Applied or Recovered. It can be seen that in the time period between the two green short lines, the performance of the service processing call was reduced to 4929 CPM, and the performance returned to normal after the end of the experiment.

  • The CPU load was increased to 50%, and the service load was found to be further reduced to 4307 CPM.

  • In extreme cases, the CPU load reaches 100%, and the service load drops to 40% of the chaos-free experiment.

Because process scheduling under the Linux system does not allow a process to occupy the CPU all the time, even in the extreme case of a full CPU, the deployed Spring Boot Demo can still handle 40% of access requests.

summary

Through the combination of SkyWalking and Chaos Mesh, we can clearly observe when the service is affected by the chaos experiment, and how the performance of the service will be after chaos is injected. The combination of SkyWalking and Chaos Mesh makes it easy for us to observe the performance of the service under various extreme conditions, which strengthens our confidence in the service.

Chaos Mesh has grown a lot in 2021. In order to learn more about users' experience in practicing chaos engineering, so as to continuously improve and enhance the support for users, the community launched a Chaos Mesh user survey. Click the link to participate in the survey, thank you!
https://www.surveymonkey.com/r/X78WQPC

Welcome everyone to join the Chaos Mesh community, join the Chaos Mesh channel under CNCF Slack (slack.cncf.io): project-chaos-mesh, and participate in the discussion and development of the project! If you find bugs or missing features during use, you can also directly mention Issue or PR https://github.com/chaos-mesh


PingCAP
1.9k 声望4.9k 粉丝

PingCAP 是国内开源的新型分布式数据库公司,秉承开源是基础软件的未来这一理念,PingCAP 持续扩大社区影响力,致力于前沿技术领域的创新实现。其研发的分布式关系型数据库 TiDB 项目,具备「分布式强一致性事务...