Author: Fuyi, Fengyun
Why do stress testing
With the popularization of wireless devices and the vigorous construction of 5G, more and more online systems and small programs have become indispensable tools in people's lives. For these tools, there is a question: how many users can the system withstand to access at the same time, and can the system ensure trouble-free and stable operation in the face of sudden traffic peaks?
In order to answer this question, it is necessary to do multiple rounds of stress tests before the system goes online, and simulate complex and highly simulated online traffic in advance to verify the high availability of the overall system, which is also a key link in implementing a system high availability solution. In addition, through stress testing at different stages, the capacity planning and bottleneck detection of the system are also completed, and the overall capability of the system is checked to ensure that the system can indeed withstand the upcoming real online pressure before the sudden traffic peak.
In a sense, stress testing is a verifier of system stability.
How to implement an accurate performance stress test
Prepare the stress test environment
The execution environment of the stress test is a common topic. If the stress test is directly executed in the production environment, there will be two problems:
1. It will affect online business and affect users who normally access the system
2. It will pollute the online data and write the pressure measurement data into the online database
In order to solve these two problems, the following solutions are generally adopted in the industry:
The above schemes have their own advantages and disadvantages, and the applicable scenarios are also different. You can choose the scheme flexibly according to the stage of your own project.
Build a stress test script
Commonly used stress testing tools in the industry include JMeter, Gatling, Locust, k6, Tsung, Alibaba Cloud PTS, etc. Without exception, these tools all need to arrange the API of the stress test service as a stress test script.
The focus of this step is to confirm the API of the stress test without omission, and the sequence of API arrangement should conform to the user's operation logic. For the stress test of the health code business, if the login authentication API is omitted from the script, the subsequent APIs such as refreshing the health code and viewing nucleic acid reports will report an error in the permission verification step, and will not execute normal business logic. It is impossible to simulate real business scenarios.
There are two ways to arrange scripts for the above stress testing tools:
1. Manually enter the script, which requires the script writer to be very familiar with the business and ensure that the API will not be missed.
2. Automatically record scripts. The above open source stress testing tools all provide proxy functions for recording requests. After enabling and configuring the proxy, as long as the user's operations and click behaviors are simulated on the page, the request can be automatically recorded and a stress testing script can be generated. At the same time, PTS also provides a Chrome recording plug-in [1], which is free of proxy configuration and can generate JMeter and PTS stress test scripts with one click. It improves the efficiency of scripting and ensures that APIs are not missed.
To avoid the risk of missing APIs in complex scripts, it is recommended to use the recording function to generate scripts.
Confirm pressure model
This step is to configure the pressure peaks simulated in the pressure test, the pressure distribution ratios for different APIs, and the pressure value increment model. The stress value refers to the number of simulated concurrent users, or the number of requests sent per second.
pressure mode
Before setting, you need to confirm the pressure mode. There are two main pressure modes in the industry:
1. The virtual user (VU) mode can be understood as a thread that simulates a real user. During the stress test, the thread is always executed in a loop, and the simulated user keeps sending requests.
2. Throughput mode, that is, the number of requests per second (QPS), can directly measure the throughput of the server.
In the project acceptance stage, a very important indicator is the throughput of the system, that is, the QPS that can be supported. For this stress test scenario, it is more recommended to use the throughput mode. You can intuitively see the number of requests sent by the stressor per second and directly correspond to the throughput of the server.
Pressure distribution ratio of each API
After confirming the pressure mode, you need to configure the pressure distribution ratio of different APIs. For example, in the health code business, 100% of users will call the API to log in to the AP and obtain the health code, but not all users will call the API for querying nucleic acid reports and viewing push information. Therefore, the exact pressure distribution ratio of each API is also an inaccessible factor in a successful pressure test.
Pressure value increasing model
Commonly there are pulse models, with step increments and uniform increments.
The pulse model simulates a sudden increase in traffic, and is often used in business scenarios such as spikes and panic buying.
The incremental model can simulate the continuous increase in the number of users within a certain period of time, and is often used to simulate business scenarios with warm-up.
In addition to the conventional incremental model, it is best to implement the manual speed adjustment function in the pressure measurement. First, it can simulate some unconventional flow increasing situations, and second, the pressure value can be adjusted repeatedly to reproduce and troubleshoot problems.
Geographical distribution of pressure flow
After determining the pressure value and the incremental model, it is also necessary to determine the geographical distribution of the pressure flow, and try to fit the real user distribution to ensure the authenticity of the test results.
For regional online business, it is understandable that the presses are distributed in the same local computer room. If it is a nationwide online business, presses should also be distributed according to users and deployed in various regions of the country.
Perform stress testing and observe stress testing indicators
Core indicators in the stress test: Request success rate, request response time (RT), system throughput (QPS)
The request success rate depends not only on the overall request success rate, but also on the success rate of some core APIs to avoid the situation where the overall success rate meets the standard and the core API success rate is insufficient.
For the request response time, you need to pay attention to whether some key quantile indicators such as 99, 95, 90, 80... are in line with expectations, and the average response time does not have much reference significance, because the stress test needs to ensure the experience of most users. When the degree of dispersion is not clear, the average value is easy to cause misjudgment.
System throughput is an indicator to measure how much access the system can withstand, and it is an indispensable standard for stress testing.
When the above three indicators meet an inflection point, it can be considered that the system has encountered a performance bottleneck, and the pressure measurement can be stopped or the pressure value can be reduced to prepare to analyze and locate the performance problem.
In addition to these three business indicators, some indicators of application monitoring, middleware monitoring and hardware monitoring of the system should also be observed at the same time, including but not limited to:
server:
- network throughput
- CPU usage
- memory usage
- Disk throughput
- ......
database:
- number of connections
- SQL throughput
- Number of slow SQL
- index hit rate
- lock wait time
- lock wait times
- .....
Middleware:
- JVM GC times
- JVM GC time consuming
- On-heap and off-heap memory usage
- The number of active threads in the Tomcat thread pool
- ......
For more indicators that need to be paid attention to during stress testing, see stress testing indicators [2]
If the system has reached the expectations, it is often possible to continuously increase the pressure value according to the ratio of 10-20%, and do a peak “touch” for the system to observe the limit value of the system, so as to have a bottom line.
Replay, performance optimization
After the stress test is over, if it fails to meet expectations, you can coordinate with the monitoring platoon to locate and analyze performance problems. After the performance optimization is completed, continue to verify in the next round of stress testing.
The method of problem analysis and tuning in the test is not described here, you can refer to this test problem analysis and tuning [3].
If the system performance has met expectations, you can use the system throughput index obtained by stress measurement to configure flow control, degradation, system or isolation rules to ensure system stability.
Alibaba Cloud PTS - stress test package, help your system worry-free
Performance Testing Service (PTS) is an Alibaba Cloud SaaS-based performance testing tool. It has been 10 years since it was first created to accurately simulate the double 11 traffic peak. It supports tens of thousands of pressure testing tasks across the group including Double Eleven every year, and is the "early verifier" of Alibaba's internal Double Eleven technical architecture.
Technology Benefit 1 - Self-developed PTS pressure measurement engine, accurate pressure model and excellent performance
The fully self-developed pressure measurement engine of PTS has better performance than the traditional thread model in the implementation of the concurrency model. And it supports API-dimensional throughput configuration, which is more refined than open source tools, and can accurately simulate the traffic funnel model.
For example, the real traffic model is that 100% of users will call the login API, 80% of users will call the refresh health code API, and 20% of users will call the nucleic acid viewing API, which requires configuring the throughput (QPS) on each API. Concurrency model, it is impossible to simulate this scenario.
Example of funnel model:
PTS stress testing also supports the traffic recording function of various clients, which can quickly build stress testing scripts, and supports the operation of completely blanking the screen, which greatly reduces the threshold for building stress testing scripts.
Technology Benefit 2 — Fully compatible with JMeter, online JMeter plug-in
While fully compatible with JMeter, PTS has made many optimizations for JMeter distributed stress testing:
Optimization point 1: Globally distributed pressure presses, which can be used immediately after pressure, and can support millions of concurrent and tens of millions of QPS pressure tests;
Optimization point 2: Support throughput mode, can set global target QPS, and measure server performance more intuitively;
Optimization point 3: It supports speed regulation during pressure measurement, and can flexibly adjust concurrency or QPS to continuously approach the performance limit point;
Optimization point 4: Support browser plug-in recording, export JMeter scripts with one click, no need to configure agents, and greatly reduce the workload of building scripts;
Optimization point 5: For distributed stress testing, it supports automatic file segmentation, supports globally effective Timer and Controller components, and enables distributed stress testing with zero threshold;
Optimization point 6: Release the JMeter PTS plug-in, use the JMeter GUI client to initiate cloud distributed pressure testing, and seamlessly connect script debugging and execution stages (see JMeter plug-in usage guide [4] for details).
Technology Benefit 3 — VPC Intranet Stress Test
Before a full-scale formal stress test, key microservice applications need to be subjected to a single-application stress test on a daily basis to find out the local performance limits.
For services deployed on Alibaba Cloud, a single microservice application does not expose the public network entrance. In this case, the pressure measurement tool needs to have the ability to open up the VPC intranet.
PTS supports VPC intranet stress testing, which can quickly connect the stressor and the user's VPC network during stress testing to ensure the smoothness of the intranet stress testing network. After the stress test is over, the network channel will be closed immediately to ensure network security.
Users only need to select the VPC intranet, security group, and switch where the microservice application is located in the pressure test configuration to enable the VPC intranet pressure test. Allow your services to detect performance indicators without exposing the public network entrance.
An example of operation is as follows:
Technology Benefit 4 - Traffic Regional Customization
Most of the business users are not evenly distributed geographically, on the contrary, they are often very uneven. To simulate the real flow distribution, the pressure press needs to deploy in in various places, and supports to allocate according to the amount of and according to the region, and supports real-time unified scheduling during pressure testing. If the presses are all distributed in a Region, or even an Availability Zone, it is impossible to simulate requests from global users.
When using the Alibaba Cloud Performance Testing Service (PTS) for stress testing, enable the traffic region customization function, and simply select the region to specify the regional distribution of the pressure machine. Currently, 22 regions around the world are supported for customization.
Technology Benefit 5 — Problem Diagnosis Tool
The purpose of the stress test is to find performance problems. In the stress test report, PTS has statistics on abnormal request status codes, and provides request sampling logs, so that you can intuitively see all the information of requests and responses. For requests with long response times , it will also visually display the time-consuming of the request in each stage.
For Java applications, PTS provides a problem diagnosis tool based on Java Agent. Simply mount a probe on a Java application to automatically obtain second-level monitoring of applications, APIs, and machine dimensions. For an error-reporting request, you can directly locate the method stack of the error-reporting method on the call chain, which saves a lot of trouble-shooting time and is a "weapon" for locating the problem.
An example of the positioning error method stack is as follows:
Cost Concession 1 — Launching the JMeter Resource Pack
PTS has launched the JMeter exclusive resource pack, the price is more favorable than the PTS stress test resource pack.
Cost Concession 2 - Better price for VPC intranet stress test
PTS has launched the VPC intranet stress test resource package, 10,000 concurrent stress tests for 20 minutes, starting at only 29 yuan, making the daily intranet stress test cost lower.
Cost Concession 3 — Yearly and monthly package, 25% off for a limited time
The annual and monthly resource package offers a 25% discount for a limited time. During the monthly subscription period, VUM is not counted, so it is suitable for users of high-frequency stress testing.
Cost Concession 4 - Customized Resource Pool
For high concurrency, it is recommended to use a custom resource pool when stress testing takes a long time. If there are more than 20 pressure presses and continuous pressure measurement for 1 hour, the billing is equivalent to 40% of the normal pressure measurement, so that users who have long time and high concurrent pressure measurement pay lower cost.
Click to read the original text, you can go to the PTS resource pack purchase page [5], you are welcome to buy as needed.
Related Links
[1] Instructions for using the Chrome recording plugin:
https://help.aliyun.com/document_detail/187749.html
[2] Pressure measurement indicators:
https://help.aliyun.com/document_detail/29338.html
[3] Test problem analysis and tuning:
https://help.aliyun.com/document_detail/29342.html
[4] JMeter plugin usage guide:
https://help.aliyun.com/document_detail/379921.html
[5] PTS product purchase page:
https://common-buy.aliyun.com/?commodityCode=ptsbag
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。