Author: Fuyi
What is performance stress measurement observable
Observability includes three dimensions: Metrics, Traces, and Logs. The observability capability helps us to quickly troubleshoot and locate problems in complex distributed systems, and is an essential O&M tool in distributed systems.
In the field of performance stress testing, the observability capability is more important. In addition to helping to locate performance problems, the Metrics performance indicators directly determine whether the stress testing passes or not, which is decisive for the system launch. The details are as follows:
• Metrics, monitor metrics
System performance indicators, including request success rate, system throughput, and response time
Resource performance indicators, measure the usage of system software and hardware resources, cooperate with system performance indicators, observe system resource water level
• Logs, logs
The pressure engine log, observe whether the pressure engine is healthy, and whether there is an error in the execution of the pressure test script
Sampling log, sampling and recording API request and response details, to assist in checking whether the parameters of some error requests during the stress test process are normal, and viewing the complete error information through the response details
• Traces, distributed link tracing is used in the performance problem diagnosis phase, by tracing the call link of the request in the system,
Locate the error reporting system and error reporting stack of the error reporting API to quickly locate performance problems
This article describes how to use Prometheus to implement the observability of performance stress measurement metrics.
The core indicators of stress monitoring
System performance indicators
The three most important indicators of stress monitoring and monitoring: request success rate, service throughput (TPS), and request response time (RT). If any of these three indicators has an inflection point, it can be considered that the system has reached a performance bottleneck.
Here is a special description of the response time. For this indicator, it is very misleading to use the average value to judge, because the response time of a system is not evenly distributed, and there is often a long tail phenomenon, which means that the response time of some user requests is particularly long. , but the overall average response time is in line with expectations, which actually affects the experience of some users and should not be judged as passing the test. Therefore, for the response time, the 99th, 95th, and 90th percentiles are often used to judge whether the system response time meets the standard.
In addition, if you need to observe the distribution details of the request response time, you can supplement the request connection time (Connect Time), waiting for the response time (Idle Time) and other indicators.
Resource Performance Metrics
During the stress measurement process, it is also important to monitor system hardware, middleware, and database resources, including but not limited to:
• CPU usage • Memory usage • Disk throughput • Network throughput • Database connections • Cache hit ratio
…
For details, see the article "Test Indicators" [1].
Press performance index
In the pressure test link, the performance of the pressure press is easily overlooked. In order to ensure that the pressure press is not the performance bottleneck of the entire pressure test link, the following performance indicators of the pressure press need to be paid attention to:
• Memory usage of the stress testing process • CPU usage of the stressor, Load1, Load5 load indicators • JVM-based stress testing engine, you need to pay attention to the number of garbage collections and the duration of garbage collections
## Why use Prometheus for stress monitoring
Open source stress measurement tools such as JMeter itself support simple system performance monitoring indicators, such as: request success rate, system throughput, response time, etc. However, for large-scale distributed stress testing, the native monitoring of open source stress testing tools has the following shortcomings:
- The monitoring indicators are not comprehensive enough. Generally, they only include basic system performance indicators and can only be used to determine whether the pressure test is passed. However, if the pressure test fails, and you need to troubleshoot and locate the problem, such as analyzing the 99th percentile connection establishment time of an API, the native monitoring indicators cannot be implemented.
- Aggregation timeliness is not guaranteed
- Unable to support large-scale distributed monitoring data aggregation
- Monitoring metrics do not support backtracking by timeline
In summary, in large-scale distributed stress testing, native monitoring of open source stress testing tools is not recommended.
The following compares two open source monitoring solutions:
Option 1: Zabbix
Zabbix is an early open source distributed monitoring system that supports MySQL or PostgreSQL relational databases as data sources.
For system performance monitoring, the pressure machine needs to provide second-level monitoring indicators, and high concurrent monitoring indicators are written per second, making the relational database the bottleneck of the monitoring system.
For resource performance monitoring, Zabbix has comprehensive indicators for physical machines and virtual machines, but the monitoring support for containers and elastic computing is not enough.
Option 2: Prometheus
Prometheus uses a time series database as the data source. Compared with the traditional relational database, the read and write performance is greatly improved. For the scenario where a large amount of second-level monitoring data is reported by the pressure machine, the performance is good.
For resource performance monitoring, Prometheus is more suitable for monitoring cloud resources, especially for Kubernetes and containers.
To sum up, compared with Zabbix, Prometheus is more suitable for the collection and aggregation of high-concurrency monitoring indicators in stress testing, and is more suitable for monitoring cloud resources, and is easy to expand.
Of course, it is also a good choice to use mature cloud products, such as the stress measurement tool PTS[2] + the observable tool ARMS[3], which are a set of golden partners. PTS provides system performance indicators during stress testing, and ARMS provides resource monitoring and overall observability capabilities to solve observable problems in one-stop stress testing.
How to use Prometheus to implement stress monitoring
Open Source JMeter Retrofit
Prometheus is a pull data model, so the pressure test engine needs to expose HTTP services for Prometheus to obtain various pressure test indicators.
JMeter provides a plug-in mechanism, and you can customize plug-ins to extend the monitoring capabilities of Prometheus. In the custom plug-in, JMeter's BackendListener needs to be extended to update each pressure measurement indicator, such as the number of successful requests, the number of failed requests, and the request response time, when the sampler is executed. Save each pressure measurement indicator in memory, and expose it through HTTP service when Prometheus pulls data. The overall structure is as follows:
JMeter custom plugins need to be transformed:
- Add index registry
- Extending the Prometheus metrics updater
- Implement a custom JMeter BackendListener and call the Prometheus updater after the sampler execution ends
- Implement HTTP Server, supplement authentication logic if necessary for security
PTS Stress Test Tool
Performance Testing Service (PTS) is an Alibaba Cloud SaaS-based performance testing tool. PTS supports the self-developed stress test engine, as well as the open-source JMeter stress test, and opens the stress test indicators to Prometheus on the PTS. There is no need to develop custom plug-ins to transform the engine, and only three steps of white screen operation are required.
Specific steps are as follows:
- In the advanced settings of the PTS stress test, turn on the [Prometheus] switch
- After the stress test starts, copy the Prometheus configuration with one click in [Monitor Export]
- Paste and hot load this configuration in the self-built Prometheus, it will take effect
Detailed reference: "How to export the indicator data of PTS stress measurement to Prometheus" [4]
Quickly build a Grafana monitoring market
PTS provides the official Grafana dashboard template [5], which supports one-click import of monitoring dashboards, and can be flexibly edited and expanded to meet your customized monitoring needs.
This dashboard provides data such as global request success rate, system throughput (TPS), 99th, 95th, and 90th percentile response time, and the number of error requests aggregated by error status codes.
In the API distribution column, you can intuitively compare the monitoring indicators of each API, and quickly locate APIs with poor performance.
In the API details column, you can view the detailed metrics of a single API to pinpoint performance bottlenecks.
In addition, Dapan also provides the JVM garbage collection monitoring indicators of the pressure machine, which can help determine whether the pressure machine is a performance bottleneck in the pressure measurement link.
The import steps are as follows:
step one
On the menu bar, click import under Dashboard:
Step 2
Fill in the id of PTS Dashboard: 15981
In Prometheus select your existing data source, in this example the data source is named Prometheus. Once selected, click Import to import
Step 3
After importing, in the upper left corner [PTS Stress Test Task], select the stress test task to be monitored, and you can see the current monitoring dashboard.
This job name corresponds to the jobname in the monitor export-Prometheus configuration of the PTS console.
Summarize
This article describes
- What is a performance test observable
- Why use Prometheus for stress measurement performance indicator monitoring
- How to use open source JMeter and PTS on the cloud to implement pressure measurement monitoring based on Prometheus
The PTS stress monitoring and exporting Prometheus function is currently in the free public beta, welcome to use it.
At the same time, the new sales method of PTS is coming, and the price of the basic version will drop by 50%! The price of one million concurrent transactions is only 6200! There are also 0.99 trial version for new users and VPC stress test exclusive version, welcome to buy!
Related Links
[1] Test indicators
https://help.aliyun.com/document_detail/29338.html
[2] Pressure testing tool PTS
https://www.aliyun.com/product/pts
[3] Observable tool ARMS
https://www.aliyun.com/product/arms
[4] How to export the indicator data of PTS stress measurement to Prometheus
https://help.aliyun.com/document_detail/416784.html
[5]Official Grafana template
https://grafana.com/grafana/dashboards/15981
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。