How to use Prometheus to implement performance stress measurement observability in the cloud native era - Metrics - 阿里巴巴云原生

Author: Fuyi

What is performance stress measurement observable

在这里插入图片描述

Observability includes three dimensions: Metrics, Traces, and Logs. The observability capability helps us to quickly troubleshoot and locate problems in complex distributed systems, and is an essential O&M tool in distributed systems.

In the field of performance stress testing, the observability capability is more important. In addition to helping to locate performance problems, the Metrics performance indicators directly determine whether the stress testing passes or not, which is decisive for the system launch. The details are as follows:

• Metrics, monitor metrics

System performance indicators, including request success rate, system throughput, and response time

Resource performance indicators, measure the usage of system software and hardware resources, cooperate with system performance indicators, observe system resource water level

• Logs, logs

The pressure engine log, observe whether the pressure engine is healthy, and whether there is an error in the execution of the pressure test script

Sampling log, sampling and recording API request and response details, to assist in checking whether the parameters of some error requests during the stress test process are normal, and viewing the complete error information through the response details

• Traces, distributed link tracing is used in the performance problem diagnosis phase, by tracing the call link of the request in the system,

Locate the error reporting system and error reporting stack of the error reporting API to quickly locate performance problems

This article describes how to use Prometheus to implement the observability of performance stress measurement metrics.

The core indicators of stress monitoring

System performance indicators

The three most important indicators of stress monitoring and monitoring: request success rate, service throughput (TPS), and request response time (RT). If any of these three indicators has an inflection point, it can be considered that the system has reached a performance bottleneck.

Here is a special description of the response time. For this indicator, it is very misleading to use the average value to judge, because the response time of a system is not evenly distributed, and there is often a long tail phenomenon, which means that the response time of some user requests is particularly long. , but the overall average response time is in line with expectations, which actually affects the experience of some users and should not be judged as passing the test. Therefore, for the response time, the 99th, 95th, and 90th percentiles are often used to judge whether the system response time meets the standard.

In addition, if you need to observe the distribution details of the request response time, you can supplement the request connection time (Connect Time), waiting for the response time (Idle Time) and other indicators.

Resource Performance Metrics

During the stress measurement process, it is also important to monitor system hardware, middleware, and database resources, including but not limited to:

• CPU usage • Memory usage • Disk throughput • Network throughput • Database connections • Cache hit ratio
…

For details, see the article "Test Indicators" [1].

Press performance index

In the pressure test link, the performance of the pressure press is easily overlooked. In order to ensure that the pressure press is not the performance bottleneck of the entire pressure test link, the following performance indicators of the pressure press need to be paid attention to:

• Memory usage of the stress testing process • CPU usage of the stressor, Load1, Load5 load indicators • JVM-based stress testing engine, you need to pay attention to the number of garbage collections and the duration of garbage collections

## Why use Prometheus for stress monitoring

Open source stress measurement tools such as JMeter itself support simple system performance monitoring indicators, such as: request success rate, system throughput, response time, etc. However, for large-scale distributed stress testing, the native monitoring of open source stress testing tools has the following shortcomings:

The monitoring indicators are not comprehensive enough. Generally, they only include basic system performance indicators and can only be used to determine whether the pressure test is passed. However, if the pressure test fails, and you need to troubleshoot and locate the problem, such as analyzing the 99th percentile connection establishment time of an API, the native monitoring indicators cannot be implemented.
Aggregation timeliness is not guaranteed
Unable to support large-scale distributed monitoring data aggregation
Monitoring metrics do not support backtracking by timeline

In summary, in large-scale distributed stress testing, native monitoring of open source stress testing tools is not recommended.

The following compares two open source monitoring solutions:

Option 1: Zabbix

Zabbix is an early open source distributed monitoring system that supports MySQL or PostgreSQL relational databases as data sources.

For system performance monitoring, the pressure machine needs to provide second-level monitoring indicators, and high concurrent monitoring indicators are written per second, making the relational database the bottleneck of the monitoring system.

For resource performance monitoring, Zabbix has comprehensive indicators for physical machines and virtual machines, but the monitoring support for containers and elastic computing is not enough.

Option 2: Prometheus

Prometheus uses a time series database as the data source. Compared with the traditional relational database, the read and write performance is greatly improved. For the scenario where a large amount of second-level monitoring data is reported by the pressure machine, the performance is good.

For resource performance monitoring, Prometheus is more suitable for monitoring cloud resources, especially for Kubernetes and containers.

To sum up, compared with Zabbix, Prometheus is more suitable for the collection and aggregation of high-concurrency monitoring indicators in stress testing, and is more suitable for monitoring cloud resources, and is easy to expand.

Of course, it is also a good choice to use mature cloud products, such as the stress measurement tool PTS[2] + the observable tool ARMS[3], which are a set of golden partners. PTS provides system performance indicators during stress testing, and ARMS provides resource monitoring and overall observability capabilities to solve observable problems in one-stop stress testing.

How to use Prometheus to implement stress monitoring

Open Source JMeter Retrofit

Prometheus is a pull data model, so the pressure test engine needs to expose HTTP services for Prometheus to obtain various pressure test indicators.

JMeter provides a plug-in mechanism, and you can customize plug-ins to extend the monitoring capabilities of Prometheus. In the custom plug-in, JMeter's BackendListener needs to be extended to update each pressure measurement indicator, such as the number of successful requests, the number of failed requests, and the request response time, when the sampler is executed. Save each pressure measurement indicator in memory, and expose it through HTTP service when Prometheus pulls data. The overall structure is as follows:
在这里插入图片描述

JMeter custom plugins need to be transformed:

Add index registry
Extending the Prometheus metrics updater
Implement a custom JMeter BackendListener and call the Prometheus updater after the sampler execution ends
Implement HTTP Server, supplement authentication logic if necessary for security

PTS Stress Test Tool

Performance Testing Service (PTS) is an Alibaba Cloud SaaS-based performance testing tool. PTS supports the self-developed stress test engine, as well as the open-source JMeter stress test, and opens the stress test indicators to Prometheus on the PTS. There is no need to develop custom plug-ins to transform the engine, and only three steps of white screen operation are required.
Specific steps are as follows:

In the advanced settings of the PTS stress test, turn on the [Prometheus] switch
After the stress test starts, copy the Prometheus configuration with one click in [Monitor Export]
Paste and hot load this configuration in the self-built Prometheus, it will take effect

Detailed reference: "How to export the indicator data of PTS stress measurement to Prometheus" [4]

Quickly build a Grafana monitoring market

PTS provides the official Grafana dashboard template [5], which supports one-click import of monitoring dashboards, and can be flexibly edited and expanded to meet your customized monitoring needs.

This dashboard provides data such as global request success rate, system throughput (TPS), 99th, 95th, and 90th percentile response time, and the number of error requests aggregated by error status codes.

In the API distribution column, you can intuitively compare the monitoring indicators of each API, and quickly locate APIs with poor performance.

In the API details column, you can view the detailed metrics of a single API to pinpoint performance bottlenecks.

In addition, Dapan also provides the JVM garbage collection monitoring indicators of the pressure machine, which can help determine whether the pressure machine is a performance bottleneck in the pressure measurement link.

The import steps are as follows:

step one

On the menu bar, click import under Dashboard:
在这里插入图片描述

Step 2

Fill in the id of PTS Dashboard: 15981

在这里插入图片描述

In Prometheus select your existing data source, in this example the data source is named Prometheus. Once selected, click Import to import
在这里插入图片描述

Step 3

After importing, in the upper left corner [PTS Stress Test Task], select the stress test task to be monitored, and you can see the current monitoring dashboard.

This job name corresponds to the jobname in the monitor export-Prometheus configuration of the PTS console.
在这里插入图片描述

Summarize

This article describes

What is a performance test observable
Why use Prometheus for stress measurement performance indicator monitoring
How to use open source JMeter and PTS on the cloud to implement pressure measurement monitoring based on Prometheus

The PTS stress monitoring and exporting Prometheus function is currently in the free public beta, welcome to use it.

At the same time, the new sales method of PTS is coming, and the price of the basic version will drop by 50%! The price of one million concurrent transactions is only 6200! There are also 0.99 trial version for new users and VPC stress test exclusive version, welcome to buy!

在这里插入图片描述

How to use Prometheus to implement performance stress measurement observability in the cloud native era - Metrics

What is performance stress measurement observable

The core indicators of stress monitoring

System performance indicators

Resource Performance Metrics

Press performance index

How to use Prometheus to implement stress monitoring

Open Source JMeter Retrofit

PTS Stress Test Tool

Quickly build a Grafana monitoring market

Summarize

Related Links

阿里云云原生

引用和评论

用通义灵码，从 0 开始打造一个完整APP，无需编程经验就可以完成

Kubernetes CNI 网络模型概览：VETH & Bridge / Overlay / BGP

通义灵码使用安装教程，3分钟快速上手体验

使用 Prometheus 监控 SAP ABAP 应用程序

SkyWalking链路追踪上下文TraceContext的traceId生成的实现原理剖析

通义灵码使用安装教程，3分钟快速上手体验

API 网关 OpenID Connect 实战：单点登录（SSO）如此简单