4
头图

In this module, I will sort out several commonly used monitoring parts. As we mentioned earlier, in the performance monitoring graph, there are many points that need to be monitored, such as operating systems, application servers, middleware, queues, caches, databases, networks, front-ends, load balancing, web servers, storage, and code. Obviously, these monitoring points cannot be covered and detailed one by one in one column. I can only find the most commonly used ones, explain some logical ideas, and describe the specific implementation at the same time. If you encounter other components, you also need to implement these monitoring one by one.

In this article, I mainly want to explain the monitoring logic in the diagram below.

图片

This should be the most popular set of monitoring logic right now. Today I will talk about the common data display methods using Grafana, Prometheus, InfluxDB, and Exporters. If you have just entered the field of performance testing, you can also have a perceptual understanding.

Only with testing tools and monitoring tools can we do subsequent performance analysis and bottleneck positioning, so it is necessary to put the logic of these tools with you.

All people who do performance should know that no matter what form the data is displayed, the most important thing is to look at the source and meaning of the data in order to make a correct judgment.

Let me first explain the data display logic from JMeter and node_exporter to Grafana. As for other Exporters, I will not explain this logic anymore, only the part of monitoring and analysis.

Data display logic of JMeter+InfluxDB+Grafana

Under normal circumstances, when we use JMeter for stress testing, we always use the JMeter console to view the results. As shown below:

图片

Or install a plug-in to see the results:

图片

Or use JMeter to generate HTML:

图片

There is no problem in this way. We have also emphasized before that for pressure tools, we only care about the data of three curves at most: TPS (T is defined by the test target), response time, and error rate. The error rate here is just a curve to assist in troubleshooting. When there is no problem, just look at the TPS and response time.

However, there are several problems with the above three methods
  • It is a waste of time to organize the results.
  • It is not realistic to use plug-ins in the GUI to look at the curve, and to do high concurrency.
  • When the scene runs for a long time, the way of generating HTML will cause excessive memory consumption. In fact, in the generated result graph, there are many generated graphs that we are not so concerned about.
  • It is troublesome to check the generated results after saving, and you have to look for them one by one.
So how to solve these problems?

Using JMeter's Backend Listener to send data to InfluxDB or Graphite in real time can solve this problem.

The support of Graphite Backend Listener is in JMeter 2.13 version, and the support of InfluxdDB Backend Listener is in JMeter 3.3 version. They all send data asynchronously for viewing.

In fact, after the data sent by JMeter to InfluxDB, we don't need to look at the HTML data above, but we can also visually see the performance trend of the system performance.

And the data saved in this way is more convenient for comparison if you want to check it again after the test is over.

The structure of JMeter+InfluxDB+Grafana is as follows:

图片

In this structure, while JMeter sends pressure to the server, it also counts information such as TPS, response time, number of threads, and error rate. By default, the results will be output on the console every 30 seconds (there is a parameter #summariser.interval=30 in jmeter.properties to control).

After configuring the Backend Listener, the statistics will be sent to InfluxDB asynchronously. Finally, configure the InfluxDB data source and JMeter display template in Grafana.

Then you can view the test results of JMeter in real time. The data seen here is the same as the data on the console.

But if it is so simple and finished, this article will be worthless. Let's talk about the logic of data transmission and display.

Configuration of Backend Listener in JMeter

Below we will explain the Backend Listener of InfluxDB. Its configuration is relatively simple, just add it in the script.

图片

We first configure the influxdb Url, application and other information. The application configuration can be regarded as the scene name.

So how does JMeter send data to InfluxDB? Please see the key code in the source code, as shown below:

private void addMetrics(String transaction, SamplerMetric metric) {
        // FOR ALL STATUS
        addMetric(transaction, metric.getTotal(), metric.getSentBytes(), metric.getReceivedBytes(), TAG_ALL, metric.getAllMean(), metric.getAllMinTime(),
                metric.getAllMaxTime(), allPercentiles.values(), metric::getAllPercentile);
        // FOR OK STATUS
        addMetric(transaction, metric.getSuccesses(), null, null, TAG_OK, metric.getOkMean(), metric.getOkMinTime(),
                metric.getOkMaxTime(), okPercentiles.values(), metric::getOkPercentile);
        // FOR KO STATUS
        addMetric(transaction, metric.getFailures(), null, null, TAG_KO, metric.getKoMean(), metric.getKoMinTime(),
                metric.getKoMaxTime(), koPercentiles.values(), metric::getKoPercentile);


        metric.getErrors().forEach((error, count) -> addErrorMetric(transaction, error.getResponseCode(),
                    error.getResponseMessage(), count));
    }

It can be seen from this code that from the perspective of global statistics, the statistical results of JMeter running, such as the total request of the transaction, the sent and received bytes, the average value, the maximum value, and the minimum value, are all added to the metric. At the same time, the successful and failed transaction information will be added to the metric.

In the source code, there are more steps to add metrics. If you are interested, you can also take a look at InfluxdbBackendListenerClient.java in the JMeter source code.

After saving the metric, use InfluxdbMetricsSender to send it to Influxdb. The key codes for sending are as follows:

@Override
    public void writeAndSendMetrics() {
 ........
        if (!copyMetrics.isEmpty()) {
            try {
                if(httpRequest == null) {
                    httpRequest = createRequest(url);
                }
                StringBuilder sb = new StringBuilder(copyMetrics.size()*35);
                for (MetricTuple metric : copyMetrics) {
                    // Add TimeStamp in nanosecond from epoch ( default in InfluxDB )
                    sb.append(metric.measurement)
                        .append(metric.tag)
                        .append(" ") //$NON-NLS-1$
                        .append(metric.field)
                        .append(" ")
                        .append(metric.timestamp+"000000") 
                        .append("\n"); //$NON-NLS-1$
                }


                StringEntity entity = new StringEntity(sb.toString(), StandardCharsets.UTF_8);

                httpRequest.setEntity(entity);
                lastRequest = httpClient.execute(httpRequest, new FutureCallback<HttpResponse>() {
                    @Override
                    public void completed(final HttpResponse response) {
                        int code = response.getStatusLine().getStatusCode();
                        /*
                         * HTTP response summary 2xx: If your write request received
                         * HTTP 204 No Content, it was a success! 4xx: InfluxDB
                         * could not understand the request. 5xx: The system is
                         * overloaded or significantly impaired.
                         */
                        if (MetricUtils.isSuccessCode(code)) {
                            if(log.isDebugEnabled()) {
                                log.debug("Success, number of metrics written: {}", copyMetrics.size());
                            } 
                        } else {
                            log.error("Error writing metrics to influxDB Url: {}, responseCode: {}, responseBody: {}", url, code, getBody(response));
                        }
                    }
                    @Override
                    public void failed(final Exception ex) {
                        log.error("failed to send data to influxDB server : {}", ex.getMessage());
                    }
                    @Override
                    public void cancelled() {
                        log.warn("Request to influxDB server was cancelled");
                    }
                });               
 ........
            }
        }
    }

Through writeAndSendMetrics, all saved metrics are sent to InfluxDB.

Storage structure in InfluxDB

Then let's look at how to store in InfluxDB:

> show databases
name: databases
name
----
_internal
jmeter
> use jmeter
Using database jmeter
>
> show MEASUREMENTS
name: measurements
name
----
events
jmeter
> select * from events where application='7ddemo'
name: events
time application text title
---- ----------- ---- -----
1575255462806000000 7ddemo Test Cycle1 started ApacheJMeter
1575256463820000000 7ddemo Test Cycle1 ended ApacheJMeter
..............
n> select * from jmeter where application='7ddemo' limit 10
name: jmeter
time application avg count countError endedT hit max maxAT meanAT min minAT pct90.0 pct95.0 pct99.0 rb responseCode responseMessage sb startedT statut transaction
---- ----------- --- ----- ---------- ------ --- --- ----- ------ --- ----- ------- ------- ------- -- ------------ --------------- -- -------- ------ -----------
1575255462821000000 7ddemo 0 0 0 0 0 internal
1575255467818000000 7ddemo 232.82352941176472 17 0 17 849 122 384.9999999999996 849 849 0 0 all all
1575255467824000000 7ddemo 232.82352941176472 17 849 122 384.9999999999996 849 849 0 0 all 0_openIndexPage
1575255467826000000 7ddemo 232.82352941176472 17 849 122 384.9999999999996 849 849 ok 0_openIndexPage
1575255467829000000 7ddemo 0 1 1 1 1 internal
1575255472811000000 7ddemo 205.4418604651163 26 0 26 849 122 252.6 271.4 849 0 0 all all
1575255472812000000 7ddemo 0 1 1 1 1 internal
1575255472812000000 7ddemo 205.4418604651163 26 849 122 252.6 271.4 849 ok 0_openIndexPage
1575255472812000000 7ddemo 205.4418604651163 26 849 122 252.6 271.4 849 0 0 all 0_openIndexPage
1575255477811000000 7ddemo 198.2142857142857 27 0 27 849 117 263.79999999999995 292.3500000000001 849 0 0 all all

This code means that in InfluxDB, two MEASUREMENTS are created, events and jmeter. These two store data respectively, and the testtile and eventTags that we configured in the interface are placed in events this MEASUREMENTS. These two values are temporarily unused in the template.

In the MEASUREMENTS of jmeter, we can see the application and transaction statistics. These values are consistent with the console. When displayed in Grafana, it is the data taken from this table and the curve based on the time series.

Configuration in Grafana

With the data sent by JMeter to InfluxDB, let's configure the display in Grafana. First, configure an InfluxDB data source. As follows:

图片

After configuring the URL, Database, User, and Password here, just click Save.

Then add a JMeter dashboard, our commonly used dashboard is Grafana's official ID 5496 template. After importing, select the corresponding data source.

图片

Then you see the interface.

图片

There is no data at this time. Let's make an example later to see how the data in JMeter corresponds to the data in this interface. Let's first look at the two important data query statements in the figure below.

TPS curve
SELECT last("count") / $send_interval FROM "$measurement_name" WHERE ("transaction" =~ /^$transaction$/ AND "statut" = 'ok') AND $timeFilter GROUP BY time($__interval)

The above is Total TPS, which is called throughput here.

Regarding this concept, I have already explained it in the first article. Here again, I remind the team to have a unified understanding of the use of the concept, and do not be misled by some traditional information in the industry.

The data taken here comes from all transactions in the successful state in MEASUREMENTS.

Response time curve:
SELECT mean("pct95.0") FROM "$measurement_name" WHERE ("application" =~ /^$application$/) AND $timeFilter GROUP BY "transaction", time($__interval) fill(null)

Here is a curve drawn with a response time within 95 pct.

The overall display effect is as follows:

图片

Data comparison

First, we configure a simple scene in JMeter. 10 threads, each thread iterates 10 times, and two HTTP requests.

图片

In other words, 10x10x2=200 requests will be generated at this time. Let's run up and take a look with JMeter.

图片

As you can see, the number of requests is the same as we expected. Let's take a look at the results shown in Grafana.

图片

There are also statistics for each transaction.

图片

At this point, the display process from JMeter to Grafana is complete. In the future, we no longer need to save the execution results of JMeter, nor wait for JMeter to output HTML.

Data display logic of node_exporter+Prometheus+Grafana

For performance testing, in the commonly used Grafana+Prometheus+Exporter logic, the first step is to look at operating system resources. So in this article, we will take node_exporter as an example to illustrate the logic of data extraction by the operating system in order to know the source of the monitoring data. As for the meaning of the data, we will continue to describe it in subsequent articles.

First of all, we still have to draw a picture.

图片

Now node_exporter can support many operating systems. The official list is as follows:

图片

Of course it does not mean that only these are supported. You can also extend your own Exporter.

Configure node_exporter

The node_exporter directory is as follows:

[root@7dgroup2 node_exporter-0.18.1.linux-amd64]# ll
total 16524
-rw-r--r-- 1 3434 3434 11357 Jun 5 00:50 LICENSE
-rwxr-xr-x 1 3434 3434 16878582 Jun 5 00:41 node_exporter
-rw-r--r-- 1 3434 3434 463 Jun 5 00:50 NOTICE

start up:

[root@7dgroup2 node_exporter-0.18.1.linux-amd64]#./node_exporter --web.listen-address=:9200 &

Is it very concise? If you want to see more functions, you can check its help.

Configure Prometheus

Download Prometheus first:

[root@7dgroup2 data]# wget -c https://github.com/prometheus/prometheus/releases/download/v2.14.0/prometheus-2.14.0.linux-amd64.tar.gz
..........100%
[=============================================================================================>] 58,625,125 465KB/s in 6m 4s2019-11-29 15:40:16 (157 KB/s) - ‘prometheus-2.14.0.linux-amd64.tar.gz’ saved [58625125/58625125]
[root@7dgroup2 data]

After decompression, we can see the directory structure is as follows:

[root@7dgroup2 prometheus-2.11.1.linux-amd64]# ll
total 120288
drwxr-xr-x. 2 3434 3434 4096 Jul 10 23:26 console_libraries
drwxr-xr-x. 2 3434 3434 4096 Jul 10 23:26 consoles
drwxr-xr-x. 3 root root 4096 Nov 30 12:55 data
-rw-r--r--. 1 3434 3434 11357 Jul 10 23:26 LICENSE
-rw-r--r--. 1 root root 35 Aug 7 23:19 node.yml
-rw-r--r--. 1 3434 3434 2770 Jul 10 23:26 NOTICE
-rwxr-xr-x. 1 3434 3434 76328852 Jul 10 21:53 prometheus
-rw-r--r-- 1 3434 3434 1864 Sep 21 09:36 prometheus.yml
-rwxr-xr-x. 1 3434 3434 46672881 Jul 10 21:54 promtool

图片

Then configure a node_exporter template, for example, I chose the official template (ID: 11074) here, as shown below:

图片

Data logic description

After explaining the above process, the most important thing for us who do performance testing and analysis is to know the source and meaning of the data.

Take the CPU usage rate in the figure above (because the CPU usage rate is a very important counter, so let’s take it today).

Let's first click edit on the title, and take a look at its query statement.

avg(irate(node_cpu_seconds_total{instance=~"$node",mode="system"}[30m])) by (instance)
avg(irate(node_cpu_seconds_total{instance=~"$node",mode="user"}[30m])) by (instance)
avg(irate(node_cpu_seconds_total{instance=~"$node",mode="iowait"}[30m])) by (instance)
1 - avg(irate(node_cpu_seconds_total{instance=~"$node",mode="idle"}[30m])) by (instance)

These are the data taken from Prometheus. The query statement reads the different module data of node_cpu_seconds_total in Prometheus.

Let's take a look at the counter exposed by node_exporter.

图片

These values are the same as top, all come from the /proc/ directory. The picture below is the top data, we can compare it.

图片

At this point, we have understood the value logic of the monitoring data in the operating system, that is, the value is taken from the counter of the operating system itself, and then passed to Prometheus, and then the corresponding data is found by the query statement in Grafana, and finally Shown on the interface by Grafana.

Summarize

Why explain the logic of the data? Because I have encountered some situations in my work recently, some people feel that after a combination tool like Prometheus+Grafana+Exportor, basically no more manual execution of commands. But what we need to understand is that for the monitoring platform, all the data it fetches must be the data that can be provided by the monitored person. A small monitoring collector like node_exporter, the monitoring data it can obtain is not all of the entire system. The performance data is just a common counter. Regardless of whether these counters are viewed with commands or with such a cool tool, their value itself will not change. So whether it is the data seen on the monitoring platform or the data seen on the command line, the most important thing for us is to know the meaning and the impact of changes in these values on the next step of performance testing and analysis.

Link: cnblogs.com/siguadd/p/14878035.html


民工哥
26.4k 声望56.7k 粉丝

10多年IT职场老司机的经验分享,坚持自学一路从技术小白成长为互联网企业信息技术部门的负责人。2019/2020/2021年度 思否Top Writer