Use Prometheus + Grafana to create a TiDB monitoring integration solution

Author introduction: Wang Tianyi

Prometheus + Grafana as a universal monitoring system is widely used in various application environments.

This article mainly introduces whether the newly built monitoring system of TiDB + Prometheus can be migrated to the existing monitoring system.

For users with tight resources and low high-availability requirements, we suggest to divide the cluster directly through Prometheus Label to achieve the All in One Prometheus monitoring environment. For users with ample resources and strong demand for high availability, you can consider using the Prometheus multi-tenant solution.

Grafana is a stateless application. If there is a high availability requirement, you can consider deploying it into a high availability architecture through the structure of Keepalived + Haproxy.

From this article, you will learn:

How to integrate different TiDB cluster Prometheus monitoring platforms into the same Prometheus platform
How to view metrics in Prometheus through Grafana
How to inject the cluster tags of different clusters into the Grafana dashboard
How to import reports into Grafana in batches via Grafana HTTP API
How to solve Prometheus performance problems caused by a large amount of indicator data
How to import data from Prometheus into a relational database for query or indicator analysis
How to achieve high availability and high tenants in Prometheus

The thought guide of this article:

What I want to do: Integrate the independent Prometheus + Grafana of each cluster into a unified platform, with a single entry for query
How to integrate Prometheus: Use independent Prometheus to pull metrics from different clusters and distinguish them by label
How to integrate Grafana: It is necessary to isolate each expr into the label information of the cluster, generate new reports, and import reports in batches using Grafana HTTP API
Possible risks after integration: Prometheus data volume explosion, slow performance
What to do: Demolition the library! Why should the newly merged library be split?
The goal of the split: Prometheus horizontal expansion, data centralized storage remote library
Data centralized storage solution: use prometheus-postgresql-adapter + TimescaleDB for data storage
What's wrong with the centralized data storage: Dashboard's expr needs to be read from Timescale DB, and the original PromSQL-based expr cannot be used
How to solve the problem of SQL conversion to PromSQL: add another layer of Prometheus to Timescale for conversion
Prometheus' horizontal expansion and multi-tenant solution: Thanos

has something to say in advance:

As a DBA who has been struggling for a long time in the first line, and can't put it down in the second and third lines, I am most concerned about three things (ranked in order):
Good job: correctness, data cannot be less
I want to go to bed early: stability, no alarm at night
Go to the elderly with peace of mind: query is a product problem, what matters to my DBA
For this reason, as a non-famous DBA, I sorted out the ideas of TiDB monitoring integration plan
This article is an idea, I dare not call it a plan
The principal and interest are recorded, I come up with a plan, and then overturn the iterative process of this plan
Each plan has its own uniqueness, so there is no best plan, only the most applicable plan

Experimental cluster environment

Operating system environment introduction

[root@r30 .tiup]# cat /etc/redhat-release
CentOS Stream release 8
[root@r30 .tiup]# uname -r
4.18.0-257.el8.x86_64

Introduction to TiDB cluster environment

As an experimental environment, we deployed two sets of TiDB clusters, tidb-c1-v409 and tidb-c2-v409.

On a separate machine, after deploying a clustered tidb-monitor system through TiUP, I only retained the Grafana and Prometheus components. Removed other TiDB components. This tidb-monitor cluster is to simulate our existing monitoring platform and migrate the monitoring of tidb-c1-v409 and tidb-c2-v409 to tidb-monitor.

Overview of the current TiDB monitoring framework

Prometheus application in

Prometheus is a time series database with a multi-dimensional data model and flexible query statements.

As a popular open source project, Prometheus has an active community and numerous success stories.

Prometheus provides multiple components for users to use. Currently, TiDB uses the following components:

Prometheus Server: used to collect and store time series data
Client code base: used to customize the metrics needed in the program
Alertmanager: used to implement the alert mechanism

Application of Grafana in TiDB

Grafana is an open source metric analysis and visualization system.

TiDB uses Grafana to display the related monitoring of each component of the TiDB cluster. The monitoring items are grouped as shown in the figure below:

Prometheus & Grafana problems

As the number of clusters increases, some users may have the following requirements:

Multiple sets of TiDB clusters cannot share a set of monitoring clusters
Prometheus itself does not have high availability as the amount of data grows
Prometheus' query speed will decrease

Here, we consider whether it is possible to integrate Prometheus and Grafana of different clusters, so that multiple clusters can share a set of monitoring system.

Prometheus' integrated solution

Introduction to Prometheus

TiDB uses the open source time series database Prometheus as the monitoring and performance indicator information storage solution, and Grafana as the visualization component for information display. Prometheus narrowly refers to the software itself, namely prometheus server, and broadly refers to the ecology of various software tools based on prometheus server as the core. In addition to prometheus server and grafana, the commonly used components of the Prometheus ecosystem include alertmanager, pushgateway and a very rich variety of exporters. The prometheus server itself is a time series database. Compared with zabbix monitoring that uses MySQL as the underlying storage, it has very efficient insertion and query performance, and the space occupied by data storage is also very small. If you want to use the prometheus server to receive the pushed information, you need to use pushgateway between the data source and the prometheus server.

The Prometheus monitoring ecosystem is very complete, and the objects that can be monitored are very rich. For detailed exporter support objects, please refer to the official introduction exporters list . Prometheus can monitor far more than the products in the official exporters list. Some products are natively supported not on the above list, such as TiDB; some can monitor a class of products through a standard exporter, such as snmp_exporter; others can write a simple script by themselves Push to pushgateway; if you have certain development capabilities, you can also write the exporter yourself to solve it. At the same time, some products can be supported without the exporter in the above list as the version is updated, such as ceph. With the continuous landing of containers and kurbernetes, and more software natively support Prometheus, I believe that Prometheus will soon become the leading product in the monitoring field.

Prometheus architecture introduction

The architecture diagram of Prometheus is as follows:

The prometheus server software in the Prometheus ecosystem is used to monitor the storage and retrieval of information, as well as the push of alarm messages, and is the core part of the Prometheus ecosystem. Alertmanger is responsible for receiving the alerts pushed by the prometheus server, and after the alerts are grouped, deduplicated, etc., they are routed according to the content of the alert label, and sent to the recipient via email, SMS, corporate WeChat, DingTalk, webhook, etc. When using Prometheus as monitoring, most software also needs to deploy an exporter as an agent to collect data, but some software natively supports Prometheus, such as TiDB components, which can directly collect monitoring data without deploying an exporter. PromQL is the Prometheus data query language. Users can directly write PromQL on the browser to retrieve monitoring information through the web UI of the prometheus server. You can also solidify PromQL into the grafana report for dynamic display, and users can also do richer custom functions through the API interface. In addition to collecting static exporters, Prometheus can also monitor various dynamic targets through service discovery, such as kubernetes node, pod, service, etc. In addition to exporter and service discovery, users can also write scripts to do some custom information collection, and then push it to pushgateway through push. Pushgateway is a special exporter for prometheus server, and prometheus server can capture other exporters like other exporters. The same to grab pushgateway information.

Promethes' Label Usage Rules

Prometheus distinguishes different Metrics according to Label

Different labels can be added to the host, which can be used to distinguish the metrics of different clusters with the same name, or it can be used to aggregate different metrics of the same cluster:

## modify prometheus.conf
static_configs:
- targets: ['localhost:9090']
  labels:
    cluster: cluster_1
    
 ## check prometheus configuration file
 ./promtool check config prometheus.yml
 
 ## reconfig the prometheus
 kill -hup <prometheus_pid>
 
 ## get the cpu source aggregation
 sum(process_cpu_seconds_total{cluster='cluster_1'})

Can by tag data acquisition or retention stop :

## stop the metric collection for the jobs with the label cpu_metric_collection_drop
scrape_configs:
  - job_name: 'cpu_metric_collection'
    static_configs:
    - targets: ['localhost:9090']
    relabel_configs:
    - action: drop
      source_labels: ['cpu_metric_collection_drop']
      
## keep the metric collection for the jobs with the label cpu_metric_collection_keep
scrape_configs:
  - job_name: 'cpu_metric_collection'
    static_configs:
    - targets: ['localhost:9090']
    relabel_configs:
    - action: keep
      source_labels: ['cpu_metric_collection_keep']

Prometheus's relabel operation

In the Prometheus monitoring system, Label is an extremely important parameter. In a centralized and complex monitoring environment, we may not be able to control the resources being monitored and their indicator data. Redefining the monitoring label can effectively control and manage data indicators in a complex environment. After Prometheus pulls the exportor's data, it will edit the data label, and also allow the user to process the label through the relabel_configs parameter, including modifying, adding, and deleting unnecessary labels.

# The source labels select values from existing labels. Their content is concatenated
# using the configured separator and matched against the configured regular expression
# for the replace, keep, and drop actions.
[ source_labels: '[' <labelname> [, ...] ']' ]

# Separator placed between concatenated source label values.
[ separator: <string> | default = ; ]

# Label to which the resulting value is written in a replace action.
# It is mandatory for replace actions. Regex capture groups are available.
[ target_label: <labelname> ]

# Regular expression against which the extracted value is matched.
[ regex: <regex> | default = (.*) ]

# Modulus to take of the hash of the source label values.
[ modulus: <int> ]

# Replacement value against which a regex replace is performed if the
# regular expression matches. Regex capture groups are available.
[ replacement: <string> | default = $1 ]

# Action to perform based on regex matching.
[ action: <relabel_action> | default = replace ]

In the above example, <relabel_action> can contain the following centralized operation :

replace: Use the value of replacement to replace the source_label matched by the regex;
keep: keep the metric of the matched label, and delete the metric that is not matched to the label;
drop: Delete the metric of the matched label, and keep the metric that is not matched to the label;
hashmod: set target\_label to the hash value of modulus configuration of source\_label;
labelmap: Configure the names of all labels matched by the regex as new labels, and configure the value as the value of the new label;
labeldrop: delete the labels that meet the rules, and keep the unmatched labels;
labelkeep: Keep the labels that meet the rules and delete the unmatched labels.

Through the above introduction to Prometheus Label, we can consider using the feature of Label to mark and distinguish different TiDB clusters.

Use Label to distinguish different cluster information in TiDB

Several options for modifying Prometheus configuration files

Taking the job tidb as an example, let's complete the most basic configuration. There are two main ideas for modifying tidb job:

Create a tidb job, use relabel_configs to label the job with two labels, tidb-c1-v409 and tidb-c2-v409
Create two tidb jobs, job-tidb-c1-v409 and job-tidb-c2-v409

Solution 1: Create a tidb job and distinguish different clusters through relabel_configs

## The first way - create one job for tidb, and distinguish different clusters by relabel_configs operation
  - job_name: "tidb"
    honor_labels: true # don't overwrite job & instance labels
    static_configs:
    - targets:
      - '192.168.12.31:12080'
      - '192.168.12.32:12080'
      - '192.168.12.33:12080'
      - '192.168.12.31:22080'
      - '192.168.12.32:22080'
      - '192.168.12.33:22080'

    relabel_configs:
      - source_labels: [ '__address__' ]
        regex: '(.*):12080'
        target_label: 'cluster'
        replacement: 'tidb-c1-v409'
      - source_labels: [ '__address__' ]
        regex: '(.*):22080'
        target_label: 'cluster'
        replacement: 'tidb-c2-v409'

In the above configuration:

' address ' represents the address filtered by targets. In this example, six values can be filtered, 192.168.12.3{1,2,3}:{1,2}2080;
regex means to match the results filtered by the above source_labels through regular expressions;
target_label means to rename the label address to cluster;
Replacement means to rename the result of matching regular expression to tidb-c1-v409 or tidb-c2-v409;
Reload the prometheus configuration file, and you can see the following results in status -> targets on the GUI page of prometheus.

Scheme 2: Create different jobs to distinguish different clusters

## The second way - create two jobs for different clusters
  - job_name: "tidb-c1-v409"
    honor_labels: true # don't overwrite job & instance labels
    static_configs:
    - targets:
      - '192.168.12.31:12080'
      - '192.168.12.32:12080'
      - '192.168.12.33:12080'
      labels:
        cluster: tidb-c1-v409

  - job_name: "tidb-c2-v409"
    honor_labels: true # don't overwrite job & instance labels
    static_configs:
    - targets:
      - '192.168.12.31:22080'
      - '192.168.12.32:22080'
      - '192.168.12.33:22080'
      labels:
        cluster: tidb-c2-v409

In the above configuration:

job_name can be used as an identifier to distinguish two different clusters;
For different jobs, only independent cluster endpoints information is configured in the target;
Tag a cluster through labels;
Reload the prometheus configuration file, and you can see the following results in status -> targets on the GUI page of prometheus.

160bd92cbbb290 It is difficult to compare the pros and cons of the two schemes. The first scheme reduces the number of jobs, but increases the number of The second solution reduces the number of labels in the job, but increases the number of jobs. Just like calculating 2 3 4, it is difficult to compare whether (2 3) 4 is better or 2 (3 4) is better.

Modification case of Prometheus configuration file

Use the first scheme to distinguish between clusters through relabel_configs.

For blackbox\_exporter, due to the intersection of machines on the two cluster deployments, in the actual production environment, from the perspective of saving resources, only one blackbox\_exporter is enough.

In the modified experimental environment, you can refer to prometheus-example .

When we return to the Prometheus service, we can check whether all jobs are in UP status in the status -> target in the web GUI of Prometheus.

Randomly check a metric, such as pd\_regions\_status, you can see that the cluster tag has two values, tidb-c1-v409 and tidb-c2-v409.

Grafana integration solution

View Datasource information in Grafana

Because all metrics have been integrated into the same Prometheus in Prometheus, this Prometheus needs to be configured in Grafana.

View reports in Grafana

Taking the overview report as an example, it is found that the display of the report is a bit abnormal. The information of the two clusters is mixed together, and there is no way to distinguish them. In this example, tidb-c1-v409 and tidb-c2-v409 each have three TiDB nodes, but in the overview dashboard, the node information is mixed together.

Let's take overview dashboard -> service port status as an example, analyze the definition of the report and open the definition of service port status, you can see that the formula of count(probe_success{group="tidb"} == 1)

Found the lack of cluster information, manually advance the cluster information,

count(probe_success{cluster="tidb-c1-v409", group="tidb"} == 1)

After modification, the TiDB node information of tidb-c1-v409 can be displayed normally.

Push the cluster information into the dashboard

By manually pushing the cluster information, we can verify that the dashboard can be displayed normally.

With the following logic, you can try the script to push the cluster information to in the dashboard:

Through the curl -s http://192.168.12.34:9090/api/v1/targets command, you can view all the metric URLs. Traverse these URLs to get all the metrics
Traverse all the metric information, push the cluster information one by one in the report, take overview.json as an example
For "expr": "node\_memory\_MemAvailable\_bytes" such a formula without options, directly push the cluster information into "expr": "node\_memory\_MemAvailable\_bytes{cluster="tidb-c1-v409" }"
For "expr": "\ncount(probe\_success{group="tidb"} == 1)" so that there is already an option formula, adding cluster information becomes "expr": "\ncount(probe\_success{ cluster="tidb-c1-v409",group="tidb"} == 1)"

You can refer to the script tidb\_dashboard\_inject_cluster.sh 1

Run the tidb\_dashboard\_inject_cluster.sh script to inject the cluster information. Note that you need to re-copy the original dashboard folder and then run the script each time:

[root@r34 ~]# rm -rf dashboards && cp -r dashboards-bak/ dashboards && ./tidb_dashboard_inject_cluster.sh "tidb-c1-v409" "/root/dashboards" "192.168.12.34:9090"

Check the injected script:

[root@r34 dashboards]# cat overview.json | grep expr | grep -v tidb-c1-v409
              "expr": "sum(increase(tidb_server_execute_error_total[1m])) by (type)",
              "expr": "sum(rate(tikv_channel_full_total[1m])) by (instance, type)",
              "expr": "sum(rate(tikv_server_report_failure_msg_total[1m])) by (type,instance,store_id)",

None of these have appeared in /tmp/tidb-metirc, you can change them manually. Because the metric is not obtained, and Prometheus does not have this metric, it is not so important to change it.

Import the redefined report into Grafana

It can be sad that the script import-dashboard.sh will import the dashboard into Grafana in batches through the Grafana API.

For detailed process and principles, please refer to [SOP Series 14] How to use multiple TiDB clusters to share one Grafana.

New problems introduced after the integration of indicators

Through the above operations, we have been able to integrate Prometheus and Grafana of different clusters into the same Prometheus + Grafana monitoring platform. What are the risks of doing this:

New bugs may be introduced during the integration process-this is unavoidable. If you want to customize, you must tolerate bugs. It will only get better and better in the later operation and maintenance
A large amount of metric information may cause
Prometheus performance problems Prometheus and Grafana are not yet highly available

Not only the monitoring of TiDB, but also the monitoring of Kubernetes, it also uses a set of clusters and a set of monitoring methods to collect metrics. Since Prometheus itself only supports stand-alone deployment and does not support cluster deployment, we cannot be highly available or expand horizontally. The data storage capacity of Prometheus is also limited by the disk capacity of a single machine. In the All in One scenario, the amount of data collected by Prometheus is too large, consumes a lot of resources, and may not achieve the best performance. It is imperative to split Prometheus.

However, in the actual environment, in order to save resources and facilitate operation and maintenance, many companies integrate multiple different cluster monitoring into one monitoring platform like the above method. The concept of how fast, save, and save is difficult to realize on a monitoring platform. More can not be fast, long time can not save.

Solving performance problems can be considered from the following aspects:

Delete those low cost-effective indicators that are low in usage and take up space
Reduce the retention policy of the history stored by Prometheus
Inflow Prometheus data into the data warehouse
Use federal methods for data aggregation

Dealing with lower cost-effective indicators

For those indicators that are low in usage and take up high space, if business needs can be met, they should be deleted as soon as possible. Such low cost performance indicators are likely to cause Prometheus's OOM. You can find indicators that take up a lot of space through the following alarm rules. If the usage rate is not high, you can use the drop command in relabel_config to delete them.

count by (\_\_name\_\_)({\_\_name\_\_=~".+"}) > 10000

Prometheus split problem

What we just did: integrate different cluster information into one monitoring platform. What we want to do: Split the data of one monitoring platform into multiple Prometheus.

The split of Prometheus can be considered from the following dimensions:

Split from the business dimension
Fragmentation of very large businesses

Split from the business dimension

Splitting from the business dimension, this approach is just the opposite of our goal. If it is split like this, it would be fine for me to do nothing.

Fragmentation of large businesses

When the business is extremely complex, or the historical data needs to be retained for a long time, consider dividing the business into multiple groups. In this case, we are too late to dismantle, and there is no need for data integration.

Make trade-offs and compromises between demolition and integration

If it is integrated, it may cause performance problems. In order to solve the performance problem, we have split Prometheus. To be or not to be, it is a question. Suppose we adopt a hybrid model to compromise:

Then new problems may be introduced, and our queries may come from different datasources. In a dashboard, we cannot perform aggregation queries on dashboards that cannot. In order to solve this problem, there are basically two solutions:

Prometheus federated query
- Each business line has its own Prometheus monitoring system, this Prometheus may have to monitor multiple subsystems
- A central Prometheus server aggregates Prometheus of multiple business lines

Centralized storage of data will
- The data in Prometheus is regularly imported into the data warehouse, and Prometheus only keeps the data for a short period of time
- Treat Prometheus as an adapter only, without data storage, and the collected data is directly aggregated into the database
- Prometheus itself is a time series database, you can use other libraries instead, such as InfluxDB (the open source version does not support high availability) or TimescaleDB

Centrally store Prometheus data

Prometheus + Grafana is a monitoring system that is essentially similar to the data warehouse model, which is a database + report display model.

Grafana is also a reporting tool that supports multiple data sources. In addition to Prometheus, we can also store data in relational databases such as PostgreSQL or MySQL.

We have two schemes to import metrics into the database:

Directly extract metirc to the database through the program;
Data is extracted into the database through Prometheus and related adapters: a layer of middleware is added, with more components, but less workload.

Import Prometheus data into PostgreSQL

As an open source time series database based on PostgreSQL, TimescaleDB itself is very similar to Prometheus.

Compared with Prometheus, it has better query speed, high availability and horizontal expansion. Compared with PromSQL, SQL statements are more friendly to operation and maintenance personnel. Timescale itself provides the plug-in prometheus-postgresql-adapter , which is stable, efficient and easy to maintain compared to other third-party tools.

For more prometheus-postgresql-adapter installation steps, please refer to prometheus-postgresql-adapter installation

Go further on prometheus-postgresql-adapter

We can already store Prometheus metadata in PostgreSQL, so how to display reports through Grafana? We have two options:

Directly use PostgreSQL as the data source of Grafana-simple structure, huge workload of changes;
Take another layer on top of PostgreSQL and use PromQL to read data in PostgreSQL-the structure is complex and there is basically no change in workload

Currently, the prometheus-postgresql-adapter project has been replaced by the Promscale project. Compared to prometheus-postgresql-adapter, Promscale is more convenient to use TimescaleDB + PostgreSQL as the remote storage of Prometheus.

Promscale provides us with the following features:

SQL and PromQL dual-engine query and analysis metric
PostgreSQL provides persistent storage capacity and performance, which can be used for historical data analysis
High availability of data
ACID characteristics
Timescale DB provides horizontal scalability

Prometheus' multi-tenant and high-availability solution

Both Thanos and Cortex are Prometheus' highly available and multi-tenant solutions. And both enter the CNCF incubator project.

Cortex is built as a scalable and easy-to-use solution that can be used for Prometheus monitoring and long-term storage. The multi-tenant feature of Cortex can isolate different Prometheus sources in a single cluster, allowing untrusted parties to share a set of clusters. Thanos is an easy-to-install solution that can execute instances on the user's Prometheus and transition to a monitoring system with long-term storage capabilities.

Both Thanos and Cortex are good Prometheus multi-tenant and high-availability solutions, but this article uses the Thanos solution:

All components in Thanos are stateless (stateless)
Monitoring data and cluster status are persisted to the object storage OSS
Support high availability deployment of Prometheus
Complete documentation, more users than other solutions

Thanos features and components

Sidecar：

Run the Sidecar container in Pod as Prometheus
Upload to Object Storage (OSS) as Prometheus data chunks (chunks)
Support multiple object storage (OSS), such as Aliyun, Tencent Cloud, S3, Google Cloud Storage, Azure Storage
Can be seamlessly integrated in Prometheus operator for deployment

Store：

Retrieve chunks from Object Storage (OSS) to query long-term monitoring indicators
Time-based partition query
Label-based partition query

Compact：

Create Downsampled chunks for monitoring data in OSS to speed up data query in a long period

Query：

As a PromQL query entry, instead of Prometheus query
Eliminate duplicate data from different data sources (multiple stores)
Support partial response

Rule：

A simplified version of Prometheus (mainly uses the rule function, does not capture data, and does not perform PromQL parsing queries)
Write results to OSS in Prometheus 2.0 storage format
Mainly used as the storage node of Rule (upload TSDB block to OSS via StoreAPI)

Bucket：

Monitoring data stored in the Supervisor Object Storage Bucket

Thanos's docker-compose case

The following project is the docker-compose implementation of TiDB docking Thanos: tidb-thanos-quick-start

The following images are required to build Thanos's docker-compose environment:

prom/prometheus:v2.24.1
quay.io/thanos/thanos:v0.18.0
minio/minio:RELEASE.2020-05-01T22-19-14Z
prom/node-exporter:v0.18.1
prom/alertmanager:v0.21.0
gcr.io/google_containers/cadvisor:v0.36.0
grafana/grafana:7.3.7

In this docker compose, two sets of prometheus monitors are created, which are used to undertake the monitoring of two sets of tidb systems. We can deploy two unmonitored tidb clusters separately, and recover metric information through prometheus in docker compose. In this way, we can check the monitoring of two sets of clusters in the query component at the same time. It can even be compared.