Author: Hu Chengqing

DBA team member, good at failure analysis and performance optimization, personal blog: 161c2a6907388d https://www.jianshu.com/u/a95ec11f67a8, welcome to discuss.

Source of this article: original submission

* Produced by the Aikesheng open source community, original content is not allowed to be used without authorization, please contact the editor and indicate the source for reprinting.


If you want to use Zabbix to monitor and use TiDB, you need to use an HTTP agent to actively call the TiDB monitoring interface to obtain monitoring data, and then configure data preprocessing: choose to use the Prometheus pattern or Prometheus to JSON method, but these two functions are added in Zabbix4.2 Yes, Zabbix4.0.x does not have this function (even the latest zabbix4.0.35):

Therefore, when it is impossible to upgrade to Zabbix5.4, we can manually create a monitoring template on a version greater than 4.2. The following demonstration environment is Zabbix5.0.5.

TiDB monitoring interface

Before starting, you need to understand the monitoring interface of : 161c2a6907396c https://docs.pingcap.com/zh/tidb/v5.1/tidb-monitoring-api

Example:

curl http://127.0.0.1:10080/metrics > /tmp/tidb_metics

Refer to the first alarm rule in the TiDB official website (

increase(tidb_session_schema_lease_error_total{type="outdated"}[15m]) > 0

This is the syntax of prometheus, we only need to know that tidb_session_schema_lease_error_total is the metrics name, and then we find this metric in the monitoring data (different versions of TiDB may be different, the metric in the example is not available in 4.0.10, in 5.1 have):

[root@localhost tmp]# grep tidb_session_schema_lease_error_total /tmp/tidb_metics 
# HELP tidb_session_schema_lease_error_total Counter of schema lease error 
# TYPE tidb_session_schema_lease_error_total counter tidb_session_schema_lease_error_total{type="outdated"} 2

The data format is:

Description

Start with "# HELP", it is the description of this metric

type

Starting with "# TYPE", it indicates the data type of this metric. There are 4 types in total:

data

What should be noted here is the "type" tag in the above example. Some metrics have multiple tags. We can select several of them based on the tags (for example, the requirement is to calculate the total time consumption of all Select, Update, and Insert commands):

tidb_server_handle_query_duration_seconds_sum{sql_type="Begin"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Commit"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Delete"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Execute"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Insert"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Replace"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Rollback"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Select"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Set"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Show"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Update"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Use"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="general"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="internal"} 12260.158577597258

Next, we will introduce how to manually add TiDB monitoring in Zabbix.

Create items

Item is a monitoring item, so which indicators of TiDB should we monitor? You can refer to the official warning rules, and the monitoring items that need to be alerted must be the highest priority.

  1. Create master item

The parameters are as shown in the figure below, and the red mark is the key point. This item defines the metrics interface of TiDB Server to obtain the data of all monitoring indicators:



Note that the obtained data format is Text, which needs to be defined in "Preprocession" (data preprocessing) and converted into JSON format:

  1. Create ordinary item

Because the master item gets all metrics, you need to create a sub-item to take out a single metrics from the master item:

The key is to define JSONPath in "Preprocession" (data preprocessing), as shown in the figure below:

  1. JSONPath expression: $[?(@.name=="tidb_session_schema_lease_error_total" && @.labels.type == "outdated")].value.first() indicates that the metric name is tidb_session_schema_lease_error_total and its tpye tag is "outdated". Note: This expression in the template of 5.4 is wrong, you can use the test function to check.
  2. Because the type of this metric is Counter (cumulative value), use the "Change per second" method to get its average growth value per second (note: this is an average value). If the type is Gauge (instantaneous value), this step is unnecessary.

Create trigger

Trigger is to define when the value of the specified item meets what conditions, it will trigger its state to become abnormal. First refer to the Prometheus syntax of TiDB official alarm rules:
increase(tidb_session_schema_lease_error_total{type="outdated"}[15m]) > 0

The increase([15m]) function represents the increase value in 15 minutes, the whole expression meaning: the increase in 15 minutes is greater than 0. Since we defined tidb_session_schema_lease_error_total per second growth in item, when the maximum value of average growth per second in a period of time is greater than 0, it means that an error has occurred and an alarm needs to be triggered. The trigger expression is:
{TiDB by HTTP:tidb.session_schema_lease_error.outdate.rate.max(15m)}>0

appendix

Data preprocessing-JSONPath

Sample data:

[
  {
    "name": "tidb_server_handle_query_duration_seconds_sum",
    "value": "100",
    "line_raw": "tidb_server_handle_query_duration_seconds_sum{sql_type=\"Begin\"} 0",
    "labels": {
      "sql_type": "Begin"
    },
    "type": "untyped"
  },
  {
    "name": "tidb_server_handle_query_duration_seconds_sum",
    "value": "50",
    "line_raw": "tidb_server_handle_query_duration_seconds_sum{sql_type=\"Commit\"} 0",
    "labels": {
      "sql_type": "Commit"
    },
    "type": "untyped"
  }
]

Expression: $[?(@.name=="tidb_server_handle_query_duration_seconds_sum")].value.sum()

Meaning: the total time consumed by all commands

Test result: 150

Expression: $[?(@.name=="tidb_server_handle_query_duration_seconds_sum" && @.labels.sql_type=="Commit")].value.first()

Meaning: the total time consumed by all Commit commands

Test result: 50

Expression: $[?(@.name=="tidb_server_handle_query_duration_seconds_sum" && @.labels.type =~ "Begin|Commit")].value.sum()

Meaning: the total time consumption of all Begin and Commit commands (other types are not counted)

Test result: 150

Computable item

Document: https://www.zabbix.com/documentation/5.0/en/manual/config/items/itemtypes/calculated

To create a calculable monitoring item based on other monitoring items, specify the newly created item as "Calculated". The following example first creates two items: query_duration_sum and query_duration_count. Divide the two to get the average response time of Query:

Trigger expression skills

1. Memory usage

expression:
{TiDB by HTTP:tidb.heap_bytes.min(5m)}>{$TIDB.HEAP.USAGE.MAX.WARN}

meaning:

tidb.heap_bytes is the key name, which corresponds to the go_memstats_heap_inuse_bytes indicator in TiDB monitoring, which is a Gauge type (ie instantaneous value);
Within 5 minutes, if the minimum used memory exceeds the specified threshold (that is, the used memory exceeds the threshold within 5 minutes), an alarm is issued

2.99 response time

How to calculate the response time of 99% SQL?

The TiDB monitoring interface provides histogram data, the type is histogram, prometheus can be processed with the histogram_quantile() function:
histogram_quantile(0.99, sum(rate(tidb_server_handle_query_duration_seconds_bucket[1m])) BY (le, instance)) > 1

But zabbix can't handle it, you can calculate the average response time (implemented with the computable item in the appendix).


爱可生开源社区
426 声望211 粉丝

成立于 2017 年,以开源高质量的运维工具、日常分享技术干货内容、持续的全国性的社区活动为社区己任;目前开源的产品有:SQL审核工具 SQLE,分布式中间件 DBLE、数据传输组件DTLE。