Author: Hu Chengqing
Source of this article: original submission
* Produced by the Aikesheng open source community, original content is not allowed to be used without authorization, please contact the editor and indicate the source for reprinting.
If you want to use Zabbix to monitor and use TiDB, you need to use an HTTP agent to actively call the TiDB monitoring interface to obtain monitoring data, and then configure data preprocessing: choose to use the Prometheus pattern or Prometheus to JSON method, but these two functions are added in Zabbix4.2 Yes, Zabbix4.0.x does not have this function (even the latest zabbix4.0.35):
Therefore, when it is impossible to upgrade to Zabbix5.4, we can manually create a monitoring template on a version greater than 4.2. The following demonstration environment is Zabbix5.0.5.
TiDB monitoring interface
Before starting, you need to understand the monitoring interface of : 161c2a6907396c https://docs.pingcap.com/zh/tidb/v5.1/tidb-monitoring-api
Example:
curl http://127.0.0.1:10080/metrics > /tmp/tidb_metics
Refer to the first alarm rule in the TiDB official website (
increase(tidb_session_schema_lease_error_total{type="outdated"}[15m]) > 0
This is the syntax of prometheus, we only need to know that tidb_session_schema_lease_error_total is the metrics name, and then we find this metric in the monitoring data (different versions of TiDB may be different, the metric in the example is not available in 4.0.10, in 5.1 have):
[root@localhost tmp]# grep tidb_session_schema_lease_error_total /tmp/tidb_metics
# HELP tidb_session_schema_lease_error_total Counter of schema lease error
# TYPE tidb_session_schema_lease_error_total counter tidb_session_schema_lease_error_total{type="outdated"} 2
The data format is:
Description
Start with "# HELP", it is the description of this metric
type
Starting with "# TYPE", it indicates the data type of this metric. There are 4 types in total:
data
What should be noted here is the "type" tag in the above example. Some metrics have multiple tags. We can select several of them based on the tags (for example, the requirement is to calculate the total time consumption of all Select, Update, and Insert commands):
tidb_server_handle_query_duration_seconds_sum{sql_type="Begin"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Commit"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Delete"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Execute"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Insert"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Replace"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Rollback"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Select"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Set"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Show"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Update"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="Use"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="general"} 0
tidb_server_handle_query_duration_seconds_sum{sql_type="internal"} 12260.158577597258
Next, we will introduce how to manually add TiDB monitoring in Zabbix.
Create items
Item is a monitoring item, so which indicators of TiDB should we monitor? You can refer to the official warning rules, and the monitoring items that need to be alerted must be the highest priority.
- Create master item
The parameters are as shown in the figure below, and the red mark is the key point. This item defines the metrics interface of TiDB Server to obtain the data of all monitoring indicators:
Note that the obtained data format is Text, which needs to be defined in "Preprocession" (data preprocessing) and converted into JSON format:
- Create ordinary item
Because the master item gets all metrics, you need to create a sub-item to take out a single metrics from the master item:
The key is to define JSONPath in "Preprocession" (data preprocessing), as shown in the figure below:
- JSONPath expression:
$[?(@.name=="tidb_session_schema_lease_error_total" && @.labels.type == "outdated")].value.first()
indicates that the metric name is tidb_session_schema_lease_error_total and its tpye tag is "outdated". Note: This expression in the template of 5.4 is wrong, you can use the test function to check. - Because the type of this metric is Counter (cumulative value), use the "Change per second" method to get its average growth value per second (note: this is an average value). If the type is Gauge (instantaneous value), this step is unnecessary.
Create trigger
Trigger is to define when the value of the specified item meets what conditions, it will trigger its state to become abnormal. First refer to the Prometheus syntax of TiDB official alarm rules:increase(tidb_session_schema_lease_error_total{type="outdated"}[15m]) > 0
The increase([15m]) function represents the increase value in 15 minutes, the whole expression meaning: the increase in 15 minutes is greater than 0. Since we defined tidb_session_schema_lease_error_total per second growth in item, when the maximum value of average growth per second in a period of time is greater than 0, it means that an error has occurred and an alarm needs to be triggered. The trigger expression is:{TiDB by HTTP:tidb.session_schema_lease_error.outdate.rate.max(15m)}>0
appendix
Data preprocessing-JSONPath
Sample data:
[
{
"name": "tidb_server_handle_query_duration_seconds_sum",
"value": "100",
"line_raw": "tidb_server_handle_query_duration_seconds_sum{sql_type=\"Begin\"} 0",
"labels": {
"sql_type": "Begin"
},
"type": "untyped"
},
{
"name": "tidb_server_handle_query_duration_seconds_sum",
"value": "50",
"line_raw": "tidb_server_handle_query_duration_seconds_sum{sql_type=\"Commit\"} 0",
"labels": {
"sql_type": "Commit"
},
"type": "untyped"
}
]
Expression: $[?(@.name=="tidb_server_handle_query_duration_seconds_sum")].value.sum()
Meaning: the total time consumed by all commands
Test result: 150
Expression: $[?(@.name=="tidb_server_handle_query_duration_seconds_sum" && @.labels.sql_type=="Commit")].value.first()
Meaning: the total time consumed by all Commit commands
Test result: 50
Expression: $[?(@.name=="tidb_server_handle_query_duration_seconds_sum" && @.labels.type =~ "Begin|Commit")].value.sum()
Meaning: the total time consumption of all Begin and Commit commands (other types are not counted)
Test result: 150
Computable item
Document: https://www.zabbix.com/documentation/5.0/en/manual/config/items/itemtypes/calculated
To create a calculable monitoring item based on other monitoring items, specify the newly created item as "Calculated". The following example first creates two items: query_duration_sum and query_duration_count. Divide the two to get the average response time of Query:
Trigger expression skills
1. Memory usage
expression:{TiDB by HTTP:tidb.heap_bytes.min(5m)}>{$TIDB.HEAP.USAGE.MAX.WARN}
meaning:
tidb.heap_bytes is the key name, which corresponds to the go_memstats_heap_inuse_bytes indicator in TiDB monitoring, which is a Gauge type (ie instantaneous value);
Within 5 minutes, if the minimum used memory exceeds the specified threshold (that is, the used memory exceeds the threshold within 5 minutes), an alarm is issued
2.99 response time
How to calculate the response time of 99% SQL?
The TiDB monitoring interface provides histogram data, the type is histogram, prometheus can be processed with the histogram_quantile() function:
histogram_quantile(0.99, sum(rate(tidb_server_handle_query_duration_seconds_bucket[1m])) BY (le, instance)) > 1
But zabbix can't handle it, you can calculate the average response time (implemented with the computable item in the appendix).
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。