This chapter describes how to use the system monitoring and diagnosis tools provided by TiDB to troubleshoot and diagnose queries under running load. In addition to viewing and diagnosing query plan problems through the EXPLAIN statement introduced in the previous chapter , this chapter mainly introduces the slow query memory table through TiDB Slow Query and the visual Statements function of TiDB Dashboard to monitor and diagnose slow queries.
Slow Query slow query memory table
TiDB enables the slow query log by default, and saves the SQL whose execution time exceeds the specified threshold to the log file. The slow query log is often used to locate slow query statements, analyze and solve SQL performance problems. You can also view the slow query log of the current TiDB node through the system table information_schema.slow_query
, whose fields are consistent with the content of the slow query log file. TiDB has added a new system table information_schema.cluster_slow_query
since version 4.0, which can be used to view slow queries of all TiDB nodes.
This section will first briefly introduce the format and field meaning of the slow query log, and then give some common query examples for the above two slow query system tables.
Slow query log example and field description
The following is a typical slow query log:
# Time: 2019-08-14T09:26:59.487776265+08:00
# Txn_start_ts: 410450924122144769
# User: root@127.0.0.1
# Conn_ID: 3086
# Query_time: 1.527627037
# Parse_time: 0.000054933
# Compile_time: 0.000129729
# Process_time: 0.07 Wait_time: 0.002 Backoff_time: 0.002 Request_count: 1 Total_keys: 131073 Process_keys: 131072 Prewrite_time: 0.335415029 Commit_time: 0.032175429 Get_commit_ts_time: 0.000177098 Local_latch_wait_time: 0.106869448 Write_keys: 131072 Write_size: 3538944 Prewrite_region: 1
# DB: test
# Is_internal: false
# Digest: 50a2e32d2abbd6c1764b1b7f2058d428ef2712b029282b776beb9506a365c0f1
# Stats: t:414652072816803841
# Num_cop_tasks: 1
# Cop_proc_avg: 0.07 Cop_proc_p90: 0.07 Cop_proc_max: 0.07 Cop_proc_addr: 172.16.5.87:20171
# Cop_wait_avg: 0 Cop_wait_p90: 0 Cop_wait_max: 0 Cop_wait_addr: 172.16.5.87:20171
# Mem_max: 525211
# Succ: true
# Plan_digest: e5f9d9746c756438a13c75ba3eedf601eecf555cdb7ad327d7092bdd041a83e7
# Plan: tidb_decode_plan('ZJAwCTMyXzcJMAkyMAlkYXRhOlRhYmxlU2Nhbl82CjEJMTBfNgkxAR0AdAEY1Dp0LCByYW5nZTpbLWluZiwraW5mXSwga2VlcCBvcmRlcjpmYWxzZSwgc3RhdHM6cHNldWRvCg==')
insert into t select * from t;
The following describes the meaning of each field in the slow query log one by one.
Note: The unit of all time-related fields in the slow query log is seconds.
(1) Slow query basic information:
-
Time
: indicates the log printing time. -
Query_time
: Indicates the time spent executing the statement. -
Parse_time
: Indicates the time spent by the statement in the parsing phase. -
Compile_time
: Indicates the time spent by the statement in the query optimization phase. -
Digest
: Indicates the SQL fingerprint of the statement. -
Stats
: Indicates the statistics version timestamp used by the table. If the timestamp is displayed aspseudo
, it means statistics with default assumptions. -
Txn_start_ts
: Indicates the start timestamp of the transaction, that is, the unique ID of the transaction. You can use this value to find other logs related to the transaction in the TiDB log. -
Is_internal
: indicates whether it is an internal SQL statement of TiDB.true
indicates the SQL statement executed by the TiDB system,false
indicates the SQL statement executed by the user. -
Index_ids
: Indicates the index ID used by this statement. -
Succ
: Indicates whether the statement is executed successfully. -
Backoff_time
: Indicates the time the statement waits before retrying when encountering an error that needs to be retried. Common errors that need to be retried are as follows: encountered lock, Region split, tikv server is busy. -
Plan_digest
: Indicates the fingerprint of the plan. -
Plan
: Indicates the execution plan of the statement, runningselect tidb_decode_plan('...')
can parse out the specific execution plan. -
Query
: Indicates the SQL statement. The field name Query will not be printed in the slow log, but the corresponding field after mapping to the memory table is calledQuery
.
(2) Fields related to transaction execution:
-
Prewrite_time
: Indicates the time-consuming of the first stage (prewrite
stage) in the two-phase commit of the transaction. -
Commit_time
: Indicates the time-consuming of the second stage (commit
stage) in the two-phase commit of the transaction. -
Get_commit_ts_time
: Indicates the time taken to obtain thecommit
timestamp in the second stage (commit
stage) of the transaction two-phase commit. -
Local_latch_wait_time
: Indicates the time spent waiting for locks on the TiDB side before the second stage (commit
stage) in the two-phase commit of the transaction is initiated. -
Write_keys
: Indicates the number of keys written to the Write CF of TiKV by this transaction. -
Write_size
: Indicates the total size of the key and value written when the transaction is committed. -
Prewrite_region
: Indicates the number of TiKV Regions involved in the first stage (prewrite
stage) in the two-phase commit of the transaction. Each Region will trigger a remote procedure call.
(3) Fields related to memory usage:
-
Memory_max
: Indicates the maximum memory space used by TiDB during execution, the unit isbyte
.
(4) Fields related to users:
-
User
: Indicates the user name for executing the statement. -
Conn_ID
: Indicates the user's connection ID. You can use keywords likecon:3
to find other logs related to this link in the TiDB log. -
DB
: Indicates the database used when executing the statement.
(5) Fields related to TiKV Coprocessor Task:
-
Process_time
: The sum of the processing time of the SQL on TiKV. Because the data will be sent to TiKV for execution in parallel, this value may exceedQuery_time
. -
Wait_time
: Indicates the sum of the waiting time of the statement on TiKV. Because the number of Coprocessor threads in TiKV is limited, when all Coprocessor threads are working, requests will be queued; if some requests in the queue take a long time, the waiting time of subsequent requests will increase. -
Request_count
: Indicates the number of Coprocessor requests sent by this statement. -
Total_keys
: Indicates the number of keys scanned by the Coprocessor. -
Process_keys
: Indicates the number of keys processed by the Coprocessor. In contrast tototal_keys
, processed_keys does not contain older versions of MVCC. Ifprocessed_keys
andtotal_keys
are very different, it means that there are many old versions. -
Cop_proc_avg
: Average execution time of cop-task. -
Cop_proc_p90
: P90 quantile execution time of cop-task. -
Cop_proc_max
: The maximum execution time of cop-task. -
Cop_proc_addr
: The address of the cop-task with the longest execution time. -
Cop_wait_avg
: Average wait time for cop-tasks. -
Cop_wait_p90
: P90 quantile waiting time of cop-task. -
Cop_wait_max
: The maximum waiting time of cop-task. -
Cop_wait_addr
: The address of the cop-task with the longest waiting time.
Slow Query memory table usage troubleshooting
Here are some examples to show how to view the slow query of TiDB through SQL.
Retrieve the current node Top N slow query
The following SQL is used to retrieve the Top 2 slow queries of the current TiDB node:
> select query_time, query
from information_schema.slow_query -- 检索当前 TiDB 节点的慢查询
where is_internal = false -- 排除 TiDB 内部的慢查询
order by query_time desc
limit 2;
+--------------+------------------------------------------------------------------+
| query_time | query |
+--------------+------------------------------------------------------------------+
| 12.77583857 | select * from t_slim, t_wide where t_slim.c0=t_wide.c0; |
| 0.734982725 | select t0.c0, t1.c1 from t_slim t0, t_wide t1 where t0.c0=t1.c0; |
+--------------+------------------------------------------------------------------+
Retrieve the Top N slow query of the specified user on all nodes
The following SQL will retrieve the Top 2 slow queries of the specified user test
on all TiDB nodes:
> select query_time, query, user
from information_schema.cluster_slow_query -- 检索全部 TiDB 节点的慢查询
where is_internal = false
and user = "test"
order by query_time desc
limit 2;
+-------------+------------------------------------------------------------------+----------------+
| Query_time | query | user |
+-------------+------------------------------------------------------------------+----------------+
| 0.676408014 | select t0.c0, t1.c1 from t_slim t0, t_wide t1 where t0.c0=t1.c1; | test |
+-------------+------------------------------------------------------------------+----------------+
Retrieve similar slow queries
After getting the Top N slow queries, you can continue to retrieve similar slow queries through SQL fingerprints.
-- 先获取 Top N 的慢查询和对应的 SQL 指纹
> select query_time, query, digest
from information_schema.cluster_slow_query
where is_internal = false
order by query_time desc
limit 1;
+-------------+-----------------------------+------------------------------------------------------------------+
| query_time | query | digest |
+-------------+-----------------------------+------------------------------------------------------------------+
| 0.302558006 | select * from t1 where a=1; | 4751cb6008fda383e22dacb601fde85425dc8f8cf669338d55d944bafb46a6fa |
+-------------+-----------------------------+------------------------------------------------------------------+
-- 再根据 SQL 指纹检索同类慢查询
> select query, query_time
from information_schema.cluster_slow_query
where digest = "4751cb6008fda383e22dacb601fde85425dc8f8cf669338d55d944bafb46a6fa";
+-----------------------------+-------------+
| query | query_time |
+-----------------------------+-------------+
| select * from t1 where a=1; | 0.302558006 |
| select * from t1 where a=2; | 0.401313532 |
+-----------------------------+-------------+
Slow query with retrieval statistics pseudo
If the statistics in the slow query log are marked as pseudo
, it usually means that the statistics of the TiDB table are not updated in time, and you need to run analyze table
to collect statistics manually. The following SQL can find this type of slow query:
如果慢查询日志中的统计信息被标记为 pseudo,往往说明 TiDB 表的统计信息更新不及时,需要运行 analyze table 手动收集统计信息。以下 SQL 可以找到这一类慢查询:
> select query, query_time, stats
from information_schema.cluster_slow_query
where is_internal = false
and stats like '%pseudo%';
+-----------------------------+-------------+---------------------------------+
| query | query_time | stats |
+-----------------------------+-------------+---------------------------------+
| select * from t1 where a=1; | 0.302558006 | t1:pseudo |
| select * from t1 where a=2; | 0.401313532 | t1:pseudo |
| select * from t1 where a>2; | 0.602011247 | t1:pseudo |
| select * from t1 where a>3; | 0.50077719 | t1:pseudo |
| select * from t1 join t2; | 0.931260518 | t1:407872303825682445,t2:pseudo |
+-----------------------------+-------------+---------------------------------+
Slow queries with changes to the query execution plan
Due to inaccurate statistical information, unexpected changes may occur in the execution plan of the same type of SQL. Which slow queries have many different execution plans can be retrieved with the following SQL:
> select count(distinct plan_digest) as count, digest,min(query)
from information_schema.cluster_slow_query
group by digest
having count>1
limit 3\G
***************************[ 1. row ]***************************
count | 2
digest | 17b4518fde82e32021877878bec2bb309619d384fca944106fcaf9c93b536e94
min(query) | SELECT DISTINCT c FROM sbtest25 WHERE id BETWEEN ? AND ? ORDER BY c [arguments: (291638, 291737)];
***************************[ 2. row ]***************************
count | 2
digest | 9337865f3e2ee71c1c2e740e773b6dd85f23ad00f8fa1f11a795e62e15fc9b23
min(query) | SELECT DISTINCT c FROM sbtest22 WHERE id BETWEEN ? AND ? ORDER BY c [arguments: (215420, 215519)];
***************************[ 3. row ]***************************
count | 2
digest | db705c89ca2dfc1d39d10e0f30f285cbbadec7e24da4f15af461b148d8ffb020
min(query) | SELECT DISTINCT c FROM sbtest11 WHERE id BETWEEN ? AND ? ORDER BY c [arguments: (303359, 303458)];
-- 借助 SQL 指纹进一步查询执行计划的详细信息
> select min(plan),plan_digest
from information_schema.cluster_slow_query
where digest='17b4518fde82e32021877878bec2bb309619d384fca944106fcaf9c93b536e94'
group by plan_digest\G
*************************** 1. row ***************************
min(plan): Sort_6 root 100.00131380758702 sbtest.sbtest25.c:asc
└─HashAgg_10 root 100.00131380758702 group by:sbtest.sbtest25.c, funcs:firstrow(sbtest.sbtest25.c)->sbtest.sbtest25.c
└─TableReader_15 root 100.00131380758702 data:TableRangeScan_14
└─TableScan_14 cop 100.00131380758702 table:sbtest25, range:[502791,502890], keep order:false
plan_digest: 6afbbd21f60ca6c6fdf3d3cd94f7c7a49dd93c00fcf8774646da492e50e204ee
*************************** 2. row ***************************
min(plan): Sort_6 root 1 sbtest.sbtest25.c:asc
└─HashAgg_12 root 1 group by:sbtest.sbtest25.c, funcs:firstrow(sbtest.sbtest25.c)->sbtest.sbtest25.c
└─TableReader_13 root 1 data:HashAgg_8
└─HashAgg_8 cop 1 group by:sbtest.sbtest25.c,
└─TableScan_11 cop 1.2440069558121831 table:sbtest25, range:[472745,472844], keep order:false
Count the number of slow queries on each node
The following SQL counts the number of slow queries that have occurred on each TiDB node within a specified period:
> select instance, count(*)
from information_schema.cluster_slow_query
where time >= "2020-03-06 00:00:00"
and time < now()
group by instance;
+---------------+----------+
| instance | count(*) |
+---------------+----------+
| 0.0.0.0:10081 | 124 |
| 0.0.0.0:10080 | 119771 |
+---------------+----------+
Retrieve slow queries for unusual periods
Assuming that problems such as QPS reduction and query response time increase are found during the period from --- 2020-03-10 13:24:00
to 2020-03-10 13:27:00
, the following SQL can be used to filter out slow queries that only appear during abnormal periods:
> select * from
(select /*+ AGG_TO_COP(), HASH_AGG() */ count(*),
min(time),
sum(query_time) AS sum_query_time,
sum(Process_time) AS sum_process_time,
sum(Wait_time) AS sum_wait_time,
sum(Commit_time),
sum(Request_count),
sum(process_keys),
sum(Write_keys),
max(Cop_proc_max),
min(query),min(prev_stmt),
digest
from information_schema.cluster_slow_query
where time >= '2020-03-10 13:24:00'
and time < '2020-03-10 13:27:00'
adn Is_internal = false
group by digest) AS t1
where t1.digest not in
(select /*+ AGG_TO_COP(), HASH_AGG() */ digest
from information_schema.cluster_slow_query
where time >= '2020-03-10 13:20:00' -- 排除正常时段 `2020-03-10 13:20:00` ~ `2020-03-10 13:23:00` 期间的慢查询
and time < '2020-03-10 13:23:00'
group by digest)
order by t1.sum_query_time desc
limit 10\G
***************************[ 1. row ]***************************
count(*) | 200
min(time) | 2020-03-10 13:24:27.216186
sum_query_time | 50.114126194
sum_process_time | 268.351
sum_wait_time | 8.476
sum(Commit_time) | 1.044304306
sum(Request_count) | 6077
sum(process_keys) | 202871950
sum(Write_keys) | 319500
max(Cop_proc_max) | 0.263
min(query) | delete from test.tcs2 limit 5000;
min(prev_stmt) |
digest | 24bd6d8a9b238086c9b8c3d240ad4ef32f79ce94cf5a468c0b8fe1eb5f8d03df
TiDB Dashboard Visual Statements
The previous section introduced the use of the Slow Query memory table to troubleshoot slow queries, but Slow Query only records SQL exceeding the slow log threshold and lacks diagnostic and troubleshooting for the entire running load. This section will introduce the use of TiDB Dashboard to troubleshoot and locate problem queries. TiDB Dashboard provides Statements to monitor and count SQL. For example, the page provides rich list information, including delay, execution times, scan rows, and full table scan times, to analyze which types of SQL statements take too long , excessive memory consumption, etc., to help users locate performance problems.
TiDB already supports a variety of performance troubleshooting tools. However, under the requirements of various application scenarios, there are still shortcomings, such as:
1. Grafana cannot troubleshoot the performance of a single SQL
2. Slow log only records SQL that exceeds the slow log threshold
3. General log itself has a certain impact on performance
4.Explain analyze can only view reproducible problems
5.Profile can only view the bottleneck of the entire instance
Therefore, the introduction of visual Statements can directly observe the SQL execution on the page without querying the system tables, which is convenient for users to locate performance problems.
Using TiDB Dashboard
Starting from version 4.0, TiDB provides a new Dashboard operation and maintenance management tool, which is integrated on the PD component. The default address is http://pd-url:pd_port/dashboard . Different from Grafana monitoring from the perspective of database monitoring, TiDB Dashboard simplifies the operation and maintenance of the TiDB database to the greatest extent from the perspective of DBA administrators, and can view the running status of the entire distributed database cluster on one interface, including Data hotspots, SQL operation, cluster information, log search, real-time performance analysis, etc.
View Statements as a whole
After logging in, click "SQL Statement Analysis" on the left to enter this function page.
Select the time period to be analyzed in the time interval option box, and you can get the SQL statement execution statistics of all databases in this period.
If you only care about some databases, you can select the corresponding database in the second option box to filter the results and support multiple selection.
The results are displayed in the form of a table, and it supports sorting the results by different columns, as shown in the following figure.
1. Select the time period to be analyzed
2. Support filtering by database
3. Support sorting by different indicators
Note: The SQL statement referred to here actually refers to a certain type of SQL statement. Syntactically consistent SQL statements are normalized to a single class of identical SQL statements.
E.g:
SELECT * FROM employee WHERE id IN (1, 2, 3);
select * from EMPLOYEE where ID in (4, 5);
normalized to
select * from employee where id in (...);
In the SQL category column, click on a certain type of SQL statement to enter the details page of the SQL statement to view more detailed information, as well as the statistics of the execution of the SQL statement on different nodes.
The key information of a single Statement details page is shown in the figure below.
1. Total SQL execution time
2. Average number of affected rows (usually writes)
3. Average scan lines (usually read)
4. Execution indicators of each node (the performance bottleneck of a node can be quickly located)
Statements parameter configuration
-
tidb_enable_stmt_summary
The Statements function is enabled by default, and can also be enabled by setting system variables, for example:
set global tidb_enable_stmt_summary = true;
-
tidb_stmt_summary_refresh_interval
Set performance_schema.events_statements_summary_by_digest
the clearing period of the table, the unit is second (s), the default value is 1800, for example:
set global tidb_stmt_summary_refresh_interval = 1800;
-
tidb_stmt_summary_history_size
Settingperformance_schema.events_statements_summary_by_digest_history
The number of history table to save each kind of SQL, the default value is 24, for example:
set global tidb_stmt_summary_history_size = 24;
Since the Statements information is stored in the memory table, in order to prevent problems such as memory overflow, it is necessary to limit the number of stored SQL entries and the maximum display length of SQL. These two parameters need to be configured under the [stmt-summary]
category of config.toml:
- Change the number of saved SQL types through
max-stmt-count
, the default is 200. When the SQL type exceedsmax-stmt-count
, the SQL that has not been used recently will be removed - Change the maximum display length of
DIGEST_TEXT
andQUERY_SAMPLE_TEXT
bymax-sql-length
904bb9fe2ca3abc0a6befe6027531dc4---, default is 4096
Note: tidb_stmt_summary_history_size
, max-stmt-count
, max-sql-length
several configurations affect the memory usage, it is recommended to adjust according to the actual situation, it is not suitable to set too large.
In summary, visual Statements can quickly locate a SQL performance problem.
This article is the third article in the "TiDB Query Optimization and Tuning" series. In the previous article, we introduced the basic concepts of the optimizer and TiDB's query plan respectively. In the future, we will continue to adjust and optimize the query execution plan of TiDB, develop other optimizers or The diagnostic and tuning functions in planning are introduced.
If you have any suggestions for TiDB products, welcome to communicate with us at internals.tidb.io .
Click to see more TiDB query optimization and tuning articles
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。