Like other mainstream commercial databases, TiDB's query optimizer is responsible for optimizing user and system queries, and generating an effective and efficient execution plan is executed by the executor. The quality of the execution plan generated by the optimizer directly affects the execution efficiency and performance of the query. The "TiDB Query Optimization and Tuning" series of articles will introduce the principles and applications related to TiDB query and optimization through some specific cases. This article is the first in a series of articles, and will briefly introduce the optimization process of TiDB's query optimizer.

Common logic optimization rules in TiDB

The optimization process of the optimizer can be simply regarded as a search problem, that is, for a query, the optimal execution plan of the query is found in a huge search space composed of various possible execution plans. Different database query optimizers have different optimization processes according to different architectures. The query optimization process of TiDB is mainly divided into two parts: logical optimization and physical optimization.

In logic optimization, the transformation rules of relational algebra are used to perform equivalent transformation of query statement expressions, and in the process, the possible plan search space (such as different join orders) is continuously increased or pruned, and finally the optimal logic is selected to generate plan tree. In the subsequent physical optimization process, the actual execution physical plan is generated for the operator nodes in the logical plan tree, and the cost of implementing algorithms (such as different join methods) or objects (such as using different indexes) of different physical plans is evaluated. , and choose the physical plan with the least cost.

The logical optimization and physical optimization are briefly introduced below.

Logic optimization is an optimization process for the logic operators in the logic plan. Before introducing the logic optimization rules, let's introduce several main logic operators in TiDB:

  • DataSource: Data source, representing a source table, such as select * from t in t .
  • Selection: represents the corresponding filter condition, select * from t where a = 5 in where a = 5 .
  • Projection: Projection operation, also used for expression calculation, select c, a + b from t c and a+b are projection and expression calculation operations.
  • Join: The join operation of two tables, select t1.b, t2.c from t1 join t2 on t1.a = t2.a in t1 join t2 on t1.a = t2.a is the join operation of two tables t1 and t2 . Join has a variety of connection methods such as inner join, left join, and right join.

Selection, Projection, and Join (SPJ for short) are the three most basic operators.

TiDB's logic optimization is rule-based optimization. By applying optimization rules to the input logic execution plan in sequence, the entire logic execution plan becomes more efficient. These common logic optimization rules include:

1.png

Some examples of logical optimization rules are as follows:

Rule 4: Max/Min Optimization

Max/Min optimization, the Max/Min statement will be rewritten. Such as the following statement:

select min(id) from t;

Change to the following writing, you can achieve a similar effect:

select id from t order by id desc limit 1;

The execution plan generated by the previous statement is a TableScan followed by an Aggregation, which is a full table scan operation. The latter statement, the generated execution plan is TableScan + Sort + Limit. Usually, the id column in the data table is the primary key or there is an index, and the data itself is ordered, so that Sort can be eliminated, and finally becomes TableScan/IndexLookUp + Limit, which avoids the operation of full table scan, and only needs to read the first entry data to return results.

Max-min elimination does this transformation "automatically" by the optimizer.

Rule 5: Outer Join Elimination

Outer join elimination refers to removing the entire join operation from the query. Elimination of outer joins needs to meet certain conditions:

  • Condition 1: The parent operator of LogicalJoin will only use the columns output by the outer plan of LogicalJoin
  • Condition 2:

    • Condition 2.1: The join key in LogicalJoin satisfies uniqueness in the output result of inner plan
    • Condition 2.2: The parent operator of LogicalJoin deduplicates the input records

Condition 1 and Condition 2 must be satisfied at the same time, but only one of Condition 2.1 and Condition 2.2 needs to be satisfied.

An example where condition 1 and condition 2.1 are satisfied:

select t1.a from t1 left join t2 on t1.b = t2.b;

can be rewritten as:

select t1.a from t1;

Common physical optimizations in TiDB

Physical optimization is cost-based optimization. In this phase, the optimizer selects a specific physical implementation for each operator in the logical execution plan to convert the logical execution plan generated in the logical optimization phase into a physical execution plan. Different physical implementations of logical operators have different time complexity, resource consumption and physical properties. In this process, the optimizer estimates the cost of different physical implementations based on the statistics of the data, and selects the physical execution plan with the smallest overall cost.

There are many decisions that need to be made for physical optimization, such as:

  • How to read data: Use index scan or full table scan to read data.
  • If multiple indexes exist, the choice between indexes.
  • The physical implementation of the logical operator, that is, the actual algorithm used.
  • Whether the operator can be pushed down to the storage layer for execution to improve execution efficiency.

TiDB statistics

Statistics are crucial input information for the query optimizer. The optimizer will use the statistics to estimate the selectivity of query predicates, the various cardinality of the query, and the cost of different operators, and use these estimates to make calculations. Some logical optimizations and physical optimizations. If there is a large distortion deviation due to outdated or missing statistics, it will often have a very large impact on the optimization of the optimizer, thereby affecting the generated query plan. Therefore, here, we will introduce statistical information, as well as related collection and maintenance, in a larger space, because this is the cornerstone of the optimizer's query optimization.

The statistics collected by TiDB include table-level and column-level information. Table statistics include the total number of rows and the number of modified rows. Column statistics include the number of distinct values, the number of NULLs, histograms, the most frequently occurring value TOPN on the column, and more.

The statistics collection of TiDB includes two methods: manual collection and automatic update:

  • Manual collection:

Statistics are collected by executing the ANALYZE statement. Taking the person table in the database as an example, the trial statement using analyze is as follows:

 analyze table person;

In the process of collecting statistics, you can query the execution status through the show analyze status statement, which can also filter the output results through the where clause, and the output results are displayed as follows:

 mysql> show analyze status where job_info = 'analyze columns';
+--------------+------------+-----------------+---------------------+----------+
| Table_schema | Table_name | Job_info        | Start_time          | State    |
+--------------+------------+-----------------+---------------------+----------+
| test         | person     | analyze columns | 2020-03-07 06:22:34 | finished |
| test         | customer   | analyze columns | 2020-03-07 06:32:19 | finished |
| test         | person     | analyze columns | 2020-03-07 06:35:27 | finished |
+--------------+------------+-----------------+---------------------+----------+
3 rows in set (0.01 sec)
  • Automatic update:

When executing a DML statement, TiDB will automatically update the total number of rows and the number of modified rows in the table. This information will be automatically persisted on a regular basis, and the default update period is 1 minute (20 * stats-lease)

Note: The default value of stats-lease is 3s, if it is set to 0, the automatic update of statistics information will be turned off.

At present, according to the evolution of statistical information collection and use, TiDB currently supports two versions of statistical information. Version 2 has made more optimizations on the basis of Version 1 to improve the maintenance method and accuracy of statistical information, as well as the collection efficiency. For specific differences, please refer to the TiDB Statistics Introduction document .

After the statistics are collected, you can view the statistics and the health of the table to confirm whether the statistics are significantly distorted.

View the statistics meta information of the table:

 mysql> show stats_meta where table_name = 'person';
+---------+------------+----------------+---------------------+--------------+-----------+
| Db_name | Table_name | Partition_name | Update_time         | Modify_count | Row_count |
+---------+------------+----------------+---------------------+--------------+-----------+
| test    | person     |                | 2020-03-07 07:20:54 |            0 |         4 |
+---------+------------+----------------+---------------------+--------------+-----------+
1 row in set (0.01 sec)

View health information for the table:

 mysql> show stats_healthy where table_name = 'person';
+---------+------------+----------------+---------+
| Db_name | Table_name | Partition_name | Healthy |
+---------+------------+----------------+---------+
| test    | person     |                |     100 |
+---------+------------+----------------+---------+
1 row in set (0.00 sec)

SHOW STATS_HISTOGRAMS to view the number of distinct values of the column and the number of NULL values and other information:

 mysql> show stats_histograms where table_name = 'person';
+---------+------------+----------------+-------------+----------+---------------------+----------------+------------+--------------+-------------+
| Db_name | Table_name | Partition_name | Column_name | Is_index | Update_time         | Distinct_count | Null_count | Avg_col_size | Correlation |
+---------+------------+----------------+-------------+----------+---------------------+----------------+------------+--------------+-------------+
| test    | person     |                | name        |        0 | 2020-03-07 07:20:54 |              4 |          0 |         6.25 |        -0.2 |
+---------+------------+----------------+-------------+----------+---------------------+----------------+------------+--------------+-------------+
1 row in set (0.00 sec)

You can view the information of each bucket of the histogram through SHOW STATS_BUCKETS :

 mysql> show stats_buckets;
+---------+------------+----------------+-------------+----------+-----------+-------+---------+-------------+-------------+
| Db_name | Table_name | Partition_name | Column_name | Is_index | Bucket_id | Count | Repeats | Lower_Bound | Upper_Bound |
+---------+------------+----------------+-------------+----------+-----------+-------+---------+-------------+-------------+
| test    | person     |                | name        |        0 |         0 |     1 |       1 | jack        | jack        |
| test    | person     |                | name        |        0 |         1 |     2 |       1 | peter       | peter       |
| test    | person     |                | name        |        0 |         2 |     3 |       1 | smith       | smith       |
| test    | person     |                | name        |        0 |         3 |     4 |       1 | tom         | tom         |
+---------+------------+----------------+-------------+----------+-----------+-------+---------+-------------+-------------+
4 rows in set (0.01 sec)

Statistics can be deleted by executing the DROP STATS statement. The statement is as follows:

 mysql> DROP STATS person;

The statistical information of TiDB can be imported and exported, which is convenient for backup and for personnel on duty to reproduce and locate related problems.

  • Export: The json-formatted statistics of table ${table_name} in database ${db_name} can be obtained through the following interface:
 http://${tidb-server-ip}:${tidb-server-status-port}/stats/dump/${db_name}/${table_name}

Example: Get the statistics of the person table in the test database on the local machine:

 curl -G "http://127.0.0.1:10080/stats/dump/test/person" > person.json
  • Import: Import the json file obtained by the statistics export interface into the database:
 mysql> LOAD STATS 'file_name';

file_name is the name of the imported statistics file.

This article is the first in a series of articles on "TiDB Query Optimization and Tuning". In the future, we will continue to monitor and troubleshoot TiDB query plans, slow query diagnostics, adjust and optimize query execution plans, and other optimizer development or planning diagnostics and tuning. functions, etc. If you have any suggestions for TiDB products, welcome to https://internals.tidb.io/ to communicate with us.


PingCAP
1.9k 声望4.9k 粉丝

PingCAP 是国内开源的新型分布式数据库公司,秉承开源是基础软件的未来这一理念,PingCAP 持续扩大社区影响力,致力于前沿技术领域的创新实现。其研发的分布式关系型数据库 TiDB 项目,具备「分布式强一致性事务...