Guide to the series of articles
OceanBase is 100% independent research and development, and has stably supported Double 11 for 9 consecutive years . It has innovatively launched a new city-level disaster recovery standard of "three places and five centers". It is the only domestically produced product in the world that has set new world records in both TPC-C and TPC-H tests. The native distributed database will be officially open sourced in June 2021. The query optimizer is the core module of the relational database system, the focus and difficulty of the database kernel development, and the "touchstone" for measuring the maturity of the entire database system. In order to help you better understand the OceanBase query optimizer, we will write a series of articles on query rewriting, so that you can better grasp the essence of query rewriting, become familiar with the equivalence of complex SQL, and write efficient SQL. This article is the sixth in the OceanBase rewrite series. It will focus on predicate derivation. Welcome to discuss~
Columnist Introduction
The OceanBase optimizer team, led by OceanBase senior technical expert Xi Feng and technical expert Shan Wen, is committed to building the world's leading distributed query optimizer.
Contents of the series
This query rewriting series not only includes four modules: subquery optimization, aggregate function optimization, window function optimization, and complex expression optimization. This article will elaborate on the derivation method of predicates, and there are more modules, so stay tuned.
Welcome to the OceanBase open source user group (DingTalk number: 33254054), and join the group to communicate with the OceanBase query optimizer team.
1. Why do we need predicate derivation?
Businesses usually only read part of the data when accessing the database, so some predicates are specified to filter out unwanted data. When implementing a query semantics, we can use many different combinations of predicates.
For example: Q1 and Q2 both read the remaining ticket information of 1024 pieces from the database. These two queries use different sets of predicates to achieve the same query effect. In terms of query performance, Q2's filter predicates are written better. T.play_id = 1024 in Q2 is a base table filter predicate. It can filter out a batch of data in advance, reducing the amount of data involved in the connection. Further, when there is an index of (play_id, sale_date, seat) on the TICKETS table, the query optimizer can determine a very good data scan range on the one hand; on the other hand, it can also use the index order to eliminate the sorting operation generated by ORDER BY . Ultimately, the entire query only needs to read at most 10 rows of data from the T table.
Q1:
SELECT P.show_time, T.ticket_id, T.seat
FROM PLAY P, TICKETS T
WHERE P.play_id = T.play_id AND P.play_id = 1024 AND T.sale_date is NULL
ORDER BY T.seat LIMIT 10;
Q2:
SELECT P.show_time, T.ticket_id, T.seat
FROM PLAY P, TICKETS T
WHERE T.play_id = 1024 and P.play_id = 1024 AND T.sale_date is NULL
ORDER BY T.seat LIMIT 10;
In order to guarantee good query performance, the database kernel needs to be capable of deriving a predicate such as T.play_id = 1024 for Q1 queries. We call this ability "predicate derivation". In OceanBase, we design and implement a variety of predicate derivation strategies for different predicate usage scenarios. These derivation strategies are mainly introduced below.
2. Predicate derivation
Predicate derivation is to derive some new predicates based on multiple predicates. For example, there are two predicates P.play_id = T.play_id and P.play_id = 1024 in Q1, and a new predicate T.play_id = 1024 can be derived. This is a single-table filter predicate on the T table. It can filter out the data on the T table in advance and reduce the amount of data involved in multi-table joins. Deriving new predicates makes sense in many optimization scenarios.
Size comparison derivation
Given multiple size-comparison predicates, we can rank size relationships between multiple expressions. For example, in the following query, there are two predicates T1.C1 > T2.C1 and T1.C1 < 10, then we can arrange the size relationship between them as: T2.C1 < T1.C1 < 10 . Obviously, for this scenario, we can derive a new predicate T2.C1 < 10 . This predicate can filter the data of the T2 table in advance and reduce the amount of data involved in the connection.
SELECT * FROM T1, T2 WHERE T1.C1 > T2.C1 AND T1.C1 < 10;
SELECT * FROM T1, T2 WHERE T1.C1 > T2.C1 AND T1.C1 < 10 AND T2.C1 > 10;
For the Q1 query, we can also derive a new predicate T.play_id = 1024 based on the size relationship given by the predicate (T.play_id = P.play_id = 1024). Further, after deriving a new predicate, we can also eliminate a redundant join predicate P.play_id = T.play_id, and finally get query Q2.
complex predicate derivation
In addition to the predicates for size comparison and equality comparison, some more complex predicates are often used in queries. For example, prefix matching strings with LIKE, etc. Given a complex predicate and some equality comparisons, we can also derive some new predicates. For example, there are two predicates T1.C1 = T2.C1 and T1.C1 LIKE 'ABC%' in the query below. Since there is an equivalence relationship between T1.C1 and T2.C1, T2.C1 LIKE 'ABC%' must also be established. This predicate can also filter the data in the T2 table in advance, reducing the amount of data involved in the connection.
SELECT *
FROM T1, T2 WHERE T1.C1 = T2.C1 AND T1.C1 LIKE 'ABC%';
SELECT *
FROM T1, T2 WHERE T1.C1 = T2.C1 AND T1.C1 LIKE 'ABC%' AND T2.C1 LIKE 'ABC%';
Given an equivalence relationship between two columns, and an arbitrary predicate on one of the columns, we can almost deduce a predicate on the other column. But that doesn't mean that we always have to derive new predicates. The computational cost of some complex predicates may be relatively high, and the filtering performance of the predicates themselves is not good. Deriving new complex predicates will lead to a decrease in query performance . When actually making a decision, you should first judge whether the derived new predicate can filter out a large amount of data .
OR predicate derivation
OR predicates are also common in business queries. In the query below, there is an interesting OR predicate. First of all, this predicate refers to the data of multiple tables, so this predicate can only filter the results after the multi-table join. The interesting thing is that each branch of this OR contains a predicate on the T1 table. We can construct a filter predicate on the T1 table: T1.C2 = 1 OR T1.C2 =2 . This is a single-table filter predicate, which can filter the data of T1 in advance and reduce the number of rows involved in the join.
SELECT * FROM T1, T2
WHERE T1.C1 = T2.C1 AND
((T1.C2 = 1) OR (T1.C2 = 2 AND T2.C2 = 2))
SELECT * FROM T1 ,T2
WHERE T1.C1 = T2.C1 AND
(T1.C2 = 1 OR T1.C2 = 2) AND
((T1.C2 = 1) OR (T1.C2 = 2 AND T2.C2 = 2));
MIN/MAX predicate derivation
The derivation of the above two scenarios is relatively straightforward. Below we introduce a more "obvious" way of predicate derivation.
In the query below, there is a HAVING predicate with MAX(C2) > 10. From this predicate, we can derive a filter predicate with C2 > 10. The rationality here is that the original query only retains the grouping aggregation results with MAX(C2) > 10. If a given row does not satisfy C2 > 10, there are two situations:
1. This row is not the maximum value of C2 in the same group (it has no meaning for group aggregation, it can be filtered)
2. This row is the maximum value of C2 in the same group (will be filtered by the HAVING predicate)
In the above two cases, data that does not satisfy C2 > 10 can be filtered in advance. Therefore, we can derive a new predicate C2 > 10.
SELECT C1, MAX(C2)
FROM T1
GROUP BY C1 HAVING MAX(C2) > 10;
=>
SELECT C1, MAX(C2)
FROM T1
WHERE C2 > 10
GROUP BY C1 HAVING MAX(C2) > 10;
Similarly, given the following query with the MIN aggregate function, we can also derive a new predicate. These predicates can filter out some data in advance, reduce the amount of computation for grouping and aggregation operations, and improve query performance.
SELECT C1, MIN(C2)
FROM T1
GROUP BY C1 HAVING MIN(C2) < 10;
=>
SELECT C1, MIN(C2)
FROM T1
WHERE C2 < 10
GROUP BY C1 HAVING MIN(C2) < 10;
This derivation method has more properties on the form of the query. Readers can consider, if there are other aggregate functions in the query, can the above predicate derivation be done?
Derivation trap
There are also some fallible pitfalls in deriving new predicates. For example: Consider the following query Q3, can we derive a new predicate T2.C_BIN = 'A' from T1.C_CI = 'A' and T1.C_CI = T2.C_BIN ?
This derivation is wrong.
This is because, when the predicates here compare, the way of comparison is not the same. In T1.C_CI = 'A', string comparison is case-insensitive, that is, 'a', 'A', etc. all satisfy the filter conditions. But T1.C_CI = T2.C_BIN compares strings in a case-sensitive manner. Combining these two predicates, it can only be deduced that the value of T2.C_BIN is 'a' or 'A'. But T2.C_BIN = 'A' is a case-sensitive comparison, it will directly filter out the data whose value is 'a'. Therefore, the derivation yielding this new predicate is incorrect.
CREATE TABLE T1 (C_CI VARCHAR(10) UTF8_GENERAL_CI);
CREATE TABLE T2 (C_BIN VARCHAR(10) UTF8_BIN);
Q3: SELECT * FROM T1, T2
WHERE T1.C_CI = 'ABC' AND T1.C_CI = T2.C_BIN;
=>
Q4: SELECT * FROM T1, T2
WHERE T1.C_CI = 'ABC' AND T1.C_CI = T2.C_BIN AND T2.C_BIN = 'ABC';
3. Summary
This paper mainly introduces the derivation of some predicates. Deriving new predicates is very important for query optimization. Based on the new predicates, the query optimizer can choose better indexes and generate better base table access paths. Therefore, predicate deduction is a very important optimization technique. There are many more optimizations related to predicates, and in the next article, we will introduce the techniques of predicate movement. It will adjust the position of the predicate in the query, move the predicate to a more reasonable position, and improve the performance of the whole query.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。