Abstract: If you can convert outer joins into inner joins, then you can simplify the query optimization process.

This article is shared from the HUAWEI cloud community " GaussDB (DWS) external connection to internal connection conversion ", author: wangxiaojuan8.

In the process of query optimization, the connection order between the inner join tables can be exchanged at will, where or on conditions that only involve a single table can be pushed down to the table as the filter condition of the table; for outer joins, The connection order of the tables cannot be exchanged arbitrarily, and the constraint conditions cannot be pushed down arbitrarily. If you can convert outer joins into inner joins, you can simplify the query optimization process.

Conditions that must be met when external connections can be converted to internal connections

For the convenience of description, two nouns are introduced:

  • Non-air side: the side on which all data in the external connection is output. For example: the left table of the left outer join, the right table of the right outer join
  • Nullable side: The side of the outer link that will be filled with null values. For example: the right table of the left outer join, the left table of the right outer join, the left table and the right table of the full outer join

As long as one of the following conditions is met, the outer connection can be converted to an inner connection:

  • There is a "strict" constraint in the Where condition, and the constraint refers to the column on the nullable side of the table. In this way, the predicate can filter out all the null values generated on the nullable side, so that the final result is equivalent to an inner join.

Refer to the "PostgreSQL Technical Insider-Deep Exploration of Query Optimization" for the definition of "strict" as follows:

The precise definition of "strict" is that for a function, operator or expression, if the input parameter is a NULL value, then the output must also be a NULL value. It can be said that the function, operator or expression is strict; but broad Say, for a function, operator or expression, if the input parameter is a NULL value, and the output result is a NULL value or FALSE, then the function or operator is considered strict. If there are such strict operators, functions or expressions in the constraints, because the input is NULL and the output is NULL or FALSE, then tuples containing NULL values will be filtered out.

This article uses the GuassDB (DWS) connection type to introduce the table and data in the article

Example 1: Query the math scores of students whose math scores are not empty

postgres=# select s.id, s.name, ms.score from student s left join math_score ms on (s.id = ms.id) where ms.score is not null;
 id | name  | score 
----+-------+-------
  1 | Tom   |    80
  2 | Lily  |    75
  4 | Perry |    95
(3 rows)
postgres=# explain select s.id, s.name, ms.score from student s left join math_score ms on (s.id = ms.id) where ms.score is not null;
                                 QUERY PLAN                                 
----------------------------------------------------------------------------
  id |               operation                | E-rows | E-width | E-costs 
 ----+----------------------------------------+--------+---------+---------
   1 | ->  Streaming (type: GATHER)           |     30 |     126 | 36.59   
   2 |    ->  Hash Join (3,4)                 |     30 |     126 | 28.59   
   3 |       ->  Seq Scan on student s        |     30 |     122 | 14.14   
   4 |       ->  Hash                         |     29 |       8 | 14.14   
   5 |          ->  Seq Scan on math_score ms |     30 |       8 | 14.14   
 
 Predicate Information (identified by plan id)
 ---------------------------------------------
   2 --Hash Join (3,4)
         Hash Cond: (s.id = ms.id)
   5 --Seq Scan on math_score ms
         Filter: (score IS NOT NULL)
(14 rows)

Example 2: Query the math scores of students whose math scores are higher than 80

postgres=# select s.id, s.name, ms.score from student s left join math_score ms on (s.id = ms.id) where ms.score > 80;
 id | name  | score 
----+-------+-------
  4 | Perry |    95
(1 row)
postgres=# explain select s.id, s.name, ms.score from student s left join math_score ms on (s.id = ms.id) where ms.score > 80;
                                 QUERY PLAN                                 
----------------------------------------------------------------------------
  id |               operation                | E-rows | E-width | E-costs 
 ----+----------------------------------------+--------+---------+---------
   1 | ->  Streaming (type: GATHER)           |     10 |     126 | 36.44   
   2 |    ->  Hash Join (3,4)                 |     10 |     126 | 28.44   
   3 |       ->  Seq Scan on student s        |     30 |     122 | 14.14   
   4 |       ->  Hash                         |     10 |       8 | 14.18   
   5 |          ->  Seq Scan on math_score ms |     10 |       8 | 14.18   
 
 Predicate Information (identified by plan id)
 ---------------------------------------------
   2 --Hash Join (3,4)
         Hash Cond: (s.id = ms.id)
   5 --Seq Scan on math_score ms
         Filter: (score > 80)
(14 rows)

In the above two examples, the conditions where ms.score is not null and where ms.score> 80, if the input score is NULL, the constraint condition returns false, which satisfies the broad "strict" definition. Therefore, the external connection can be eliminated and converted into an internal connection. The query plan above has also been verified. And this kind of outer join elimination can be handled automatically by the query optimizer of the database.

  • In the On connection condition, if the value in the non-nullable side column is a subset of the nullable side column, and the value of the nullable side is not NULL. Typically, the column on the non-null side is the foreign key, and the column on the nullable side is the primary key, and the relationship between the two is the primary and foreign key reference.
CREATE TABLE student(
  id INTEGER primary key,
  name varchar(50)
);

CREATE TABLE math_score(
  id INTEGER, -- 由于GaussDB(DWS)不支持外键,故此处省去了外键定义,但保证该列的值是student表中id列的子集
  score INTEGER
);

INSERT INTO student VALUES(1, 'Tom');
INSERT INTO student VALUES(2, 'Lily');
INSERT INTO student VALUES(3, 'Tina');
INSERT INTO student VALUES(4, 'Perry');

INSERT INTO math_score VALUES(1, 80);
INSERT INTO math_score VALUES(2, 75);
INSERT INTO math_score VALUES(4, 95);

Then the following outer connection has the same result as the inner connection:

postgres=# select ms.id, s.name, ms.score from student s right join math_score ms on (s.id = ms.id);
 id | name  | score
----+-------+-------
  1 | Tom   |    80
  2 | Lily  |    75
  4 | Perry |    95
(3 rows)

postgres=# explain select ms.id, s.name, ms.score from student s right join math_score ms on (s.id = ms.id);
                               QUERY PLAN                               
-------------------------------------------------------------------------
  id |              operation              | E-rows | E-width | E-costs
 ----+-------------------------------------+--------+---------+---------
   1 | ->  Streaming (type: GATHER)        |     30 |     126 | 36.59  
   2 |    ->  Hash Left Join (3, 4)        |     30 |     126 | 28.59  
   3 |       ->  Seq Scan on math_score ms |     30 |       8 | 14.14  
   4 |       ->  Hash                      |     29 |     122 | 14.14  
   5 |          ->  Seq Scan on student s  |     30 |     122 | 14.14  

 Predicate Information (identified by plan id)
 ---------------------------------------------
   2 --Hash Left Join (3, 4)
         Hash Cond: (ms.id = s.id)
(12 rows)

postgres=# select ms.id, s.name, ms.score from student s join math_score ms on (s.id = ms.id);
 id | name  | score
----+-------+-------
  1 | Tom   |    80
  2 | Lily  |    75
  4 | Perry |    95
(3 rows)

postgres=# explain select ms.id, s.name, ms.score from student s join math_score ms on (s.id = ms.id);
                                 QUERY PLAN                                
----------------------------------------------------------------------------
  id |               operation                | E-rows | E-width | E-costs
 ----+----------------------------------------+--------+---------+---------
   1 | ->  Streaming (type: GATHER)           |     30 |     126 | 36.59  
   2 |    ->  Hash Join (3,4)                 |     30 |     126 | 28.59  
   3 |       ->  Seq Scan on student s        |     30 |     122 | 14.14  
   4 |       ->  Hash                         |     29 |       8 | 14.14  
   5 |          ->  Seq Scan on math_score ms |     30 |       8 | 14.14  

 Predicate Information (identified by plan id)
 ---------------------------------------------
   2 --Hash Join (3,4)
         Hash Cond: (s.id = ms.id)
(12 rows)

Since GaussDB (DWS) does not support foreign keys, the conditions that can eliminate the outer joins and convert them into inner joins cannot be recognized by the optimizer and are automatically converted, but it can help developers manually identify and manually eliminate them when writing SQL daily. External connection.

An interesting rewrite example

There is a use case as follows:

Select count(1)
from student s left join math_score ms on (s.id = ms.id)
where s.id = 2
and ms.score > 70;

postgres=# Select count(1)
postgres-#  from student s left join math_score ms on (s.id = ms.id)
postgres-# where s.id = 2
postgres-# and ms.score > 70;
 count 
-------
     0
(1 row)

postgres=# explain Select count(1)
postgres-#  from student s left join math_score ms on (s.id = ms.id)
postgres-# where s.id = 2
postgres-# and ms.score > 70;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
  id |                            operation                            | E-rows | E-width | E-costs 
 ----+-----------------------------------------------------------------+--------+---------+---------
   1 | ->  Aggregate                                                   |      1 |       8 | 26.51   
   2 |    ->  Streaming (type: GATHER)                                 |      1 |       8 | 26.51   
   3 |       ->  Aggregate                                             |      1 |       8 | 22.51   
   4 |          ->  Nested Loop (5,6)                                  |      3 |       0 | 22.49   
   5 |             ->  Index Only Scan using student_pkey on student s |      1 |       4 | 8.27    
   6 |             ->  Seq Scan on math_score ms                       |      1 |       4 | 14.21   
 
     Predicate Information (identified by plan id)    
 -----------------------------------------------------
   5 --Index Only Scan using student_pkey on student s
         Index Cond: (id = 2)
   6 --Seq Scan on math_score ms
         Filter: ((score > 70) AND (id = 2))
(15 rows)

It can be seen from the above plan that the left outer join in SQL has been optimized as a cross join, because there is no join condition on the 4th operator Nest Loop.

This query is ultimately to calculate the total number of rows in the result of joining the two tables. For cross join, the number of rows connected between two tables is equal to the product of the number of rows in the left table and the number of rows in the right table. So this query can be modified to the following equivalent query:

explain select lcount * rcount as count
from (select count(1) lcount from student where id = 2) s,
     (select count(1) rcount from math_score where score > 70 and id = 2) ms;

postgres=# select lcount * rcount as count
postgres-# from (select count(1) lcount from student where id = 2) s,
postgres-#      (select count(1) rcount from math_score where score > 70 and id = 2) ms;
 count 
-------
     1
(1 row)

postgres=# explain select lcount * rcount as count
postgres-# from (select count(1) lcount from student where id = 2) s,
postgres-#      (select count(1) rcount from math_score where score > 70 and id = 2) ms;
                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
  id |                            operation                             | E-rows | E-width | E-costs 
 ----+------------------------------------------------------------------+--------+---------+---------
   1 | ->  Streaming (type: GATHER)                                     |      1 |      16 | 26.56   
   2 |    ->  Nested Loop (3,7)                                         |      1 |      16 | 22.56   
   3 |       ->  Aggregate                                              |      1 |       8 | 8.29    
   4 |          ->  Streaming(type: BROADCAST)                          |      1 |       8 | 8.29    
   5 |             ->  Aggregate                                        |      1 |       8 | 8.28    
   6 |                ->  Index Only Scan using student_pkey on student |      1 |       0 | 8.27    
   7 |       ->  Materialize                                            |      1 |       8 | 14.25   
   8 |          ->  Aggregate                                           |      1 |       8 | 14.23   
   9 |             ->  Streaming(type: BROADCAST)                       |      1 |       8 | 14.23   
  10 |                ->  Aggregate                                     |      1 |       8 | 14.22   
  11 |                   ->  Seq Scan on math_score                     |      1 |       0 | 14.21   
 
    Predicate Information (identified by plan id)   
 ---------------------------------------------------
   6 --Index Only Scan using student_pkey on student
         Index Cond: (id = 2)
  11 --Seq Scan on math_score
         Filter: ((score > 70) AND (id = 2))
(20 rows)

Through this rewriting, the aggregation operation can be pushed to each subtree of the Nested Loop. When the amount of data in each subtree of the Nested Loop is relatively large, the aggregation can greatly reduce the result set, reduce the amount of data participating in the join, and thus improve performance.

The following example is left for consideration:

Select sum(score)
From student s left join math_score ms on (s.id = ms.id)
Where s.id = 2
  And ms.score > 70;

Click to follow and learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量