Seek and then move: Interpretation of the mystery of line number estimation and path generation in the generation of data warehouse plans

Abstract: act later, know what to stop and gain." When doing anything, you must plan and deploy, and be prepared. Only in this way can it be conducive to the success of this matter, and you must not act recklessly. Similarly, GaussDB (DWS) executes the query statement according to the predetermined plan. In the case of a given hardware environment, the execution speed depends on the quality of the plan. So how is the plan of a query statement formulated? This article will Interpret the mystery of line number estimation and path generation in plan generation for everyone.

This article is shared from the HUAWEI CLOUD community " GaussDB (DWS) plan generation principle secret (1) ", the original author: Jugg.

GaussDB (DWS) optimizer has two plan generation methods, one is dynamic programming, and the other is genetic algorithm. The former is the most used method and is also the focus of this series of articles. Generally speaking, after a SQL statement generates a specific structure of the query tree (QueryTree) through the grammar tree (ParseTree), it starts from the QueryTree and then enters the core part of the plan generation. There are some key steps:

Set the initial degree of parallelism (Dop)
Query rewrite
Estimate the number of rows in the base table
Estimated association table (JoinRel)
Path generation, generate optimal Path
Plan node created by optimal Path for execution
Adjust the optimal degree of parallelism

This article mainly focuses on 3, 4, and 5. These steps have a relatively large impact on the generation of a plan, which mainly involves line number estimation, path selection method and cost estimation (or cost estimation). Cost estimation is the basis for path selection. Each calculation The child corresponds to a set of models and is a relatively independent part, which will be explained in subsequent articles. Plan Hint will intersperse the generation of interference plans in 3, 4, 5 and many other steps. For a detailed introduction, readers can refer to the blog post: GaussDB (DWS) Performance Tuning Series Implementation Part 6: Eighteen Martial Arts Plan Hint Application.

First look at a simple query statement:

select count(*) from t1 join t2 on t1.c2 = t2.c2 and t1.c1 > 100 and (t1.c3 is not null or t2.c3 is not null);

The execution plan given by the GaussDB (DWS) optimizer is as follows:

postgres=# explain verbose select count(*) from t1 join t2 on t1.c2 = t2.c2 and t1.c1 > 100 and (t1.c3 is not null or t2.c3 is not null);

                                                  QUERY PLAN                                                 

--------------------------------------------------------------------------------------------------------------

  id |                    operation                     | E-rows | E-distinct | E-memory | E-width | E-costs

 ----+--------------------------------------------------+--------+------------+----------+---------+---------

   1 | ->  Aggregate                                    |      1 |            |          |       8 | 111.23 

   2 |    ->  Streaming (type: GATHER)                  |      4 |            |          |       8 | 111.23 

   3 |       ->  Aggregate                              |      4 |            | 1MB      |       8 | 101.23 

   4 |          ->  Hash Join (5,7)                     |   3838 |            | 1MB      |       0 | 98.82  

   5 |             ->  Streaming(type: REDISTRIBUTE)    |   1799 | 112        | 2MB      |      10 | 46.38  

   6 |                ->  Seq Scan on test.t1           |   1799 |            | 1MB      |      10 | 9.25   

   7 |             ->  Hash                             |   1001 | 25         | 16MB     |       8 | 32.95  

   8 |                ->  Streaming(type: REDISTRIBUTE) |   1001 |            | 2MB      |       8 | 32.95  

   9 |                   ->  Seq Scan on test.t2        |   1001 |            | 1MB      |       8 | 4.50   

 

           Predicate Information (identified by plan id)         

 -----------------------------------------------------------------

   4 --Hash Join (5,7)

         Hash Cond: (t1.c2 = t2.c2)

         Join Filter: ((t1.c3 IS NOT NULL) OR (t2.c3 IS NOT NULL))

   6 --Seq Scan on test.t1

         Filter: (t1.c1 > 100)

Usually the Plan of a query statement starts from the base table. In this example, the base table t1 has multiple filter conditions. From the plan, some conditions are pushed down to the base table, and some conditions are not pushed down, so its rows How is the number estimated? We first start with the estimation of the number of rows in the base table.

1. Estimate the number of rows in the base table

If there is no filter condition on the base table or the filter condition cannot be pushed down to the base table, then the estimated number of rows in the base table is the number of rows displayed in the statistics, and no special processing is required. This section considers the filtering conditions pushed down to the base table, which can be divided into single-column and multiple-column cases.

1. The idea of single-column filter condition estimation

The estimation of the number of rows in the base table currently mainly relies on statistical information. The statistical information is some statistical average information about the sample data of the table that is triggered by Analyze to be generated before the plan. For example, some statistical information of the t1 table is as follows:

postgres=# select tablename, attname, null_frac, n_distinct, n_dndistinct, avg_width, most_common_vals, most_common_freqs from pg_stats where tablename = 't1';

 tablename | attname | null_frac | n_distinct | n_dndistinct | avg_width | most_common_vals | most_common_freqs

-----------+---------+-----------+------------+--------------+-----------+------------------+-------------------

 t1        | c1      |         0 |        -.5 |          -.5 |         4 |                  |

 t1        | c2      |         0 |       -.25 |     -.431535 |         4 |                  |

 t1        | c3      |        .5 |          1 |            1 |         6 | {gauss}          | {.5}

 t1        | c4      |        .5 |          1 |            1 |         8 | {gaussdb}        | {.5}

(4 rows)

The meaning of each field is as follows:

null_frac: the proportion of null values
n_distinct: global distinct value, value rule: when a positive number represents the distinct value, when a negative number, its absolute value represents the ratio of the distinct value to the number of rows
n_dndistinct: the distinct value on DN1, the value rule is similar to n_distinct
avg_width: the average width of the field
most_common_vals: list of high frequency values
most_common_freqs: A list of the proportions of high frequency values, corresponding to most_common_vals

The specific data distribution can be roughly judged from the above statistical information, such as t1.c1 column, the average width is 4, the average repeatability of each data is 2, and there is no null value, and no value is significantly higher than the other. The value, that is, most_common_vals (MCV for short) is empty. This can also be understood as the data is basically evenly distributed. For these evenly distributed data, a certain amount of buckets are allocated, and the data is divided according to the same height, and each bucket is recorded The boundary, commonly known as histogram (Histogram), means that there is an equal amount of data in each bucket.

With this basic information, the number of rows in the base table can be roughly estimated. For example, the filter condition "t1.c1>100" on the t1 table, combined with the uniform distribution characteristics of the t1.c1 column and the specific data distribution:

postgres=# select histogram_bounds from pg_stats where tablename = 't1' and attname = 'c1';

                                                                                                                                                                                              histogram_bounds                                  

                                                                                                                                                           

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------------------------------------------

 {1,10,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250,260,270,280,290,300,310,320,330,340,350,360,370,380,390,400,410,420,430,440,450,460,470,480,490,500,510,520,530,540,550,560,570,580,590,600,610,62

0,630,640,650,660,670,680,690,700,710,720,730,740,750,760,770,780,790,800,810,820,830,840,850,860,870,880,890,900,910,920,930,940,950,960,970,980,990,1000}

(1 row)

It can be seen that the data in the t1.c1 column is distributed between 1 and 1000, and the amount of data contained in each of the two boundaries is roughly the same (here is the statistical boundary based on sample statistics), first find 100 in this histogram Approximate location, here it is the boundary of a bucket (sometimes inside the bucket), then the proportion of data with t1.c1>100 is approximately the proportion of the number of buckets after the boundary 100, and the proportion here is also called It is the selection rate, that is, the proportion of selected data after passing this condition, so the number of rows after filtering by "t1.c>100" can be estimated.

The above is the basic idea of estimating the number of rows in the base table. normally,

There are statistics:

1. Equivalent conditions
1) Compared with MCV, if the filter conditions are met, the selection rate (ie most_common_freqs) is accumulated;
2) For Histogram data, roughly estimate the selection rate according to the number of distinct values;
2. Range conditions
1) Compare the MCV data, if the filter conditions are met, the selection rate is accumulated;
2) For Histogram data, estimate the selection rate according to the boundary position;
3. Unequal value conditions: can be transformed into equal value condition estimation

No statistics:

Equivalent condition: For example, the filter condition is: "substr(c3, 1, 5) ='gauss'", the c3 column has statistical information, but substr(c3, 1, 5) has no statistical information. How to estimate this condition selection rate? A simple idea is that if the distinct value of substr(c3, 1, 5) is known, you can roughly assume that the repeatability of each distinct value is the same, so the selection rate can also be estimated; in GaussDB (DWS), You can enable the expression distinct value estimation function by setting cost_model_version=1;
Range condition: At this time, only knowing the distinct value of substr(c3, 1, 5) cannot predict the selection rate. For expressions that cannot be estimated, you can set the corresponding distinct value through qual_num_distinct;
Unequal value conditions: can be transformed into an equivalent condition estimation

2. Multi-column filter condition estimation idea

For example, the t1 table has two filter conditions: t1.c1 = 100 and t1.c3 ='gauss', then how to estimate the comprehensive selection rate of the two columns? In GaussDB (DWS), there are two general methods:

Only a single column of statistics

In this case, first calculate the selection rate of each filter condition based on a single column of statistical information, and then select a method to combine these selection rates. The selection method can be specified by setting cost_param. Why do we need to choose a combination method? Because in the actual model, there is a certain correlation between columns. In some scenarios, the correlation is relatively strong, and some scenarios are relatively weak. The strength of the correlation determines the final number of rows.
For the meaning and usage introduction of this parameter, please refer to: GaussDB (DWS) Performance Tuning Series Practical Part 5: Path Intervention of Eighteen Martial Arts.

There are multiple columns of combined statistics

If the combined statistical information of the filtered combination column has been collected, the optimizer will first use the combined statistical information to estimate the number of rows. The basic idea of the estimation is the same as that of a single column, that is, the combination of multiple columns is regarded as a "single column", and then more Column statistics to estimate.

For example, multi-column statistical information: ((c1, c2, c4)), ((c1, c2)), double brackets indicate a group of multi-column statistical information:

If the conditions are: c1 = 7 and c2 = 3 and c4 = 5, use ((c1, c2, c4))
If the conditions are: c1 = 7 and c2 = 3, use ((c1, c2))
If the conditions are: c1 = 7 and c2 = 3 and c5 = 6, then use ((c1, c2))

The general principle of multi-column condition matching multi-column statistical information is:

The column combination of multiple columns of statistical information needs to be included by the column combination of the filter condition;
Among all the multi-column statistical information that meets "condition 1", select the multi-column statistical information that "has the largest intersection with the column combination of the filter condition".

For the filter conditions that cannot match the multi-column statistical information column, the single-column statistical information is used for estimation.

3. Points worth noting

Currently, when using multi-column statistical information, range conditions are not supported; if there are multiple sets of multi-column conditions, the selection rate of each group of multi-column conditions is multiplied as the overall selection rate.
The single-column condition estimation and multi-column condition estimation mentioned above are applicable to only one column in each filter condition. If a filter condition is a combination of multiple columns, such as "t1.c1 <t1.c2", then it is generally It is impossible to estimate single-column statistics, because single-column statistics are independent of each other, and it is impossible to determine whether two independent statistics come from one row. The current multi-column statistical information mechanism does not support scenarios where the filter conditions on the base table involve multiple columns.
Filter conditions that cannot be pushed down to the base table are not included in the consideration of the number of base table rows, such as the above: t1.c3 is not null or t2.c3 is not null, this condition is generally called JoinFilter, and JoinRel will be created Estimated at the time.
If no statistics are available, then the default selection rate is given.

Second, the number of JoinRel rows is estimated

After the number of rows in the base table is estimated, the table association phase can be processed. Then to associate two tables, you need some information, such as the number of rows in the base table, the number of rows after the association, and the choice of the way of association (also called the choice of Path, please see the next section), and then choose the least cost in these ways , Also called the best path. For the estimation of related conditions, there are also single conditions and multiple conditions. The optimizer needs to calculate the comprehensive selection rate of all Join conditions and JoinFilter, and then give the estimated number of rows. First look at how the selection rate of a single related condition is estimated.

1. A group of Join condition estimation ideas

Similar to estimating the number of rows by base table filter conditions, it also uses statistical information to estimate. For example, the association condition in the above SQL example: t1.c2 = t2.c2, first look at the statistical information of t1.c2:

postgres=# select tablename, attname, null_frac, n_distinct, n_dndistinct, avg_width, most_common_vals, most_common_freqs

from pg_stats where tablename = 't1' and attname = 'c2';

 tablename | attname | null_frac | n_distinct | n_dndistinct | avg_width | most_common_vals | most_common_freqs

-----------+---------+-----------+------------+--------------+-----------+------------------+-------------------

 t1        | c2      |         0 |       -.25 |     -.431535 |         4 |                  |

(1 row)

There is no MCV value in the t1.c2 column. On average, each distinct value is repeated about 4 times and is evenly distributed. Because the data retained in the Histogram is only the boundary of the bucket, not the actual data (repetitive collection of statistics, these boundaries may have Change), then it is not practical to compare the boundary value with t2.c2, and a relatively large error may occur. At this time, we firmly believe that: "The columns and columns that can be associated have the same meaning, and the data overlap as much as possible." That is to say, if the t1.c2 column has 500 distinct values, the t2.c2 column There are 100 distinct values, then these 100 and 500 will overlap 100, that is, the smaller the distinct value will all appear in the table with the larger distinct value. Although this assumption is a bit harsh, it is more consistent with the actual situation in many cases. Going back to this example, according to the statistics, n_distinct shows a negative value representing the proportion, and the estimated number of rows in the t1 table is 2000:

postgres=# select reltuples from pg_class where relname = 't1';

 reltuples

-----------

      2000

(1 row)

Thus, the distinct of t1.c2 is 0.25 * 2000 = 500. Similarly, according to the statistics, the distinct of t2.c2 is 100:

postgres=# select tablename, attname, null_frac, n_distinct, n_dndistinct from pg_stats where tablename = 't2' and attname = 'c2';

 tablename | attname | null_frac | n_distinct | n_dndistinct

-----------+---------+-----------+------------+--------------

 t2        | c2      |         0 |        100 |      -.39834

(1 row)

So, can the distinct value of t1.c2 be 500 directly? The answer is no. Because there is a filter condition "t1.c1> 100" on the base table t1, the current association occurs after the filter condition of the base table, and the estimated distinct should be the number of distinct after the filter condition, not the original table. How many. Then various hypothetical models can be used to estimate at this time, such as a few simple models: Poisson model (assuming that t1.c1 and t1.c2 are very weakly correlated) or a fully correlated model (assuming that t1.c1 and t1.c2 are completely correlated) ), the values obtained by different models will be different. In this example, the selection rate of "t1.c1> 100" is 8.995000e-01, and the distinct values obtained by different models will be different, as follows:

Poisson model (weak correlation model): 500 (1.0-exp(-2000 8.995000e-01 / 500)) = 486
Fully related model: 500 * 8.995000e-01 = 450
Completely unrelated model: 500 * (1.0-pow(1.0-8.995000e-01, 2000 / 500)) = 499.9, the model can be obtained by probabilistic methods, and interested readers can try to derive it by themselves
The actual filtered distinct: 500, that is, the c2 and c1 columns are not related

postgres=# select count(distinct c2) from t1 where c1 > 100;

 count

-------

   500

(1 row)

Estimate the distinct value of t1.c2 after filtering, then the selectivity of "t1.c2 = t2.c2" can be estimated: 1 / distinct.

The above is the case that there is no MCV in any table. If t1.c2 and t2.c2 have MCV, then compare their MCV first, because the values in MCV have a clear proportion, just accumulate the matching results directly. Then match the values in the Histogram.

2. Multi-group Join condition estimation idea

When the table association contains multiple Join conditions, it is similar to the base table filter condition estimation. There are also two ways of thinking. First, try multiple columns of statistical information to estimate the selection rate. When multi-column statistical information cannot be used, single-column statistical information is used to calculate the selection rate of each Join condition according to the above method. Then the combination selection rate is also controlled by the parameter cost_param. details on 160b75db1af671, please refer to GaussDB (DWS) Performance Tuning Series Practical Part 5: Path Intervention of Eighteen Martial Arts.

In addition, the following is the selection rate estimation method for special circumstances:

If the Join column is an expression and there is no statistical information, the optimizer will try to estimate the distinct value, and then estimate it without MCV;
Left Join/Right Join requires special consideration of the following characteristics of filling in one side and full output on the other side. The above model can be modified appropriately;
If the correlation condition is a comparison of range classes, such as "t1.c2 <t2.c2", the default selection rate is currently given: 1 / 3;

3. The estimation idea of JoinFilter

When two tables are associated, if there are some filter conditions on the base table that cannot be pushed down, they will generally become JoinFilter, that is, these conditions are filtered during the Join process, so JoinFilter will affect the number of rows in JoinRel, but not Affect the number of rows scanned up by the base table. Strictly speaking, if JoinRel is regarded as an intermediate table, then these JoinFilters are the filtering conditions of this intermediate table, but JoinRel has not been generated yet, and there is no row number and statistical information, so it cannot be accurately estimated. However, a simple and approximate method is to still use the base table to roughly estimate the selection rate of this JoinFilter, and then put it in the final row number estimation of JoinRel.

Three, path generation

With the foreshadowing of the number of rows estimated in the previous two sections, you can enter the process of path generation. What is path generation? It is known that there are many ways to associate tables (such as NestLoop, HashJoin), and GaussDB (DWS) tables are distributed and stored in the cluster, so there may be multiple ways to associate the two tables, and our goal That is, starting from these given base tables, going through some operations (filter conditions, association methods and conditions, aggregation, etc.) as required, combining them with each other, step by step, and finally get the result we want. This is like starting from the base table, seeking an optimal path, so that we can get the fastest result, this is our goal. In this section, we introduce the generation of Join Path and Aggregate Path.

1. Join Path generation

The basic idea of GaussDB (DWS) optimizer selection is dynamic programming. As the name implies, from a certain starting state, through solving the optimal solution of the intermediate state, gradually evolve forward, and finally obtain the global optimal plan. Then in dynamic programming, there is always a variable that drives the evolution of the process. Here, this variable is the number of tables. In this section, we take the following SQL as an example to explain:

select count(*) from t1, t2 where t1.c2 = t2.c2 and t1.c1 < 800 and exists (select c1 from t3 where t1.c1 = t3.c2 and t3.c1 > 100);

In this SQL statement, there are three base tables t1, t2, t3, the distribution keys of the three tables are all column c1, and there are two association conditions:

t1.c2 = t2.c2, t1 is associated with t2
t1.c1 = t3.c2, t1 is associated with t3

In order to cooperate with the analysis, we combined the log to help everyone understand, set the following parameters, and then execute the statement:

set logging_module='on(opt_join)';

set log_min_messages=debug2;

The first step, how to get the data of t1 and t2

First of all, how to get the data of t1 and t2, such as Seq Scan, Index Scan, etc. Since we did not create Index in this example, the only choice is Seq Scan. The log snippet shows:

Let's remember these three sets of Path names: path_list, cheapest_startup_path, cheapest_total_path, the latter two correspond to the local optimal solution of dynamic programming, here is a set of sets, collectively referred to as the optimal path, which is also the search space for the next step. The path_list stores a set of valuable candidate Paths on the current Rel set (the pruned Path will not be placed here), cheapest_startup_path represents the Path with the lowest startup cost in the path_list, and cheapest_total_path represents the lowest total cost in the path_list. Path (here, a set of optimal Paths that may have multiple dimensions are used). The t2 table is similar to the t3 table, and the optimal path is a Seq Scan. With the Scan optimal path of all base tables, the associated path can be selected below.

The second step is to solve the optimal path associated with (t1, t2)

The distribution keys of the two tables t1 and t2 are both the c1 column, but the Join column is both the c2 column, then the theoretical path is: (placed on the right means as an inner table)

Broadcast(t1) join t2
t1 join Broadcast(t2)
Broadcast(t2) join t1
t2 join Broadcast(t1)
Redistribute(t1) join Redistribute(t2)
Redistribute(t2) join Redistribute(t1)

Then each path can be matched with different Join methods (NestLoop, HashJoin, MergeJoin), a total of 18 related paths, the optimizer needs to select the optimal path among these paths, and the basis for selection is the cost of the path (Cost). The optimizer will assign a cost to each operator. For example, Seq Scan, Redistribute, HashJoin all have a cost, and the cost is related to the data size, data characteristics, system resources, etc., on how to estimate the cost, the subsequent article will analyze it again, this section Just focus on how to choose the path based on these costs. Since the cost is proportional to the execution time, the goal of the optimizer is to choose the least costly plan, so the path selection is the same. The path cost comparison idea is roughly like this. For a new Path generated, compare the new Path with the path in the path_list one by one. If the total_cost is very similar, compare the startup cost. If it is similar, then keep the Path to the path_list; If the total_cost of the new path is relatively large, but the startup_cost is much smaller, the Path is retained. The specific comparison process is omitted here, and the comparison result of Path is directly given:

It can be seen from this that the path with the smallest total cost is the path with redistribution on both sides and t1 as the inner table.

The third step is to solve the optimal path associated with (t1, t3)

The association condition of the t1 and t3 tables is: t1.c1 = t3.c2, because the Join column of t1 is the distribution key c1 column, so there is no need to add Redistribute to the t1 table; because the Join method of t1 and t3 is Semi Join, the external table cannot Broadcast, otherwise it may produce duplicate results; there is another type of Unique Path selection (that is, t3 table deduplication), then the available candidate paths are roughly as follows:

t1 semi join Redistribute(t3)
Redistribute(t3) right semi join t1
t1 join Unique(Redistribute(t3))
Unique(Redistribute(t3)) join t1

Since only one side needs to be redistributed and can be redistributed, broadcast is not selected, because the cost of Broadcast is generally higher than redistribution when the amount of data is the same, and it is pruned in advance. Taking the Join method into consideration, the optimizer gives the final choice:

The optimal plan at this time is to choose the path of the internal table Unique Path, that is, the t3 table first removes the duplicates, and then goes through the Inner Join process.

The fourth step is to solve the optimal path associated with (t1, t2, t3)

With the preparation of the previous two steps, the idea of association of the three tables is similar. The form is to decompose into two tables and associate first, and then associate with the third table. In actual operation, all the two tables are directly related. JoinRel, and then add another table one by one, try to associate, the selection method is as follows:

JoinRel(t1, t2) join t3：
(t1, t2)->(cheapest_startup_path + cheapest_total_path) join t3->(cheapest_startup_path + cheapest_total_path)
JoinRel(t1, t3) join t2：
(t1, t3)->(cheapest_startup_path + cheapest_total_path) join t2->(cheapest_startup_path + cheapest_total_path)
JoinRel(t2, t3) join t1: Since there is no (t2, t3) association, this situation does not exist

Every time a pair of Paths of the inner and outer tables are used for Join, it will also determine whether redistribution is required and whether it can be deduplicated, and the association method is selected. For example, JoinRel(t1, t2) will try to deduplicate the t3 table when joining t3. Path, because the essence of this Join is still Semi Join. The following figure shows part of the valuable candidate paths generated during the selection process (limited by space, only a part of it is intercepted):

Among these paths, the optimizer selected the following optimal path:

Compared with the actual execution plan, the two are the same (compared to the "E-costs" of the 4th layer HashJoin are the same):

From this process, it can be roughly felt that path_list may undergo some expansion. If there are too many paths in path_list, it may cause multiple cheapest_total_paths, and the next level of search space will also become very large, which will eventually lead to Increased time-consuming plan generation. Regarding the generation of Join Path, the following points are explained:

When selecting a Join path, the cost is calculated in two stages, initial and final. The initial cost quickly estimates the cost of building the hash table, calculating the hash value, and placing the disk. When the initial cost is greater than a path in the path_list, Prune the path in advance;
There are multiple reasons for cheapest_total_path: mainly considering that in multiple dimensions, paths with very similar costs may be the best choice for the next level of dynamic planning, leaving only one may not get the overall optimal plan;
cheapest_startup_path records the one with the lowest startup cost, which also reserves another dimension. When the query statement requires few results, there is a Path with a small startup cost, but the total cost may be relatively large, and this Path may become Preferred
Due to pruning, in some cases, a Path may be pruned in advance, or the Path may not be selected as cheapest_total_path or cheapest_startup_path, and this Path is part of the theoretically optimal plan, which will cause the final plan to be different Optimal, this scenario is generally not very likely. If you encounter this situation, you can try to use Plan Hint for tuning;
Path generation is closely related to cluster size, system resources, statistical information, and cost estimation. For example, the number of cluster DNs affects the tilt of the distribution and the data volume of a single DN, and the system memory affects the disk cost. The statistical information is the number of rows and the data volume of a single DN. The first-hand data for distinct value estimation, and the Cost estimation model is a key factor for selection and elimination in the entire plan generation. Inaccurate estimation of the number of rows for each JoinRel may affect the final plan. Therefore, the same SQL statement, in different clusters or the same cluster with different statistics, the plan may be different, if there are some changes in the path, you can analyze the Performance information and logs to locate the problem. For a detailed explanation of Performance, please refer to the blog post: GaussDB( DWS) explain performance in detail.
If the Random Plan mode is set, the cheapest_startup_path and cheapest_total_path of each layer of dynamic planning are randomly selected from the path_list to ensure randomness.

2. Generation of Aggregate Path

Generally speaking, Aggregate Path is generated after the path associated with the table is generated, and there are three main steps (the Aggregate of Unique Path has been completed when the Join Path is generated, but there will also be these three steps): first estimate Aggregate the number of rows of the result, and then select the path method, and finally create the optimal Aggregate Path. The former relies on statistical information and the Cost estimation model, while the latter depends on the former’s estimation results, cluster size and system resources. The estimation of the number of aggregate rows is mainly combined according to the distinct value of the aggregate column. We focus on the estimation of the number of aggregate rows and the selection of the optimal Aggregate Path.

2.1 Aggregate row count estimation

Take the following SQL as an example:

select t1.c2, t2.c2, count(*) cnt from t1, t2 where t1.c2 = t2.c2 and t1.c1 < 500 group by t1.c2, t2.c2;

The statement first associates the two tables, with filter conditions on the base table, and then obtains the GROUP BY result of the two columns. There are two aggregate columns here, t1.c2 and t2.c2. Look at the original information given in the system table:

postgres=# select tablename, attname, null_frac, n_distinct, n_dndistinct from pg_stats where (tablename = 't1' or tablename = 't2') and attname = 'c2';

 tablename | attname | null_frac | n_distinct | n_dndistinct

-----------+---------+-----------+------------+--------------

 t1        | c2      |         0 |       -.25 |     -.431535

 t2        | c2      |         0 |        100 |      -.39834

(2 rows)

Statistics show that the original distinct values of t1.c2 and t2.c2 are -0.25 and 100, respectively. The absolute value of -0.25 is 0.25 * 2000 = 500. Should their combined distinct value be at least 500? The answer is no. Because Aggregate aggregates the results of JoinRel(t1, t2), and the statistical information in the system table is the original information (without any filtering). At this time, both Join conditions and filter conditions need to be taken into account. How to consider? First look at the filter condition "t1.c1<500" may filter out a part of t1.c2, then there will be a selection rate (we call it FilterRatio at this time), and then the Join condition "t1.c2 = t2.c2" is also There will be a selection rate (we call it JoinRatio at this time), both of which are a number between [0, 1], so when estimating the distinct of t1.c2, the influence of these two Ratios must be considered . If you choose the Poisson model between different columns, and use a completely correlated model between the same columns, the distinct of t1.c2 is approximately like this:

distinct(t1.c2) = Poisson(d0, ratio1, nRows) * ratio2

Where d0 represents the original distinct in the base table, ratio1 represents the Ratio using the Poisson model, ratio2 represents the Ratio using the full correlation model, and nRows is the number of rows in the base table. If you need to locate and analyze the problem, these Ratios can be consulted from the log, and run the SQL statement after the following settings:

set logging_module='on(opt_card)';

set log_min_messages=debug3;

In this example, we can see two Ratios on the t1 table from the log:

Looking at t2.c2, the original distinct of this column is 100, and from the above log, we can see that the data in the t2 table is all matched (without Ratio), then the distinct of t2.c2 after Join is also 100. At this time, t1.c2 and t2.c2 cannot be directly combined, because "t1.c2 = t2.c2" implies that the values of these two columns are the same, that is, they are equivalent, so you only need to consider Min(distinct( t1.c2), distinct(t2.c2)). The following figure shows the actual and estimated number of rows given by Performance:

postgres=# explain performance select t1.c2, t2.c2, count(*) cnt from t1, t2 where t1.c2 = t2.c2 and t1.c1 < 500 group by t1.c2, t2.c2;

                                                                            QUERY PLAN                                                                           

------------------------------------------------------------------------------------------------------------------------------------------------------------------

  id |                   operation                   |      A-time      | A-rows | E-rows | E-distinct |  Peak Memory   | E-memory | A-width | E-width | E-costs

 ----+-----------------------------------------------+------------------+--------+--------+------------+----------------+----------+---------+---------+---------

   1 | ->  Streaming (type: GATHER)                  | 48.500           |     99 |    100 |            | 80KB           |          |         |      16 | 89.29  

   2 |    ->  HashAggregate                          | [38.286, 40.353] |     99 |    100 |            | [28KB, 31KB]   | 16MB     | [24,24] |      16 | 79.29  

   3 |       ->  Hash Join (4,6)                     | [37.793, 39.920] |   1980 |   2132 |            | [6KB, 6KB]     | 1MB      |         |       8 | 75.04  

   4 |          ->  Streaming(type: REDISTRIBUTE)    | [0.247, 0.549]   |   1001 |   1001 | 25         | [53KB, 53KB]   | 2MB      |         |       4 | 32.95  

   5 |             ->  Seq Scan on test.t2           | [0.157, 0.293]   |   1001 |   1001 |            | [12KB, 12KB]   | 1MB      |         |       4 | 4.50   

   6 |          ->  Hash                             | [36.764, 38.997] |    998 |   1000 | 62         | [291KB, 291KB] | 16MB     | [20,20] |       4 | 29.88  

   7 |             ->  Streaming(type: REDISTRIBUTE) | [36.220, 38.431] |    998 |    999 |            | [53KB, 61KB]   | 2MB      |         |       4 | 29.88  

   8 |                ->  Seq Scan on test.t1        | [0.413, 0.433]   |    998 |    999 |            | [14KB, 14KB]   | 1MB      |         |       4 | 9.25

2.2 Aggregrate Path generation

With the number of aggregation rows, the aggregation method can be flexibly selected according to the resource situation. There are mainly three types of Aggregate methods:

Aggregate + Gather (+ Aggregate)
Redistribute + Aggregate (+Gather)
Aggregate + Redistribute + Aggregate (+Gather)

The indication in brackets may not have this step, depending on the specific situation. These aggregation methods can be understood as whether to choose both Redistribute or Broadcast when two tables are associated. After the optimizer obtains the final number of aggregated rows, it will try each aggregation method, calculate the corresponding cost, select the optimal method, and finally generate a path. When there are two layers of Aggregate, the last layer is the number of final aggregate rows, and the number of aggregate rows in the first layer is calculated based on the Poisson model. The Aggregate method is selected by the optimizer according to the cost by default, and the user can also specify it through the parameter best_agg_plan. The general scope of application of the three types of aggregation methods is as follows:

The first type, the number of rows after direct aggregation is not too large, usually DN aggregation, CN collection, and sometimes CN needs to be aggregated twice.
The second type requires redistribution and the number of rows is not significantly reduced after direct aggregation
The third type requires redistribution and the number of rows is significantly reduced after direct aggregation. After redistribution, the number of rows can be reduced again, generally DN aggregation, redistribution, and re-aggregation, commonly known as double-layer Aggregate

Four, concluding remarks

This article focuses on the core steps of plan generation, from the estimation of the number of rows, to the generation of the Join Path, and then to the generation of the Aggregate Path, and introduces the basic principles of the simplest process. The actual processing method is far more complicated than the description, and there are many situations to consider, such as how to combine the optimal selection rate of multiple groups, how to choose the distribution key, how to deal with the tilt, how much memory is used, and so on. Weighing the entire plan generation process, sometimes you have to give up, so that you can gain, and sometimes a little disadvantage of the plan can be ignored or compensated by other capabilities. For example, after the SMP is turned on, the parallel effect will dilute some of the defects in the plan. . All in all, plan generation is a complex and meticulous work. To generate a global optimal plan requires continuous discovery and optimization. We will continue to explore the secrets of plan generation in follow-up blog posts.

Click to follow and learn about Huawei Cloud's fresh technology for the first time~

Seek and then move: Interpretation of the mystery of line number estimation and path generation in the generation of data warehouse plans

1. Estimate the number of rows in the base table

1. The idea of single-column filter condition estimation

There are statistics:

No statistics:

2. Multi-column filter condition estimation idea

Only a single column of statistics

There are multiple columns of combined statistics

3. Points worth noting

Second, the number of JoinRel rows is estimated

1. A group of Join condition estimation ideas

2. Multi-group Join condition estimation idea

3. The estimation idea of JoinFilter

Three, path generation

1. Join Path generation

The first step, how to get the data of t1 and t2

The second step is to solve the optimal path associated with (t1, t2)

The third step is to solve the optimal path associated with (t1, t3)

The fourth step is to solve the optimal path associated with (t1, t2, t3)

2. Generation of Aggregate Path

2.1 Aggregate row count estimation

2.2 Aggregrate Path generation

Four, concluding remarks

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

Elasticsearch查询及聚合类DSL语句宝典

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

2025年4月中国数据库流行度排行榜：OB高分复登顶，崖山稳驭撼十强

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

2025年3月国产数据库大事记-墨天轮

湖仓一体化（Lakehouse）指什么？有哪些应用场景？

Seek and then move: Interpretation of the mystery of line number estimation and path generation in the generation of data warehouse plans

1. Estimate the number of rows in the base table

1. The idea of single-column filter condition estimation

There are statistics:

No statistics:

2. Multi-column filter condition estimation idea

Only a single column of statistics

There are multiple columns of combined statistics

3. Points worth noting

Second, the number of JoinRel rows is estimated

1. A group of Join condition estimation ideas

2. Multi-group Join condition estimation idea

3. The estimation idea of JoinFilter

Three, path generation

1. Join Path generation

The first step, how to get the data of t1 and t2

The second step is to solve the optimal path associated with (t1, t2)

The third step is to solve the optimal path associated with (t1, t3)

The fourth step is to solve the optimal path associated with (t1, t2, t3)

2. Generation of Aggregate Path

2.1 Aggregate row count estimation

2.2 Aggregrate Path generation

Four, concluding remarks

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

Elasticsearch查询及聚合类DSL语句宝典

某全球领先网络解决方案提供商 基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

2025年4月中国数据库流行度排行榜：OB高分复登顶，崖山稳驭撼十强

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

2025年3月国产数据库大事记-墨天轮

湖仓一体化（Lakehouse）指什么？有哪些应用场景？

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈