The mystery of row number estimation and path generation in the generation of warehouse plan is explained

Abstract:As Sun Tzu said in the Art of War, “Plan before you act; stop when you know you will get something.” In doing anything, you must make plans and preparations. Only in this way can you help the success of this matter, and never act rashly. Similarly, GAUSSDB (DWS) will execute the query statement according to the predetermined plan, given the hardware environment, the speed of execution depends on the quality of the plan, so how to make a query statement plan, this article will explain the row number estimation and path generation for you.

This article is shared from the Huawei cloud community “GausSDB (DWS) plan generation principle revealed (1)”, the original author: Jugg.

There are two kinds of plan generation methods for GausSDB (DWS) optimizer, one is dynamic programming, and the other is genetic algorithm. The former is the most used method and is also the focus of this series of articles. Generally speaking, after a SQL statement generates a QueryTree (QueryTree) with a specific structure through the syntax tree (ParseTree), it starts from QueryTree and enters the core part of the plan generation. There are some key steps:

Set the initial degree of parallelism (DOP)
Query rewriting
Estimate the number of base table rows
Estimated Correlation Table (Joinrel)
Path generation, generation of the optimal Path
The Plan node for execution is created from the optimal PATH
Adjust optimal parallelism

This paper mainly focuses on 3, 4 and 5, which have a great impact on the generation of a plan, mainly involving row number estimation, path selection method and Cost estimation (or Cost estimation). Cost estimation is the basis of path selection, and each operator corresponds to a set of models, which is a relatively independent part, which will be explained in subsequent articles. The Plan Hint can be used in steps 3, 4, 5, etc. The Plan Hint can be used in steps 3, 4, 5, etc. The Plan Hint can be used in steps 3, 4, 5, etc.

Let’s start with a simple query:

select count(*) from t1 join t2 on t1.c2 = t2.c2 and t1.c1 > 100 and (t1.c3 is not null or t2.c3 is not null);

The GausSDB (DWS) optimizer gives the following execution plan:

postgres=# explain verbose select count(*) from t1 join t2 on t1.c2 = t2.c2 and t1.c1 > 100 and (t1.c3 is not null or t2.c3 is not null); QUERY PLAN -------------------------------------------------------------------------------------------------------------- id | operation | E-rows | E-distinct | E-memory | E-width | E-costs - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- - | 1 - > Aggregate | 1 | | | | 2 | 111.23 8 - > Streaming (type: GATHER) 4 | | | | | 3 | - > 111.23 8 Aggregate 4 | | | 1 MB 101.23 4 | - > | | 8 Hash Join (5, 7) | 3838 | | 1 MB | | 0 to 98.82  5 | -> Streaming(type: REDISTRIBUTE) 2 MB | | 1799 | | 112 | | 46.38 6-10 > Seq Scan on the test. The t1 | 1799 | | 1 MB 9.25 7 | - > | | 10 Hash | 1001 | 25 16 MB | | | 32.95 8 August | - > Streaming (type: REDISTRIBUTE) | 1001 | | 2 8 MB | | | 9 - > 32.95 Seq Scan on the test. The t2 | 1001 | | 1 MB | | 4.50 8 Predicate Information (identified by the plan id) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 4 -- Hash Join (5, 7) Hash Cond:  (t1.c2 = t2.c2) Join Filter: ((t1.c3 IS NOT NULL) OR (t2.c3 IS NOT NULL)) 6 --Seq Scan on test.t1 Filter: (t1.c1 > 100)

Usually, the PLAN of a query statement starts from the base table. In this example, there are multiple filtering conditions in the base table T1. From the perspective of the Plan, some conditions are pushed to the base table, but some conditions are not pushed down. Let’s start with an estimate of the number of rows in the base table.

I. Estimation of the row number of the base table

If there are no filter conditions on the base table or the filter conditions cannot be pushed down to the base table, then the row count estimate for the base table is the number of rows shown in the statistics and requires no special processing. In this section, the filtering conditions that are pushed down to the base table are considered in two cases, single column and multi-column.

1. Estimation of single column filtration conditions

Estimate of the number of rows in the base table is currently mainly dependent on statistics, which generate some statistical average information about the sample data collected by the Analyze trigger before the plan. For example, some statistics of the T1 table are as follows:

postgres=# select tablename, attname, null_frac, n_distinct, n_dndistinct, avg_width, most_common_vals, most_common_freqs from pg_stats where tablename = 't1'; tablename | attname | null_frac | n_distinct | n_dndistinct | avg_width | most_common_vals | most_common_freqs -----------+---------+-----------+------------+--------------+-----------+------------------+------------------- t1 | c1  | 0 | -.5 | -.5 | 4 | | t1 | c2 | 0 | -.25 | -.431535 | 4 | | t1 | c3 | .5 | 1 | 1 | 6 | {gauss} | {.5} t1 | c4 | .5 | 1 | 1 | 8 | {gaussdb} | {.5} (4 rows)

The meaning of each field is as follows:

Null_frac: Null_value ratio
N_DISTINCT: Global DISTINCT value, value rule: positive value represents DISTINCT value, and negative value represents the ratio of DISTINCT value to the number of rows
N_DNDISTINCT: A DISTINCT value on DN1, with rules similar to N_DISTINCT
Avg_width: The average width of this field
MOST_COMMON_VALS: List of high frequency values
MOST_COMMON_FREQS: The proportion list of high frequency values, corresponding to MOST_COMMON_VALS

For example, the average width of the T1.C1 column is 4, and the average repeatability of each column is 2. There are no null values, and no values have a significantly higher proportion than the other values. MOST_COMMON_VALS (MCV) is empty. For these evenly distributed data, a certain amount of buckets are allocated, these data are divided by contour mode, and the boundary of each bucket is recorded, commonly known as Histogram, that is, there is an equal amount of data in each bucket.

With this basic information in hand, the number of rows in the base table can be roughly estimated. For example, the filter condition “T1.C1 >100” on the T1 table, combined with the uniform distribution characteristics of the T1.C1 column and the specific situation of data distribution:

postgres=# select histogram_bounds from pg_stats where tablename = 't1' and attname = 'c1'; histogram_bounds ------------------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------ - ------------------------------------------------------------------------------------------------------------------------ ------------------------------------ {1,10,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250,260,270,280,290,300,310,32 0330340350360370380390400410420430440450460470480490500510520530540550560570580590600610 on conversion 0630640650660670680690700710720730740750760770780790800810820830840850860870880890900910, 9 20930940950960970980990100 0} (1 row)

It can be known that the data in column T1.C1 are distributed between 1 and 1000, and each two boundaries contain roughly the same amount of data (here is the statistical boundary based on sample statistics). First, find the approximate position of 100 in this histogram, where it is the boundary of a certain bucket (sometimes inside the bucket). Then the data proportion of T1.C1 >100 is approximately the proportion of the number of buckets after the boundary 100. The proportion here is also known as the selection rate, that is, the proportion of selected data after this condition. Therefore, the number of rows filtered by “T1.C >100” can be estimated.

This is the basic idea of estimating the number of rows in a base table. In general,

Statistics are available:

1) Compared with MCV, if the filtering condition is met, the selection rate (i.e., MOST_COMMON_FREQS) is accumulated; 2) For Histogram data, roughly estimate the selection rate according to the number of distinct values; 1) Compared with MCV data, if the filtering conditions are met, the selection rate is accumulated; 2) For Histogram data, estimate the selection rate according to the boundary position; 3. Disparity condition: it can be converted into equivalent condition estimation

No statistics:

Equivalence condition: For example, the filter condition is: “SUBSTR (C3, 1, 5) = ‘GAUSS'”, C3 column has statistics, but SUBSTR (C3, 1, 5) has no statistics. So how do you estimate the conditional selection rate? A simple idea is that if the DISTINCT values for SUBSTR (C3, 1, 5) are known, then we can roughly assume that each DISTINCT value has the same repeatability, so the selection rate can be estimated as well. In GAUSSDB (DWS), the expression distinct value estimation function can be enabled by setting COST_MODEL_VERSION =1;
Range condition: at this time only knowing the DISTINCT values for SUBSTR (C3, 1, 5) is not able to estimate the selection rate. For expressions that cannot be estimated, the corresponding DISTINCT values can be specified by setting QUAL_NUM_DISTINCT.
Disparity condition: can be converted into an equivalent condition estimate

2. Estimation idea of multiple filtration conditions

For example, if the t1 table has two filters: t1.c1 = 100 and t1.c3 = ‘gauss’, how do you estimate the combined selection rate for the two columns? In GausSDB (DWS), there are two general methods:

Only single column statistics

In this case, the selection rate for each filter condition is first calculated as a single column statistic, and then a way to combine these selection rates is selected, which can be specified by setting COST_PARAM. Why do you need to choose combinations? In the actual model, there is a certain correlation between columns. In some scenes, the correlation is strong, while in others, the correlation is weak. The strength of the correlation determines the final number of rows. The meaning and use of this parameter can be referred to: GAUSSDB (DWS) performance tuning series of practical chapter five: 18 martial arts path intervention.

There are multi-column composite statistics

If the combined statistics for the filtered combined columns have been collected, the optimizer will use the combined statistics in preference to estimate the number of rows. The basic idea of estimation is the same as for a single column, i.e. a multi-column combination is formally considered as a “single column” and then the multi-column statistics are used to estimate the number of rows.

For example, multi-column statistics are: ((C1, C2, C4)), ((C1, C2)), double parentheses denote a set of multi-column statistics:

C1 = 7 AND C2 = 3 AND C4 = 5
If c1 = 7 and c2 = 3, use ((c1, c2))
C1 = 7 AND C2 = 3 AND C5 = 6;

The general principle of multi-column conditional matching multi-column statistics is:

The column combination of multi-column statistics needs to be included by the column combination of the filter condition;
Of all the multi-column statistics that satisfy “condition 1”, select the multi-column statistics that “has the greatest intersection with the column combination of the filter condition”.

For filter criteria that do not match the multi-column statistics column, the single-column statistics are used for estimation.

3. Notable points

The scope class condition is not currently supported when using multi-column statistics. If there are multiple groups of conditions, the selection rate of each group of conditions is multiplied as the overall selection rate.
If a filter condition is a combination of multiple columns, such as “T1.C1 < T1.C2”, then in general the single-column statistics cannot be estimated, because the single-column statistics are independent of each other. There is no way to determine whether two independent statistics are from a single row. The current multi-column statistics mechanism also does not support scenarios where the filtering criteria on the base table involve multiple columns.
Filter conditions that cannot be pushed down to the base table are not included in the estimation of the number of rows in the base table, such as: T1.C3 is not NULL or T2.C3 is not NULL. This condition is commonly referred to as JoinFilter and will be estimated when creating Joinrel.
If no statistics are available, the default selection rate is used.

II. Estimation of the number of Joinrel rows

After estimating the number of rows of the base table, we can enter the processing of the table association stage. To correlate two tables, you need some information, such as the number of rows in the base table, the number of rows after the association, the choice of the association mode (also called PATH selection, see the next section), and then choose the least costly among these modes, also known as the best Path. For the estimation of association conditions, there are also single conditions and multiple conditions. The optimizer needs to calculate the comprehensive selection rate of all Join conditions and JoinFilter, and then give the estimated row number. First, let’s see how the selection rate of a single association condition is estimated.

1. A group of Join condition estimation ideas

Similar to the number of rows estimated by the base table filter condition, statistics are used to estimate the number of rows. For example, in the SQL example above, the association condition: T1.c2 = T2.c2

postgres=# select tablename, attname, null_frac, n_distinct, n_dndistinct, avg_width, most_common_vals, most_common_freqs from pg_stats where tablename = 't1' and attname = 'c2'; tablename | attname | null_frac | n_distinct | n_dndistinct | avg_width | most_common_vals | most_common_freqs -----------+---------+-----------+------------+--------------+-----------+------------------+------------------- t1 | c2  | 0 | -.25 | -.431535 | 4 | | (1 row)

There is no MCV value in T1.C2 column, and each distinct value repeats about 4 times on average and is uniformly distributed. Since the data retained in Histogram are only the boundary of the bucket, not the actual data (statistics are collected repeatedly, and these boundaries may change), it is not practical to compare the boundary value with T2.C2 in practice. There may be a large error. Now we believe in one thing: If T1.C2 has 500 distinct values and T2.C2 has 100 distinct values, then the 100 and 500 columns will overlap by 100. If T2.C2 has 500 distinct values, then the 100 will overlap by 100. DISTINCT values with small values will all appear in the table with large values. Although this assumption is a little harsh, but many times and the actual situation is more consistent. Going back to this example, according to the statistics, N_DISTINCT shows a negative value representing the proportion, while the estimated number of rows in the t1 table is 2000:

postgres=# select reltuples from pg_class where relname = 't1';

 reltuples

-----------

      2000

(1 row)

So, T1.C2 distinct is 0.25 * 2000 = 500, similarly, T2.C2 distinct is 100 according to statistics:

postgres=# select tablename, attname, null_frac, n_distinct, n_dndistinct from pg_stats where tablename = 't2' and attname = 'c2';

 tablename | attname | null_frac | n_distinct | n_dndistinct

-----------+---------+-----------+------------+--------------

 t2        | c2      |         0 |        100 |      -.39834

(1 row)

So, is it possible to use 500 as distinct values for T1.C2? The answer is no. Since there is also a filter on the base table T1 “T1.C1 > 100”, the current association occurs after the base table filter, the estimate should be the number of distinct after the filter, not the number in the original table. In this case, various hypothesis models can be used for estimation, such as several simple models: For Poisson’s model (assuming T1.C1 is weakly correlated with T1.C2) or perfect correlation model (assuming T1.C1 is perfectly correlated with T1.C2), the values will be different. In this case, the selection rate of “T1.C1 > 100” is 8.995000e-01. The DISTINCT values obtained by different models will be different, as follows:

Poisson Model (Weak Correlation Model) : 500(1.0 – exp (- 20008.995000E-01/500)) = 486
Complete correlation model: 500 * 8.995000E-01 = 450
Completely unrelated model: 500 * (1.0-PoW (1.0-8.995000E-01, 2000/500)) = 499.9. This model can be obtained by probabilistic method, and interested readers can try to derive it by themselves
The actual filtered DISTINCT: 500, C2 and C1 columns are not related

postgres=# select count(distinct c2) from t1 where c1 > 100;

 count

-------

   500

(1 row)

Estimate the distinct value of T1.C2 after filtering, then the selection rate of “T1.C2 = T2.C2” can be estimated: 1 / distinct.

The above is the case of no MCV in either table. If both T1.C2 and T2.C2 have MCV, then their MCV should be compared first, because the values in MCV have a clear proportion, so the matching results can be accumulated directly, and then the values in the Histogram should be matched.

2. Estimation idea of multi-group Join conditions

When the table association contains multiple JOIN conditions, similar to the estimation of the filtering conditions of the base table, there are also two ways of thinking. The preference is to try multi-column statistics to estimate the selection rate. When multi-column statistics are not available, single-column statistics are used to calculate the selection rate of each JOIN condition as described above. Then the way the combination selection rate is also controlled by the parameter COST_PARAM. For details, refer to GAUSSDB (DWS) Performance Tuning Series of Actual Part 5:18 Types of Weapon Path Interintervention.

In addition, the following is the selection rate estimation method for special cases:

If the JOIN column is an expression with no statistics, the optimizer will attempt to evaluate DISTINCT values and then estimate them as if there were no MCV.
Left Join/Right Join needs to take into special consideration the following characteristics of blank filling on one side and full output on the other side. The above model can be modified properly.
If the association condition is a comparison of range classes, such as “T1.c2 < T2.c2”, the default selection rate is currently given: 1/3;

3. Estimation idea of JoinFilter

When two tables are associated, if there are some filter conditions on the base table that cannot be pushed down, they will generally become JoinFilter, that is, these conditions are filtered during the Join process, so JoinFilter will affect the number of rows in Joinrel, but will not affect the number of rows in the scan of the base table. Strictly speaking, these JoinFilters are the filtering conditions for an intermediate table if you think of Joinrel as such, but Joinrel has not yet been generated, nor does it have row count and statistics, so it is impossible to estimate accurately. However, a simple approximation is to still use the base table to roughly estimate the selection rate of this JoinFilter, and then put it into the Joinrel final row count estimate.

Third, path generation

With the row count estimates from the previous two sections in place, you can enter the path generation process. What is path generation? Given that tables can be associated in multiple ways (e.g., NESTLOOP, HashJoin) and that GAUSSDB (DWS) tables are distributed in clusters, there may be multiple ways for two tables to be associated. Our goal is to start from these given base tables. According to the requirements of some operations (filtering conditions, association mode and conditions, aggregation, etc.), mutual combination, step by step, finally get the result we want. It’s like going from the base table and finding the best path that will get us the fastest, that’s what we’re trying to do. In this section, we introduce the generation of Join Path and Aggregate Path.

1. Generation of Join Path

The basic idea of GausSDB (DWS) optimizer selection is dynamic programming. As the name implies, from a certain beginning state, by solving the optimal solution of the middle state, it gradually evolves forward, and finally obtains the global optimal plan. So in dynamic programming, there’s always a variable that’s driving the process evolution. In this case, this variable is just the number of tables. In this section, we will take the following SQL as an example to explain:

select count(*) from t1, t2 where t1.c2 = t2.c2 and t1.c1 < 800 and exists (select c1 from t3 where t1.c1 = t3.c2 and t3.c1 > 100);

In this SQL statement, there are three base tables (T1, T2, T3). The distribution key of the three tables is C1 column, and there are two association conditions:

T1.c2 = T2.c2, T1 is associated with T2
T1.C1 = T3.C2, T1 is associated with T3

To help with the analysis, we combined the log to help you understand, set the following parameters, and then execute the statement:

set logging_module='on(opt_join)';

set log_min_messages=debug2;

Step one, how do I get the data for T1 and T2

First of all, how to get t1 and t2 data, such as Seq Scan, Index Scan, etc., since in this example, we did not create Index, so select only Seq Scan. The log fragment shows:

Let’s remember the three sets of Path names: path_list, cheapest_startup_path, and cheapest_total_path. The last two correspond to the local optimal solution of dynamic programming. Here is a set of sets, collectively known as the optimal Path, which is also the search space for the next step. PATH_LIST holds a valuable set of candidate paths from the current REL set (the pruned Path will not be placed here). CHEAPEST_STARTUP_PATH represents the Path with the least startup cost in the PATH_LIST. CHEAPEST_TOTAL_PATH represents a set of paths in the PATH_LIST that have the lowest total cost (here a set of mainly optimal paths that may have multiple dimensions corresponding to each other). T2 and T3 are similar in that the optimal path is a Seq Scan. Now that you have the SCAN optimal path for all the base tables, you can choose the associated path.

The second step is to solve the optimal path associated with (T1, T2)

The distribution key of both tables T1 and T2 is C1 column, but the Join column is C2 column, so the theoretical path is :(placed on the right side to indicate as an inner table)

Broadcast(t1) join t2
t1 join Broadcast(t2)
Broadcast(t2) join t1
t2 join Broadcast(t1)
Redistribute(t1) join Redistribute(t2)
Redistribute(t2) join Redistribute(t1)

Then, each path can be matched with different Join methods (NESTLOOP, HashJoin, and MergeJoin), and there are 18 associated paths in total. The optimizer needs to select the optimal path from these paths, and the selection is based on the Cost of the path. The optimizer will assign costs to each operator, such as Seq Scan, Redistribute and HashJoin, which are related to data size, data characteristics, system resources and so on. How to estimate the costs will be analyzed in subsequent articles. This section only focuses on how to choose the path based on these costs. Since the cost is proportional to the execution time, the goal of the optimizer is to choose the least costly plan, so is the path choice. If the total cost is similar, then the startup cost is compared. If the total cost is similar, then the startup cost is kept in the path_list. If the total cost is similar, then the startup cost is kept in the path_list. If the new Path’s total_cost is large, but the startup_cost is much smaller, then the Path will be retained. The comparison process is omitted and the result of the Path comparison is given:

It can be seen that the path with the lowest total cost is the path with redistribution of both sides and T1 as the path of the inner table.

The third step is to solve the optimal path associated with (T1, T3)

The association condition of tables t1 and t3 is that t1.c1 = t3.c2, because the Join column of tables t1 is distributed key c1 column, there is no need to be Redistribute on tables t1; T3 and T1 are joined by SEMI, so the appearance cannot be Broadcast. If not, duplicate results may occur. There is also a Unique Path option (i.e. T3 table de-duplication), so the candidate paths available are roughly as follows:

t1 semi join Redistribute(t3)
Redistribute(t3) right semi join t1
t1 join Unique(Redistribute(t3))
Unique(Redistribute(t3)) join t1

Since only one side needs to be redistributed and can be redistributed, Broadcast is not selected, as it is generally more expensive to redistribute the same amount of data than to prune it early. Taking the Join method into account, the optimizer gives the final choice:

In this case, the optimal plan is to select the Path of the Inner table UNIQUE PATH, that is to say, the Inner Join of table T3 is followed by the process of de-duplication.

The fourth step is to solve the optimal path associated with (T1, T2, T3)

With the preparation of the previous two steps, the idea of association of the three tables is similar. Formally, it is decompose into two tables to associate first. However, when associating with the third table, the actual operation is to directly take out all the Joinrel associated with the two tables, and then add another table one by one to try association.

Joinrel (t1, t2) join t3: (t1, t2)->(cheapest_startup_path + cheapest_total_path) join t3->(cheapest_startup_path + cheapest_total_path)
JOINREL (T1, T3) JOIN T2: (t1, t3)->(cheapest_startup_path + cheapest_total_path) join t2->(cheapest_startup_path + cheapest_total_path)
Joinrel (T2, T3) Join T1: There is no (T2, T3) association, so this does not exist

For example, when Joinrel (T1, T2) joins T3, it will also try to deduplication the PATH of T3 table, because the nature of the Join is still Semi Join. The following chart shows some of the valuable candidate paths generated during the selection process (only a portion is cut for lack of space) :

Among these paths, the optimizer selects the following optimal path:

This is the same when comparing the actual execution plan (the “e-costs” are the same when comparing HashJoin level 4) :

From this process, we can feel that PATH_LIST may have some expansion. If there are too many paths in PATH_LIST, it may lead to multiple cheapest_total_paths, and then the search space of the next level will become very large, which will eventually lead to the increase in the time of plan generation. About the generation of JOIN PATH, the following points are made:

When selecting the Join path, the cost will be calculated in two stages: initial cost and final cost. The initial cost quickly estimates the cost of building the hash table, calculating the hash value and footwall. If the initial cost is larger than a certain path in the PATHS list, the path will be pruned away in advance.
Cheapest_total_path has several reasons. The main reason is that under multiple dimensions, paths with similar costs are likely to be the best choice for the next level of dynamic planning, and only one may not get the overall optimal plan.
CHEAPEST_STARTUP_PATH records the one with the least startup cost, which is also reserved for another dimension. When the query statement requires few results, there is a Path with a low startup cost, but the total cost may be high, and this Path may become the first choice.
Because of pruning, in some cases, a Path may be pruned off early, or the Path may not be selected as CHEAPEST_TOTAL_PATH or CHEAPEST_STARTUP_PATH, but this Path is part of the theoretically optimal plan, which may result in the final plan being less than optimal. This kind of scenario is usually not very likely. If this happens, you can try to use Plan Hint for tuning.
Path generation is closely related to cluster size, system resources, statistical information and COST estimation. For example, the number of cluster DN affects the skew of focus distribution and the data volume of single DN, and the system memory affects the footwall Cost. The statistical information is the first-hand data of row number and distinct value estimation, while COST estimation model plays an important role in the whole plan generation. It is the key factor for selection and elimination. The inaccurate estimation of the number of rows of each Joinrel may affect the final plan. Therefore, the same SQL statement, in different clusters or in the same cluster with different statistics, the plan may be different, if some path changes can be analyzed by the Performance information and log to locate the problem, Performance details can be seen in the blog post: Explain Performance of GausSDB (DWS).
If the Random Plan mode is set, then cheapest_startup_path and cheapest_total_path of each layer of dynamic programming are randomly selected from the path_list, so that randomness is guaranteed.

2. Generation of Aggregate Path

In general, Aggregate Path generation occurs after the associated Path of the table is generated, and there are three main steps (the Aggregate of Unique paths is already completed when the Join Path is generated, but there are also three steps) : First, the number of rows of the Aggregate result is estimated, then the Path method is selected, and finally, the optimal Aggregate Path is created. The former depends on statistical information and COST estimation model, while the latter depends on the estimation results, cluster size and system resources of the former. The estimation of Aggregate rows is mainly combined according to distinct values of Aggregate columns. We focus on the estimation of Aggregate rows and the selection of the optimal Aggregate Path.

2.1 Aggregate row number estimation

Take the following SQL as an example to illustrate:

select t1.c2, t2.c2, count(*) cnt from t1, t2 where t1.c2 = t2.c2 and t1.c1 < 500 group by t1.c2, t2.c2;

This statement first associates the two tables with a filter condition on the base table, and then retrieves the GROUP BY result for the two columns. There are two clustered columns here, T1.c2 and T2.c2. Take a look at the original information given in the system table:

postgres=# select tablename, attname, null_frac, n_distinct, n_dndistinct from pg_stats where (tablename = 't1' or tablename = 't2') and attname = 'c2';

 tablename | attname | null_frac | n_distinct | n_dndistinct

-----------+---------+-----------+------------+--------------

 t1        | c2      |         0 |       -.25 |     -.431535

 t2        | c2      |         0 |        100 |      -.39834

(2 rows)

The statistics show that the original distinct values of T1.C2 and T2.C2 are -0.25 and 100, respectively. If -0.25 is converted to an absolute value of 0.25 * 2000 = 500, should the combination of them be at least 500? The answer is no. Because Aggregate aggregates the results of Joinrel (T1, T2), the statistics in the system table are raw information (without any filtering). In this case, both Join condition and filter condition need to be taken into account. How? First look at the filter condition “T1.C1 <500” which might filter out part of T1.C2 and then have a selection rate (here we call it FilterRatio), then Join condition “T1.C2 = T2.C2” which also has a selection rate (here we call it JoinRatio), These two ratios are both a number between [0, 1], so the influence of these two ratios should be taken into account in the estimation of distinct T1.C2. If Poisson models are selected between different columns and perfectly correlated models are used between the same columns, the distinct values of T1.C2 will look something like this:

distinct(t1.c2) = Poisson(d0, ratio1, nRows) * ratio2

Where d0 represents the original distinct in the base table, ratio1 represents the Ratio using Poisson model, ratio2 represents the Ratio using the perfectly related model, and nRows is the number of rows in the base table. If you need to locate the analysis problem, these ratios can be checked from the log and run the SQL statement after setting them as follows:

set logging_module='on(opt_card)';

set log_min_messages=debug3;

In this example, we can see two ratios in the T1 table from the log:

SQL > SELECT * FROM t2. C2 WHERE (* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * FROM t2. You can’t combine T1.C2 and T2.C2 directly, because “T1.C2 = T2.C2” implies that the values of the two columns are the same, which means they are equivalent, so you can just consider Min(DISTINCT (T1.C2), DISTINCT (T2.C2)), The following figure shows the actual and estimated rows given by Performance:

postgres=# explain performance select t1.c2, t2.c2, count(*) cnt from t1, t2 where t1.c2 = t2.c2 and t1.c1 < 500 group by t1.c2, t2.c2; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ ------------------------------------------ id | operation | A-time | A-rows | E-rows | E-distinct | Peak Memory | E-memory | A-width | E-width | E-costs ----+-----------------------------------------------+------------------+--------+--------+------------+----------------+ ----------+---------+---------+--------- 1 | -> Streaming (type: 48.500 GATHER) | | 99 | 100 | | 80 KB | | | | 2 | 89.29-16 > HashAggregate | [38.286, 40.353] | 99 | 100 | | [28 KB, 31 KB] 16 MB | | | | 3 | 79.29 16 [24, 26] - > Hash Join (4, 6) | [37.793, 39.920] 1980 | | 2132 | | [6 KB, 8 6 KB] 1 MB | | | | | - > 75.04 4 Streaming (type: REDISTRIBUTE) | [0.247, 0.549] | 1001 | 1001 | | 25 [53 KB, 53 KB] 2 MB | | | | | 32.95 5 - > 4 Seq Scan on the test. The t2 | [0.157, 0.293] 1001 | | 1001 | | [12 KB, 4 12 KB] 1 MB | | | | | - > 4.50 6 Hash | [36.764, 38.997] | 998 | 1000 | 62 | [291 KB, 291 KB] 16 MB | | | | 4 [20, 20] 29.88 7 | - > Streaming (type: REDISTRIBUTE) | [36.220, 38.431] 998 | | 999 | | [53 KB, 61 KB] 2 MB | | | | 4 8 | - > 29.88 Seq Scan on the test. The t1 | [0.413, 0.433] | 998 | 999 | | 14 KB] [14 KB, 1 MB | | | | 9.25 4

2.2 Aggregrate Path generation

With the number of clustered rows, you can flexibly choose the clustering method according to the resource situation. Aggregate methods mainly have the following three types:

Aggregate + Gather (+ Aggregate)
Redistribute + Aggregate (+Gather)
Aggregate + Redistribute + Aggregate (+Gather)

The parentheses indicate that this step may not be available, depending on the situation. These aggregation methods can be understood as either Redistribute or Broadcast when two tables are associated. After the optimizer gets the final number of clustered rows, it will try each clustering method, calculate the corresponding cost, select the optimal method, and finally generate the path. When there are two layers of Aggregate, the last layer is the final Aggregate number of rows, and the first Aggregate number is calculated according to Poisson’s model. The selection of Aggregate mode is by default selected by the optimizer based on the cost, but can also be specified by the user with the parameter BEST_AGG_PLAN. The three types of clustering methods are broadly applicable as follows:

In the first case, the number of rows is not too large after direct aggregation, which is usually DN aggregation and CN collection. Sometimes, CN needs to be aggregated twice
The second type requires redistribution and does not significantly reduce the number of rows after direct aggregation
The third type requires redistribution and the row number decreases obviously after direct aggregation. After redistribution, the row number can be reduced again, which is generally DN aggregation, redistribution and re-aggregation, commonly known as double-layer Aggregate

4. Conclusion

This paper focuses on the core steps of plan generation, from row estimation, to Join Path generation, and then to Aggregate Path generation, and introduces the basic principles of the simplest process. But the actual processing method is far more complicated than the description, need to consider many cases, such as how to combine the optimal selection rate of multiple groups, how to choose the distribution key, how to deal with the occurrence of skew, how much memory used, and so on. In the balance of the whole plan generation process, there are times when you have to give something up in order to gain something, and sometimes a weakness of the plan can be ignored or compensated for by other capabilities, such as the parallelism effect that will dilute some of the plan’s weaknesses with SMP enabled. To sum up, plan generation is a complex and detailed work. Generating a global optimal plan requires continuous problem discovery and optimization. We will continue to explore the secrets of plan generation in future posts.

Click on the attention, the first time to understand Huawei cloud fresh technology ~