Abstract: If you can transform outer joins into inner joins, you can simplify the query optimization process.

This document is shared in huawei cloud community. Conversion of GaussDB(DWS) from External connections to Internal connections.

In the process of query optimization, the join order between internally joined tables can be exchanged at will, and the conditions involving only one table in the WHERE or on conditions can be pushed down to the table as the filtering conditions of the table. For outer joins, the join order of the table cannot be arbitrarily exchanged, and constraints cannot be arbitrarily pushed down. If you can transform outer joins into inner joins, you can simplify the query optimization process.

Conditions to be met when an external connection can be converted into an internal connection

To facilitate the description, two terms are introduced:

  • Non – empty side: The side of the external connection where all data is output. For example: left outer join left table, right outer join right table

  • Nullable side: the side of the outer connection to be filled with nullable values. For example: left outer join right table, right outer join left table, full outer join left table and right table

An outer join can be converted to an inner join if one of the following conditions is met:

  • The Where condition has a “strict” constraint that references columns in a table with nullable sides. This predicate filters out all nulls generated by the nullable side, making the end result equivalent to the inner join.

See the PostgreSQL Technical Insider – Deep Exploration of Query Optimization for the following definition of “strict” :

A precise definition of “strict” is that a function, operator, or expression is strict if its input argument is NULL and its output is NULL. But broadly speaking, a function, operator, or expression is considered strict if its input argument is NULL and its output is either NULL or FALSE. If you have such a strict operator, function, or expression in a constraint, because the input is NULL and the output is either NULL or FALSE, then tuples containing NULL values are filtered out.

This article uses GuassDB(DWS) connection type introduction of the table and data for example.

Example 1: Select * from student where math score is not null

postgres=# select s.id, s.name, ms.score from student s left join math_score ms on (s.id = ms.id) where ms.score is not null; id | name | score ----+-------+------- 1 | Tom | 80 2 | Lily | 75 4 | Perry | 95 (3 rows) postgres=# explain select s.id, s.name, ms.score from student s left join math_score ms on (s.id = ms.id) where ms.score is not null; QUERY PLAN ---------------------------------------------------------------------------- id | operation | E-rows | E-width | E-costs ----+----------------------------------------+--------+---------+--------- 1 | -> Streaming (type: GATHER) 36.59 2 | | 126 | | 30 - > Hash Join (3, 4) 30 | | 126 | 3 | - > 28.59 Seq Scan on student s 30 | | 122 | | 14.14 4 29 - > Hash | | | | 14.14 5-8 > Seq Scan on math_score 8 | | | 30 ms 14.14 Predicate Information (identified by the plan Id) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 2 -- Hash Join (3, 4) Hash Cond: (s.id = ms.id) 5 --Seq Scan on math_score ms Filter: (score IS NOT NULL) (14 rows)Copy the code

Example 2: Select * from student where math > 80

postgres=# select s.id, s.name, ms.score from student s left join math_score ms on (s.id = ms.id) where ms.score > 80; id | name | score ----+-------+------- 4 | Perry | 95 (1 row) postgres=# explain select s.id, s.name, ms.score from student s left join math_score ms on (s.id = ms.id) where ms.score > 80; QUERY PLAN ---------------------------------------------------------------------------- id | operation | E-rows | E-width | E-costs ----+----------------------------------------+--------+---------+--------- 1 | -> Streaming (type: GATHER) | | 126 | 2 | - > 36.44 Hash Join (3, 4) 10 | | 126 | 3 | - > 28.44 Seq Scan on student s 30 | | 122 | | 14.14 4 - > Hash 10 8 | | | | - > 5 14.18 Seq Scan on math_score 10 8 | | | 14.18 ms Predicate Information (identified by the plan Id) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 2 -- Hash Join (3, 4) Hash Cond: (s.id = ms.id) 5 --Seq Scan on math_score ms Filter: (score > 80) (14 rows)Copy the code

If wherems.score is not null and wherems.score > 80, this constraint returns false, satisfying the broad definition of “strict”. So you can eliminate the outer connection and convert it to the inner connection. The query plan from above is also validated. This elimination of external connections can be handled automatically by the database’s query optimizer.

  • In the On connection condition, if the values in the non-empty side column are subsets of the nullable side column, and the values of the nullable side are not NULL. Typically, the non-empty side is listed as a foreign key, while the empty side column is the primary key, and the primary foreign key reference relationship is between the two.

    CREATE TABLE student( id INTEGER primary key, name varchar(50) );

    CREATE TABLE math_score(ID INTEGER, — because GaussDB(DWS) does not support foreign keys, the foreign key definition is left out, but the value of this column is guaranteed to be score INTEGER, a subset of the ID column in the student TABLE);

    INSERT INTO student VALUES(1, ‘Tom’); INSERT INTO student VALUES(2, ‘Lily’); INSERT INTO student VALUES(3, ‘Tina’); INSERT INTO student VALUES(4, ‘Perry’);

    INSERT INTO math_score VALUES(1, 80); INSERT INTO math_score VALUES(2, 75); INSERT INTO math_score VALUES(4, 95);

Then the following outer join is the same as the inner join:

postgres=# select ms.id, s.name, ms.score from student s right join math_score ms on (s.id = ms.id); id | name | score ----+-------+------- 1 | Tom | 80 2 | Lily | 75 4 | Perry | 95 (3 rows) postgres=# explain select ms.id, s.name, ms.score from student s right join math_score ms on (s.id = ms.id); QUERY PLAN ------------------------------------------------------------------------- id | operation | E-rows | E-width |  E-costs ----+-------------------------------------+--------+---------+--------- 1 | -> Streaming (type: GATHER) 30 | | 126 | 2 | - > 36.59 Hash Left Join (3, 4) | | 126 | 3 | - > 28.59 Seq Scan on math_score ms 8 14.14 4 | | | | 30 - > Hash 29 14.14 5 | | 122 | | - > Seq Scan On student s | | 122 | 30 14.14 Predicate Information (identified by the plan id) --------------------------------------------- 2 --Hash Left Join (3, 4) Hash Cond: (ms.id = s.id) (12 rows) postgres=# select ms.id, s.name, ms.score from student s join math_score ms on (s.id = ms.id); id | name | score ----+-------+------- 1 | Tom | 80 2 | Lily | 75 4 | Perry | 95 (3 rows) postgres=# explain select ms.id, s.name, ms.score from student s join math_score ms on (s.id = ms.id); QUERY PLAN ---------------------------------------------------------------------------- id | operation | E-rows | E-width | E-costs ----+----------------------------------------+--------+---------+--------- 1 | -> Streaming (type: GATHER) 36.59 2 | | 126 | | 30 - > Hash Join (3, 4) 30 | | 126 | 3 | - > 28.59 Seq Scan on student s 30 | | 122 | | 14.14 4 29 - > Hash | | | | 14.14 5-8 > Seq Scan on math_score 8 | | | 30 ms 14.14 Predicate Information (identified by the plan Id) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 2 -- Hash Join (3, 4) Hash Cond: (s.i d = Ms. Id) (12 rows)Copy the code

Because GaussDB(DWS) does not support foreign keys, the condition that can remove external connections and convert them to internal connections cannot be recognized and automatically converted by the optimizer. However, it can help developers manually identify and remove external connections when writing SQL.

An interesting rewriting example

There is a use case like this:

Select count(1) from student s left join math_score ms on (s.id = ms.id) where s.id = 2 and ms.score > 70; postgres=# Select count(1) postgres-# from student s left join math_score ms on (s.id = ms.id) postgres-# where s.id = 2  postgres-# and ms.score > 70; count ------- 0 (1 row) postgres=# explain Select count(1) postgres-# from student s left join math_score ms on (s.id = ms.id) postgres-# where s.id = 2 postgres-# and ms.score > 70; QUERY PLAN ----------------------------------------------------------------------------------------------------- id | operation | E-rows | E-width | E-costs ----+-----------------------------------------------------------------+--------+---------+--------- 1 | -> Aggregate | 1 8 | | 2 | - > 26.51 Streaming (type: GATHER) | 1 | | 3 | - > 26.51 8 Aggregate 22.51 4 | - > | 1 | | 8 Nested Loop (5, 6) 22.49 5 | - > | | 3 | 0 Index Only Scan Using student_pkey on student s 8.27 6 | | 1 | | 4 - > Seq Scan on math_score | 1 | | 4 14.21 ms Predicate Information (identified by plan id) ----------------------------------------------------- 5 --Index Only Scan using student_pkey on student s Index Cond: (id = 2) 6 --Seq Scan on math_score ms Filter: ((score > 70) AND (id = 2)) (15 rows)Copy the code

As you can see from the plan above, the left outer join in SQL has been optimized for cross join because there is no join condition on Nest Loop operator 4.

The final result of this query is to calculate the total number of rows resulting from the join of the two tables. For a cross join, the number of rows joined by two tables is equal to the product of the number of rows on the left and the number of rows on the right. Therefore, this query can be modified to the following equivalent query:

explain select lcount * rcount as count from (select count(1) lcount from student where id = 2) s, (select count(1) rcount from math_score where score > 70 and id = 2) ms; postgres=# select lcount * rcount as count postgres-# from (select count(1) lcount from student where id = 2) s, postgres-# (select count(1) rcount from math_score where score > 70 and id = 2) ms; count ------- 1 (1 row) postgres=# explain select lcount * rcount as count postgres-# from (select count(1) lcount from student where id = 2) s, postgres-# (select count(1) rcount from math_score where score > 70 and id = 2) ms; QUERY PLAN ------------------------------------------------------------------------------------------------------ id | operation | E-rows | E-width | E-costs ----+------------------------------------------------------------------+--------+---------+--------- 1 | -> Streaming (type: GATHER) | 1 | | 2 | 26.56 16 - > Nested Loop (3, 7) | 1 | | 3 | - > 22.56 16 Aggregate 8 | | 1 | | 4 - > 8.29 Streaming(type: BROADCAST) | 1 | | | 8.29 5-8 > Aggregate | 1 | | | 8.28 6-8 > Index Only Scan using student_pkey on student | 1 | | 0 8.27 7 | - > Materialize | 1 | | | 14.25 8 8 - > Aggregate | 1 | | | 9-14.23 8 > Streaming (type: BROADCAST) | 1 | | | 14.23 8 10 - > Aggregate | 1 | | | 14.22 8 11 - > Seq Scan on math_score | 1 | | 0 14.21 Predicate Information (identified by plan id) --------------------------------------------------- 6 --Index Only Scan using student_pkey on student Index Cond: (id = 2) 11 --Seq Scan on math_score Filter: ((score > 70) AND (id = 2)) (20 rows)Copy the code

This rewriting can push the aggregation operation to each subtree of the Nested Loop. When the amount of data in each subtree of the Nested Loop is large, the aggregation can greatly reduce the result set and the amount of data participating in the join, thus improving the performance.

Save the following example for thought:

Select sum(score)
From student s left join math_score ms on (s.id = ms.id)
Where s.id = 2
  And ms.score > 70;
Copy the code

Click to follow, the first time to learn about Huawei cloud fresh technology ~