This is the 13th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

Subquery optimization has always been one of the difficulties in SQL query optimization. The basic execution of associated subqueries is similar to nested-loop, but this execution is often intolerably inefficient. When the amount of data is a little larger, it must be Decoorelation or Unnesting in the optimizer to rewrite it to a more efficient operator like semi-join.

Predecessors have summarized a set of complete methodology, which can theoretically de-associate any query. In this paper, the combination of SQL Server and HyPer several classic papers, from simple to profound to explain this set of disassociation theory system. They both use similar methods, the basic idea is understood.

Introduction to Subqueries

Subqueries are a syntax defined in the SQL standard that can appear almost anywhere in SQL, including SELECT, FROM, WHERE, and so on.

In general, subqueries can be classified into Correlated subqueries and non-correlated subqueries. The latter non-associative subquery is a simple problem, as simple as executing it first, fetching and materializing the result set, and then executing the outer query. Here’s an example:

SELECT c_count, count(*) AS custdist
FROM (
         SELECT c_custkey, count(o_orderkey) AS c_count
         FROM CUSTOMER
                  LEFT OUTER JOIN ORDERS ON c_custkey = o_custkey
             AND o_comment NOT LIKE '%pending%deposits%'
         GROUP BY c_custkey
     ) c_orders
GROUP BY c_count
ORDER BY custdist DESC, c_count DESC;
Copy the code

Non-associative subqueries are beyond the scope of this article and, unless otherwise stated, refer to associative subqueries.

An associated subquery is unique in that it is inherently incomplete: its closure contains some of the parameters provided by the outer query. Obviously, we can’t run the query until we know these parameters, so we can’t treat it like a non-associated subquery.

According to the data generated, sub-queries can be classified into the following types:

Scalar subquery: Outputs a result table with only one row and one column, and this Scalar value is its result. If the result is empty (0 lines), a NULL is printed. Note, however, that results greater than 1 line are not allowed and will result in a runtime exception.

Scale-quantum queries can occur anywhere there are scalars, such as in the SELECT, WHERE, and other clauses. Here’s an example:

SELECT c_custkey
FROM CUSTOMER
WHERE 1000000 < (
    SELECT SUM(o_totalprice)
    FROM ORDERS
    WHERE o_custkey = c_custkey
)
Copy the code

Query 1: A scalar Query that appears in the WHERE clause with the correlation parameters indicated in red

SELECT o_orderkey, (
    SELECT c_name
    FROM CUSTOMER
    WHERE c_custkey = o_custkey
) AS c_name FROM ORDERS
Copy the code

Query 2: A scalar quantum Query that appears in the SELECT clause

An Existential Test subquery, specifically an EXISTS subquery that returns a Boolean value. If it appears in WHERE, this is the familiar semi-join. Of course, it can appear anywhere a Boolean value can be placed.

SELECT c_custkey
FROM CUSTOMER
WHERE c_nationkey = 86 AND EXISTS(
        SELECT * FROM ORDERS
        WHERE o_custkey = c_custkey
    )
Copy the code

Query 3: An example of semi-join

Quantified Comparision subquery: a Quantified Comparision subquery, especially a query IN, SOME, or ANY, that returns a Boolean value, usually IN the following form: X = SOME(Q) (equivalent to x IN Q) or x <> ALL(Q) (equivalent to x NOT IN Q). As above, it can appear anywhere a Boolean value can be placed.

SELECT c_name
FROM CUSTOMER
WHERE c_nationkey <> ALL (SELECT s_nationkey FROM SUPPLIER)
Copy the code

Query 4: a non-associative subquery for a collection comparison

Original execution plan

Let’s use Query 1 as an example to get a sense of why it is necessary to de-associate associated subqueries.

Here is the raw Relation Tree of Query 1 without disassociation. Unlike other query plans, we draw the Expression Tree so that we can clearly see that the subquery is actually hanging below the conditional Expression of the Filter.

Evaluator: Evaluator; Evaluator: Evaluator; Evaluator: Evaluator; Evaluator: Evaluator Since this conditional expression contains a scalar query, Evaluator will call Executor to evaluate the result of the scalar query.

This executor-evaluator-executor interchange is inefficient! Given that there may be millions of rows of data passing through the Filter, the total time of query execution is clearly unacceptable if a subquery is executed for each row.

The Apply operator

The taigt-expression-relation alternate reference mentioned above not only has poor performance, but also is a trouble for the optimizer — our optimization rules are all matching and transforming Taigt-relation. In this case, the subquery is hidden in Expression, which is confusing.

To do this, we introduce the Apply operator before starting the disassociation:

The Apply operator (also known as a Correlated Join) takes input from two relational trees. Unlike ordinary joins, Apply’s Inner input (shown as the right subtree) is a relational tree with parameters.

The meaning of Apply is defined by the set expression in the right part of the following figure: For each data RR in Outer Relation RR, calculate Inner Relation E(r)E(r) and output the result of their Join r⊗E(r)r⊗E(r). The result of Apply is the UNION of ALL of these results (by UNION in this article I mean the UNION under Bag semantics, i.e., UNION ALL).

Apply is a name for SQL Server and is referred to in HyPer’s article as Correlated Join. They are completely equivalent. In consideration of the earlier publication and wider impact of SQL Server articles, the name is used in this article.

Depending on the connection mode (⊗⊗), Apply has four more forms:

  • Cross Apply A×A× : This is the most basic form, the behavior we have just described;
  • CVD: CVD: CVD {CVD} CVD {CVD} CVD {CVD}
  • Semi Apply A∃A∃ : return rr if E(r)E(r) is not empty, otherwise discard;
  • Anti-semi Apply A∄A∄ : Return rr if E(r)E(r) is empty, otherwise discard;

We use the Apply operator we just defined to override the previous example: extracting subqueries from inside Expression. The results are as follows:

In the above example, we can confirm that the Scalar Agg sub-query has only one row of results, so we can Apply directly. In some cases, however, it may not be certain that the subquery will return 0 or 1 rows (for example, imagine Query 2 if C_CUSTkey was not unique). To ensure SQL semantics, add a Max1RowMax1Row operator to the right of Apply:

Max1Row (E) = ⎧ ⎩ ⎨ ⎪ ⎪ Null, E, the error, the if | E | = 0 if | E | = 1 otherwisemax1row (E) = {Null, if | E | = 0 E, if | E | = 1 error, otherwise

Theoretically, we can convert all subqueries into Apply operators. A general method is as follows:

  1. If there is a subquery in the expression of an operator, we extract the subquery under the operator (leaving the result variable of a subquery) to form an ALOJALOJ operator. If there is more than one subquery, multiple ALOJALOJ will be generated. Max1RowMax1Row operator if necessary.
  2. Then apply some other rules to convert ALOJALOJ to A×A×, A∃A∃, A∄A∄. For example, the subquery result XX in the above example is used as the Filter criterion. NULL values are filtered out and can be safely converted to A×A×.

In the following example, the Filter expression contains two subqueries, Q1Q1 and Q2Q2. After transformation, corresponding Apply operators are generated respectively. Q2Q2 can not be sure that only one record will be generated, so Max1RowMax1Row operator is added.

Basic elimination rule

The first set of rules is the most basic. ⊗⊗ in the equation indicates that it does not limit connection types. It can be any of {×,LOJ,∃,∄}{×,LOJ,∃, ∃}.

These two rules are obvious and translate to plain English: If the right side of Apply does not contain arguments from the left, it is equivalent to direct Join.

Here is an example of applying rule (2) to Query 3:

De-correlating Project and Filter

The second set of rules describes how projects and filters in subqueries are handled. The idea can be summed up in one sentence: Push Apply down as much as possible and push the operator below Apply up.

Note that these rules only deal with Cross Apply. The other three variants of Apply can theoretically be converted to Cross Apply, but that’s all we need to know for now.

You may ask: Normally we push Filter and Project down as far as possible, why do we do the opposite here? Here’s the catch: Filter and Project originally contain expressions with associated variables, but when you put them above Apply, the associated variables become normal variables! This is exactly what we want.

We’ll see the huge benefit of this later: When Apply is pushed to the bottom, the first set of rules can be applied to change Apply directly to Join, completing the optimization process of subquery de-association.

Here is an example of applying rule (3) to Query 2. Rule (1) is then applied to complete the de-association process.

Discorrelating Aggregate

The third set of rules describes how to handle Aggregate (Group By) in subqueries. As in the previous set, the guiding idea is to push Apply down as far as possible and lift the operator below Apply up.

In the following equation, GA,FGA,F represent the aggregation with Group By Group (Group Agg), where AA represents the column of the grouping, FF represents the column of the aggregation function; G1FGF1 stands for polymerization without grouping (Scalar Agg).

This set of rules is not as straightforward as before, so let’s look at an example to get a feel for it. Here is the result of applying rule (9) to Query 1:

When rule (9) pushes down Apply, ScalarAgg is also changed into GroupAgg, where the grouping column is the key of R, in this case, the primary key C_CUSTKey of CUSTOMER.

If R does not have a primary or unique key, theoretically, we can generate one at Scan.

Why is it equivalent before and after? Before the transformation, we did a ScalarAgg aggregation calculation for each R row, and then combined the aggregation results. After the transformation, we first prepare all the data to be assembled (this is called augment), and then use GroupAgg to complete all the assembly at one time.

This also explains why we use ALOJALOJ instead of the original A×A× : the original ScalarAgg prints A NULL even if the input is an empty set. If we use ALOJALOJ here, we get exactly the same behavior (*); Conversely, if you use A×A×, you have A problem — customers without corresponding ORDERS disappear from the results!

Rule (8) deals with group g, and the same is true, except that the original group column is kept.

Details in ScalarAgg transformation *

Observant readers will notice that the aggregation function produced on the right side of rule (9) is F ‘F’, with a single quotation mark, suggesting that it may be somewhat different from the original aggregation function FF. So under what circumstances is it different? This topic is a little bit more advanced, so you can skip it if you’re not interested.

First of all, do GroupAgg and ALOJALOJ behave exactly the same as before? It’s not. Here’s a counter example:

SELECT c_custkey, (
    SELECT COUNT(*)
    FROM ORDERS
    WHERE o_custkey = c_custkey
) AS count_orders
FROM CUSTOMER
Copy the code

Imagine that customer Eric does not have any orders, so the query should return a row [‘Eric’, 0]. However, when we apply the transform to rule (9), we get a value of [‘Eric’, 1], and the result is wrong!

Why is this so? After the transformation, LeftOuterJoin is used to prepare the intermediate data (Augment), and then GroupAgg is used to do aggregation. LeftOuterJoin generates a [‘Eric’, NULL, NULL… The line; In later GroupAgg, the aggregate function COUNT(*) thinks the group Eric has 1 row, so it prints [‘Eric’, 1].

Here’s a more complex example with a similar problem:

SELECT c_custkey
FROM CUSTOMER
WHERE 200000 < (
    SELECT MAX(IF_NULL(o_totalprice, 42)) -- o_totalprice may be NULL
    FROM ORDERS
    WHERE o_custkey = c_custkey
)
Copy the code

To summarize, the root of the problem is this: F(∅)≠F({NULL})F(∅)≠F({NULL}).

The transformed GroupAgg cannot tell whether the NULL data it sees is generated by the OuterJoin or exists in the original. Sometimes, these two situations will produce different results in ScalarAgg before the transformation.

Fortunately, the aggregate functions F(col)F(col) defined in the SQL standard are all OK — they all satisfy F(∅)=F({NULL})F(∅)=F({NULL}), and we can solve this problem with just a little transformation of FF.

  • For example one, we willCOUNT(*)Replace it with a Count for a non-empty column (such as a primary key), for example:COUNT(o_orderkey);
  • So for example two, you need to takeMIN(IF_NULL(o_totalprice, 42))Do this in two steps: Define intermediate variablesX, calculate with Project firstX = IF_NULL(o_totalprice, 42)And then the aggregation functionMIN(X)You can de-associate it.

De-correlating set operations

The final set of optimization rules is used to process subqueries with Union (for Union ALL), Subtract (for EXCEPT ALL), and Inner Join operators. Again, the guiding idea is to push Apply down and lift the operator under Apply up as much as possible.

In the following equation, xx stands for Cross Join, ⋈ R.cey ⋈ R.Cey stands for CVD with RR Key: R CVD E1 CVD e2R CVD E1 CVD E2 As before, we assume that RR has a primary key or a unique key. If not, we can add one during Scan.

Notice one significant difference between these rules and the ones we’ve seen before: RR appears twice on the right-hand side of the equation. So either we make a copy of the subtree, or we make a DAG execution plan, which is a lot of trouble.

In fact, this set of rules is rarely useful. As mentioned in [2], it is even difficult to write a meaningful subquery with Union All under TPC-H Schema.

other

A few points that I think are important are listed below in FAQ form.

► Can any associated subquery be de-associated?

It can be said that, with a few qualifications, it can be theoretically proved that any relational subquery can be de-correlated.

The proof methods are mentioned in [1] and [3]. Taking [1] as an example, the idea is roughly as follows:

  1. For any query relation tree, the associated subquery is extracted from the expression and expressed by Apply operator.
  2. Non-basic relation operators are removed step by step. First, Union and Subtract are removed through equivalent transformation.
  3. OuterJoin, ALOJALOJ, A∃, A∃, A∄, A∄ are removed.
  4. Finally, all A×A× are removed, and the remaining relation tree contains only some basic relation operators, that is, the de-association is completed.

On the other hand, the subqueries that users use in the real world are mostly simple, and the rules described in this article probably cover 99% of scenarios. Although in theory any subquery can be handled, in practice, no known DBMS implements all of these transformation rules.

What are the similarities and differences between HyPer and SQL Server?

HyPer’s theory covers more disassociative scenarios. For example, all kinds of operators such as Join are given corresponding equivalent transformation rules in [3] (as an example, the following figure is the transformation of Outer Join). In [1] it is only proved that these cases can be reduced to manageable cases (in fact, it is conceivable that there must be no processing).

Another detail is that there is a rule in HyPer:

Where D= π F(T2)∩A(T1)(T1)D= π F(T2)∩A(T1)(T1) represents A Distinct Project result (the so-called Magic Set) for T1T1. It’s tricky to look at the equation directly, but it’s easy to understand by looking at the following example:

In the figure, before applying, get the set of Distinct values of the columns that need to be applied, Apply these values, and then Join the results of Apply with a plain Join.

The advantage of this is that the number of rows that need to Apply after a Distinct Project is significantly reduced if there are a lot of duplicates of the data being applied. In this way, even if Apply is not optimized later, the cost of iterative execution will be significantly reduced.

► Should these transformation rules be used in RBO or CBO? In other words, is the execution plan after de-correlating necessarily better than before?

The answer is, not necessarily.

Intuitively, if the amount of data on Apply’s left side is small (for example, there is only 1 data), then it is better to go directly to Apply’s right side. Alternatively, there is a suitable index on the right, in which case the cost of applying multiple times is not unacceptable.

So it is more appropriate to put these rules into a CBO optimizer, which selects the best plan based on cost estimates. Even, in some cases, we use these equations from right to left to “add correlation.”

This is the same thing as using HashJoin or NestedLoopJoin. In fact, NestedLoopJoin is a special case of Apply. It is not uncommon for NestedLoopJoin to be more efficient than HashJoin if appropriate indexes exist.