This paper mainly discusses the optimization strategy of in exist subquery in mysql

The overall classification is as follows:

  1. Associated subquery
  2. Non-associated subquery

What’s the difference?

Non-associated subquery: No outer query parameters are involved in the subquery. Pretty straightforward thought:

  1. Execute first, get the result set, loop around it, filter and add the final result set
  2. The inner result set can be optimized ->Physical and chemical

What about associative subqueries? First of all, the query itself is incomplete. The query closure contains some parameters provided by the outer query, which are needed to run the query.

Unassociated IN subquery

SELECT * FROM s1 
    WHERE key1 IN (SELECT common_field FROM s2 WHERE key3 = 'a');
Copy the code
  1. Subqueries are not related to the outer layer
  2. Outer key in (subquery result set)

Intuitive ideas have been said above, and some questions:

  1. Subquery result set memory may not fit
  2. The outer query detects whether a record is in the result set:id in (1.. 10000000).This is actually similar to for

First, it is irrelevant, then it can indeed be written in a result set, and the subject is to optimize the result set:

  1. Write to temporary table where column is the column of the query.
  2. Result sets are de-weighted, as small as possible
  3. Create temporary tables for query accelerationhash index

Since the hash index exists, the query is to determine whether the column value is in the collection.

Of course, the result set is large enough to exceed the memory limit for temporary tables, which will be converted to disk storage, hash index -> b+ index.

This process is called objectification

inner join

Optimization from above:

  1. Materialize subqueries into memory temporary tables
  2. The outer query loop matches each record in the subquery temporary table

The overall view is: the matching operation between two tables, this is not classic join?

SELECT s1.* FROM s1 INNER JOIN <materialized_table> ON key1 = m_val;
Copy the code

So back to the cost of the join itself: first, this is the inner Join

  1. Drive table selection
  2. Select the lowest-cost access method for the driven and driven tables

Connection cost = cost of single access to < driven table > + drive table loop times ✖️ cost of single < driven table > access

The difference in costing is:

  1. s1 S1 num * Materialized_table [mval= XXX] Access cost: The materialization table hashash index
  2. m_table Materialized_table Num * s1[key1] Access cost: s1[key1] can go to the index

So it depends on the difference in cost of access between the two.

semi-join

The last time we talked about costs, whatever was driving the table had to materialize the subquery first and then start joining.

Further optimization: Get rid of the materialization process. Back to the top example SQL:

SELECT * FROM s1 
    WHERE key1 IN (SELECT key2 FROM s2 WHERE key3 = 'a');
Copy the code

SQL’s final result set selects:

Find records in s1 that can be matched in s2 after filtering. The matching condition is s1.key1 = s1.key2

What exactly is semi-join?

Let’s talk about it next.


To be continued…