takeaway

Remember the example SQL in my “Introduction” in “Deep Optimization of Join queries – unknown new Methods”?

This is a Join query that counts the gender distribution of people visiting a user. Here, I’ll repaste it:

SELECT u.sex, COUNT(*) FROM user u LEFT JOIN t_user_view tuv ON u.user_id = tuv.user_id WHERE tuv.viewed_user_id = 10008 GROUP BY u.sex
Copy the code

As programmers, we often find that a lot of business logic is expressed in SQL and can be implemented using either joins or subqueries. For example, if we use subquery to implement SQL, we can write:

SELECT u.sex, COUNT(*) FROM user u WHERE u.user_id IN (SELECT user_id FROM t_user_view WHERE viewed_user_id = 10008) GROUP BY u.sex
Copy the code

Since a business logic can be expressed as both a Join and a subquery, what is the difference between a Join and a subquery, which query performs better, and when do we use a Join and when do we use a subquery?

Today, I’m going to analyze the principle of a query and gradually help you solve the above series of questions.

LooseScan

Let’s take a look at the table structure and data used by the above SQL.

The structure of the user and t_user_view tables is as follows:

  • user

    CREATE TABLE `user` (
      `id` int(11) NOT NULL,
      `user_id` int(8) DEFAULT NULL COMMENT 'user id',
      `user_name` varchar(29) DEFAULT NULL COMMENT 'Username',
      `user_introduction` varchar(498) DEFAULT NULL COMMENT 'User Introduction',
      `sex` tinyint(1) DEFAULT NULL COMMENT 'gender',
      `age` int(3) DEFAULT NULL COMMENT 'age',
      `birthday` date DEFAULT NULL COMMENT 'birthday'.PRIMARY KEY (`id`),
      UNIQUE KEY `index_user_id` (`user_id`),
      KEY `index_un_age_sex` (`user_name`,`age`,`sex`),
      KEY `index_age_sex` (`age`,`sex`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
    Copy the code
  • t_user_view

    CREATE TABLE `t_user_view` (
      `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT 'on the id',
      `user_id` bigint(20) DEFAULT NULL COMMENT 'user id',
      `viewed_user_id` bigint(20) DEFAULT NULL COMMENT 'User ID being viewed',
      `view_count` bigint(20) DEFAULT NULL COMMENT 'View times',
      `create_time` datetime(3) DEFAULT CURRENT_TIMESTAMP(3),
      `update_time` datetime(3) DEFAULT CURRENT_TIMESTAMP(3) ON UPDATE CURRENT_TIMESTAMP(3),
      PRIMARY KEY (`id`),
      KEY `index_viewed_user_user` (`viewed_user_id`,`user_id`)
    ) ENGINE=InnoDB AUTO_INCREMENT=7 DEFAULT CHARSET=utf8;
    Copy the code

The records of the two tables are as follows:

  • user

  • t_user_view

So, now that we have table structures and data, combined with my article why MySQL can Support fast queries with tens of millions of data? MySQL > query subquery (); MySQL > query subquery ()

Let’s focus on the red line:

  1. According to the condition viewed_user_id=10008, search the index tree index_Viewed_user_user in table T_user_view and locate the leaf node that meets the condition, that is, the green node in the leftmost tree in the figure. The records that meet the conditions are traversed through the node and the result is obtained, that is, the second part from left to right in the figure.

  2. Scan the previous query result record sequentially, the downward arrow in the second part of the figure.

    2.1 according to the record 10008,10001, search for the record with user_id=10001 in index tree index_user_id in table user and locate the leaf node, that is, the orange node at the lower right corner of index tree in the third part of the figure. Obtain the record 10001 in the node where user_id=10001. Similarly, a record 10002 with user_id=10002 can be obtained. They correspond to 10001 and 10002 on the far right of the figure.

    2.2 according to the record 10008,100013, search the index tree index_user_id in table user for the record with user_id=10003 and locate the leaf node, that is, the orange node in the middle of the index tree in the third part of the figure. Obtain the record 10003 in the node where user_id=10003. Similarly, records 10005 with user_id=10005 and 10009 with user_id=10009 can be obtained. They correspond to 10003, 10005 and 10009 on the far right of the figure.

  3. Tell your interviewer I can optimize groupBy, and I know it! In the groupBy principle, query the records that meet the conditions in the previous step by the groupBy and count.

The second step of the scan process is called LooseScan by MySQL. Why is it called loose scanning? Suppose we get duplicate values in the second part of the figure, how does MySQL scan the records? Let’s take a look:

Also focus on the red line:

  1. Sequentially scan records that meet the criteria, the downward arrow in the figure.

    1.1 according to record 10008,10001, search index tree index_user_id in table user for the record with user_id=10001 and locate the leaf node, that is, the orange node at the lower right corner of the index tree in the figure.

    1.2 According to the first record in 10008 and 10003, search the index tree index_user_id in table user for the record with user_id=10003 and locate the leaf node, that is, the orange node in the middle of the index tree in the figure. Ps: Scan only the first record for the same record.

Combined with the above two scanning processes, we can see why MySQL calls this scanning process loose scanning.

LooseScan: Scan only the first record if the same record occurs during index records scanning.

At this point, you might think that the subquery execution process is nothing special. Go index -> scan -> go index! Speaking of this process, have you ever thought that if the sequential scan of step 2 is a lot of records, say 1W, then MySQL is very, very slow to perform this subquery?

This incidentally introduces one of the trigger conditions for executing LooseScan: the inner query in the subquery must be able to hit the index. Otherwise it will be slow!

As a result, MySQL came up with other strategies to optimize subqueries in response to the large number of scan records.

FirstMatch

For example: The following case:

Suppose now that the risk control team wants to find the number of abnormal users on the platform who have visited a single user’s home page more than 10,000 times at some point in time. Then, I would find such a user with the following SQL:

SELECT * FROM user WHERE user_id IN (SELECT user_id FROM t_user_view WHERE view_count > = 10000)
Copy the code

At this point, I create another index as follows:

ALTER TABLE `t_user_view` ADD INDEX `index_vc_user` (`view_count`, `user_id`);
Copy the code

Select * from t_user_view where view_count>=10000; select * from t_user_view where view_count>=10000; This leads to another trigger condition for executing LooseScan:

The inner – layer query in a subquery must be an equivalent query.

Since the LooseScan policy is not available, MySQL tries the following query strategy to execute the case SQL.

Focus on the red line:

  1. Query t_user_VIEW index tree index_vc_user according to inner statement view_count>=10000, locate the leaf node, that is, the green node in the left tree in the figure. At the same time, the first record 1000010002 satisfying the condition in this node is found.

  2. Alter table user;

    2.1 According to the table 1,10001… ,1998-01-02, scan backwards from the first record in the node that meets the condition. When the last record that meets the condition is scanned, it is found that user_id in all records that meet the condition is not equal to 10001 in the table record, that is, record 1,10001 in the figure… ,1998-01-02 indicated by the crossed red arrow.

    2.2 According to the table 2,10002… ,2008-02-03, scan backwards from the first record that meets the condition in the node. The first record that meets the condition is detected, and the user_id in the record 10000and 10002 is found to be equal to 10002 in the table record, that is, 2,10002 in the figure… Green dotted arrow indicated on 2008-02-03. Add user to table 1,10002… Put into the final result set, that is, the gray box in the figure 2,10002… ,2008-02-03 represents one of the records in the final result set.

    2.3 According to the table 3,10009… ,2002-06-07, scan backwards from the first record in the node that meets the condition. The second record that meets the condition is scanned, and it is found that the user_id in the record 15000,10009 is equal to 10009 in the table record 3,10009… ,2002-06-07 indicated by green dotted arrow. Add user to table 3,10009,… Put into the final result set, the gray box in the figure 3,10009… 2002-06-07 means one of the records in the final result set.

MySQL scans the outer table of the subquery and then searches the inner index or inner table one by one.

As can be seen from the above example, MySQL tries to execute the whole statement by scanning the outer table without knowing whether the inner index is large or not. However, it is obvious that scanning the outer table is not necessarily the best way to execute the whole statement without knowing the inner index. MySQL has come up with the following strategy to try to scan the inner table index.

MaterializeScan

Again using the example SQL in FirstMatch, let’s look at this execution strategy:

Focus on the red line:

  1. According to the inner query condition view_count>=10000, search index tree index_vc_user and locate the node that meets the condition, that is, the green node in the left tree in the figure. At the same time, find the first record 1000010002 in the node that meets the condition.

  2. Create a temporary table tmp_table and iterate through the following records from 10000&10002:

    2.1 Insert the value 10002 in records 10000and 10002 into tMP_table

    2.2 Insert the value 10009 in records 1500010009 into tMP_table

    2.3 Insert the value 10005 in records 20000 and 10005 into tMP_table

    MySQL > insert 10005 into tmp_table; MySQL > insert 10005 into Tmp_table; MySQL > insert 10005 into Tmp_table

    Insert 10005 into tmp_table (tmp_table); insert 10005 into tmp_table (tmp_table)

  3. Scanning tmp_table:

    3.1 According to record 10002 in the table, search index tree index_user_id in the outer table user and locate the leaf node where 10002 is located, that is, the orange node in the rightmost tree in the figure. Iterate through the records in this node to find the record 10002.

    3.2 Similarly, according to the record 10009 in the table, search index tree index_user_id in the outer table user and locate the leaf node where 10009 is located, that is, the orange node in the rightmost tree in the figure. Iterate through the records in this node to find the record 10009.

    3.3 Similarly, according to record 10005 in the table, search index tree index_user_id in the outer table user and locate the leaf node where 10005 is located, that is, the orange node in the rightmost tree in the figure. Iterate through the records in this node to find the record 10005.

  4. Go back to the user table to find the user information corresponding to 10002, 10009, and 10005 records found in step 3.

In the above procedure, the newly created Tmp_table contains a unique index, which ensures the uniqueness of its inserted records. The index_vc_user index is de-duplicated, and then the index_user_id index is searched by scanning the TMP_table.

MySQL de-duplicates the newly created temporary table, scans the temporary table (or temporary table index), and uses the temporary table records to match the outer table records one by one. This method is called MaterializeScan.

MySQL has discovered that there is a way not to repeat the search path because each entry in the tmp_table must be searched from the root of the index tree index_user_id. A new strategy was created to optimize the case SQL in FirstMatch.

MaterializeLookup

Let’s look at this strategy:

Focus on the red line:

  1. According to the inner query condition view_count>=10000, search index tree index_vc_user and locate the node that meets the condition, that is, the green node in the left tree in the figure. At the same time, find the first record 1000010002 in the node that meets the condition.

  2. Create a temporary table tmp_table and iterate through the following records from 10000&10002:

    2.1 Insert the value 10002 in records 10000and 10002 into tMP_table

    2.2 Insert the value 10009 in records 1500010009 into tMP_table

    2.3 Insert the value 10005 in records 20000 and 10005 into tMP_table

    MySQL > insert 10005 into tmp_table; MySQL > insert 10005 into Tmp_table; MySQL > insert 10005 into Tmp_table

    Insert 10005 into tmp_table (tmp_table); insert 10005 into tmp_table (tmp_table)

  3. Alter table user;

    3.1 according to the table 1,10001… ,1998-01-02, scan backwards from the first record in tMP_table. User_id = 10001; user_id = 10001; user_id = 10001; ,1998-01-02 indicated by the crossed red arrow.

    3.2 According to the table, 2,10002… ,2008-02-03, scan backwards from the first record in tMP_table. Select user_id from user where user_id = 10002, user_id = 10002, user_id = 10002, user_id = 10002 Green dotted arrow indicated on 2008-02-03. Add user to table 1,10002… Put into the final result set, that is, the gray box in the figure 2,10002… ,2008-02-03 represents one of the records in the final result set.

    3.3 According to the table, 3,10009… ,2002-06-07, scan backwards from the first record in tMP_table. > user_id = 10009 > user_id = 10009 > user_id = 10009 > user_id = 10009 > user_id = 10009 ,2002-06-07 indicated by green dotted arrow. Add user to table 3,10009,… Put into the final result set, the gray box in the figure 3,10009… 2002-06-07 means one of the records in the final result set. Similarly, you can find the record 5,10005… ,2008-02-06 Add to final result set.

    3.4 According to the table 8,10008,… ,2002-06-07, scan backwards from the first record in tMP_table. User_id = 10009; user_id = 10008; user_id = 10009; 2002-06-07 indicated by the crossed red arrow.

  4. At this point, the record for the query statement in the case is retrieved from the User table. , 2008-02-03, 3100, 09,… , the 2002-06-07 and 5100, 05,… , 2008-02-06.

In the above procedure, we found that MySQL directly matches the tMP_table records with the user table records without going through the index lookup. Therefore, the query efficiency is faster than the MaterializeScan method above.

MySQL calls this method of creating a temporary table by scanning the outer table of the subquery and matching the temporary table records one by one.

In the four sub-query execution strategies described above, you will find that they all have something in common: as long as there is an index in the outer table or the inner table, you can use the index to improve the efficiency of the query. How can MySQL perform subqueries if there are no indexes on the inside and outside? A new strategy is introduced here, and I’m going to use the case SQL from FirstMatch as an example.

DuplicatesWeedout

Focus on the red line:

  1. Create a temporary table using row_id of the user table, that is, there is only one and unique field row_id in the table. That’s the leftmost part of the picture.

  2. Scan t_user_View for all tables and find five records that meet the condition view_count>=10000. The gray box in the second part of the figure from left to right, where some records are omitted.

  3. Associate the records obtained in step 2 with the user table using the user_id field. The red line with user_id in the graph.

    3.1 record 3100 02,… Select * from user_id where user = 2,10002,… , 2008-02-03.

    3.2 record 7100 05,… Select * from user_id where user = 5,10005,… , 2008-02-06.

    The result is the associated table record, which is part 4 of the figure.

  4. Insert associated table records into temporary tables.

    4.1 to 2100-02,… 10000 Insert the temporary table successfully because user.row_id=2 does not exist in the temporary table.

    4.2 the 3100-09,… Row_id =3 does not exist in the temporary table. The temporary table is successfully inserted.

    4.3 will 05, 5100… 30000 Insert the temporary table successfully because user.row_id=5 does not exist in the temporary table.

    4.4 will 05, 5100… User. row_id=5 exists in the temporary table. Row_id must be unique, and the insert fails.

  5. The result of the subquery is three records.

Through the analysis of the above five sub-query execution strategies one by one, we found that the subquery statements in the FirstMatch case performed best when MaterializeLookup was used.

At the same time, the analysis of the above five strategies is also the process of MySQL optimization sub-query: analyze the query cost of the five strategies one by one to obtain the optimal solution, and finally, select the optimal query strategy.

Optimizer_trace: optimizer_Trace: optimizer_Trace: Optimizer_trace: Optimizer_trace: Optimizer_trace

SET OPTIMIZER_TRACE="enabled=on";
SET OPTIMIZER_TRACE_MAX_MEM_SIZE=1000000;
SELECT * FROM user WHERE user_id IN (SELECT user_id FROM t_user_view WHERE view_count > = 10000);
SELECT * FROM INFORMATION_SCHEMA.OPTIMIZER_TRACE;
Copy the code

Since the results after implementation are very long, I will intercept the cost results of five optimization strategies:

  • LooseScan

    MySQL does not analyze the cost of this policy because the case statement cannot use it.

  • FirstMatch

  • MaterializeScan

  • MaterializeLookup

  • DuplicatesWeedout

In terms of the execution cost of the five strategies, it is indeed MaterializeLookup that has the lowest cost. Therefore, MySQL chooses MaterializeLookup strategy to optimize the case SQL in FirstMatch.

At this point, there are five strategies for MySQL to optimize subqueries: LooseScan, FirstMatch, MaterializeScan, MaterializeLookup, and DuplicatesWeedout. By comparing the execution costs of these strategies, MySQL decides which strategy to use to execute subqueries.

conclusion

This article explains MySQL’s optimization strategy for subqueries using a real case study (and other subquery structures, of course). The table association mentioned in this article is called SEMIJOIN. MySQL optimizes subqueries with 5 strategies:

Execution strategy The trigger condition Optimization scheme
LooseScan 1. The number of inner layer subqueries in a subquery statement cannot exceed 64

2. The inner subquery in the subquery statement must match indexes

3. The inner layer query in the subquery statement must be an equivalent query
In the process of scanning sub-query index records in the inner table, if the same record exists, only the first record is scanned and then the corresponding records are searched in the outer table one by one
FirstMatch Scan the outer table of the subquery statement, and then look up the index or corresponding record of the inner table of the statement one by one
MaterializeScan Create a temporary table to de-duplicate, then scan the temporary table (or temporary table index) and match the outer table records with the temporary table records one by one
MaterializeLookup Create a temporary table for deduplication, scan the outer table of the subquery statement, and match the temporary table records one by one
DuplicatesWeedout Create a temporary table that stores only the subquery outer or inner table ROW_ID. Use row_ID to retrieve associated table records

Through the above summary, we find that these sub-query optimization strategies of MySQL are to achieve query performance optimization by removing records. Compared with Join query, it is easy to find that Left Join/Right Join query will not be duplicated when associated field values occur. Therefore, in the case of associated table scanning, performance will be greatly affected.

So, we know when to use subqueries and when to use Join queries when expressing the same semantics.

  • If the associated fields of an associated table have duplicate values, you are advised to use subquery to improve query performance.
  • When the associated field value of the associated table is unique, the performance difference between the subquery and Join query is not significant, and both can be used.