How is the statement containing the IN clause executed IN MySQL

Tag: article of public account

For development partners, the MySQL contains IN clause statements must be familiar with the more familiar, almost every day, always use. MySQL InnoDB storage engine IN MySQL 5.7 InnoDB storage engine IN MySQL 5.7 InnoDB storage engine for example

The preparatory work

To smooth the story, let’s create a table:

CREATE TABLE t (
    id INT NOT NULL AUTO_INCREMENT,
    key1 VARCHAR(100),
    common_field VARCHAR(100),
    PRIMARY KEY (id),
    KEY idx_key1 (key1)
) Engine=InnoDB CHARSET=utf8;
Copy the code

Table T contains two indexes:

In order toidColumn cluster index for primary key
forkey1The secondary index created by the column

The table now contains 10,000 entries:

mysql> SELECT COUNT(*) FROM t;
+----------+
| COUNT(*) |
+----------+
|    10000 |
+----------+
1 row in set (0.00 sec)
Copy the code

Locate records from the B+ tree

We now want to execute the following statement:

SELECT * FROM t WHERE 
    key1 >= 'b' AND key1 <= 'c';
Copy the code

Assuming the optimizer chooses to execute the query using a secondary index, the query statement would look like this:

Tip: Forgive me for making a very simplified version of the complex B+ tree structure of the index. In order to highlight the point, we ignore the page structure and show the records of all the leaf nodes together. We want to highlight the key is: in a B + tree leaves node records are sorted by size index column value for clustering index, its corresponding records in a B + tree leaves node is sorted by the id column, for idx_key1 secondary indexes, its corresponding records in a B + tree leaves node is sorted by key1 column.

Select * from key1 where (‘b’, ‘c’);

First, the secondary index record with the value of ‘B’ in key1 column is quickly located to the left by the B+ tree corresponding to the IDx_KEY1 index. The secondary index record contains the corresponding primary key value. According to this primary key value, the complete record is located in the cluster index (this process is called back table) and returned to the server layer. The server layer then sends it to the client.
Records are arranged in a single linked list in ascending order of key values, so we can follow this single linked list and then locate the next secondary index record, and perform a table back operation, delivering the complete record to the Server layer and then sending it to the client.
Continue along the one-way list of records and repeat the above process until the value of key1 column in the secondary index does not meet the condition key1 <= ‘c’, as shown in the figure below. It was found that it did not meet the condition key1 <= ‘c’, so the search was stopped.

The above procedure is the process of finding a record of a key value in a certain range through B+ tree.

Contains the execution of the IN clause

If we want to execute the following statement:

SELECT * FROM t WHERE 
    key1 IN ('b'.'c');
Copy the code

How does the optimizer perform the above statement if it chooses to use a secondary index?

The optimizer treats the conditions IN the IN clause as two range intervals (although each interval contains only one value) :

['b', 'b']
['c', 'c']

B+ tree is used to locate the location of two records during statement execution:

First locate the key values in the range [‘b’, ‘b’] :
- The second index record with key1 column value ‘B’ is quickly located through the B+ tree corresponding to the idx_KEY1 index, and then sent back to the table to the server layer and then sent to the client.
- Then find the secondary index records conforming to KEY1 = B along the single linked list of records, and send them back to the server layer, and then send them to the client.
- Repeat the process until you find a secondary index record whose key1 column does not satisfy the condition that key1 = ‘b’.
Locate key values in range [‘c’, ‘c’] :

The search process is similar, so I won’t go into details.

So the more arguments you write IN, the more times you need to locate records through the B+ tree.

Duplicate parameter values IN the IN clause

Take this statement for example:

SELECT * FROM t WHERE 
    key1 IN ('b'.'b'.'b'.'b'.'b'.'b'.'b'.'b'.'b');
Copy the code

Although the IN clause contains several parameters, MySQL generates only one range for it when parsing: [‘b’, ‘b’].

IN clause argument order problem

Take this statement for example:

SELECT * FROM t WHERE key1 IN ('c'.'b');
Copy the code

Is there any difference between IN (‘c’, ‘b’) and IN (‘b’, ‘c’)? IN (‘c’, ‘b’) clause, will the storage engine look for key1 = ‘c’ and then for key1 = ‘b’? If so, aren’t the following statements likely to be deadlocked:

SELECT * FROM t WHERE key1 IN (SELECT * FROM t WHERE key1 IN ('b'.'c') FOR UPDATE; SELECT * FROM t WHERE key1 IN (SELECT * FROM t WHERE key1 IN ('c'.'b') FOR UPDATE;
Copy the code

IN (‘c’, ‘b’), the optimizer will tell the storage engine to find the key values IN the range [‘b’, ‘b’] and then find the key values IN the range [‘c’, ‘c’].

Effect of system variable eq_range_INDEx_DIve_limit on IN clause

It is important to remember that the MySQL optimizer only decides to use an index to perform queries because the cost of using the index is low enough. That is, even if we have the following statement:

SELECT * FROM t WHERE 
    key1 IN ('b'.'c');
Copy the code

The MySQL optimizer needs to analyze the number of key values in the [‘b’, ‘b’] and [‘c’, ‘c’] ranges if the secondary index idx_KEY1 is used, and then calculate the cost of performing the query in a way that is less expensive than the cost of performing a full table scan.

IN this step of calculating the cost of the query, you need to analyze the number of records IN each range IN turn for a query that contains the IN clause. The MySQL optimizer specifies different policies for how much of a range the IN clause corresponds to:

If clause IN the corresponding range is relatively small, so will be the first to visit the storage engine, how many have a look at the records IN the scope of each interval (if less range of records, the statistical result is accurate, and can adopt a certain means of calculating the value of a fuzzy algorithm is more trouble, of course, we don’t say, The way the optimizer first accesses the index to count the number of index records to scan before the query is actually executed is called index Dive.
If the IN clause corresponds to multiple ranges, then index dive cannot actually access the secondary index idx_KEY1 (because that would take a lot of time). Instead, you need to estimate the number of matched secondary index records using statistics previously generated in the background (obviously counting records based on statistics is much less accurate than index Dive).

When to use index Dive and when to use Index Statistic? This depends on the value of the system variable eq_range_index_dive_limit. Let’s look at the value of the system variable on my machine:

mysql> SHOW VARIABLES LIKE 'eq_range_index_dive_limit';
+---------------------------+-------+
| Variable_name             | Value |
+---------------------------+-------+
| eq_range_index_dive_limit | 200   |
+---------------------------+-------+
1 row in set (0.20 sec)
Copy the code

As you can see, the default value is 200, which means that index Dive will be used when the number of ranges is less than 200, otherwise Index Statistic will be used.

Note that the default value of eq_range_index_dive_limit in MySQL 5.7.3 and earlier is 10. So if you’re using 5.7.3 or earlier, it’s easy to use index statistics instead of Index Dive to calculate query costs. When you use an IN query but do not actually use an index, you should consider whether the eq_rangE_INDEx_DIve_limit value is too small.

digression

This article was first published on the public account “We are all little Frogs”.

Writing articles is tiring, and sometimes you feel that the reading is smooth, but it is actually the result of countless revisions behind. If you think it is good, please help to forward it, thanks a million ~ here is my public account “we are all small frog”, there are more technical dry goods, occasionally pull a calf, welcome to pay attention to:

How is the statement containing the IN clause executed IN MySQL

The preparatory work

Locate records from the B+ tree

Contains the execution of the IN clause

Duplicate parameter values IN the IN clause

IN clause argument order problem

Effect of system variable eq_range_INDEx_DIve_limit on IN clause

digression

Related Posts

How to debug in a container?

What do you know about Lambda expressions in Java?

This article addresses memory barriers