MySQL query cost and scan interval

Tag: article of public account

Antecedents to review

We know that for InnoDB storage engine, table data is stored in the so-called B+ tree. Every index we create is equivalent to creating another B+ tree.

For the B+ tree corresponding to the cluster index, a complete user record is stored in the leaf node (the so-called complete user record means that a cluster index record contains all user-defined columns and some built-in columns), and these cluster index records are sorted according to the primary key value from the smallest to the largest.
For the B+ tree corresponding to the secondary index, incomplete user records are stored in the leaf node (incomplete user records mean that a secondary index record contains only the index column and primary key), and these secondary index records are sorted by the value of the index column from smallest to largest.

As many records as we store in the table, there are as many records in the leaf node of each B+ tree (note “every tree”, including the B+ tree corresponding to the cluster index and the B+ tree corresponding to the secondary index).

The sample

Let’s take an example:

CREATE TABLE t (
    id INT UNSIGNED NOT NULL AUTO_INCREMENT,
    key1 INT,
    common_field VARCHAR(100),
    PRIMARY KEY (id),
    KEY idx_key1 (key1)
) Engine=InnoDB CHARSET=utf8;
Copy the code

This table contains two indexes (i.e., two B+ trees) :

Cluster index corresponding to primary key of id column.
The secondary index idx_KEY1 for the KEY1 column.

We insert some records into the table:

INSERT INTO t VALUES
    (1, 30, 'b'),
    (2, 80, 'b'),
    (3, 23, 'b'),
    (4, NULL, 'b'),
    (5, 11, 'b'),
    (6, 53, 'b'),
    (7, 63, 'b'),
    (8, NULL, 'b'),
    (9, 99, 'b'),
    (10, 12, 'b'),
    (11, 66, 'b'),
    (12, NULL, 'b'),
    (13, 66, 'b'),
    (14, 30, 'b'),
    (15, 11, 'b'),
    (16, 90, 'b');
Copy the code

So the clustering index for s1 now looks like this:

The secondary index of s1 looks like this:

As you can see from the diagram, all secondary index records with a value of NULL are placed at the far left of the B+ tree. This is because InnoDB’s designers specified that:

We define the SQL null to be the smallest possible value of a field.

That is, NULL is considered to be the smallest.

Tip: Forgive we put the B + tree structure made a such simplified, we omit the structure of the page, omitted all the nodes (alternative) simply draw a triangle, omitted records between the linked list, because these are not the focus of this article, as if shown in appearance just to highlight the leaf node of the record is according to the sorting of the key value of a given index.

How is the query executed

Let’s say we execute the following query:

SELECT * FROM t WHERE key1 = 53;
Copy the code

The statement execution process is as follows:

Describe the process in words:

First, the secondary index record with the value of 53 in key1 column is quickly located through the B+ tree corresponding to the secondary index IDx_KEY1.
Then through the primary key value on the secondary index record, that is, 6 to perform the table back operation, that is, to find the cluster index record whose ID column value is 6 in the cluster index.

Tip: Records in the leaf node of the B+ tree are sorted in ascending order by key value. It is very fast to locate a record in the leaf node through the B+ tree index.

Like the following query:

SELECT * FROM t WHERE key1 > 20 AND key1 < 50;
Copy the code

It looks like this:

This is what it looks like in words:

First, the first record satisfying key1 > 20 is quickly located through the B+ tree corresponding to the secondary index IDx_KEY1, that is, the record whose key1 value is 23 as shown in the figure. Then, according to the primary key value 3 in the secondary index, the complete user record is obtained and sent to the client.
Then according to the next_record attribute of the secondary index record whose key1 column value is 23 obtained in the previous step, find the next adjacent secondary index record, that is, the record whose KEY1 column value is 30, and then perform back table operation to get the complete user record and send it to the client.
Then it looks for the next record of the secondary index record whose key1 column value is 30 obtained in the previous step, and continues the back-table operation to send the complete user record to the client.
The query is terminated because the value of key1 is 53 and the value of key1 < 50 is not satisfied.

As you can see from the steps above, the more secondary index records that need to be scanned, the more back-table operations need to be performed. If you need to scan the record of all of the secondary index proportion reaches a certain range, the optimizer may choose to use a full table scan way to execute queries (an extreme example is to scan all the secondary index record, that will be implemented for all secondary index record back to the table, obviously it is better to direct a full table scan).

Therefore, the conclusion is that the condition to determine whether a query can use the index is whether the proportion of the secondary index records to be scanned is relatively low in the total records, which means that the cost is low. Therefore, the secondary index can be used to execute the query, otherwise, the full table scan should be used.

Scan interval and boundary conditions

For the following query:

SELECT * FROM t WHERE key1 > 20 AND key1 < 50;
Copy the code

If we use idx_KEY1 to perform the query, then we need to scan all secondary index records in the interval (20, 50) of key1. We call (20, 50) the scan interval of idx_KEY1. Key1 > 20 AND key1 < 50 are called the boundary conditions that form the scan interval.

As long as the index column and constant use =, <=>, IN, NOT IN, IS NULL, IS NOT NULL, >, <, >=, <=, BETWEEN,! The = or LIKE operators are concatenated to produce what is called a scan interval. Here are some of the scan areas that are easy to overlook:

The semantics of the IN operator are the same as those of several equivalent matching operators =, which are connected by OR. They all produce multiple single-point scan intervals. For example, the semantics of the following two statements are the same:
```
SELECT * FROM single_table WHERE key1 IN ('a', 'b');

SELECT * FROM single_table WHERE key1 = 'a' OR key1 = 'b';
Copy the code
```
! The scanning interval generated by = is interesting and easy to be ignored, such as:
```
SELECT * FROM single_table WHERE key1 ! = 'a';Copy the code
```
The corresponding scan interval for idx_key1 is :(-∞, ‘a’) and (‘a’, +∞).
The LIKE operator is special in that it produces an appropriate scan interval only if it matches a full string or a string prefix.

Comparing the size of a string is the same as comparing the size of each character in turn, so:
- The first character of the string is compared, and the string with a smaller first character is smaller.
- If the first character of two strings is the same and the second character is compared, the string with the smaller second character is smaller.
- If the first two characters of two strings are the same, then the third character is compared; And so on.
For an index column, records with the same string prefix must be adjacent. For example, if we have a search condition key1 LIKE ‘a%’, for the secondary index idx_key1, all secondary index records with the string prefix ‘a’ must be adjacent, which means we only need to locate the first key1 with the string prefix ‘a’. You can then scan back along the one-way list of records until a secondary index record does not have a string prefix of ‘a’

Obviously, key1 LIKE ‘a%’ forms a scan interval equivalent to [‘a’, ‘b’].

For any query statement, the optimizer uses the following methods to determine how the query should be executed:

Analyze the scan intervals for queries using different indexes.
Use some means to analyze the following corresponding costs for scanning intervals using different indexes.

Tip: This is a qualitative cost analysis, not a quantitative one, which can be found in the book. You think roughly that the more records you have in the scan interval, the higher the cost.
Compare the cost of executing a query using different indexes with the cost of a full table scan, and choose the least expensive option to execute the query, which is called the execution plan.

Analysis of specific query conditions

WHERE IS NULL IS NOT NULL! = these conditions are how the optimizer makes decisions.

IS NULL

For example, this query:

SELECT * FROM t WHERE key1 IS NULL;
Copy the code

Before executing the query, the optimizer will first make a few visits to the index to find out how many records key1 has in the [NULL, NULL] range:

Tip: [NULL, NULL] Indicates that there is only one NULL value in the interval.

After investigating, the optimizer found that the ratio of secondary index records to be scanned was 3/16 of the total number of records. It thought it was more reasonable to use secondary index to execute the query, so it was shown in the execution plan to use idx_KEY1 to execute the query:

IS NOT NULL

For example, this query:

SELECT * FROM t WHERE key1 IS NOT NULL;
Copy the code

Before executing the query, the optimizer will first make a few visits to the index to see how many records key1 has in the range (NULL, +∞) :

Tip: We’re treating NULL as a minimum here, which you can think of as less than – infinity. Also note that the interval (NULL, +∞) is an open interval, which means NULL values are not included.

The optimizer found that the ratio of secondary index records to be scanned is 13/16, which is obviously too large, so the optimizer decided to use full table scan to execute the query:

How can we make a query that uses the IS NOT NULL condition use a secondary index? Select * from ‘IS NOT NULL’; select * from ‘IS NOT NULL’;

UPDATE t SET key1 = NULL WHERE key1 < 80;
Copy the code

To execute the query:

SELECT * FROM t WHERE key1 IS NOT NULL;
Copy the code

Before executing the query, the optimizer first makes a few visits to the index to see how many entries key1 has in the range (NULL, +∞) :

! The condition of the =

For example, this query:

SELECT * FROM t WHERE key1 ! = 80;Copy the code

Before executing the query, the optimizer will first make a few visits to the index to see how many key1 records are in the interval between (NULL, 80) and (80, +∞) :

After investigating, the optimizer found that the ratio of secondary index records to be scanned was 2/16 of the total number of records. It thought it was more reasonable to use the secondary index to execute the query, so it showed in the execution plan that idx_KEY1 would be used to execute the query:

Wait a minute! Why is the “rows” column for executing the plan set to 3? What the hell is that? There’s only two records that fit the bill. Let’s list the number of records found in each interval that match the criteria:

(NULL, 80) There are 0 records that meet the condition key1! = 80.
Two records in the interval of (80, +∞) meet the condition key1! = 80.

However, the designer of the optimizer has a rule here: when the number of records that meet a given condition in a scan interval is 0, it is broken to 1. That is, the actual optimizer considers 1 record in the scan interval (NULL, 80) to be eligible key1! = 80. That’s why the rows column for executing the plan shows 3.

Tip: Here’s how the designer of the optimizer explains how to break a scan interval into 1 when the number of records that meet a given condition is 0: The MySQL optimizer seems to believe an estimate of 0 rows is always accurate and may return the result ‘Empty set’ based on that. The accuracy is not guaranteed, and even if it were, for a locking read we should anyway perform the search to set the next-key lock. Add 1 to the value to make sure MySQL does not make the assumption!

conclusion

So far, we have analyzed IS NULL, IS NOT NULL,! The key conclusion is that the cost determines the execution plan, not the query criteria. Optimizer will first for possible use to the secondary indexes of several scanning interval, and then investigate the interval with how many records respectively, in the sum of the scanning range of the secondary index record accounted for the proportion of the total number of records to a certain value, the optimizer will abandon the use of secondary indexes to execute queries, and adopt a full table scan.

Tip: IN fact, too many scan intervals can affect the optimizer’s decision. For example, too many parameters IN the IN condition will reduce the probability that the optimizer decides to use the secondary index to execute the query. In addition, the optimizer can investigate the number of index records in a scan interval in two ways: a so-called Index Dive (which is accurate when data is small but slightly biased when data is large), or relying on index statistics. Either way, the optimizer will calculate the number of index records in each scan interval. About these two kinds of investigation methods in “how MySQL is running: From the ribs to understand MySQL” have given a detailed algorithm, of course, have occupied a considerable length of space, written in the public article is a bit to kill chicken with a knife.

This article was first published on the public account “we are all little frog”, long click to pay attention to the little frog, are dry goods oh

Antecedents to review

The sample

How is the query executed

Scan interval and boundary conditions

Analysis of specific query conditions

IS NULL

IS NOT NULL

! The condition of the =

conclusion

Related Posts

“Zero basis to see understand C language introduction tutorial” — (13) socket server prepared

[SpringSecurity]]Controller How to obtain the current login user information

Clear state machine code in C++