How do you generally tune SQL? How does MySQL select indexes during execution?

preface

It’s the second week after the Spring Festival, and I finally have time to summarize the knowledge. Take a look at SQL tuning. This is the kind of question you’ll be asked in an interview, whether you’re in the backend, operations, testing, etc. Like the two questions in the title of this article, I encountered in the actual interview process, so this time mainly around these two questions to sum up.

The focus of this article is in the second part, please sit through!

Explain queries SQL execution plans

When we want to know the execution plan of an SQL statement, we can simulate the optimizer to execute the SQL query through the Explain keyword, and then analyze the SQL statement.

For example, 🌰 :

Create the following table

CREATE TABLE `test_score` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `name` varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL COMMENT 'name',
  `subject` varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL COMMENT 'subjects',
  `score` int(10) DEFAULT NULL,
  `create_time` datetime DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP.PRIMARY KEY (`id`),
  KEY `idx_name_score` (`name`,`score`) USING BTREE COMMENT 'Joint Index of Student Names and grades',
  KEY `idx_create_time` (`create_time`) USING BTREE COMMENT 'Index of creation time'
) ENGINE=InnoDB AUTO_INCREMENT=20 DEFAULT CHARSET=latin1 COMMENT='Student Subject Result Sheet';
Copy the code

In the student subject score table above, I created two indexes, one for name and score, and one for create_time.

Then insert 10W rows into test_score as stored procedure:

DROP PROCEDURE  IF EXISTS  insert_test_score;
CREATE DEFINER = `root` @`%` PROCEDURE `insert_test_score` ( ) 
 BEGIN
	DECLARE
		c_id INTEGER DEFAULT 1;
	WHILE
			c_id < = 100000 DO
			INSERT INTO test_score
		VALUES
			( c_id, concat( 'Li Ming Ming', c_id ), 
			convertSubject(RAND()*10),
			RAND()*100, 
			date_sub( NOW( ), INTERVAL c_id SECOND));SET c_id = c_id + 1;
		
	END WHILE;
END
Copy the code

Custom function: convertSubject contents are as follows:

-- Custom functions
delimiter $$
DROP FUNCTION IF EXISTS convertSubject $$
create function `convertSubject`(xis int) returns VARCHAR(50)
begin
		CASE xis
		WHEN 1 THEN RETURN 'mathematics';
		WHEN 2 THEN RETURN 'Chinese';
		WHEN 3 THEN RETURN"English".WHEN 4 THEN RETURN"Physical";WHEN 5 THEN RETURN"Chemistry";WHEN 6 THEN RETURN"Creatures";WHEN 7 THEN RETURN"Political";WHEN 8 THEN RETURN"History";WHEN 9 THEN RETURN"Geography";ELSE  RETURN"Sports";END CASE ;
end
$$
delimiter ;
Copy the code

Execute stored procedure to insert data:

call insert_test_score();
Copy the code

After the data is inserted successfully, we look at the execution plan of a query SQL statement:

EXPLAIN select * from test_score where name = 'Li Mingming 5';
Copy the code

Running results:Explain the meaning of each field in the execution plan result:

id

Id represents the execution order of SQL statements. There are two cases:

Same id

If there is only one SQL single table query (no intermediate table), then the id is 1, as in:If there is an SQL query in the middle table, there will be multiple execution results, but the id value is the same, this represents the execution order from top to bottom.

EXPLAIN select t2.name,t1.score 
from test_score t1 join subject_score t2 on t1.id = t2.id where t2.id = 1;
Copy the code

Id not the same

When the ID values are different, the higher the value, the higher the priority, that is, the higher the value, the earlier the execution.

EXPLAIN select `name`,score from test_score t1 where id = (SELECT id from test_score t2 where t2.id =1);
Copy the code

select_type

Select_type indicates the query type, which is different from the simple query, joint query, sub-query and other query types. There are mainly the following types:

Simple: Indicates that the SQL does not contain subquery or associated query.
Primary: If the SQL contains sub-queries, the outermost query is marked as primary.
Subquery: Execute SQL that contains subqueries in select or WHERE, and the subquery will be marked as subQuery.
Derived: Subqueries contained in the FROM list are marked as Derived, and MySQL puts the results of those Derived queries into temporary tables before using them for subsequent queries.
Union: If the second select appears after the union, it is marked as union, and if the union is included in the subquery of the form clause, the outer select is marked as derived.
Union result: the result of the union

table

The name of the table referenced by the data entry row (alias is displayed if the table name has an alias).

Type

The connection type is displayed as follows:

System: A special case of the const join type, which has only one row of data in the table (this type is only shown in general system tables).
Const: Found by index once, const is used to compare primary key or unique indexes. Because only one row of data is matched, mysql can convert the query to a constant if the primary key is placed in the WHERE list.
Eq_ref: Unique index scan, for each index key, only one record in the table matches it. This is common with unique indexes or primary key scans.
Ref: non-unique index scan, which returns all rows matching a single value. Multiple rows may be returned. In essence, it can also be classified as a continuous index scan.
Range: Retrieves rows in a given range, using an index to select a range of data. The key column displays which index is used, usually in the WHERE statement, between,in, etc. This range scan index scan is better than a full table scan because it starts at one point in the index and ends at another without a full table scan.
Index: Different from all, index traverses only the index tree. It is usually faster than all because index files are usually much smaller than data files.
All: The matching data is found through full table scanning.

possible_keys

Possible_keys specifies which index MySQL can use to find rows in the table. The actual indexes are usually in the possible_keys index.

key

The name of the index that is actually run during SQL execution. If no index is run, this value is Null.

key_len

Represents the number of bytes used in the index. This column calculates the length of the index used in the query. The shorter the length is, the better, without losing precision. If the key is NULL, the length is NULL. This field is displayed as the maximum possible length of the index field, not the actual length used.

ref

The ref column shows which column or constant is used with the key to select rows from the table, and if the value is const it represents a constant.

rows

Based on table statistics and index selection, approximate the number of rows that need to be read to find the desired record.

filtered

Indicates the number of rows returned as a percentage of the number of rows to be read. The greater the Filtered column value, the better. The value of a Filtered column depends on statistics.

Extra

Additional information that is important but not suitable for display in other columns. The main values are as follows:

Using filesort: indicates that mysql applies an external index sort to data. Instead of reading in index order within the table. In MySQL, the inability to perform a sort using an index is called “file sort”.
Using temporary: A temporary table is used to hold intermediate results. Mysql uses temporary tables when sorting query results. Common in sort order by and group by queries.
Using index: indicates that the corresponding SELECT operation uses an overwrite index to avoid accessing rows of the table. If using WHERE is also present, the table name index is used to perform a lookup of the index key value. If using WHERE is not present at the same time, the index is used to read data rather than perform a query action.
Using WHERE: Indicates that where filtering is used.
Using join buffer: The join buffer is used.
Impossible WHERE: The value of the where clause is always false and cannot be used to retrieve any tuples.
Select Tables Optimized Away: Without the Group by clause, optimize Min/Max operations based on indexes or count (*) for MyISAM storage engine without the need to wait until the execution stage.
Distinct: Optimizes the DISTINCT operation to stop finding the same value once the first matching tuple is found.

By understanding the results of the Explain execution plan, we can analyze what optimization should be done when we write SQL, these operations are some of the skills that must be mastered, there is also to understand which conditions will cause index failure, such as: Query after a field is evaluated or like ‘%***’, field type implicit conversion, etc. Also, try to avoid back table, use overwrite index to complete the query is best, use file sorting, try to avoid filesort and so on.

How does MySQL select indexes during execution?

From the above we understand the contents of each field of the Explain execution plan, we know that MySQL in the execution of SQL, the final execution plan is what kind of.

However, I don’t know if you’ve ever written a SQL that thought it would run an index, but it ended up running a full table scan without any indexes. There is also, write their own SQL, think should go A index, but through Explain check, the result is to go B index.

In these cases, MySQL’s query optimizer analyzes your SQL and ends up using the least expensive execution plan. This means that sometimes MySQL thinks it is cheaper to scan a full table than to run an index.

How does MySQL’s query optimizer optimize SQL? I don’t know how we picked the cheapest strategy. When there are multiple indexes, which index should I go to?

So let’s take a look at that.

The cost of query

MySQL makes an execution plan of possible scenarios before executing a query, and then decides which execution plan to use based on cost.

The cost here refers to: IO cost and CPU cost.

==IO cost refers to that when MySQL reads data, it reads data from disk to memory. The unit of read data is data page, and each page is 16KB. Therefore, the cost constant of read data page is written as 1 (the cost of a page is 1) ==.
==CPU cost refers to the execution cost of the CPU to check whether the data meets the query or sorting conditions. By default, the cost constant of detection logging is recorded as 0.2 (in this case, the cost of detecting each row of data) ==.

SQL > select * from MySQL; select * from MySQL; select * from MySQL;

SHOW TABLE STATUS LIKE 'test_score'
Copy the code

Output result:

From this result, we can see that the total number of rows in the test_score table is 99869. Didn’t we insert 10W earlier? Why are there over 100 lines missing? In fact, the total number of rows is only an estimate of MySQL, but this estimate does not affect the cost of our calculation. As stated above, the CPU cost constant of a single record is 0.2, so 99869*0.2= 19,974.
The total length of data in the table is 6832128 bytes. InnoDB has 16KB data per page and the total length of data is 417 pages. Therefore, the IO cost is 417*1=417.

So the cost of a full table scan is approximately 19,974 +417= 20,391.

Statistical cost

Next, let’s use an SQL example to illustrate the execution cost statistics. Again, take the example of test_Score, the data table we created above.

EXPLAIN select * from test_score 
where name > 'Li Mingming 90017' and create_time < 'the 2021-02-26 18:00:00';
Copy the code

SQL > execute a full table scan:But if we change the create_tieme parameter from 18 to 17, the execution plan display will follow the index of create_time, not the name joint index.

EXPLAIN select * from test_score 
where name > 'Li Mingming 90017' and create_time < 'the 2021-02-26 17:00:00';
Copy the code

Execution Plan:MySQL > select * from ‘where’; select * from ‘where’; select * from ‘where’;

The reason for this result is that MySQL chose the optimal plan to execute SQL based on cost. So how does MySQL make an execution plan and what is the basis for choosing a plan?

MySQL execution plan selection process.

In MySQL5.6 and later, you can see the optimizer generating an execution plan through the Optimizer_Trace function. With this feature, we can understand the cost of each MySQL plan and further optimize the query.

The Optimizer_Trace function is turned off by default. You can open it with the following code, execute the specific SQL, view the execution plan from the Information_schema.optimizer_trace table, and then manually disable OPTIMIZER_TRACE.

SET optimizer_trace="enabled=on";
select * from test_score 
where name > 'Li Mingming 90017' and create_time < 'the 2021-02-26 17:00:00';
SELECT * FROM information_schema.OPTIMIZER_TRACE;
SET optimizer_trace="enabled=off";
Copy the code

The result is a large JSON with a lot of content, so I just truncated the key parts.

Let’s go firstidx_name_scoreExecute plan when indexing:We see. Let’s goidx_name_scoreIndex, to scan data 21474 rows, the cost is: 25770.

Idx_create_time = idx_create_time;I see it here. Let’s goidx_create_time, scanning data 6805 lines, cost is 8167, lower than idx_NAMe_score index cost. So MySQL finally chose to goidx_create_timeThe implementation plan of.

Change create_time to 18, and then look at the full table scan execution time.

SET optimizer_trace="enabled=on";
select * from test_score where name > 'Li Mingming 90017' and create_time < 'the 2021-02-26 18:00:00';
SELECT * FROM information_schema.OPTIMIZER_TRACE;
SET optimizer_trace="enabled=off";
Copy the code

We see that a full table scan retrieves 99,869 rows of data and the cost of execution is20391.this20391Exactly the cost we calculated above when counting full table scans.

MySQL actually compares the cost of these three execution plans, and then selects the one with the least cost and executes it.

With the Optimizer_Trace function, you can see how MySQL selects an execution plan and, therefore, which index to use.

MySQL > select * from ‘SQL’; MySQL > select * from ‘SQL’; MySQL > select * from ‘SQL’;

- Full table scan takes 3.545 seconds
select * from test_score  
where name > 'Li Mingming 90017' and create_time < 'the 2021-02-26 18:00:00';

-- Forcing an index drive takes 3.088 seconds
select * from test_score force index(idx_create_time) 
where name > 'Li Mingming 90017' and create_time < 'the 2021-02-26 18:00:00';
Copy the code

Optimizer_trace: Optimizer_trace: Optimizer_trace: Optimizer_trace: Optimizer_trace: Optimizer_trace: Optimizer_trace

conclusion

Finally, to sum up, I made a table of related information about the values of each field in Explain.

Optimizer_trace optimizer_trace optimizer_trace optimizer_trace optimizer_trace optimizer_trace optimizer_trace

The last

Using the test_score table of the above example, to share a SQL question I met in the interview.Or subject score table, according to this table, please write in SQL, each subject with the highest score of the students. The returned field should have the student’s name, subject and grade.