Prerequisite for backend programmers: Conquer Order by

preface

In daily development, we often use Order BY. Dear friend, do you know the working principle of Order BY? What is the optimization idea of order BY? What are the problems with using order by? In this article, we will learn to conquer order by~

Wechat official account: a boy picking up field snails
Github address, thanks to every star
If you feel there is a harvest, help to like, forward ha, thank you

A simple example using order by

Assume a table of employees with the following structure:

CREATE TABLE 'staff' (' id' BIGINT (11) AUTO_INCREMENT COMMENT 'id', 'id_card' VARCHAR (20) NOT NULL COMMENT 'iD ',' id 'VARCHAR (64) NOT NULL COMMENT' iD ', 'age' INT (4) NOT NULL COMMENT 'age ',' city 'VARCHAR (64) NOT NULL COMMENT' city ', PRIMARY KEY (' id '), INDEX idx_city (' city ') ENGINE = INNODB COMMENT 'INNODB ';Copy the code

Table data is as follows:

We now have such a requirement: query the name, age and city of the top 10 employees from Shenzhen, and sort them according to their age from youngest to oldest. The corresponding SQL statement can be written like this:

Select name,age,city from staff where city = 'shenzhen' order by age limit 10;Copy the code

The logic of this statement is clear, but what about the underlying execution flow?

How order by works

Explain execution plan

Let’s start by looking at the execution plan using the Explain keyword

The key field of the execution plan indicates the use of the index IDx_City
The Using index condition of Extra indicates the index condition
The Using filesort of Extra indicates that sorting is used

As you can see, this SQL uses indexes and also uses sorting. So how is it sorted?

Full field sort

MySQL allocates a small memory for each query thread, called sort_buffer, for sorting. When to put the field into the sort, in fact, through the idx_city index to find the corresponding data, then put the data into the.

The idx_city index tree is as follows:

Idx_city index tree. The leaf node stores the primary key ID. Id primary key cluster index tree

How does our query find a match? Use the idx_city index tree to find the corresponding primary key ID, and then search the primary key index tree based on the obtained primary key ID to find the corresponding row data.

After adding order by, the overall execution process is as follows:

MySQL initializes sort_buffer for the corresponding thread and places the name, age, and city fields to be queried.
From the index tree idx_city, find the first primary key ID that meets the condition city=’ Shenzhen ‘, that is, ID =9 in the figure;
Select * from sort_buffer where id=9; select * from sort_buffer where name, age, and city =9;
Get the primary key ID of the next record from the index tree idx_city, that is, id=13 in the figure;
Repeat steps 3 and 4 until the value of City does not equal Shenzhen.
The previous 5 steps have found all the data whose city is Shenzhen. In sort_buffer, all the data are sorted according to age.
Take the first 10 rows and return them to the client.

The execution diagram is as follows:

Sort_buffer reads all the fields required by the query into sort_buffer. Sort_buffer is a block of memory. If the amount of data is too large, the sort_buffer will not fit.

Disk temporary file assist sort

In fact, the size of sort_buffer is controlled by one parameter: sort_buffer_size. If the data to be sorted is smaller than sort_buffer_size, sorting is done in sort_buffer memory, and if the data to be sorted is larger than sort_buffer_size, sorting is done using disk files

How do YOU determine if disk files are used for sorting? You can use the following commands

Optimizer_trace = "enabled=on"; Select name,age,city from staff where city = 'shenzhen' order by age limit 10; Select * from information_schema.optimizer_traceCopy the code

You can see from number_of_tmp_files whether temporary files are used.

Number_of_tmp_files specifies the number of disk temporary files to be used for sorting. If number_of_tmp_files>0, disk files are used for sorting.

What about the whole sorting process when disk temporary files are used?

Retrieve the desired data from the primary key Id index tree and place it in the sort_buffer block. When sort_buffer is nearly full, the data in sort_buffer is sorted, and when it is finished, the data is temporarily placed in a small file on disk.
Continue to go back to the primary key ID index tree to fetch the data, continue to sort_buffer memory, after sorting, also write this data to disk temporary small files.
Continue the loop until all of the data meets the criteria. Finally, merge the small, temporarily sorted files on the disk into a large, ordered file.

TPS: With disk temporary small file sort, actually using the merge sort algorithm.

If sort_buffer does not fit, you will need to use temporary disk files, which will affect sorting efficiency. So why put unrelated fields (name, city) in sort_buffer? Put only the age field related to sorting, doesn’t it smell? You can see the Rowid sort.

The rowid sorting

Rowid sort is to place only the fields and primary key ids needed for sorting in the SQL query into sort_buffer. So how do I determine whether I’m going to go full-field sort or ROwid sort?

There’s actually a parametric control. This parameter is max_LENGTH_FOR_sort_data, which represents a parameter used by MySQL to sort the length of a row. If the length of a row exceeds this value, MySQL considers the row to be too large and changes the ROWId sort. We can look at the value of this parameter through the command.

show variables like 'max_length_for_sort_data';
Copy the code

Max_length_for_sort_data The default value is 1024. In this example, the length of name,age, and city =64+4+64 =132 < 1024, so it is a full-field sort. Let’s change this parameter, make it a little bit smaller,

Set max_length_FOR_sort_data = 32; SQL select name,age,city from staff WHERE city = 'shenzhen' order by age limit 10;Copy the code

What is the overall SQL execution flow with Rowid sorting?

MySQL initializes sort_buffer for the corresponding thread and places the age field to be sorted and the primary key ID;
From the index tree idx_city, find the first primary key ID that meets the condition city=’ Shenzhen ‘, that is, ID =9 in the figure;
Select * from sort_buffer; select * from sort_buffer; select * from sort_buffer;
Get the primary key ID of the next record from the index tree idx_city, that is, id=13 in the figure;
Repeat steps 3 and 4 until the value of City does not equal Shenzhen.
The previous 5 steps have found all the data whose city is Shenzhen. In sort_buffer, all the data are sorted according to age.
Iterate through the sorting result, take the first 10 rows, and return to the original table according to the id value, take out the city, name and age fields and return them to the client.

The execution diagram is as follows:

In contrast to the full field sort process, rowid sort has one more return to the table.

What is a return table? The process of retrieving a primary key and then going back to the primary key index is called back to the table

With optimizer_trace, we can see if rowid sort is used:

Optimizer_trace = "enabled=on"; Select name,age,city from staff where city = 'shenzhen' order by age limit 10; Select * from information_schema.optimizer_traceCopy the code

Full field sort compared to Rowid sort

Full field sort: sort_buffer memory is insufficient, the need to use disk temporary files, causing disk access.
Rowid sort: Sort_buffer can hold more data, but needs to go back to the original table to fetch the data, which is one more time than the full field sort.

In general, full field sorting is preferred for InnoDB storage engines. You can see that the max_LENGTH_FOR_sort_data parameter is set to 1024, which is a large number. In general, the sorted fields do not exceed this value, that is, the sorted fields are all sorted.

Some optimization ideas for order by

How can we optimize the order BY statement?

Because the data is unordered, it needs to be sorted. If the data itself is ordered, there is no need to rank it. While the index data itself is ordered, we optimize the ORDER BY statement by creating a joint index.
We can also optimize by adjusting parameters such as max_LENGTH_FOR_sort_data;

Joint index optimization

Review the query plan for the sample SQL

Explain select name,age,city from staff where city = 'shenzhen' order by age limit 10;Copy the code

We add idx_city_age to the query condition city and age. Then look at the execution plan

alter table staff add index idx_city_age(city,age); Explain select name,age,city from staff where city = 'shenzhen' order by age limit 10;Copy the code

With idx_city_age, you don’t need a Using filesort sort. Why is that? Because the index itself is ordered, we can look at the idx_city_age joint index schematic, as follows:

The entire SQL execution flow becomes a mess:

Find the primary key ID of city=’ shenzhen ‘from the index idx_city_age
Select name, city, and age as part of the result set
Fetch a record primary key ID from index idx_city_age
Repeat steps 2 and 3 until the 10th record is found or the condition city=’ Shenzhen ‘is not met.

The process diagram is as follows:

From the diagram, there is still a back table operation. Is there a more efficient solution for this example? Yes, you can use an overwrite index:

Overwrite index: in the data column of the query, you do not need to go back to the table to look up, directly from the index column can fetch the desired result. In other words, the index column data used in your SQL overwrites the column of the query result.

Mysql > alter table name (age, city, name); mysql > alter table name (age, city, name);

Tuning parameter optimization

We can also tune the parameters to optimize the execution of order by. For example, you can adjust sort_buffer_size. Because sort_buffer is too small, the amount of data will be sorted by disk temporary files. If the MySQL server configuration is high, you can use slightly larger Settings.

We can also adjust the value of max_LENGTH_FOR_sort_data. If this value is too small, order BY will go through ROWID sorting, which will return to the table and reduce query performance. So max_LENGTH_FOR_sort_data can be a bit larger.

Of course, a lot of times, these MySQL parameter values, we just use the default values.

Some notes for using order by

There is no WHERE condition, do I need to index the order by field

In the daily development process, we may encounter an order by without where condition, so whether the order by after the field should be indexed? SQL > alter table create_time;

select * from A order by create_time;
Copy the code

An unconditional query will not be used even if there is an index on create_time. This is because the MySQL optimizer considers it more expensive to go back to a normal secondary index than to scan a full table. So choose to go through the full table scan, and then according to the full field sort or RoWId sort.

If the query SQL is modified:

select * from A order by create_time limit m;
Copy the code

Unconditional query, if the value of m is small, it can go to the index. Because the MySQL optimizer believes that it can terminate the loop by going back to the table according to the index order and then getting m rows, the cost is less than the full table scan, so it chooses to go to the secondary index.

What if paging limits are too large, causing a lot of sorting?

Suppose the SQL is as follows:

Select * from A order by A limit 100000,10Copy the code

You can record the LAST ID of the previous page. When you query information on the next page, enter the ID in the query condition, for example, where ID > last ID of the previous page LIMIT 10.
You can also limit the number of pages if your business allows.

Index storage order is inconsistent with order BY, how to optimize?

Suppose there is a federated index idx_AGe_name, we need to change it to this: query the name and age of the top 10 employees, and sort by age from youngest to oldest, if the age is the same, then sort by name in descending order. The corresponding SQL statement can be written like this:

select name,age from staff  order by age ,name desc limit 10;
Copy the code

If we look at the execution plan, we find that Using filesort is used.

This is because in the idx_age_name index tree, age is sorted from smallest to largest, and if age is the same, name is sorted from smallest to largest. In order by, the order is sorted by age from the smallest to the largest. If age is the same, the order is sorted by name from the largest to the smallest. In other words, the index is stored in a different order than order by.

How do we optimize? MySQL > alter TABLE Descending Indexes

CREATE TABLE 'staff' (' id' bigint(11) NOT NULL AUTO_INCREMENT COMMENT '主 库 iD ', 'id_card' varchar(20) NOT NULL COMMENT 'iD ',' id 'varchar(64) NOT NULL COMMENT' iD ', 'age' int(4) NOT NULL COMMENT 'age ',' city 'varchar(64) NOT NULL COMMENT' city ', PRIMARY KEY (' id '), KEY 'idx_age_name' (' age ', 'name' desc) USING BTREE) ENGINE=InnoDB AUTO_INCREMENT=15 DEFAULT CHARSET=utf8 COMMENT=' unsigned ';Copy the code

Whether the SQL execution has a sorting process when multiple attributes of the IN condition are used

SQL > select idx_city_name, idx_city_name, idx_city_name, idx_city_name, idx_city_name

Select * from staff where city in (' shenzhen ') order by age limit 10;Copy the code

However, if the IN condition is used and there are more than one condition, there is sorting.

Explain select * from staff where city in (' shenzhen ',' Shanghai ') order by age limit 10; explain select * from staff where city in (' Shenzhen ',' Shanghai ') order by age limit 10;Copy the code

This is because :in has two conditions, when satisfying Shenzhen, age is sorted, but if we add the age satisfying Shanghai, we cannot guarantee that all ages are sorted. Hence the need for Using filesort.

The last

If you feel there is a harvest, help to like, forward ha, thank you
Wechat search public number: pick up snail boy, add a friend, into the technical exchange group

Reference and thanks

MySQL > MySQL