Recently I came across a MySQL page optimization test case, and did not very specific description of the test scenario, gave a classic solution. Because many situations in reality are not fixed, general-purpose practices or rules should be considered in many scenarios. At the same time, in the face of the way to achieve optimization to investigate its reasons, the same approach, change a scene, can not achieve the optimization effect, but also investigate its reasons.

I was skeptical about this scenario and tested it on my own. Sure enough, I found some problems and confirmed some of my expectations.

This article on MySQL page optimization, from the most simple situation, to do a simple analysis.

In addition, the test environment in this paper is the cloud server with the lowest configuration. The server hardware environment is relatively limited, but it should be “equal” for different statements (written).

20170916 added:

Think of using your toes:












MySQL’s classic pagination “optimization.

There is a classic problem with MySQL pagination optimizations where the “later” the data is queried becomes slower (depending on the index type on the table, and also in SQL Server for b-tree indexes).

Select * from t order by id limit m,n

That is, as M increases, the same amount of data will be queried more and more slowly. Faced with this problem, a classic approach (or variation on it) is to find the ids in the paging range separately, then associate them with the base table, and finally query for the desired data.

select * from t

inner join (select id from t order by id limit m,n)t1 on t1.id = t.id

Does this always work, or under what circumstances can the latter be optimized? Have you ever made a rewrite that didn’t work or even slowed down?

At the same time, most queries are filtered, and if there is a filter, the SQL statement becomes:

select * from t where *** order by id limit m,n

If you do the same thing, rewrite it like this:

select * from t

inner join (select id from t where *** order by id limit m,n )t1 on t1.id = t.id

In this case, can the rewritten SQL statement achieve the purpose of optimization?

Test environment setup

Test data is relatively simple, through the stored procedure loop write test data, test table InnoDB engine table.




Innodb_flush_log_at_trx_commit = 2; otherwise, by default, 500W data will not be written in a day.




The reason for paging query optimization

First of all, let’s look at the classic problem of “backward” queries being slower when paging.

Test 1: Query the data in rows 1-20 for 0.01 seconds




It takes 1.97 seconds to query 20 rows of data from “lower” rows, such as 4900001-4900020.

It can be seen from the above that, under the same query conditions, the later the query, the lower the query efficiency, which can be simply interpreted as: in the same search for 20 rows of data, the later the data, the higher the query cost. As for why the latter is less efficient, I will slowly analyze it later.

The test environment is centos 7 and MySQL 5.7, and the data in the test table is 500W.




Replaying classic paging “optimizations” does not improve when sorting columns without filtering criteria for clustered indexes.

Here is a comparison of the performance of the following two notations when clustered index columns are used as sorting criteria.

Select * from t order by id limit m,n

select * from t

inner join (select id from t order by id limit m,n)t1 on t1.id = t.id

The first way:

Select * from test_table1 order by id asc limit 4900000,20;

The test result is shown in the screenshot. The execution time is 8.31 seconds.




The second rewritten form is:

select t1.* from test_table1 t1

Inner join (select id from test_table1 order by id limit 4900000,20)t2 on t1.id = t2.id;

The execution time is 8.43 seconds.




Here is very clear, through the classical rewriting method after rewriting, the performance can not improve, and even a little bit slower, the actual test performance for both in the performance and no significant linear difference, the two building is done many tests.

I personally see similar conclusions must be tested, this thing can not rely on Mongolia, or luck and so on, can improve the efficiency of why, why not.

So why didn’t the rewrite improve the performance as it was supposed to?

What is causing the current rewrite to fail to improve performance?

How does the latter improve performance?

First look at the table structure of the test table. There is no problem with the index on the table. The key is that the index on the table is the primary key (clustered index).




Why is the sequence of clustered index, relative to “optimization” rewriting SQL can not achieve the purpose of “optimization”?

In the case of sorted columns that aggregate index columns, both scan the table sequentially to query for eligible data. Although the latter is to drive a sub-query first, and then use the results of the sub-query to drive the main table, but the sub-query did not change the “sequential scan table to achieve query eligible data” approach, under the current situation, even after the rewriting of the approach appears to gild the lily.

Refer to the following two execution plans. The execution plan line in the first screenshot is basically the same as the third line (id =2) of the rewritten SQL execution plan.




When there is no filtering criteria, the so-called paging query optimization is just gilding the lily.

At present, the above data query, the two methods are very slow, so if you want to query the above data, how to do? Or to see why slow, must first understand the number of B balance structure, in a rough understanding of my own view, the following figure, when the query data “on”, is actually a deviation in the b-tree indexes in one direction, as shown in the following two screenshots of the target data actually balance data in the tree, there is no such thing as a “front” and “on”, “Forward” and “back” are both relative to each other, or viewed from the direction of the scan. If you look at the “back” data in one direction, you look at the “forward” data in one direction. The front and back are not absolute.

The following two screenshots are rough representations of the b-tree index structure. If the location of the target data is fixed, the so-called “back” is relative to the left to right;




If you look from right to left, what was previously called the back data is actually the front.




As long as the data is up front, it’s still possible to find that data efficiently. MySQL should have similar practices to THOSE of SQL Server with Forwardedand backward.

If backward data is scanned, this part of data can be found quickly, and then the found data can be sorted again (ASC), and the results should be the same.

First, the result is exactly the same as the query above, which takes 0.07 seconds, compared to more than 8 seconds for the previous two methods, a hundredfold difference in efficiency.




As for why this is, I think according to the above elaboration, I should be able to understand, here attached to this SQL.

If you frequently query so-called later data, such as data with a large Id, or data that is relatively new in time dimension, you can implement efficient paging queries by using a flashback index scan.

(Please calculate the page where the data is located. For the same data, the starting “page number” of positive and reverse order is different.)

select* from

(

Select * from test_table1 order by id desc limit 99980,20

) t order by id;

This is improved when sorting columns for non-clustered indexes without filtering criteria.

Here’s the change to test table test_table1:
















The above test sorted by primary key index (clustered index), now sort by non-clustered index, the new column ID_2, to test the two paging methods mentioned earlier.

Let’s start with the first way:

Select * from test_table1 order by id_2 asc limit 4900000,20;

The execution time is a little over a minute, let’s say 60 seconds.




The second way is:

select t1.* from test_table1 t1

Inner join (select id from test_table1 order by id_2 limit 4900000,20)t2 on t1.id = t2.id;

The execution time is 1.67 seconds.




In this case, that is, when sorting non-clustered index columns, the latter way of writing can provide a significant efficiency boost. That’s almost a 40-fold improvement.

So why?

SQL > select top 20 (id_2, id_2, id_2, id_2); Full table scan itself is a very time-consuming process, sorting is also a very large cost, so performance is very low.




Select primary key (id_2, id_2, id_2, id_2, id_2, id_2, id_2, id_2); This avoids the need to query a large amount of data and then reorder it (Using filesort).

If you understand the SQL Server execution plan, compared with the former, the latter should avoid frequent back table (SQL Server called key lookup or bookmark lookup process, can be considered as a sub-query drive outer table query in line with the conditions of 20 data process is a batch, Disposable.




In fact, the rewritten SQL can improve the efficiency of paging queries only in the current case, that is, when sorting non-clustered index columns.

Even so, there is a significant difference in the paging efficiency of the “optimized” paging statement compared to the one shown above, which returns the same data in 0.07 seconds, two orders of magnitude higher than the 1.67 seconds shown here.

select* from

(

Select * from test_table1 order by id desc limit 99980,20

) t order by id;

Another question I would like to mention is why not create a clustered index on this column if frequent paging queries are in some order. MySQL will automatically create a clustered index on the primary key if the statement increases the Id, or if time + other fields ensure uniqueness. Then with clustered indexes, “forward” and “back” are just relative logical concepts, and can be used if most of the time you want “back” or newer data.

Optimization of paging queries when filtering conditions exist

On reflection, the situation is too complex to generalize a very representative case, so we won’t test it too much.

select * from t where *** order by id limit m,n
















The more complex the situation, the more difficult it is to summarize a universal law or method, everything has to be viewed in a specific situation, it is difficult to draw a conclusion. Here for the query with filtering conditions, I will not do one by one analysis, but can be sure that, out of the actual scenario, there is certainly no fixed solution.

In addition, when querying the current page data, using the maximum value of the previous page to do the filtering condition, can also quickly find the current page data, so of course there is no problem, but this belongs to another practice, not discussed in this article.

As an additional test result under SQL Server, if the index is not clustered, and if the query sorted column is a single column index, paging does not improve efficiency.

Create table TestPaging (id int identity(1,1), name varchar(50), other varchar(100) ) declare @i int = 0 while @i<100000 begin insert into TestPaging values (NEWID(),NEWID()) set @i = @i + 1 end create index idx on TestPaging(name)Copy the code



As you can see from the execution plan, the subquery for the query Id is a full table scan.

Unless it is a conforming index, it is more efficient if the data in the table is large (index scans by subqueries are less expensive than full table scans), but then again, if there is frequent paging by a column, why not create a clustered index on that column?



conclusion

In fact, for B-tree index, front and back is a logical relative concept. The difference in performance is related to the structure of b-tree index and the scanning mode. If you add filter criteria, the situation becomes more complicated, this problem in SQL Server is the same principle, was also tested in SQL Server, not repeated here.

In the current situation, the sequence is not certain, the query conditions are not certain, the data distribution is not certain, it is difficult to use a specific method to achieve “optimization”, bad also play the side effect of gilding the lily. Therefore, when doing paging optimization, we must do the analysis according to the specific scenario, and there is not necessarily only one method. The conclusion from the actual scenario is bullshit. Only by understanding the ins and outs of this problem can we overcome it.

Therefore, the conclusion of personal data “optimization” must be a case-by-case analysis, and it is very taboo to summarize a set of rules (rules 1,2,3,4,5) for others to “apply”. Considering that I am also very bad at it, I dare not summarize some dogmas.


The original article was published on April 15, 2018

Author: lujun

This article is from the cloud community partner Data & Cloud. For more information, follow data & Cloud.