sequence

background

Because in the project, the APP side has a paging query about exceptions, on the response has been slow, due to the MySQL cognitive is not enough, always feel paging can’t optimize, until from a technology blog, see the related article, it aroused my curiosity, and through their computers were tested, and this blog to record it.

preface

Before reading this article, you should have some basic knowledge of MySQL indexes, including index structures B+ trees, index overwrites, back tables, clustered and non-clustered indexes, and more. This article will also give a brief introduction to these topics. The storage engine for this article is based on InnoDB.

Introduction to the

This article will introduce the cause of the slow paging query in MySQL, and the actual test, and give an optimized solution, through the test comparison, and finally give an analysis.

Relevant concepts

Index storage structure: B+ tree

  • Only leaf nodes store data;
  • There are two header Pointers to the head and tail of the leaf node block (the nodes corresponding to the maximum and minimum keywords);
  • Leaf node blocks point to each other through Pointers, which is a bidirectional circular linked list structure.
  • Suitable for two kinds of lookups: scope and paging lookups for primary keys; Start at the root node and do a random lookup.

(Picture source network, the original source is unknown, so can not be given)

The index concept

This section will briefly introduce: clustered and non-clustered index, back table, index coverage, these four concepts need to be used in this paper;

Through above B + tree structure, we can know that the data is actually stored in the leaf node, but we have a table may exist multiple indexes at the same time, the leaf node, if all went to the data storage table, will cause the waste of space, the increase of data deletion operation, may lead to multiple leaf index page of the split and merge, lead to the IO frequently. So InnoDB is designed to optimize this problem.

Only leaf nodes with primary key indexes store complete data; With other ordinary node index, its leaf node will only store the primary key value, when it needs to be the primary key and index data itself, you need to find the primary key id first, and then find need to query the data according to its index, this concept is back to the table, if you don’t need to back to table query, call it “index”. For primary key indexes, the index and complete data are stored together, which can be called “clustered index”. For normal indexes, it is called “non-clustered index”.

InnoDB allows tables with no primary keys, so there is no clustered index. Of course not. Here are the rules:

  • If the table defines a primary key, that primary key is the clustered index;
  • If the table has no primary key defined, the first and only non-empty column is the clustered index;
  • Otherwise, InnoDB creates a hidden row-ID as the clustered index;

conclusion

Here, a quick and simple introduction to the basic concepts of MySQL, ready to get down to business.

Paging queries before optimization

If you read a lot of blogs and they say pagination is inefficient, we need to actually test it to see if it’s true.

To prepare data

  • Environment: MySQL5.7
  • Engine: InnoDB
  • Scripts: build tables, stored procedures, and calls:
Create table logs1
CREATE TABLE `logs1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `logtype` varchar(255) DEFAULT NULL,
  `logurl` varchar(255) DEFAULT NULL,
  `logip` varchar(255) DEFAULT NULL,
  `logdz` varchar(255) DEFAULT NULL,
  `ladduser` varchar(255) DEFAULT NULL,
  `lfadduser` varchar(255) DEFAULT NULL,
  `laddtime` datetime DEFAULT NULL,
  `htmlname` varchar(255) DEFAULT NULL.PRIMARY KEY (`id`)
) ENGINE=InnoDB  AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT='1000W';


Create a stored procedure that temporarily modifies the command to end with $$
DELIMITER $$
CREATE PROCEDURE my_insert()
BEGIN
	SET @i=1;
	WHILE @i< =10000000 DO
		INSERT INTO `logs1`(`logtype`,`logurl`,`logip`,`logdz`,`ladduser` ,`lfadduser`,`laddtime`,`htmlname`) VALUES ( 2.'/index'.'0:0:0:0:0:0:0:1'.null.null.'null'.'the 2018-05-03 14:02:42'.'home');
		SET @i=@i+1;
	END WHILE;
END $$
DELIMITER;

Call the stored procedure
CALL my_insert();
Copy the code

The above steps completed the filling of the table and data, and a total of 10 million data were inserted.

Open the monitor

To facilitate better monitoring of execution times, enable Profiles monitoring first. After this function is enabled, you can view the execution time of each SQL item:

set profiling=1;
Copy the code

To begin testing

Here, the effect of actual entries and offsets on performance is tested separately, three times for each group, and their time is recorded:

  • Test the impact of different number of queries on speed: execute each SQL three times separately
select * from logs1 where logtype=2 limit 100000.1;
select * from logs1 where logtype=2 limit 100000.10;
select * from logs1 where logtype=2 limit 100000.100;
select * from logs1 where logtype=2 limit 100000.1000;
select * from logs1 where logtype=2 limit 100000.10000;
Copy the code

The time records are as follows:

Based on the above data, it can be found that: the larger the number of a single query is, the slower the speed will be, but the performance will not decrease too much.

  • Test the impact of different offsets on query speed: each SQL is also executed three times separately
select * from logs1 where logtype=2 limit 100.100;
select * from logs1 where logtype=2 limit 1000.100;
select * from logs1 where logtype=2 limit 10000.100;
select * from logs1 where logtype=2 limit 100000.100;
select * from logs1 where logtype=2 limit 1000000.100;
Copy the code

The time records are as follows:

Through this set of results, it can be found that the larger the offset, the slower the speed, and the performance sharply decreased.

Why is pagination slow

When paging, such as when we need to fetch the millionth row, MySQL doesn’t have the ability to skip the first million rows, so it scans from scratch, and when it reaches the millionth row, it discards the first million rows. So the second type of test above, which is for offsets, finds that the larger the offsets, the more performance degrades. And if you need to return to the table, the efficiency impact of the first 1 million return operations can be huge.

conclusion

The larger the offset, the larger the number of queries, and the worse the performance. The offset has more impact on performance than the number of queries.

Optimized paging queries

How to optimize

Paging queries are hard to avoid on every system, more or less, so what can you do to optimize them? The answer is yes, and here’s a solution: based on subqueries. Look for 100 records starting with the 100,000th row

select id  
from logs1 
where logtype=2 and id > =
        (select id from logs1 where logtype=2 limit 100000.1) 
limit 100;
Copy the code

Optimization-based testing

In normal paging queries, offsets have a significant impact on performance, so only offsets are tested here, not query-count tests.

  • Offset-based distributed query optimization test: each SQL is also executed separately three times
select id  from logs1 where logtype=2 and id > = (select id from logs1 where logtype=2 limit 100.1) limit 100;
select id  from logs1 where logtype=2 and id > = (select id from logs1 where logtype=2 limit 1000.1) limit 100;
select id  from logs1 where logtype=2 and id > = (select id from logs1 where logtype=2 limit 10000.1) limit 100;
select id  from logs1 where logtype=2 and id > = (select id from logs1 where logtype=2 limit 100000.1) limit 100;
select id  from logs1 where logtype=2 and id > = (select id from logs1 where logtype=2 limit 1000000.1) limit 100;
Copy the code

The optimized test results are as follows:

Let’s take the above test results for different offsets before optimization for easy comparison:

By comparison, it can be found that the larger the offset, the more obvious the effect of optimization, at least two to three times the speed of performance.

Principles of optimization

Through the previous test comparison, we can see that the efficiency is indeed improved several times through sub-query, so we can not help thinking, what is the principle?

We are putting the optimized subquery in here:

select id  
from logs1 
where logtype=2 and id > =
        (select id from logs1 where logtype=2 limit 100000.1) 
limit 100;
Copy the code

As summarized above: the larger the offset, the greater the query quantity and performance. So the optimization point of view can start from these two, reduce the offset and the number. At the same time, considering the characteristics of B+ tree structure: the query will be faster. So let’s analyze this optimized SQL:

  • Subquery part: still start from 100000 rows, but the quantity is only 1, so this is the first optimization;

  • Select * from ‘where’ where ‘id >’ where ‘id >’ where ‘id >’ Because B+ tree, id is the primary key index, so this is also fast;

  • Limit of the main query: because the subquery and WHERE have located the starting position of the scan, and because limit 100 does not scan the first N lines, it does not scan the first N lines and discard the first many lines. So there’s been a big improvement in efficiency.

conclusion

In this paper, through the test of large amount of data, the reason of low efficiency in the paging scenario is analyzed. An optimization method and related test and principle analysis are given. Of course, there are many other options for paging optimization, such as between and and temporary tables, which are not described here.

One last word

If you find something wrong, please ask me in the comments section or on WX and I will fix it.

Thank you for reading, if you have a small help, help to like the collection!!

Below is my wechat, welcome technical discussion and exchange.