Optimization for complex deep paging problems

There is an article table, which stores the basic information of the article, including the article ID, author ID and other attributes, and a Content table, which stores the content of the article. The primary key is article_id.

Therefore, my colleague first queried the qualified author ID in the project, and then started multiple threads. Each thread took one author ID each time to perform the query and import work.

SQL > query all articles under the name of the author whose id is 1111.

SELECT a.*, c.* FROM article a LEFT JOIN content c ON a.id = c.article_id WHERE a.author_id = 1111 AND a.create_time < '2020-04-29 00:00:00 "LIMIT of 210000100Copy the code

Because the database queried is from the mechanical hard disk, the query time is extremely long when the offset is queried to 200,000.

Operation and maintenance colleagues received an alarm saying that the library was IO blocked and had been switched from master to slave for many times. We went to Navicat and tried to execute this statement, also waiting for it, and then ran the show ProceessList command to check the database. Find that each query is in the Writing to NET state.

There is no other option but to temporarily take the imported project offline and run the kill command to kill the current query process (because the MySQL server will continue the query if the client stops).

We then analyze the reason for the slow execution of this command:

1. Whether it is a federated index

The current index status is as follows:

The primary key of the article table is ID, and author_id is a normal index. The primary key of the Content table is article_idCopy the code

Select * from article table (author_id), select * from article table (author_id), select * from article table (author_id), select * from article table (author_id), select * from article table (author_id), select * from article table (author_id), select * from article table (author_id), select * from article table (author_id), select * from article table (author_id); Check whether create_time meets the requirement and filter the data whose offset is 20000.

So we change the article author_id index to the joint index (author_id,create_time), so that the B+ tree in the joint index (author_id,create_time) installs the author_id sort first. Create_time < ‘2020-04-29 00:00:00’ and create_time < ‘2020-04-29 00:00:00’ Create_time = create_time = create_time = create_time

If the limit is still 210000, 100, the data is still not found, and the data is not found for several minutes, until navica shows a timeout. If the offset is set to 6000, 100, then the index is hit. We can barely get the data, but it takes 46 seconds, so the bottleneck isn’t there.

The real reasons are as follows:

Let’s start with the two queries about deep paging, where ID is the primary key and val is the plain index

2. Direct query method

Select * from test where val=4 limit 300000,5;Copy the code

Select primary key from primary key

Select * from test a inner join (select id from test where val=4 limit 300000,5) as b on a.id=b.id; select * from test a inner join (select id from test where val=4 limit 300000,5) as b on a.id=b.id;Copy the code

Select * from val; select * from val; select * from val; select * from val; select * from val; select * from val; select * from val; A lot of random I/O is generated in the process.

The second query starts by reading only the last five ids under the normal index val, and then goes to the clustered index to read five data pages.

Similarly, the query in our business is actually a more complex situation, because the SQL in our business will not only read 210,100 results in the article table, but also query the article related content in the content table for each result, and this table has several fields of type TEXT. We use the show table status command to view the information found about the table

It is found that the data volume of the two tables is more than 2 million. The average row length of article table is 266, and that of Content table is 16847.

When InnoDB stores very long varchars or bloBs in Compact or Redundant format, we don’t store all the contents directly in the data page node. Instead, we store the first 768 bytes of row data in the data page. The overflow page is followed by an offset.

If you query 100 consecutive rows of data from the Content table, you also need to read the overflow page of each row, which requires a large number of random I/OS. Due to the hardware characteristics of mechanical hard drives, random I/OS are much slower than sequential I/OS. So we tested it later,

SQL > select * from article table where limit 200000100 = 1; SQL > select * from Article table where limit 200000100 = 1; SQL > select * from Article table where limit 200000100 = 1;

SELECT a.* FROM article a WHERE a. uthor_id = 1111 AND A. create_time < '2020-04-29 00:00:00' LIMIT 200100, 100Copy the code

On the contrary, we directly find 100 article_id to query the data in the content table, and find that it takes about 3s to query the data in the first time (i.e., the information related to the article content with these IDS has not been cached). The second query is about 0.04s because the overflow page data has already been loaded into the buffer pool.

SELECT SQL_NO_CACHE c.* FROM article_content C WHERE c.Copy the code

4. Solutions

Therefore, there are two main solutions to this problem:

(1) Check the primary key ID before inner join

In the case of a discontinuous query, where we’re looking at page 100, we’re not necessarily looking at page 99, where we’re allowed to skip pages.

Inner join (article table temp_table, article table temp_table); inner join (article table temp_table, article table temp_table, article table temp_table); inner join (article table temp_table, article table temp_table, article table temp_table, article table temp_table); Query the information about the articles in the left Join Content table and query the information about the articles in the left Join Content table.

The first query is about 1.11s, and each subsequent query is about 0.15s

SELECT
  a.*, c.*
FROM article a
INNER JOIN(
  SELECT  id FROM  article a
  WHERE  a.author_id = 1111
  AND a.create_time < '2020-04-29 00:00:00'
  LIMIT 210000 ,
  100
) as temp_table ON a.id = temp_table.id
LEFT JOIN content c ON a.id = c.article_idCopy the code

The optimization results

Before optimization, when the offset reaches 200,000, the query time is too long until timeout.

After optimization, the query time is 1.11s when offset reaches the level of 200,000.

(2) Use range query conditions to limit the data taken out

The following is the general idea of this method: if we want to query the next 100 columns of data whose offset is 10000 in test_table, let’s assume that we know the id of the 10000 column and its value is min_id_value.

Select * from test_table where id > min_id_value order by id limit 0, 100 select * from test_table where id > min_id_value order by ID limit 0, 100 Then take 100 pieces of data and the offset becomes 0.

However, this method has limitations. Offset must be known to correspond to id, and then used as min_ID_value, adding id > min_ID_value to filter. If it is used for paging search, it must know the maximum ID of the previous page, so it can only be checked one page at a time, and cannot skip pages. However, since our business requirement is to perform batch data with 100 data at a time, this scenario can be used.

For this approach, our business SQL is rewritten as follows:

SELECT min(a.id) as min_id, Max (a.id) as max_id FROM article a WHERE a.athor_id = 1111 AND a.create_time < '2020-04-29 00:00:00 while(min_id<max_id) { SELECT a.*, c.* FROM article a LEFT JOIN content c ON a.id = c.article_id WHERE a.author_id = 1111 AND a.id > min_id LIMIT 100 // Select min_id from min_id; // Select min_id from min_id;Copy the code

The optimization results

Before optimization, when the offset reaches 200,000, the query time is too long until timeout.

After optimization, when offset reaches the level of 200,000, the query time is 0.34s because the ID of the 200,000th data is known.

Joint index problem optimization

A joint index actually serves two purposes:

1. Make full use of where conditions to narrow the scope

For example, we need to query the following statement:

SELECT * FROM test WHERE a = 1 AND b = 2Copy the code

If the field is a single index, based on b to establish a single index, so the query, the index can only walk a, query all the primary key id a = 1, then back to the table, in the process of back to the table, read each line of data in the clustered index, and then filter out b = 2 a result set, or index b, also is such a process.

If a joint index (a, B) is established for a and B, the node a=1 is directly found in the joint index during query, and then the node b=2 is searched down to find the result set that meets the conditions, and the table is returned.

2, avoid back table (this is also called overwrite index)

Query a and b as follows:

SELECT a,b FROM test WHERE a = 1 AND b = 2Copy the code

To create a single column index for a and b, we need to find a set of primary key ids that meet the conditions, and then we need to aggregate the index and return the table. However, if the fields we want to query are contained in the joint index, then we do not need to return the table.

3, reduce the number of rows that need to be returned to the table

This is the case if we need to query for a>1 and b=2

SELECT * FROM test WHERE a > 1 AND b = 2Copy the code

If the index a is set up, then the primary key ID of a>1 will be found in the single column index A and then be returned to the table.

If a is a joint index (a,b), based on the leftmost prefix matching principle, because the query condition of A is a range search (= or other than in is a range search), so that the part of a can only be matched in the joint index, and the part of B can only be matched according to a>1. However, since each leaf node in the joint index contains information about B, b=2 will also be filtered when all primary key ids of A >1 are queried. In this way, only primary key ids of A >1 and B =2 need to be returned to the table, so the amount of data returned to the table will be smaller.

Our business SQL was originally more complex and would join other tables, but because the bottleneck of optimization was the establishment of the joint index, we simplified it a bit. Here is the simplified SQL:

SELECT a.id as article_id , a.title as title , A. author_id as author_id from article a where A. create_time between '2020-03-29 03:00:00.003' and '2020-04-29 03:00:00.003' and a.status = 1Copy the code

Select * from article table where status = 1; select * from article table where status = 1; select * from article where status = 1;

Create_time = idx_createTime_status; idx_createTime_status = idx_createTime_status; idx_createTime_status = idx_createTime_status;

Idx_createTime is mandatory for query

SELECT a.id as article_id , a.title as title , a.author_id as author_id from article a FORCE INDEX(idx_createTime) where a.create_time between '2020-03-22 003' and '2020-04-22 03:00:00.003' and a.status = 1Copy the code

Enforce idx_createTime_status for queries (select this index if not mandatory)

SELECT a.id as article_id , a.title as title , a.author_id as author_id from article a FORCE INDEX(idx_createTime_status) where a.create_time between '2020-03-22 003' and '2020-04-22 03:00:00.003' and a.status = 1Copy the code

Optimization results:

Before optimization, idx_createTime single-column index is used, and the query time is 0.91s

Before optimization, idx_createTime_status is used as the joint index, and the query time is 0.21s

EXPLAIN results are as follows:

Idtypekeykey_lenrowsfilteredExtra1rangeidx_createTime431160825. 00 using index condition; Using where2rangeidx_createTime_status6310812100. 00 Using index conditionCopy the code

4. Principle analysis

Let’s first introduce the meanings of various values of Extra column in EXPLAIN

Using filesort

When a Query contains an ORDER BY operation and cannot be sorted using an index, the MySQL Query Optimizer has to choose the corresponding sorting algorithm to implement it. Sort from memory when data is low, otherwise from disk. Explain does not explicitly tell the client which sort to use.

Using index

Only the information in the index tree is used to retrieve column information from the table, without additional searches to read the actual rows (the data is obtained using a secondary override index). You can use this policy when a query uses only columns that are part of a single index.

Using temporary

To resolve the query, MySQL needs to create a temporary table to hold the results. This usually happens if you query for GROUP BY and ORDER BY clauses that contain different columns.

To resolve a query, MySQL needs to create a temporary table to hold the results. A typical case is when a query contains GROUP BY and ORDER BY clauses that can list columns in different cases. It is obvious that the result set retrieved by the WHERE condition is too large to be stored in, so temporary tables are added to aid processing.

Using where

Indicates that when the fields in the WHERE filter condition have no index, the MySQL Sever layer receives the result set of the storage engine (such as InnoDB) and filters the fields according to the conditions in the WHERE filter condition.

Using index condition

Using index condition filters the index first, then finds all rows that match the index condition, and then uses other conditions in the WHERE clause to filter the rows.

So in our case, we’re essentially going through a single index, idx_createTime, Select * from a.create_time between ‘2020-03-22 03:00:00.003’ and ‘2020-04-22 03:00:00.003’; select * from A. create_time between ‘2020-03-22 03:00:00.003’ and ‘2020-04-22 03:00:00.003’; Select * from idx_createTime; select * from idx_createTime; select * from idx_createTime;

Innodb then returns the result set to MySQL Sever. MySQL Sever filters the result set according to the status field, and the status field is 1, so the Extra in the Explain result of the first query shows Using WHERE.

The filtered field indicates the percentage of the data returned by the storage engine that meets the number of records queried after being filtered by the server layer. This estimate is 25% because the status value is null, 1,2,3,4.

The difference between idx_createTime_status and idx_createTime_status is that the result set of idx_createTime_status is 1, so the number of rows scanned will be much smaller (27,000 rows vs. 150,000 rows).

Innodb then returns to MySQL Server the result set with status 1 (27,000 rows) without filtering, so the second query is so much faster, 23% of the time before optimization.

The EXPLAIN estimate for both queries is around 300,000 rows because idx_createTime_status only matches createTime because createTime does not look for a single value but for a range.

SELECT count(*) from article a where a.post_time between '2020-03-22 03:00:00.003' and '2020-04-22 'SELECT count(*) from article a where a.post_time between '2020-03-22 03:00:00.003' and '2020-04-22 SELECT count(*) from article a where a.post_time between '2020-03-22 03:00:00.003' and SELECT * from article a where a.post_time between '2020-03-22 03:00:00.003' and '2020-04-22 03:00:00.003' and a.udit_status = 1Copy the code

Divergent thinking: What happens if you change the union index (createTime, status) to (status, createTime)?

Where a.create_time between '2020-03-22 03:00:00.003' and '2020-04-22 03:00:00.003' and A.status = 1Copy the code

If (createTime, status) then the index will only use createTime;

If (status, createTime) is a single value, the number of rows scanned in the index (status, createTime) will be reduced.

(createTime, status, id); (createTime, status, id); (createTime, status, id); Sequential IO, so read faster, the total query time is basically the same.

Here are the results:

First create (status, createTime) idx_status_createTime,

SELECT a.id as article_id , a.title as title , a.author_id as author_id from article a FORCE INDEX(idx_status_createTime) where a.create_time between '2020-03-22 003' and '2020-04-22 03:00:00.003' and a.status = 1Copy the code

CreateTime = 0.21; createTime = 0.21;

Explain results comparison:

The number of scanned rows is indeed lower, because the idx_status_createTime index initially excludes status = 1 for any other value.

Wenyuan network, only for the use of learning, such as infringement, contact deletion.

I’ve compiled the interview questions and answers in PDF files, as well as a set of learning materials covering, but not limited to, the Java Virtual Machine, the Spring framework, Java threads, data structures, design patterns and more.

Follow the public account “Java Circle” for information, as well as quality articles delivered daily.