A group by optimization trip

Recently, I found an interesting problem in optimizing the company’s CRM reporting system. Once the query scope exceeds a certain period of time, the execution time of this SQL will be more than 10 times different from the original. Online numbers are in the millions. A desensitized table structure and stored procedures are given below for simulation. The simulated screenshots are all from my VIRTUAL machine (2 cores, 2 gb memory).

Build table SQL

CREATE TABLE `dt_school` (
  `id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'Increment primary key',
  `tea_reg` int(11) NOT NULL DEFAULT '0' COMMENT 'Teacher Enrolments',
  `stu_reg` int(11) NOT NULL DEFAULT '0' COMMENT 'Student Enrolment',
  `school_id` int(11) NOT NULL COMMENT 'school id',
  `time` int(11) NOT NULL DEFAULT '0' COMMENT 'Update time (in days)',
  PRIMARY KEY (`id`),
  KEY `key_school_id` (`school_id`),
  KEY `index_time_school` (`time`,`school_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Statistics - School Statistics'
Copy the code

Initialize data

delimiter  //
CREATE PROCEDURE proc_init_dt_school() 
BEGIN 
	DECLARE d INT;
	DECLARE sid INT ;
	SET d = 20160101;
	WHILE d < 20190501 DO
		SET sid = 1;
		WHILE sid < 1501 DO
			insert into dt_school(tea_reg,stu_reg,school_id,time) 
 value(floor(rand()*100),floor(rand()*100),sid,d);
			SET sid = sid + 1;
		END WHILE;
		SET d = DATE_FORMAT(date_add(d,INTERVAL 1 day),'%Y%m%d');
	END WHILE;
END //
delimiter ;
call proc_init_dt_school() ;
Copy the code

One more day of data, SQL time multiplied by n times?

The SQL on line looks something like this

select sum(tea_reg),sum(stu_reg) from dt_school 
where time between 20160101 and 20160411 group by school_id\G;
Copy the code

Execution Result:

Check one more day

select sum(tea_reg),sum(stu_reg) from dt_school where time between 20160101 and 20160412 group by school_id\G;
Copy the code

The execution result

JPG, all right. Compare the execution plan:

SQL > select school_id from school_id; SQL > select school_id from school_id; In this case I use force index to force index_time_school

select sum(tea_reg),sum(stu_reg) from dt_school force index(index_time_school)
where time between 20160101 and 20160412 group by school_id\G;
Copy the code

drop index key_school_id on dt_school
Copy the code

Execute the query again

select sum(tea_reg),sum(stu_reg) from dt_school 
where time between 20160101 and 20160412 group by school_id\G;
Copy the code

So, does an index necessarily speed up queries? An inappropriate index is worse than none.

The solution

We found that after key_school_id was deleted. MySQL still does not select the index_time_school index, although we can specify the index explicitly using force index(), but this does not solve the problem fundamentally. Because MySQL does not use this index, at least from the point of view of the MySQL optimizer, the use of this index does not have much optimization significance. We found that once we expanded the scope of the query, it became more and more time-consuming

select sum(tea_reg),sum(stu_reg) from dt_school 
force index(index_time_school) group by school_id\G;
Copy the code

At this time, I suddenly had an idea. Since this business is all about summation and aggregation of data, in fact, we can solve this problem by monthly summarizing of data:

Create dT_school_month exactly like dt_school
The periodic script periodically aggregates dT_school data and inserts dt_school_month into dT_school_month every month
Business layer split, if for a time range 20180121-20180402 query, Alter table 20180121-20180131, 20180401-20180402,20180201-20180331,20180201-20180331, 20180121-20180131, 20180401-20180402

The split SQL query becomes:

SELECT
  sum(tea_reg) as tea_reg,
  sum(stu_reg) as stu_reg,
  school_id
FROM (
  (SELECT
     sum(tea_reg) as tea_reg,
     sum(stu_reg) as stu_reg,
     school_id
   FROM dt_school
   WHERE time BETWEEN 20180121 AND 20180131
   GROUP BY  school_id)
  UNION ALL (SELECT
               sum(tea_reg) as tea_reg,
               sum(stu_reg) as stu_reg,
               school_id
             FROM dt_school_month
             WHERE time BETWEEN 20180201 AND 20180331
             GROUP BY  school_id)
  UNION ALL (SELECT
               sum(tea_reg) as tea_reg,
               sum(stu_reg) as stu_reg,
               school_id
             FROM dt_school
             WHERE time BETWEEN 20180401 AND 20180402
             GROUP BY  school_id))
GROUP BY  school_id;
Copy the code

Dt_school_month =20180228 and 20180331 Dt_school and dt_school_month can be indexed index_time_school, so speed up.

SQL_BIG_RESULT

Just after finishing the monthly table scheme, Meng Kang, the big guy in the technical exchange group, posted a blog post about a group by + Order by performance optimization analysis also mentioned a similar case, which mentioned a keyword SQL_BIG_RESULT. My two eyes shine, give full play to the spirit of a small white apprentice.

I saw a hammer, and I wanted to take it out and hammer it

select SQL_BIG_RESULT sum(tea_reg),sum(stu_reg) from dt_school  
where time between 20190101 and 20190412 group by school_id\G;
Copy the code

show variables like '%sort_buffer_size%';
Copy the code

SET GLOBAL sort_buffer_size = 1024*1024*2;
Copy the code

Why use SQL_BIG_RESULT to speed up queries? Take a look at the execution plan

help SQL_BIG_RESULT
Copy the code

SQL_BIG_RESULT or SQL_SMALL_RESULT can be used with GROUP BY or DISTINCT to tell the optimizer that the result set has many or few rows, respectively. With SQL_BIG_RESULT, MySQL directly uses disk-based temporary tables when creating and prioritises temporary tables with GROUP BY element keys. With SQL_SMALL_RESULT, MySQL uses temporary tables in memory to store the generated tables instead of using sort.

Use SQL_BIG_RESULT to sort and group. Do not use SQL_BIG_RESULT, group first and sort later.

Start with the Group by principle

The reason why SQL_BIG_RESULT can speed up queries is from the principle of group by. Here I’m going to borrow an image from Dinch’s column, along with amway’s MySQL45

The essence of group by is sort

for

select  sum(tea_reg) as t ,sum(stu_reg) as s from dt_school 
where time between 20190101 and 20190412 group by school_id\G;
Copy the code

SQL > create temporary table t, s, school_id; Select school_id, tea_reg, stu_reg, and time from school_id; 3. If the value of time is not in the obtained time range, discard it. Select school_id,tea_reg,stu_reg, school_id, stu_reg, school_id, stu_reg, school_id, stu_reg, school_id, stu_reg, school_id, stu_reg, school_id, stu_reg,stu_reg, school_id, stu_reg,stu_reg, school_id, stu_reg,stu_reg, school_id, stu_reg If the memory table is full when data is inserted into the memory table one by one, the memory table is full. Create a new innoDB engine disk temp table and move data to disk temp table. 5. Sort the school_id of the temporary table and return the result set to the client.

In fact, step 5 is not necessary. When using school_id to iterate, inserts into temporary tables are ordered by default. That’s why the optimizer chooses school_id because it always thinks sorting is time-consuming and doesn’t need to be done with school_id. So for group by A if the index is not a, all operations in mysql5.6 and below are sorted by default. If you don’t need to use order by NULL to speed up the query.

When the temporary table in memory is insufficient, we move the data from the temporary table in memory to the temporary table on disk. With SQL_BIG_RESULT, the optimizer uses disk temporary tables directly

for

select  SQL_BIG_RESULT sum(tea_reg) as t ,sum(stu_reg) as s from dt_school 
where time between 20190101 and 20190412 group by school_id\G;
Copy the code

The execution flow of this statement looks like this:

Select school_id, tea_reg, stu_reg, and time from school_id;
If the value of time is not in the obtained time range, discard it. Otherwise, insert sort_buffer into sort buffer. If sort_buffer is not enough, use disk temporary files to help sort
Sort school_ID from sort buffer
Scan the sort buffer data and return the aggregation result

This is why you can check the execution plan of SQL_BIG_RESULT statements, the value of Extra option, and do not use temporary tables, but need using filesort. Of course, we add school_id index, filesort sorting is not needed.

Increase the size of temporary tables in memory (temp_table_size) first, but do not explain why. So I asked my FELLOW DBA at the company, and he said this

Sort buffer is a session variable. If you set the sort buffer to 1M, it will take up a lot of memory if there are lots of connections sorting at the same time. The database is a whole and the limited resources need to be allocated evenly. The configuration cannot be adjusted because of a certain statement.

The above is the summary of group by in my work. Due to the author’s limited knowledge, the text is inevitably numerous omissions. Readers are welcome to exchange corrections.

Reference reading:

The SELECT Syntax – MySQL 5.6 Reference Manual dev.mysql.com/doc/refman/… Dev.mysql.com/doc/refman/…
Using where using index in extra
A Group by + Order BY Performance Optimization Analysis
When will internal temporary tables be used – Dinky
MySQL optimization — Lao Ye Teahouse

One more day of data, SQL time multiplied by n times?

The solution

SQL_BIG_RESULT

Start with the Group by principle

Reference reading:

Related Posts

How to use Redis to match people and cars?

Taobao commodity reptilian combat notes

Use JDBCTemplate in detail