Welcome to pay attention to github.com/hsfxuebao, I hope to help you, if you think it can trouble to click on the Star ha

What dimensions are available for database tuning? In short:

  • Index failure, underutilization of index 1The index set up
  • Too many joins in associative queries (a design flaw or unavoidable requirement) – oneSQL optimization
  • Server tuning and various parameter Settings (buffer, thread number, etc.) oneAdjust my CNF
  • Too much data —Depots table

The knowledge of database tuning is very fragmented. Different DBMS, different companies, different positions, different projects encounter different problems. Here we are divided into three chapters for detailed explanation.

Although there are many techniques for SQL query optimization, it can be completely divided into physical query optimization and logical query optimization.

  • Physical query optimizationIs through theThe indexandTable join modeAnd other technologies to optimize, here the key need to masterUse of indexes.
  • Logical query optimizationIt is throughSQL equivalent transformationTo improve query efficiency, simply put, the execution of a different query writing method may be more efficient.

1. Data preparation

Student table inserts 500 thousand, class table inserts 10 thousand.

Step 1: Create a table

# CREATE TABLE 'class' (' id 'INT(11) NOT NULL AUTO_INCREMENT,' className 'VARCHAR(30) DEFAULT NULL, `address` VARCHAR(40) DEFAULT NULL, `monitor` INT NULL , PRIMARY KEY (`id`) ) ENGINE=INNODB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8; CREATE TABLE 'student' (' id 'INT(11) NOT NULL AUTO_INCREMENT,' stuno 'INT NOT NULL, `name` VARCHAR(20) DEFAULT NULL, `age` INT(3) DEFAULT NULL, `classId` INT(11) DEFAULT NULL, PRIMARY KEY (`id`) #CONSTRAINT `fk_class_id` FOREIGN KEY (`classId`) REFERENCES `t_class` (`id`) ) ENGINE=INNODB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;Copy the code

Step 2: Set parameters

  • Command on: allows to create function Settings:
set global log_bin_trust_function_creators=1; The current window is valid only without global.Copy the code

Step 3: Create the function

Make sure every piece of data is different.

DELIMITER // CREATE FUNCTION rand_string(n INT) RETURNS VARCHAR(255) BEGIN DECLARE CHARs_str VARCHAR(100) DEFAULT 'abcdefghijklmnopqrstuvwxyzABCDEFJHIJKLMNOPQRSTUVWXYZ'; DECLARE return_str VARCHAR(255) DEFAULT ''; DECLARE i INT DEFAULT 0; WHILE i < n DO SET return_str =CONCAT(return_str,SUBSTRING(chars_str,FLOOR(1+RAND()*52),1)); SET i = i + 1; END WHILE; RETURN return_str; END // DELIMITER ; Drop function rand_string;Copy the code

Randomly generate class numbers

DELIMITER // CREATE FUNCTION rand_num (from_num INT,to_num INT) RETURNS INT(11) BEGIN DECLARE I INT DEFAULT 0; SET i = FLOOR(from_num +RAND()*(to_num - from_num+1)) ; RETURN i; END // DELIMITER ; #drop function rand_num;Copy the code

Step 4: Create the stored procedure

DELIMITER // CREATE PROCEDURE insert_STu (START INT, max_num INT) BEGIN DECLARE I INT DEFAULT 0; SET autocommit = 0; REPEAT # SET I = I + 1; # INSERT INTO score (stuno, name,age,classId) VALUES ((START+ I),rand_num(1,50),rand_num(1,1000)); UNTIL i = max_num END REPEAT; COMMIT; END // DELIMITER; Drop PROCEDURE insert_stu;Copy the code

Create a stored procedure that inserts data into the class table

DELIMITER // CREATE PROCEDURE 'insert_class' (max_num INT) BEGIN DECLARE I INT DEFAULT 0; SET autocommit = 0; REPEAT SET i = i + 1; INSERT INTO class (classname, address, monitor) VALUES (rand_string (8), rand_string (10), rand_num (1100 000)); UNTIL i = max_num END REPEAT; COMMIT; END // DELIMITER ; Drop PROCEDURE insert_class;Copy the code

Step 5: Invoke the stored procedure class

CALL insert_class(10000);Copy the code

stu

CALL insert_STu (100000,500000);Copy the code

Step 6: Delete an index on a table to create a stored procedure

DELIMITER // CREATE PROCEDURE `proc_drop_index`(dbname VARCHAR(200),tablename VARCHAR(200)) BEGIN DECLARE done INT DEFAULT 0; DECLARE ct INT DEFAULT 0; DECLARE _index VARCHAR(200) DEFAULT ''; DECLARE _cur CURSOR FOR SELECT index_name FROM information_schema.STATISTICS WHERE table_schema=dbname AND table_name=tablename AND seq_in_index=1 AND index_name <>'PRIMARY' ; Each cursor must declare continue handler for not found set done=1 to control the end of the cursor ; # If no data is returned, the program continues and sets the variable done to 2 OPEN _cur; FETCH _cur INTO _index; WHILE _index<>'' DO SET @str = CONCAT("drop index " , _index , " on " , tablename ); PREPARE sql_str FROM @str ; EXECUTE sql_str; DEALLOCATE PREPARE sql_str; SET _index=''; FETCH _cur INTO _index; END WHILE; CLOSE _cur; END // DELIMITER ;Copy the code

Executing stored procedures

CALL proc_drop_index("dbname","tablename");
Copy the code

2. Index failure case

One of the most effective ways to improve performance in MySQL is to design proper indexes for data tables. Indexes have a critical impact on query speed because they provide efficient access to data and speed up queries.

  • Index can be used to quickly locate a record in a table, so as to improve the speed of database query and improve database performance.
  • If the query does not use an index, the query statement scans all records in the table. In the case of large data volume, this query speed will be slow.

B+ trees are used (by default) to build indexes in most cases. However, indexes for spatial column types use R-trees, and MEMORY tables also support hash indexes.

Ultimately, the optimizer decides whether to use an index or not. What is the optimizer based on? Cost base Optimizer, which is not rule-based or semantically based. Do whatever it takes to make it cheap. In addition, SQL statements use indexes, with the database version, data volume, data selection degree are related.

2.1 Full value matching my favorite

SQL statements commonly used in the system are as follows:

 EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE age=30;
 EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE age=30 and classld=4;
 EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE age=30 and classld=4 AND name = 'abed';
Copy the code

indexing

CREATE INDEX idx.age ON student(age); CREATE INDEX idx_age_classid ON student(age, classic!) ; CREATE INDEX idx_age_classid_name ON student(age,classld,name);Copy the code

After creating an index, execute:

Mysql > SELECT SQL_NO_CACHE * FROM student WHERE age=30 and classld=4 and name = 'abed'; Empty set, 1 warning (0.01 SEC)Copy the code

As can be seen, the query time before index creation is 0.28 seconds, and the query time after index creation is 0.01 seconds. Indexes greatly improve the query efficiency.

2.2 Optimal left prefix rule

MySQL will follow the principle of optimal left prefix matching when establishing the joint index, i.e. the left-most first, and start matching from the left-most of the joint index when retrieving data.

Example 1:

 EXPLAIN SELECT SQL.NO.CACHE * FROM student WHERE student.age=30 AND student.name = `abcd`;
Copy the code

Example 2:

 EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE student.classid=1 AND student.name = 'abed';
Copy the code

Example 3: Can idx_age_DASsid_name be used properly?

 EXPLAIN SELECT SQL NO CACHE * FROM student WHERE classid=4 AND student.age=30 AND student.name
Copy the code

If you index multiple columns, follow the leftmost prefix rule. Indicates that the query starts at the left-most front of the index and does not skip columns in the index.

Mysql > EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE student.age=30 AND student.name = 'abed';Copy the code

It works, but it’s only partially used.

mysql> EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE student.classid=1 AND student.name 'abed';
Copy the code

No index at all.

Conclusion: MySQL can create indexes for multiple fields. An index can contain up to 16 fields. For multi-column indexes, the filtering conditions must be met in the sequence in which the indexes are created. Once a field is skipped, the fields following the index cannot be used. Multi-column (or union) indexes are not used if the first of these fields is not used in the query condition.

Alibaba Java Development Manual index file has the left-most prefix matching feature of B-tree, if the left value is not determined, then this index cannot be used.

2.3 Insertion Sequence of primary Keys

For a use ofInnoDB storage engineWhen we do not explicitly create an index, the data in the table is actually storedIndex in clustersOf the leaf nodes. Records are stored in data pages, and data pages and records follow recordsPrimary key values go from small to largeSo if weinsertThe record ofThe primary key values increase in orderIf we insert a full data page, then we move to the next data page to continue to insert, and if we insertThe primary key goes up and downSuppose a data page is full of records and stores primary keys between 1 and 100:If another record with a primary key value of 9 is inserted, it will be inserted as follows:

But this data page is already full, how to plug in again? We need to split the current page into two pages and move some records from this page to the newly created page. What does page splitting and record shifting mean? Means: Performance loss! So if we want to avoid this unnecessary performance loss, it is best to increase the primary key values of the inserted records successively so that this performance loss does not occur. Instead of inserting the person_info table manually, use AUTO_INCREMENT as the primary key.

CREATE TABLE person_info(
	 id INT UNSIGNED NOT NULL AUTO_INCREMENT,
	 name VARCHAR(100) NOT NULL,
	 birthday DATE NOT NULL,
	 phone_number CHAR(11) NOT NULL,
	 country varchar(100) NOT NULL,
	  PRIMARY KEY (id),
	  KEY idx_name_birthday_phone_number (name(10), birthday, phone_number)
);  
Copy the code

Our custom primary key column ID has the AUTO_INCREMENT attribute, and the storage engine will automatically fill in the increment primary key value for us when the record is inserted. Such primary keys take up less space and write sequentially, reducing page splitting.

2.4 Index invalidation due to calculation, function, type conversion (automatic or manual)

  1. Which of the following 2 SQL statements is better written?
EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE student.name LIKE 'abc%';
Copy the code
EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE LEFT(student.name,3) = 'abc';
Copy the code
  1. Create indexes
CREATE INDEX idx_name ON student(NAME);
Copy the code
  1. First: index optimization takes effect
mysql> EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE student.name LIKE 'abc%';
Copy the code

mysql> SELECT SQL_NO_CACHE * FROM student WHERE student.name LIKE 'abc%'; +---------+---------+--------+------+---------+ | id | stuno | name | age | classId | +---------+---------+--------+------+---------+ | 5301379 | 1233401 | AbCHEa | 164 | 259 | | 7170042 | 3102064 | ABcHeB | 199 | 161 | | 1901614 | 1833636 | ABcHeC | 226 | 275 | | 5195021 | 1127043 | abchEC | 486 | 72 | | 4047089 | 3810031 |  AbCHFd | 268 | 210 | | 4917074 | 849096 | ABcHfD | 264 | 442 | | 1540859 | 141979 | abchFF | 119 | 140 | | 5121801 | 1053823 | AbCHFg | 412 | 327 | | 2441254 | 2373276 | abchFJ | 170 | 362 | | 7039146 | 2971168 | ABcHgI | 502 | 465 | | 1636826 | 1580286 | ABcHgK | 71 | 262 | | 374344 | 474345 | abchHL | 367 | 212 | | 1596534 | 169191 | AbCHHl | 102 | 146 |... | 5266837 | 1198859 | abclXe | 292 | 298 | | 8126968 | 4058990 | aBClxE | 316 | 150 | | 4298305 | 399962 | AbCLXF | 72 |  423 | | 5813628 | 1745650 | aBClxF | 356 | 323 | | 6980448 | 2912470 | AbCLXF | 107 | 78 | | 7881979 | 3814001 | AbCLXF  | 89 | 497 | | 4955576 | 887598 | ABcLxg | 121 | 385 | | 3653460 | 3585482 | AbCLXJ | 130 | 174 | | 1231990 | 1283439 |  AbCLYH | 189 | 429 | | 6110615 | 2042637 | ABcLyh | 157 | 40 | +---------+---------+--------+------+---------+ 401 rows In set, 1 warning (0.01 SEC)Copy the code

If type is range, the index column is used, and the query time is 0.01 seconds.

  1. Second: index optimization failure
mysql> EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE LEFT(student.name,3) = 'abc';
Copy the code

mysql> SELECT SQL_NO_CACHE * FROM student WHERE LEFT(student.name,3) = 'abc'; +---------+---------+--------+------+---------+ | id | stuno | name | age | classId | +---------+---------+--------+------+---------+ | 5301379 | 1233401 | AbCHEa | 164 | 259 | | 7170042 | 3102064 | ABcHeB | 199 | 161 | | 1901614 | 1833636 | ABcHeC | 226 | 275 | | 5195021 | 1127043 | abchEC | 486 | 72 | | 4047089 | 3810031 |  AbCHFd | 268 | 210 | | 4917074 | 849096 | ABcHfD | 264 | 442 | | 1540859 | 141979 | abchFF | 119 | 140 | | 5121801 | 1053823 | AbCHFg | 412 | 327 | | 2441254 | 2373276 | abchFJ | 170 | 362 | | 7039146 | 2971168 | ABcHgI | 502 | 465 | | 1636826 | 1580286 | ABcHgK | 71 | 262 | | 374344 | 474345 | abchHL | 367 | 212 | | 1596534 | 169191 | AbCHHl | 102 | 146 |... | 5266837 | 1198859 | abclXe | 292 | 298 | | 8126968 | 4058990 | aBClxE | 316 | 150 | | 4298305 | 399962 | AbCLXF | 72 |  423 | | 5813628 | 1745650 | aBClxF | 356 | 323 | | 6980448 | 2912470 | AbCLXF | 107 | 78 | | 7881979 | 3814001 | AbCLXF  | 89 | 497 | | 4955576 | 887598 | ABcLxg | 121 | 385 | | 3653460 | 3585482 | AbCLXJ | 130 | 174 | | 1231990 | 1283439 |  AbCLYH | 189 | 429 | | 6110615 | 2042637 | ABcLyh | 157 | 40 | +---------+---------+--------+------+---------+ 401 rows In set, 1 warning (3.62 SEC)Copy the code

If type is ALL, the index is not used and the query time is 3.62 seconds. The query efficiency is much lower than before.

For example:

  • The student table has an index set on field stuno
CREATE INDEX idx_sno ON student(stuno);
EXPLAIN SELECT SQL_NO_CACHE id, stuno, NAME FROM student WHERE stuno+1 = 900001;
Copy the code

Index optimization takes effect:

EXPLAIN SELECT SQL_NO_CACHE id, stuno, NAME FROM student WHERE stuno = 900000;
Copy the code

For example:

Select * from student; select * from student

CREATE INDEX idx_name ON student(NAME); EXPLAIN SELECT id, stuno, name FROM student WHERE SUBSTRING(name, 1,3)=' ABC ';Copy the code

EXPLAIN SELECT id, stuno, NAME FROM student WHERE NAME LIKE 'abc%';
Copy the code

2.5 Index Failure Due to Type Conversion

Which of the following SQL statements can use an index? (Assuming an index is set on the name field)

EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE name=123;Copy the code

Use index

EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE name='123';
Copy the code

Name =123 Type conversion occurred, index invalid. (Implicit type conversion)

Conclusion: When designing entity class attributes, it must correspond to database field types. Otherwise, type conversions will occur

2.6 Range Condition The index of the column to the right is invalid

  1. If the system often appears SQL as follows:
ALTER TABLE student DROP INDEX idx_name;
ALTER TABLE student DROP INDEX idx_age;
ALTER TABLE student DROP INDEX idx_age_classid;

EXPLAIN SELECT SQL_NO_CACHE * FROM student
WHERE student.age=30 AND student.classId>20 AND student.name = 'abc' ;
Copy the code

Student. Name = ‘ABC’ where student. ClassId >20

  1. Then the indexidx_age_classid_nameDoes the index still work?
  • No, the column to the right of the range cannot be used. Such as< <= > >= and betweenAnd so on.
  • If this type of SQL is large, it should create:
create index idx_age_name_classid on student(age,name,classid);
Copy the code

Place range query criteria at the end of statement:

EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE student.age=30 AND student.name = 'abc' AND student.classId>20 ;
Copy the code

It is not useful to swap the location of SQL statements directly; you need to change the location of the union index

Range queries in application development, such as amount queries, date queries are often range queries. The query criteria should be placed at the end of the WHERE statement.

2.7 is not equal to (! = or <>) index invalid

  • Create an index for name
create index ide_name on student(name);
Copy the code
  • Check whether the index is invalid
explain select SQL_NO_CACHE *from student where student.name <> 'abc'
Copy the code

or

explain select SQL_NO_CACHE *from student where student.name ! = 'abc'Copy the code

When the SQL statement has! = or <> will cause index invalidation, try to rewrite to equal, or use overwrite index

2.8 IS NULL Indicates that an index can be used, but is not NULL indicates that an index cannot be used

  • IS NULL can use indexes
EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE age IS NULL;
Copy the code

  • IS NOT NULL Failed to trigger the index
EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE age IS NOT NULL;
Copy the code

Conclusion: It is best to set the not NULL constraint on the table when designing the table. Similarly, using not like in a query will not use the index, resulting in a full table scan

2.9 LIKE The index starts with the wildcard % is invalid

  • Use index
explain select SQL_NO_CACHE * FROM student where name like 'ab%'
Copy the code

  • No index is used
explain select SQL_NO_CACHE * FROM student where name like '%ab%'
Copy the code

Development: Alibaba “Java Development Manual” [mandatory] page search is strictly prohibited left blur or full blur, if necessary, please go to the search engine to solve.

2.10 Non-index columns exist before OR after OR, and the index is invalid

In the WHERE clause, if the condition column before OR is indexed, but the condition column after OR is not indexed, then the index is invalidated

If only one condition column is indexed, it does not make sense. As long as a conditional column does not have an index, a full table scan is performed, so all conditional columns are invalidated.

EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE age = 10 OR Classid = 100;Copy the code

Because there is no index on the Classid field, none of the above queries use the index

SELECT SQL_NO_CACHE * FROM student WHERE age = 10 OR name = 'Abel'; # EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE age = 10 OR name = 'Abel';Copy the code

2.11 UtF8MB4 is used for database and table character sets

The unified use of UTF8MB4 (supported in 5.5.3 and later) is more compatible, and the unified character set can avoid garbled characters caused by character set conversion. The need for conversion before comparing different character sets causes index invalidation.

2.12 Exercises and general recommendations

Assuming that the index (a, b, c)

General recommendations:

  • For single-column indexes, try to select indexes with better filtering for the current Query
  • When selecting a composite index, the filtering field in the current query is higher in the index field order, the better.
  • When selecting a composite index, try to select an index that contains as many fields as possible from the WHERE clause in the current query.
  • When selecting composite indexes, try to place a field at the bottom of the index order if it is likely to show up in a range query.

In general, try to avoid index invalidation when writing SQL statements.

3. Associated query optimization

3.1 Data Preparation

# CREATE TABLE IF NOT EXISTS 'type' (' id 'INT(10) UNSIGNED NOT NULL AUTO_INCREMENT, `card` INT(10) UNSIGNED NOT NULL, PRIMARY KEY (`id`) ); # book CREATE TABLE IF NOT EXISTS 'book' (' bookid 'INT(10) UNSIGNED NOT NULL AUTO_INCREMENT, `card` INT(10) UNSIGNED NOT NULL, PRIMARY KEY (`bookid`) ); INSERT INTO TYPE(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO TYPE(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO TYPE(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO TYPE(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO TYPE(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO TYPE(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO TYPE(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO TYPE(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO TYPE(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO TYPE(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO TYPE(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO TYPE(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO TYPE(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO TYPE(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO TYPE(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO TYPE(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO TYPE(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO TYPE(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO TYPE(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO TYPE(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO book(card) VALUES(FLOOR(1 + (RAND() * 20)));Copy the code

3.2 Use the left external connection

Let’s start with EXPLAIN analysis:

EXPLAIN SELECT SQL_NO_CACHE * FROM `type` LEFT JOIN book ON type.card = book.card;
Copy the code

Conclusion:typeAll

Add index optimization

ALTER TABLE book ADD INDEX Y ( card); EXPLAIN SELECT SQL_NO_CACHE * FROM 'type' LEFT JOIN book ON type.card = book.card;Copy the code

You can see the second rowtypeInto theref.rowsAlso become more obvious optimization. This is byLeft connection characteristics determined. The left outer joinLEFT JOINConditions are used to determineHow do I search for rows from the right table, but there must be rows on the left, soOn the right is our key pointYou definitely need to build an index.

ALTER TABLE `type` ADD INDEX X (card); EXPLAIN SELECT SQL_NO_CACHE * FROM 'type' LEFT JOIN book ON type.card = book.card;Copy the code

then

DROP INDEX Y ON book;
EXPLAIN SELECT SQL_NO_CACHE * FROM `type` LEFT JOIN book ON type.card = book.card;
Copy the code

3.3. Use internal connection

drop index X on type; drop index Y on book; (If it has been deleted, you do not need to perform this operation.)Copy the code

MySQL select driver from inner JOIN table

EXPLAIN  SELECT SQL_NO_CACHE * FROM type INNER JOIN book ON type.card=book.card; 
Copy the code

Add index optimization

ALTER  TABLE book ADD INDEX Y ( card);

EXPLAIN  SELECT SQL_NO_CACHE * FROM type INNER JOIN book ON type.card=book.card;
Copy the code

ALTER  TABLE type ADD INDEX X (card);
 
EXPLAIN  SELECT SQL_NO_CACHE * FROM type INNER JOIN book ON type.card=book.card;
Copy the code

And then:

DROP INDEX X ON `type`;

EXPLAIN  SELECT SQL_NO_CACHE * FROM TYPE INNER JOIN book ON type.card=book.card;
Copy the code

And then:

ALTER  TABLE `type` ADD INDEX X (card); 

EXPLAIN  SELECT SQL_NO_CACHE * FROM  `type` INNER JOIN book ON type.card=book.card;
Copy the code

And then:

INSERT INTO 'book' (card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO `book`(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO `book`(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO `book`(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO `book`(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO `book`(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO `book`(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO `book`(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO `book`(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO `book`(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO `book`(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO `book`(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO `book`(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO `book`(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO `book`(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO `book`(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO `book`(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO `book`(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO `book`(card) VALUES(FLOOR(1 + (RAND() * 20))); INSERT INTO `book`(card) VALUES(FLOOR(1 + (RAND() * 20))); ALTER table book and index Y(card) EXPLAIN SELECT SQL_NO_CACHE * FROM TYPE INNER JOIN book on type.card = book.cardCopy the code

Found that,MySQL selects type as the driver table because the type table data is larger than the book table data.

3.4 Connection Overview

3.4.1 Nature of connection

To get the story going, let’s create two simple tables and fill them with a bit of data:

mysql> CREATE TABLE t1 (m1 int, n1 char(1)); Query OK, 0 rows affected (0.02sec) mysql> CREATE TABLE T2 (m2 int, n2 char(1)); Query OK, 0 rows affected (0.02sec) mysql> INSERT INTO T1 VALUES(1, 'a'), (2, 'b'), (3, 'c'); Query OK, 3 rows affected (0.00 SEC) Records: 3 Duplicates: 0 Warnings: 0 mysql> INSERT INTO t2 VALUES(2, 'b'), (3, 'c'), (4, 'd'); Query OK, 3 Rows Affected (0.00 SEC) Records: 3 Duplicates: 0 Warnings: 0Copy to clipboardErrorCopiedCopy the code

Select * from t1 where CHAR(1) = CHAR(1); select * from T2 where CHAR(1) = CHAR(1);

mysql> SELECT * FROM t1; + + -- -- -- -- -- -- -- -- -- -- -- -- + | | m1 n1 | + -- -- -- -- -- - + -- -- -- -- -- -- + | 1 | a | | 2 | b | | 3 | | c + + -- -- -- -- -- -- -- -- -- -- -- -- + 3 rows in the set (0.00 SEC) mysql> SELECT * FROM t2; + -- -- -- -- -- - + -- -- -- -- -- -- + | | m2 + n2 | + -- -- -- -- -- - -- -- -- -- -- -- + | | b | 2 | 3 | c | | | 4 d | + -- -- -- -- -- - + -- -- -- -- -- - + 3 rows in the set (0.00 SEC) Copy to clipboardErrorCopiedCopy the code

The essence of joining is to take records from each join table and add the matching combination to the result set and return it to the user. Therefore, the process of joining t1 and T2 tables is shown in the figure below:

This process appears to concatenate t1 records with T2 records to form a new, larger record, so this query process is called join query. The result set of a join query contains a combination of each record in one table matching each record in another table, which is called a Cartesian product. Since there are three records in table T1 and three records in table T2, the joined Cartesian product of the two tables has 3×3=9 rows. SQL > alter table t1 join table T2; SQL > alter table T2 join table T1 join table T2;

mysql> SELECT * FROM t1, t2; +------+------+------+------+ | m1 | n1 | m2 | n2 | +------+------+------+------+ | 1 | a | 2 | b | | 2 | b | 2 | b | | 3 | c | 2 | b | | 1 | a | 3 | c | | 2 | b | 3 | c | | 3 | c | 3 | c | | 1 | a | 4 | d | | 2 | b | 4 | d | | 3 | c | 4 | D | + -- -- -- -- -- - + -- -- -- -- -- - + -- -- -- -- -- - + -- -- -- -- -- - + 9 rows in the set (0.00 SEC)Copy the code

3.4.2 Connection Process Overview

We can join as many tables as we like, but the Cartesian product from joining these tables can be huge if there are no constraints. For example, three 100-row tables joined together produce a Cartesian product of 100×100×100= 1,000,000 rows! Therefore, it is necessary to filter out specific record combinations when joining. The filtering criteria in join queries can be divided into two types:

  • Conditions involving single tables

    For example, t1.m1 > 1 is a filter condition for T1 only, and T2.n2 < ‘d’ is a filter condition for T2 only.

  • Conditions involving two tables

    This filter condition we have not seen before, such as T1.m1 = T2.m2, T1.n1 > t2.n2, these conditions involve two tables, we will analyze how this filter condition is used later.

Let’s take a look at the general execution of a join query with filter criteria, such as the following query:

SELECT * FROM t1, t2 WHERE t1.m1 > 1 AND t1.m1 = t2.m2 AND t2.n2 < 'd';
Copy the code

In this query we specify the three filter criteria:

  • t1.m1 > 1
  • t1.m1 = t2.m2
  • t2.n2 < 'd'

The general execution process of the join query is as follows:

  1. First determine the first table to be queried, which is called the driver table. We explained how to execute a query in a single table in the previous chapter. Just select the least expensive method to execute a single table query (i.e., select the least expensive method from const, ref, ref_or_NULL, range, index, and all). Here, it is assumed that T1 is used as the driver table, so it is necessary to find records in T1 that meet t1.m1 > 1. Since there is too little data in the table and we have not established a secondary index on the table, the access method for querying T1 is set as ALL, that is, to perform single-table query in the way of full table scan. We’ll talk about how to improve the performance of join queries later, but let’s get the basics straight. Therefore, the query process is as follows:

    We can see that there are two records in the T1 table that match T1.m1 > 1.

  2. For each record in the result set generated from the driver table in the previous step, it is necessary to look for the matched record in the T2 table. The so-called matched record refers to the record that meets the filtering conditions. The t2 table can also be called a driven table because it is based on the records in t1 to find the records in T2. The previous step obtained two records from the driver table, so you need to query the T2 table twice. Here the filter condition t1.m1 = t2.m2 for columns involving two tables comes in handy:

    • M1 = t2.m2 = t2.m2 = t2.m2 = t2.m2 = t2.m2 = t2.n2 < ‘d’

    • M1 = t1.m1 = t2.m2 = t2.m2 = t2.m2 = 3, t2 = t2.m2 = 3, t2. N2 < ‘d’

      Therefore, the execution process of the whole join query is as follows:

      In other words, the result of the whole join query is only two records that meet the filtering criteria:

         +------+------+------+------+
        | m1   | n1   | m2   | n2   |
        +------+------+------+------+
        |    2 | b    |    2 | b    |
        |    3 | c    |    3 | c    |
        +------+------+------+------+
      Copy the code

From the above two steps, we can see that the above two table join query requires 1 t1 table, 2 T2 table. Of course, this is the result of certain filtering conditions. If we remove the condition that T1.m1 > 1, then we have 3 records from t1, and we need to query t2 3 times. That is, in a two-table join query, the driven table needs to be accessed only once, and the driven table may be accessed multiple times.

3.4.3 Internal connection and external connection

To help you understand what’s going on, let’s create two realistic tables,

CREATE TABLE student (number INT NOT NULL AUTO_INCREMENT COMMENT '* ', name VARCHAR(5) COMMENT' * ', InnoDB CHARSET=utf8 COMMENT '主 体 '; CREATE TABLE score (number INT COMMENT 'student ', subject VARCHAR(30) COMMENT' subject ', score TINYINT COMMENT 'score ', PRIMARY KEY (number, score)) Engine=InnoDB CHARSET=utf8 COMMENTCopy the code

We created a student information table and a student performance table, and then we inserted some data into the above two tables. To save space, the specific insertion process will not be introduced. The data in the two tables after insertion is as follows:

mysql> SELECT * FROM student; +----------+-----------+--------------------------+ | number | name | major | + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | 20180101 | Du Ziteng software college of | | | 20180102 | Fan Tong | computer science and engineering | | 20180103 | Shi Zhenxiang | computer science and engineering | + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + 3 rows in the set (0.00 SEC) mysql > SELECT * FROM score; +----------+-----------------------------+-------+ | number | subject | score | + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- + | | 78 | 20180101 | sow postpartum nursing theory of | 20180101 | | 88 war preparedness for saddam hussein | | 20180102 | theory of saddam hussein's war | 98 | | 20180102 | sow postpartum care | 100 | + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- - + 4 rows in the set (0.00) sec)Copy to clipboardErrorCopiedCopy the code

Now we need to connect the two tables if we want to query the test scores of each student (because there is no name information in score, so we cannot simply query the score table). Select * from student where score = number; select * from student where score = number; select * from student where score = number;

mysql> SELECT * FROM student, score WHERE student.number = score.number; +----------+-----------+--------------------------+----------+-----------------------------+-------+ | number | name | major | number | subject | score | + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- + | 20180101 | Du Ziteng | Software college | 78 | | 20180101 | sow postpartum nursing | 20180101 | Du Ziteng | | 20180101 | software institute of saddam hussein's war | 88 | | 20180102 | Fan Tong | | computer science and engineering 20180102 | theory of saddam hussein's war | 98 | | 20180102 | Fan Tong | computer science and engineering | 100 | | 20180102 | sow postpartum care + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- - + 4 rows in the set (0.00)  sec)Copy the code

We need to query a few fields:

mysql> SELECT s1.number, s1.name, s2.subject, s2.score FROM student AS s1, score AS s2 WHERE s1.number = s2.number; +----------+-----------+-----------------------------+-------+ | number | name | subject | score | + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- + | 20180101 | Du Ziteng | sow postpartum care | 78 | | 20180101 | Du Ziteng | Theory of saddam hussein's war | 88 | | 20180102 | Fan Tong | theory of saddam hussein's war | 98 | | 20180102 | Fan Tong postpartum nursing | 100 | | sows + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- - + 4 rows in the set (0.00 SEC)Copy the code

As can be seen from the above query results, the corresponding scores of each student have been found out. However, there is a problem: Shi Zhenxiang, whose student id is 20180103, did not take the exam for some reasons, so there is no corresponding score record in the score table. So if the teacher wants to see the test scores of all students, even those who missed the test should be displayed, but the join query we have introduced so far does not meet this requirement. Let’s think about this requirement for a moment. The essence of this requirement is that records in the driven table need to be added to the result set even if there are no matching records in the driven table. To solve this problem, there is the concept of inner join and outer join:

  • For the two internally joined tables, the records in the driver table cannot find a matching record in the driven table, and this record will not be added to the final result set. The joins we mentioned above are all so-called inner joins.

  • For two externally joined tables, records in the driver table need to be added to the result set even if there are no matching records in the driven table.

    In MySQL, external connections can still be subdivided into two types according to the selected driver table:

    • The left outer join

      Select the table on the left as the driver table.

    • Right connection

      Select the table on the right as the driver table.

There is still a problem though, even for external joins, sometimes we do not want to add all the records of the driver table to the final result set. This is difficult, sometimes the match failed to join the result set, and sometimes do not join the result set, how to do this, a little sad ah… This problem is solved by dividing the filter criteria into two types, so the filter criteria have different semantics when placed in different places:

  • Filter criteria in the WHERE clause

    The filter conditions in the WHERE clause are the same as those we usually see. Records that do not meet the filter conditions in the WHERE clause are not added to the final result set, whether it is an inner join or an outer join.

  • Filter criteria in the ON clause

    If no record matching the filter condition in the ON clause is found in the driven table, the record is still added to the result set, and the fields of the corresponding driven table record are filled with NULL values.

    Note that the ON clause is specifically used to specify whether or not the records in the outer join drive table should be added to the result set if the driven table does not find a matching record. If the ON clause is inserted into the inner join, MySQL will treat it like the WHERE clause. The WHERE clause in an inner join is equivalent to the ON clause.

    In general, filter conditions involving only one table are put in the WHERE clause, and filter conditions involving both tables are put in the ON clause. Filter conditions in the ON clause are also called join conditions.

Note: left outer join and right outer join are referred to as left outer join and right outer join, so the word 'outer' in left outer join and right outer join mentioned below are expanded in parentheses to indicate that the word is optional.Copy the code

3.4.3.1 Syntax for left (outer) joins

Select * from t1; select * from t2; select * from t1;

SELECT * FROM t1 LEFT [OUTER] JOIN t2 ON; SELECT * FROM t1 LEFT [OUTER] JOIN t2 ON;Copy the code

The word OUTER in parentheses can be omitted. For LEFT JOIN types, we call the LEFT table outer or driven table, and the right table inner or driven table. So in the above example, T1 is the outer or driven table, and T2 is the inner or driven table. Note that for left (outer) joins and right (outer) joins, the ON clause must be used to indicate the join condition. Select * from left (outer); select * from left (outer); select * from left (outer); select * from left (outer)

mysql> SELECT s1.number, s1.name, s2.subject, s2.score FROM student AS s1 LEFT JOIN score AS s2 ON s1.number = s2.number; +----------+-----------+-----------------------------+-------+ | number | name | subject | score | + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- + | 20180101 | Du Ziteng | sow postpartum care | 78 | | 20180101 | Du Ziteng | Theory of saddam hussein's war | 88 | | 20180102 | Fan Tong | theory of saddam hussein's war | 98 | | 20180102 | Fan Tong | sow postpartum care | 100 | | 20180103 | Shi Zhenxiang | NULL | NULL | + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- - + 5 rows in the set (0.04 SEC)Copy the code

It can be seen from the result set that although Shi Zhenxiang does not have the corresponding result record, she is still placed in the result set because the connection type is left (outer) connection, but the corresponding result record columns are filled with NULL values.

3.4.3.2 Syntax for right (outer) joins

The principle of RIGHT (outer) concatenation is the same as that of LEFT (outer) concatenation, except that the syntax replaces LEFT with RIGHT:

SELECT * FROM t1 RIGHT [OUTER] JOIN t2 ON; SELECT * FROM t1 RIGHT [OUTER] JOIN T2 ON; Copy to clipboardErrorCopiedCopy the code

But the drive table is the right table, driven table is the left table, specific will not be introduced.

3.4.3.3 Syntax for internal joins

The fundamental difference between an inner join and an outer join is that a record in the driver table is not added to the final result set if it does not meet the join condition in the ON clause. The types of join queries we started with were all inner joins. The simplest form of inner join syntax mentioned earlier, however, is to place multiple tables to be joined directly after the FROM clause. MySQL provides several different syntax for internal joins. Let’s use t1 and T2 as an example:

SELECT * FROM t1 [INNER | CROSS] JOIN t2 link conditions [ON] [WHERE ordinary filter conditions];Copy the code

In MySQL, the following inner join forms are equivalent:

  • SELECT * FROM t1 JOIN t2;
  • SELECT * FROM t1 INNER JOIN t2;
  • SELECT * FROM t1 CROSS JOIN t2;

This is equivalent to placing the names of the tables to be joined after the FROM statement and separating them with commas:

 SELECT * FROM t1, t2;
Copy the code

Now we have introduced many kinds of INNER JOIN writing methods, but we should be familiar with one of them. Here we recommend the form of INNER JOIN to write INNER JOIN (because the semantics of INNER JOIN are very clear, it can be easily distinguished from LEFT JOIN and RIGHT JOIN). Note that since the ON clause in the inner join is equivalent to the WHERE clause, there is no requirement to specify the ON clause in the inner join.

As we said earlier, the essence of joining is to take the records from each join table and add the matching combination to the result set and return it to the user. The Cartesian product produced by joining two tables must be the same regardless of which table is the driver table. For the connection, because that is not in conformity with the ON clause or the conditions in the WHERE clause of the records will be filtered out, actually also is equivalent to the cartesian product of two table join in kick is not in conformity with the records of filter conditions to go out, so for the connection, the driver table and to be the driver table is interchangeable, will not affect the final query results. But for the external connection, because the records in the drive table even in the driven table can not find ON clause connection conditions of the record, so at this time the relationship between the drive table and the driven table is very important, that is to say, the left outer connection and the right outer connection of the drive table and the driven table can not be easily exchanged.

3.5 Principle of the JOIN Statement

The introduction above is just to remind you of the concepts of join, inner join, and outer join. These basic concepts are the foundation for getting into the topic of this chapter. The real focus is on what algorithm MySQL uses to connect tables to each other. After understanding this, you can understand why some join queries run as fast as lightning while others run as slow as snail.

3.5.0 Driver table and driven table

Join joins multiple tables, which is essentially a circular matching of data between tables. Prior to MySQL5.5, MySQL supported only one relational party between tables, Nested Loop joins. Join associations can take a long time to execute if the associated table has a large amount of data. In versions after MySQL5.5, MySQL optimized nested execution by introducing the BNLJ algorithm.

The driving table is the main table, driven table is from the table, three back area moving table.

  • For inner joins:
SELECT * FROM A JOIN E ON .. ,Copy the code

Must A be the drive table? Not necessarily, the optimizer will optimize which table to look up first based on your query statement. The first query is the driver table, and vice versa is the driven table. You can view the information by keyword explain.

  • For outer joins:
SELECT * FROM A LEFT JOIN B ON • • • # or SELECT * FROM B RIGHT JOIN A ON • • • #Copy the code

In general, you think of A as the driven table, and B as the driven table. But not necessarily. The tests are as follows:

CREATE TABLE a(f1 INT, f2 INT, INDEX(f1))ENGINE=INNODB; CREATE TABLE b(f1 INT, f2 INT)ENGINE=INNODB; INSERT INTO a VALUES (1, 1), (2, 2), (3, 3), (4), (5, 5), (6, 6); INSERT INTO b VALUES (3, 3), (4), (5, 5), (6, 6), (7, 7), (8, 8); SELECT * FROM b; # EXPLAIN SELECT * FROM a LEFT JOIN b 0N(a.f1=b.f1) WHERE (a.f2=b.f2); # EXPLAIN SELECT * FROM a LEFT JOIN b 0N(a.f1=b.f1) AND (a.f2=b.f2);Copy the code

3.5.1 Simple Nested-loop Join]

As we said earlier, for a two-table join, the driven table is accessed only once, but the driven table is accessed many times, depending on the number of entries in the result set after a single table query is performed on the driven table. For the inner join, it doesn’t matter which table is the driver table, and the outer join driver table is fixed, that is to say, the left (outer) connected driver table is the left table, the right (outer) connected driver table is the right table. We have already outlined the general process of executing an inner join query for t1 and T2, so let’s review it:

  • Step 1: Select the driver table, use the filter conditions related to the driver table, and select the single table access method with the lowest cost to perform the single table query on the driver table.
  • Step 2: For each record in the result set obtained from querying the driver table in the previous step, search for the matching record in the driven table.

The general process of joining two tables is as follows:

If you have three tables to connect, then the result set in step 2 is like a new driver table, then the third table has become a driver table, repeat the above process, also is the result of step 2 of each record in the table need to t3 find any matching records, expressed in pseudo code once this process is like this:

For each row in t1 {#} for each row in t2 {#} For each row in t3 {# Check the t3 table for a single table query if row distribution conditions, send to client}}}Copy to clipboardErrorCopiedCopy the code

This process is like a Nested Loop, so the driven table can be accessed only once, but the driven table can be accessed multiple times, depending on the number of entries in the result set after a single-table query is performed on the driven table. This is called a nested-loop Join. It is also one of the most awkward join query algorithms.

3.5.2 Index Nested-loop Join

Use indexes to speed up connections

We know that in a nested loop join step 2 May need to access the table driven many times, if the visit is the way the driver table is a full table scan, mama ah, that better to scan multiple ah ~ ~ ~ but don’t forget, in fact is the equivalent of a single query t2 table table scan, we can use the index to speed up the query speed. Let’s recall the example of joining t1 and T2 from the beginning:

SELECT * FROM t1, t2 WHERE t1.m1 > 1 AND t1.m1 = t2.m2 AND t2.n2 < 'd';
Copy the code

SQL > select join (); SQL > select join (); SQL > select join ();

There are two records in the result set after query driver table T1, and the nested loop join algorithm needs to query the driven table twice:

  • When T1.m1 = 2, query t2 table. The query statement for T2 table is equivalent to:

    SELECT * FROM t2 WHERE t2.m2 = 2 AND t2.n2 < 'd';
    Copy the code
  • When T1.m1 = 3, query t2 table again, and the query statement for T2 table is equivalent to:

    SELECT * FROM t2 WHERE t2.m2 = 3 AND t2.n2 < 'd';
    Copy the code

As can be seen, the original filter condition of T1.m1 = T2.m2 involving two tables has been determined in the query of T2 table, so we only need to optimize the query of T2 table. Columns M2 and N2 are used in the above two query statements of T2 table, we can:

  • Since the condition for column m2 is equivalent search, such as T2.m2 = 2 and T2.m2 = 3, the access method to ref may be used. If the access method of REF is used to query t2, it is necessary to go back to the table to determine whether the condition t2.n2 < d is valid.

    In a special case where column M2 is assumed to be the primary key or the only secondary index column of the T2 table, the cost of retrieving records from the T2 table using conditions such as t2.m2 = constant is constant. We know that the value of primary key or unique secondary index column in a single table is const. The design of MySQL called the value of primary key or unique secondary index column in a join query is eq_ref.

  • N2 < ‘d’, and may use range’s access method. Suppose we use range’s access method to query T2, we need to go back to the table to determine whether the condition on m2 is valid.

Given that indexes exist on both columns M2 and n2, it is necessary to choose the less expensive option to perform the query on table T2. Of course, an index does not have to be used, but only if the cost of a secondary index + back to the table is less than the cost of a full table scan.

In addition, sometimes only partial columns of the driven table may be involved in the query list and filter criteria of the join query, and these columns are all part of an index. In this case, even if the eq_ref, ref, ref_OR_NULL, or range access methods cannot be used to perform the query on the driven table, You can also query the driven table using an index scan, or index access method. Therefore, we recommend that in real work, it is better not to use * as the query list, but rather to use the actual column as the query list.

3.5.3 Block Nested-loop Join

The process of scanning a table is to load the table from disk to memory and then compare whether the matching conditions are met from memory. Tables in real life are different from t1 and T2, which have only 3 records. Tens of thousands of records are small. Tables with millions, tens of millions or even hundreds of millions of records are everywhere. The memory may not be able to store all the records in the following table. Therefore, the later records may still exist in the disk when the earlier records are scanned. When the later records are scanned, the memory may be insufficient, so you need to release the previous records from the memory. Ahead of us and said, using the nested loop join algorithm in the process of the two tables together, was the driver table to be accessed many times, however, if the driver is the data in the table access much more special and cannot use index, which is equivalent to want to read it several times this table from the disk, the I/O cost is very big, so we have to think of some way to: Minimize the number of accesses to the driven table.

When the data in the driven table is very large, each access to the driven table will load the records of the driven table to the memory. Each record in the memory will match only one record in the result set of the driven table, and then will be deleted from the memory. Then take out another record from the drive table result set, and load the records of the drive table into memory again, and again, drive table result set how many records, you have to load the drive table from disk to memory how many times. So we can load the records of the driven table into memory, one time and multiple records in the drive table to match, so that you can greatly reduce the cost of loading the driven table from disk repeatedly.

Therefore, the designers of MySQL proposed the concept of a join buffer. A join buffer is a piece of memory with a fixed size applied for before executing a join query. First, several records in the result set of the drive table are loaded into the join buffer, and then the drive table is scanned. The records of each driven table are matched with the records of multiple driven tables in the Join buffer at one time. Since the matching process is completed in memory, this can significantly reduce the I/O cost of the driven table. The process of using join buffer is shown in the figure below:

In the best case, the Join buffer is large enough to hold all the records in the result set of the driven table, so that only one access to the driven table is required to complete the join operation. The folks who designed MySQL call this Nested Loop join algorithm added to the join buffer the Block nested-loop join algorithm.

The size of the join buffer can be configured using the startup parameter or the system variable join_BUFFer_SIZE. The default size is 262144 bytes (256KB) and the minimum size can be set to 128 bytes. If you can’t use an index and your own machine has too much memory, you can try to increase the join_BUFFer_SIZE value to optimize the join query.

In addition, not all columns of the drive table will be added to the Join buffer. Only columns from the query list and filter criteria will be added to the join buffer. Therefore, it is better not to use * as the query list, but only to put the columns we care about in the query list. This can also place more records in the Join buffer.

3.6 summary

  • Overall efficiency comparison: INLJ>BNLJ>SNLJ

  • Always drive a large result set with a small result set (essentially reducing the amount of data in the outer loop) (small units of measure refer to the number of table rows * size of each row)

* from t1 straight.join T2 on (t1.b=t2.b) WHERE t2.id<=100; * from T2 straight_join T1 on (t1.b=t2.b) WHERE t2.id<=100; # is not recommendedCopy the code
  • forCondition matched by the driven table adds index(Reduce the number of loop matches in the inner table)
  • increasejoin buffer size(The more data cached at a time, the fewer times the inner packet scans the table)
  • Reduce unnecessary field queries that drive tables (the fewer fields,join buffer The more data is cached.)

3.7 the hash join

BNLJ will be deprecated from MySQL 8.0.20 because hash Join has been added since MySQL8.0.18 and will be used by default

  • Nested Loop: Nested Loop is a better choice for cases where a small subset of data is connected.

  • Hash Join is a common way to Join large data sets. The optimizer uses the smaller (relatively small) table of two tables to create a Hash table in memory using the Join Key, then scans the larger table and probes the Hash table for rows that match the Hash table.

    • This approach is suitable for smaller tables that can be placed entirely in memory, so that the total cost is the sum of the cost of accessing both tables.
    • If the table is too large to fit into memory, the optimizer splits it into several different partitions. If the table cannot fit into memory, the partition is written to a temporary segment of disk. Large temporary segments are required to maximize I/O performance.
    • It works well in environments with large tables without indexes and parallel queries and provides the best performance. Most people say it’s Join’s heavy lift. Hash Join can only be applied to equivalent joins (for example, WHERE A.col1 = b.col2), which is determined by the Hash characteristics.

4. Sub-query optimization

Let’s see if we can optimize the subquery into an inner join query

MySQL supports subqueries since version 4.1. Subqueries can be used to perform nested queries of SELECT statements. That is, the result of one SELECT query is the condition of another SELECT statement. Subqueries can perform many SQL operations at once that logically require multiple steps.

Subquery is an important function of MySQL, which can help us to implement complex queries through a SQL statement. However, the execution efficiency of subqueries is not high.

The reason:

  • To execute a subquery, MySQL creates a temporary table for the results of the inner query, and then the outer query queries the records from the temporary table. After the query is complete, the temporary tables are destroyed. This will consume too much CPU and I/O resources, resulting in a large number of slow queries.

  • Temporary tables stored in the result set of a subquery do not have indexes in either in-memory or disk temporary tables, so the query performance is affected.

  • For subqueries that return large result sets, the impact on query performance is greater.

In MySQL, JOIN queries can be used instead of subqueries. Join queries do not require temporary tables, are faster than subqueries, and perform better if indexes are used in the query.

Conclusion: Do NOT use NOT IN or NOT EXISTS. Use LEFT JOIN XXX ON xx WHERE xx IS NULL instead

5. Sorting optimization

5.1 Sorting Optimization

Question:

Index the WHERE condition field, but why index the ORDER BY field?

Answer:

MySQL supports two sorting methods, FileSort and Index.

  • In Index sort, the Index can ensure the order of the data, do not need to sort,More efficient.
  • FileSort sort is generally inIn the memoryTo sort,Occupy too many cpus. If the queue result is large, temporary file I/O will be generated. Sorting rows to disk is inefficient.

Optimization suggestions:

  • In SQL, indexes can be used in the WHERE clause and ORDER BY clause, with the purpose of being in the WHERE clauseAvoid full table scan, in the ORDER BY clauseAvoid using FileSort sorting. Of course, in some cases full table scans or FileSort sorts are not necessarily slower than indexes. But in general, we still want to avoid, in order to improve query efficiency.
  • Try to useIndexcompleteORDER BYSorting. ifWHEREORDER BYUse a single-index column if the same column follows; If not, use a federated index.
  • Can’t useIndex, need to be rightFileSortMethod for tuning.

5.2 test

Delete student table and class table created line | bow.

DROP INDEX idx_monitor ON class; DROP INDEX idx_cid ON student; DROP INDEX idx_age ON student; DROP INDEX idx_name ON student; DROP INDEX idx_age_name_classid ON student; Call proc_drop_index(' atguigudB2 ','student ');Copy the code

Using filesort can be removed from the index

5.2.1 process a

explain select sql_no_cache * from student ordey by age,classid
Copy the code

explain select sql_no_cache * from student ordey by age,classid limit 10
Copy the code

5.2.2 If oder by is not limit in Procedure 2, the index fails

CREATE INDEX idx_age_CLASsid_name ON student (age, Classid,NAME); EXPLAIN SELECT SQL_NO_CACHE * FROM student ORDER BY age, Classid;Copy the code

# add limit filter to index. EXPLAIN SELECT SQL_NO_CACHE * FROM student ORDER BY age,classid LIMIT 10;Copy the code

5.2.3 The order by of Procedure 3 is incorrect and the index fails

CREATE INDEX idx_age_CLASsid_stuno ON student (age, Classid,stuno);Copy the code

Which of the following indexes is invalid?

EXPLAIN  SELECT * FROM student ORDER BY classid LIMIT 10;
Copy the code

EXPLAIN  SELECT * FROM student ORDER BY classid,NAME LIMIT 10;  
Copy the code

EXPLAIN  SELECT * FROM student ORDER BY age,classid,stuno LIMIT 10; 
Copy the code

EXPLAIN  SELECT * FROM student ORDER BY age,classid LIMIT 10;
Copy the code

EXPLAIN  SELECT * FROM student ORDER BY age LIMIT 10;
Copy the code

5.2.4 Inconsistent rules and index failure occurred during process 4 ORDER BY (wrong order, no index; Reverse direction, no index)

EXPLAIN  SELECT * FROM student ORDER BY age DESC, classid ASC LIMIT 10;
Copy the code

EXPLAIN  SELECT * FROM student ORDER BY classid DESC, NAME DESC LIMIT 10;
Copy the code

EXPLAIN  SELECT * FROM student ORDER BY age ASC,classid DESC LIMIT 10; 
Copy the code

EXPLAIN  SELECT * FROM student ORDER BY age DESC, classid DESC LIMIT 10;
Copy the code

Oder by clause, try to use index rather than filesort sort

5.2.5 Procedure 5: No filtering, no index

EXPLAIN  SELECT * FROM student WHERE age=45 ORDER BY classid;
Copy the code

EXPLAIN  SELECT * FROM student WHERE  age=45 ORDER BY classid,NAME; 
Copy the code

EXPLAIN  SELECT * FROM student WHERE  classid=45 ORDER BY age;
Copy the code

EXPLAIN  SELECT * FROM student WHERE  classid=45 ORDER BY age LIMIT 10;
Copy the code

5.2.6 summary

INDEX a_b_c(a,b,c) order by Can use the left-most INDEX prefix - order by a - order by a,b - order by a,b,c - order by a DESC,b DESC,c DESC If WHERE is defined as constant using the leftmost prefix of the index, - WHERE a = const order by b,c - WHERE a = const AND b = const order by c - WHERE a = const order by b,c - WHERE a = const AND b > const ORDER BY b,c = const - ORDER BY a ASC,b DESC,c DESC ORDER BY b,c /* lost a index */ - WHERE a = const ORDER BY c /* lost b index */ - WHERE a = const ORDER BY a,d /*d is not part of the index */ - WHERE a in (...). ORDER BY b,c /* For sorting, multiple equality conditions are also range queries */Copy the code

5.3 Actual Case

ORDER BY clause, try to use Index rather than FileSort.

Delete index from student before executing the case, leaving only primary key:

DROP INDEX idx_age ON student; DROP INDEX idx_age_classid_stuno ON student; DROP INDEX idx_age_classid_name ON student; # or call proc_drop_index(' atguigudB2 ','student');Copy the code

Scenario: Query students 30 years old and whose STUDENT ID is less than 101000 by user name

EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE age = 30 AND stuno <101000 ORDER BY NAME ;
Copy the code

The query results are as follows:

mysql> SELECT SQL_NO_CACHE * FROM student WHERE age = 30 AND stuno <101000 ORDER BY NAME; +---------+--------+--------+------+---------+ | id | stuno | name | age | classId | +---------+--------+--------+------+---------+ | 922 | 100923 | elTLXD | 30 | 249 | | 3723263 | 100412 | hKcjLb | 30 | 59 | | 3724152 | 100827 | iHLJmh | 30 | 387 | | 3724030 | 100776 | LgxWoD | 30 | 253 | | 30 | 100031 | LZMOIa | 30 | 97 | | 3722887 | 100237 | QzbJdx | 30 | 440 | | 609 | 100610 | vbRimN | 30 | 481 | | 139 | 100140 | ZqFbuR | 30 | 351 | + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + + -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- + 8 rows in the set, 1 warning (3.16 SEC)Copy the code

Conclusion: Type is ALL, the worst case. Using filesort appears in Extra, which is the worst case scenario. Optimization is a must.

Optimization idea:

In order to get rid of filesort, we can build the index

CREATE INDEX idx_age_name ON student(age,NAME);Copy the code

Select * from ‘where’; select * from ‘where’;

DROP INDEX idx_age_name ON student;
CREATE INDEX idx_age_stuno_name ON student (age,stuno,NAME);
EXPLAIN SELECT SQL_NO_CACHE * FROM student WHERE age = 30 AND stuno <101000 ORDER BY NAME ;
Copy the code
mysql> SELECT SQL_NO_CACHE * FROM student -> WHERE age = 30 AND stuno <101000 ORDER BY NAME ; +-----+--------+--------+------+---------+ | id | stuno | name | age | classId | +-----+--------+--------+------+---------+ | 167 | 100168 | AClxEF | 30 | 319 | | 323 | 100324 | bwbTpQ | 30 | 654 | | 651 | 100652 | DRwIac | 30 | 997 | | 517 | 100518 | HNSYqJ | 30 | 256 | | 344 | 100345 | JuepiX | 30 | 329 | | 905 | 100906 | JuWALd | 30 | 892 | | 574 | 100575 | kbyqjX | 30 | 260 | | 703 | 100704 | KJbprS | 30 | 594 | | 723 | 100724 | OTdJkY | 30 | 236 | | 656 | 100657 | Pfgqmj | 30 | 600 | | 982 | 100983 | qywLqw | 30 | 837 | | 468 | 100469 | sLEKQW | 30 | 346 | | 988 | 100989 | UBYqJl | 30 | 457 | | 173 | 100174 | UltkTN | 30 | 830 | | 332 | 100333 | YjWiZw | 30 | 824 | + -- -- -- -- -- + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- + 15 rows in the set, 1 warning (0.00 SEC)Copy the code

The result was that filesort SQL was running faster than the SQL optimized with Filesort removed, and much faster, almost instantaneously.

Conclusion:

  1. When both indexes exist, mysql automatically selects the optimal solution. (For this example, mysql selects idx_AGe_STUno_name). However, as the amount of data changes, so does the selected index.

  2. If there is a choice between group by and Order by, the number of filter conditions is observed first. If the number of filter data is large enough, but the number of data to be sorted is not large enough, the index is placed on the range field first. And vice versa.

Consider: is it possible to use the following index here?

DROP INDEX idx_age_stuno_name ON student;
CREATE INDEX idx_age_stuno ON student(age,stuno);
Copy the code

5.4 Filesort algorithm: Double-path sort and single-path sort

Double sort (slow)

  • MySQL 4.1 used dual-path sorting“, literally scanning the disk twice to get the data, reading the row pointer andThe order by column, sort them, then scan the sorted list and read the corresponding data output from the list again according to the values in the list
  • Get the sort field from disk, sort in buffer, and then fromDisk takes other fields

Fetching a batch of data requires two scans of the disk. IO is notoriously time-consuming, so a second improved algorithm was introduced after mysql4.1, namely single-way sorting.

Single-way sort (fast)

It is faster to read all the columns needed for the query from disk, sort them in buffer by the ORDER BY column, and then scan the sorted list for output, avoiding a second read. And it turns random IO into sequential IO, but it uses more space because it saves each line in memory.

Conclusion and the problems raised

  • As a result of single road is after, overall better than double road
  • But there’s a problem with single-path

Optimization strategy:

  • Try to improvesort_buffer_size
    • Regardless of which algorithm is used, increasing this parameter will improve efficiency, depending on the capacity of the system, because this parameter is adjusted between 1M and 8M per connection. MySQL5.7, InnoDB storage engine default value is 1048576 bytes, 1MB.
    SHOW VARIABLES LIKE '%sort_buffer_size%';Copy the code

  • Try increasing ‘max_LENGTH_FOR_sort_data’
    • Increasing this parameter increases the probability of using the improved algorithm.
    SHOW VARIABLES LIKE '%max_length_for_sort_data%' ; # Default 1024 bytesCopy the code
    • However, if set too high, the probability of total data capacity exceeding sort_BUFFer_size increases, with high disk I/O activity and low processor usage evident. If the total length of the columns that need to be returned is greater than Max “ength_for_sort_data, use the two-way algorithm, otherwise use the single-way algorithm. Adjust between 1024 · 8192 bytes
  • Order by select * is a big no-no. It is best to Query only the required fields.
    • When the total size of the fields in Query is less thanmax_length_for_sort_dataAnd the sort field is notTEXT|BLOBType, will use the improved algorithm – a single-way sort, otherwise use the old algorithm one – to – multiple sort.
    • The numbers for both algorithms could be exceededsort_buffer_sizeTMP files will be created to merge and sort, resulting in multiple I/OS. , but the risk of using single-path sorting algorithm will be greater, so it should be improvedsort_buffer_size.

GROUP BY optimization

  • Group by uses indexes in almost the same way as ORDER BY. Group BY can use indexes without filtering conditions.
  • Group by is sorted and then grouped, following the best left prefix rule for indexes
  • When index columns are unavailable, increase the max_LENGTH_FOR_sort_DATA and sort_BUFFer_SIZE parameter Settings
  • Where is more efficient than HAVING, so don’t use HAVING if you can qualify where
  • Reduce the use of order by, and communicate with the business without sorting, or put sorting on the application side. The Order BY, Group BY, and DISTINCT statements consume CPU, and the CPU resources of the database are extremely valuable.
  • The SQL statement contains the order BY, Group BY, and DISTINCT queries. Keep the result set filtered by the WHERE condition within 1000 lines; otherwise, the SQL will be slow.

7. Optimize paging queries

In general paging queries, performance can be improved by creating overwrite indexes. MySQL > select * from ‘limit 2000000.2000010’; select * from ‘limit 2000000.2000010’;

EXPLAIN SELECT * FROM student LIMIT 200000,10;Copy the code

Optimization idea 1:

Sort paging is performed on the index, and then the primary key is associated back to the original table to query the other column contents.

EXPLAIN SELECT * FROM student t,(SELECT id FROM student ORDER BY id LIMIT 200000,10) a WHERE T.id = a.id; EXPLAIN SELECT * FROM student t,(SELECT id FROM student ORDER BY ID LIMIT 200000,10) a WHERE T.id = a.id;Copy the code

Optimization idea two:

This scheme applies to tables with self-increasing primary keys and can convert Limit queries into queries at a certain location.

EXPLAIN SELECT * FROM student WHERE id > 2000000 LIMIT 10;
Copy the code

8. Override indexes first

8.1 What is an overwrite index?

The query result is directly found through the data corresponding to the secondary index, without returning to the table

  • An index is an efficient way to find rows, but a database can also use an index to find data for a column, so it doesn’t have to read the entire row. After all, index leaf nodes store the data they index; When you can get the data you want by reading the index, you don’t need to read rows. An index that contains data that satisfies the query result is called an overwrite index.

  • A form of non-clustered composite index that includes all columns used in the SELECT, JOIN, and WHERE clauses of the query (i.e. the fields that are indexed are the ones that are covered by the query condition).

The index column + primary key contains the columns queried between SELECT and FROM.

Example 1:

DROP INDEX idx_age_stuno ON student; DROP INDEX idx_age_stuno ON student; CREATE INDEX idx_age_name ON student (age,NAME); EXPLAIN SELECT * FROM student WHERE age <> 20;Copy the code

EXPLAIN SELECT age,NAME FROM student WHERE age <> 20;
Copy the code

All of the above uses declarations to indexes, but not the following:

EXPLAIN SELECT id,age,NAME,classid FROM student WHERE age <> 20;
Copy the code

Example 2:

EXPLAIN SELECT * FROM student WHERE NAME LIKE '%abc';
Copy the code

create index idx_age_name on student(age,name)
EXPLAIN SELECT id,age FROM student WHERE NAME LIKE '%abc';
Copy the code

SQL > select * from classid;

EXPLAIN SELECT id,age,name,calssid FROM student WHERE NAME LIKE '%abc';
Copy the code

8.2 Advantages and disadvantages of overwriting indexes

Benefits:

  • Avoid secondary queries for Innodb tables with indexes (back tables)
    • Innodb is stored in the order of clustered indexes. For Innodb, secondary indexes store the primary key information of rows in leaf nodes. If we use secondary indexes to query data, after finding the corresponding key value, we need to conduct secondary queries through primary keys to obtain the data we really need.
    • In the overwrite index, the required data can be obtained from the key value of the secondary index, which avoids the secondary query on the primary key, reduces the operation of work zero, and improves the query efficiency.
  • You can change random I/OS into sequential I/OS to improve query efficiency
    • Because the overwrite index is stored in the II page order of key values, for a 10-intensive range lookup, 10 is much less than reading each row from disk randomly, so the overwrite index can also be used to convert the disk’s random read work 0 into the sequential work 0 of index lookup.
    • Because overwriting indexes can reduce the number of tree searches and significantly improve query performance, it is a common performance optimization method to use overwriting indexes.

Disadvantages:

  • Index field maintenanceThere's always a price. Therefore, there are trade-offs when creating redundant indexes to support overwriting indexes. This is the job of a business DBA, or business data architect.

9. How to index a string

There is a table of teachers with the following definitions:

create table teacher( ID bigint unsigned primary key, email varchar(64), ... ) engine=innodb;Copy the code

Instructors use email to log in, so a statement like this must appear in the business code:

mysql> select col1, col2 from teacher where email='xxx';
Copy the code

If there is no index on the email field, then this statement can only do a full table scan.

9.1 Prefix Index

MySQL supports prefix indexes. By default, if the statement you create the index does not specify a prefix length, the index will contain the entire string.

mysql> alter table teacher add index index1(email); Alter table teacher add index index2(email(6));Copy the code

What are the differences in data structure and storage between these two definitions? The diagram below shows the two indexes.

As well as

If you use index1 (the index structure for the entire string of email), the order of execution looks like this:

  • Find the index value ‘[email protected]’ from the index1 index tree, obtain the value of ID2;
  • Find the row whose primary key value is ID2 on the primary key, judge that the value of email is correct, and add this row to the result set.
  • Select the next entry in the index1 index tree and find that the condition email=’[email protected] ‘is no longer met, and the loop ends.

In this process, the primary key index only needs to fetch data once, so the system considers that only one row is scanned.

If index2 (i.e. Email (6) index structure) is used, the order of execution would look like this:

  • Select * from index tree where index value = ‘zhangs’ and ID1;
  • The primary key is ID1, and the value of email is not ‘[email protected]’. This row is discarded.
  • Select the next entry in index2 and find that it is still ‘zhangs’. Select ID2 and select the whole row in ID index. Then judge that the value is correct this time and add this row to the result set.
  • Repeat the previous step until the value on IDxe2 is not ‘zhangs’, and the loop ends.

That is, using a prefix index with a defined length can save space without adding too much query cost. We talked about the degree of differentiation, and the higher the degree of differentiation, the better. Because higher differentiation means fewer duplicate keys.

9.2 Impact of Prefix Indexes on Overwrite Indexes

Conclusion:

Using prefixed indexes eliminates the query performance optimization provided by overwriting indexes, which is another factor to consider when deciding whether to use prefixed indexes.

10. Index push down

Index Condition Pushdown(ICP) is a new feature in MySQL 5.6. It is an optimized way to filter data using indexes in the storage engine layer. ICP can reduce the number of times the storage engine accesses the base table and the number of times the MySQL server accesses the storage engine.

10.1 Scanning process before and after use

In the process of not using ICP index scanning:

  • Storage layer: Only the entire row corresponding to the index record that meets the index key condition is retrieved and returned to the Server layer

  • Server layer: The data returned is filtered using the following WHERE criteria until the last row is returned.

Procedure for using ICP scanning:

  • Storage layer:
    • The index record range that meets the index key condition is determined, and the index is filtered using index filter. The index records that meet the Index Filter criteria are retrieved from the table and returned to the server layer. Index records that do not meet the Index Filter criteria are discarded and are not returned to the table or the Server layer.
  • Server layer:
    • For the returned data, the table filter criterion is used for final filtering.

Cost difference before and after use:

Before use, the storage layer returns the entire row of records to be filtered out by index Filter

With ICP, records that do not meet the Index Filter criteria are simply removed, saving them the cost of returning to the table and passing to the Server layer.

The acceleration effect of ICP depends on the proportion of data filtered through ICP in the storage engine.

10.2 Start and close of ICP

  • Index conditional push is enabled by default. You can do this by setting system variablesoptimize_switchControl:index_condition_pushdown
SET optimizer_switch = 'index_condition_pushdown '; SET optimizer_switch = 'index_condition_pushdown=on';Copy the code
  • EXPLA work N statement output results when pushing down using index conditionsExtraColumn content is displayed asUsing index condition

10.3 ICP Use Cases

Case 1:

SELECT * FROM tuser WHERE NAME LIKE 'zhang %' AND age = 10 AND ismale = 1;Copy the code

Case 2:

10.4 Enabling and Disabling ICP Performance Comparison

The main purpose of creating stored procedures is to insert a lot of 000001 data, so that the query will be filtered in the storage engine layer by 10, and the buffer pool (cache data pages, no IO) will be reduced.

DELIMITER // CREATE PROCEDURE insert_people(max_num INT) BEGIN DECLARE I INT DEFAULT 0; SET autocommit = 0; REPEAT SET I = I + 1; INSERT INTO people (zipcode, firstname, lastname, address) VALUES (¹ 000001 ', 'I UNTIL I = Max. Num END REPEAT; COMMIT; END / / DELIMITER.Copy the code

Calling a stored procedure

 call insert_people(1000000);
Copy the code

First, open profiling.

set profiling=1;
Copy the code

Index push down is enabled by default when the SQL statement is executed.

SELECT * FROM people WHERE zipcode=' 000001 'AND zipcode=' 000001';Copy the code

Execute the SQL statement again without index push-down

SELECT * FROM people WHERE zipcode= 100 AND zipcode= 100; SELECT * FROM people WHERE zipcode= 100 AND zipcode= 100;Copy the code

View all profiles generated by the current session

 show profilesXG;
Copy the code

The results are as follows.

Multiple test efficiency comparison, using ICP to optimize the query efficiency will be better, here it is suggested to store more data effect is more obvious.

10.5 Conditions of use of ICP

ICP conditions of use:

  1. Can only be used for secondary indexes
  2. The type value (join type) in the execution plan shown in Explain is range, ref, eq_REF, or REF_OR_NULL.
  3. Not all WHERE conditions can be filtered by ICP. If the condition is not in the index column, the whole table must be read to the server for WHERE filtering.
  4. ICP can be used with MyISAM and InnnoDB storage engines
  5. The ICP function for partitioned tables is not supported in MySQL 5.6, but is supported in MySQL 5.7.
  6. ICP optimization is not supported when SQL uses overwrite indexes.
  7. Conditions for related subqueries cannot use ICP

11. Plain indexes vs. unique indexes

For performance reasons, would you prefer a unique index or a plain index? What is the basis of selection?

Suppose we have a table with a primary key column ID, a field K in the table, and an index on k, assuming that none of the values on field K are duplicated. The construction sentence of this table is:

The construction sentence of this table is:

mysql> create table test(
		id int primary key,
		k int not null,
		name varchar(16),
		index (k)
	)engine=InnoDB;
Copy the code

In the table (ID, k) values of R1 and R5, respectively (100, 1), (200, 2), (300, 3), (500, 5) and (600).

11.1 Query Process

For example, select ID from test where k=5.

  • For a normal index, after finding the first record that meets the condition (5,500), the next record is searched until the first record that does not meet the condition k=5 is encountered.
  • For unique indexes, since the index defines uniqueness, the search stops after the first record that meets the condition is found.

So how much of a performance difference does this make? The answer is, very little.

11.2 Update Process

To illustrate the impact of normal and unique indexes on update statement performance, I introduced ChangeBuffer.

When a data page needs to be updated, the data page is updated directly in memory. If the data page is not already in memory, InooDB will cache these updates in the Change Buffer without affecting the consistency of the data, so that the data page does not need to be read from disk. The next time a query needs to access the data page, the data page is read into memory and the changeBuffer operations related to the page are performed. In this way, the correctness of the data logic can be guaranteed.

The process of applying the operations in the Change Buffer to the original data page to get the latest results is called merge. In addition to accessing the data page that triggers the merge, the system has background threads that merge periodically. The merge operation is also performed during the database shutdown process.

If update operations can be recorded in the change buffer first to reduce disk reads, the execution speed of statements will be significantly improved. In addition, data reading into memory needs to occupy the buffer pool, so this method can avoid occupying memory and improve memory utilization. The change buffer cannot be used for updates to unique indexes, and in fact only normal indexes can be used.

What is InnoDB’s process for inserting a new record (4,400) into this table?

11.3 Usage scenarios of the change Buffer

  • How to choose between normal index and unique index? In fact, there is no difference between the two types of indexes in terms of query capability, but the main consideration is the impact on update performance. Therefore, it is recommended that you choose normal indexes as much as possible.

  • In actual use, it can be found that the combination of ordinary index and change buffer can significantly optimize the update of tables with large data volume.

  • If all updates are immediately followed by a query for the record, you should turn off the change Buffer. In other cases, change Buffer can improve update performance.

  • Since unique indexes do not use the optimization mechanism of Change Buffer, it is recommended to give priority to non-unique indexes from the perspective of performance if the business is acceptable. But what if “the business may not be assured”?

    • First, business correctness comes first. We start with the premise that the business code has been guaranteed not to write duplicate data. If the business is not guaranteed, or if the business simply requires the database to do the constraint, then there is no choice but to create a unique index. In this case, the significance of this section is that if you encounter a large number of slow insertion of data, memory hit ratio is low, to provide you with a troubleshooting idea.
    • However, in some “archive repository” scenarios, you can consider using unique indexes. For example, online data only needs to be kept for half a year, and historical data is kept in an archive. At this point, archiving the data ensures that there are no unique key conflicts. To improve archiving efficiency, consider changing the unique index in a table to a normal index.

12. Other query optimization strategies

12.1 Distinction between EXISTS and IN

Question: Do not understand which case should be used IN EXISTS and which should be used IN. Is it possible to use the index of the table? Answer: Index is a prerequisite, but it depends on the size of the table. You can think of the selection criteria as small tables driving large tables. Efficiency is highest in this way. Like this:

SELECT * FROM A WHERE cc IN (SELECT cc FROM B)
SELECT * FROM A WHERE EXISTS (SELECT cc FROM B WHERE B.cc=A.cc)
Copy the code

If A is less than B, use EXISTS because the implementation of EXISTS is equivalent to an appearance loop. The implementation logic is similar to:

for i in A
    for j in B
      if j.cc == i.cc then ...
Copy the code

Use IN when B is less than A, because the implementation logic is similar to:

 for i in B
    for j in A
      if j.cc == i.cc then ...
Copy the code

The size of A table is determined by EXISTS and the size of B table is determined by IN.

12.2 COUNT(*) and COUNT(Specific Field) efficiency

MySQL > SELECT COUNT(*), SELECT COUNT(1), SELECT COUNT(1), SELECT COUNT(1), SELECT COUNT(1), SELECT COUNT(1)

A: Premise: If you want to count the number of non-empty rows in a field, it is a different matter. After all, the premise of performance comparison is the same result.

  • Step 1: COUNT(*) and COUNT(1) both COUNT all the results, and COUNT(*) and COUNT(1) are essentially the same (there may be a slight difference in execution time, but you can still think of them as being equally efficient). If there is a WHERE clause, all rows that match the filter criteria are counted. If there is no WHERE clause, the number of rows in the data table is counted.

  • Link 2: With MyISAM storage engine, the number of rows in the table is 0⑴. This is because each MyISAM table has a meta information that stores row.count values, and consistency is guaranteed by table level locking. InnoDB storage engine, because InnoDB supports transactions, uses row-level locking and MVCC mechanism, so it cannot maintain a row_count variable like MyISAM, so it needs to scan the whole table, is 0 (n) complexity, cycle + count to complete statistics.

  • Step 3: In The InnoDB engine, if COUNT (specific field) is used to COUNT the number of rows, try to use secondary indexes. Because the primary key uses the clustered index, the clustered index contains more information and is obviously larger than the secondary index (non-clustered index). For COUNT (*) and COUNT (1), they do not need to look up specific rows, but COUNT the number of rows, and the system automatically uses a smaller secondary index for the COUNT.

If there are more than one secondary index, the secondary index smaller than key_len is used for scanning. Primary key indexes are used for row statistics only when there are no secondary indexes.

12.3 about the SELECT (*)

In table query, you are advised to specify fields. Do not use * as the field list. SELECT < field List > is recommended. The reason:

  • In the process of parsing, MySQL will query the data dictionary to convert “*” into all column names in sequence, which will greatly consume resources and time.

  • Unable to use overwrite index

12.4 Impact of LIMIT 1 on Optimization

For SQL statements that scan the entire table, if you are certain that there is only one result set, then LIMIT 1 will not continue to scan when one result is found, which will speed up the query.

If the table already has a unique index for the field, then the query can be queried by the index. If the table does not have a full table scan, LIMIT 1 is not required.

12.5 Use COMMIT

Whenever possible, use COMMIT in your application as much as possible. This will improve your application’s performance and reduce the demand for resources released by COMMIT.

COMMIT Releases resources:

  • Rollback the information used to recover data on the segment
  • The lock acquired by a program statement
  • Space in the redo/undo log buffer
  • Manage internal costs in the three resources listed above

13 taobao database, how to design the primary key?

Talk about a practical problem: Taobao’s database, the main key is how to design?

Some of the wrong answers are still circulating on the Internet year after year, and even become the so-called MySQL catch-all. One of the most obvious errors is the design of MySQL’s primary key.

Most people responded confidently: Use 8-byte BIGINT as the main key, not INT. Wrong!

Such an answer is only at the database level, without thinking about primary keys from a business perspective. Is the primary key an increment ID? In the 2022 New Year, with auto-increment as the main key, you may not even get a passing grade in architectural design.

13.1 Problem of adding an ID automatically

Increment ID primary key, easy to understand, almost all databases support the increment type, but their implementation is different. In addition to its simplicity, the auto-increment ID has other disadvantages. In general, it has the following problems:

  1. Low reliability

There is an issue with auto-increment ID backtracking, which was not fixed until the latest MySQL 8.0 release. 2. Low security It is easy to guess the corresponding information of exposed interfaces. For example: /User/1/ interface, it is very easy to guess the value of the User ID, the total number of users, it is also very easy to crawl data through the interface. 3. Poor performance The self-added ID has poor performance and needs to be generated on the database server. 4. Interworking multiple services You need to execute an additional function like last_insert_id() to know the value you just inserted, which requires another network interaction. In the massive concurrent system, one more SQL, one more time on the overhead. 5. Local uniqueness The most important point is that the increment ID is locally unique, only in the current database instance, not globally unique, is unique between any server. For today’s distributed systems, this is a nightmare.

13.2 Main key of Service Fields

To uniquely identify a member’s information, you need to set up a primary key for the member information table. So how do we set the primary key for this table to achieve our desired goal? Here we consider the business field as the primary key.

Table data is as follows:

In this table, which field is appropriate?

Select the card number

A cardno seems appropriate because it cannot be empty and is unique and can be used to identify a membership record.

Mysql > CREATE TABLE demo. Membermaster -> (-> cardno CHAR(8) PRIMARY KEY, -> memberphone TEXT, -> memberpid TEXT, -> memberaddress TEXT, -> sex TEXT, -> birthday DATETIME -> ); Query OK, 0 rows affected (0.06 SEC)Copy the code

Different member card numbers correspond to different members, and the field “cardno” uniquely identifies a member. If this is the case, the member card number corresponds to the member one by one, and the system can work normally.

But the reality is that there may be repeated use of member card numbers. For example, if Joe moves out of his original address because of a job change and stops spending money at the store (and returns his membership card), he will no longer be a member of the store. However, the merchant did not want to leave the card empty, so he gave wang wu the card number “10000001”.

From the point of view of system design, this change only modiifies the member information of the card number “10000001” in the member information table, without affecting data consistency. In other words, if the member information of the member card number “10000001” is modified, all modules of the system will obtain the modified member information, and there will not be “some modules get the modified member information, while some modules get the modified member information, resulting in data inconsistency within the system”. So, at the information system level, it’s okay.

But from the business level of using the system, there is a big problem, which affects the merchants. For example, we have a sales statement (TRANS) that records all sales statements. On December 1, 2020, Zhang SAN bought a book at a store and spent 89 yuan. So, the system has a running record of Zhang SAN’s book purchase, as follows:

Next, let’s check the member sales record as of December 1st, 2020:

mysql> SELECT b.membername,c.goodsname,a.quantity,a.salesvalue,a.transdate -> FROM demo.trans AS a -> JOIN demo.membermaster AS b -> JOIN demo.goodsmaster AS c -> ON (a.cardno = b.cardno AND a.itemnumber=c.itemnumber); +------------+-----------+----------+------------+---------------------+ | membername | goodsname | quantity | Salesvalue | transdate | + -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | zhang book | | | 1.000 89.00 | 2020-12-01 00:00:00 | + -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + 1 row in the set (0.00 SEC)Copy the code

If the membership card “10000001” is sent to Wang Wu again, we will change the member information form. When the query results:

mysql> SELECT b.membername,c.goodsname,a.quantity,a.salesvalue,a.transdate -> FROM demo.trans AS a -> JOIN demo.membermaster AS b -> JOIN demo.goodsmaster AS c -> ON (a.cardno = b.cardno AND a.itemnumber=c.itemnumber); +------------+-----------+----------+------------+---------------------+ | membername | goodsname | quantity | Salesvalue | transdate | + -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | fifty books | | | 1.000 89.00 | 2020-12-01 00:00:00 | + -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + 1 row in the set (0.01 SEC)Copy the code

The result is: Wang Wu bought a book on December 1, 2020, spending 89 yuan. Clearly wrong! Conclusion: never use the member card number as the main key.

Select member phone number or ID number

Can member’s telephone number be the main key? No good. In practice, mobile phone numbers are also recovered by operators and re-issued to others.

What about the ID number? It seems ok. Because the id card will never be repeated, the id number and a person have a one-to-one relationship. But the problem is that the ID number is private and customers may not be willing to give it to you. Forcing members to register their ID numbers would drive many customers away. In fact, the customer phone also has this problem, which is why we allow the id number and phone number to be empty when designing the membership information form.

Therefore, it is recommended to avoid using business-related fields as the primary key. After all, as project design technicians, none of us can predict which business fields will be duplicated, or reused, for the sake of the project’s business requirements, throughout the project’s life cycle.

Lesson: A common mistake many people make when starting out with MySQL is to use the business fields as the primary key, assuming they know what the business needs are, but the reality is often unexpected, and the cost of changing the primary key Settings is very high.

13.3 Primary key design of Taobao

In Taobao’s e-commerce business, order service is a core business. Excuse me, how is the primary key taobao of the order form designed? Does it increment the ID?

Open Taobao and look at the order information:

As you can see from the figure above, the order number is not an increment ID! Let’s take a look at the above four order numbers in detail:

1550672064762308113
1481195847180308113
1431156171142308113
1431146631521308113
Copy the code

The order number is 19 digits long, and the last 5 digits of the order are the same, which are 08113. And the first 14 digits of the order number are monotonically increasing.

Bold guess, taobao order ID design should be:

Order ID = time + deduplication field + the last 6 digits of the user IDCopy the code

Such a design can be globally unique and extremely friendly to distributed system queries.

13.4 Recommended primary key design

You can change the time sequence of the UUID to put the hour and minute in front of the second instead of the default second. In this way, non-core services can be implemented in an orderly manner. The ids of the primary key of the corresponding table are automatically added, such as alarms, logs, and monitoring information.

Core business: The primary key design should be at least globally unique and monotonically increasing. Globally unique is guaranteed to be unique across systems, and monotonically increasing is expected to insert without affecting database performance.

The simplest primary key design is recommended here: UUID.

UUID features: Globally unique, occupying 36 bytes, data is out of order, and insert performance is poor.

Know the UUID:

  • Why are UUID’s globally unique?
  • Why does a UUID take up 36 bytes?
  • Why is the UUID out of order?

The UUID of the MySQL database is as follows:

UUID = Time +UUID version (16 bytes) - Clock sequence (4 bytes) - MAC address (12 bytes)Copy the code

Take the UUID e0ea12D4-6473-11EB-943C-00155dbaa39d as an example.

Why are UUID's globally unique?

In the UUID, the time part occupies 60 bits. It stores a TIMESTAMP similar to that of 1582-10-15 00:00:00.00, but represents the count of 100ns from 1582-10-15 to the present. You can see that the UUID store is more accurate than TIMESTAMPE, and the probability of time dimension duplication is reduced to 1/100ns. The clock sequence is designed to avoid the possibility of time repetition as the clock is rolled back. MAC addresses are globally unique.

Why does a UUID take up 36 bytes? Uuids are stored as strings and are designed with useless “-” strings, so a total of 36 bytes is required.

Why are UUID’s randomly unordered? Because the UUID design puts the time low first, and this part of the data is always changing, and is out of order.

Transform UUID

If the high and low parts of time are swapped, time is monotonically increasing, and it becomes monotonically increasing. MySQL 8.0 can change the time low and time high storage modes, so that UUID is ordered UUID.

MySQL 8.0 also addresses the space footprint of UUID strings by removing meaningless “-” strings from UUID strings and storing them in binary type, reducing the storage space to 16 bytes.

You can use the uuid_to_bin function provided by MySQL8.0 to do this, as well as the bin_to_uuid function provided by MySQL

Transformation:

SET @uuid = UUID();
SELECT @uuid,uuid_to_bin(@uuid),uuid_to_bin(@uuid,TRUE);
Copy the code

Uuid_to_bin (@uuid,true) converts the UUID to an ordered UUID. Globally unique + monotonically increasing, this is not the primary key we want!

Ordered UUID performance test

How does a 16-byte ordered UUID compare in performance and storage space to the previous 8-byte incremented ID? Let’s do a test, insert 100 million data, each data occupy 500 bytes, contain 3 secondary indexes, the final result is as follows:

From the figure above, you can see that inserting 100 million ordered Uids is the fastest, and in actual business use ordered Uids can be generated on the business side. You can further reduce the number of SQL interactions. In addition, although the ordered UUID is 8 bytes more than the increment ID, it actually increases the storage space by 3 gigabytes, which is acceptable

In today’s Internet environment, database designs with auto-increment ids as primary keys are highly discouraged. A globally unique implementation like ordered UUID is more recommended.

In a real service system, the primary key can also add service and system attributes, such as the user’s tail number and equipment room information. This primary key design is even more challenging for the architect.

What if it’s not MySQL8.0?

Manually assign field master key!

For example, design the primary key of the membership table of each branch, because if the data generated by each machine needs to be merged, the problem of duplicate primary key may occur.

You can add a field to the management information table in the headquarters MySQL database to record the maximum number of current members.

When adding members, stores first obtain the maximum value from the headquarters MySQL database, add 1 to this value, and then use this value as the “ID” of new members. At the same time, update the maximum number of former members in the management information table of the headquarters MySQL database.

In this way, when each store adds members, it will operate the data table field in the same MySQL database of the headquarters, thus solving the problem of member number conflict when each store adds members.

Refer to the article

Chapter 10-17 “MySQL Technology Insider: InnoDB Storage Engine (2nd edition)” “Database index Design and Optimization”