preface

This article mainly refers to chapter 5 of “High Performance MySQL” and chapter 3 of “How to Cultivate MySQL DBAs”, which is a practice and supplement of the original book. Last time we focused on the basic operation of MySQL, this time we will talk about indexes and EXPLAIN.

I. What is an index?

If you want to learn MySQL related technology in depth, rather than just stay in simple CURD, you need to master index technology first. So what is an index?

The easiest way to understand how indexes work in MySQL is to look in the “Indexes” section of a book: if you want to find a specific topic in a book, you will usually look at the “indexes” and find the corresponding page number. Therefore, as the data in the table becomes more and more, it will become slower and slower to search records one by one. We need to build a “directory” like a “dictionary” to help us still quickly find the records we want. This “table of contents” existence is an index. In MySQL, the storage engine uses indexes in a similar way, first finding the corresponding value in the index, and then finding the corresponding row in the database table based on the matching index record.

An index, also known as a “key” in MySQL, is a data structure used by the storage engine to quickly find records. Why data structures? Because the index itself is to solve the search problem, search sorting is often encountered in the algorithm, we usually have some corresponding search algorithm, traversal, binary, binary search tree, red black tree, hash table, etc. (details can see the orange “algorithm” book). Some fast search algorithms have their corresponding data structure to achieve, index is a data structure implemented by the storage engine can be quickly used to find records in the database. We will see later that the data structures can be B-tree, hash index, R-tree, full-text index, etc.

Index optimization is probably the most effective way to optimize query performance. Indexes can easily improve query performance by several orders of magnitude, and an “optimal” index can sometimes perform two orders of magnitude better than a “good” index. Creating a truly “optimal” index often requires rewriting the query.

Multiple indexes can be created for the same table. Just like the index of Xinhua Dictionary, there are not only pinyin, but also strokes and radicals. In addition to allowing different fields to be indexed, combinations of fields can also be indexed, and the order in which they are combined also affects the query speed. MySQL even allows multiple indexes to be created on the same field, but this is not desirable.

II. Index type

For database indexing-related issues, there are many basic computer operating system terms that need to be understood and understood. Here are some concepts that need to be understood in advance in this article:

  • CPU intensive: CPU intensive is also called computing intensive, which means that the performance of the hard disk and memory of the system is much better than that of the CPU. In this case, the system is in a state where the CPU is Loading 100%, and the I/O can be completed in a very short time by the CPU, while the CPU has a lot of operations to process. CPU Loading is high.

  • IO intensive: Indicates that the CPU performance of the system is much better than that of the hard disk or memory. In this case, the CPU is waiting for I/O (hard disk or memory) read/write operations and does not make full use of the CPU capacity.

  • Operating system page: Disk read/write speeds are much slower than main memory, so minimize disk I/O and read/write operations to improve efficiency. To do this, the disk does not read strictly on demand, but prereads each time. Even if only one byte is needed, the disk reads a certain length of data backwards from that location into memory, based on a theory known in computer science as locality. Proofread the length of a multiple of that of the general order page (page), page is computer management logical block of memory, the hardware and operating systems tend to main memory and disk storage area is divided into continuous blocks of equal size, each storage blocks is called a page (in many operating systems, the page size is usually 4 k), main memory and disk page to exchange data for the unit. When the data that the program is trying to read is not in main memory, a page-missing exception will be triggered. At this time, the system will send a disk read signal to the disk. The disk will find the starting location of the data and load one or more pages back into memory successively.

  • The database page: Database file storage is a page storage unit. A page can store N rows of data. The common page types are data pages and index pages. Take The InnoDB engine as an example. Its logical storage structure is shown in the following figure. All data is logically stored in a system calledTable space (tablespace). Table Spaces are separated bySegment (segment)Area (among)Page (page)Composition. A table space is composed of segments. Common segments include data segments, index segments, and rollback segments. InnoDB storage engine tables are organized by index, so data is index and index is data. Then the data segment is the leaf node of the B+ tree (leaf node segment), the index segment is the non-leaf node of B+ tree (non-leaf node segment). Extents are made up of 64 consecutive pages, each 16KB in size, or 1MB in size for each extents. Pages are the smallest unit of InnoDB disk management. InnoDB storage engine is row oriented, which means data is stored row by row. The number of rows per page is also rigidly defined, up to the maximum allowed16KB/2-200Row records, i.e. 7992 row records.

  • Sequential IO: Each time a target block of the disk is accessed, the arm needs to move to the correct track. This time is the addressing time. The disk then needs to rotate to the correct sector, and this time is the rotation delay. Obviously, the total amount of time spent depends on where the head is started, and where the sector is being accessed. If the target block is just below the head, there is no need to wait; If you have just passed the head, you have to wait a cycle time. After the previous target block is found, the next target block is immediately behind the disk block that was accessed. The magnetic head can encounter the I/O immediately without waiting. This is called sequential I/O.

  • Random IO: If the next target block is in another part of the disk, accessing it will have the same seek and rotation delay. We call this type of IO random IO. A lot of database design is also to make full use of sequential I/O, traditional database architecture has little resistance to random I/O, random I/O almost all DBAs fear, MySQL InnoDB uses transaction logs to turn random I/O into sequential I/O.

  • Main memory access process: From an abstract perspective, main memory is a matrix of a series of storage units, each storing a fixed size of data. Each storage unit has a unique memory address. When the system needs to read the main memory, it puts the address signal on the address bus to the main memory. After the main memory reads the address signal, it parses the signal and locates it to the specified storage unit, and then puts the data of the storage unit on the data bus for other components to read. The process of writing the main memory is similar. The system will write the unit address and data on the address bus and data bus respectively. The main memory reads the contents of the two buses and performs corresponding write operations. Here we can see that the time of main memory access is only related to the number of access, and the location of the data access has nothing to do with, because reading and writing data is equivalent to directly locating coordinates according to the address.

  • Disk access process: When data needs to be read from the disk, the system will transmit the logical address of the data to the disk. The control circuit of the disk translates the logical address into the physical address according to the addressing logic, that is, to determine which track and sector the data to be read in. This process takes time including seek time and spin time. Due to the characteristics of the storage medium, the disk itself access is much slower than the main memory, coupled with the cost of mechanical movement, the disk access speed is often a few hundredths of the main memory, so in order to improve efficiency, to minimize disk I/O.

An index is a data structure, so of course an index type is a different type of data structure. There are many types of indexes that can provide better performance for different scenarios. In MySQL, indexing is implemented at the storage engine layer rather than the server layer. As a result, there is no universal index standard: indexes in different storage engines work differently, and not all storage engines support all types of indexes. Even if multiple storage engines support the same type of index, the underlying implementation may differ. We will focus on the index types supported by MySQL.

B-tree indexes

When you talk about indexes, if you don’t specify the type, you’re probably talking about b-tree indexes, which use b-tree data structures to store data. So first let’s look at the B-tree data structure.

① B tree, B- tree, B+ tree

First of all, it should be made clear that a B-tree is a B-(minus) Tree. All b-trees in this paper are B-bar trees, not minus signs. If you want to represent a b-subtraction Tree, you can directly use the Chinese character minus or B Tree to represent the same meaning.

B tree

Binary search trees are not balanced trees, and in extreme cases the search performance may be very low, hence the existence of balanced binary trees such as red-black trees. A B tree is also a balanced tree, and the order of a B tree, or the maximum degree of a node, is not limited to 2 or 3, as opposed to a 2 or 3 balanced tree like a red-black tree. In terms of search efficiency, the general order is greater than or equal to 3, which is represented by M. Suppose a non-empty B-tree that satisfies the following properties:

  • The root node has at least two children;
  • Each intermediate node containsk-1Elements andkA child,kBelong to [The ceil (m / 2), m];
  • Each leaf node containsk-1An element,kBelong to [The ceil (m / 2), m];
  • All leaf nodes are in the same layer (balanced tree);
  • The elements in each node are arranged from smallest to largestk-1Zero is exactly zerokThe range division of the elements contained by the child;
  • Each node containsk-1Elements, andkA child’s pointer.

A standard B-tree looks like this:

We can see that when we look for an element, we need to read h (the height of the b-tree) at most. If a piece of data needs to be inserted, the structure of the B-tree needs to be adjusted to maintain the above properties. If the number of elements in a node after insertion is greater than M, then splitting is required before insertion. Similarly, if a piece of data is deleted and the number of elements on the deleted node is less than CeiL (M /2), the corresponding merge operation is also performed.

Using the data structure of B tree to store data, we can define data and corresponding index information as a combination [key, data], key is the index of data. Then a simple B-tree can be expressed as:

Each node contains K-1 index values, K-1 corresponding data (excluding the index values), and K Pointers to child nodes.

B + tree

B+Tree is actually a variant of B Tree. MySQL generally uses the data structure of B+Tree to achieve indexes, including the main storage engine MyISAM and InnoDB. B+ trees are different from B trees in the following ways:

  • Each intermediate node containskElements andkOne child is equal to the number of Pointersk;
  • Non-leaf nodes do not store data, but only keys.
  • All intermediate node elements coexist in child nodes, which are the largest (or smallest) element in the child node element.

The following is a diagram of a B+ tree that satisfies all three properties.

B+ tree with sequential access Pointers

The B+ tree structure commonly used in database systems or file systems is optimized on the basis of classical B+ trees, adding sequential access Pointers. The B+ tree with sequential access Pointers shown below is the B+ tree we often see.

Compared with the previous figure, the main difference is that each leaf node adds a pointer to adjacent leaf nodes, thus forming a B+ tree with sequential access Pointers. The purpose of this optimization is to improve the performance of interval access. If you want to query all data records whose key is in a certain range, after finding the first data, you can access all data nodes at one time only by traversing the sequence of nodes and Pointers, which greatly improves the efficiency of interval query.

② Why do databases use B+ trees?

The amount of index data in a database is also large, so it is stored on disk rather than in memory. So when you add, delete, change and look up the data, you need to read the index content, which is disk I/O. According to the previous concepts, as few time-consuming disk I/O operations as possible, the number of disk I/O operations can be used to evaluate the performance of an index data structure.

Let’s start with binary search tree and red-black tree. The order of these two trees is fixed, and the number of child nodes of each node is very small. As a result, if there are many indexes, the tree depth is very deep, and the number of comparison times for searching will be very large, and the performance will be seriously affected.

And the B tree, because it has order M, can be set to be large, so that the factor that determines the number of query comparisons – the tree depth can be very shallow. According to the definition of B-tree, a maximum of H nodes need to be accessed in a search. The designers of the database system took advantage of disk prefetch by setting the size of each node of the tree to be equal to one page, so that each node could be fully loaded with only one disk I/O. In order to achieve this, we need to use the following techniques when actually implementing B trees:

  • Each time a new node is created, a page space is requested directly, so that a node is physically stored in a page, and computer storage allocation is page aligned, so that a node only needs one disk I/O;
  • A maximum of one query in the B tree is requiredh-1Because the root node resides in the memory, the progressive complexity is O(h)=O(logdN) O(h)=O(logdN) O(h)=O(log_dN) O(h)=O(logdN). In general practical applications, the degreedIt’s a very large number, usually over 100, sohVery small (usually no more than 3,3 is already 10, 6, 106, 10^6, 10^6, 6 levels of data).

So the index efficiency of using B tree as database is much higher than red black tree and so on.

However, MySQL MyISAM and InnoDB both use B+ trees with sequential access Pointers to implement indexes. Why? Comparing the difference between B+ tree and B tree, except that the leaf node has sequential access Pointers to help the range query, the B+ tree on the non-leaf node only has key, and no additional data. As we have said before, the size of each node in a typical tree is equal to the size of a page. In the case of a fixed capacity, the number of indexes a node can contain is smaller than that of a B+ tree because of the need to store data records. In other words, the upper limit of the degree D of a non-leaf node depends on the size of key and data in the node. The specific formula is as follows:

$d_{Max}= Number of indexes that can fit in a node =floor(pagesize/(keysize+datasize+pointsize)) $

Since data is removed from the non-leaf node of B tree, it can have greater output and better performance.

③ MySQL b-tree index

High-performance MySQL has always used the b-tree index, which is actually a B+ Tree from a technical implementation perspective.

Suppose we now have a data table with three fields: primary key Col1, secondary index Col2, and field Col3.

MyISAM index implementation

From the above description of B+ trees, we can roughly infer the index structure. Let’s look at MyISAM’s schematic for primary key indexes:

It can be seen that in the B+ tree of MyISAM, the non-leaf node only stores the primary key value, while the non-leaf node stores the address of the corresponding record in the database. We can locate every record through the address. We know that each node corresponds to a page, and each node contains multiple rows of database records (2 in the figure). It is important to note that logically adjacent records may not be physically on the same page, such as row 2 and row 3 data in a table, which are on different pages.

Let’s look at the structure of the secondary index Col2. The leaf node of the secondary index contains the key value, and the index row in each leaf node contains a bookmark that tells the storage engine where to find the corresponding data row. The bookmark of the secondary index in MyISAM storage engine is the address, which is no different from the primary key index.

Also a B+ tree, the data field holds the address of the database record. Therefore, the index retrieval algorithm in MyISAM is to search the index according to the B+ tree search algorithm. If the specified Key exists, the value of its data field is extracted, and then the corresponding data record is read with the value of the data field as the address.

InnoDB index implementation

InnoDB also uses B+ trees as index structures, but the implementation is quite different from MyISAM.

Let’s first look at how primary key indexes are implemented.

The most significant difference from MyISAM’s primary key index is the content of the leaf node. The MyISAM index file and the data file are separate, the index file only holds the address of the data record. In InnoDB, the table data file itself is an index structure organized by B+ tree. The data field of the leaf node of this tree stores complete data records. The key of the index is the primary key of the data table, so the data file of InnoDB engine is the primary index file. This structure of data and indexes is called a clustered index, or clustered index. Since InnoDB’s data files themselves are aggregated by primary keys, InnoDB requires tables to have primary keys. If not explicitly specified, the MySQL system automatically selects a column that uniquely identifies the data record as the primary key. If no such column exists, MySQL automatically generates an implied field for the InnoDB table as the primary key. This field is 6 bytes long and of type long integer.

Let’s look at the secondary index implementation structure of InnoDB. We add secondary indexes to the Col3 field of the table.

Unlike the MyISAM index, InnoDB’s secondary index data field stores the value of the primary key of the corresponding record rather than the address. In other words, all secondary indexes in InnoDB refer to the primary key as the data field, because we can also query the entire database record through the primary key. The implementation of clustered indexes makes searching by primary key very efficient, but secondary index queries require a secondary query: first retrieve the secondary index for the primary key, and then use the primary key to retrieve records from the primary index.

(4) Use and optimize indexes correctly

Knowing the way of index implementation is of great help to us to understand the correct use of index and optimization principle. Here are some common strategies for using indexes.

Suppose we now have a table:

CREATE TABLE People (
    last_name varchar(50) not null,
    first_name varchar(50) not null,
    birth date not null,
    gender enum('m', 'f') not null,
    key(last_name, first_name, birth)
);
Copy the code

Four fields are defined in the table, including last name, first name, date of birth and gender, and a composite index is established to contain last name, first name and date of birth. A small part of the index’s B+ tree structure looks like this:

It can be seen that non-leaf nodes store index field information. B-tree organizes and stores index columns in order. Indexes are sorted according to certain sorting rules, such as alphabetical order of name and date. Based on such a structure, writing reasonable SQL statements, we can find the records we need very quickly.

  • Full value matching: Full value matching refers to the matching of all columns in the index, that is, the condition of our query statement is exactly the same as one of the index columns, not only the content, but also the order. It is easy to understand that this is a simple recursive search tree process.
  • Match the leftmost prefix: Of course we want to benefit from the current index tree. The query condition does not need to be a full value match. We can only include the first field in the index column, for example, we find all the people whose last name is Allen. Of course, if the query contains only the name or birthday, the current index tree cannot be utilized.
  • Match column prefixes: Again, our query criteria may not even have complete information about the first column field, such as matching only the beginning of the first column value. For example, find someone whose last name starts with All.
  • Exactly match one column and range match another: from previous experience, we can naturally deduce that we can exactly match the index column of the first part, and the index column of the second part is only the form of the leftmost prefix. For example, we’re looking for someone with a specific last name but only a date of 19xx.
  • Matching range value: Matching range value can benefit from the whole structure of the index tree, B+ tree leaf nodes add sequential access Pointers, making it faster. We’ve talked about this before.
  • Overwrite index: An overwrite index is a query that accesses only the index, not the row. If the index column of the information we want to query is already fully contained, we do not need to go to the leaf node to find the primary key or the address of the record, and then look up the information in the corresponding data record. Such a process is actually called a secondary query.
  • ORDER BY and GROUP BY: The nodes in an index tree are ordered, so indexes can be used in queries in addition to lookups by valueORDER BYGROUP BYOperation.

Of course, there are some limitations on the b-tree index from the previous index implementation:

  • An index cannot be used unless the search starts in the leftmost column of the index. For example, the index in the above example cannot be used to directly look up a person named Bill, or a person of a particular birthday, because neither column is a left-most prefix. Similarly, you can’t find people whose last names end with a certain letter.
  • Columns in an index cannot be skipped. That is, the index described above cannot be used to find someone whose last name is Smith and who was born on a particular date. If no name is specified, MySQL can only use the first column of the index — the last name column.
  • If a column in a query condition is a range query, all columns to the right of it cannot be found using index optimization. If you have a limited number of range query column values, you can replace a range condition by using multiple equal conditions.

The hash index

① What is a hash index?

A hash index is implemented based on a hash table, and only queries that exactly match all columns of the index are valid. For each row of data, the storage engine computes a hash code for all index columns, which is a small value, and evaluates different hash codes for rows with different key values. A hash index stores all hash codes in the index, along with Pointers to each row of data in a hash table.

In MySQL, only the Memory engine explicitly supports hash indexes. This is also the default index type for tables in the Memory engine, which also supports B-tree indexes. It is worth mentioning that the Memory engine supports non-unique hash indexes, which is quite unusual in the database world. If multiple columns have the same hash value, the index stores multiple record Pointers in a linked list into the same hash entry.

The hash index itself only needs to store the corresponding hash value, so the index structure is very compact, which also makes the hash index search very fast. In terms of implementation principles, we can think of this as a huge collection of HashMaps. So hash indexes naturally have their limits:

  • Hash indexes contain only hash values and data row Pointers, but do not store field values, so you cannot use values in the index to avoid reading rows, i.e., there is no overwriting index. However, access to rows in memory is fast, so in most cases this doesn’t have a noticeable effect on performance.
  • Hash index data is not stored in index order and therefore cannot be used for sorting.
  • Hash indexes also do not support partial index column match lookups, that is, full value matches, because hash indexes always use the entire contents of the index column to calculate the hash value. Index columns with a little less hash code are not the same, so partial matching is not possible.
  • Hash indexes only support equivalent comparison queries, including =, IN(), <=>, and do not support any range queries.
  • Accessing hashed indexed data is very fast unless there are a lot of hashing conflicts. When a hash conflict occurs, the storage engine must traverse all the row Pointers in the linked list, comparing them row by row until it finds all the rows that match the criteria.
  • Some index maintenance operations can be expensive if there are many hash conflicts. For example, adding a hash index to a gender column is of little value because there are only two conventional genders, so the hash conflict is very serious.

The InnoDB engine has a special feature called adaptive Hash Index. When InnoDB notices that some index values are being used very frequently, it creates a hash index in memory on top of the B-tree index, which gives the B-tree index some of the benefits of hash indexes, such as fast hash lookups. This is a completely automatic, internal behavior that the user cannot control or configure, although it can be turned off if necessary.

② Use custom hash indexes to improve performance

Create a custom hash index. If the storage engine does not support hash indexes, you can simulate creating hash indexes like InnoDB, which can enjoy some of the benefits of hash indexes, such as creating indexes for very long keys with very small indexes.

The idea is very simple: create a pseudo hash index based on the B-tree. This is not the same thing as a real hash index, because b-tree is still used for lookups, but it uses hash values instead of the keys themselves for index lookups. All you need to do is manually specify the use of hash functions in the WHERE clause of the query.

For example, we often insert links in the database. These links tend to be very long and not very selective. If we use b-tree to store urls, the contents of the store will be very large. Our query statement is as follows:

mysql> SELECT id FROM url WHERE url="http://www.mysql.com";
Copy the code

If you drop the index on the original URL column and add an indexed URL_CRC field to store the URL hash value, you can not only reduce the string size, but also improve performance.

mysql> SELECT id FROM url WHERE url="http://www.mysql.com" AND url_crc=CRC32("http://www.mysql.com");
Copy the code

The MySQL optimizer uses this highly selective and small index based on the URL_CRC column to perform the lookup. Even if there are multiple records with the same index value, lookups are still fast, and it only takes a quick integer comparison of hash values to find index entries and return the corresponding rows. Instead, you index the entire URL string, which is very slow.

The downside of this implementation, of course, is the need to maintain hashes. It can be maintained manually or implemented using triggers. In addition, the advantages and disadvantages of the hash algorithm also need to be noted, because it affects the selectivity of the hash index. Index selectivity refers to the ratio of the number of different values in the index column to the total number of records in the table.

III. Use EXPLAIN

The EXPLAIN tool can verify that the execution plan is good and that the query follows the appropriate index. The MySQL optimizer varies from version to version, some optimization rules may change with the version, and the execution plan of the query may change as the data changes. In this case, we can use the EXPLAIN tool to verify our judgment.

use

The syntax is:

Explain the select...Copy the code

There are also two variants:

Explain Extended Select... Show WarningsCopy the code

Extend decompilates the execution plan into a SELECT statement. Show Warnings is an optimized query statement for MySQL.

Another variation is:

The explain partitions the select...Copy the code

This command is used to EXPLAIN commands for partitioning tables. Partitioning is the partitioning of data into multiple locations, either on the same disk or on different machines. After partitioning, the table is still ostensibly a single table, but the data is hashed to multiple locations. MySQL server automatically organizes partitioned data.

Let’s use the sample database provided in the official MySQL documentationemployeesIn thetitlesAs an example. First look at its full index, you can see that the first three columns make up the primary key index<emp_no, title, from_date>, and creates a separate secondary index<emp_no>. The index information section in the following image has been removed to consider the image width.



We make a query and analyze it with EXPLAIN.



The table tells us which tables MySQL accesses and how it accesses the data. It contains important index usage information to determine if an index needs to be optimized.

Return message interpretation

For the table returned by EXPLAIN above, we did a specific study of the meaning of each column. All of this information can be roughly represented by the following mind map:

1) id

idContains a set of numbers that represent execution in the queryselectThe order of clauses or tables of operations. ifidIf the query is the same, it is a group. The execution sequence is from top to bottom. If the query is subquery,idThe serial number will increase,idA larger value indicates a higher priority and is executed earlier.

(2) select_type

Select_type specifies the type of each select clause in a query.

  • SIMPLE: The query does not contain subqueries orUNION;
  • PRIMARY: If the query contains any complex subparts, the outermost query is marked asPRIMARY;
  • SUBQUERYIn:SELECTWHEREThe list contains subqueries marked asSUBQUERY;
  • DERIVED: is used to indicate inclusion inFROMClause of the subquerySELECTMySQL executes recursively and puts the results into a temporary table. Internally, the server is called a “derived table” because the temporary table is derived from a subquery.
  • UNION: If the secondSELECTAppear in theUNIONAfter that, it is marked asUNION;
  • UNION RESULTFrom:UNIONIn the results tableSELECT

As an example, the following SQL statement is used

select vt1.dept_no
from (select emp_no, dept_no from dept_emp where emp_no < (select emp_no from employees where emp_no = 10010)) vt1
union
(select vt2.emp_no from (select emp_no from dept_manager) vt2 where emp_no < 110300);
Copy the code

Execute the corresponding EXPLAIN statement to view the execution plan. To turn off the optimizer from MySQL5.7 and introduce derived_merge, processing derived tables and views in from statements better avoids unnecessary materialization and results in more efficient execution plans through conditional delegation. Select emp_no from dept_manager (emp_no < 110300, emp_no < 110300, emp_no < 110300, emp_no < 110300, emp_no < 110300) This is actually very slow, in the derived table when the use of the filter emp_NO < 110300, derived table results will be much smaller.

In order to display all queries completely, we turn this optimization off.



As you can see from the figure above, there are six queries, basically including several common onesselect_typeIn order:

  • First execute the query with id=5, looktableThe columns specified are fromdept_manager, can be determined to beselect emp_no from dept_manager, because the subquery is infromIs, soDERIVED;
  • Then execute the query with id=4,tableThe columns specified are fromderived5Where id=5 and id=5unionIn the later part of the query, the outer query looks up the derived table. Due to itsunionAfter, so the query is marked asUNION;
  • Continue with query id=3,tableThe columns specified are fromemployees, can be determined to beselect emp_no from employees where emp_no = 10010, the subquery inwhereCondition, so is the subquery marked asSUBQUERY;
  • Id = 2 queriestableThe columns specified are fromdept_emp, can be determined to beSelect emp_no, dept_no from dept_emp where emp_no = dept_emp where emp_no = dept_emp where emp_no = dept_emp where emp_no = dept_emp, again its result is a derived table, so marked asDERIVED;
  • unionThe previous query is a complex query and is marked asPRIMARY, itstableThe columns specified are fromderived2In the query;
  • The final will beunionThe query results before and after are merged and marked asUNION RESULT.

(3) type

Explain types in MySQL include the following types, from top to bottom, from worst to best.

Type meaning
All Full table scan, MySQL will traverse the full table to find matching rows.
index Index full scan,indexALLThe difference forindexThe type only traverses the index tree.
range The indexThe scope ofScan, a scan of an index that begins at a point and returns rows matching the range of values. The obvious index range scan is withbetweenorwhereClause with <, > query. When MySQL uses indexes to find a list of values, for exampleIN()ORLists will also be displayedrange(Range scan), in which case query performance is often better because there are fewer results.
ref Using a non-unique index scan or prefix scan with a unique index, returns a matchIndividual valuesAll record rows of.
eq_ref similarrefFor each index key value, only one record in the table matches. In simple terms, the association condition is that the primary key or unique index is used in the multi-table join.
const/system These types of access are used when MySQL optimizes part of a query and converts it to a constant. For example, set the primary key towhereList, MySQL can convert the query to a constant. System is a special case of const, and can be used when the table being queried has only one record.
NULL MySQL breaks down statements during optimization without even accessing the table or index. For example, picking the minimum value from an index column can be done by a separate index lookup.

(4) possible_keys and key

Possible_keys specifies which index MySQL can use to find records in the table. Possible_keys will be listed if there is an index on a column involved in the query, but it may not be used by the query.

Key Displays the actual index used by MySQL in the query. If no index is used, the value is NULL. If an overridden index is used in the query, it only appears in the list of keys. For example, the above example demonstrates a query whose type is index. Possible_keys is NULL and key is the auxiliary index EMP_no.

(5) key_len

Key_len represents the number of bytes used in the index, which is used to calculate the length of the index used in the query. Note that the value key_len displays is the maximum possible length of the index field, not the actual length, i.e. key_len is calculated from the table definition, not retrieved from the table.

For example, as shown in the figure below, we usetitlesThe primary key index of a table — a composite index, each of the index columns used in reduced queries,emp_noINTType, 4 bytes;titleVARCHARThe utF-8 character set contains 3 bytes for each character, so 150 bytes for 50 characters, plus 2 bytes for storage, so 152 bytes for 50 characters; The lastfrom_dateDATEThe type takes up three bytes.

6 ref

Represents the join match criteria for the table, that is, which columns or constants are used to find values on indexed columns. Looking at the figure below, the inner and outer joins of the table use filter matching conditions. First look at the outer connection, driven table (a left join bb) use primary key index, drive table as outer loop first execute (id same order from top to bottom), need full table scan without index. For inner joins, MySQL uses the table with fewer data records as the driven table (the inner loop of the Cartesian product), so the last four columns are the same.refWhere do the values used for the index come from.

7 rows

Indicates the number of rows that MySQL estimates to find the desired record based on table statistics and index selection.

End up Extra

Display additional information that is important but not appropriate in other columns. There are four possible types of information, as shown in the table.

Extra information meaning
Using index The value represents the correspondingselectOverwrite indexes are used in the operation.
Using where Indicates that the MySQL server will filter rows after the storage engine retrieves them. manywhereThe condition involves a column in the index, which can be checked by the storage engine when (and if) it reads the index, so not all bandswhere“Will be displayed.”Using where“. Sometimes”Using where“Is a hint that queries can benefit from different indexes.
Using temporary Indicates that MySQL needs to use temporary tables to store result sets, common in sort and group queries.
Using filesort A sort operation in MySQL that cannot be done with an index is called “file sort”.

IV. High-performance index policies

Proper creation and use of indexes are fundamental to high-performance queries. MySQL’s b-tree index has been introduced in detail, so let’s take a look at how it can really take advantage of indexes.

Separate columns

“Independent column” means that the index column cannot be part of an expression or arguments to a function. Here’s an example:

Prefix indexing vs selectivity

Sometimes you need to index very long character columns, which can make the index large and slow. It is usually possible to index the first part of the character, which can greatly save index space and thus improve index efficiency, but in fact sacrifice index selectivity. A more selective index allows MySQL to filter out more rows during lookups. Unique index selectivity is 1, which is the best index selectivity and the best performance.

In general, the selectivity of a column prefix is high enough to meet query performance. For BLOB, TEXT, or very long VARCHAR columns, you must use prefixed indexes because MySQL does not allow indexes to the full length of these columns. The trick is to choose prefixes that are long enough for high selectivity, but not too long (to save space). The prefix should be long enough so that the selectivity of the prefix index is close to indexing the entire column.



At 6 prefix lengths, the selective increase is tiny, approaching 0.0055. Of course, it’s not enough to just look at average selectivity, but there are exceptions, and you need to look at worst-case selectivity. Such as thoughcount(distinct left(last_name, 6))Larger, but not alllast_nameThe number of records is evenly distributed and can be certainlast_nameThere’s a lot of data, so this particularlast_nameThe selectivity of the query is low.

Prefix indexes are an effective way to make indexes smaller and faster, but on the other hand, they have their drawbacks: MySQL cannot use prefix indexes for ORDER BY and GROUP BY, and cannot use prefix indexes for overwrite scans.

Multi-column index and order

Multi-column indexes do not create an index for each column, but create a composite index for multiple columns. Of course, the order of multiple columns is also very important. Creating a single column index on multiple columns does not improve MySQL query performance in most cases. The correct index column order depends on the queries that use the index, and you need to consider how best to meet the sorting and grouping needs.

In a multi-column B-tree index, the order of the index columns means that the index is sorted by the leftmost column first, the second column second, and so on. Therefore, indexes can be scanned in ascending or descending ORDER to meet the query requirements of the ORDER BY, GROUP BY, and DISTINCT clauses that precisely match the column ORDER, so the column ORDER of a multi-column index is critical.

When sorting and grouping are not a concern, it is usually good to put the most selective columns first. The purpose of the index is to optimize the lookup of WHERE conditions. In this case, indexes designed this way do filter out the desired rows fastest, and are more selective for queries that use only the index part prefix column in the WHERE clause. However, performance depends not only on the selectivity of all index columns (the overall cardinality), but also on the specific values of the query criteria, that is, the distribution of values. This is the same as before when choosing the length of the prefix. You may need to adjust the order of the index columns based on those queries that run most frequently, so that the index is most selective in this case.

Clustering index

Clustered index is not a separate index type, but a way of data storage. The details depend on how it is implemented, but InnoDB’s clustered index actually holds the B-tree index and rows in the same structure.

A table can only have one clustered index because it is impossible to store rows in two different places at the same time. MySQL does not allow you to manually specify that index as a clustered index. InnoDB primary key is a clustered index. If no primary key is defined, InnoDB selects a unique non-empty index instead. If there is no such index, InnoDB implicitly defines a primary key as the cluster index.

Clustering primary keys can help with performance, but it can also cause serious performance problems.

The main advantages are as follows:

  • You can keep related data together. The cluster index itself contains data, and it does not need to read the corresponding disk address when obtaining other data to generate disk I/OS.
  • Faster data access. Clustered indexes keep indexes and data in the same B-tree, so getting data from a clustered index is usually faster than looking it up in a non-clustered index.
  • Queries that use an overridden index scan can directly use the primary key value in the page node. That is, an overwrite index contains a primary key field by default, so you can design an overwrite index without adding a primary key field, as it must.

There are drawbacks, of course:

  • Clustered data maximizes performance in I/ O-intensive applications, but if the data is all in memory, the order of access is less important and clustered indexes have little advantage.
  • Insertion speed depends heavily on insertion order. Primary key insertion is the fastest way to load data into An InnoDB table. But if the data isn’t loaded in primary key order, it’s a good idea to use the OPTIMIZE TABLE command to reorganize the TABLE after loading.
  • Updating clustered index columns is expensive because InnoDB is forced to move each updated row to a new location. Tables based on clustered indexes can suffer from “page splitting” when new rows are inserted, or when the primary key is updated so that rows need to be moved. Because each node of the B-tree index occupies one page to hold the index, when the primary key value of a new row of data records requires that the row be inserted into a full page, the storage engine splits the page into two pages to hold the row. This is a page splitting operation. Page splitting causes tables to take up more disk space.
  • Clustered indexes can cause slow full table scans, especially if rows are sparse or the data store is discontinuous due to page splitting.
  • Secondary indexes (non-clustered indexes) can be larger than expected because the leaf nodes in the secondary index contain primary key columns that reference rows. Secondary index access for non-overwritten indexes requires a second query.

In the introduction of MyISAM and InnoDB storage engines, we learned how the two engines organize data. The MyISAM leaf node holds “row Pointers” to specific data record addresses. MyISAM is stored on disk in the order of data insertion, and the corresponding address is recorded by the leaf node. In contrast, InnoDB engine, data itself is recorded in the primary key index of the leaf node, the order of data insertion is completely based on the primary key in the whole B-tree position, if the primary key is out of order, then the data insertion phenomenon will appear on the left side of the Tree, one on the right side, one on this side, one on that side. What are the dangers of doing this? First, it takes a long time to insert, and second, it may take up more space and become more fragmented. Details are as follows:

  • The written target page may have been flushed to disk and removed from the cache, or it may not have been loaded into the cache. InnoDB has to find and read the target page from disk into memory before inserting. This results in a lot of random I/O.
  • Because writes are out of order, InnoDB has to split pages frequently to allocate space for new rows. Page splitting causes large amounts of data to be moved, and at least three pages need to be modified at a time instead of one.
  • Due to frequent page splitting, pages become sparse and irregularly populated, so the data ends up fragmented.

So, how to take advantage of InnoDB primary key sequential insert data feature?

That’s ordered primary keys, like self-growing columns. Since primary key values are sequential, InnoDB stores each record after the previous one. When the maximum fill factor for a page is reached (InnoDB’s default maximum fill factor is 15/16 of the page size, leaving some room for later modification), the next record is written to a new page. Once the data is loaded in this sequential fashion, the primary key page is approximately filled with sequential records, which is exactly what is expected.

Cover index

The benefits of using a covered index are:

  • The number of index fields is usually much smaller than the total number of fields in the data record, so if you only need to read the index, MySQL can greatly reduce the number of data visits. This is important for the load on the cache, because most of the response time in this case is spent copying data. Overwriting indexes is also helpful for I/O intensive applications because indexes are smaller than data and easier to fit into memory altogether (this is especially true for MyISAM, which can compress indexes to become smaller).
  • Overwrite indexes are especially useful for InnoDB tables because of InnoDB’s clustered indexes. InnoDB’s secondary index holds the primary key value of the row in the leaf node, so if the secondary primary key overrides the query, it can avoid a secondary query on the primary key index.

Not all types of indexes can be covered indexes. An overwrite index must store the value of an index column. A hash index, a spatial index, and a full-text index do not store the value of an index column. Therefore, MySQL can only use the B-tree index as an overwrite index.

The leaf nodes of InnoDB’s secondary index all contain primary key values, which means InnoDB’s secondary index can effectively use these “extra” primary key columns to override queries. In other words,The secondary index query primary key is also an overwrite index.

Sort using index scans

We can see that the ORDER BY sort can benefit from the index. The ORDER BY clause has the same restrictions as a lookup query: the left-most prefix of the index needs to be satisfied; Otherwise, MySQL will need to perform a sort operation and cannot use indexes to sort.

Of course, there is one case where the ORDER BY clause does not satisfy the left-most prefix of the index, and that is when the leading column is constant. If you specify constants for these columns in the WHERE or JOIN clause, you can “compensate” for the index.

Here are some examples.

Redundant Redundant indexes that are not used repeatedly

MySQL allows multiple indexes to be created on the same column, intentionally or unintentionally. MySQL needs to maintain duplicate indexes separately, and the optimizer also needs to consider them individually when optimizing queries, which can affect performance.

Duplicate indexes are indexes of the same type that are created in the same order on the same columns. You should avoid creating duplicate indexes in this way and remove them as soon as they are discovered.

There are some differences between redundant and duplicate indexes. If an index (A, B) is created, creating index (A) is redundant because it is just A prefixed index of the previous index. So indexes (A, B) can also be used as indexes (A) (this redundancy is only for b-tree indexes). But if you create index (B, A) again, it is not redundant, and neither is index (B), because B is not the left-most prefix column of index (A, B).

In addition to redundant and duplicate indexes, there may be some indexes that the server never uses. Such an index is completely redundant and is recommended to be removed.

V. summary

In MySQL, b-tree indexes are used in most cases. Other types of indexes are mostly for specific purposes only. If used in the right scenario, indexes can greatly improve query response time. There is no more to cover in this chapter, but it is worth reviewing these features and how to use b-tree indexes in general.

There are three principles to always keep in mind when selecting indexes and writing queries that leverage them:

  • Single-line access is slow. Especially in mechanical hard disk storage (SSD random I/O is much faster, but still true). If the server reads a block of data from storage just to get one row, a lot of work is wasted. It is best to read a block that contains as many rows as you need. Use indexes to create location references for efficiency.
  • Sequential access to range data is fast. There are two reasons for this. First, sequential I/O does not require multiple disk seeks, so it is much faster than random I/O, especially for mechanical hard drives. Second, if the server can read the data in the order it needs it, no additional sorting operations are required, andGROUP BYQueries no longer need to sort and aggregate rows by group.
  • Index coverage queries are fast. If an index contains all the columns needed for a query, the storage engine does not need to go back to the table to look for rows. This avoids a lot of single-line access.

In addition to this, the use of EXPLAIN is critical.

Refer to the reading

  • What are CPU intensive and IO intensive?
  • A brief introduction to algorithms and data structures: B tree of ten balanced search trees
  • The data structure and algorithm principle behind MySQL index
  • MySQL DBA Training
  • High Performance MySQL