Writing in the front

This article is to learn the nuggets of the booklet “how MySQL is running: from the root to understand MySQL” after finishing, the article a lot of use and reference the contents of the booklet. In addition, the booklet is very good, the explanation is very in place, recommended reading.

In our previous article, how Data is stored in InnoDB, we covered the details of MySQL data storage, including row formats and pages. We know that pages are divided into many kinds. This article mainly involves two kinds of data pages and index pages. We know that data pages store data, but what is the index page used for?

Ok, more don’t say, less don’t Lao, let’s start straight away.

Just think, if there is no index, we can only rely on the double-linked list composed of leaf nodes to query the data, we can only go through backwards from the beginning, such efficiency is very low, the emergence of index is to solve this problem.

InnoDB index

For the sake of description, suppose we create a table with three columns, A, B, and C, where A is int, b, C char, and A is the primary key. Then we insert some data into the table.

insert into ... values(1.'b1'.'c1');
insert into ... values(2.'b2'.'c2');
insert into ... values(3.'b3'.'c3');
insert into ... values(4.'b4'.'c4'); And so on...Copy the code

MySQL will save the data to the data page, and also generate the index page:

Note the following figure:

  • The data page holds complete data (all of our insert data),
  • Each record in the index page contains the primary key (the data in red) and the page number (which points to the corresponding data page)
  • Each record in the index page does not contain data from other columns other than the primary key (other columns, such as second column B, third column C)
  • Records and pages are sorted from smallest to largest by primary key

Now with the help of the index page, we can use the index page to quickly find the page where the record is located through binary search, and then find the corresponding data in the page, which is much faster than the way to traverse the linked list described at the beginning of the article.

Although each record in an index page only stores the primary key value and the corresponding page number, it can hold more records than a data page, but in any case, a page is only 16KB in size, so if there is too much data in the table, then an index page is not enough. At this point we need more index pages:

If you know anything about data structures, you know that this is a B+ tree.

Clustering index

The B+ tree is itself a directory, or an index. It has two characteristics:

  1. Sorting records and pages using the size of the record primary key has three implications:

    • The records in the page are arranged in a one-way linked list in order of the size of the primary key.
    • Each page storing user records is also arranged in a bidirectional linked list according to the primary key size of user records in the page.
    • The pages that store directory entry records are divided into different levels. The pages in the same level are also arranged in a bidirectional linked list according to the primary key size of directory entry records in the page.
  2. The leaves of the B+ tree store the complete user record.

    A complete user record is one in which the values of all columns (including hidden columns) are stored.

A B+ tree with these two characteristics is called a clustered index, and all the complete user records are stored at the leaf node of the clustered index. The InnoDB storage engine automatically creates clustered indexes for us without explicitly using INDEX statements in MySQL statements (more on indexes later). Another interesting point is that in the InnoDB storage engine, clustered indexes are how data is stored (all user records are stored in leaf nodes), so called index is data, data is index.

Secondary indexes

The clustering index described above only works if the search criteria are primary keys, because the data in a B+ tree is sorted by primary keys. So what if we want to search for a different column? Now we can create a new B+ tree. Let’s index the second column B and assume that the size of the second column B is b1 > B2 > b2 >… > b12 > … , the newly created B+ tree is shown as follows:

There are a few things to note here,

  • The red column is still the primary key, and the blue column is the second column, B
  • Page-to-page and page-to-page records are sorted by column 2, B
  • The data stored in the leaf node is not the complete data, only the primary key and the second column B
  • No longer in internal nodesPrimary key + page numberAnd becomeSecond column b+ page numberThe collocation.

Select * from XXX where b = ‘b7’ select * from XXX where b = ‘b7’

We’ll be the first to look for in the secondary indexes, using the binary search, down from the root node, know to find the corresponding leaf nodes, namely b = ‘b7’ the record, this time can get the record of the primary key is a = 7, we again according to the primary key to clustering index lookup in the b + tree, this process is also called back to the table, Finally, the complete data of this record can be found by clustering index.

Joint index

If we want to sort the B+ tree by the size of B and C columns, we can sort the B+ tree by the size of B and C columns at the same time.

  • Start by following the individual records and pagesbColumns are sorted.
  • In the record ofbColumn the same case as adoptedcColumn sort

Also known as a federated index, it has the following characteristics:

  • Will be contained in the internal nodeb,c,Page numberThese three parts, each record according to the firstbColumn values are sorted if recordedbIf the columns are the same, follow thecTo sort the values of the columns.
  • The user record at the leaf node is determined byb,cAnd the primary keyaOf the column.

Note that a B+ tree sorted by the size of the C2 and C3 columns is called a joint index and is essentially a secondary index.

MyISAM index scheme in a brief introduction

We know that InnoDB index (data) already contains all the complete user records in the leaves of the B+ tree of clustered indexes, while MyISAM index scheme also uses tree structure, but stores indexes and data separately:

  • Records in a table are stored in a separate file in the order in which they are inserted, called a data file. The file is not divided into several data pages, just as many records are crammed into the file. We can quickly access a record by line number.

  • Tables using the MyISAM storage engine store index information in a separate file called an index file. MyISAM creates a separate index for the primary key of the table, but instead of storing the complete user record in the leaf node of the index, it stores a combination of the primary key value and the row number. That is, first through the index to find the corresponding row number, and then through the row number to find the corresponding record!

How to use indexes

The cost of indexing

As the saying goes, nothing is perfect, and the same can be said of indexes. Before we use indexes, we must understand that indexes can sometimes be a drag:

  • The cost in space

    Each node of a B+ tree is a data page. By default, a page takes up 16KB of storage. A large B+ tree consists of many data pages, which is a lot of storage.

  • The cost in time

    Each B+ tree index needs to be modified every time data in the table is added, deleted, or modified. And as we have said, each level of the B+ tree nodes are sorted in ascending order according to the value of the index column to form a bidirectional linked list. Both the records in the leaf node and the records in the inner node (that is, the user record and the directory entry record) form a one-way linked list in ascending order of the index column values. Add, delete, and modify operations may damage the order of nodes and records, so the storage engine needs extra time to perform some operations such as record shifting, page splitting, and page reclamation to maintain the order of nodes and records. If we build a lot of indexes, and each index’s corresponding B+ tree has to perform maintenance operations, wouldn’t that slow down performance?

So, index is good, but don’t “drink” oh.

So what about the queries we normally write that work with indexes and those that don’t? Let’s first look at the application scenarios of B+ tree indexes,

B+ tree indexes apply

Let’s create a table for simplicity:

CREATE TABLE staff_info(
    id INT NOT NULL auto_increment,
    name VARCHAR(100) NOT NULL,
    birthday DATE NOT NULL,
    phone_number CHAR(11) NOT NULL,
    country varchar(100) NOT NULL,
    PRIMARY KEY (id),
    KEY idx_name_birthday_phone_number (name, birthday, phone_number)
);
Copy the code

There are two things to note about this staff_info table:

  • The primary key in the table isidColumn, which stores an automatically incrementing integer. soInnoDBThe storage engine automaticallyidColumn to build the cluster index.
  • We define an additional secondary indexidx_name_birthday_phone_number, which is a federated index consisting of three columns. So in this index corresponding toB+Only user records stored at leaf nodes of the tree are retainedname,birthday,phone_numberThe values of the three columns and the primary keyidIs not savedcountryThe value of the column.
All values match

When the column in our search criteria matches the column in the index, we call it a full-value match. This is a perfect case for using the index, as in:

SELECT * FROM staff_info WHERE name = 'Ashburn' AND birthday = '1990-09-27' AND phone_number = '15123983239';
Copy the code

Note that indexes can also be written like this:

SELECT * FROM staff_info WHERE birthday = '1990-09-27' AND phone_number = '15123983239' AND name = 'Ashburn';
Copy the code

The query optimizer in MySQL determines the order of the query conditions, so the effect is the same as in SQL above.

Matches the column on the left

Indexes can also be used if the search statement contains only the left column in the federated index, for example:

SELECT * FROM staff_info WHERE name = 'Ashburn';
Copy the code

Or it

SELECT * FROM staff_info WHERE name = 'Ashburn' AND birthday = '1990-09-27';
Copy the code

It is important to note that the columns in the search criteria must be the left-most consecutive columns in the federated index. The following SQL cannot use the index:

SELECT * FROM staff_info WHERE birthday = '1990-09-27';
Copy the code

This is because the data pages and records in the B+ tree are sorted first by the name column value, and then the Birthday column is only sorted if the name column value is the same, which means that the birthday value in records with different name column values may be unordered. Select birthday, birthday, birthday, birthday, birthday, birthday, birthday, birthday, birthday, birthday, birthday Similarly, the following SQL does not work:

SELECT * FROM staff_info WHERE name = 'Ashburn' AND phone_number = '15123983239';
Copy the code
Matching column prefix

As mentioned earlier, the union index idx_name_birthday_phone_number is sorted using the value of the name column, and only using the birthday column if the name column has the same value, Then use the phone_number column for sorting if the birthday column values are the same. Comparing the size of two strings starts with the first character, that is, the first n characters, or prefixes, of these strings in the joint index idx_name_birthday_phone_number are sorted. So you can also write like this using indexes:

SELECT * FROM staff_info WHERE name LIKE 'As%';
Copy the code
Matching range value

Range matching like this can also be done with indexes:

SELECT * FROM staff_info WHERE name > 'Asa' AND name < 'Barlow';
Copy the code

Note that this does not work:

SELECT * FROM staff_info WHERE name > 'Asa' AND name < 'Barlow' AND birthday > '1980-01-01';
Copy the code

The reason for this is that the birthday column is only used if the name column has the same value, which means that the birthday value in records with different name values may be unordered.

Matches exactly one column and ranges exactly the other

If the left column is an exact lookup and the right column is a range lookup, then an index can be used:

SELECT * FROM staff_info WHERE name = 'Ashburn' AND birthday > '1980-01-01' AND birthday < '2000-12-31';
Copy the code
SELECT * FROM staff_info WHERE name = 'Ashburn' AND birthday = '1980-01-01' AND phone_number > '15100000000';
Copy the code
The sorting

Let’s start with a simple query:

SELECT * FROM staff_info ORDER BY name, birthday, phone_number LIMIT 10;
Copy the code

The result set of this query needs to be sorted by the name value, by birthday if the records have the same name value, and phone_number if the records have the same birthday value. If you look back at the idx_NAMe_birthday_phone_number index diagram, since the B+ tree index itself is sorted according to the above rules, it is good to extract data directly from the index and then perform table back operations to retrieve columns that are not included in the index.

ORDER BY phone_number, birthday, name (phone_number, birthday, name), phone_number (phone_number, birthday, name);

grouping

The following grouped query can also use indexes:

SELECT name, birthday, phone_number, COUNT(*) FROM staff_info GROUP BY name, birthday, phone_number
Copy the code

The cost of returning to the table

Let’s talk about the table now. Using the idx_name_birthday_phone_number index as an example, look at the following query:

SELECT * FROM staff_info WHERE name > 'Asa' AND name < 'Barlow';
Copy the code

Using the idx_name_birthday_phone_number index to query can be roughly divided into two steps:

  1. From the indexidx_name_birthday_phone_numberThe correspondingB+In the tree to take out thenameValues inAsa~BarlowBetween user records.
  2. Due to the indexidx_name_birthday_phone_numberThe correspondingB+Tree user records contain onlyname,birthday,phone_number,idThese four fields, whereas the query list is*, which means that all fields in the table are queriedcountryField. For each record retrieved from the previous stepidFields to the corresponding cluster indexB+Tree to find the complete user record, which is what we usually callBack to the tableThe complete user record is then returned to the query user.

Since records in the B+ tree corresponding to the index IDx_name_birthday_phone_number are first sorted by the value of the NAME column, records with values between Asa and Barlow are stored on disk in a contiguic way, distributed in one or more data pages. We can read these attached records from the disk very quickly, which is also called sequential I/O. According to get in step 1 to record the value of the id field may not be connected, and record in the clustering index is based on id (that is, the primary key) of order, so according to these is not a continuous id value to the clustering index access the complete user record may be distributed in different data page, so read the complete user record may want to visit more data page, This type of reading can also be called random I/O. In general, sequential I/O performance is much higher than random I/O performance, so step 1 May be fast and Step 2 slower. So this query using the idx_name_birthday_phone_number index has two characteristics:

  • I’m going to use two of themB+Tree index, a secondary index, a cluster index.
  • Access secondary index usageSequential I/OTo access the cluster indexThe random I/O.

The more records that need to be returned to the table, the lower the performance of using secondary indexes, even leading some queries to use full table scans rather than secondary indexes. For example, the number of user records whose name value ranges from Asa to Barlow accounts for more than 90% of the total number of user records. Using the idx_name_birthday_phone_number index, more than 90% of the ID values must be returned to the table. It is better to scan the clustered index directly (i.e., full table scan).

Cover index

To eliminate the performance penalty associated with back-table operations, we recommend that the query list contain only indexed columns, such as:

SELECT name, birthday, phone_number FROM staff_info WHERE name > 'Asa' AND name < 'Barlow'
Copy the code

Since we only query the name, birthday, and phone_number index columns, we do not need to search the cluster index for the remaining columns of the record, that is, the country column, after obtaining the results from the idx_name_birthday_phone_number index. This eliminates the performance cost associated with back table operations. We call this type of query using only indexes index coverage.

How to optimize indexes

With the matting above, this part of the content can be said to be natural. Take a look.

Create indexes only for columns used for searching, sorting, or grouping

That is, only columns that appear in the WHERE clause, join columns in the join clause, or columns that appear in the ORDER BY or GROUP BY clause are indexed. Columns that appear in the query list do not need to be indexed:

SELECT birthday, country FROM staff_info WHERE name = 'Ashburn';
Copy the code

Columns such as birthday and country in the query list do not need to be indexed; we only need to create indexes for the name column that appears in the WHERE clause.

Consider the cardinality of the column

The cardinality of a column refers to the number of numbers in a column that are not duplicated. For example, if a column contains values 2, 5, 8, 2, 5, 8, 2, 5, 8, even though there are nine records, the cardinality of that column is 3. In other words, in the case of a certain number of recorded rows, the larger the cardinality of the column, the more scattered the values in the column, and the smaller the cardinality of the column, the more concentrated the values in the column. The cardinality of this column is very important and directly affects whether we can use the index effectively. Assume that the base of a column is 1, that is, all the records in the column value is the same, that it is no use for the column index, because all the value will not be able to sort, to quickly find the ~ and if secondary indexes is established in a column of duplicate values much more special, so records also identified by using the secondary indexes can be operated to do back to the table, So the performance loss is even greater. So the conclusion is that it is better to index columns with a large cardinality than columns with a small cardinality.

Index column types should be as small as possible

When we define the table structure, we must explicitly specify the type of the column. For example, the integer types are TINYINT, MEDIUMINT, INT, BIGINT, and so on. The storage space occupied by these types increases successively. If we want to index an integer column, we should keep the index column smaller if the integer range allows. For example, we should not use BIGINT if we can use INT and we should not use INT if we can use MEDIUMINT. This is because:

  • The smaller the data type, the faster the comparison at query time
  • The smaller the data type, the less storage space the index occupies, and the more records you can fit into a data page, thereby reducing the number of disksI/OThe performance penalty means that more data pages can be cached in memory, which speeds up read and write efficiency.

This recommendation is especially true for primary keys of tables, because not only are primary keys stored in clustered indexes, but all secondary index nodes store the primary key of a record. If the primary key is applicable to smaller data types, it means more storage space and more efficient I/O.

The prefix of the index string value

We know that a string is actually composed of several characters. If we use the UTF8 character set in MySQL to store strings, encoding a character takes 1 to 3 bytes. Assuming that our string is long, storing a string takes up a lot of storage space. When we need to index the string column, that means there are two problems in the corresponding B+ tree:

  • B+A record in a tree index needs to store the full string for that column, and the longer the string, the more storage space it takes up in the index.
  • ifB+Index columns in tree indexes store long strings, which can take more time to do string comparisons.

We mentioned earlier that the string prefixes in the index column are also sorted, so the index designer came up with a scheme to index only the first few characters of the string, that is, to keep only the first few characters of the string in the record of the secondary index. In this way, although the location of the record cannot be accurately located, the location of the corresponding prefix can be located. Then, the complete string value can be queried back to the table according to the primary key value of the record with the same prefix, and then the comparison is good. In this way, only the encoding of the first few characters of the string is stored in the B+ tree, which not only saves space, but also reduces the string comparison time, and probably solves the sorting problem.

If the first 10 characters of the name column are placed in the secondary index, the following query can be a bit awkward:

SELECT * FROM staff_info ORDER BY name LIMIT 10;
Copy the code

Because the secondary index does not contain the complete name column information, it cannot sort records whose first ten characters are the same but whose last characters are different, that is, the way of using the index column prefix cannot support the use of index sort.

Let the index column appear separately in the comparison expression

Suppose there is an integer column my_col in the table, and we have indexed this column. The following two WHERE clauses, though semantically identical, differ in efficiency:

  1. WHERE my_col * 2 < 4
  2. WHERE my_col < 4/2

The my_col column in the first WHERE clause does not appear as a single column, but as an expression such as my_col * 2. The storage engine iterates through all the records to determine if the expression is less than 4, so it does not use the B+ tree index for the my_col column. The my_col column in the second WHERE clause does not appear as a separate column, in which case you can use the B+ tree index directly.

So the conclusion is: if the index column appears in a comparison expression not as a single column, but as an expression, or as a function call, the index is not needed.

Primary key insertion order

For a table using InnoDB storage engine, the data in the table is actually stored in the leaf node of the clustered index when we do not explicitly create the index. And records are stored in the data page, page data and records are sorted in accordance with the order of the record the primary key value since the childhood, so if we insert the record’s primary key value is, in turn, increases, that each of us with a data page, changes to the next page data to insert, and if we insert the primary key of the big and small, it is more trouble, Page splitting and record shifting are involved. This problem can be avoided if the primary key has AUTO_INCREMENT and the storage engine automatically fills in the increment for us when the record is inserted.

conclusion

InnoDB storage engine’s B+ tree index is summarized as follows:

  • Each index corresponds to a treeB+The tree,B+The tree is divided into several layers, with the lowest layer being the leaf nodes and the rest being the inner nodes. allUser recordAre stored in theB+The leaves of the tree, all of themDirectory entry recordAre stored on the inner node.
  • InnoDBThe storage engine will automatically set up the primary key (it will automatically add it for us if it doesn’t exist)Clustering indexThe leaf node of the cluster index contains the complete user record.
  • We can create columns that we are interested inSecondary indexes.Secondary indexesThe leaf node contains user records byIndex column + primary keyComposition, so if you want to passSecondary indexesTo find the full user history, you need to go throughBack to the tableOperation, that is, passingSecondary indexesAfter you find the primary keyClustering indexTo find the complete user record.
  • B+All nodes in the tree form a bidirectional linked list according to the order of index column value from small to large, and all the records in each page (whether user record or directory entry record) form a single linked list according to the order of index column value from small to large. If it isJoint indexThen, the page and record according to firstJoint indexIf the value of the preceding column is the same, thenJoint indexThe following columns are sorted.
  • Looking up records by index is fromB+Start at the root of the tree and work your way down. Because each page is created according to the value of the index columnPage Directory(page directory), so lookup in these pages is very fast.

Index use and optimization:

  1. B+ tree indexes are expensive in space and time, so don’t build indexes when you don’t have to.

  2. B+ tree indexes apply to the following situations:

    • All values match
    • Matches the column on the left
    • Matching range value
    • Matches exactly one column and ranges exactly the other
    • For sorting
    • Used for grouping
  3. Here are some things to consider when using indexes:

    • Create indexes only for columns used for searching, sorting, or grouping
    • Create indexes for columns with large cardinality of columns
    • Index column types should be as small as possible
    • You can index only the prefixes of string values
    • An index applies only if the index column appears alone in a comparison expression
    • In order to let as little as possibleClustering indexIn the case of page splitting and record shifting, it is recommended that the primary key be ownedAUTO_INCREMENTProperties.
    • Locate and delete duplicate and redundant indexes in a table
    • Try to useCover indexQuery and avoidBack to the tablePerformance loss.