Crush the MySQL index question in the interview

This is the fourth article illustrating MySQL, and this article will get you

Understand what an index is, thoroughly understand the relationship between B+ tree and index;
Thorough understanding of primary key indexes, normal indexes, and union indexes;
Learn what HASH indexes are and how InnoDB and MyISAM indexes are implemented differently.
Easy to understand the following index usage rules.

1. Preparation

To better explain indexes, let’s start with a table.

CREATE TABLE `user_innodb` (
  `id` int NOT NULL AUTO_INCREMENT,
  `name` varchar(255) DEFAULT NULL,
  `gender` tinyint(1) DEFAULT NULL,
  `phone` varchar(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Copy the code

I created a table user_innodb with storage engine as InnoDB, which contains primary key id, name field, gender field (0,1 means different gender), phone field, and batch initialized 500W+ data.

Note: All data are generated by simulation, and gender is not strictly differentiated; If the phone numbers are the same, it’s a coincidence

mysql> SELECT COUNT(*) FROM user_innodb; + -- -- -- -- -- -- -- -- -- -- + | COUNT (*) | + -- -- -- -- -- -- -- -- -- -- + | 5283424 | + -- -- -- -- -- -- -- -- -- -- + 1 row in the set (0.31 SEC)Copy the code

Example 1: Before creating an index for name

Mysql > SELECT * FROM user_innodb WHERE name = "innodb "; +---------+-----------+--------+-------------+ | id | name | gender | phone | + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | 1099999 | cicada mphone | | 13203398311 | 0 + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + 1 row in the set (0.96 SEC)Copy the code

Example 2: After creating an index for name

Mysql > SELECT * FROM user_innodb WHERE name = "innodb "; +---------+-----------+--------+-------------+ | id | name | gender | phone | + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | 1099999 | cicada mphone | | 13203398311 | 0 + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + 1 row in the set (0.03 SEC)Copy the code

Example 3: Query data by primary key ID

mysql> select * from user_innodb where id = 1099999; +---------+-----------+--------+-------------+ | id | name | gender | phone | + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | 1099999 | cicada mphone | | 13203398311 | 0 + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + 1 row in the set (0.00 SEC)Copy the code

It can be seen that it takes 0.96 seconds to search the record whose name is mu wind before the index is created, and only 0.03 seconds after the index is created for the name field, indicating the great role of the index.

But we don’t explicitly create indexes for primary keys, so why are primary key queries so fast? I explained why primary key queries are fast in my last article, but I only explained half of it, so let me explain the other half.

Although I hope that each article covers an independent knowledge point, for complex software like MySQL, various details are intertwined, and a deep understanding of one knowledge point often requires the support of other knowledge points, so IT is highly recommended that you take 10 minutes to read this article before continuing.

If you really don’t want to see it, I’ll just summarize what I’ve talked about.

MySQL primary key query why is it so fast

2. Pre-knowledge

As we now know, the InnoDB storage engine provides 4 different row formats for storing data we insert into MySQL, which we collectively call records.

Records are stored in InnoDB pages, and the InnoDB storage engine divides data into several pages, using the page as the smallest unit of interaction between disk and memory. The default page size in InnoDB is 16KB. By default, at least 16KB of data is read from the disk to the memory and at least 16KB of content is flushed to the disk at a time. The page where user records are stored is collectively called a data page, and it is just one of many InnoDB pages that we don’t care about.

Very, very important, in a data page, user records are a one-way linked list of primary keys in ascending order.

But the records in a data page may very much, in order to avoid inefficient traversal, InnoDB engine designer came up with a wonderful search method, all of the records in the data page (including false records) into several groups (and number of each group within the team members to do the rules), each group choose the biggest one record in the group as a “leader”, Then take out the addresses of all the group leaders and catalogue them.

For example, the following image shows all the records in a data page grouped:

All the records in the image above (including pseudo records) are divided into four groups. The “leaders” of each group are individually promoted and compiled into separate “catalogs”, which InnoDB officially calls “slots”. The fact that slots are continuous in physical space means that you can easily find the one before and the one after it through a slot, which is very important.

Slot numbers start from 0. When we search for data, we can first find the corresponding slot, and then traverse in the group. Because the number of records in a group is not large, the performance loss of traverse can be ignored. Moreover, the primary key value of the “group leader” represented by each slot is also arranged in ascending order, so we can use the dichotomy method to find the slot quickly.

The figure contains 4 slots, which are 0, 1, 2 and 3 respectively. Before the binary search, the lowest slot low=0, and the highest slot high=3. Now let’s see what happens when we query the record with ID 7 on the data page.

Using dichotomy, calculate the position of the middle slot,(0 + 3) / 2 = 1To see theSlot 1The primary key value of group leader is4Because the4 < 7, so setlow=1.highStay the same;
Using the dichotomy again, calculate the position of the middle slot,(1 + 3) / 2 = 2To see theChannel 2The primary key value of group leader is8Because the8 > 7, so sethigh=2.lowStay the same;
nowhigh=2.low=1The difference between the two is 1, there is no need to continue the dichotomy, we can be sure that our record atChannel 2And we know itChannel 2The primary key of the corresponding group leader is8But there is a one-way linked list between the records, so we cannot traverse it forward. As mentioned above, we can passChannel 2findSlot 1, and then find its “group leader”, and then walk down the “group leader” until you find the record with primary key 7.

When there are too many user records for one data page, another data page is applied. Each data page is logically connected by bidirectional linked list, so the newly allocated data page numbers are not necessarily arranged in order from smallest to largest, as shown in the following figure:

So, though within a data page can fast query of the primary key, but the InnoDB storage engine don’t know what do you want to record the page number where to find out, that also can only along the two-way chain table from the first page of search, traversal every page, as for in every data page is how to find, you already very clear.

Obviously, the InnoDB engine has a way to quickly locate the data page where you want the primary key data, rather than start from the first page, otherwise it would not be possible to query as quickly as in Example 3.

So how does InnoDB do it?

3. The InnoDB index

3.1 Primary key index debut

For the sake of description, we assume that a data page can hold a maximum of 3 user records, so the first 12 columns of the USER_InnoDB table are stored as follows:

Do these data pages that connect together look like each chapter of a book? Naturally, every record in the data page is every section in the chapter.

So to speed up retrieval, we can add a table of contents to the data page, mimicking the book chapter table of contents.

As shown in the figure above, we create a table of contents for four data pages, with one record for each data page. To distinguish it from a user record, we call it a table of contents record, which is also one-way linked in order of primary key from small to large.

Unlike the user record, which contains complete data, the catalog entry record contains only the minimum primary key of the data page and the corresponding data page number. Since these are all records, InnoDB designers directly use data pages to store entries, so the structure of page 32 is exactly the same as other data pages.

Let’s see how adding a directory can improve the efficiency of our query. For example, to query the record whose primary key id is 8, the steps are as follows:

First find the data page 32 of the directory entry, and quickly locate the corresponding directory entry record through dichotomy, because7 < 8 < 10, so the page where the corresponding record is located should be page 14;
Then you can just look it up on page 14, as we described earlier.

At present, there are not many pages, so the improvement of query efficiency is not very obvious, but once the number of data pages rapid growth, this way by adding directory to bring query advantages will be infinitely magnified! However, there is a problem at the same time, there are many data pages, and the directory entry record in one data page is not enough.

Add a data page. Let’s add two more user records and see what they look like:

Note: Actually, the number of records (user records/directory entry records) that can be stored on a page is very large. For the sake of drawing, I just assume that the data page can hold up to 3 user records and up to 4 directory entry records

Now suppose that to find the record with primary key ID 14, we still need to find the data page where the directory entry is stored. However, there are two such data pages, namely page 32 and page 124. How do I know which directory entry data page to locate? Would you like to traverse from page 32? Are you kidding me? We’re doing this so we don’t want to go through it. Well, let’s regenerate a directory for the data page that stores the directory entry. Let’s get this straight.

As mentioned in the previous example, the data pages that store user records are equivalent to chapters, and user records are equivalent to sections. Generating a table of contents for chapters gives you the data pages (pages 32 and 124) that store catalog entries, which is equivalent to a book, and then creating a table of contents for the book, which is equivalent to a bookshelf.

That corresponds to the storage structure as shown below:

As shown in the figure above, we have added a data page 99 to hold the two directories corresponding to page 32 and page 124. Now to find the record with primary key ID 14, we need to go through the following steps:

From page 99, the corresponding directory item data page 124 can be quickly retrieved.
In page 124, the corresponding data page 27 is quickly retrieved.
On page 27, the record with primary key 14 is quickly retrieved.

So far, you have quietly mastered the B+ tree. Yes, the search structure we derived step by step above is known as B+ tree, but MySQL has given it a better name – index.

The node at the bottom of the B+ tree (corresponding to the data page in the figure that stores user records) is called the leaf node, the other nodes are naturally called non-leaf nodes, and in particular, the node at the top of the B+ tree is called the root node.

One detail worth noting is that the leaves of this B+ tree store our entire user record (that is, all the data we insert into the table), and this is the only way user records are stored in the InnoDB engine. The so-called “index is data, data is index”.

More conveniently, the index for the primary key is automatically generated by the InnoDB storage engine without the need to explicitly write the index creation statement. This index is called a primary key index, also called a clustered index.

Primary key indexes have two features:

Sort user records and data pages according to the size of the primary key. Records are linked by a one-way list, and data pages are linked by a two-way list.
The leaf nodes of the B+ tree hold a complete record of the user.

Now that we’ve explained why primary key queries are so fast, it’s too easy to figure out the primary key indexes, normal indexes and union indexes!

3.2 Common Indexes

The primary key index is used only when the search condition is primary key, but what if I use name=’ muwind ‘as the search condition? A quick search of the name column can be achieved by creating a B+ tree (we call it the Name index) where user records and data pages are sorted by the name field and the leaves of the B+ tree retain the complete user data.

However, the data in the table is fully recorded twice (the primary key leaf node and the name leaf node), and if we create indexes for other fields, the disk space will be limited. So we have to think of another way.

We already know that querying user records by primary key is very fast, so we can find a way to quickly find user records by primary key. Again, this is a B+ tree.

This B+ tree is slightly different from the clustered index B+ tree:

The leaf node no longer holds the complete user record, but only recordsnameColumns and primary key values;
The user records and directory entry records stored in the data page are sorted by primary key instead of by primary keynameColumn sorting;
In addition to storing index columns (name) and page number, but also stores primary key values; (Think about why you want to store primary keys.)

With this B+ tree, you can quickly find the primary key through the name column in the same way that you would find a user record based on the primary key, except that the former finds the primary key and the latter finds a complete user record.

The dichotomy of strings may seem a bit strange to you, and even to uneducated readers it may seem strange to sort strings. We can specify character sets and comparison rules for string fields when creating tables. If you don’t specify them, MySQL will set them by default. Anyway, MySQL will always find a way to sort strings.

Now you have the primary key ID, and then you can look up the complete user record in the primary key index based on the primary key ID. This process is called back table. If there is no uniqueness constraint for the name column, it is possible to find multiple primary key ids that match the criteria and return the table several times.

An index added to a single column such as name is called a normal index, also known as a secondary index.

What would B+ tree storage look like if multiple columns were indexed at the same time? This is the federated index, and understanding the above content and then understanding the federated index is just a matter of course.

3.3 Joint Index

If we create a joint index for the name and phone columns (note the order I described), we will naturally create a B+ tree that is slightly different from the previous one:

The leaf node holdsnameThe column,phoneColumns and primary key values;
In addition to storing index columns (name,phone) and page number, but also stores primary key values; (Think about why you want to store primary keys.)
The user records and directory entry records stored in the data page are sorted by primary key instead of by primary keynameColumn sort ifnameIf the columns are the same, let’s just go tophoneColumn sorting; (ifphoneWhat about the same columns? Do you see why primary key values are stored?

Let’s draw another picture (a little lazy, the data page number is not changed) :

It is still the same as the secondary index, using B+ tree to quickly locate to the data page, and then quickly locate to the record within the page, find the primary key ID in the record, and then return to the table, if multiple records meet the conditions, return to the table several times.

4. InnoDB other indexing methods

The above is the B+ tree index, which is actually one of the many indexes provided by InnoDB storage engine, but it is the most used index and the most frequently asked in interviews. In addition, there are other indexing methods available. For example, my TablePlus tool (MySQL connect tool for Mac) provides 4 indexing methods.

4.1 the HASH

If you’ve ever used a Java HashMap or a Python dictionary, you should be familiar with this concept.

A hash table is a structure that uses key-value pairs to store data. It generates hash codes and Pointers based on index fields. Pointers point to data in the table. Inevitably, multiple index field values will be converted by hash functions to the same value, and one way to handle this is to create a one-way linked list. As shown below, we create a HASH index for the name field:

Hash indexes have three important features:

The query speed is very, very fast, the time complexity is O(1), because the data in the hash index is not stored in order, so it cannot be used for sorting;
When querying data, hash codes are computed based on key values, so it can only support equivalent queries (=,IN), does not support range queries (>,<,> =,< =,BETWEEN,AND);
If the hash conflict, we have to use the method of adding one-way linked list to solve, which will cause efficiency decline.

In addition, there is no way to explicitly create a HASH index in InnoDB, although the HASH index method is provided. The so-called HASH index is actually an adaptive HASH index (AHI), which InnoDB automatically creates for the hot pages in BufferPool. TablePlus can select HASH when creating the index, but the actual display type is still BTREE.

4.2 FULLTEXT

If your table has a large text field and you want to query all entries in that field, you might want to query LIKE ‘% index %’, but the left-most matching rule tells you that this query is inefficient, and then the full-text index appears.

To illustrate, let’s assume that a text field stores this text:

My name is Cicada Mufeng, welcome everyone to follow my wechat public numberCopy the code

To quickly query according to a certain word, the text should be segmented first, and the following word segmentation results are obtained:

I/call/cicada/soak up/wind /, / welcome/everyone/attention/my/wechat/public numberCopy the code

Then establish a relationship between each participle and the user’s record (a document, in search jargon) to generate a matrix of word documents

And then you can Search for a particular word, which is how modern Search engines work, if you’re interested you can Search for an inverted index, and if you’re interested you can look at Elastic Search.

4.3 SPATIAL

It’s an index of spatial data, and I haven’t used it, so I’ll explain that for now.

5. MyISAM index scheme

Different storage engines store data in different ways, and produce different files in different numbers and formats. InnoDB files contain 2, MEMORY files contain 1, and MyISAM files contain 3. Our next focus is on the files in MyISAM.

.MYDThe file, D stands for Data, is MyISAM’s Data file, which holds user records, that is, the table Data we inserted;
.MYIFile, I for Index, is the Index file of MyISAM. An index has a B+ tree, and all B+ trees are stored in this file.

That is, unlike InnoDB’s “index as data” mentality, MyISAM storage engine stores indexes and data separately.

What does a B+ tree look like in MyISAM? In fact, it looks similar to InnoDB, but the difference is that MyISAM’s B+ tree leaves store user records corresponding to the disk address, so from the index file. After finding the corresponding index key (the value of the column to be indexed) in MYI, the Find the corresponding user record in MYD. Let’s take the primary key and let me draw another picture:

See you next time!