Why do we need an index?

The role of the index: It is critical that we know in our hearts what the role of the index is. An index is a data structure that speeds up our queries.

With indexes, we can query the database faster. To illustrate the importance of indexes, let’s look at how queries can be performed in a database without an index model: we know that data stored on disk is composed of data pages. The data page has rows of fields in it, and below that is a data page.

It may store the following data:

key1 = value1

key2 = value2

key2 = value3

.

At this point we have a program that wants to query the value of the field keyx. Select * from t where xx = keyx. And then it took a long, long time to find out. Why is it taking so long? Because we have thousands of pages of data stored on disk! We can only load data pages into buffer and then do binary search. If we do not find our target value, we can only continue to follow the pointer to the data page after it and load into the buffer to look for it. This is no different than a full table scan. A lot of articles talk about index reduction and IO reduction, actually reduce the number of times we load data pages from disk into memory. As we can see, this is too many times without an index!

So how do you use indexes to speed up queries? Let’s not get into the index model directly but I want to share my own thoughts about the index model, using the little things in life to illustrate the idea of the index, right?

The idea of indexing is everywhere in life

This is an example of my imaginative thinking: my friend wants to find a girl who is 165.3cm as his girlfriend. The playground is now crowded with thousands of girls of different heights. How did he quickly locate the girl with the exact height? You can’t just pull it up and measure it one by one. We can use this method to find the girl with the exact height as quickly as possible in just a few eyes. In the meantime, it will be convenient for him to find other girls of height like (172.4cm) in the future.

First we rank all the girls on the playground by height. And then we can group them into a range. For example, the girls between 168.0cm and 169.0cm all stand in a queue in order. There was a sign at the front of the line, group 168-169. Then there would be a lot of lines in the playground, and we would see a long sign that said the height range. Because our brand is really too many, I stand on the rostrum can not see the end of the edge ah.

So we made some new signs and put them in front of the same sign. There are fewer of these new brands. It says 165-170 and then there are five indicators pointing to the five groups in the big 165-170 group, 165-166, 166-167… We can even add a more abstract set of signs on top of the previous set of signs in order to find a more accurate height object more quickly. Let’s say 160 to 170. And finally even a sign greater than 170 and less than 170. Now, if we want to find someone of any height, all we have to do is see if they’re greater than 170, and then we go to the side and we go to the group, and within a couple of moves we can find that person. This is the idea of indexing to speed up queries. Another example is dictionaries, which are arranged in lexicographical order, so you can see for yourself.

Take a macro look at the Mysql index model

In this section I will briefly describe the characteristics of B+ trees and different nodes for storing values. The following diagram shows the Mysql index model. The child nodes of the tree are data pages, which store data from the database. Each page has an ID corresponding to a field value. For example, id1 = “vegetabledog”. There are two things to look out for in the data page below:

  • Each data page is ordered in size from small to large.
  • There are two Pointers to each other between the data page and the data page.

Now let’s look at the index page: if we look at the index page of the second-to-last node, let’s take the left-most node (index page 1) for example, what information is stored in this node? This index page holds the minimum ID of each of its children (data pages). We can see that the minimum Id of data page 1 is indeed 1, and the minimum Id of data page 2 is 4. So what information does an index page hold when its children are not data pages but index pages? They hold the minimum ID value in their child node index page.

This is what a primary key index tree looks like. Let’s see if we were to query an index with a primary key of X, we would first look at the root node of the tree that records the smallest ID on the left and right. If it’s less than the minimum ID on the right, it goes to the left subtree, if it’s more than the minimum ID on the right, it goes to the right. (Of course, the actual tree could not be such a binary tree). We will then use the same method to quickly locate the corresponding data page, and then only need to perform binary search on that data page to find the value we want. Isn’t that much faster than no index.

The growth of the tree

Let’s look at how the tree grows when we insert data into the database.

It starts out as a data page, and we just insert data into it.

The capacity of a data page is also limited. When a data page cannot fit, it will produce a page split, resulting in two pages pointing at each other, while maintaining the size order of page ids. The page ID on the left cannot be greater than the page ID on the right. It also generates an index page that points to its child data pages and records the minimum ID of their page.

As more and more data is inserted our index page will point to more data pages and record more and more of the minimum information of those data pages, but our index page storage size is also limited and what happens when we run out of index pages?

Are you right? We generate a new index page, and the index page and its parent point to our child index page and record the minimum ID of each index page. If you look at the case of finding a girl with a certain height, it seems to be the same way that we speed up our search by adding indexes layer by layer, as if the idea is similar.

Secondary indexes

All we’ve talked about above is creating a primary key index. Sometimes we will create a secondary index tree and index the non-primary key fields of the table. Like a name? Age? Height? When inserting this data, we not only maintain a primary key index tree like the one above, but also maintain a B+ tree based on our secondary index fields. For example, if we sort by name, the a will come first. Each index page stores the minimum name value for its children.

Finally, the primary key corresponding to this name is stored in the data page. So we finally find the primary key ID in the secondary index tree, we need to take the primary key ID to search the primary index to find the last field value. This process is also called “back table”.