Index can be said to be a necessary skill point for every engineer, understand the principle of index for writing high quality SQL is crucial, today we will from 0 to 1 to understand the principle of index, I believe you will read not only the index will not be on the MySQL InnoDB storage engine’s minimum storage unit “page” will have a deeper understanding

Start with actual needs

Assume the following user table:

CREATE TABLE `user` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `name` int(11) DEFAULT NULL COMMENT 'name',
  `age` tinyint(3) unsigned DEFAULT NULL COMMENT 'age',
  `height` int(11) DEFAULT NULL COMMENT 'height'.PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='User table';
Copy the code

As you can see, the storage engine is using InnoDB. Let’s look at some of the most common SQL statements for this table. After all, technology needs to serve business needs.

1. select * from user where id = xxx
2. select * from user order by id asc/desc
3. select * from user where age = xxx
4. select age from user where age = xxx
5. select age from user order by age asc/desc
Copy the code

Since we want to query that let’s insert some data first, after all, no data how to query

insert into user ('name'.'age'.'height') values ('Joe'.20.170);
insert into user ('name'.'age'.'height') values ('bill'.21.171);
insert into user ('name'.'age'.'height') values ('Cathy'.22.172);
insert into user ('name'.'age'.'height') values ('Daisy'.23.173);
insert into user ('name'.'age'.'height') values ('money seven'.24.174);
Copy the code

Insert the following data into the table:

InnoDB adds an ID value for each record by default, and the ID value is incremented by 1 for each record inserted. Why is the ID incremented? Each record is joined in a linked list in ascending order of ID, so that each search for a value with id = XXX starts with id = 1 and goes backwards

Now suppose we want to execute the following SQL statement, how would MySQL query

select * from user where id = 3
Copy the code

page

As mentioned above, first read the record with the smallest ID (id = 1), and then compare its ID value with the value to be queried. Then read the record three times in a row and find record 3. Note that the operation of reading is to first read the records stored on disk into memory and then compare the ids. I/O from disk to memory, that is, I/O generated three times in the process, if only a few records, but if the number of records to compare is very serious performance challenge, if I want to query the record id = 100, then I/O generated 100 times. We can only read one record at a time. If we can read 100 or more records at a time, we can only generate one IO. The idea behind this is program locality principle: When you use an item of data, it is likely that adjacent data will be used, so simply load it with dependent data. (If you start reading from id = 1, you are likely to use elements immediately following id = 1, so simply load all entries with id = 1 to ID = 100.)

An IO read records is not, of course, the more the better, always can’t to a query records and put a lot of irrelevant data are loaded into memory, it will cause the waste of resources, so we have adopted a more compromise of our regulations IO read 16 K of data at a time, assuming that article 100 data for, In this way, if we want to query the record whose ID is 100, only one IO read (id=1 to ID =100) is generated, which is 100 times better than the original 100 I/OS

We call this 16KB record combination a page

Page directory

I/O reads one page at a time, and then looks up the page in memory. Looking up the page in memory is much faster than disk, but we are still not satisfied because if we want to look up the page with id= 100, we need to start with id= 1, then id=2… , id=100, need to compare 100 times, can it be faster?

Can consult binary search, first search id = (1 + 100) / 2 = 50, because < 50, 100, and then check in the records of 50, 100, and then in 75, 100, so can find id = 100 times in seven records, This is a significant performance improvement over the original 100 comparisons. But here’s the thing, the first time and find the id = 50 record was from id = 1 traversal 50 times to find, whether it is positioning to id = 50 record, if not, even for the first time since id = 30 or 40 search is also ok

What data structure can satisfy this requirement, remember the skip table no, every n elements to form a primary index, every 2*n elements to form a secondary index…

As shown, to establish a primary index, for example, we are first in the primary index search, when searching in the primary index in positioning arrived to find the list, such as the number 7, we are looking for if you don’t have to jump table directly on the check list, need more 7 times, and if the jump table we first in the primary index search, found that as long as three times, So we can use the idea of a hop table to reduce the number of queries. The operation is as follows: every 4 elements are grouped into a slot. The slot only records the largest element in the group and records how many records there are in the group

Now suppose we want to locate the record with ID = 9. How do we do this is simple: first locate the record in which slot, and then iterate over the elements in that slot

  1. To locate a slot, first obtain the ID of the smallest slot and the largest slot (4 and 12 respectively). The median value is (4+12)/2 = 8 through binary search. 8 is less than 9, and the maximum ID of slot 2 is 12
  2. Traverse the elements in the slot 2, now the problem is coming, we know that each record form a singly linked list, each slot points to is the maximum id value of the group, how to the beginning of a trough the first element of the traverse, is very simple, starting from the slot 1 traversal don’t go, because it refers to elements of the next element is the slot 2 starting element, When we iterate, we find that the first element in slot 2 is the element we found with id 9

Can see within the page quickly in this way the elements of our positioning, MySQL regulation elements in each slot in the 1 ~ 8, so as long as the location in which slot, the rest of the comparison, it is not what problem, of course, a page loading record is limited after all, if the page is full, is going to want to open another page for record, Order by ID ASC and “order by ID desc”; order by id desc That’s why you use bidirectional lists, right

B+ tree is born

Here’s the thing, if there are a lot of pages, how to locate elements, if the element is just in the first few pages are good, big deal traverse the first few pages also soon, but if you want to check my id = 100 w if such elements page traversal traversal on page 1 w (assuming 100 records per page), that is obviously not acceptable, how to improve? Before actually built within the page directory has given us inspiration, since in the page we can build page directory by record, in the form of a first positioning elements in which slot and then find that multiple pages, whether positioning elements in which page first, so we can build a directory for page also, each record in this catalog are the minimum record corresponds to the page and a page, Of course, this directory also exists in the form of pages. For the sake of differentiation, we refer to the pages that correspond to the directories generated for the pages as directory pages, and the previous pages that stored the complete records as data pages

Voiceover: The table of contents page and the data page are the same, they have grooves inside. For the convenience of display above, they are not drawn. The structure of the table of contents page and the data are the same except for the recording data

Now if you want to find the record id = XXX is very simple, just go to the directory page to locate its start page and then search, because both the directory page and the data page have slots, so it is very fast to locate the page number of the directory page and locate the record in the data page.

Of course, as the number of pages, directory pages store more and more records, directory pages will eventually be full, then create a directory page, so now the problem comes, how to locate the id of the directory page is in the directory page, again make the directory page for the directory page is not ok, as follows

What do you think of when you look at this structure up here? That’s right, it’s a B+ tree! I believe you have understood the evolution of the B+ tree and its principle. We can see that the B+ tree has three layers. We call the top directory page the root node, and the bottom page that stores the complete record is called the leaf node.

Now let’s look at how to find the record with id = 55. First we load the root node, and we find that we should find the record on page 30, so we load page 30, and we find that we should find the record on page 4, so we load page 4 again, and then we iterate through page 4. The page is cached in memory after being read, and if it hits a page in memory, it is fetched directly from memory. Some people might ask, if there are a lot of B+ tree layers, there might be a lot of IO, let’s do a simple calculation, suppose that the data page can store 100 records, directory page can store 1000 records (directory page because only store primary key, do not store complete data, so can store more records), then

  • ifB+The tree has only one layer, that is, only one node for storing user records, at most100Records.
  • ifB+The tree has two layers and can hold as much as possible1000 x 100 = 100000Records.
  • ifB+The tree has three layers and can hold as much as possible1000 x 1000 x 100 = 100000000Records.
  • ifB+The tree has four layers and can hold as much as possible1000 x 1000 x 1000 x 100 = 100000000000Record!

Therefore, 3 to 4 layers of B+ trees are generally enough to meet our requirements, and will be cached in memory after each reading (of course, it will also be swapped out of memory according to certain algorithms), so overall 3 to 4 layers of B+ trees are enough to meet our requirements

Clustered index and non-clustered index

Believe that you have found, in this paper we take the example of B + tree is targeted at the id is the primary key index, it is not difficult to find the primary key index of the leaf node to store a complete SQL record, we store the full records of index called clustering index, as long as you define the primary key, then the primary key index is clustering index.

The form of the index is exactly the same for the non-primary key column, but the storage of the leaf node is slightly different. The non-primary key column index on the leaf node stores the index column and the primary key value. For example, if the age column is indexed, the index tree is as follows

The non-leaf node stores “age value + page number”, while the leaf node stores “age value + primary key value”

select * from user where age = xxx
Copy the code

Mysql > select * from user where id = ‘age’; mysql > select * from user where id = ‘age’; mysql > select * from user where id = ‘age’ This is what we call a back table. If you have a lot of back tables, it’s obviously going to be a performance problem because the ids may be distributed across different pages, which means that you’re going to have to read different pages from disk into memory, and they’re probably not adjacent to each other. This means that a large number of random IO will be generated, which will seriously affect performance. Why is a full table scan caused by an index hit set? One of the reasons is that although an index hit is set, a large number of tables are returned after the leaf node is queried, leading the optimizer to think that this situation is not as fast as a full table scan

Some people may ask why secondary indexes do not store the full record, of course, to save space, after all, complete data is very expensive, if each index has to store the full record, it will cause a lot of data redundancy.

How can this be avoided? Index overwrite, if the following SQL meets your needs, then the following form is recommended

select age from user where age = xxx
select age,id from user where age = xxx
Copy the code

It is not difficult to find that the characteristic of this SQL is to obtain the column (age) is the index column itself (including ID), so that when the corresponding record on the leaf section is found according to the index of age, because the record itself contains these columns, there is no need to return to the table, which can improve performance

Disk to proofread

Next, we discuss a lot of people online don’t carry of a problem, we know that the operating system on page as the unit to manage memory, in Linux, the size of a page defaults to 4 KB, which means both load data from disk into memory and memory will be written back to disk, the operating system will be operated in page, Even if you write only one byte to an empty file, the operating system allocates it a page size (4 KB).

As shown, two bytes were written to disk, but the operating system still assigned it a page size (4 KB)

InnoDB is also stored and read in pages, and innoDB pages are 16 KB by default, so many people on the Internet are wondering if this means that it takes 4 IO to read innoDB pages. No, it only takes one IO. Why? This requires a little understanding of how disk read data works

Disk construction

First let’s look at the physical structure of the disk

Main components for the disk platters, inside hard disk drive, read/write head and a shaft magnetic arm, data is mainly written to disk platters, disc is composed of a number of sectors, the data is written to read are in sectors as the basic unit, the other a circle at the center of the disc, the disc is divided into several concentric circles, to divide the circle that each and every one of the “line”, is called a track

So how is data read and written? There are three main steps

  1. Seek: Since the data is stored in sectors, we first need to know which sector it is in. This requires the magnetic head to move to the track where the sector is located. We call this the seek time, and the average seek time is generally 3-15ms
  2. Rotational latency: moving disk to disk sector, head alignment at this time is not necessarily what we want data corresponding sectors, so need to wait for rotating plate for a moment, when we want to fall under the head of the sector data corresponding, rotation delay depends on the disk rotational speed, usually expressed in half of the disk rotation time needed for a week. For example, the average rotation delay of a 7200rpm disk is about 4.17ms (60 x 1000/7200/2), and that of a 15000rpm disk is 2ms
  3. Data transfer: After the previous two steps, the magnetic head finally began to read and write data. Currently, IDE/ATA can reach 133MB/s, SATA II can reach 300MB/s interface data transfer rate, data transfer time is usually much less than the first two parts of the consumption time. negligible

Sequential I/O can read and write as fast as or faster than random I/O in memory, so this part of the time can be ignored. (Kafka is known for its strong performance. There is one important reason is the use of the order of the disk, speaking, reading and writing), but if you want to read data is distributed in different sectors, becomes a random IO, random IO no doubt increased the seek time and rotational delay performance is very worrying (typical representative is back to the table above when a large number of id distribution on different pages, Causing a lot of random IO)

As shown in this performance comparison from the ACM Queue, Sequential Disk (IO) is faster than Random memory

Reading a page in innoDB counts as an I/O, which I’m sure you’ve guessed, because the page is consecutively allocated, which means their sectors are adjacent, so it’s a sequential I/O

The operating system manages memory in pages, and it can load an integer multiple of pages at a time. InnoDB’s page size is 16KB, which is just 4 times the size of the operating system page (4KB), so you can specify 4 consecutive operating system pages to be read at the starting address, which is 16KB. This is what we call disk prefetch. Now it’s easy to see why reading a page is actually one IO instead of four

conclusion

After reading this article, I believe you can understand the origin of the index, in addition to the page and disk prefetch performance should have a lot of understanding, in fact, MySQL page structure is a little different from our derivation, but does not affect the overall understanding, if you are interested in the page structure of MySQL, I strongly suggest that you take a look at the <<MySQL how to run: From the root of understanding MySQL>> this book, explained very detailed

  • Shoulders of giants
  1. Disk I/O those things: tech.meituan.com/2017/05/19/…
  2. MySQL index structure: blog.51cto.com/u_12302929/…
  3. How does MySQL run: From the root understand MySQL

Finally, welcome everyone to pay attention to my public number “code sea”, the first time to receive dry goods tweets