This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money.

Index can be said to be a big heart in the database, if a database without index, then the meaning of the existence of the database itself is not big, and ordinary files no different. So a good index for the database system is particularly important, today to say MySQL index, from the details and actual business point of view to see in MySQL B+ tree index benefits, and we need to pay attention to the knowledge point when using index.

Rational use of indexes

At work, the most straightforward way to determine if a field in a table needs to be indexed is if the field appears frequently in our WHERE condition. From a macro point of view, this is fine, but in the long run, it may sometimes require more nuanced thinking, such as do we need to create more than just an index on this field? Is a joint index of multiple fields better? Take a user table for example. The fields in the user table might have the user’s name, the user’s ID number, the user’s home address, and so on.

1. Disadvantages of normal Indexes

The first obvious way to do this is to create an index on id_card, which is strictly unique because the id number must be unique, so when we perform the following query:

SELECT name FROM user WHERE id_card=xxx
Copy the code

It should run like this:

  1. Search the id_card index tree and find the primary key ID corresponding to the ID_card
  2. Search primary key index by id to find the corresponding name

In effect, the result is no problem, but on the efficiency of it, it seems that the query is a bit expensive, because it to retrieve the two B + tree, assuming that the height of a tree is 3, so the height of the tree is 6, because the root node in memory (the root node), so it needs to be done on the disk IO number is 4 times, If the average time of random disk I/o is 10ms, then 40ms is required. This number is average, not fast.

“2. Pitfalls of primary key indexes”

Since the problem is to retrieve tables in both trees, the core problem is to see if you can retrieve only one tree. Here from a business perspective, you may found a breakthrough point, id number is the only, so we don’t have to default on the primary key is id, we set the primary key into our id number, so that the entire table only need one index, and through the identity card number can find all the needed data, including our names, simple thought seems to be reasonable, As long as each time inserts the data, specify id is the ID number on the line, but a careful thought seems to have a problem.

In terms of the characteristics of B+ trees, the data of B+ trees are stored in leaf nodes, and the data is managed in a page format. A page is 16K. What does this mean? Even now, we are a line of data, it also takes up 16 k data page, after page only when our data with data will write a new page, new and old data pages in the physical data may not necessarily be continuous, but one thing is critical, although the physical data page is discontinuous, but the data is logically contiguous.

And you might be wondering, what does that have to do with our ID number being the primary key ID? That’s when you should pay attentioncontinuousThis keyword, the ID number is not consecutive, what does that mean? When we insert A discrete data, in order to keep continuous, need to move data, such as the original data on A page 1 – > 5, by inserting A 3 at this time, you will need to move the 5 to 3, you might say that also didn’t how much overhead, but if this page A 3 when new data is full, it will have to see it at the back of the page B if there is A space, If there is space, then the starting data of page B should be the one that overflows from page A, and the corresponding data should also be moved. If page B also does not have enough space, then A new page C is requested, and some data is moved to this new page C, and the relationship between page A and page B is severed, and A page C is inserted between the two, which, at the code level, switches the pointer to the linked list.

In summary, discontinuous ID numbers as primary keys can cause page data movement, random IO, and overhead associated with frequent page requests. If we use an incrementing primary key, it must be sequential for the ID, there will be no data movement problems due to random IO, and the insertion overhead must be relatively small.

There is another reason why using id numbers as primary keys is not recommended: The id number is too big for a number, so bigINT is used to store it. Normally, int is enough for a student in a school. As we know, a page can store 16K, and the more space an index occupies, the less data can be stored on a page. Bigint requires more pages and therefore more storage space than int.

“3. The Contradiction and Shield of joint Indexes”

It can be concluded from the above two conclusions:

  1. Try not to go back to the table
  2. Id numbers are not suitable for primary key indexes

Therefore, it is natural to think of a joint index, create a [ID number + name] joint index, pay attention to the order of the joint index, to comply with the leftmost principle. So when we also execute the following SQL:

select name from user where id_card=xxx
Copy the code

There is no need to return the table to get the name field we need. However, the problem that the ID number itself occupies too much space is still not solved, which is the problem of business data itself. If you want to solve it, we can use some conversion algorithms to convert originally large data into small data, such as CRC32:

crc32.ChecksumIEEE([]byte("341124199408203232"))
Copy the code

We can replace the id number that originally needed 8 bytes of storage space with the CRC code of 4 bytes, so our database needs to add another field crC_ID_card, and the joint index is also changed from [ID number + name] to [CRC32 (ID number)+ name], and the space occupied by the joint index is reduced. But the switch comes at a cost:

  1. Each additional CRC requires more CPU resources
  2. Extra fields, while reducing the size of the index, take up space of their own
  3. CRC will have the probability of conflict, which requires us to query the data and filter it according to ID_card. The cost of filtering depends on the number of duplicate data. The more duplicate data, the slower the filtering.

About combination index storage optimization, there is A small detail, suppose now have two fields A and B, respectively, 8 bytes and 20 bytes, we is already [A, B] in joint index, but also can support the query of separate B, so naturally we also create an index on B, then two indexes takes up space for 8 + 20 + 20 = 48, Now, whether we query through A or B, we can use the index. If business permits, can we establish [B,A] and A indexes? In this way, not only can we use the index to query data through A or B alone, but also can occupy A smaller space: 20+8+8=36.

4. Short and concise prefix index

Sometimes we need to index the field is a string type, and the string is very long, we hope that this field plus index, but we don’t want this index take up too much space, then can consider to build a prefix index, in the earlier part of this field character to establish index, can enjoy index already so, can save a space again, It is important to note that in the case of high prefix repetition, the speed of prefix index and normal index should be different.

alter table xx add index(name(7)); Select * from * where name="JamesBond" select * from * where name="JamesBond"Copy the code

“5. The speed and slowness of unique Indexes”

Before we talk about unique indexes, let’s look at normal indexes. We know that for a B+ tree, the leaves are ordered.

Suppose now we want to query 2, then when we find 2 through the index tree, the storage engine does not stop searching, because there may be multiple 2’s. Does this mean that the storage engine will continue to look back on the leaf node and stop after finding the second 2? The answer is no, because the storage engine doesn’t know if there are more twos, so it has to go back until it finds the first data that isn’t a two, which is a three, and then it stops searching, which is what a normal index would do.

The only index is different, because of the uniqueness, there can be no repeat of the data, so the retrieved after our target data returned directly, not like a normal index more backward to find a, from this perspective, the only index is faster than normal index, but when the average index data are within a page, actually is not how much faster also. In terms of data insertion, unique indexes are less likely to be used because they are unique and need to determine whether the data to be inserted already exists each time. Normal indexes do not need this logic and, importantly, unique indexes do not use the change buffer (see below).

“6. Don’t blindly index”

At work, you may come across situations like: Do I need to index this field? . For this question, we often use the method to determine whether the query will use this field. If this field is often in the query condition, we may consider adding an index. But if you only use this criterion, you might add a wrong index. Let’s look at an example: suppose we have a user table with about 100W of data, and we have a gender field in the user table representing males and females, about half and half. Now we want to count all males, and then we index the gender field, and we write the following SQL:

Select * from user where sex=" male"Copy the code

InnoDB does not select the index by gender if there is no accident. If you go to the gender index, then it must be necessary to return to the table, in the case of a large amount of data, what will be the consequences of returning to the table? I posted a picture like the above must be known:

It’s basically a lot of IO, 4 times a piece of data, what about 50W of data? The result is predictable. Therefore, in this case, the MySQL optimizer is likely to go through the full table scan, directly scanning the primary key index, because performance is likely to be higher.

“7. Index failure stuff”

In some cases, mysql doesn’t use indexes because of our own misuse. This can easily happen with type conversions. You might say, mysql already supports implicit conversions. Select * from user_id where user_id = user_id;

select xx from user where user_id="1234"
Copy the code

Note that this is character 1234. When this happens, MySQL is smart enough to convert character 1234 to number 1234 and happily use the user_id index. But if we have a character user_id index, because we did not pay attention to the query, we write:

select xx from user where user_id=1234
Copy the code

You may be wondering why MySQL doesn’t convert the number 1234 to character 1234. The conversion rules need to be explained here. When comparing strings to numbers, remember that MySQL converts strings to numbers. You may also ask: why does converting a user_id field to a number not use an index? This and say to the structure of B + tree index, we know that B + tree index is bifurcate and sorted according to the value of the index, when we put the index field type conversion happens when the value changes, turned out to be A value, for example, if you execute the integer conversion may correspond to A B value (int) (A) = B), the index tree cannot be used at this moment, Since the index tree is constructed according to A, not B, the index will not be used.

The index optimization

“1. Change buffer”

We know that when updating a data, we should first determine whether the page of this data is in memory, if so, directly update the corresponding memory page, if not, we can only go to disk to read the corresponding data page into memory, and then update, what is the problem?

  1. Read to disk is a bit slow
  2. If a lot of data is updated at the same time, then it is possible for a lot of discrete IO to occur

In order to solve the speed problem in this case, change buffer appears. First, do not be misled by the word buffer, change buffer will not only be in the public buffer pool, but also persist to disk. When we have the change buffer, if we find that the corresponding data page is not in the memory during update, we will not read the corresponding data page from the disk, but put the data to be updated into the Change buffer, when will the data of the change buffer be synchronized to the disk? What if a read action occurs at this point? First, a thread in the background will periodically synchronize the change buffer data to the disk. If the thread has not yet synchronized, but a read operation occurs, the change buffer data will also be merged to the disk event.

It’s important to note that not all the indexes can be used to changer buffer, as the primary key index and the index can’t use, because of the uniqueness, so they don’t exist to determine when to update the data, page if the data is not in memory, you must read the corresponding data on disk page in memory, while the common index doesn’t matter, You don’t need to check uniqueness. The larger the change buffer is, the greater the theoretical benefit is. This is because first of all, there are fewer discrete read IO, and second, when multiple changes occur on a data page, only one merge is required to merge to disk. Of course, not all scenarios are suitable for the Changer buffer. If your business needs to be read immediately after the update, the changer buffer will be counterproductive because the merge action will be triggered constantly, resulting in the same number of random I/OS. Instead, it increases the overhead of maintaining the Changer buffer.

“2. Index push down”

SQL > select * from ‘A’ where ‘B’ = ‘A’; SQL > select * from ‘A’ where ‘B’ = ‘A’;

select * from table where A="xx"
select * from table where A="xx" AND B="xx"
Copy the code

In fact, the joint index can also use the left-most prefix principle, namely:

Select * from table where A =" %" AND B=" "Copy the code

Mysql > select * from MySQL5.6; mysql > select * from MySQL5.6; mysql > select * from MySQL5.6; mysql > select * from MySQL5.6; Why don’t you use the joint index to determine the number of times the table is returned? Mysql > alter table MySQL5.6 alter table MySQL5.6 alter table MySQL5.6 alter table MySQL5.6 alter table MySQL5.6 alter table MySQL5.6 alter table MySQL5.6 alter table MySQL5.6 alter table MySQL5.6 alter table MySQL5.6 Although the left-most prefix is used, it is also possible to search for A% in the federated index while filtering non-B data, greatly reducing the number of times to return to the table.

“3. Refresh adjacent pages”

Before said adjacent page refresh, we say the dirty pages first, when we know the updates a data, must first determine whether the data in the page in memory, if not, need to put the data page read into memory first and then update the data in memory, then will find pages with the latest data in the memory, but it is still the old data on the disk page, Then the page in memory where the data resides is a dirty page and needs to be flushed to disk to keep it consistent. So the question is, when? How many dirty pages should I swipe at a time? If you swipe every change, performance will be poor. If you wait too long to swipe, dirty pages will pile up, leaving fewer pages available in the memory pool and affecting normal functionality. MySQL has a clean thread that executes periodically to ensure that it is not too fast. When there are too many dirty pages or the redo log is almost full, it will trigger a flush immediately to ensure that it is not too fast.

InnoDB has an optimization for cleaning dirty pages: Page if you want to brush the dirty pages of neighbor is dirty, so helping to brush together, this advantage is that can reduce the random IO, in the case of mechanical disk, optimization should be pretty big, but there may be a pit here, if the current dirty neighbor dirty pages after be brush together into the neighbors page immediately because the data changes become dirty, that at this time whether there is a redundant, And it’s a waste of time and money. Even worse is if the neighbors of the neighbor page are also dirty pages… The chain reaction may cause temporary performance problems.

“4. MRR”

In real business, we might be told to try to use overwrite indexes rather than return to a table, because the return to a table requires more I/OS and takes longer, but sometimes we have to return to a table, which not only causes too much I/OS, but also too much discrete I/OS.

select * from user where grade between 60 and 70
Copy the code

Select * from user where grade=60; select * from user where grade=60; select * from user where grade=60; Return to grade index again and repeat the same action over and over again… Now, suppose that grade = 60 corresponding id = 1, the data is in page_no_1 grade corresponding id = 61 = 10, the data is on page_no_2, grade corresponding id = 62 = 2, the data is on page_no_1, Id =1 and id=2 can be merged by reading page_NO_1, which not only saves IO but also avoids random IO. This is MRR. When MRR is used, the secondary index does not immediately go back to the table, but puts the obtained primary key ID into a buffer, and then sorts it. After sorting, the secondary key index is read sequentially, which greatly reduces the discrete IO.

Past highlights:

Kafka! I’m glad I left a little bit of my own stuff to start with with MySQL delete and understand MySQL rollback and persistence

Finally, wechat search [pretend to understand programming], if you have any questions, welcome to contact me, if my article has a question, also welcome to correct, if you love learning, like to study, you can pay attention to me.