This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money.

MySQL database should be one of the most commonly used database, in a variety of large and small companies can see it figure, how do you master MySQL database? Want to better use it, so we must first understand it, is the so-called work to good things, must first sharpen its tools.

This article will lead you to in-depth analysis of MySQL index knowledge, first to understand what is the index, and index storage model, why the underlying data structure choose B+ tree its reason?

What is the index?

A table with 5 million data, execute a WHERE query on name field with no index:

select * from user_innodb where name ='小马';
Copy the code

What if the name field has an index on it? Create an index on the name field and perform the same query again.

ALTER TABLE user_innodb DROP INDEX idx_name; 
ALTER TABLE user_innodb ADD INDEX idx_name (name);
Copy the code

Queries with indexes are dozens of times more efficient than queries without indexes.

Through this case we should be very intuitive feel, index for data retrieval performance improvement is very large.

So what exactly is an index? Why does it have such a big impact on our query? What happens when you create the index?

Index definition

A database index is a sorted data structure in a database management system (DBMS) to help quickly query and update data in a database table.

Data is stored on disk as a file, and each row of data has its disk address. Without an index, we would have to retrieve a single piece of data from 5 million rows, traversing the table until we found it.

But after we have the index, we only need to retrieve the data in the index, because it is a special data structure for fast retrieval, we find the disk address of the data, we can get the data.

The index type

In InnoDB, there are three index types: normal index, unique index (primary key index is a special unique index), and full text index.

Normal: Also called non-unique index, it is the most common index without any restriction.

Unique: A Unique index requires that the key value must be Unique. It is also important to note that a primary key index is a special kind of unique index, which has an additional constraint that the key value cannot be empty. The primary key index is created with primay key.

Fulltext: For large data, such as message content, there are several KB of data, if you want to solve the problem of low efficiency of like query, you can create a Fulltext index. Full-text indexes can only be created for fields of text type, such as CHAR, vARCHar, and text.

Index is a kind of data structure, so what kind of data structure should it choose to achieve efficient data retrieval?

Index storage model deduction

Binary search

After singles’ Day, my girlfriend played a number guessing game with me. Guess how much MONEY I bought yesterday. Five chances.

10000? Low. 30000? High. How many guesses would you make next? 20000. Why don’t you guess 11,000? Why don’t you guess 29,000?

So this is the idea of binary search, or halved search, where each time we reduce the candidate by half. This is more efficient if the data is already sorted.

So first, we can think about data structures indexed by ordered arrays.

The equivalent query and comparison query of ordered arrays are very efficient, but there is a problem when updating data, which may require moving a large amount of data (change index), so it is only suitable for storing static data.

To support frequent changes, such as inserting data, we need to use linked lists. If it’s a linked list, if it’s a single linked list, it’s not efficient enough.

So, is there a linked list that can use binary lookup?

To solve this problem, the BST (Binary Search Tree), or Binary Search Tree, was born.

Binary Search Tree

All nodes in the left subtree are smaller than the parent, and all nodes in the right subtree are larger than the parent. When it’s projected onto the plane, it’s an ordered linear table.

Binary search tree can realize both fast search and fast insertion.

But there is a problem with binary search trees: the search time is dependent on the depth of the tree, and in the worst case the time complexity degrades to O(n).

What’s the worst case scenario?

It’s the same set of numbers, if we insert the numbers in order, 2, 10, 12, 15, 21, 28

At this point, the BST will become a linked list (” inclined tree “), which cannot achieve the purpose of accelerating the search speed, and the sequential search efficiency is no different.

What causes it to tilt?

Because the depth difference between the left and right subtrees is so great, the left subtree of the tree has no nodes at all — that is, it is not balanced.

So, do we have a more balanced tree that’s not so different in depth from the left and right subtrees?

This is called Balanced binary search trees, or AVL trees.

AVL Tree balanced binary Tree

Definition of balanced binary tree: the absolute value of depth difference between left and right subtrees cannot exceed 1.

What does that mean? For example, if the depth of the left subtree is 2, the depth of the right subtree can only be 1 or 3.

At this point, we will insert 1, 2, 3, 4, 5, 6 in order, so it will not become a “slanted tree”.

So how does the AVL tree balance? How can we ensure that the depth difference between the left and right subtrees is not more than 1? For example, insert 1, 2, and 3.

When we insert 1 and 2, according to the definition of binary search tree, 3 must be to the right of 2. At this time, the depth of the right node of root node 1 will become 2, but the depth of the left node will be 0, because it has no children, so it will violate the definition of balanced binary tree.

So what should we do? And since it’s a right node followed by a right node, right-right, we’re going to put the 2 up here, and that’s called left-handed rotation.

Similarly, if we insert 7, 6, 5, this is going to be left to left, and we’re going to rotate right, and we’re going to bring the 6 up.

So to maintain balance, the AVL tree performs a series of calculations and adjustments as it inserts and updates data.

Now that we’ve solved the balancing problem, how can we query data with a balanced binary tree as an index? In a balanced binary tree, what should a node, whose size is a fixed unit, store as an index?

First: the key value of the index. For example, if I create an index above the id, I will find the key value of the id in the index when I use the condition where ID =1.

Second: the disk address of the data, because the purpose of the index is to find where the data is stored.

And the third one, because it’s a binary tree, it has to have a reference to the left child and the right child, so that we can find the next node. For example, if it’s greater than 26, go to the right, go to the next node in the tree, and continue.

Let’s see what happens if we store data this way.

First of all, the indexed data is stored on the hard disk. View the size of the data and index:

select CONCAT(ROUND(SUM(DATA_LENGTH/1024/1024),2),'MB') AS data_len, 
CONCAT(ROUND(SUM(INDEX_LENGTH/1024/1024),2),'MB') as index_len 
from information_schema.TABLES 
where table_schema='gupao' and table_name='user_innodb';
Copy the code

When we use the tree structure to store the index, we need to compare the data on the Server layer to see if it is needed. If it is not, we need to read the disk again. Accessing a node requires an I/O to the disk. The smallest unit of InnoDB disk manipulation is a page (or disk block) of 16K(16384 bytes).

So, the nodes of a tree are 16K in size. If we only store one key value + data + reference in a node, such as the field of shaping, it may only use a dozen or dozens of bytes, which is far less than 16K capacity, so it wastes a lot of space when accessing a tree node and performing an IO.

So if each node stores too little data, to find the data we need from the index, we have to visit more nodes, which means too many interactions with disk.

In the era of mechanical hard disks, each reading of data from disk takes about 10ms of addressing time, and the more interactions, the more time is consumed.

For example, in the figure above, we have 6 entries in a table. When we query id=37, we need to interact with disk 3 times to query two child nodes. What if we have millions of entries? That time is even harder to estimate.

So what’s our solution?

The first is to make each node store more data.

Second, the more keywords on a node, the more Pointers we have, which means more forks.

Because the more branches there are, the deeper the tree becomes (the root node is 0). In this way, is our tree from the original tall and thin appearance, into a squat and squat appearance?

At this point, our tree is no longer binary, but multi-fork, or multipath.

Multi-path balanced search Tree (B Tree)

Like AVL trees, B trees store key values, data addresses, and node references at the branch and leaf nodes.

It has a feature: the number of forks (ways) is always 1 more than the key word count. For example, the tree we drew, with two keys per node, would have three Pointers to three child nodes.

What is the lookup rule for B Tree?

Let’s say we’re looking for 15 in this table. Because 15 is less than 17. Go left. Because 15 is greater than 12. Go to the right. 15 was found in disk block 7, using only 3 IO.

Is this more efficient than AVL trees? How can B Tree store multiple keywords in one node and still maintain a balance? How is it different from an AVL tree?

For example, when Max Degree is 3, we insert data 1, 2, 3. When we insert 3, it should be the first disk block, but if a node has three keywords, it means that there are four Pointers, the child node will become four. So you have to split (B+Tree). I’m going to pull up the middle number 2, and I’m going to turn 1 and 3 into children of 2.

If the node is deleted, the reverse merge operation is performed.

Notice that there is splitting and merging, which is not the same as left and right rotation of AVL trees.

If we continue to insert 4 and 5, the B Tree will split and merge again.

We can also see that there are a lot of changes to the index structure when updating the index, which explains why we should not build indexes on frequently updated columns, or why we should not update primary keys.

Node splitting and merging is actually splitting and merging InnoDB pages.

B+ Tree B+ Tree

MySQL > modify B Tree to use B+Tree

Overall, this modified version of the B-tree solves more comprehensive problems than the B-tree.

Let’s look at the storage structure of InnoDB’s B+ tree:

MySQL > select * from B+Tree;

  1. The number of keywords is equal to the number of paths;

  2. No data is stored in the root node or branch node of a B+Tree. Only the leaf node stores data. Search keyword will not return directly, will go to the last layer of the leaf node. For example, if I search for ID =28, although I hit it directly in the first layer, all the data is above the leaf node, so I have to continue to search until I reach the leaf node.

  3. Each leaf node of B+Tree adds a pointer to the adjacent leaf node, and its last data points to the first data of the next leaf node, forming an ordered linked list structure.

  4. It retrieves data according to the left closed right open interval [).

B+Tree data search process:

  1. 28, such as we need to find the root node to find the keys, but because it is not the page child nodes, so will continue to search, 28 is [28,66) left closed right away the range of the critical value, so will walk in the middle of the child nodes, and then continue to search, it is [28, 34) left closed right away the range of the critical value, so will walk on the left side of the child nodes, Finally, the required data is found on the leaf node.

  2. Second, if it is a range query, for example, to query data from 22 to 60, when 22 is found, all data nodes can be accessed at one time by traversing the sequence of nodes and Pointers, which greatly improves the interval query efficiency (there is no need to go back to the parent node and repeat the search).

InnoDB B+Tree

  1. It’s a variant of B Tree, and it solves all the problems that B Tree solves. What are the two biggest problems solved by B Tree? (More keywords per node; More ways);

  2. Stronger ability to scan libraries and tables (if we want to scan the whole table, we only need to traverse leaf nodes, not the whole B+Tree to get all the data);

  3. The disk read and write capability of B+Tree is stronger than that of B Tree (root nodes and branches do not store data areas, so one node can store more keywords and load more keywords at a time).

  4. Better sorting ability (because the leaf node has a pointer to the next data area, the data forms a linked list);

  5. The efficiency is more stable (B+Tree always gets data from the leaf node, so the IO count is stable).

Afterword.

MySQL has chosen to use B+ tree as its index data structure model.

In the next article, we will continue to cover the rules for the use of indexes and their creation and use. Like, follow and bookmark articles that have been helpful to you.