This article introduces Mysql InnoDB index related knowledge, from various trees to index principle to storage details.

InnoDB is the default storage engine for Mysql (MyISAM before Mysql5.5.5, documentation). For the purpose of efficient learning, this paper mainly introduces InnoDB, with a small amount of MyISAM as a comparison.

This article was summarized by me in the learning process, and the content is mainly from books and blogs (references will be given). I added some of my own understanding in the process, please kindly point out any inaccurate description.

1 Various tree structures

I wasn’t going to start with binary search trees, because there are already too many articles on the web, but I added this section because it helps to understand the problem clearly and for the sake of completeness.

Let’s take a look at some tree structures:

1 search binary tree: each node has two child nodes, and the increase of data volume will inevitably lead to a rapid increase in height. Obviously, this is not suitable for a large amount of data storage infrastructure.

2 B tree: an m – order B tree is a balanced M – way search tree. The most important property is that each non-root node contains j of chrysene m/2 erres-1 <= j <= m-1; The number of children of a node is 1 more than the number of keywords, so that the keywords become the segmentation flag of children nodes. The key words are usually drawn in the middle of the child nodes in the diagram, which is very vivid and easy to distinguish from the B+ tree behind. Since the data exists in both leaf nodes and non-leaf nodes, it is not easy to traverse the key words in B tree in order, so the middle-order traversal method must be used.

3 B+ tree: An m-order B tree is a balanced M-way search tree. The most important property is that each non-root node contains j of chrysene m/2 erres-1 <= j <= m; The number of subtrees can be up to the number of keywords. The non-leaf node stores the smallest key in the subtree. At the same time, data nodes only exist in leaf nodes, and horizontal Pointers are added between leaf nodes, so it is very easy to traverse all data in sequence.

4 B* tree: An m-order B tree is a balanced M-way search tree. Chrysene m2/3 eserr-1 <= j <= m; 2 horizontal Pointers are added between non-leaf nodes.

B/B+/B* have similar operations, such as retrieving/inserting/deleting nodes. Here we focus only on the insertion of nodes and only analyze their insertion operations when the current node is full, because this action is slightly complex and can fully reflect the differences of several types of trees. In contrast, node retrieval is easier to achieve, while node deletion only needs to complete the opposite process of insertion (deletion is not the complete reverse operation of insertion in actual application, and usually only deletes data and reserves space for subsequent use).

First, look at the splitting of the B-tree. The red value in the figure below is the newly inserted node. Whenever a node after the full, you need to split (divided is a recursive process, refer to the following seven insert led to a split two layers), due to B a leaf node of the tree holds the same key value, so the full node split after the value will be distributed in three places: 1 original node, 2 the node’s parent, 3 new brothers from the original node (insert process reference 5, 7). Splitting may increase the height of the tree (see insertions in 3,7) or may not affect the height of the tree (see insertions in 5,6).

Tree splitting: When a node is full, if its next sibling is not, part of the data is moved to the sibling node, then the key is inserted into the original node, and finally the key of the parent node is changed (because the key range of the sibling node is changed). If the sibling is also full, a new node is added between the original node and the sibling, and 1/3 of the data is copied to the new node. Finally, a pointer to the new node is added to the parent node. You can see B
The tree must ensure that the nodes after splitting are 2/3 full. If the method of B+ tree is adopted, and the nodes that are already full are simply divided in two, each node will be only 1/2 full, which does not satisfy B
The tree takes the strategy of continuing to insert siblings after the local node is full (this is also why B

1 Each leaf node stores 468 rows of data, and each non-leaf node stores about 1200 key values, which is a balanced 1200 path search tree!

2 For a 22.1 GB table, only a 3 height B+ tree is needed to store it, which can probably meet the needs of many applications. If you increase the height to 4, the storage capacity of the B+ tree immediately increases to 25.9 tons!

3 For a 22.1 GB table, the height of the B+ tree is 3, and less than 18.8M of memory is needed to load all the non-leaf nodes into memory (how to arrive at this result? Because for a tree with a height of 2, 1203 leaf nodes only need 18.8M space, while 22.1G congbiao height is 3, and 1204 non-leaf nodes. We also assume that the size of the leaf node is larger than that of the non-leaf node because the leaf node stores rows of data while the non-leaf node has only keys and a small amount of data. With so little memory, it is very efficient to retrieve the required data with only one disk IO operation.

It can be said that the database must have an index, without which the retrieval process becomes a sequential search, O(n) time complexity is almost unbearable. It is easy to imagine a table with a single keyword indexed using a B+ tree, simply by storing the keyword on the node of the tree. When a database contains multiple fields in a single record, a B+ tree can only store primary keys. If non-primary key fields are retrieved, the primary key index is disabled and the lookup becomes sequential again. A second set of indexes should be created on the second column to be retrieved. The index is organized by a separate B+ tree. There are two common methods to solve the problem of multiple B+ trees accessing the same set of table data, called clustered index and non-clustered index. Although both names are called indexes, this is not a separate type of index, but a way of storing data. For clustered index storage, row data is stored together with primary key B+ trees. Secondary key B+ trees store only secondary and primary keys. Primary and non-primary key B+ trees are almost two types of trees. For non-clustered index storage, primary key B+ trees store Pointers to real data rows on leaf nodes, not primary keys.

InnoDB uses clustered indexes to organize primary keys into a B+ tree, and row data is stored on leaf nodes. If you use a condition like “where ID = 14” to search for primary keys, you can follow the B+ tree search algorithm to find corresponding leaf nodes, and then get row data. If a conditional search is performed on the Name column, two steps are required: the first step is to retrieve the Name in the secondary index B+ tree and reach its leaf node to obtain the corresponding primary key. In the second step, the primary key is used to perform another B+ tree retrieval operation in the primary index B+ tree species. Finally, the whole row of data can be obtained when the leaf node is reached.

MyISM uses a non-clustered index, and the two B+ trees with a non-clustered index look the same. The structure of the nodes is exactly the same, but the contents are different. The nodes of the primary key index B+ tree store the primary keys, and the secondary key index B+ tree store the secondary keys. Table data is stored in separate places, and the leaves of both B+ trees use one address to point to real table data. There is no difference between the two keys for table data. Because the index tree is independent, secondary key retrieval does not require access to the index tree of the primary key.

To visualize the difference between the two indexes, imagine a table that stores four rows of data as shown below. Id is the primary index and Name is the secondary index. The diagram clearly shows the differences between clustered and non-clustered indexes.

1 Since the row data and the leaf node are stored together, the primary key and the row data are loaded together. If the leaf node is found, the row data can be returned immediately. If the primary key Id is organized by the primary key Id, the data can be retrieved faster.

2 secondary index using the primary key as “pointer” instead of using the address values as the benefits of the pointer is decreased when a line or auxiliary index maintenance work split data page, use the primary key value as a pointer to the auxiliary index takes up more space, for the benefit of the InnoDB in mobile lines do not need to update the auxiliary index of the “needle”. That is, the position of the rows (which are located by 16K pages in the implementation, and will be covered later) will change as the data in the database changes (B+ tree node splitting and Page splitting above), and the use of clustered indexes ensures that the secondary index tree is not affected by changes in the primary key B+ tree node.

3 Page structure

If the previous content tends to explain the principle, then the implementation begins.

The Page structure is the most basic building block of InnoDB storage and the smallest unit of InnoDB disk management. All database-related content is stored in this Page structure. There are several types of pages, including B-tree Node, Undo Log Page, System Page, Transaction System Page, etc. The size of a single Page is 16K (controlled by the compiler macro UNIV_PAGE_SIZE), and each Page is uniquely identified by a 32-bit int value, which corresponds to the maximum storage capacity of InnoDB of 64TB (16Kib * 2^32 = 64Tib). The basic structure of a Page is shown below:

1 starts traversing an indexed B+ tree through the root node, and finally reaches a Page that stores all the leaf nodes through the layers of non-leaf nodes.

2 Traverses the single-linked list starting at the “Infimum” node within the Page (this traversal is often optimized) and returns successfully if the key is found. If the record reaches “supremum”, there is no appropriate key in the current Page. In this case, use the Next Page pointer of the Page to go to the Next Page and continue the search from “Infimum”.

1 Non-leaf node of primary index tree (green)

1 Min Cluster Key on Child (Min Cluster Key on Child), which is required by B+ trees to locate a specific record in a Page.

2 Child Page Number of the smallest Page, used to locate the Record.

2 Main index tree leaf node (yellow)

1 Cluster Key Fields, required for B+ trees and part of data rows

2 All non-key Fields except the primary Key. This is the collection of all other columns of the data row except the primary Key.

Here, parts 1 and 2 add up to a complete row.

3 Non-leaf node of auxiliary index tree (blue)

The minimum value (Min Secondary-Key on Child) stored in the Child node. This is required by the B+ tree to locate the specific record location in a Page.

2 primary Key (Cluster Key Fields). Why do non-leaf nodes store primary keys? Since secondary indexes may not be unique, but a B+ tree requires that the key value be unique, we combine the secondary key value with the primary key value as the true key value in the B+ tree to ensure uniqueness. But this also results in 4 bytes more nonleaf nodes than leaf nodes in the secondary index B+ tree. (The blue node has 4 bytes more than the red node in the figure below)

3 The minimum value of the Page Number (Child Page Number), function is to locate the Record.

4 Auxiliary index tree leaf node (red)

1 Secondary Key values (Secondary Key Fields), which are required for B+ trees.

2 Cluster Key Fields, used to perform another B+ tree search in the primary index tree to find the entire record.

Principle 1 is the foundation of InnoDB index. Only when we fully understand how InnoDB index works can we use it effectively.

2. Primal knowledge is particularly suitable for using illustrations, which I personally like very much.

3 about InnoDB optimization, there is a more comprehensive introduction in “High Performance Mysql”, students who are interested in Mysql optimization can obtain relevant knowledge by themselves, I have not accumulated enough to share these contents.