Piece | MySQL index principle of B + tree, and built the index of a few big principle

MySQL actually uses different storage engines are also very different, the following friends can understand.

Comparison of storage engines

Note: The above mentioned b-tree index does not indicate b-tree and B+Tree index, but the definition of B-tree and B+Tree is different.

In MySQL, there are four types of indexes: B-tree index, Hash index, Fulltext index, and R-tree index.

B-tree index is the most frequently used index type in MySQL database. All storage engines except Archive support B-tree indexes. The Archive engine did not support indexing until MySQL 5.1, and only supports indexing a single AUTO_INCREMENT column.

Not only in MySQL, in fact, in many other database management systems b-tree index is also the most important index type, mainly because the storage structure of B-tree index has very good performance in database data retrieval.

In general, most of the physical files in the MySQL B-tree index are stored in the structure of the Balance Tree, that is, all the data needed are stored in the Leaf Node of the Tree. And the length of the shortest path to any Leaf Node is exactly the same, so we call it a B-tree index.

Of course, it is possible that various databases (or MySQL’s various storage engines) will modify the storage structure slightly when hosting their own B-tree indexes. For example, the actual storage structure used by Innodb storage engine b-tree index is actually B+Tree, that is, on the basis of b-tree data structure made a small transformation, in each Leaf Node to store the relevant information of index keys, Pointer information pointing to the next LeafNode adjacent to the LeafNode is also stored (sequential access Pointers are added), which is mainly to speed up the efficiency of retrieving multiple adjacent Leaf nodes.

InnoDB is the default storage engine for Mysql (MyISAM before Mysql5.5.5)

So let’s first look at the concept of B- trees, B+ trees. Figure out, why does adding an index speed up queries?

B- tree, B+ tree concept

B tree

Binary search tree:

1. All non-leaf nodes have at most two sons (Left and Right);

2. All nodes store a keyword;

3. The left pointer of a non-leaf node points to a subtree smaller than its keyword, and the right pointer points to a subtree larger than its keyword.

Such as:

b-tree

Is a multipath search tree (not binary) :

1. Define that any non-leaf node has at most M sons; And M > 2;

2. The number of sons of the root node is [2, M];

3. The number of sons of non-leaf nodes except root nodes is [M/2, M];

4. Each node stores at least M/2-1 (rounded) and at most M-1 keywords; (At least 2 keywords)

5. Number of keywords of non-leaf nodes = number of Pointers to sons -1;

6. Keywords of non-leaf nodes: K[1], K[2]… [M, K – 1); And K[I] < K[I +1];

7. Pointers to non-leaf nodes: P[1], P[2]… P [M]; Where P[1] points to the subtree whose keyword is less than K[1], P[M] points to the subtree whose keyword is greater than K[m-1], and other P[I] points to the subtree whose keyword belongs to (K[i-1], K[I]).

8. All leaf nodes are located in the same layer;

M=3

B-tree search, starting from the root node, the keyword (ordered) sequence within the node for binary search, if the match is finished, otherwise enter the search keyword range of the son node; Repeat until the corresponding son pointer is null or already a leaf node;

B-tree features:

1. The keyword set is distributed in the whole tree;

2. Any keyword appears and only appears in one node;

3. The search may end at non-leaf nodes;

4, its search performance is equivalent to a binary search in the whole set of keywords;

5, automatic level control;

As non-leaf nodes except root nodes are limited, they contain at least M/2 sons, ensuring at least utilization of nodes. Therefore, b-tree performance is always equivalent to binary search (independent of M value), so there is no problem of B-tree balance;

Due to the limitation of M/2, when nodes are inserted, if they are full, they need to be split into two nodes, each accounting for M/2. When deleting a node, two sibling nodes less than M/2 should be merged.

B + tree

A B+ tree is a variant of the B- tree and a multipath search tree:

1. Its definition is basically the same as b-tree, except that:

2. The number of subtree Pointers and keywords of non-leaf nodes is the same;

3. The subtree pointer P[I] of non-leaf node points to the subtree whose keyword value belongs to [K[I], K[I +1]) (b-tree is the open interval);

5. Add a chain pointer to all leaf nodes;

6. All keywords appear in leaf nodes;

M=3

The search for B+ is basically the same as that for B-trees, except that a B+ tree hits only when it reaches a leaf

Non-leaf node hit), its performance is equivalent to a binary search in the whole set of keywords;

B+ features:

1. All keywords appear in the linked list of leaf nodes (dense index), and the keywords in the linked list happen to be ordered;

2. Impossible to hit in non-leaf nodes;

3. Non-leaf node is equivalent to the index of leaf node (sparse index), and leaf node is equivalent to the data layer storing (keyword) data;

4, more suitable for file indexing system;

After understanding the concept of B-/B+ tree, we continue to analyze the principle of B+ tree to improve efficiency.

B+ tree index principle

As shown above, it is a b + tree, about the definition of b + tree can see b + tree, said only a few key here, light blue piece of what we call a disk block, you can see every disk block contains several data items (shown in blue) and pointer (shown in yellow), such as disk blocks 1 contains data item 17 and 35, contains a pointer (P1, P2, P3, P1 indicates a disk block smaller than 17, P2 indicates a disk block between 17 and 35, and P3 indicates a disk block larger than 35. Real data exists in leaf nodes 3, 5, 9, 10, 13, 15, 28, 29, 36, 60, 75, 79, 90, 99. Non-leaf nodes only store the real data and only the data items that guide the search direction, such as 17 and 35, which do not really exist in the data table.

B + tree search process

As shown in the figure, if data item 29 is to be searched, disk block 1 will be loaded from disk to memory first. At this time, I/O will occur. In memory, binary search is used to determine that 29 is between 17 and 35. By disk 1 piece of P2 pointer disk address the disk block 3 by disk loaded into memory, in the second IO, 29 between 26 and 30, locking disk blocks of the 3 P2 pointer, pointer loading disk blocks through eight into memory, in the third IO, binary search to find memory do 29 at the same time, the end of the query, a total of three IO. The reality is that a 3-tier B + tree can represent millions of data. If millions of data lookups take only three IO, the performance improvement is huge. If there is no index and each data item takes one IO, the total number of IO will be millions, which is obviously very, very expensive.

B + tree

1. Through the above analysis, we know that the number of I/OS depends on the height of b+ number H. Assuming that the data in the current data table is N and the number of data items in each disk block is M, then h=㏒(m+1)N. The size of a disk block is the size of a data page. The size of a disk block is fixed. If the space occupied by data items is smaller, the number of data items is larger and the height of the tree is lower. That’s why each data item, or index field, should be as small as possible; for example, int is 4 bytes, less than half of bigInt8 bytes. This is why b+ trees require real data to be placed in leaf nodes rather than inner nodes, where the number of data items in a disk block drops dramatically, causing the tree to grow taller. When the data item is equal to 1 it will degenerate into a linear table.

2. When the data items of the b+ tree are compound data structures, such as (name,age,sex), the search tree is built on the order of b+ numbers from left to right. For example, when the data of (zhang 3,20,F) is searched, the b+ tree will compare the name preferentially to determine the direction of the next search. If the name is the same, then age and sex are compared in turn to obtain the retrieved data. However, when data such as (20,F) without name comes, b+ tree does not know which node to search next, because name is the first comparison factor when the search tree is established, and search must be conducted according to name first to know where to search next. For example, when (three,F) data to search, b+ tree can use name to specify the search direction, but the next field age is missing, so we can only find the data whose name is equal to three, and then match the data whose gender is F, this is a very important property, that is, the left-most matching feature of the index.

Slow query optimization

MySQL index principle is rather boring things, you only need to have a perceptual understanding, do not need to understand very thorough and in-depth. Let’s go back to the beginning of the slow query we said, after understanding the principle of indexing, do you have any ideas? To summarize the basic principles of indexing,

Four, index building principles

Mysql > select * from (a,b,c,d); mysql > select * from (b,c,d); mysql > select * from (a,b,c,d); D (a,b,d,c); d (a, B,d,c); d (a, B,d);

2, = and in can be in random order, such as a = 1 and b = 2 and c = 3 to create (a,b,c) index can be in any order, mysql query optimizer will help you optimize the form to be recognized by the index

3. Try to select columns with high distinction as indexes. The formula of distinction is count(DISTINCT Col)/count(*), which indicates the proportion of fields that are not duplicated. So one might ask, is there any empirical value to this ratio? This value is also difficult to determine in different application scenarios. Generally, we require join fields to be above 0.1, that is, 10 records are scanned per item on average

Select froM_unixtime (create_time) = ‘2014-05-29’ from ‘b’; select from_unixtime(create_time) = ‘2014-05-29’ from ‘b’; Obviously the cost is too great. Create_time = unix_timestamp(‘ 2014-05-29 ‘);

5, as far as possible to expand the index, do not create a new index. For example, if you want to add (a,b) to a table that already has an index of A, you only need to modify the original index.

If you want to learn programming, please searchCircle T community, more industry related information and industry related free video tutorials. It’s totally free!

Piece | MySQL index principle of B + tree, and built the index of a few big principle

Comparison of storage engines

B- tree, B+ tree concept

B+ tree index principle

Four, index building principles

Related Posts

RabbitMQ Unfair Distribution of messages and Prevalues | RabbitMQ Series (5)

Cloud computing network security challenges and opportunities

From Understanding to Using RPC Framework Dubbo (PART 1)