MySQL database index select B+ tree before we further analyze why MySQL database index choose to use B+ tree, I believe many friends of the data structure of the tree is still a little fuzzy, so we step by step to explore the evolution process of tree, step by step lead to the B tree and why MySQL database index choose to use B+ tree!

Those of you who have studied data structures usually know the most basic trees, so we’ll start with binary lookup trees, which are more closely related to our topic.

Binary search tree

(1) Introduction to binary tree:

Binary search tree is also called ordered binary search tree, which meets the general properties of binary search tree. It refers to an empty tree with the following properties:

1. If the left subtree of any node is not empty, the value of the left subtree is less than that of the root node;

2. If the right subtree of any node is not empty, the value of the right subtree is greater than that of the root node;

3. The left and right subtrees of any node are binary search trees respectively;

4. No node with the same key value;



The figure above shows a normal binary search tree, with outputs sorted from smallest to largest in a middle-order traversal: 2, 3, 5, 6, 7, 8.

Search the binary tree above, for example, search the record whose key value is 5, first find the root, whose key value is 6, 6 is greater than 5, so search the left subtree of 6, find 3; If 5 is greater than 3, find the right subtree; Three times in total. If you find the same demand 3 times in order of 2, 3, 5, 6, 7, 8. Using the same method to find the record with key 8, this time it takes three lookups, as opposed to six sequential lookups. The average number of searches is calculated as follows: the average number of searches for sequential searches is (1+2+3+4+5+6) /6= 3.3 times, and the average number of searches for binary search trees is (3+3+3+2+2+1) /6=2.3 times. The average search speed of binary search tree is faster than that of sequential search.

(2) Limitations and applications

A binary search tree is randomly composed of n nodes, so for some cases, the binary search tree will degenerate into a linear chain with n nodes. The diagram below:



If you look at the figure above, if we choose the smallest or the largest root node, then the binary search tree is completely reduced to a linear structure. The average number of lookups in the figure above is (1+2+3+4+5+5) /6=3.16, about the same as sequential lookups. Obviously the binary tree query efficiency is very low, so if the structure of the biggest performance a bintree, need the binary tree is balanced (the balance can be seen from one trait that the height of the height of a tree than the last one to lose, in the case of the same node is not balance), and raises a new definition – AVL balanced binary tree.

Second, the AVL tree

(1) Introduction

AVL tree is a binary search tree with balance condition, which is generally judged by the difference of balance factor and balanced by rotation. The height of left and right subtrees is no more than 1. Compared with red-black trees, it is a strict balanced binary tree, and the balance condition must be satisfied (the height difference between left and right subtrees of all nodes is no more than 1). No matter whether we perform insert or delete operations, as long as the above conditions are not met, we need to maintain balance through rotation, which is very time-consuming, so we can know that AVL tree is suitable for the case of less insert and delete times, but more search.



From the above is an ordinary balanced binary tree. It can be seen from this figure that the balance factor difference between the left and right subtrees of any node is not greater than 1.

(2) Limitations

Because the cost of maintaining such a high balance is greater than the efficiency gain from it, there are few practical applications, and more red black trees that pursue local rather than very strict global balance. Of course, AVL is better than red black trees in scenarios where insertion and deletion are not frequent but where lookups are high.

(3) Application

1. Widely used in Windows NT kernel;

Red black tree

(1) Introduction

A binary search tree, but with a memory bit added to each node to indicate the node’s color, which can be red or black. Red-black trees ensure that no path is twice as long as any other by limiting how each node is colored along any path from root to leaf. It is a weakly balanced binary tree (if it is balanced, it can be derived that the height of AVL tree is lower than that of red-black tree under the same node situation). Compared with the STRICT AVL tree, its rotation times are less, so we use red-black tree in the case of many search, insert and delete operations.

(2) Nature

1. Each node is either red or black; 2. The root node is black; Every leaf node (i.e., the NULL pointer at the end of the tree or the NULL node) is black; 4. If a node is red, its two sons are black; 5. For any node, each path to the leaf point tree NULL pointer contains the same number of black nodes; 6. Each path contains the same black node;

(3) Application

1, widely used in C++ STL, Map and Set are implemented by red-black tree; 2. The well-known Linux process Scheduler Completely Fair Scheduler manages the process control block with the red-black tree. The virtual memory area of the process is stored in a red-black tree, and each virtual address area corresponds to a node of the red-black tree, and the left pointer points to the adjacent address virtual storage area. The right pointer points to an adjacent high-address virtual address space; 3. The implementation of IO multiplexing epoll adopts red-black tree to organize and manage SOCKFD to support fast add, delete, change and search; 4. Nginx uses red-black tree to manage timer, because red-black tree is orderly, it can quickly get the smallest distance from the current timer; 5. Realization of TreeMap in Java;

Fourth, B/B + tree

We have not yet understood why MySQL uses B+ trees as an index, but we will first explore what B trees are.

(1) Introduction

Our data in MySQL is generally stored in the disk. When reading data, there must be operations to access the disk. There are two mechanical parts in the disk, namely disk rotation and magnetic arm movement. Disk rotation is the number of revolutions per minute mentioned in our market, and disk movement is in the disk rotation to a specified position, after moving the magnetic arm began to read and write data. Then there is the process of locating the block in the disk, and locating is a relatively time-consuming piece of disk access, after all, mechanical movement takes much more time than electronic movement. Location is obviously a very time consuming process when large-scale data is stored on disk, but we can optimize it by b-tree to improve the efficiency of location when disk is read.

Why can B trees be optimized? We can according to the characteristics of the class B tree, more than an order of class B tree structure, and then in as much as possible on the node to store information, ensure the layer number of less as far as possible, so that later we can find information faster, disk I/O operations are less, and class B tree is balanced tree, the height of each node to the leaf node is the same, This also ensures that each query is stable.

Overall, B/B + tree is designed to disk or other storage device of a balance of multiple search trees (B, relative to the binary tree each node has several branches), compared with the red-black tree, in the case of the same node, the height of a B/B + tree far less than the height of the red-black tree (in the B/B + tree) in the performance analysis. The operation time on the B/B+ tree consists of the disk access time and the CPU computing time. The CPU speed is very fast. Therefore, the operation efficiency of the B tree depends on the number of disk access times.

Notice that a b-tree is just a b-tree, a – is just a symbol.

(2) Properties of B-trees

1. Define that any non-leaf node has at most M sons, and M>2;

2. The number of sons of the root node is [2, M];

3. The number of sons of non-leaf nodes except root nodes is [M/2, M];

4. Each node stores at least M/2-1 (rounded) and at most M-1 keywords; (At least 2 keywords)

5. Number of keywords of non-leaf nodes = number of Pointers to sons -1;

6. Keywords of non-leaf nodes: K[1], K[2]… [M, K – 1); And K[I] < K[I +1];

7. Pointers to non-leaf nodes: P[1], P[2]… P [M]; Where P[1] points to the subtree whose keyword is less than K[1], P[M] points to the subtree whose keyword is greater than K[m-1], and other P[I] points to the subtree whose keyword belongs to (K[i-1], K[I]).

8. All leaf nodes are located in the same layer;



Here is just a simple B-tree. In practice, there are many key words in b-tree nodes. In the figure above, for example, 35 nodes represent a key(index), and the small black block represents the actual storage location of the content pointed to by the key in memory, which is a pointer.

Fifth, B + tree

(1) Introduction

B + tree is a file system and the required deformation of a binary tree tree (primary index file directory level, only the bottom of the leaf node (file) save data) not a leaf node only save index, don’t save the actual data, the data is stored in the leaf node, is this file system file to find?

Let’s take a file search example: there are three folders A, B, and C, where A contains B and B contains C, a file Yang. c, where A, B, and C are indexes (stored on non-leaf nodes), a, B, and C are just the keys to find Yang. c, and the actual data yang.c is stored on leaf nodes.

All non-leaf nodes can be considered index parts!

(2) Properties of B+ tree (the following are different from the properties of B tree)

1. The number of subtree Pointers and keywords of non-leaf nodes is the same; 2. The subtree pointer p[I] of the non-leaf node points to the subtree whose keyword value belongs to [k[I],k[I +1]]. 3. Add a chain pointer to all leaf nodes; 4. All keywords appear in the leaf node (dense index). (And the keywords in the linked list happen to be ordered); 5. Non-leaf nodes are equivalent to the index (sparse index) of leaf nodes, while leaf nodes are equivalent to the data layer storing (keyword) data. 6. More suitable for file systems;

Non-leaf nodes (e.g. 5,28,65) are only key (index). Actual data stored on leaf nodes (5,8,9) are real data or Pointers to real data.

(3) Application

B and B+ trees are used to index file systems and databases such as MySQL.

B/B+ tree performance analysis

The height of the balanced binary tree of n nodes is H(logn), while the height of B/B+ tree of N nodes is logt((n+1)/2)+1. As a lookup table in memory, B trees are not necessarily better than balanced binary trees, especially if M is large. Because the CPU time of the lookup operation on the B-tree is O(mlogtn)=O(LGN (m/ LGT)), and m/ LGT >1; So the operation time of O(mlog TN) is much longer than that of balanced binary tree when M is large. Therefore, to use b-trees in memory, you must take smaller m. (Usually taking the minimum value m=3, the b-tree can have 2 or 3 children per internal node, such third-order B-trees are called 2-3 trees).

Why B+ tree is better than B tree for database index?

1, B+ tree disk read and write cost is lower: B + tree internal nodes and no pointer to keyword specific information, so the internal nodes relative smaller B tree, if you put all the same internal nodes keyword in the same disk block, then disk blocks can hold is, the more the number of keywords, one-time read into memory to find the key words, the more, the relative number of IO, speaking, reading and writing was reduced.

2. The query efficiency of B+ tree is more stable because the non-endpoints are not the nodes that ultimately point to the content of the file, but only the index of the keyword in the leaf node. So any keyword lookup must take a path from root to leaf. The length of all keyword query paths is the same, resulting in the same query efficiency of each data.

3, as a result of B + tree data is stored in the leaf node, branch node for the index, convenient sweep library, you just need to sweep the leaf node again, but because of the B tree branch node stores the same data, we need to find the specific data, need to be in a sequence traversal to sweep in order, so the B + tree is more suitable for range queries, So usually B+ trees are used for database indexes.

PS: I read this on Zhihu, and I think what I said is quite reasonable:

They believe that the main reason why database index adopts B+ tree is that B tree improves IO performance and does not solve the problem of low efficiency of element traversal. It is in order to solve this problem that B+ tree application was born. B+ trees can traverse the entire tree simply by traversing the leaves. Moreover, range-based queries are very frequent in the database, and b-trees do not support such operations or are inefficient.

I read several articles today and summarize them by myself.

The database must use B+ trees to improve search efficiency.

But how to improve search efficiency?

The easiest way to find data is to find it sequentially. But for hundreds of thousands, millions, or even hundreds of millions of databases, queries are slow.

So to find the way to optimize, familiar binary search, binary tree can speed up to O(log(n,2)), the bottleneck of the query is the depth of the tree, the worst case to find the deepest layer of the binary tree, because, each search a deep layer, will access the index file of the deeper layer. In up to gigabytes of index files, this can be a lot of overhead. Therefore, try to design the data structure to be more ‘chunky’ to reduce the number of layers accessed. Among the many solutions, B-/B+ trees fit well. The definition of a B-tree can be seen in detail. In short, the middle node can have more than two children, and the middle element can be a domain. Compared with B- tree, the parent node of B+ tree must also exist in the child node, which is the largest or smallest element. The node of B+ tree only stores the index key value, and the address of the specific information exists in the address of the leaf node. This allows more nodes to be placed in the page-by-page index. Reduce more I/O expenses. Therefore, B+ tree has become an excellent data structure for database. MyIsAM and InnoDB in MySQL both use B+ tree structure. The difference is that the former is a non-clustered index, while the latter is a clustered index. The so-called clustered index is the index stored continuously by the physical address. When fetching the interval, the search speed is very fast, but also, the insertion speed will be affected and reduced. The physical location of the clustered index is stored using linked lists.