The article directories

  • preface
  • Binary search tree (BST)
  • 2. Balanced Binary Tree (AVL)
  • Red black tree
  • Fourth, B tree
  • Fifth, B + tree
  • 6. Advantages of B+ trees
  • Seven,
  • Write at the end, thanks for the likes

preface

In MySQL, both Innodb and MyIsam use B+ tree as index structure (hash and other indexes are not considered here). So why use B+ tree as index structure? Not B tree, equilibrium, red and black? After reading this article, you will understand that B+ is considered optimal in terms of performance and efficiency (stability).

Binary search tree (BST)

Binary Search Tree (BST), also known as Binary sorting Tree, on the basis of Binary Tree, all nodes in the left subtree of any node must have values not greater than the root node, and all nodes in the right subtree of any node must have values not less than the root node.



Storing data in a binary tree is a common choice when quick lookups are needed, since the query time depends on the tree height and the average time complexity is O(LGN). However, in extreme cases, tree skew occurs, as shown in the figure below, where the binary tree degenerates into a linked list,



The time complexity is reduced to O(n).

To solve this problem, a balanced binary tree is introduced.

2. Balanced Binary Tree (AVL)

AVL trees are strictly balanced binary trees, and the height difference between left and right subtrees of all nodes cannot exceed 1. AVL tree lookup, insertion, and deletion are O(LGN) on average and in the worst case. The key to AVL balance is rotation operations: insertion and deletion can upset the balance of the binary tree, requiring one or more tree rotations to rebalance the tree. When inserting data, only 1 rotation is required at most (single or double rotation); However, when deleting data, the tree will be unbalanced, and AVL needs to maintain the balance of all nodes along the path from the deleted node to the root node with an order of O(LGN) of rotation. Due to the time consuming of rotation, AVL tree is very inefficient in deleting data; When there are many deletions, the cost of maintaining balance may outweigh the benefits, so AVL is not widely used.

Red black tree

In contrast to AVL trees, red-black trees are not strictly balanced, but roughly balanced: just ensuring that the longest possible path from root to leaf is no more than twice as long as the shortest possible path. From the perspective of implementation, the biggest characteristic of red-black tree is that each node belongs to one of two colors (red or black), and the division of node color needs to meet specific rules (detailed rules are omitted). The following is an example of a red-black tree:



Compared with AVL tree, the query efficiency of red-black tree will be reduced, because the balance of the tree becomes worse and the height is higher (its longest path is twice that of the shortest path, but the height of the balanced tree is not more than 1, so the natural height will be higher and the query time will be longer). However, the removal efficiency of red-black trees is greatly improved because red-black trees also introduce color. When inserting or deleting data, only O(1) rotations and color changes are required to ensure basic balance, rather than O(LGN) rotations (at the time of deletion) as AVL trees do. In general, red-black trees have better statistical performance than AVL.

Therefore, in practical applications, AVL trees are used relatively little, while red-black trees are widely used. For example, TreeMap in Java uses red-black trees to store sorting key-value pairs; HashMap in Java8 uses linked lists + red-black trees to resolve hash collisions (linked lists when there are few conflicting nodes, red-black trees when there are many).

Red-black trees perform very well when the data is in memory (such as TreeMap and HashMap above). But in the case of data on secondary storage devices such as disks (such as databases such as MySQL), red-black trees are not good because they are still too tall. When data is on disk, disk I/O becomes the biggest performance bottleneck. The design goal should be to minimize the number of I/OS. The higher the tree height is, the more I/OS required for add, delete, modify, and query, which seriously affects the performance.

Fourth, B tree

A B-tree, also known as a B-tree (not a minus sign), is a multi-path balanced search tree designed for auxiliary storage devices such as disks. Compared with a binary tree, each non-leaf node of a tree can have more than one sub-tree. Therefore, when the total number of nodes is the same, the height of B tree is much smaller than AVL tree and red-black tree (B tree is a “dumpy”), and the disk I/O times are greatly reduced.

The most important concept to define B-tree is Order. For an M-order B-tree, the following conditions should be met:

Each node contains a maximum of m child nodes.

If the root node contains child nodes, it contains at least two child nodes. In addition to the root node, each non-leaf node contains at least M /2 child nodes.

A non-leaf node with K child nodes will contain K-1 records.

All leaf nodes are in the same layer.

It can be seen that the definition of B-tree is mainly to limit the number of child nodes and records of non-leaf nodes.

Here is an example of a third-order B-tree:

In addition to the advantages of small tree height, B trees also make use of the principle of access locality. So-called locality principle (mysql can read data one-time read full 16 k, and if you query a data 1 k, the article 15 of the adjacent data are loaded in memory, the next time you find relevant is the system cache), refers to when a data is used, the data on the outskirts of the larger probability are used in a short period of time. B tree stores data with similar keys in the same node. When accessing one of the data, the database will read the whole node into the cache. When adjacent data is immediately accessed, it can be read directly from the cache without disk I/O. In other words, b-trees have a higher cache hit ratio. B tree has some applications in database, such as mongodb index using B tree structure. However, in many database applications, the variant B+ tree is used.

Fifth, B + tree

B+ tree is also a multi-path balanced search tree, which differs from B tree mainly in: Every node in the tree B (including the leaf node and the non-leaf nodes) real data are stored, only leaves of B + tree node storing the real data, the leaf nodes only store the key (so B + tree index of the leaf node can store more, final data on the leaf node, tree height is not high, the query efficiency is stable, Whatever data is queried is ultimately determined by the tree height, because the data is at the page byte point). In MySQL, the actual data may be the entire row (as in Innodb’s clustered index), the primary key of the row (as in Innodb’s secondary index), or the address of the row (as in MyIsam’s non-clustered index. MyIsam’s data and index are separate files. Not like InnoDB clustering index data all in one piece). Whereas a record in a B tree appears only once and does not repeat itself, a key in a B+ tree may repeat itself — either at a leaf or at a non-leaf node. The leaves of a B+ tree are linked by a bidirectional linked list. (Range lookup, while B-tree can only iterate) Non-leaf nodes in B-tree, the number of records is 1 less than the number of child nodes; The number of records in a B+ tree is the same as the number of child nodes. Therefore, compared with B tree, B+ tree has the following advantages: Fewer IO times: The non-leaf nodes of B+ tree only contain keys, not real data, so the number of records stored by each node is much more than that of B number (that is, order M is larger), so the height of B+ tree is lower, and fewer IO times are required for accessing. In addition, because there are more records stored per node, the principle of access locality is better utilized and the cache hit ratio is higher. More suitable for range query: when range query is carried out in B tree, the lower limit to be searched is found first, and then the middle order traversal of the B tree is carried out until the upper limit of the search is found; The range query of B+ tree only needs to traverse the linked list. More stable query efficiency: The query time complexity of A B tree is between 1 and tree height (recorded at the root and leaf nodes respectively), whereas the query complexity of a B+ tree is stable at tree height because all data is located at the leaf node. B+ trees also have a disadvantage: they take up more space because the keys are repeated. However, the spatial disadvantage is often acceptable compared to the performance advantage, so B+ trees are more widely used in databases than B trees.

6. Advantages of B+ trees

As mentioned earlier, the biggest advantage of a B tree /B+ tree over a binary tree like a red-black tree is that the tree is smaller in height. In fact, for Innodb’s B+ index, the height of the tree is usually 2-4 levels. Now let’s do some concrete estimates. The height of a tree is determined by the order, and the higher the order, the shorter the tree; The size of the order depends on how many records each node can store. Each Innodb node uses a page, the size of the page is 16KB, the metadata is only about 128 bytes (including file management header information, page header information, etc.), most of the space is used to store data. For non-leaf nodes, the record contains only the key of the index and a pointer to the node at the next level. Assuming 1000 records are stored per non-leaf page, each record occupies about 16 bytes. This assumption is reasonable when the index is an integer or a short string. By extension, we often hear the advice that the index column length should not be too large for this reason: if the index column is too long and each node contains too few records, the tree will be too tall, the index will be less effective, and the index will waste more space. For leaf nodes, records contain keys and values of indexes (values may be the primary key of a row, a full row of data, and so on, as described above), and the amount of data is larger. It is assumed that 100 records are stored per leaf page (in fact, this number may be less than 100 when the index is clustered; When the index is secondary, this number can be much higher than 100; Can be estimated according to the actual situation). For a 3-tier B+ tree, the first tier (root node) has 1 page and can store 1000 records; The second layer has 1000 pages and can store 1000 * 1000 records; The third layer (leaf node) has 1000 * 1000 pages, and each page can store 100 records, so it can store 1000 * 1000 * 100 records, or 100 million records. For binary trees, storing 100 million records requires about 26 layers.

Seven,

Finally, we summarize the problems solved by all kinds of trees and the new problems: binary search tree (BST) : solve the basic problem of sorting, but because of the balance can not guarantee, may degenerate into a linked list; Balanced binary tree (AVL) : the problem of balance is solved by rotation, but the efficiency of rotation operation is too low. Red-black tree: By abandoning strict balance and introducing red-black nodes, the problem of low AVL rotation efficiency is solved. However, in disk and other scenarios, the tree is still too high and IO times are too many. B tree: the problem of too high tree is solved by changing binary tree to multi-path balanced search tree. B+ tree: On the basis of B tree, non-leaf nodes are transformed into pure index nodes that do not store data, further reducing the height of the tree; In addition, leaf nodes are linked into linked lists using Pointers, which makes range query more efficient.

Write at the end, thanks for the likes

Welcome to follow my wechat official account [Village of the Apes]

Insert a picture description here

To talk about Java interview and my wechat further communication and learning, wechat mobile search [codeYuanzhicunup] can be added if there are related technical problems welcome to leave a message to discuss, the public number is mainly used for technology sharing, including often meet test analysis, as well as source code interpretation, micro service framework, technical hot spots, etc..