preface

When we find that SQL execution is slow, the natural thought is to add indexes. For range queries, the underlying structure of the index is a B+ tree. Today we are going to learn about B+ tree

Public number: a boy picking up snails

  • Tree introduction, tree species

  • B- tree, B+ tree introduction

  • B + tree insert

  • B + tree search

  • B + tree deletion

  • B+ tree classic interview questions

  • Github address, thanks to each star

Github.com/whx123/Java…

An introduction to the tree

An introduction to the tree

A tree, like an array, a linked list, or a stack, is a data structure. It consists of a finite number of nodes, which constitute a set of hierarchical relationships. It got its name because it looked like a tree. An ordinary tree looks like this:

A tree is a finite set containing N (n is an integer greater than 0) nodes and n-1 edges. It has the following characteristics:

  • Each node has either no children or a finite number of children;
  • There is a particular node that has no parent, called the root;
  • Each non-root node has one and only one parent;
  • There’s no loop in the tree

Some tree concepts:

  • Degree of node: the number of children of a node is called the degree of the node;
  • Tree degree: in a tree, the degree of the largest node is called tree degree;
  • Parent: If a node has children, the node is called the parent of its children.
  • Depth: for any node N, the depth of n is the unique path length from the root to n, and the depth of the root is 0;
  • Height: for any node N, the height of n is the longest path from n to a leaf, and the height of all leaves is 0;

The types of trees

According to the order, it can be divided into ordered trees and disordered trees:

  • Unordered tree: A tree in which the children of any node have no sequential relationship
  • Ordered tree: A tree in which the children of any node have a sequential relationship

According to the number of subtrees a node contains, it can be divided into B tree and binary tree. Binary tree can be divided into the following types:

  • Binary tree: A tree with at most two subtrees per node is called a binary tree.
  • Binary search tree: First, it is a binary tree. If the left subtree is not empty, the value of all nodes in the left subtree is less than the value of its root node. If the right subtree is not empty, the value of all nodes in the right subtree is greater than the value of its root node. The left and right subtrees are also binary sorting trees.
  • Full binary tree: a tree with two subtrees in all nodes except leaf nodes is called full binary tree.
  • Complete binary tree: if a binary tree is full except for the nodes at the last level, and the nodes at the last level are distributed from left to right
  • Huffman tree: binary tree with the shortest weighted path.
  • Red-black tree: red-black tree is a special binary search tree, each node is black or red, root node, leaf node is black. If a node is red, its children must be black.
  • Balanced binary tree (AVL) : The absolute value of the difference in height between an empty tree or its left and right subtrees is no more than 1, and both subtrees are a balanced binary tree

B- tree, B+ tree introduction

Introduction to the b-tree

A B-tree, also known as a B-tree, is a balanced multi-fork tree (compare that to a balanced binary search tree) that is better suited for external searches. Take a look at these concepts:

  • Order: The maximum number of child nodes a node has. (Usually denoted by the letter M)
  • Keyword: The value on the node is the keyword
  • Degree: The number of children that a node has.

A B-tree of order M has the following characteristics:

  • The root node has at least two children;
  • The number of keywords contained in each non-root node j meets the following criteria: ⌈m/2⌉ -1 <= j <= m-1.(⌈⌉ indicates an upward rounded value)
  • A nonleaf node with k keywords (in ascending order) has exactly K +1 children.
  • All the leaf nodes are in the same layer.

A simple B-tree looks like this:

Introduction of B + tree

The B+ tree is a variant of the B- tree and is also a multi-path search tree. A B+ tree of order M has these characteristics:

  • Each node has at most m children;
  • Number of key values of non-root nodes: m/2 <= K <= m-1
  • Adjacent leaf nodes are connected by Pointers and sorted by keyword size.

A B+ tree of order 3 looks like this:

The main differences between B+ trees and B- trees are as follows:

  • The internal nodes of the B-tree store data; The internal nodes of a B+ tree do not store data, but only serve as indexes, while its leaf nodes store data.
  • The adjacent leaf nodes of a B+ tree are connected by a linked list pointer, but a B- tree is not.
  • A B-tree ends up finding a specific value, whereas a B+ tree ends up finding data in a leaf node through an index
  • Any keyword in a B-tree can occur in only one node, whereas a B+ tree can occur multiple times.

B+ tree insertion

Remember these steps for B+ tree insertion:

  • 1.B+ tree insertion is always carried out at leaf nodes, that is, before insertion, the leaf node to be inserted needs to be found first.
  • 2. If the number of keywords in the leaf node is less than the order m, insert the keyword directly.
  • 3. If the number of keywords on the leaf node is equal to the order m, the leaf node splits into two new nodes. One node contains ⌊m/2⌋ keywords, and the other keyword contains ⌈m/2⌉ keywords. ⌊m/2⌋ indicates the round down, ⌈m/2⌉ indicates the round up, for example, ⌋ 3/2, and ⌋ 3/2, respectively.
  • 4. After splitting, move the keyword ⌈m/2⌉ up to the parent node. If the number of keywords contained in the parent node is less than M, the insertion operation is complete.
  • 5. After splitting, move the keyword ⌈m/2⌉ up to the parent node. If the number of keywords contained in the parent node is equal to m, the parent node continues to be split.

Take a B+ tree of order 4, for example. At order 4, the key values are at most 3 (m-1). Suppose the following data were inserted 43,48,36,32,37,49,28.

  1. Insert 43 into the empty tree

So the root has a key value, so it’s both the root and the leaf.

  1. Insert 48, then 36

At this point, the node has three keywords, which are already full

  1. Continue to insert 32, find the current node keyword is not less than order 4, so split

⌈4/2⌉=2 (subscripts 0,1,2), that is, 43 is moved up to the parent node.

  1. Insert 37,49, the former node keyword is not full, directly insert as follows:

  1. Finally, when 28 is inserted, it is found that the keyword of the current node is no less than the order 4, so it is split. ⌈4/2⌉=2, that is, 36 is moved up to the parent node. Since the parent node only has 2 key values, which are still less than 4, there is no need to continue splitting and the insertion is complete

You can take a look at the motion picture (a bit long, please bear with me) :

B+ tree lookup

Because the data of B+ tree are all on the leaf node, the internal node is only the function of pointer index, so the search process needs to search on the leaf node. Take this B+ tree as an example:

B+ tree single value query

Let’s say the value we’re looking for is 32.

For the first disk I/O, look for disk block 1, the root node (36,43), since 32 is less than 36, so visit the first child node to the left of the root node

The second disk I/O looks for disk block 2, the first child of the root node, and gets the interval (28,32), which is traversed to get 32.

The dynamic diagram is as follows:

B+ tree range query

Suppose we want to find the value of the interval [32,40].

The first step is to visit the root node. If the left endpoint of the interval 32 is less than 36, the first left subtree (28,32) of the root node is accessed.

The second step is to visit the node (28,32), find 32, and then start traversing the list to find the interval value [32,40], which is where B+ trees are more efficient than B- trees.

B+ tree deletion

B+ tree delete keyword, divided into these cases

  • Find the node containing the key value. If the number of keywords is greater than M /2, delete them directly.
  • If the number of keywords is greater than M /2, the key value is the maximum (small) value of the current node, and the key value exists in the parent node, delete the keyword, and adjust the value of the parent node accordingly.
  • Find the node that contains the keyword. If the number of keywords is less than ⌈m/2⌉ after the keyword is deleted, and its sibling node has redundant keywords, borrow the keywords from its sibling node
  • If the number of keywords is less than ⌈m/2⌉ after the keyword is deleted and its sibling node has no redundant keywords, the node is merged with its sibling node.

If the number of keywords is greater than ⌈m/2⌉, delete them directly.

Let’s say I have a B+ tree of order 5

If the number of keywords is 3 > 5/2=2, delete the keyword directly (⌈⌉ indicates the rounded meaning).

If the number of keywords is greater than ⌈m/2⌉-1 and the deleted keywords exist on the parent node, adjust the values of the parent node accordingly

If 20 is deleted, the number of keywords is 3 > 5/2=2, and 20 is the boundary value of the current node and exists in the parent node. Therefore, the parent node also needs to respond to adjustment after deletion.

If the number of keywords is less than M /2 after the keyword is deleted, the sibling node can borrow the keyword

Here’s a B+ tree of order 5,

If 15 is deleted, the node where the keyword is deleted only has one keyword, less than 5/2=2, which does not meet the characteristics of B+ tree. However, its sibling node has three elements (7,8,9), which can be borrowed from 9, as shown in the figure:

After deleting the keywords, if the number of keywords in the nodes is insufficient and the sibling nodes are not borrowed, the sibling nodes need to be merged

The following B+ tree of order 5:

If keyword 7 is deleted, only one keyword is left in the node where the keyword is deleted, less than 5/2=2, which does not meet the characteristics of B+ tree and brother nodes cannot be borrowed, so the merger occurs as follows:

Main process:

  • After 7 is deleted, there is only one keyword of 8, which does not meet the B+ tree feature (m/2<= keyword <=m-1).
  • And there is no sibling keyword to borrow, so 8 is combined with the previous sibling.
  • The parent node of the deleted keyword node, 7 index is also deleted, only a 9, and its right sibling node (18,20) has only two keywords, also can not borrow, so it is merged here.
  • The parent node of the deleted keyword node is also merged with its sibling node, leaving only one subtree branch, so the root node (16) is also moved down.

Therefore, the result after deleting keyword 7 is as follows:

B+ tree classic interview questions

  • How many rows can InnoDB store in a B+ tree?
  • Why do index structures default to B+ trees instead of Hash, binary, red-black, and B-trees?
  • B- tree and B+ tree

How many rows can InnoDB store in a B+ tree?

The short answer to that question is: about 20 million lines.

  • In a computer, the smallest unit of data stored on a disk is a sector. A sector is 512 bytes in size.
  • In a file system, the smallest unit is a block, which is 4k in size.
  • InnoDB storage engine the smallest storage unit is a page, a page size is 16K.

Because the B+ leaves store data, and the inner nodes store keys + Pointers. The index organization table determines which page the data is in through the binary search method of non-leaf nodes and Pointers, and then finds the needed data in the data page.

Let’s say the height of a B+ tree is 2, so there’s one root and several leaves. The total number of records stored in this B+ tree is = the number of Pointers to the root node * the number of records in a single leaf node.

  • If the data size of a row is 1K, then the number of records that can be stored in a single leaf node = 16K / 1K =16.
  • How many Pointers are stored in non-leaf nodes? We assume that the primary key ID is bigINT and the length is 8 bytes, while the pointer size is set to 6 bytes in the InnoDB source code, so 8+6=14 bytes, 16K /14B =16*1024B/14B = 1170

Thus, a B+ tree of height 2 can hold 1170 * 16=18720 such data records. Similarly, a B+ tree of 3 height can store 1170 *1170 *16 =21902400, that is, it can store about 20 million records. The HEIGHT of a B+ tree is 1-3 layers, which can store tens of millions of data.

Why index structures use B+ trees by default, instead of B-trees, Hash, binary trees, red black trees?

Here’s the simple answer:

  • Hash is only suitable for equivalent query, not range query.
  • A general binary tree may be specialized as a linked list, equivalent to a full table scan.
  • Red black tree is a kind of specialized balanced binary tree. When MySQL has a large amount of data, the index volume is also too large to store and read from disk. If the tree level is too high, the number of reads from disk will be too many.
  • A b-tree is a Tree that stores data on both leaf nodes and non-leaf nodes. A B-tree that stores data on both leaf nodes and non-leaf nodes is a Tree that stores data on the same amount of data.

B- tree and B+ tree

  • The internal nodes of the B-tree store data; The internal nodes of a B+ tree do not store data, but only serve as indexes, while its leaf nodes store data.
  • The adjacent leaf nodes of a B+ tree are connected by a linked list pointer, but a B- tree is not.
  • A B-tree ends up finding a specific value, whereas a B+ tree ends up finding data in a leaf node through an index
  • Any keyword in a B-tree can occur in only one node, whereas a B+ tree can occur multiple times.

Reference and thanks

  • B+ tree this is enough
  • B tree and B+ tree insert, delete graphic details
  • How many rows can InnoDB store in a B+ tree?