B+Tree has been the first choice of various database engines since its invention. Although lSM-Tree database has sprung up like bamboo shoots after a spring spring in recent years, B+Tree is still the first choice of all databases due to its excellent performance and balanced performance in various scenarios.

However, B+ Tree is not without disadvantages. As a balanced multi-tree structure, its advantage is that it can keep the Tree height low even when the amount of data is large, so its cost of query or update is low. However, one of the disadvantages of trees is that updating triggers structural adjustments, which can extend from the leaf to the root node at the longest, which can block concurrent read and write operations, leading to performance degradation. Therefore, how to optimize the concurrent access performance of B+ Tree in this scenario is also an important research direction in academia and industry.

This paper makes a simple introduction to B-Tree and B+Tree, and focuses on a data structure called B-Link Tree proposed on the basis of B+Tree. It optimizes the lock granularity of B+Tree structure by a clever method, improves the concurrency and keeps the performance stable under high concurrency.

B+Tree B+Tree

B-tree is a balanced search Tree designed for external storage devices such as disks. So before we talk about B-tree, let’s learn about disks.

The system reads data from disks to memory in the basic unit of disk blocks. Data in the same disk block is read out at one time, rather than retrieving whatever is needed.

InnoDB storage engine has the concept of Page, which is the smallest unit of disk management. The default size of each page in InnoDB storage engine is 16KB. You can set the page size to 4K, 8K, or 16K by parameter innodb_page_size. In MySQL, you can run the following command to check the page size:

mysql> show variables like 'innodb_page_size';
Copy the code

The storage space of one disk block is often not that large, so InnoDB requests for disk space each time with several contiguously addressed disk blocks to reach the page size of 16KB. InnoDB reads data from disk to disk on a page basis. When querying data, if each piece of data in a page can help locate the data record, this will reduce disk I/O times and improve query efficiency.

B-tree enables the system to efficiently locate the disk block where the data resides. To describe b-tree, first define a record as a binary group [key, data]. Key is the key value of the record, which corresponds to the primary key value in the table, and data is the data in a row of records except for the primary key. For different records, the key value is different.

Each node in a B-tree can contain a large number of keyword information and branches according to the actual situation. A third-order B-tree is shown in the figure below:

Each node occupies the disk space of one disk block. A node has two keywords in ascending order and three Pointers to the sub-root node. The Pointers store the address of the disk block where the sub-node resides. The three range domains divided by two keywords correspond to the range domains of the data of the subtree to which the three Pointers point. For the root node, the keywords are 17 and 35. The data range of the subtree pointed by the pointer P1 is less than 17, the data range of the subtree pointed by the pointer P2 is from 17 to 35, and the data range of the subtree pointed by the pointer P3 is greater than 35.

B+Tree is an optimization based on B-Tree to make it more suitable for external storage index structure. InnoDB storage engine uses B+Tree to implement its index structure.

From the B-Tree structure diagram, you can see that each node contains not only the key value of the data, but also the data value. The storage space of each page is limited. If the data is large, the number of keys that can be stored by each node (that is, a page) will be small. If the amount of data stored is large, the depth of B-Tree will also be large, increasing the number of disk I/O times during the query, and thus affecting the query efficiency. In B+Tree, all data record nodes are stored on leaf nodes of the same layer according to the order of key value, while only key value information is stored on non-leaf nodes. In this way, the number of key values stored in each node can be greatly increased and the height of B+Tree can be reduced.

Since the non-leaf nodes of B+Tree only store key value information, assuming that each disk block can store four key values and pointer information, the structure of B+Tree is shown in the figure below:

There are usually two head Pointers on the B+Tree, one to the root node and the other to the leaf node with the smallest keyword, and there is a chain ring structure between all leaf nodes (data nodes). Therefore, there are two search operations for B+Tree: one is range search and paging search for the primary key, and the other is random search starting from the root node.

The differences between B+Tree and B-tree can be summarized as follows:

  • Non-leaf nodes store only key-value information

  • All leaf nodes have a chain pointer between them

  • Data records are stored in leaf nodes

Ii. Core propositions of B-Link Tree

First, add a link Pointer to the intermediate node to point to the right sibling of the b-link Tree.

Second, add a field high key to each node. If the target value exceeds the high key of this node, the search will continue to follow the Link Pointer to the next node.

A typical B-link Tree is as follows:

As we can see in the figure, each intermediate node also adds a successor pointer to its right sibling. If the value to be queried exceeds the high key in the node, then the successor node needs to be queried as well.

Here is an example of the advantages of the B-Link Tree. Suppose we have the following B+ Tree and the leaves y is full:

If thread A wants to query 15 and thread B wants to insert the value 9 into node y, let’s look at the following execution sequence:

If you don’t do any concurrent protection, you get a split node that accesses the wrong leaf node and then the target value is not a problem, which is obviously not the case.

The simplest solution is to use a global lock on the entire B+ Tree and prevent all concurrent access until the Tree is restructured, which InnoDB did in the early days, but of course this leads to poor performance.

The B-link Tree rejects the global lock. It implements a bottom-up adjustment method, each time only the current adjustment node lock, when the child node adjustment is completed and then back up to adjust the parent node until all adjustments are completed.

The following figure illustrates the process of node splitting caused by the insertion of a new record: first create a new node, copy some of the data from the original node to the new node, establish a connection between the two nodes (this is the most critical step), and finally insert new inodes into the parent node (d and E).

The lock-free design of the B-Link Tree may cause problems with the parent node view (not yet inserted) and the split child node view. How does it solve the above problem?

This is where the Link Pointer becomes valuable. If other threads access the child node after the modification but before the modification of the parent node, they may not find the existing data when they follow the parent node to the old child node. In this case, they need to follow the child node to find the correct access path through its link Pointer.

Again, the example above:

When node split into y y and y + two nodes, and the parent node x has yet to embody this division, this time to find 15, find the node along the x, y failed to find 15 in y, but judge 15 more than the record high key, then follow the pointer to find the following node y +, finally found the value, thus can ensure correctness.

In the b-link Tree split scheme, there is no global lock on the subtree involved in the split, but it can be properly searched through the node’s right join pointer.

Iii. Summary of advantages and Disadvantages of B-Link Tree

The advantage of B-link Tree is that there is no need to lock global or local subtrees when adjusting the Tree structure, which is conducive to performance stability under high concurrency.

The disadvantages of B-link Tree are mainly in the following two aspects:

First, add extra fields, link Pointers and high keys, to each node, but it’s cheap.

Second, additional judgment is required during the query. If the query search exceeds the high key, additional link Pointer is needed to query the subsequent node, which may cause an additional IO in the database application, resulting in a decline in the performance of a single search. However, since tree structure adjustment is an action with a low frequency, In addition, the operation of querying the successor node will only occur between the adjustment of the child node and the adjustment of the parent node. Once the adjustment of the parent node is completed, it can be directly queried through the pointer of the parent node instead of searching through the successor pointer of the child node.

In summary, b-link Tree changes space for time and adds a successor pointer to each intermediate node to avoid performance degradation caused by global locking when the Tree structure is adjusted. This is an excellent scheme. In GreenPlum, B-link Tree is used as the index of its storage engine.