Bw-tree, an unlocked B family tree proposed by Microsoft

“This is the 10th day of my participation in the First Challenge 2022. For details: First Challenge 2022.”

An overview of the

Bw-tree is a data structure proposed in a related paper published by Microsoft in 2013. Considering the popularity of multi-core machines and SSD, a LATch-free, delta update, log structured B-tree (BW-tree) was proposed combining the characteristics of two storage engines B+-tree and LSM-tree.

Due to the lack of implementation details in the above paper, several CMU authors implemented a version of memory-based BW-Tree in 2015, and then published a paper that added some implementation details and made the code open source on Github, called Open BwTree.

As a routine summary, the main features of Bw Tree are:

It is divided into three layers: bwtree index layer, cache control layer and Flash storage layer.
Bwtree is a B+ tree on the whole, and borrows the idea of B-link tree. Each node has a side pointer pointing to its right brother.
Bwtree presents a log-structure similar to LSM-tree on a single tree node, and each logical node consists of a base node + a Delta records chain.
The core data structure of Bwtree that realizes latch-free is called Mapping Table. The INSTALLING operation is performed through CAS, and multiple logical Pointers can be modified simultaneously by modifying one Mapping entry.
The Bwtree Flash layer also uses the log-Structure Store (Append only) to manage the physical Store (Base Page and Delta Record) of the logical pages.

Author: A miscellany of wood birdswww.qtmuniao.com/2021/10/17/….

origin

New data structures are often designed to accommodate hardware changes. So how has hardware changed in recent years? On the one hand, after the single-core performance is squeezed to the extreme (Moore’s Law fails), single-core multi-core architecture becomes a major development direction, but the traditional data structure based on lock control concurrency is difficult to make full use of multi-core performance. This is because too many locks can lead to frequent waits and context switches. Flash, on the other hand, is getting cheaper and becoming the dominant type of storage. But flash has unique read-write features: sequential writes are several times faster than random writes, and random reads are far more concurrent than traditional disks.

Microsoft designed BwTree in response to these two observations. Bwtree makes incremental updates in memory using a lockless structure:

Lock-free architecture can reduce context switching and improve parallel throughput.
Incremental updates avoid cache misses caused by in-place updates.

Use Log Structure Store to manage physical data Store in external memory:

Appending write can make full use of flash memory sequential write speed, high throughput characteristics.
While it may result in more random reads, flash random read concurrency is higher.

architecture

The architecture diagram of BwTree is as follows:

Bwtree is divided into three levels: logical Bw-tree index layer, physical Log Structured Store storage layer and intermediate cache layer. The cache layer uses a core data structure, the Mapping Table, which records the Mapping of Page ids to physical Pointers and controls the movement of data between memory and flash.

Bw – tree indexes

Bw-tree provides a key value (Record) interface. The following figure shows the overall structure of the interface.

The nodes of a BW-tree are flexible and consist of a base Node and a delta records chain. All changes to the node, including inserts, changes, and deletes, are appended to the linked list header as delta records.

There are two types of Pointers in a bw-tree: Physical links within nodes and Logical links between nodes. The logical pointer is the Page ID and must be used together with the Mapping Table because the latter records the Mapping between the Page ID and the physical pointer. A physical pointer, which is represented as a pointer in memory, and an address on a file system or block storage on flash. Bw-tree nodes, if in memory, are linked to a block by a memory pointer; If it hits the flash drive, it’s strung together by a physical address. The logical pointer between bW-tree nodes, namely the page ID, is the key to concurrency control in CAS mode. The reason will be explained later.

Note: The Delta node in the figure below is not a scientific name. It is more reasonable to call Delta Record because it stores information at a finer granularity than node, usually at a single record level.

Delta Record is an important data structure in Bw-tree. There are two main types: one is the INCREMENTAL modification of KV for Leaf Node; One is to modify the tree structure for Inner nodes. Delta Record has several important fields: low Key, High Key, and Side Pointer.

Let’s put ourselves in the designer’s shoes. What information should a Delta Record contain? A brief list:

Contains the necessary incremental information (kv value) to apply to the original node.
Contains redundant information (low key, high key) to optimize the lookup.
Contains Pointers (borrowed from b-link’s side pointer) that allow concurrent scans to be adjusted without affecting the tree structure.

Here are some typical scenarios to illustrate the above design. There are two common scenarios. One is modification only for a single node (adding delta record including KV), generally for leaf nodes. One is the large-scale modification of tree structure, which is generally the splitting and merging of subtrees caused by too many additions or deletions, including leaf nodes and intermediate points.

Single-node operation

The operations for a single node mainly include update (insert, delete, change), query (dot search, scope search) and consolidation. Among them, the update operation is completed by introducing incremental records. The point search will traverse the incremental chain from the beginning to the base node, and the range search will prepare the kv vector corresponding to the node in advance. The merge is essentially compact within the node.

The change of a single node generally only occurs on leaf nodes, and its type is caused by the modification operation of a single record (a KV), including insertion, deletion and change. Bw-tree makes these changes into a Delta record, which is appended to the change chain within the node, and then modifies the list header in the mapping table to complete the changes.

As shown in figure A, incremental modification of Page P takes the following steps:

Apply a new delta record in memory (figure △D).
Assign a value to it, including incremental information (such as modified type, KV to be modified), find optimization information (such as low key, high key) and the physical pointer to the current incremental chain.
After the CAS operation, modify the corresponding item in the Mapping Table to point to the head of the new incremental chain.

As for single-node platforms, merging operations (HP) are also performed to free up space, as shown in figure B, with the following steps:

Apply a new page in memory.
Merge the incremental record with the base page and copy it to the new requested page.
After the CAS operation, you can modify the corresponding item in the Mapping Table.

The merge operation is somewhat similar to compact in LSM-Tree, but with a smaller granularity.

Tree structure change

The changes in tree structure, called SMO (Index Structure modification Operation) in the Microsoft paper, include split and merge. Since a tree structure adjustment is difficult to accomplish by a single CAS operation, bW-tree splits it into multiple steps. However, in order to ensure the visibility of the adjusted intermediate state points, bW-tree borrows the ideas of B-Link tree:

Each node maintains a side pointer to its right sibling.
Each split is only allowed to split to the right.

The effect is that even if the index of the newly split node is not added to the parent node, it can still be queried by the right pointer of the precursor node.

That is, although the child nodes are split, they are still connected by Pointers.

In addition, different from the KV modification increment of leaf node, the modification increment of intermediate node is some special increment, which will be described in detail below.

Tree structure adjustments include node split and Node merge.

The node is split.

As shown in the figure above, split Page P is divided into two stages: child split (a, B) and parent Update (C). Each stage uses a CAS operation to make changes visible to the public:

Create Page Q. Holds the kv value for the right half of Page P and points its side pointer logic to Page R, the right sibling of Page P. In this case, Page Q is invisible, that is, the bW-tree root node is unreachable.
Install Split Deltas. To make Page Q visible, bW-tree introduces a special delta record: split deltas. The split delta contains the split key in the original Page P, which is used to route between Page P and Q during lookup. Also record Page Q as its side pointer. Finally, the CAS is used to install the split delta into the mapping table.
The installationThe index increment(Index Entry Delta). After Page Q is split from Page Q, you need to add an index entry to the parent node pointing to Page Q. Bw-tree does this by introducing index deltas. Index increments include splitkey-pq and pointer-pq as well as a splitkey-qr, which falls in the[SplitKey-PQ, SplitKey-QR]Can be routed directly to Page Q.

While the Pointers in this picture look dazzling, there are just a few things to understand about them:

The solid line is the real physical pointer, and the dashed line is the logical pointer, the Page ID.
Physical Pointers on nodes are created when incremental records are created. Physical Pointers in the Mapping Table are updated by CAS.
After updating the records in the Mapping Table, the dotted pointer in the diagram changes direction.

Node merging.

As shown above, Page R is merged into Page L in three phases, each containing a CAS operation:

Marking for Delete: Introduces the Remove Node Delta, adds it to Page R, updates the corresponding value of Page R in the mapping table through CAS, and deletes the Page R mark. However, Page R is still accessible via Page L’s Side Pointer, so removing the node delta only masks access from the parent node.
For Merging Children: Node Merge deltas are introduced. The increment records physical Pointers to Page L and Page R, and then updates the value of Page L in the mapping table through CAS, i.e. the increment is part of logical Page L.
Parent Update. Index Term Delete Delta is introduced, append to the parent node, logically Delete the key and pointer that originally pointed to Page R, and update the value of parent node P in the mapping table through CAS operation.

As you can see from Figure A, two logical points were changed when the Remove Node Delta was installed:

Page L points to side pointer to Page R.
The parent Page P points to a child pointer to Page R.

This is also the significance of mapping tables and logical Pointers: to modify a mapping entry through CAS, so as to modify multiple logical Pointers at the same time.

To deal with conflict

If a SMO operation is performed on one node while a single node update is performed on another thread, and the SMO operation conflicts (for example, operating on the same Page), how do I resolve the conflict?

In general, BW-Tree is embedded in the DBMS as a storage engine, and the transaction management module in the DBMS tries to handle external conflicts by serializing multiple SMO operations (personal guess). Bw-tree then handles conflicts between SMO operations and single-node updates.

Bw-tree uses a scheme called “the help-along Protocol “, whereby any thread that finds an SMO operation in progress performs the SMO operation first and then performs its own operation (add, delete, change, and search). That is:

Increase the priority of the SMO to determine the order of two types of updates (SMO vs. single-node update).
When all threads encounter a conflicting SMO, they complete the SMO first, regardless of whether the SMO belongs to their own thread. In this way, one thread always completes the SMO and continues, while the other threads give up.

Cache Management

The main function

The main functions of the cache layer are:

Maintain the Mapping Table and store the mappings between logical pids and physical addresses. The physical address could be a pointer in memory or an address in the flash file system.
Responsible for page moving between memory and flash, including Reading, swapping, flushing.

Mapping table update

All updates to the mapping table are done through the LATch-free CAS, including:

Physical pointer changes caused by the addition of delta records to leaf nodes and intermediate nodes.
The substitution of memory Pointers and file addresses caused by pages being swapped between flash and memory.

Increment the brush

Page flushing in the cache can be caused by a variety of reasons, such as overhead transactions requiring checkpoints, or memory usage thresholds requiring pouts. As mentioned earlier, a logical page consists of a base page and a chain of increments, and the chain of increments extends over a threshold range, so brushing down a logical page is not done at once, but incrementally. For this reason, bW-tree also introduces a special Flush delta, which records the flush point and adds it to the logical page. In this way, if there is any modification, the next time to brush, only need to brush the incremental record after the chain.

Log-structured Store (LSS)

Flash layer and memory corresponding, are incremental flush, single logical page data is not connected, page through the file system address string.

Garbage collection is done using overrides that move parts of the logical page together to reduce subsequent read magnification.

To be read

Read Microsoft’s BwTree paper alone, the storage and transaction section is not easy to understand, because this paper only describes the bwtree index in detail. For storage and transactions, I’ll have to look at two more Microsoft nesting doll papers, and I’ll do two more when I get the chance:

LLAMA: A Cache/Storage Subsystem for Modern Hardware, www.vldb.org/pvldb/vol6/…
Deuteronomy: Transaction Support for Cloud Data, www.microsoft.com/en-us/resea…

reference

Microsoft 2013 bwtree thesis: 15721. courses.cs.cmu.edu/spring2017/…
Taobao database kernel issue monthly report 2018/11: mysql.taobao.org/monthly/201…

I am Green teng mu bird, a distributed system programmer like photography, more interesting articles, welcome to pay attention to my public number: “Mu Bird miscellaneous Record”.