Boltdb index design

This article is first posted on our official account: “Miscellaneous Notes of Wooden Birds”

Boltdb is one of the few pure GO language development, stand-alone KV library on the market. Boltdb is based on Howard Chu’s LMDB project, and the implementation is relatively clean. The core code is about more than 4000 lines without the unit test and adaptation code. Simple API and simple implementation are also the intention of the author. Due to the author’s energy limit, the original BOLtDB has been sealed, no longer updated. If you want to improve and submit a new PR, it is recommended to go to the fork version bbolt maintained by ETCD.

For convenience, this series of guide articles is still based on the original REPO, which does not change. Although the project is small, the five organs are complete, only more than 4000 lines of code, to achieve a single KV engine based on B+ tree index, support a write multi-read transaction. The code itself is minimalist and well-commented, and boltDB is a repO not to be missed if you are a go lover and interested in the KV library.

This series plan is divided into three articles, in turn around data organization, index design, transaction implementation of three main aspects of boltDB source code analysis. Since the three aspects are not completely orthogonal and decoupled, there will inevitably be interweaving in the narration. If you can’t understand it, you can skip it for the time being and come back to comb it out when there is a complete picture. This is the first article, boltDB data organization.

An overview of

There are two common index designs in database, one is B+ tree, the other is LSM-tree. B+ tree is more classic, for example, the traditional stand-alone database mysql is B+ tree index, it is more friendly to fast reading and range query. Lsm-tree is a popular index structure in recent years. Bigtable, LevelDB, and RocksDB all have lSM-tree. As mentioned in the previous article, LSM-tree uses WAL and multi-level data organization to sacrifice some read performance for robust random write performance. So this is also a classic trade-off.

BoltDB logically organizes data by buckets. A bucket can be regarded as a namespace, which is a collection of KV pairs, similar to the concept of buckets for object storage. Each Bucket corresponds to a B+ tree, and namespaces can be nested, so BoltDB also allows nesting between buckets. Implementationally, the page ID of the root node of the child bucket is nested on the parent bucket leaf node.

Each DB file is a set of tree organized B+ trees. We know that for B+ trees, branch nodes are used for lookup, and leaf nodes are used for storing data.

The top layer of the B+ tree, in particular, is called the root bucket, and all its leaf nodes hold the page IDS of the sub-bucket B+ root.
Other B+ trees, let’s call them data buckets, may have leaf nodes that are either normal user data or page ids of subbucket B+ roots.

Compared with ordinary B+ trees, BoltDB’s B+ trees have several special features:

The number of branches of a node is not a fixed range, but is limited by the sum of the size of its elements, which is the page size.
A branch node has the smallest key of the branch to which it points.
All the leaves are not connected end to end through a linked list.
There is no guarantee that all leaf nodes are in the same layer.

In terms of code organization, boltDB index related source files are as follows:

Bucket. go: high-level encapsulation of bucket operations. Including the increase, deletion, change and check of KV, sub bucket and B+ tree split and merge.
Node. go: Operations on stored elements and relationships between nodes. The detailed logic of adding and deleting, loading and dropping elements stored in nodes, accessing child sibling elements, splitting and merging.
Cursor. go: Implements a function similar to iterator, can be in B+ tree leaf nodes for random walk.

This paper is divided into three parts, from part to whole to reveal how BoltDB index design step by step. Firstly, we disassemble the basic unit of the tree, then analyze the traversal implementation of bucket, and finally analyze the growth and balance process of the tree.

Author: Greenwood Birdwww.qtmuniao.com/2020/12/14/…Please indicate the source

Basic unit — Node

The basic unit of a B+ tree is node, which corresponds to the page stored in the file system mentioned in the previous article. Nodes include two types, branch node and Leaf node. But in implementation, they reuse the same structure and are distinguished by a flag bit isLeaf:

// node represents a deserialized page in memory
type node struct {
	bucket     *Bucket   // The pointer to the bucket
	isLeaf     bool      // Whether it is a leaf node
  
  // Adjust and maintain the B+ tree
	unbalanced bool      // Whether a merge is required
	spilled    bool      // Whether to split and drop disks
  
	key        []byte    // Contains the key of the first element
	pgid       pgid      // The corresponding page id
  
	parent     *node     // Parent node pointer
	children   nodes     // Child node pointer (contains only part of the children loaded into memory)
  
	inodes     inodes    // Meta information about the stored element; This is the key+ pGID array for branch nodes and the KV array for leaf nodes
}
Copy the code

The node/page conversion

The mapping between page and node is that a set of consecutive physical pages in a file system are loaded into memory as a logical page and then converted into a node. The following figure shows a contiguous segment of data occupying two Pagesize Spaces on the file system, mmap to memory, converted to page, and then converted to Node via Node.read.

Node. read converts page to node:

// The read function reads the page from mmap and converts it to node
func (n *node) read(p *page) {
  // Initialize the meta informationn.pgid = p.id n.isLeaf = ((p.flags & leafPageFlag) ! =0)
	n.inodes = make(inodes, int(p.count))

  // Load the contained element inodes
	for i := 0; i < int(p.count); i++ {
		inode := &n.inodes[i]
		if n.isLeaf {
			elem := p.leafPageElement(uint16(i))
			inode.flags = elem.flags
			inode.key = elem.key()
			inode.value = elem.value()
		} else {
			elem := p.branchPageElement(uint16(i))
			inode.pgid = elem.pgid
			inode.key = elem.key()
		}
		_assert(len(inode.key) > 0."read: zero-length inode key")}// Use the key of the first element as the key of the node so that the parent node can use it as an index for lookup and routing
	if len(n.inodes) > 0 {
		n.key = n.inodes[0].key
		_assert(len(n.key) > 0."read: zero-length node key")}else {
		n.key = nil}}Copy the code

The node element inode

Inodes represent the internal elements contained in nodes. Branch nodes and leaf nodes also use the same structure. For branch nodes, a single element is a key+ reference; For leaf nodes, single element is user KV data.

Notice that the reference type for the other nodes is pgid, not node*. This is because the element pointed to in the inode is not necessarily loaded into memory. When boltDB accesses the B+ tree, it will convert the accessed page into node as required and store its pointer in the children field of the parent node. The specific loading sequence and logic will be detailed in the following summary.

type inode struct {
  // Common variables
	flags uint32
	key   []byte
  
  // Used by branch nodes
  pgid  pgid   // The page ID of the branch/leaf node pointed to
  // Use the leaf node
	value []byte // The data stored in the leaf node
}
Copy the code

Inodes are routed in B+ trees – binary lookups.

The new element

All new data is added to leaf nodes. If the B+ tree is unbalanced after new data is added, node. Spill is used to split data. The main code is as follows:

// put inserts a key/value.
func (n *node) put(oldKey, newKey, value []byte, pgid pgid, flags uint32) {
	// Find the insertion point
	index := sort.Search(len(n.inodes), func(i int) bool { returnbytes.Compare(n.inodes[i].key, oldKey) ! =- 1 })
  
  // If the key is new rather than replaced, empty the node to be inserted
	exact := (len(n.inodes) > 0 && index < len(n.inodes) && bytes.Equal(n.inodes[index].key, oldKey))
	if! exact { n.inodes =append(n.inodes, inode{})
		copy(n.inodes[index+1:], n.inodes[index:])
	}

  // Assign to the element to be replaced/inserted
	inode := &n.inodes[index]
  inode.key = newKey
	// ...
}
Copy the code

The main logic of this function is relatively simple, that is, binary search to insert the position, if it is to write overwrite directly overwrite; Otherwise, add a new element and move it to the right to make room for insertion. But the function signature is interesting, with both oldKey and newKey, which feels strange at first. The calling code has two places:

When the leaf node inserts user data,oldKeyIs equal to thenewKeyIn this case, the two parameters are redundant.
inspillPhase adjustment of B+ tree,oldKeyMay not be equal tonewKey, is useful at this point, but from a code point of view, the usefulness is still obscure.

After discussion with my friends, the general conclusion is as follows: In order to avoid the chain update of node.key of the ancestor node when a very small value is inserted to the left of the leaf node, the update is delayed to the final B+ tree adjustment phase (spill function). Inodes [0].key: node.inodes[0].key

		// The snippet is in the node.spill function
		ifnode.parent ! =nil { // If it is not the root node
			var key = node.key // after split, the leftmost node
			if key == nil {    // after split, non-leftmost node
				key = node.inodes[0].key
			}

			node.parent.put(key, node.inodes[0].key, nil, node.pgid, 0)
			node.key = node.inodes[0].key
		}
Copy the code

Node traversal

Since Golang does not support a yield-like mechanism in Python, BoltDB uses the stack save traversal context to implement an iterator for sequential traversal of tree nodes: cursor. Logically, it can be understood as an iterator that iterates over the elements stored in the leaf node of a B+ tree. As mentioned earlier, BoltDB’s B+ tree does not use a linked list to string all the leaf nodes together, so some extra logic is required to handle the various details in the traversal.

This implementation increases the complexity of traversal, but reduces the difficulty of maintaining B+ tree equilibrium properties, which is also a trade-off. Of course, the most important thing is to avoid chaining all the precursor nodes when implementing transactions through COW.

The cursor is bound to a bucket to implement the following function. Note that value = nil is used when iterating over nested buckets.

type Cursor
func (c *Cursor) Bucket(a) *Bucket// return boundbucket
func (c *Cursor) Delete(a) error/ / deletecursorPoint to thekey
func (c *Cursor) First(a) (key []byte, value []byte)// Locate and return to thisbucketThe first oneKV 
func (c *Cursor) Last(a) (key []byte, value []byte)// Locate and return to thisbucketThe last oneKV 
func (c *Cursor) Next(a) (key []byte, value []byte)// Locate and return to thisbucketThe nextKV
func (c *Cursor) Prev(a) (key []byte, value []byte)// Locate and return to thisbucketBefore aKV
func (c *Cursor) Seek(seek []byte) (key []byte, value []byte)// Locate and return to thisbucketWithin the specifiedKV
Copy the code

Since the left and right leaf nodes of B+ tree in BoltDB are not connected through a linked list, the traversal path needs to be recorded for backtracking. Its structure is as follows:

type Cursor struct {
	bucket *Bucket    // Use this handle to load node
	stack  []elemRef  // Save the path for easy backtracking
}
Copy the code

Find page and node correspond one by one. If page is loaded into memory (converted from Page), node is used first; otherwise, Page is used. Index indicates the position of the path in the inodes when it passes through the node.

type elemRef struct {
	page  *page
	node  *node
	index int
}
Copy the code

Auxiliary function

These apis are implemented with some common logic that can be reused, so they can be disassembled as building blocks. Move cursor in the implementation, is to adjust the cursor. Stack array represented by the path.

// End recursion, query the node where the key resides, and record the path in the cursor
func (c *Cursor) search(key []byte, pgid pgid)

// Use search to query the value of the key
// If key does not exist, return the kv pair at the position to be inserted:
// 1.ref. Index < ref. Node.count () returns the first kv greater than the given key
// 2.ref.index == ref.node.count() returns nil
// Upper Seek needs to handle the second case.
func (c *Cursor) seek(seek []byte) (key []byte, value []byte, flags uint32)

// Move the cursor to the leftmost leaf node in the subtree rooted at the top of the stack
func (c *Cursor) first(a)
// Move the cursor to the rightmost leaf node in the subtree rooted at the top of the stack
func (c *Cursor) last(a)

// Move cursor to next leaf element
// 1. If there is an element after the current leaf node, return it directly
// 2. Otherwise, use the saved path to find the next non-empty leaf node
// 3. Return nil if the last element of the last leaf node is already pointed to
func (c *Cursor) next(a) (key []byte, value []byte, flags uint32)
Copy the code

Combination of traverse

With these basic building blocks, let’s take a look at the implementation of the main API functions:

// 1. Add the root node to the stack
// 2. Call c.first() to locate the first leaf node of the root
// 3. If the node is empty, call c.ext () to find the next non-empty leaf node
// 4. Returns the first element of the leaf node
func (c *Cursor) First(a) (key []byte, value []byte)

// 1. Add the root node to the stack
// 2. Call c.cast () to locate the last leaf node of the root
// 3. Return the last element, or null if it does not exist
func (c *Cursor) Last(a) (key []byte, value []byte) 

// 1. Call c. ext
func (c *Cursor) Next(a) (key []byte, value []byte)

// 1. Traverse the stack, tracing back to the first branch node with a precursor element
// 2. Change the ref. Index of the node to --
// 3. Call c.cast () to locate the last leaf node of the subtree
// 4. Return the last element
func (c *Cursor) Prev(a) (key []byte, value []byte)

// 1. Call c.seek() to find the element in the first node >= key.
// 2. If the key falls between two leaf nodes, call c.ext () to find the next non-empty node
func (c *Cursor) Seek(seek []byte) (key []byte, value []byte)// Locate and return to thisbucketWithin the specifiedKV
Copy the code

The implementation of these apis is a bit cumbersome to describe, but it’s easy to follow once you grasp the main idea and grasp some of the boundaries and details. The main idea is simple: The ultimate purpose of cursor is to traverse the elements of all leaf nodes, but the leaf nodes are not connected by a linked list. Therefore, a stack array is needed to record the traversal context — path, to achieve fast access to the precursor and successor (because the precursor and successor are likely to share the prefix path with the current leaf node).

Some additional boundaries and details to note are as follows:

Every time you move, you need to find the node, and then you need to find the element in the node.
If the node has been converted to node, the node is preferentially accessed. Otherwise, access the page from Mmap.
The key of the element remembered by a branch node is the key of the node it points to, that is, the minimum key of the element contained by the node.
The use of Golangsort.SearchGetting the first element subscript smaller than the given key requires some extra processing.
Several boundaries determine whether node has an element, whether index is greater than the element value, and whether the element is a subbucket.
If the key does not exist, seek/search locates the point where the key should be inserted.

The growth of the tree

Let’s expand the life cycle of B+ tree in BoltDB by several time nodes:

During database initialization
After the transaction is started
When a transaction commits

Finally, on the basis of understanding the state of these stages, the whole series of its growth process.

Initialization time

When the database is initialized, the B+ tree contains only one empty leaf node, which is the root node of the root bucket.

func (db *DB) init(a) error {
  // ...
  // Write an empty leaf at pageid = 3.
	p = db.pageInBuffer(buf[:], pgid(3))
	p.id = pgid(3)
	p.flags = leafPageFlag
	p.count = 0
  // ...
}
Copy the code

After that, the growth of B+ tree is completed by adding, deleting and adjusting B+ tree nodes in write transactions. By boltDB’s design, write transactions can only be done serially. Boltdb uses COW to modify nodes to ensure that concurrent read transactions are not affected. That is, the page to be modified is read to the memory, after modification and adjustment, apply for a new page will be changed node disk.

This approach makes it easy to implement read and write concurrency and transactions, which will be explained in more detail in the next article in this series. However, there is no doubt that the cost is relatively high. Even if a key is modified/deleted, all nodes in the B+ tree path of the corresponding leaf node will be modified and dropped. Therefore, if the changes are frequent, it is best to do Batch in a single transaction.

The Bucket is lesser

When a bucket contains very little data, a new page is not opened for the bucket. Instead, the page is inline in the leaf node of the parent bucket. Whether the judgment logic can be embedded is in bucket.inlineable:

Contains only one leaf node
Data size is not larger than 1/4 page
Does not contain subbuckets

After the transaction is started

At each transaction initialization, a handle to the root bucket is copied in memory as an entry point for dynamically loading nodes on the modified path later.

func (tx *Tx) init(db *DB) {
  / /...
  
	// Copy and initialize root bucket
	tx.root = newBucket(tx)
	tx.root.bucket = &bucket{}
	*tx.root.bucket = tx.meta.root
  
  // ... 
}
Copy the code

In a read/write transaction, there are two phases when a user calls bucket.put to add or modify data:

Locate: Use cursor to locate the page where the specified key is located
Load and insert: load all nodes on the path and insert KV into the leaf node

func (b *Bucket) Put(key []byte, value []byte) error {
	// ...

	// 1. Move the CURSOR to the key position
	c := b.Cursor()
	k, _, flags := c.seek(key)

	// ...

	// 2. Load the node as node and insert the key value
	key = cloneBytes(key)
	c.node().put(key, key, value, 0.0)

	return nil
}
Copy the code

Locate. The implementation logic is described in cursor.seek of CURSOR discussed in the previous section. Its main logic is to start from the root node, continuously search bisection, access the corresponding node/page, until the leaf node to be inserted kv is located. There is a key node access function called bucket.pagenode, which does not load a page as a node, but simply reuses a cached node, or accesses the mmap directly into the relevant page in memory space.

Load the insert. In this stage, first of all, all node indexes stored in cursor.node() are loaded as node through cursor.node() and cached to avoid repeated loading for reuse. The key value data is then inserted into a leaf node by binary lookup via node.put.

In a read-only transaction, there is only lookup and location, so the page is only accessed through mmap, and there is no page-to-node conversion.

For bucket.delete, it is similar to the previous two phases, except that load insert becomes load Delete. As you can see, all changes at this stage are in memory, and the B+ tree structure of the previous pages stored in the file system is not broken, so read transactions can be performed concurrently.

Before transaction commit

After the write transaction is enabled, users may perform a series of additions/deletions, and a large number of relevant nodes are converted into nodes and loaded into memory. The modified B+ tree consists of pages in the file system and nodes in memory. It may result in too few node elements for some and too many for others.

Therefore, before the transaction is committed, the B+ tree will be adjusted according to certain strategies to maintain good query properties, and then all the changed nodes will be serialized into page increments and written into the file system to form a new, persistent and balanced B+ tree.

Rebalance and bucket.spill. Merge and split are key functions that make boltDB B+ tree lookup possible.

Rebalance. This function aims to merge nodes that are too small (too few keys or too small overall) into neighboring nodes. Rebalance. Bucket. Rebalance is mainly a wrapper around this:

func (b *Bucket) rebalance(a) {
  // Adjust all cached nodes
	for _, n := range b.nodes {
		n.rebalance()
	}
  
  // Adjust all subbuckets
	for _, child := range b.buckets {
		child.rebalance()
	}
}
Copy the code

Rebalance the logic is as follows:

judgen.unbalancedFlag to avoid repeated adjustments to a node. Tags are used for two reasons: to adjust as needed and to avoid repeated adjustments.
To determine whether this object needs to be merged, the criteria are as follows:n.size() > pageSize / 4 && len(n.inodes) > n.minKeys()If not, end.
If the node is the root node and has only one child node, it is merged with its only child.
If the node has no child nodes, the node is deleted.
Otherwise, the node is merged into the left neighbor. If the node is the first node and has no left neighbors, merge the right neighbors.
Since this adjustment may result in the deletion of nodes, recurse upwards to see if further adjustment is required.

To be clear, only a call to Node.del () will cause N.balanced to be marked. There are only two places where node.del() is called:

The user callsbucket.DeleteThe delete function deletes data.
Child node callnode.rebalanceDuring the adjustment, the merged nodes are deleted.

2 in turn is caused by 1, so it can be said that the actual execution of node.rebanlance’s main logic is only caused when the user deletes data during a write transaction.

Spill, a function that splits large nodes (larger than one page in size) and writes them to dirty pages. As with rebalance, the primary logic is in Node.spill. In contrast, bucket.spill also has a fair amount of logic that deals with inline buckets. As mentioned earlier, if a bucket has too little content, it is embedded directly in the leaf node of the parent bucket. Otherwise, the spill function of the child bucket is called, and the pageID, the root node of the child bucket, is placed in the parent bucket leaf node. As can be seen, spill adjustment is a bottom-up process, similar to the back-order traversal of a tree.

// spill splits pages larger than one node and writes the adjusted nodes to dirty pages
func (b *Bucket) spill(a) error {
  
	// From bottom to bottom, recursively adjust the subbuckets
	for name, child := range b.buckets {
		// If the child bucket can be embedded, all its data is serialized and embedded into the corresponding leaf node of the parent bucket
		var value []byte
		if child.inlineable() {
			child.free()
			value = child.write()
    // Otherwise, adjust the child bucket first and write its root node page ID as a value to the corresponding leaf node of the parent bucket
		} else {
			iferr := child.spill(); err ! =nil {
				return err
			}

			value = make([]byte, unsafe.Sizeof(bucket{}))
			var bucket = (*bucket)(unsafe.Pointer(&value[0]))
			*bucket = *child.bucket
		}

    // If the subbucket does not cache any nodes (indicating no data changes), skip it
		if child.rootNode == nil {
			continue
		}

		// Update the reference to the child bucket of the parent bucket
		var c = b.Cursor()
		k, _, flags := c.seek([]byte(name))
		if! bytes.Equal([]byte(name), k) {
			panic(fmt.Sprintf("misplaced bucket header: %x -> %x"And []byte(name), k))
		}
		if flags&bucketLeafFlag == 0 {
			panic(fmt.Sprintf("unexpected bucket header flag: %x", flags))
		}
		c.node().put([]byte(name), []byte(name), value, 0, bucketLeafFlag)
	}

	// If the bucket does not cache any nodes (indicating no data changes), the adjustment is aborted
	if b.rootNode == nil {
		return nil
	}

	// Adjust this bucket
	iferr := b.rootNode.spill(); err ! =nil {
		return err
	}
	b.rootNode = b.rootNode.root()

  // Since adjustments are written incrementally, the root node of this bucket references changes, so b.oot needs to be updated
	if b.rootNode.pgid >= b.tx.meta.pgid {
		panic(fmt.Sprintf("pgid (%d) above high water mark (%d)", b.rootNode.pgid, b.tx.meta.pgid))
	}
	b.root = b.rootNode.pgid

	return nil
}
Copy the code

The main logic of Node. spill is as follows:

judgen.spilledFlag, which defaults to false, indicating that all nodes need to be adjusted. If yes, skip it.
Since the adjustment is bottom-up, recursive calls are required to adjust the child nodes first and then the local node.
When adjusting this node, split the node according to pagesize.
Apply for new pages of the appropriate size for all new nodes, and then write Node to page (not yet written back to the file system).
If spilt generates a new root node, you need to make recursive calls up to adjust the other branches. To avoid repeated adjustments, this node is changedchildrenEmpty.

It can be seen that BoltDB maintains the B+ tree search property. It does not maintain the branch tree of all branch nodes in a fixed range like textbook B+ tree, but directly depends on whether node elements can be saved to a page. This reduces internal page fragmentation and is relatively simple to implement.

Both functions use the put/delete tag to adjust as needed, but rebalance itself before merging subbuckets. Spill splits the bucket, then splits itself. In addition, there are requirements on the order in which they are called. You need to call balance to merge, and then call spill to split according to pagesize and write dirty pages.

To summarize, at DB initialization, there is only one page that holds the root node of the root bucket. Subsequent B+ trees are created at bucket.create. Initially embedded in the leaf node of the parent bucket, the read transaction does not change the B+ tree structure. All changes in the write transaction are written to memory first, and then incrementally written to the file system when the transaction commits. As more data is written, the B+ tree is constantly split, becoming deeper and not embedded in the parent bucket.

summary

Boltdb uses class B+ trees to organize database indexes, with all data stored in leaf nodes and branch nodes only used for route lookups. Boltdb supports nesting between buckets, which is implemented as nesting of B+ trees, and maintains references between parent and child buckets through page ids.

For simplicity, the B+ tree in BoltDB does not use a linked list to string all the leaf nodes together. In order to support sequential traversal of data, an additional CuroSR traversal logic is implemented to save traversal stack to improve traversal efficiency and fast jump.

Boltdb grows B+ trees in a transaction cycle, and growth occurs only in write transactions. After the write transaction starts, the root node of the root bucket is copied, and the nodes involved in the change are loaded into memory as needed and the changes are made in memory. Before the end of the write transaction, after adjusting the B+ tree, apply for new pages for all nodes involved in the changes and write them back to the file system to complete a growth of the B+ tree. Free tree nodes that are not occupied by read transactions are entered into freelist for later use.

annotations

Node: A node in a B+ tree that appears as a page in the file system or Mmap and becomes a node after being converted in memory.

Path: A path in a tree refers to all nodes in the sequence from the root node to the current node.

reference

Boltdb repo:github.com/boltdb/bolt

I am green teng mu bird, a storage engineer like photography, welcome to pay attention to my public account: “Mu Bird miscellaneous Record”.