Red and black tree

An overview of the

Red-black tree is one of the most important data structures of trees. Java containers TreeSet and TreeMap use red-black tree. Red-black trees have also been added to hashMap in JDK1.8. Each node has a color attribute, red or black. In addition to the general requirement for binary lookup trees, the following additional requirements are added for any valid red-black tree:

  1. Nodes are either black or red
  2. The root node must be black
  3. Each leaf node has two black nodes that are NIL
  4. The two children of each red node must be black, so there can be no two consecutive red nodes, and the parent of the red node must be black
  5. All paths from any node to the leaf nodes it can reach contain the same number of black nodes. To achieve black balance. (A balanced binary tree is a perfectly balanced tree, a red-black tree is not a perfectly balanced tree, but a perfectly black balanced binary search tree)

The name of the node

The Parent node, P (Parent)

GrandParent node — G(GrandParent)

Uncle node — U(Uncle)

Current node — C(Current)

Brother node — B(Brother)

The Left child, L (Left)

Right children – R (Right)

B tree

thinking

The forward deletion, change and check of database are the most common and particularly important operations in the development process, especially now the rise of big data, resulting in a sharp increase in data storage, improving the operation efficiency of data has become particularly critical. Most database indexes are stored in the structure of numbers, because trees are relatively efficient to query and remain ordered.

The time complexity of binary search tree is O (logN). In terms of algorithm and logic analysis, the search speed and data comparison times of binary search tree are small. But we have to consider a new problem, the amount of data is much larger than the size of memory, so when we look up the data, we can not load all the data into memory at the same time. Since it cannot be all loaded into memory, it can only load a page in disk step by step, in short, load disk one by one, and load data into memory in chunks for search and comparison.

As shown in the figure, look for 10 in the tree, where each node represents a disk page, which is equivalent to one disk IO per visit to a new node

The search results show that the disk I/O count is related to the tree height. In the worst case, the disk I/O count is equal to the tree height. The disk I/O process is time-consuming and inefficient. Therefore, you need to reduce the tree height when designing the data storage structure. That is, a tall and thin tree becomes short and fat.

When the number of data is the same and the tree height is kept in order, only the key value stored in the nodes is increased to reduce the tree height. That is, each node in the binary search tree has only one data element, while each node in the B tree can have multiple data elements.

define

B trees are also called B-trees. It is a multi-path balanced lookup tree (all leaf nodes have the same height). When describing a B-tree, it is necessary to specify its order, which indicates how many children nodes a node has at most. It is usually represented by the letter M. When M is 2, it is a binary search tree.

To define a b-tree of order m, follow the following five principles:

  1. The root node must have at least one element and at least two child nodes
  2. Each node has at most m-1 elements
  3. The non-root node has at least (m/2) -1 elements. M /2 should be rounded up, for example, m/2=1.5=2
  4. The elements in each node are arranged in descending order, with all elements in the left subtree of each element smaller than it and all elements in the right subtree larger than it
  5. All leaf nodes are in the same layer, which means the root node is the same length to each leaf node

operation

The search of a B-tree is an extension of the search of a binary search tree. The difference with a binary search tree is that there is more than one subtree per node in a B-tree. To find a node in a B-tree, you need to determine which sub-trees the node to find, and then find the target nodes one by one in the nodes. The search process of B tree is relatively simple, similar to binary search tree, so it will not be described here.

insert

The insertion operation of B tree refers to the insertion of a new record in the tree species, namely, the key-value pair of (key, value). If a key-value pair already exists in the B tree, the value to be inserted is used to replace the old value. If the key does not exist in the B tree, it must be inserted into the leaf node.

The insertion process is as follows:

1) According to the value of the key to be inserted, perform a search operation on the B tree to find the current node location of the data to be inserted.

2) Check whether the number of keys on the current node is less than or equal to M-1. If yes, insert data directly.

3) If no, split the node into left and right parts with the key in the middle as the center. Then insert the middle key into the parent node. The left child tree of the key points to the left half after splitting, and the right child tree of the key points to the right half after splitting.

The following uses a fifth-order B-tree as an example to describe the insertion operation of a b-tree. In a fifth-order B-tree, a node has at most four keys and at least two keys. 1: insert 38, which is an empty tree and serves as the root node. Continue to insert 22, 76, and 40, and insert directly in case (2).

2: insert 51, in case (3), perform split.

3: Repeat the same steps to insert 13 and 21

4: insert 39, which conforms to situation (3), resulting in node splitting. Choose the median 22 as the parent node, and move the 22 node up to merge with the 40 node.

5: Insert data with keys 30, 27, 33, 36, 35, 34, 24, and 29 into the tree according to the same insert rule.

6: Continues to insert data whose key is 26. After the data is inserted, split the node.

7: moves the data node whose key is 27 up to the parent node

8: The parent node already has four keys. After inserting the key27 data, split the node and increase the tree height by 1.

9: insert 14,23,28,29,31,32.

delete

The deletion process is as follows:

1) If the current key to be deleted is located in a non-leaf node, the key to be deleted is overwritten by the nearest successor key. Then delete the successor key in the child branch where it is located. In this case, the successor key must be on the leaf node, and the process is similar to the way that nodes are deleted in a binary search tree.

2) After the record is deleted, if the number of keys on the node is greater than or equal to (m/2)-1, the deletion is complete.

3) If no, if the number of keys of the sibling nodes is greater than (m/2)-1, the key of the parent node is moved down to this node, and the key of the sibling node is moved up, and the deletion operation is complete.

4) Otherwise, the key in the parent node is moved down and merged with the key in the current node and its siblings to form a new node. The two child Pointers to key in the original parent node become one child pointer to the new node. Then the pointer to the current node points to the parent node, and repeat Step 2.

Illustrated process:

Performance analysis

B tree is a balanced multi-path search tree, which is designed to reduce the height of the tree by storing more than one key. With the same number of comparison times, the tree height is small to ensure a relatively small number of DISK I/O times and improve query efficiency. It is widely used in file systems and database index scenarios, such as MongoDB

Search performance: There are two types of b-tree search: One is to search for the address of another node from a node, and the disk address needs to be located (search for the address), which costs a lot. The other is to put the ordered keywords in the node into memory for optimized search (can be halved), which is very cheap compared to search. The height of B tree is very small, so in this context, B tree is much more efficient than any binary structure search tree.

Insert performance: Node splitting will occur during b-tree insertion. When an insert operation causes s nodes to split, the number of disk accesses is H (reading nodes on the search path) +2s (writing two new nodes split) +1 (writing back to the new root node or nodes that did not split after insertion). Therefore, the number of disk accesses required is H +2s+1, up to 3H +1. So insertion is expensive.

Deletion performance: Deletion of b-tree will result in node merge operation. The worst case number of disk access is 3 h = (find h time contain deleted elements need read access) + (for the most adjacent to h layer 2 brothers need read access) h – 1 + (layer in to 3 h the merger of the h – 2 write access) + (the root node of the modified and two nodes on the layer 2 to 3 times to write access).