LSM tree and B+ tree are often used as a data structure in the storage system, so there are similarities and differences between them. LSM tree is proposed on the basis of B+ tree, while B+ tree is an extension of B tree (also known as B- tree), so we analyze the relationship according to the progressive order.

1. B tree (B-tree)

A B-tree is a balanced, multipath lookup tree that is useful in file systems for binary sort tree lookups. Each node of the B tree is an ordered table of multi-key codes. When a node is reached, the ordered table is searched first. If it is found, the search succeeds. Otherwise, search in the subtree pointed to by the corresponding pointer information. When the leaf node is reached, it means that there is no corresponding key code in the tree.

A B-tree of order M has the following characteristics:

  1. The root node has at least two children.

  2. Each intermediate node contains K-1 elements and k children, where m/2≤ K ≤mm/2 \leq k \leq mm/2≤ K ≤m

  3. Each leaf node contains K-1 elements, where M /2≤ K ≤mm/2 \leq k \leq mm/2≤ K ≤m

  4. All the leaf nodes are in the same layer.

  5. Elements in each node are arranged from small to large, and k-1 element in the node is exactly the range division of elements contained by K children.

Take a third-order B-tree:

2≤k≤32\leq k\leq32≤k≤3 Intermediate nodes contain 1-2 elements and 2-3 child leaf nodes contain 1-2 elements

2. B + tree

A B+ tree is a variant of a B tree that is required by the file system. Traditional relational databases use B+ trees or some variants as storage structures, which can efficiently search.

An M order B+ tree has the following characteristics:

  1. The root is either a leaf, or at least two subtrees, or at mostmKeZi tree.
  2. All non-terminal nodes except the root node have at least one
    [ m 2 ] [\frac{m}{2}]
    There are subtrees, at mostmKeZi tree.
  3. All the leaves are at the same level.
  4. If a node hasnA subtree, must containnA keyword.
  5. All leaf nodes contain all recorded keyword information and Pointers of these keyword records, and nodes are linked in order of keyword size from small to large.
  6. All non-leaf nodes can be seen as indexed parts, which only contain the maximum (or minimum) keyword of their subtree.

Comparing the two, we can see that the difference is (unique to B+ trees) :

  • There arenThe nodes of the subtree containnA key
  • All leaf nodes contain information of all key codes and Pointers to records containing these key codes, and leaf nodes themselves are linked sequentially according to the size of key codes from small to large
  • All non-leaf nodes can be regarded as index parts, which only contain the maximum (or minimum) key codes of their sub-root nodes

Take a third-order B+ tree:

Intermediate nodes don’t hold data, just indexes, data is all in the leaf node.

3. The LSM tree

Traditional relational databases that use B+ trees or some variation as storage structures are efficient for lookups. But it also has an obvious drawback when stored on disk, that is, it is logically close but physically far apart (the intermediate nodes are indexes), which can cause a lot of random disk reads and writes.

To overcome the weaknesses of B+ Trees, HBase introduces the concept of log-structured merge-trees (LSM).

  • LSM trees are a kind ofStorage policy, == Can take any existing data structure, such as B+ tree ==.
  • When LSM tree inserts records, logs are first written and then inserted into memory. The tree in memory is called C0 tree.
  • Data in memory reaches a certain threshold, or after a period of time, the C0 tree is merged into disk C1.
  • The data in C1 can be further merged into C2.