In addition to B+ trees, you may have heard of B+ trees, B- trees, in fact, B- trees are B trees, the English translation is B-Tree, where the “-” is not relative to the “+” in B+ Tree, but just a conjunction. And the B tree is actually a lower-level version of the B+ tree, or a better version of the B tree.

B+ tree

B+ tree is actually an M-balanced search tree (not a binary tree).

Balanced lookup tree definition: the height difference between the left and right subtrees of any node in the tree cannot be greater than 1

/** * This is the definition of a non-leaf node in a B+ tree. * * assume keywords = [3, 5, 8, 10] * 4 key data can be divided into five ranges: (INF, 3), [3, 5), [5, 8), [8, 10), [10, INF) * 5 interval corresponding: Children [4] * * m = (m-1)*4[keywordss size]+m*8[children size] */
public class BPlusTreeNode {
  / / 5 tree
  public static int m = 5;

  // Key value, used to divide data interval
  public int[] keywords = new int[m-1];

  // Save the child node pointer
  public BPlusTreeNode[] children = new BPlusTreeNode[m];
}
/** * This is the definition of a leaf node in a B+ tree. * * leaf nodes in a B+ tree are different from internal nodes. * leaf nodes store values, not ranges. * In this definition, each leaf node stores three rows of key and address information. * * the k value is calculated in advance by making the size of all information exactly equal to the page size: * PAGE_SIZE = K *4[keyw.. size]+ K *8[dataAd.. size]+8[prev size]+8[next size] */
public class BPlusTreeLeafNode {
  public static int k = 3;

  // The key of the data
  public int[] keywords = new int[k];

  // Data address
  public long[] dataAddress = new long[k];

  // This node is the precursor of the linked list
  public BPlusTreeLeafNode prev;

  // This node is the next node in the list
  public BPlusTreeLeafNode next;
}

Copy the code

In a B+ tree, the nodes in the tree do not store the data themselves, but serve only as indexes. In addition, all recorded nodes are stored in the leaf nodes of the same layer in order of size, and each leaf node is connected by a pointer.

To summarize,B+ trees have the following characteristics

  1. Each node of a B + tree can contain more nodes for two reasons. The first reason is to reduce the height of the tree. (The index tree is not stored in memory. The height of the tree is equal to the number of DISK I/O operations per query. The other option is to change the data range to multiple intervals. The larger the interval, the faster the data retrieval (think skip tables)

  2. Instead of storing one key, each node stores multiple keys

  3. Leaf nodes are used to store data, while other nodes are used for indexes

  4. Leaf nodes are linked to each other by two Pointers for better sequential query performance.

This design also has the following advantages:

  1. The non-leaf nodes of a B + tree store only keys and take up very little space, so each layer of the node can index a much wider range of data. In other words, you can search for more data for each IO operation

  2. Leaf nodes are connected in pairs to meet the disk prefetch feature. For example, leaf nodes store 50 and 55, which have Pointers to leaf nodes 60 and 62. When we read data from disk corresponding to 50 and 55, we will mention 60 and 62 in passing due to the preread nature of disk. Read the corresponding data. This time sequential reading, rather than disk searching, speeds things up.

  3. Supports range query. Local range query is very efficient. Each node can index a larger and more accurate range, which means that the SINGLE-disk I/O information of the B + tree is larger than that of the B tree and the I/O efficiency is higher

  4. Because the data is stored in the leaf node layer and there are Pointers to other leaf nodes, a range query only traverses the leaf node layer, not the entire tree.

Because of the gap between disk access speed and memory, disk I/O should be minimized for efficiency. Disks are usually not read strictly on demand, but are preread every time. After the disk has read the required data, it will read back a certain length of data in memory. The theoretical basis for this is a well-known local principle in computer science:

How about MySQL data index, you can refer to this article: time.geekbang.org/column/arti…

B-Tree

B-Tree is also an M – fork balanced search Tree

  1. All keys are distributed throughout the tree
  2. All keys appear in one node
  3. The search can end at a non-leaf node
  4. In the full keyword search process, the performance is close to binary search.

The difference between B trees and B+ trees

  1. Non-leaf nodes in B + trees do not store data, and all data stored in leaf nodes makes the query time complexity fixed at log N.

  2. The complexity of the b-tree query time is not fixed, it depends on the position of the key in the tree, preferably O (1).

  3. Since the leaf nodes of B+ tree are linked by bidirectional linked list, it supports range query and is more efficient than B tree

  4. The key and data are together at each node of the B tree

Why is MongoDB using b-tree and Mysql using B+Tree?

Non-leaf nodes in B + trees do not store data, and all data stored in leaf nodes makes the query time complexity fixed at log N. B tree query time complexity is not fixed, it depends on the key position in the tree, preferably O(1).

We’ve already said that having as little disk IO as possible is an effective way to improve performance. MongoDB is an aggregated database, and the B-tree happens to be a cluster of key and data fields.

As to why MongoDB uses B trees instead of B + trees, consider it from a design standpoint. MongoDB is not a traditional relational database, but noSQL stored in BSON format (think of it as JSON). The goal is high performance, high availability and ease of expansion.

Mysql is a relational database, and the most commonly used operation is data traversal operation (JOIN), while MongoDB’s data is more aggregated data, unlike Mysql’s strong relationship between tables, so MongoDB is more of a single query.

Because Mysql uses B+ tree, the data is on leaf nodes, and leaf nodes are connected through bidirectional linked list, which is more conducive to data traversal, while MongoDB uses B tree, and all nodes have a data field. As long as the specified index is found, it can be accessed. There is no doubt that MongoDB is faster than Mysql on average for a single query.

A Hash index

In short, a hash index uses some hash algorithm to convert a key value to a new hash value. There is no need to progressively search from the root to the leaf as in a B + tree. You just need a hash algorithm to find the location immediately, very fast. (Think of a HashMap in Java.)

B+ tree index and Hash index

  1. If the query is equivalent, the hash index is clearly superior, because only one algorithm is needed to find the corresponding key value; Of course, if the key values are unique, you have to iterate through the list if there are hash conflicts.

  2. Hash indexes do not support range queries (with this modification, however, you can use LinkedHashMap in Java to store the insertion order of nodes in a linked list, so you can also use a linked list to store the size order of data)

This supports range query, but the time complexity is O(n), and the efficiency is worse than that of hop tables and B+Tree

  1. Hash indexes cannot use index sort and fuzzy matching

  2. Hash indexes also do not support leftmost matching rules for multi-column union indexes.

  3. Hash indexing is inefficient when there are a large number of key – value conflicts