This is the seventh day of my participation in the August More text Challenge. For details, see:August is more challenging

Index improves query efficiency, just like we read a book. If we want to directly turn to a certain chapter, we just need to look at the table of contents and find the number of pages according to the table of contents.

In a computer we need a data structure to store this directory. Common data structures are hash table, binary lookup tree, binary balanced tree (AVL), red-black tree. So why Innodb and MyISAM choose B + tree?

1. The hash table

A hash table is an array plus a linked list, subscript 0,1,2,3….. Represents the location of its data. If you want to store data in a hash table, first use the hash algorithm (the basic operation is modulo), if the array length is 13, modulo 13 is 0 to 12, which corresponds to the index of the data, if the calculated index is the same, will follow the linked list at the index.

Disadvantages:

  1. Using hash storage requires all data files to be added to memory, comparing memory consumption.
  2. The search of hash is equivalent query, which is fast, but there is no scope rule between each data. However, in actual work, more scope query, hash is not suitable.

Mysql does not use hash tables. It is determined by the storage engine. The Memory storage engine uses hash tables

2. Binary search tree

Disadvantages:

  1. In extreme cases, as shown in the figure, there may be a problem of tilting, and eventually a linked list structure.
  2. The tree nodes are too deep, increasing the I/O of the search, which is now the bottleneck of the search
3. Binary balanced tree -AVL

In order to maintain the balance of the tree and avoid data skew, rotation operation is required. The length of the oldest tree and the shortest subtree should not exceed 1 through left or right rotation. If the length exceeds 1, it is not strictly an AVL tree

Disadvantages:

1. When there is a large amount of data, 1-n rotations are required to maintain balance. This rotation wastes performance, and the efficiency of insertion and deletion is very low, while the efficiency of query is very high.

  1. With only two branches, the tree is very deep even when there is a lot of data.
4. A red-black tree

The oldest tree can’t be more than twice the size of the shortest subtree, which balances inserts and queries with color changes and rotations

A red-black tree is a variant of an AVL tree that loses some query performance to improve insert performance.

Disadvantages:

Also, with only two branches, the depth will still be very deep when the amount of data is large

All of the above three binary trees will eventually have too many nodes as the number of data increases, and they have only two branches, so the number of IO is the same.

How do you solve the problem that you only have 2 branches and it’s too deep, so you have B tree, and you add branches

5. B-Tree
  1. First, instead of reading B subtree, read B tree
  2. All the key values are distributed throughout the tree.
  3. The search may end at the non-leaf node, and a search is done in the complete set of keywords, and the performance is close to binary search.
  4. Each node has at most m subtrees.
  5. The root node has at least two subtrees.
  6. A branch node has at least m/2 subtrees (all branches except the root and leaf nodes).
  7. All leaf nodes are on the same layer, and each node can have up to m-1 keys in ascending order

As shown in the figure above :(only part of the picture is drawn, in fact, there is no limit, not just p1,p2,p3)

Each node occupies one disk block. Each node has two keywords in ascending order and three Pointers to the child root node. The Pointers store the address of the disk block where the child node resides. The three range domains divided by two keywords correspond to the range domains of the data of the subtree to which the three Pointers point. For the root node, the keywords are 16 and 34. The data range of the subtree pointed by the pointer p1 is less than 16, the data range of the subtree pointed by the pointer P2 is between 16 and 34, and the data range of the subtree pointed by the pointer p3 is greater than 34.

How to find keyword 28:

  1. Locate disk block 1 based on the root node and read it to the memory. [First disk I/O operation]
  2. Compare keyword 28 in the interval (16,34) and find the pointer p2 to disk block 1.
  3. Locate disk block 3 using the P2 pointer and read the memory. [Second disk I/O operation]
  4. Compare keyword 28 in the interval (25,31) and find the pointer p2 to disk block 3.
  5. Locate disk block 8 according to p2 and read the memory. [Third disk I/O operation]
  6. If keyword 28 is found in the keyword list of disk block 8, the end.

Disadvantages:

  1. Each node has a key and contains data, and the storage space of each page is limited. If the data is large, the number of keys that can be stored in each node will become smaller.
  2. When a large amount of data is stored, the disk depth becomes large, which increases the I/O query times and affects the query performance.
6. B + tree

B+ tree is an optimization made on the basis of B tree. The changes are as follows:

  1. Each node of a B+ tree can contain more nodes for two reasons. The first reason is to reduce the height of the tree. The second reason is to change the data range into multiple intervals.
  2. Non-leaf nodes store only keys, while leaf nodes store keys and data.
  3. Two Pointers of leaf nodes are connected to each other (complying with the disk prefetch feature). Sequential query performance is higher.

Above: in B + tree has two head pointer, a pointer to the root node, another point to the smallest leaf nodes of a keyword, and between all leaf nodes (nodes) and data is a chain ring structure, therefore the B + tree can be two lookup operation: to the extent of the primary key is a kind of search and paging search, another starts from the root node random search.

Index differences between InnoDB and MyISAM

1. InnoDB- Primary key index

Leaf nodes store specific row data

2. InnoDB- Non-primary key index

A leaf node that is not a primary key index stores the primary key value.

3. MyISAM

The leaf node stores the address of the row data, requiring one additional addressing and one more IO

Summary: Why does mysql use B+ trees

Why do mysql’s InnoDB and MyISAM storage engines use B+ tree indexes

  1. Hash table, equivalent query is fast, but does not meet the usual range lookup, there is no relationship between two adjacent values, and hash comparison consumes memory.
  2. Binomial tree/balanced binomial tree/red-black tree have only two branches. The common thing is that the depth of the tree becomes deeper when the amount of data is large, and the number of IO increases.
  3. B trees store data on nodes, reducing the number of keys on a page and increasing the depth of the tree.
  4. The B+ tree removes data from non-leaf nodes, which increases the number of keys in a page. In addition, leaf nodes are connected by a linked list, which is conducive to range lookup and paging.

Thank you for reading, if it is helpful to you, you can click a like to encourage, also have any questions are welcome to leave a message exchange!