preface

Why does a database need an index?

We all know that database data is stored on disk, when our program is started, it is equivalent to a process running in the machine’s memory. So when our program wants to query data, it has to go out of memory to find the data on disk, and then write the data back to memory. However, disk IO efficiency is far less than memory, all data search speed directly affects the efficiency of the program.

The main purpose of database index is to use a suitable data structure, MySQL learning notes + interview question +MySQL index optimization video, can make the query data efficiency becomes higher, reduce the number of DISK I/O, improve the speed of data search, rather than the foolish youth type of global traversal.

Why is the index using B+Tree?

If we think about it simply, if we want to find the data quickly, we feel that the hash table is the fastest, based on the key, hash into a slot, directly to find the exact location of the data, how fast. However, when we do business, we usually need only one data requirement. Most of the requirements are to query a part of the data according to certain conditions. In this case, hash display is not suitable.

If we look at trees, binary trees, balanced binary trees, red black trees, B trees, they’re binary search, they’re fast to find, but whether it’s balanced binary trees or optimized red black trees, at the end of the day, they’re binary trees, and when there’s more nodes, they’re higher. I’m looking for a number. If the root node is not, I will go to the next layer. If the next layer is not available, I will go to the next layer. The result is that I may have to search for a data several times, and each time I perform disk I/O. So should we just make it shorter?

So let’s think about B trees. Firstly, the data structure of B-tree is briefly introduced:

Definition of B-tree:

  • Each node has a maximum of M-1 keywords (key-value pairs that can be stored).
  • The root node can have at least one keyword.
  • The non-root node must have at least the m/2 keyword.
  • The keywords in each node are arranged in descending order, with all the keywords in the left subtree of each keyword being smaller than it and all the keywords in the right subtree being larger than it.
  • All leaf nodes are at the same level, or the length from the root node to each leaf node is the same.
  • Each node has an index and data, i.e. a corresponding key and value.

Therefore, the keyword quantity range of the root node is 1 <= K <= m-1, and the keyword quantity range of the non-root node is m/2 <= k <= m-1.

Here m represents the order, and the order indicates how many children a node can have at most, so you need to specify its order when describing a B-tree.

Let’s take another example to illustrate the above concept, for example, here is a b-tree of order 5, the range of root nodes: 1 <= K <= 4, the range of non-root nodes: 2 <= k <= 4.

Below, we through an example of insert, explain the process of insert B tree, and then explain the process of delete keywords.

1.2 B tree Insertion

When inserting, we need to remember a rule: determine whether the number of keys in the current node is less than or equal to m-1. If so, insert the key directly; if not, divide the node into left and right parts by dividing the middle node into the parent node.

Example: In a 5-order B-tree, a node has at most 4 keys and at least 2 keys (note: the following nodes use the same node for key and value).

  • Insert 18,70,50,40

  • Insert the 22

When 22 is inserted, it is found that the keyword of this node is already greater than 4, so it needs to be split. The rules of splitting have been described above. After splitting, it is as follows.

  • Then insert 23,25,39

I split it and I get the bottom one.

Therefore, the number of nodes in each layer of the B-tree will increase. With the same amount of data, the b-tree will be lower than the binary tree, requiring fewer IO times, which meets our index requirements. So why did MySQL finally choose B+ tree, what is better than B tree?

Let’s first look at how B+ trees differ from B trees:

  • The leaf node of the B+ tree contains all the key values of the tree. The non-leaf node does not store data, but only indexes. The data is stored in the leaf node. B trees, on the other hand, have indexes and data on each node.
  • Each leaf node of the B+ tree has a pointer to the adjacent leaf nodes, and the leaf nodes themselves are linked sequentially according to the size of the key word.

As shown in figure:

First point: when non-leaf nodes only store index keys but not data, the space occupied by non-leaf nodes can be reduced, and nodes with the same capacity can store more indexes. Then, it is also a three-layer B+ tree, and the order will be more, and more data will be stored than the B tree.

The second point: the B + tree leaves node entities adjacent leaf node pointer, want to understand the benefits of the pointer, when we first know disk read data is often not in accordance with the need to read, but every time to proofread, even if only need one byte, disk will begin from this location, order a certain length of data read back into memory. The theory behind this is the famous locality principle in computer science:

  • When a piece of data is used, nearby data is usually used immediately.
  • The data required during the running of a program is usually concentrated.

The prefetch length is an integer multiple of a page. Hardware and operating systems often divide main memory and disk storage areas into contiguous, equal-sized blocks. Each block is called a page (in many operating systems, the page size is usually 4k). Main memory and disk exchange data on a page basis. When the data that the program is trying to read is not in main memory, a page-missing exception will be triggered. At this time, the system will send a disk read signal to the disk. The disk will find the starting location of the data and load one or more pages back into memory successively.

Now look at the pointer to the leaf node of the B+ tree, we can see its use, when the prefetch can ensure that the sequential read data is in order.

Some of you may have mentioned the B* tree, which is based on the B+ tree, adding a linked list pointer to non-leaf nodes. Personally, I don’t think it’s necessary to use b-tree. We don’t store data in non-leaf nodes. Data is all in leaf nodes, and the linked list pointer is useless in non-leaf nodes.

MySQL index optimization video

Clustered index and non-clustered index:

As mentioned above, the leaf node of the B+ tree stores the data data of the index key, but different mysql engines have different choices of data storage. MyISAM stores the index file and the real data file in two different files. The data stored in the index file is the address value of the data corresponding to the index key in the data file. InnoDB stores formal data in leaf nodes. Therefore, clustering and non-clustering are to distinguish whether the data stored in the leaf node is real (can be understood as whether the leaf node is crowded or not?).

Back table: back table is also simple, but we must first understand the primary key index and normal index, above the leaf node store real data, that is only the primary key index is stored in this way, normal index it stores data is the key of the primary key index. So that makes sense to us. Select * from table where name = ‘test’ select * from table where name = ‘test’ select * from table where name = ‘test’ The only way to get the whole row is to take this key and go back to the primary key index tree again. This operation is called a table back.

Select * from (name+age) where name =xx and age =xx; select * from (name+age) where name =xx and age =xx; select * from (name+age) where name =xx and name =xx. This is because MySQL creates a joint index by sorting the first leftmost field of the joint index, and then sorting the second field.