This content is a summary made after I consult articles and videos of various authors at home and abroad. If the article has mistakes or omissions, everyone is welcome to correct it.

The main content of this article is to deduce why B+ tree is used as index implementation in relational database.

How does a disk store data

For a mechanical hard disk, the disk contains at least one disk. There are concentric circles on the plate, and two adjacent concentric circles form a circle called a track, which is the pink part of the image. At the same time, radiating out from the disk’s axis, radial lines form arc-shaped segments called sectors where they intersect the magnetic tracks. These are the parts of the picture where the pink ring and blue Sector intersect. If a mechanical hard disk contains multiple disks, tracks at the same location on different disks make up a Cylinder, but that concept is not important for this article.

The point is,The sector is the smallest unit of disk read and write, which is a physical unitIn general, the size of a sector is 512B, depending on the manufacturer of the mechanical hard disk. But even though the smallest unit of reading and writing to a hard disk is a sector,However, for the operating system, the minimum unit of reading and writing a hard disk is a Block/Cluster, which is a logical unitBlock is the Linux term and cluster is the Windows term.A block contains a number of contiguous sectors. The number of sectors in the block can be adjusted. The size of the block can be different for each partition, usually 8, which means the size of a block is 4 KiB. The obvious reason for adding the concept of blocks to the operating system is to decouple the physical design of the hard disk from the logical design of the operating system, and to give the user more choice. For a mechanical hard disk, it is very slow to read sectors on different tracks at random, but it is very fast to read consecutive sectors. So, the more sectors a block contains, the more content can be read at a time. But because a block is the smallest read-write unit in the operating system, a file that contains nothing will occupy a block. If you have a lot of small files on your system and the block size is designed to be very large, you waste a lot of disk space.

So how does the program read from the hard disk? A program can only read data from memory, and memory also has the smallest read-write unit, the Page (Page), usually a Page size is 4 KiB. The program reads the file, and the operating system reads integer blocks of data from the hard disk and fills them into the memory of the integer page for the program to use. The obvious reason for adding the concept of pages is to shield the ability of a program to read files due to the size of the blocks on different partitions.

With the above premise in mind, we know the fact that a program reading a byte from disk through the operating system needs to read at least one block (4 KiB). A byte on a hard disk that can be represented as a block number + an in-block offset.

In fact, for InnoDB storage engine, reads and writes are based on the “index page node”. This’ page ‘is not a’ page ‘in the operating system, but a data structure used by the storage engine, usually set to 16KiB. You can see the current page size in MySQL by showing VARIABLES LIKE ‘innodb_page_size%’. However, this does not affect the derivation of the conclusion of the whole article.

Why do indexes need to be used

Now let’s assume that we have a table with two fields, id and name, and the data size is as follows.

The field name Field size (B)
id 4
name 1020

Well, the size of the name field is 1020B, just for ease of calculation. Let’s say we have a table of 100 pieces of data.

id name
1 yizdu
2 udziy
. .
100 izdyu

So how many hard drives do we need to store it? Obviously, 100 times 1020 plus 4 divided by 4096 is 25. So that’s 25 blocks.

What happens when we read the name of an ID value from the hard drive? The sectors within a block are contigous on the hard disk, but when a file is divided into blocks and stored on the disk, it is random. There is no way to jump directly on the hard disk by the size of each piece of data. So, if we want to find an ID, we need to start from scratch and read at least one, and at most 25 blocks.

This situation is like reading a book without a table of contents. You have to traverse it every time to find what you want. To avoid this situation, you can index the table just as you would a book with a table of contents.

Let’s now design an index table with IDs and Pointers. We’re going to assume a pointer size of 6, which is actually the size of InnoDB’s storage engine.

The field name Field size (B)
id 4
pointer 6

We then build the following table, which is obviously 1000B in size and can be stored using only one block.

id pointer
1 Block number + offset
2 Block number + offset
. .
100 Block number + offset

Once we have this table, we just need to read an index table and find the target data by the block number and offset of the disk to which the pointer points. You only need to read 2 blocks.

Incidentally, this way of storing data addresses in indexed tables is also called a non-clustered index, and is the solution used by the MyISAM storage engine.

What if the index table is also large

Assuming that our table has 10,000 data items, the index table will occupy 25 blocks. So what’s not going to improve? Then we can create another index for the index.

The index of an index can be designed as the disk block number corresponding to the ID of a range.

The field name Field size (B)
id 4
pointer 6

For example, in our index table, each record occupies 10B, so a block can store 409 indexes. We can design the index of such an index, ID 1-409 jump to block 111, ID 410-818 jump to block 222, and so on, we only need to read the index of the index, according to the ID can jump to the corresponding range of the index table, do not need to read the whole index table.

id pointer
1 Block no.
410 Block no.
. .
10000 Block no.

If you were to draw it like this, it would look something like this, but it’s not a good drawing, so let’s do with it. So how much space does this index of our index take up? Obviously, all you need is ((10000/409)*10)/4096=0.0596… . So, we end up reading the data in the table with only three blocks to read, which is much more efficient.

How can you automate the process of building indexes, indexes of indexes

If the index of an index is large, then obviously you need to design an index of an index, but how can you make the process smarter and more automated?

In fact, from the picture above, we can see a little bit of that. To obtain a certain data, it is necessary to compare and search from the top level index, and finally go deep to the lowest level node to obtain the corresponding data. This is clearly a tree.

So, what type of tree is a good choice? Common search trees are binary search trees (BST), AVL trees, red-black trees, B trees, and B+ trees.

In fact, from the above corollary, we can see that the ultimate purpose of index optimization is to reduce the number of disk read blocks. Because disk IO for CPU operations, memory reading, are too slow, is the bottleneck of the database. However, for common binary trees, each node can only have two children. If a binary tree is used to build an index tree, each block can only contain one index key after it is stored on disk. This leads to two problems: 1. A block contains more space than one index key, and the rest of the block is wasted. 2. This index tree will be relatively high, which will lead to more hard disk reads. (If you ask me why a block can’t store multiple nodes, I don’t know)

Therefore, in order to reduce the number of disk reads, it is necessary to reduce the height of the tree as much as possible, which requires the use of multi-path search trees. Common multi-path search trees are B trees and B+ trees.

B tree

Let’s talk about B trees first, and then we’ll talk about B+ trees later. B tree is a multipath balanced search tree. There are several definitions for a tree of class M and class B:

  1. Each node can have at most m subtrees
  2. Every non-leaf non-root node, at leastceli(m/2)Subtrees (CELI is rounded up)
  3. If the root node is not also a leaf node, there are at least two subtrees
  4. Nodes with k subtrees have k-1 keywords
  5. All the leaves are in the same layer

Knowing that a B tree is being inserted, deleted, split and combined, we can go to this website Data Structure Visualizations. We do not discuss the algorithm of splitting and merging nodes of B tree.

So here we have a third-order B tree, and it looks like this. We can clearly see that the whole tree can get shorter because it’s branching more. If each node represents a hard disk block, you can see that the number of disk blocks that need to be read is at least 1 and at most 3. And all the leaves are at the same level, which means there’s no query that needs more than onelogM(N)Of the query time. Here,MIt’s for every node in the tree BThe average number of fork, N is the amount of data, and this conclusion can be deduced from the binary search tree.

B + tree

B+ trees are an upgraded version of B trees. The definition of a B+ tree, in fact, varies in detail from textbook to book. Considering that B trees and their variants (B+,B* trees) are intended for practical use in engineering, not only are they described differently in different textbooks, but they can be implemented without a database storage engine. Hence the lack of a strict academic definition. There are several definitions for a class M B+ tree:

  1. Each node can have at most m subtrees
  2. Every non-leaf non-root node, at leastceli(m/2)Subtrees (CELI is rounded up)
  3. If the root node is not also a leaf node, there are at least two subtrees
  4. All the leaves are at the same level and so far this is the same definition as the B-tree
  5. All the data is stored in the leaves. They are arranged in order and connected by Pointers
  6. All non-leaf nodes are used as indexes and do not store data

Here’s the more controversial and confusing part:

For most foreign textbooks, there are k subtree nodes, with k-1 keywords. It’s very similar to a B-tree, so it’s easy to understand.

For InnoDB engine index tree implementation, and most of the domestic textbooks (such as Yan Weimin’s “Data Structure”), there arekThe nodes of the subtree havekKeyword. At the same time, each non-leaf node contains the largest (or smallest) keyword of its subtree. This B plus tree looks like this. B+ tree of this design, I don’t really understand the details of its construction process. Take an examination of grind in also ask to understand its basic concept and property only, lack relevant teaching material description. However, these differences are not significant enough to affect the following conclusions. For the convenience of explanation and drawing, I still use the definition of B+ tree in foreign textbooks.

So here we have a third-order B tree, and it looks like this. As you can see, since all the data is in the leaf node, access to any data requires accesslogM(N)(M is still the average number of forks, and N is the amount of data). At the same time, since non-leaf nodes are only indexes and all data are on leaf nodes, there is certain data redundancy, resulting in some cases, the same data volume and order, B+ tree will be higher than B tree.

Why choose B+ tree

It seems that B+ trees perform less well than B trees, so why is it a common implementation of database indexes?

Let’s focus on the biggest difference between B trees and B+ trees, which is

  1. All the data is stored in the leaves. They are arranged in order and connected by Pointers
  2. All non-leaf nodes are used as indexes and do not store data

For 5. Because all the data is arranged sequentially at the leaf level, it is easy to do a range query and a full table scan on the B+ tree index. For B-trees, in-order traversal is required, which may result in more random IO and a more complex algorithm implementation.

For 6. This is the most important point. Because the non-leaf nodes of the B+ tree are just indexes, they don’t contain data. So for each disk, using a B+ tree, you can store more keywords per block, which makes the index tree shorter.

Assume that both the B tree and the B+ tree use a 4-byte primary key and a 6-byte pointer, with the same number of keys and subtrees for ease of calculation. That’s in a 4 KiB disk block. B+ tree can store 4096/(4+6)=409 indexes. B+ tree needs at least one pointer to data because each node contains data, so a block can only store 4096/(4+6+6)=256 indexes. As you can see, using B+ trees, the index tree can be much shorter, about twice as efficient as B trees. Therefore, even though in theory a B+ tree may be taller than a B tree at the same order, in practice a B+ tree can obtain a larger order when stored on disk.

In addition, the query complexity of a B+ tree is stable logM(N), which is different from the minimum 1 and maximum logM(N) of a B+ tree. Readers may wonder why stability is needed. In fact, I also have such a question, the individual refers to the views of various bloggers and their own thinking. Personally, I think the biggest benefit lies in the query optimization of database management system. Most data management systems, such as MySQL, implicitly optimize queries to determine whether and which indexes to run. There may be a variety of ways to query a statement. MySQL will compare the various costs and expenses in a variety of ways, and finally choose the one it thinks is the best. The stable query complexity of B+ trees makes the cost of computation predictable.

Extra points

In InnoDB, how high is the primary key index tree usually?

In InnoDB, the units for storing nodes are not disk blocks, but “pages” designed by InnoDB. Also, InnoDB uses clustered indexes, where the leaf node contains the entire row of data in the table. As mentioned at the beginning of this article, this page size is usually 16 KiB. The common INT id primary key requires four bytes. Pointers in InnoDB require 6 bytes.

Therefore, in the case of INT as primary key, it is assumed that each piece of data has a size of 1 KiB. InnoDB can store 1638 indexes and 16 data items per page. So, if the tree height is 1, it can store 16 pieces of data, in which case only one page acts as the root or leaf node. When the tree height is 2 (only root and leaf nodes), 1638*16 pieces of data can be stored. When the height of the tree is 3, 1638*1638*16 pieces of data can be stored; when the height of the tree is 4, 1638*1638*16 pieces of data can be stored. Of course, this is the theoretical value if every node is of the maximum order, but in reality it’s a little bit smaller, but it doesn’t matter, because you can see that when the tree is three, it already stores 40 million pieces of data.

How many disk IO does InnoDB need for each index search

The number of times a node needs to be read is the height of the tree. Most blogs will say that the number of disk IO equals the height of the tree, but this is only true if the size of the tree node equals the number of disk blocks. In InnoDB, the data page size is 16KiB and the root page is memory resident. A common block size in an operating system is 4 KiB. So in theory, the true number of disk IO should be (tree height -1)*(data page size/block size), -1 because the root page is memory resident.


  • [1] Data Structure Visualizations
  • [2] 10.2 B Trees and B+ Trees. How they are useful in Databases
  • [3] Why is the stability of B+ tree queries important?
  • [4] B+Tree index structures in InnoDB
  • [5] B+Trees — How SQL Server Indexes are Stored on Disk
  • [6] How Database B-Tree Indexing Works
  • [7] Modify InnoDB data page size to optimize MySQL method
  • [8]MySQL innodb_page_size