Why choose B + tree as storage engine index structure

In the database or storage world, the role of storage engines has always been central. To put it simply, a storage engine is responsible for how data is read and written. To be more complicated, how to complete data reading and writing quickly and efficiently has always been a key problem to be solved by storage engines. In most of the introduction, explain the storage engine books or articles, we all default to read more write less disk storage engine is the USE of B + tree, and few people to analyze the choice of B + tree as the index structure behind, what kind of thinking and trade-offs? In order to answer the above questions, this paper tries to discuss with you from a new perspective:

Why do disk-based storage engines choose to use b+ trees as their index structures when they deal with more reads than writes? I hope that after reading this article, readers can have a new understanding of this question and their own answer. Limited to personal ability is limited, there is expression, understanding improper place hope to criticize and correct.

The content of this paper is mainly carried out in the way of question and answer, which is to analyze and solve problems step by step. The content involved in this paper will be carried out around the following three questions. Before you start reading, try answering these three questions yourself!

In order to reduce the reader’s confusion, before starting the formal content of this article, I would like to give the reader some brief explanations and explanations of the key nouns:

1. Storage engines: Storage engines are a broad category, ranging from b+ tree storage engines based on page structure to log structure (LSM tree) storage engines. The storage engine mentioned in this paper, if there is no special explanation, refers to the disk-based B + tree storage engine for dealing with read and write scenarios. This kind of storage engine has a high frequency in relational database, the classic representative is InnoDB in mysql. In addition, BoltDB written by Golang also belongs to the category discussed in this paper.

2. Extended content: The chapters marked as extended content in this article can be read selectively by readers who are relatively good at basic knowledge, and it will not cause difficulty in understanding the core content of this article if they do not read them; For the basic relatively general partner, you can read selectively.

Let’s try to answer the first two questions, because they fall into a broad category. The first problem mainly lies in the selection of data structure. The second problem is the distinction between cause and effect.

1. The background

The answer to that question, where do we start? Think again, only make clear a background of the whole, we can know what kind of a problem the engineers faced at that time. More recently, we can try to explain the problem at its root. From the overall environment, the problems they want to solve are mainly faced with the following four main characteristics:

1. Handle the scenario of reading too much and writing too little. 2. Store tens of millions of data. 4. Use cost-effective storage

We’re going to explain them one by one, because if they’re not true, we’re starting from the wrong place, and the rest of the answers are nonsense.

1.1 Handling the Scenario of reading too much and writing too little

In the early days of the Internet, most systems dealt mainly with reading and writing scenarios. For example, early BBS content, blog, portal website, commodity warehousing of e-commerce, etc., these scenes can be divided into reading more and writing less. They write data to the database in a limited number of writes, which can then be viewed and read multiple times. To today’s Internet, many systems for users are still in the category of read more write less. So read more than write less this was the biggest backdrop that early storage engines faced when reading and writing data.

1.2 Relational databases are organized by rows

In the early days, the storage engine concept was mostly associated with relational databases, and it’s probably because of mysql that most people are introduced to the concept, and the word storage engine is often used in mysql. In a relational database, data is organized in a database -> table (multiple columns)- > row format. By the time it finally gets to the storage engine layer, the data is already organized in a row format. In a broad sense, key-value storage is similar to key-value storage, but in a relational database, the value that reaches the storage engine layer is a row of records arranged flat according to the specified format, the specific format is similar. There is no further elaboration here. You can do your own research and look it up, and this is really just a background of data being stored in a row format in a relational database.

For the convenience of introduction, the data stored by the storage engine will be discussed in the following content in accordance with key-value. The key is the primary key.

1.3 Store tens of millions of data

With the rapid development of the Internet, the magnitude of data storage is increasing day by day, and it soon reaches the magnitude of storing tens of millions of magnitude data. This background is clearly an iterative process from a requirements point of view. Impossible initial Internet come together, immediately face this magnitude. But we also know that in the world of computer software, scalability is a familiar word. So when designing a storage engine, this is a natural consideration. So here, we’re going to store tens of millions of data levels as a third background.

1.4 Use cost-effective storage

Then comes the third context, which naturally raises the question of where the data is stored. To answer this question, one must raise the question of cost. If you don’t consider the cost of storage, then the natural problem is simplified. But when it comes to data storage in the tens of millions, there are cost constraints.

Our goal is to find a relatively inexpensive storage medium that can support tens of millions of data levels.

For computers, the media in which data is stored can be divided into two general categories:

1. Volatile storage: typically memory. 2. Non-volatile storage: typically hard disks, especially early mechanical hard disks

For a detailed comparison, you can refer to the following figure:

First of all, from the above comparison, we can roughly confirm that the storage medium we want is hard disk (mainly mechanical hard disk). Because it fits a couple of the characteristics we’re looking for. But we all know that hard drives, while they meet our requirements, have their inherent limitations. Disk access is much slower than memory access.

Here also have to mention, about the mechanical hard disk it constitutes. For a brief introduction to the mechanical hard disk, we will give a brief introduction in the following extended content, you can read, have mastered the reader can directly skip this part of the dashed box content.


Extended content

The image above of the disk introduction is taken from this article.

Common mechanical hard disks read and write data by moving the magnetic head to the corresponding track and then rotating the magnetic head to the corresponding sector. Finally, the magnetic head is moved to read and write data.

To put it simply, a disk read/write operation consists of three major parts: seek time, rotation time, and transfer time. Among them, the seeking time mainly accounts for the majority, mainly because the movement of the magnetic head is mainly caused by the motor driving the magnetic arm to move the magnetic head. This movement is mechanical, so the speed is relatively slow.

For disks, disk access is definitely slower than memory. However, there are two types of disk access:

1. Random I/OS 2. Sequential I/OS

Sequential I/O is much faster than random I/O for the fundamental reason that sequential I/O mainly reduces the movement frequency of the magnetic head, thus reducing the seek time. So its performance is much faster than random IO.

Due to the limited space, the introduction of the hard disk we will not be more than the expansion, otherwise it will be off topic. With this knowledge, we are ready to move on. For details on hard drives, you can find other materials to learn or click on this article to read. Let’s move on to our main content.


Secondly, since we have chosen hard disk to do storage media, we must think of ways to overcome this problem. The graph below details the memory access speed versus disk access speed.

Let’s summarize briefly and throw out the conclusions we’ve drawn here:

> Conclusion 1 Refer to the expanded content for details.

1. Disk access time: seek time + rotation time + transfer time: > seek time: 8ms to 12ms > rotation time: 7200 RPM: half a cycle 4ms > transfer time: 50M/s, about 0.3ms

2. Disk random IO tro Disk sequence IO ≈ Memory random IO tro memory sequence IO

3. Composition of mechanical disks and solid state disks:

Mechanical hard disk: electromagnetic storage, through electromagnetic signal conversion control read and write, magnetic head mechanical arm mobile SOLID-state disk: semiconductor storage, with solid-state electronic memory chip array, composed of control unit and storage unit, internal composed of flash memory particles. Speed is more

1.5 summarize

This section mainly explains four big background, and we will review the following.

1. Handle the scenario of reading too much and writing too little. 2. Store tens of millions of data. 4. Use cost-effective storage

Finally, we choose hard disk as the medium to store data according to the actual scenario, and at the storage engine layer, data is read and written according to the key-value of the abstract level. Now that we know the big picture, it’s time to understand what our goals are. Where there is no goal, there is no direction. It is very important for us to know our goals.

2. The target

In the first section, we introduced four basic backgrounds. And figured out that we eventually need to take hard drives to store data. In this section, we will focus on the goals we want to achieve. Only with clear goals can we better break down tasks and solve problems from the top down.

Before we get to our goal, let’s take a look at what a typical user request would look like under our disk-based data storage scenario.

2.1 General user request response process

As we all know, our data stored on the hard disk, so when the user’s request (read/write) came in, will first operating system to manage the memory, some logic processing in memory, the CPU will then send instructions from the hard disk read data into memory, finally will the upper user response (read: reading data, writing: write data success, etc.).

A general process described above. We can see that the whole process goes through three stages:

Request process: User request -> Memory -> Hard disk

Response process: Response user <- memory <- hard disk

Now that we’ve figured out how to respond to a user’s request, what’s our final goal?

2.2 the target

In fact, the application of the Internet, there is a subtle rule, that is efficient, fast, storage engine is no exception, our goal is to carry out fast, efficient data read and write requests in the above background.

Here’s the problem! Fast, efficient these are qualitative analysis of a class of descriptors, what is fast? How to be efficient? How do you quantify this demand?

If you were responsible for the requirement, how would you think about it?

If you’re smart enough, remember there’s a metric in data structures and algorithms! Time complexity, this is a very important core metric, a key metric to measure the efficiency of data structures or algorithms. Let’s think about how our data will eventually be stored, how it will be stored, and how it will be organized. The question then extends to what kind of data structure should be used to organize our data so that it can be read and written as quickly and efficiently as possible. The specific speed and efficiency are judged by the time complexity.

2.3 summarize

In this section, we will review the previous two parts by using a block diagram. The question of which data structure to choose will be covered in the next section.

3. Data structure selection

As mentioned in Section 2.2, we ultimately translate the problem of fast, efficient reads and writes into choosing which data structure to use to organize data, and quantitatively determining our goals by their time complexity. Let’s start with data structures.

3.1 Comparison of data structure schemes

In our detailed analysis, fast and efficient means that the average time complexity of reading and writing should be as low as possible. We have learned a lot of data structures in data structures, such as: the average time complexity is O(1) data structure, the typical representative is hash table. Hash table is very efficient in point-to-point mapping read and write with low conflict, but its native data structure will have relatively high time complexity in the face of range search, sorting and other operations, which is also a major defect of hash table. The second average time complexity slightly slower than O(1) is O(logn), Such data structures include binary search/sort tree (BST tree), balanced binary tree (AVL tree), red black tree (RB tree), B tree (B/B-tree), B + tree (B + tree), skiplist, etc. They naturally support sorting, range lookup operations; Then the time complexity that is slower than O(logn) is O(n), O(n^2), and so on. The typical representation of the average time complexity of O(n) is arrays. The other types we’re not going to expand much here.

The figure below is our average time complexity from **O(1)->O(logn)->O(n)->… ** A comparison of fast to slow results.

We all know that in the application of the Internet, sorting, scope search is a very common requirement. For example, when shopping on e-commerce websites, most users will subconsciously click the order from the highest sales volume to the lowest; For example, on a portal news website, we will pay attention to the number of times a certain news was read in the past week and sort it by time. Another example is recommendation systems, where we sort items by one or more attributes of the content, and there are many, many examples. So when we choose data structures, we must consider supporting range lookup, sorting, and other operations.

At this point, it seems that the hash table is not very suitable for our needs. So let’s take the next step, and look at the time complexity of O(logn), which data structure do we choose? At first glance, all of these data structures seem to satisfy our needs, ** Binary search/sort tree (BST tree), balanced binary tree (AVL tree), red black tree (RB tree), B tree (B/B-tree), B + tree (B + tree), skiplist (skiplist)** can be implemented in memory. It’s intuitive that we can choose either one, but let’s not forget that our data ultimately ends up on the hard drive, and since we can’t draw a conclusion here, let’s look at the dimensions of the hard drive, which data structure is easy to maintain?

3.2 Turn your attention to the disk

As described earlier, our data flow is divided into three phases, taking the read operation as an example: disk -> memory -> user. In that case, the intuitive thought was that it would be perfect to maintain the same data structure on both the memory and the hard disk, so that when a user request comes in, the data is loaded from disk, it can be loaded directly into memory, and then do some processing and return to the user. If the intuition doesn’t work (there’s no such data structure). So we’re going to have to take an indirect approach, maintain one data structure on the hard disk to store our data, and choose another data structure in memory to store our data. When reading data from hard disk into memory, there is a layer of conversion in between. The indirect approach is a bad one because the conversion of intermediate data inevitably introduces some performance loss.

So let’s start with the intuition and see, is there such a thing as a data structure?

3.3 Intuitive Thinking

Let’s think about it for a moment. Since our goal is still fast and efficient reads and writes, how do we get fast and efficient reads and writes to disks?

I think you already know that from the introduction, you can use the order of disk IO as much as possible. When it comes to sequential IO, the first thing that pops out of my mind is the natural addition to write, because this way is a typical sequential write, sequential IO, very high performance. We assume that each write request is appending to save data. Writing in this way is faster and more efficient. So how do you read it?

As described above, data is stored flat by key-value. Each record has a different length, so to ensure that it is read, it is necessary to save some additional information to assist in processing the user’s read request. This extra data is called an index. Let’s think about what information we need to save to complete a normal read request in this appending scenario. In fact, we only need to know the location (offset) and size of each record on the disk. We can read it. In short, one record corresponds to one such binary index. The simple schematic diagram is as follows:

At this point, efficient writes are ok, as is reading of individual records after maintaining the index; But there’s one thing we have to think about what do we do? That’s the sort, scope-lookup operation we mentioned earlier.

In this scenario, we append each record directly, so the data itself is stored out of order on disk, since sorting, range lookup is required. It would have to load all the records on the disk into memory, and then traverse judgment one by one, and finally filter out the records that meet the conditions and return to the user. It works, but it’s obviously inefficient. At the same time, the amount of data stored on the disk probably far exceeds the size of the memory, so this approach is not desirable. Is there a way to solve the problem?

Let’s make a little assumption: let’s assume that when we write to disk, we can write sequentially and at the same time, the data is in order. For example, we write three data, the three data itself is in order when written, so when the range search, as long as we locate the first data, the following data is in order, we can quickly read in order. If the assumptions are true, the sorting and scoping problems are fundamentally simplified. We wouldn’t have to go through so much trouble. So let’s start with this simple assumption, what other problems do we have to solve if that’s true?

In this mode, we access each record and still need to retain the previous conclusion: each data maintains an index: offset, size.

We need to store tens of millions of data, and each record needs an index entry, so tens of millions of records need to maintain tens of thousands of index entries. The question then arises, how are these millions of index entries organized? Which data structure to organize? Deposit? .

For the problem of millions of index entries, let’s see if this problem has a solution. Intuitive ideas may fall into two broad categories:

  1. Can I reduce the number of index entries? Reduce the number of index entries and the problem will be solved
  2. Is it possible to find a reasonable data structure to organize without reducing the number of index items? “Reasonable” here can be interpreted as: space compression, optimization, etc.

Let’s start with the first idea.

Q: Why are there tens of thousands of index entries?

W: Because each record needs to maintain an index entry, we need to store tens of thousands of records, so we need to store tens of thousands of index entries.

Q: Why do we need to maintain an index entry for every record?

W: Because each record is passed in from the user’s request, the length of each record is not fixed when stored flat in line format, so it is necessary to maintain an index entry for each record.

Here we get to the heart of the problem.

Here we can summarize the derivation of the above layers with a diagram:

3.4 Index contradictions

Index core paradox: According to the previous analysis, each record is variable-length, so you need to maintain an index entry for each record. Lengthen, lengthen, lengthen, can we make some improvements or optimizations from the lengthen dimension? Since we can’t control the length of each record, is it possible to convert the disk?

If we divide the disk into consecutive blocks of fixed size, for each block we record a block number no to distinguish the different blocks. Assuming that our block size is 100 bytes, the range for block 0 is 0 99, the range for block 1 is 100 199, and so on. What happens when you make this change? Let’s break it down.

When the disk is divided into consecutive blocks of fixed size, two features of the original process remain in each block: data is ordered and written sequentially. The approximate results are as follows:

After this transformation, the key is to see how to ensure that reading and writing?

Let’s assume that our block space is large enough to avoid the problem of not being able to store a single record in a single block.

Normally, our block holds multiple records, and the records within the block are stored in order. When we read a record, one of the records must be on one of the records, first we have to solve the positioning problem. When a specific block is located, the data of the current block is loaded from disk to memory, and the data inside the block is stored in order, so it is natural to find the index item corresponding to our specific data through the binary way. Finally, read the data according to the index. In the same way, although the external process is to write a single record, the internal process is to write disks in blocks.

The question then becomes how do you determine where a record is stored?

To solve this problem, we need to determine the specific scope of multiple records stored on a specific block. For example, block 0 holds records with ids from 0 to 10; Block 1 holds records with ids from 11 to 23. , etc.

This way, when we query for a record with ID 7, we quickly know that the record is stored on block 0, then look inside block 0 for the index entry for the record with specific ID 7, and finally read the data.

How do you know for sure? Naturally, it is necessary to save two more pieces of information on the basis of the original block number no: min, the minimum value of records saved by this block, and Max, the maximum value of records saved by this block.

Each block corresponds to a triplet block->(no,min, Max). Block no holds records in the range of min to Max

If we think about it a little bit, there’s room for improvement. Because when we write, each block is written sequentially and the data within the block is ordered, and the data between the blocks is ordered. That is to say: for block I, the range of records stored in block I is the minimum value of the I +1 splicing of the minimum value of block I. The fundamental reason for this is that the blocks are stored in order. Further, we can reduce the above triplet to a binary block->(no,min). It also ensures that the data stored between blocks is logically ordered.

After a lot of long-winded talk, let’s conclude:

  1. The concept of dividing disks into continuous blocks of fixed size was introduced
  2. The data in the block is still stored in an orderly and sequential manner. Two pieces of information are saved for each record in the block: the offset position on the disk where the record is written and the size of the record
  3. Data between blocks is ordered. Each block needs to store two pieces of information: the block number no and the minimum record value min

After introducing the concept of a block, we can see that when performing range lookups, sorts, and so on, we can reduce the number of I/OS in most cases by reading one piece of data at a time, and the data in one piece contains multiple records. Multiple pieces of data need to be read only once from the disk if the data involved is in one piece. So in this scenario, performance also improves.

The overall general structure is shown in the figure below:

Meanwhile, we still have two remaining issues to resolve:

What size should the piece be?

2. Does the block index exist? How to save?

For the first question: how big should the block size be? We can look at it dialectically.

The larger the block size is, the more data a block can hold, so the fewer blocks we need for the same amount of data, and the fewer block indexes we need to maintain. However, when reading and writing each record, the extra space is larger (read and write by block), so the read and write efficiency is lower. The smaller the block size is, the less data a block can hold, so the more blocks we need for the same amount of data, and the more recent block indexes we need to maintain. However, when reading and writing each record, the extra space is smaller (read and write in blocks), so the read and write efficiency is higher.

So here we finally see, in fact, how big the block size is a compromise problem. So how do we choose?

Remember, our data is stored on disk, and our applications are running on top of the OPERATING system, so we’re here thinking how do we make the operating system serve our applications better? In short, it is better use of the characteristics of the operating system, let it play the most value. Every time we read and write data, disk operations are involved, such as reading and writing data to the disk, brushing disk, etc. Before writing data to the disk, data will be written to the memory first. When managing memory in the operating system, there is a concept of a page. Operating system read and write disk, flush disk time and management memory pages have an inseparable relationship. Therefore, in order to make better use of the operating system, we take the operating system page as the basic unit to determine the size of the block. The simplest is that the size of a block is equal to the size of a page (default 4K). Larger sizes can also be 8K, 16K, 32K and so on.

In fact, as we recall, innoDB’s default page size is 16K; In BoltDB, the default page size is the OPERATING system page size of 4K.

Since the operating system page is chosen as the basic unit of block size, we do not introduce a new concept of block, we also call a block page. Reduce the cost of understanding new nouns.

This is the end of the first question. Now let’s look at the second problem.

Does the block index exist? How to save?

Our answer is save, because otherwise, when our application is restarted, we need to rebuild the block index. When the amount of data stored is very large, rebuilding the block index is quite time-consuming, and during the rebuilding of the index, the service may be unavailable. So we need to store the block index. So how do you store it?

This method is also known as clustered index in mysql. Index and record data are stored together in a file.

This method is also called non-clustered index in mysql. The index and record data are stored separately in different files.

3.5 B tree is still B + tree

At this point, we are almost at the end of our overall derivation. Let’s summarize the above derivation and get the final result as shown in the figure below.

In the figure above, each dotted box represents a page, where each page holds data (data items or index items), and there can be Pointers between pages to ensure logical order between pages. Secondly, each page contains multiple data items or index items, and the data is stored in order. If we remove the black lines, what kind of structure is left?

The answer is: a multi-fork tree, in which each page can be seen as a node with multiple entries inside, and a node can have multiple child nodes

So let’s think about one more question. Now we can read and write based on this structure. So let’s see, when we read, if the data that we read happens to be the smallest record on one of the pages, then if our smallest record holds the data, we can return it directly. I don’t have to go all the way down. If you save only one minimum record key, you need to keep going down. So let’s look at the differences between the index entries on each page and the raw data of that record.

According to the above analysis, we can see that if the original data is stored in the corresponding page index item, it corresponds to the data structure of b-tree. If the raw data is not stored, it corresponds to the data structure of the B + tree. After analyzing the difference between saving and not saving, do we choose to save or not to save?

The answer is: no, because if the page index entry stores additional raw data on a page of the same size, the number of page indexes stored on a page will inevitably decrease. Furthermore, the height of the tree when storing the same amount of data will be much higher than when not storing. The height of the tree in our scenario actually corresponds to the number of disk I/OS. Obviously, to be fast and efficient, we need to minimize unnecessary disk I/OS. So it doesn’t. In the near future, we have decided that our data structure of choice is a B + tree.

At this point, from our first intuitive point of view, we have found a data structure that is easy to maintain on disk: b+ trees.

In our abstraction and refinement, we introduced the concept of pages, where data is organized in units of pages on disk, data stored within pages is ordered, and data stored between pages is ordered. At the same time, in the B + tree on disk, non-leaf nodes hold page index information. Including (page number no, the minimum record of data saved on the page min); Leaf nodes hold specific record data.

Since b+ tree storage is selected on disk, it is also selected in natural memory. Let’s see how the two translate into each other.

Nodes of the B + tree in memory correspond to a page on disk. Multiple entries within a node in memory hold each element (each record data or each page index item) in each page on disk.

Again, the core of our choice to use a B + tree as an index rather than a B tree is that it is better to use a B + tree as an index than a B tree when storing the same amount of data. The average disk I/O count is lower. Meanwhile, for B + tree, the time complexity of different requests is relatively average. Because the data for each record is stored on the leaf node.

3.6 summarize

This concludes our attempt to answer the question of why b+ trees are chosen as the storage engine index structure. Let’s summarize with a picture:

Finally, let’s see what the B + tree looks like for our data structure, and what the B + tree looks like for our disk + memory.

The following figure shows a B + tree in a data structure, but we won’t explain its B + tree properties too much here.

The following figure shows the last corresponding B + tree in disk + memory.

Finally, in the next section, we will attempt to support our choice by answering the third question.

4. Argue the opposite

Given that we have chosen to use b+ trees to store tens of millions of bytes of data, how many bytes can a 3 – or 4-tier B + tree store for a storage engine with a specified page size? Through this problem, we will prove whether the solution we choose can solve the problem of storing data of tens of millions of data levels that we mentioned at the beginning.

4.1 How much data can be stored in 3-layer and 4-layer B + trees (InnoDB as an example)?

How do we make a rough estimate of this problem?

We all know that innoDB’s default page size is 16K, so we’ll approximate it by 16K.

Before estimating, we need to assume two data in advance:

  1. Assume page index entries held in non-leaf nodes, each of which is 16 bytes in size. For a 16K page, about 1000(16K /16) index entries can be stored.

  2. Assume that the average size of each record data stored in the leaf node is 160 bytes. For a 16K page, about 100(16K /160) records can be stored.

In conclusion:

For a 3-tier B + tree, the amount of data it can store is on the order of 1000 *1000 *100, which is about 10^8

For a 4-tier B + tree, the amount of data it can store is on the order of 1000 * 1000 * 1000 * 100, which is about 10^11

In other words, a three-tier B + tree, in this scenario, can store approximately ten million levels of data. Therefore, this solution can solve the problem we originally proposed. Here is a simple summary.

4.2 summarize

So far, we have answered three questions. By answering these three questions, we discuss the selection process behind the B + tree storage engine in the scenario of reading too much and writing too little. It is worth noting that this article is just a bit of thinking and pondering in the process of personal learning. There is understanding or expression incorrect place also please point out.