Our daily development more or less will deal with the database, so how is the data stored in the database to ensure the efficiency of reading and writing? This article will introduce data storage and read and write in the database in detail.

The simplest database

Let’s start with one of the simplest bash-enabled databases, which is a key-value database that uses bash functions for reading and writing.

There are two functions, one is a write function, which simply writes key and value pairs. The other function is the db_get() function, which reads the last row written.

We can use it like this, here we write two key, value, one is 123456, corresponding to the following Json format data: ‘{” name “:” San Francisco “, “attractions” :[” Exploratorium “]} ‘, It corresponds to ‘{” name “:” San Francisco “, “attractions” :[” Golden Gate Bridge “]} ‘. Finally, we can call the get function to get the corresponding value of 42. This implementation is pretty straightforward.

How to save a key if it is set more than once, as shown below:

As you can see, 42 has been overwritten, and the simplest implementation here is to keep adding a line to the end, and then read from back to back (after which is the latest) to get the latest value as it reads.

In fact, in a real database, there’s a similar implementation which is a log file that keeps growing, and we keep writing in it, and of course the real data is a lot more complicated, you know, you get a problem halfway through, you write a problem, and so on. But the general idea is similar.

What if I want to delete a key? In general, for delete operations, we need to insert a special record at the end, and when we walk through the record, we know that the key has been deleted.

The other problem with this implementation, of course, is that as we keep writing the file size keeps increasing, even if we update the same key, the file keeps growing, which eventually leads to file size problems. The general solution to this problem is to merge, that is, to save the latest records. This operation can greatly reduce the file size when the key value is small.

Of course, there’s another question that you might be wondering, why do you keep adding to the end instead of updating? This is a good question for several reasons:

  • Sequential writing is much faster and more efficient than random writing, if you have studied reading and writing on HDDS. The main reason for this is that the head of a disk does not need to change randomly all the time. Even with SSDS, sequential writing extends the lifetime of SSDS relative to random writes.
  • Concurrency and crash recovery are much simpler. You don’t have to worry too much about crashing in the middle of writing. So basically, if you do the checksum, you know how many logs might be wrong at the end, and you can remove them, and you don’t have to worry about half of them being updated in one place.

Such implementation actually write efficiency is good, because as long as you are responsible for the file finally can write that the non-stop but when data is big, after reading will be a problem, you find a key value, need to constantly forward after traversal, complexity is o (n), so do you have any good ways to speed up this process? The answer is obvious, and that’s the hash index we’re going to talk about next.

The hash index

Using the database example above, we know that the file actually exists on disk, and it has an offset. If we could have an in-memory hash table to store the key and offset pairs, the read efficiency would be greatly improved, as shown in the following figure:

With this hash table, we know that 42 corresponds to offset 64, and we can easily read the corresponding content without having to traverse backwards. The problem with the hash index is that it is actually in-memory, which means that if a crash occurs, we need to recreate it, which means that we need to iterate over the disk to get the desired index. A good way to do this is to save a copy of the hash index on disk. We can also recover quickly.

At first glance, this hash index sounds good, but if you think about it more closely, you realize that it has a lot of problems:

  • If we have a lot of keys, that means we need to store a lot of things in the hash index.
  • Range lookups are inefficient. For example, looking for a value between key001 and key002 requires traversing all keys to find the corresponding answer.

So how to solve these problems? Let’s look at sstAbles and LSM-tree.

SSTable

We found that in the database implementation described above, each row is a pair of key values, and the order between different keys is not important. That is, in the database, whether the key=42 record comes first or the key=53 record first doesn’t matter. In this way, we can sort by key when we merge. We call this method Sorted String Table for short SSTable.

Its implementation is also very simple, the flow is as follows:

  • When a write comes in, it is first written to an in-memory balanced tree structure (such as a red-black tree). The reason for starting to write to memory instead of disk is that it is generally easier to maintain an ordered memory structure than to maintain an ordered disk file. This is called memtable.
  • When memtable grows larger and larger, such as larger than a certain threshold, it will be written to disk. Since memtable itself is ordered, it is relatively easy to write to disk in order. We call each write as one segment.
  • Access memTable first and then each segment at once (from latest to oldest).
  • In the background, we start another thread to merge each segment, as shown below:

After this is done, we don’t even need to write all the keys in our index. We can write a range directly, as shown in the picture below. If you want to look for handful, although there is no key in index, you know that it is between hook and handsome. So as long as we start at the offset of our handbag and find the offset of handsome, we return if there is one. Only this range is set reasonably, then the traversal efficiency of this small range is still very high. Such indexes are known as log-structured merge-tree (LSM-tree).

Of course, lSM-tree also has some space that can be optimized. For example, it is inefficient to find non-existent keys. You need to find memtable first and then each index. A common optimization is to maintain a structure in memory that determines if a key exists, and if it does not, it does not need to be iterated. Another area worth studying is when and in what order to perform background merge operations. Generally speaking, there are two categories of methods. One is to merge small new segments into large new segments. Another approach is to merge the key scope.

B-Tree

The LSM-tree we mentioned above is nice, but it is not the most widely used in reality, we usually use the B-tree.

The only similarity between B-tree and LSM-tree is that it also sorts keys in order. Lsm-tree divides data into segments of different sizes, while B-tree divides the database into blocks or pages of a fixed size, generally 4KB, and one page for each read and write. This design is similar to the disk design.

As shown below, each page is distinguished by an address so that it can point from one page to another, like a pointer. It’s sort of a hierarchy, so if we’re looking for 251, we’re going to find it in the range 200 to 300, and then in the next level we’re going to find it in the range 250 to 270, and then one level down we’re going to find 251.

The number of pages each of these levels points to is called the offset factor, which in our case was six. Real-world examples are determined by data size, on the order of hundreds. The final layer of the page that holds the data is called the Leaf page.

If we need to update the value, it’s as simple as finding the corresponding value and updating it, and everything else stays the same. However, if we need to insert a value, there may be a problem, such as its data exceeds the size of our page, and we need to split the page into two pages, as shown in the following figure:

This algorithm can basically meet the requirements of most databases at present, for example, we set the error factor to 500, each page is 4KB, there are four layers of tree, we can store 256TB of data.

The problem with this structure is that we write one page at a time, and the problem with this process is that we write half of the crash. If this problem happens, we only have half of the data. So usually we also keep a log file that keeps going forward, and this log file is updated before we actually update the B-tree, so that if something goes wrong when we update the B-tree, we can recover data from this log file.

Of course, in addition to this method to prevent crash, there are many other methods, such as maintaining two pages at the same time, when the new page is updated successfully, you can directly change the pointer to point past.

Of course, B-Tree has been around for a while, and many people have made a lot of optimization, such as trying to arrange the storage position of each page in disk as much as possible in order, so that it will be a sequential access when reading, and improve the efficiency of access. For example, we can add a forward and backward pointer to each leaf page to point to the page sorted before and after, so that we can query a range of more efficient and so on.