This article originated from personal public account: TechFlow, original is not easy, for attention


Today is the ninth article on distributed systems.

Log-structed merge-tree LSM tree log-structed Merge-tree LSM tree It’s a little daunting, but it’s not that hard, and it’s almost child’s play compared to b-trees.

And this is a very classic data structure, and has a very wide range of applications in big data systems. There are many well-known classical systems, the underlying implementation is based on the LSM tree. So, today, let’s take a closer look at how it works.

Background knowledge

First, let’s start with the background. When we talked about B+ trees, the big difference between a B+ tree and a B tree is that you put all the data in the leaf. The core logic of optimization is because no matter what the storage medium, the efficiency of sequential storage must be higher than random storage, and not a bit higher. This is a cliche, and if I remember correctly, this is the third time I’ve mentioned it in an article.

I recently saw a graph that nicely illustrates the efficiency difference between random and sequential reads. Let’s look at the following graph. The green part represents the maximum sequential read speed of the hard drive, while the red part represents the random read speed.


If we look at the ordinate, it’s not exactly the same, it’s an order of magnitude different. And it’s not just an order of magnitude, it’s at least three orders of magnitude off, which is obviously pretty scary. In addition, this gap does not only exist with traditional mechanical hard drives, but also with today’s more advanced SOLID-state drives (SSDS). In other words, the difference is medium independent.

Intuitive optimization

Since random read and sequential read efficiency difference so much, can not help but let a person enchanted. If we can invent a data structure that can take full advantage of this, our system’s ability to swallow data will certainly be improved to a higher level. For many tech companies, especially big data ones, the cost of machines is a big part of daily expenses. If this problem can be solved well, it can obviously save a lot of resources.

A naive idea is to design all reads and writes sequentially, such as logging systems. When logging, it is always added to the end of the file, not inserted in the middle of the file. Since we always add at the end of the file, this is a sequential read and write on disk. But we have sequential reading and writing, which means that when we’re looking for something in the middle of the file, we have to read everything in the file.

This idea is most widely used in two places, one is in the database log. When we use the database to perform a write or modify operation, the database will log all changes. Another is middleware for messaging systems, such as Kafka.

However, in complex scenarios of adding, deleting, checking and modifying files, especially those involving batch read and write, simple sequential read and write files cannot meet the requirements. Of course, we can introduce hash table or B+ tree data structure to achieve the function, but these complex data structures can not avoid relatively slow random read and write operations, and we hope to minimize random read and write operations, based on this principle, LSMT was invented. LSMT uses a unique mechanism that sacrifices some of the read performance to preserve the write capability, allowing all operations to be sequenced, almost entirely avoiding random reads and writes.

Before we introduce the principles of LSMT, let’s first introduce its substructure, SSTable.

SSTable

It’s normal to feel confused when you see this word for the first time. The full name of SSTable is Sorted String Table, which is essentially a file with Sorted KV structures. Take a look at the picture below:


The most basic SstAbles are the ones on the right in the figure above, where key-value pairs for keys and values are sorted by key value and stored in files. When we need to find the data corresponding to a key value, we will read the entire file into memory and search for it. The same is true for writes, where we insert in memory, and when we get the result, we overwrite the original file instead of making changes in the file, because that would involve moving a lot of data.

If the amount of data in the file is too large, we need to build another index file to store the offset corresponding to different key values, so that we can quickly find the file we want to find when reading the file. The index file is the left part of the image above.

Note that the sstables are not modifiable, and we will only overwrite the old sstables with the new ones, not modify the original ones. Because the changes would involve random reads and writes, we don’t want that.

Add, delete, change and check LSM

Now that you understand SSTable, let’s look at the basic LSM implementation principles.

In fact, the basic principle of LSM is very simple. In essence, a Memtable is added on the basis of SSTable. Memtable, as its name implies, is a table structure stored in memory. It doesn’t have to be a table structure, it can be a tree structure, it doesn’t matter, it can be a data structure that can be added, deleted, or checked quickly, like a red-black tree or SkipList. Second, we need a log file, just like the log in the database, to record the changes in the data. It is convenient to retrieve data in case of system failure or data loss, so the overall architecture is as follows:


To find the

Let’s look at lookups first. When we need to look up an element, we look up Memtable first. If it is not found in the Memtable, we can search the SSTable one by one. Since the data in the SSTable is also stored sequentially, we can use binary search, and the whole search is very fast. However, since the number of Sstables may be large and we have to search sequentially, a large number of Sstables will also affect the speed of search. To solve this problem, we can introduce a Bloom filter for optimization. We set up a Bloom filter for each SSTable to quickly determine whether an element is in an SSTable. It is certain that the Bloom filter can accurately judge that the element does not exist, and there may be a small probability of error in judging the existence. However, this error rate can be controlled, and reasonable parameters can be set to make the error rate low enough.

The lookup operation with the Bloom filter added looks like this:


The stars in the image above represent the Bloom filter, which means that we look through the Bloom filter to see if an element exists before we look it up.

Bloom filter has been introduced to you in the previous article, those who have forgotten or are new to follow can click on the link below to review.

Big data algorithm — Bloom filter

Increases the deletion

Everything except lookup happens in Memtable. For example, when we want to add an element, we add it directly to Memtable instead of writing it to a file. This also ensures that the rate of increase can be very fast.

If the element you want to modify happens to be in the Memtable, there’s nothing to talk about. So if it’s not in Memtable, if we have to look it up and then modify it, we’re going to have to do disk reads and writes, and that’s going to consume a lot of resources. So we’re going to do it in Memtable again, and we’re going to insert this element and mark it as modified or deleted.

Therefore, we can regard these three operations as adding, deleting and modifying, but this brings a problem, such operations will soon lead to the accumulation of a large amount of data in Memtable, and our memory resources are also limited, obviously can not be infinite expansion. To solve this problem, we need to periodically store the contents of the Memtable to disk as an SSTable. This is also the source of the SSTable, where data does not appear in a vacuum, but is generated by the LSM falling down the disk.


Similarly, the number of sstables will also increase as we keep falling down the disk. As we have analyzed previously, the increase in the number of Sstables will affect the efficiency of our search, so we cannot allow it to increase indefinitely. Plus we store a lot of modified and deleted information, which we need to implement. To achieve this, we need to merge all the SstAbles on a regular basis, deleting and modifying data during the merge process. In other words, previous deletions and modifications are only recorded until merge time.


The whole merge process is not difficult, similar to merge sort, but we need to add state judgment.

conclusion

Let’s review the entire process of LSMT. Although it’s called a tree, the tree structure is not obvious. But if we look at the order in which we look for elements, we actually start with Memtable and then go down the SSTable order, which is consistent with the tree structure, so we can view it as a narrow tree. If you still feel that the data structure doesn’t look like a tree, don’t bother. In addition, in principle, it is too simple and crude, but in practice, it works very well and is widely used in Hbase and Kudu.

Let’s compare this to B+ trees, where we use multi-path balanced trees for fast reading, so that we can quickly find the nodes corresponding to the key. We only need to read the contents of the node, but because of the structure of the balanced tree, the tree structure will change when we write data, which involves multiple random reads and writes of the file. When the throughput of our data is high, there is a huge overhead. LSMT, on the other hand, reads less efficiently than B+ trees, but supports writing big data better. In big data scenarios, many have high requirements on data throughput, such as messaging systems and distributed storage. B+ trees are a little bit helpless at this point, but again, if we need to ensure efficient look-ups, LSMT is not a good fit either, so it’s not really better than the other, it’s different scenarios.

Finally, there are many variations of LSMT. The most famous is Leveldb written by Jeff Dean, which makes some changes on the basis of LSMT to further improve performance, which will be discussed in the next article.

Today’s article is all of these, if you feel that there is a harvest, please feel free to point attention or forward, your help is very important to me.