preface

B+ trees are becoming more and more popular with the mysql Innodb engine. While researching Raft algorithm, I came across a data structure similar to B+ trees, LSM (Log-structured merge-tree). It is an interesting file organization data structure mentioned in Google’s Big Table paper, and is now used in many industrial products: HBase, Cassandra, LevelDB, RocksDB, and so on. Today we will investigate the principle of LSM trees.

Define the LSM tree

LSM trees (log-structured merge-trees) are similar to B+ trees in that they are designed to better store data on high-capacity disks. Compared with B+ trees, LSM trees have better random write performance. See this in an ACM report below:

Sequential disk write performance is somewhat counterintuitive; in the above example, sequential disk write throughput can even exceed random memory write throughput. The LSM tree takes advantage of this by increasing the throughput of random writes by several orders of magnitude by converting random writes to sequential ones. So how does it translate? Let’s take a look at the basic idea.

The basic idea of LSM trees

The LSM tree stores all data insertion, modification, and deletion operations in memory. When such operations reach a certain amount of data, they are written to disks in batches. Before data is written to the disk, the data is merged with the previous data. During the merge process, new data is inserted directly instead of being modified at the location of the original data, as is the case with B+ trees, thus avoiding random writes.

LSM tree structure

The STRUCTURE of the LSM tree spans memory and disk, and contains memtable, IMMUTABLE MEMtable, and SSTable.

memtable

As the name implies, memTable is an in-memory data structure used to store the latest update operations. When data is written to memTable, it is backed up to disks using WAL to prevent data loss due to memory power failure.

Write-ahead Logging (WAL) is a set of techniques used in relational database systems to provide atomicity and persistence (two of the ACID properties). On systems that use WAL, all changes are written to a log file before committing.

Memtable can use data structures such as skip tables or search trees to organize data to keep it organized. When the number of memtable data reaches a certain level, memtable becomes IMmutable, and a new MEMtable is created to process the new data.

immutable memtable

As the name implies, IMmutable memtables are data structures that are not modifiable in memory, and are an intermediate state for turning memtables into SstAbles. The goal is not to block write operations during the dump. Write operations can be handled by the new memtable instead of waiting for memTable to be locked.

SSTable

SSTable(Sorted String Table) is a set of ordered key-value pairs. It is the data structure of the LSM tree group on disk. If the SSTable is large, you can also create an index based on the key value to speed up the SSTable query. A simple SSTable structure is shown below:

The data in memTable is eventually converted to SSTable and stored on disk, followed by SSTable log merging operations, which are also the focus of the LSM tree structure.

The final STRUCTURE of the LSM tree can be simply represented as follows:

LSM tree add delete change check

Having introduced the basic ideas and structure of LSM, let’s look at how they work together to complete the most common CRUD flow of a data system.

Write operation

Data is first written to the disk Log through WAL to prevent data loss. Then, data is written to the MEMtable in memory. In this way, the write operation is complete. Compared with the multiple disk random I/O of B+ tree, the efficiency is greatly improved. The memTable data is then merged into the sstAbles on disk in batches, changing the random writes to sequential writes.

Delete operation

When there is a deletion operation, there is no need to find the corresponding data in the disk and then delete it like the B+ tree, but only need to insert a data in the memtable as a flag, such as delKey:1933. When the read operation reads the flag in the memtable, it will know that the key has been deleted. Later in the log merge, the deleted data will be deleted together during the merge.

The update operation

The update operation is similar to the 000 delete operation in that it only operates on memtable, writes a flag, and then the actual update operation is deferred for merge.

Query operation

The query operation is slow compared to the B+ tree. The read operation is memtable, immutable memtable, SSTable0, SSTable1…… . All collections need to be traversed in reverse order, and because of the write order and merge order, the data in a collection with a small number must be newer than the data in a collection with a large number. Therefore, once the data to be read is matched in the reverse order traversal process, it must be the latest data, as long as the data can be returned. But if a piece of data is not in the entire data set, it will be traversed for nothing.

The read operation looks clunky, but you can speed it up with a Bloom filter. When the Bloom filter shows that there is no data to read in the corresponding SSTable, the SSTable is skipped.

A Bloom filter is actually a long binary vector and a series of random mapping functions. Bloom filters can be used to retrieve whether an element is in a collection. Its advantage is that its space efficiency and query time are far more than the general algorithm, but its disadvantage is that it has certain error recognition rate and deletion difficulty.

Index files mentioned above can also speed up read operations.

merge

From the previous add, delete, change, and query operations, the merge operation is the most important operation in the LSM tree.

The merge operation has two main functions:

  1. Merges data from memory to disk.
  2. Merging memory data to disk will generate a large number of small sets, and update and delete operations will generate a large number of redundant data. Therefore, merging operations can reduce redundant data in the set and reduce the time of linear scan during read operations.

There are two types of merger strategy that are widely used, size-tiered strategy and grade strategy

Size – tiered strategy

Size-tiered policy is a merger strategy adopted by HBase, where you combine a collection of something tiered into a larger set when a certain number of collections reached a certain level. For example, if you have five sets of 50 data, merge them into one set of 250 data. One disadvantage of this strategy is that when the collection reaches a certain amount of data, the merge operation becomes very time-consuming.

Leveled strategy

The leveled strategy was a merger strategy adopted by LevelDB and RocksDB, and size-tiered strategy, because it generated a collection of large data, resulted in sudden I/O and CPU resource consumption, so the leveled strategy used a tiered data structure instead of the original big data collection.

The leveled strategy limits the size of the collection to a small range such as 5MB and divides the collection into different levels. The total size of the collection at each level is fixed and increasing. For example, the first layer is 50MB and the second layer is 500MB… . When the data set size of one layer reaches the upper limit, a file is selected from this layer and merged with the next layer, or promoted directly to the next layer. If data conflicts are found during the merge process, the data at the next level is discarded because the data at the lower level is always updated.

And the grade strategy will be limited, except for the first level. The key values for each other layer are not repeated. This is achieved by eliminating redundant data during merging to speed up linear scanning of data in the same layer.

RocksDB Compaction instance

conclusion

LSM trees sacrifice a small amount of read performance for a large amount of write performance, so they are well suited for write-heavy read scenarios, where they are more competent than B+ trees. Through in-depth understanding, it is found that the LSM tree merge strategy greatly affects the performance of the LSM tree. Therefore, you should flexibly choose the corresponding strategy according to the specific scenario.


Invite dance card king old demon’s code memo