LSM is not an old batch

Log-structured merge-tree (LSM) is a Log structure merging Tree. In fact, it does not belong to a specific data structure, it is more of a data structure design idea. Most NoSQL database core ideas are based on LSM to do, but the specific implementation is different.

What is a tree of the LSM

Because disk I/O overhead is one of the bottlenecks to database efficiency, there are many solutions to reduce disk I/O. The LSM tree is one of the solutions for frequent disk reads and writes.

When using multi-path search tree index data such as B-tree, the insertion of data will cause rebalancing, which makes the insertion efficiency low. At the same time, the data between the leaf nodes of the B tree may be adjacent, but they are in different subtrees. As a result, although the numerical value is continuous, the physical value is discontinuous, resulting in a large number of random read and write operations, which reduces the DISK I/O efficiency.

LSM provides a design idea that uses both memory and disk to store data. In addition, the data in the memory is flushed to the disk in sequence according to certain policies. The data is then stored logically and physically in order on disk.

LSM does not directly Write data to disks. Instead, LSM stores data in memory for a period of time and uses write-ahead logging (WAL) to prevent data loss. This is similar to the Redo Log in MySQL. This allows LSM to withstand high write requests.

The fundamentals of LSM

LSM consists of three parts (RocksDB for example, different storage engines may be designed in the same way but implemented differently)

  1. MemTable Data structure in memory
  2. Immutable MemTable is about to become an intermediate state of SSTable
  3. SSTable(Sorted String Table) Data structure of the LSM tree on disk

New data is organized as MemTable in memory. When it reaches a certain capacity, it becomes Immutable MemTable and is merged with data in SSTable. When new data is inserted, a new MemTable is generated for recording. Immutable MemTable is merged with SSTable and stored in disks in a certain sequence.

When querying data, the system searches the MemTable first. If the data cannot be found, the system searches the disk. So the query cost of LSM trees is high.

WAL logs are used to record data operations. Therefore, when data in memory is lost due to accidents, memory data can be rebuilt using logs. Although WAL logs are stored on disks, they are appended sequentially. Therefore, the data insertion efficiency is not affected.

Insert and merge

In a real application, there will be many MEMtables in memory and many SstAbles in disk. Let’s assume that there is only one MemTable in memory, called C0, and only one SSTable in disk, called C1.

In the process of merging, there are two block structures.

  1. The Emptying Block stores unmerged data in C1
  2. Filling Block Stores merged data and appends data to disks when it is filled

When inserting data, organize the data in the memory first. When the time is right, merge the data in the memory with the data on the disk and write the data to the disk in sequence.

The following will briefly describe the process of insert, merge and append in a large number of graphic ways.

Insert data 1

Insert log is now recorded in log

And then we put data 1 into C0

Insert data 2

Insert data 3, C0 is stored as a B-tree (other data structures are fine)

And then you insert another blob of data

Assuming the merge is triggered, we first load C1’s data into the Emptying Block, but since C1 has no data, the Emptying Block data is empty. Just place the data in C0 in a Filling Block in order.

Put node 1 into the Filling Block

After node 1 is placed in the Filling Block, node 1 in C0 is removed

After node 2 is deleted, the tree is unbalanced and needs to be rebalanced.

Node 3 is replaced by node 2 to keep the tree balanced

After placing node 3 in the Filling Block, “left-spin” the tree to keep it balanced (see the introduction of balanced binary trees for left-spin methods)

When a Filling Block reaches its capacity limit, the contents of a Filling Block are appended to C1

And then we insert 4, 9, 10

When it is time to merge again, you will need to put C1’s data into an Emptying Block and merge it with C0.

Merge the data from the Emptying Block and the data from C0 sequentially into a Filling Block. Once the Emptying Block’s data is merged into a Filling Block, the Emptying Block’s data is removed.

When the Filling Block is full, append to C1 and delete the original data

And then we merge again

When Filling blocks, append to C1 and build a B tree (or other index structure)

In this way, the design idea of LSM tree is formed by ceaselessly inserting, merging and adding.

To find the data

The system searches for data in the memory first. If no data exists in the memory, the system searches for data in the disk.

Find 9, in memory can hit

Find 7, can not find in the memory, and then go to the disk to find

Delete the data

When data is deleted, the LSM tree marks the node as “deleted” rather than being deleted directly. Wait until the next merge operation to actually delete the data.

If the data is in memory, mark it as “deleted”

If the data is on disk, it is too expensive to look it up and delete it. So just insert the same data into memory and mark it as “deleted”.

When encountering deleted nodes in the next merge, delete them directly.

conclusion

LSM improves write performance by utilizing memory high-performance IO. Using the merge strategy, the data in memory and disk are combined and sorted and persisted. WAL is used to prevent accidental loss of memory data due to breakpoints. The combination of hard and soft makes LSM have good read and write performance.

LSM queries the memory first and then the disk. In addition, there are not only one tree between memory and disk, but many tree structures. As a result, the query performance of LSM is far lower than that of traditional relational storage engines such as B+ tree.

The LSM tree is far ahead of the B+ tree in the write scenario, but far behind the B+ tree in the read scenario. Therefore, LSM is suitable for the scenario of writing more and reading less.

Represents databases, such as NessDB, Leveldb, and HBase

The core of the core idea is to give up some read ability in exchange for maximum write ability, and give up disk read performance in exchange for sequential write. At the extreme, the write performance of LSM tree-based HBase is an order of magnitude higher than that of MySQL, and the read performance is an order of magnitude lower.