Introduction to different implementations of LSM trees

This needs to be mentioned together with BTree. We all know that BTree is the Balance Tree. In order to maintain the characteristics of Balancing, the Tree needs to be adjusted each time it is read, which is a considerable time loss. In some usage scenarios (such as Google crawling web pages), high-performance writes are required, while reading requirements are not so high. Therefore, LSM tree is developed based on this background, and many optimization measures are proposed for LSM in similar scenarios.

Application scenarios

Some common examples are HBase, LevelDB, and RocksDB

LevelDB and RocksDB are both open source products from Google and FaceBook, respectively. For those who are good at English, you can read the official documents directly to have a deeper understanding of the industrial implementation of LSM.

Why look at different DB implementations? It is important to understand the LSM tree implementations of different companies because different implementations of the same data structure can greatly improve their efficiency through means such as bloom filters.

LSM tree based

LSM Tree, write all Log-Structured Merge Tree. Essentially, the sequential writing of logs greatly improves write performance, while NoSQL’s structural properties make it a natural fit for shards.

Basic operation

1. Write data to the memtable memory (copy the data to the hard disk to prevent data loss during system downtime) 2. When data reaches a certain capacity, data is flushed to the hard disk SStable, and data on hard disks is merged

The complications :Compaction

The number of SstAbles in hard disks increases, causing great read pressure. This is where the unique value of a data structure comes into play. Creating a different data structure using a different Compaction can significantly reduce read performance. Space for time is the real difference between HBase, LevelDB, and RocksDB.

LSM implementation in HBase

Take a look at the HBase implementation, as shown in Figure 1.

Bloom filter

Bloom filters are similar to dynamic caches, but are classic examples of data structures that use space to drastically reduce time usage.

Bloom uses a large bittable and multiple hash functions. For example, in the figure, {x,y,z} is mapped into the table with three hash functions. Then there are n positions marked 1 in the bittable, (3<=n<=9).

If w is in {x,y,x}, then w is definitely not in {x,y,x}.
If all three hash functions of w are mapped to 1, it does not mean that w is in {x,y,z}. It could be something like the 4, 5, and 6 positions above, which are mapped from different values.

A Compaction of HBase

As mentioned above, Compaction is the heart of an LSM tree. HBase has two types:

Minor Compaction
Major Compaction

Both are merged sstables (called HFile in HBase). A Major Compaction deletes expired keys during a merge. It lasts a long time and is triggered manually when operations run low. (You can use hbase.offpeak.start.hour and hbase.offpeak.end. Hour in off-peak hours.)

In addition, different versions of HBase provide different Compaction policies for users to decide:

RatioBasedCompactionPolicy
ExploringCompactionPolicy

Two kinds of concrete introduction to see: http://hbasefly.com/2016/07/13/hbase-compaction-1/

Since I have no actual experience with HBase, I dare not comment on it. You can check out the professional introduction through the link above. However, for a service provider like me, HBase does not provide an index but uses Scanner to query data. This data structure is not “perfect”.

LevelDB and RocksDB

RocksDb is an updated version of LevelDB developed by the Facebook team based on Google and provides many features that LevelDB does not have. (By extension, TiDB is a newSQL database whose underlying is based on RocksDb. There are many companies on the Internet that have put TiDB into production environment, and the data team of my company has also put part of gray data into TiDB. RocksDb is also reliable. LevelDB is poorly documented, so take a look at RocksDB’s implementation.

LevelDB document address: https://github.com/google/leveldb/blob/master/doc/impl.md RocksDB document address: https://github.com/facebook/rocksdb/wiki

Compactions

RocksDB also offers a variety of strategies:

Leveled style compaction
Universal style compaction
FIFO Compaction

This should be a Grade Style compaction that originates from LevelDB.

Leveled style compaction

As shown in the figure above, it can be seen that Grade Style divides SStable into different levels. Except for duplicate keys that may exist in Level 1, there will be no duplicate keys after Level 2. At the same time, every SStable, the key is ordered, do not underestimate the role of order, see lookup you will know its magic.

If the number of operations exceeds the capacity of a level, an SStable is merged with level+1 data. The selected Level +1 SStable has the same key as the current SStable. If the capacity of level+1 exceeds the threshold due to the generated SstAbles, merge further.

Level0 and Level1 merge all merges can be merged with multiple threads, so there is no concurrency problem.

To find the

As mentioned above, orderliness can greatly improve the efficiency of key lookups in two steps:

Binary search all files start and end, find the file that may have a key
Binary search for all contained keys in file

All lookups are done by binary lookups.

conclusion

This paper starts with the introduction of LSM and gives an overview of some engineering implementations of LSM. Because have not yet in-depth understanding of these kinds of DB source, only from the document to find some valuable content for you to see, I hope to be able to throw bricks to welcome jade, let you re experience the wisdom of the data structure. If the content is wrong or want to communicate, please feel free to contact me ~