First, the reason for restricting the throughput of storage media greatly

First, throw out the conclusion. Sequential access to any storage medium (whether mechanical hard disks, SSDS, or memory) is much faster than random access.

Second, the implementation mechanism of traditional database

Traditional databases, such as Mysql, use B + tree indexes, which are read friendly. But easy to cause random write. For example, to insert a new value into the database, first we need to read the B + tree, determine where the new value is placed in the tree, and then write the new value in a specific position, and make a series of adjustments and splits to make it meet the characteristics of the B + tree. This inevitably results in random disk access and slow insertion of large amounts of data. Of course, this is also in line with the historical development trend, the early IT industry, data volume and data growth speed is limited, as long as there is good query performance, can meet the demand.

However, with the improvement of hardware performance and the change of business form, the current Internet system is often faced with a large amount of data, high-speed generation. How to speed up the storage speed, became the key. Hence the LSM Tree.

3. History of LSM Tree

The earliest concept of LSM Tree was born in Google’s “BigTable” paper in 1996. There are some small differences in LSM Tree implementation of various database products. Databases that use LSM Tree as storage structure include LevelDB of Google, RockDB of Facebook (RockDB comes from LevelDB), Cassandra,HBase, etc.

Fourth, the idea of improving write throughput

Since sequential writing is faster than random writing. You have to figure out how to write the data sequentially.

4.1 One way is to drop the data directly after it comes in

This has very high write speed. However, when we want to search for a data, because the stored data itself is disordered (the order of written values cannot be controlled), we cannot use any algorithm to optimize, and can only search one by one, so the reading speed is very slow.

4.2 Another way is to ensure that the data of the falling disk is written in sequence and that the data is in order

The requested data itself is unordered and unpredictable, so how to ensure that the data falling disk is in order? This takes advantage of the fact that memory access is faster than hard disk access. Write requests are cached in the memory and organized in an orderly manner. When a certain amount of data is written to the disk, sequential data is written to the disk. At the same time, it is convenient for the subsequent search based on ordered data (ordered data structure can be quickly queried by binary search algorithm, depending on the ordered structure).

5. LSM Tree structure diagram

The LSM tree takes advantage of the second approach. The specific structure diagram is as follows:

5.1 Why Do I Write a Log first

To prevent written data from being lost during a power outage. Therefore, write a log to the hard disk in order to facilitate data recovery.

5.2 What is MemTable

Memory cache that writes data. MemTable stores ordered data. What is an ordered data structure? Different implementations may differ. LevelDB uses SkipList. Hbase uses B Tree.

5.3 What is ImmutableMemTable

The data in MemTable is increasing at any time, when it increases to a certain amount, it becomes immutable data, ImmutableMemTable. A MemTable is generated for subsequent data writing. Data in ImmutableMemTable will be written to the SSTable on the hard disk.

5.4 What is SSTable

SSTable full name Sorted String Table. It’s actually an ordered storage file that gets written to, so it’s called sorted.

SSTable files have a DataBlock, IndexBlock BitSet (different implementations, could not)

  • DataBlock An SSTable contains multiple blocks of data, organized in the form of keyvalues.
  • IndexBlock records the Offset of the largest Key in each data block
  • Bitsets use Bloom filters to map a Key to bitsets

Ordered organization of data, IndexBlock, BitSet. All of these data structures are designed to speed up data reading. So how does the data get read?

5.5 How Do I Read Data

The general process of reading is as follows Because sstables are created sequentially, the latest sstables contain the latest values. When searching for sstAbles, search for the latest Sstables in sequence.

The query flow of each SSTable is as follows The principle of Bloom expression is to store the possibility of large amount of data with very small data capacity. So it is only a theoretical possibility to query the existence of the Key through BitSet’s Bloom expression, and then to actually query it through IndexBlock. If the Bloem expression is not found in the BitSet, it is not, and can be quickly skipped to the next SSTable lookup. The use of Bloom expression can greatly improve the search efficiency.

5.6 How Do I Delete and Update Data

To ensure that data is written sequentially, none of the SStAbles will change in the original data location due to deletion or update. At update time, only the latest value is inserted to write to the new SSTable. When deleting, a delete flag based on the Key is inserted and written to the latest SSTable. Because looking for a Key is based on time freshness and looking for sstables in reverse order, reading a Key always reads the latest value.

5.7 Merging sstAbles

Over time, the number of files in the SSTable increases, resulting in poor search performance. At the same time due to data update or deletion. This reduces the validity of data in the old SSTable. If too much outdated data occupies the SSTable, the query efficiency will also be reduced. Therefore, a database engine will have an SSTable merge operation on a regular basis. Remove obsolete data and merge several small SstAbles into one large SSTables.

5.8 The Last Read of the SSTable IndexBlock Cache

In the case of large memory, some databases also cache the most recently read SSTable indexes into memory. This further speeds up the search process.

Vi. References

www.benstopford.com/2015/02/14/… www.cnblogs.com/haippy/arch… Blog.csdn.net/u014774781/… En.wikipedia.org/wiki/Log-st… www.igvita.com/2012/02/06/…

Welcome to follow my personal account “North by Northwest UP”, recording code life, industry thinking, technology comments