Lsm-tree algorithm design from RocksDB

Original is not easy, reprint please indicate the source

preface

At present, I am building the company’s internal messaging platform based on Pulsar, and of course I have also done some research on its underlying storage. Pulsar uses BookKeeper as the storage layer, and the underlying BookKeeper uses RocksDB to hold the location index corresponding to the Entry (data storage unit in BookKeeper). RocksDB is the storage engine technology that I have been paying attention to, because in my previous research on persistent KV storage, I found that the mainstream open source PIka/KVRocks and the persistent KV storage service of the cloud manufacturer that I finally selected were all based on RocksDB at the bottom. There is also the well-known TiDB, whose storage engine is also RocksDB.

Out of curiosity, I began to learn RocksDB. Since RocksDB is generally used for low-level development, it is difficult to get access to it in daily life if it is not for the development of data storage middleware, so I decided to learn the data structure design of RocksDB: LSM tree.

This paper first introduces RocksDB’s implementation of LSM tree, then summarizes the design idea of LSM tree, also compares the storage design idea of Elasticsearch Lucene, and finally compares LSM tree with common B+ tree.

LSM tree introduction

LSM Tree: Log-Structured-Merge Tree. At first glance you might think it would be a tree, but it’s not. The LSM tree is actually a complex algorithm design. This algorithm design is derived from Google’s Bigtable paper (which introduced the terms SSTable and memtable).

The storage engine based on LSM tree algorithm is called LSM storage engine. LevelDB, RocksDB, Cassandra, HBase all implement storage engines based on LSM tree algorithm.

Below we through RocksDB LSM tree implementation, to understand the design idea of LSM tree in detail. If you just want to see the LSM tree design idea summary, you can skip to the final summary section, I think the summary is good.

RocksDB LSM tree implementation

1. Core composition

First, take a look at RocksDB’s three basic file formats, memTable & WAL & SSTable.

The following figure describes the core composition and key process steps of the RocksDB LSM tree (read & write & Flush & compaction).

1.1 memtable (active & immutable)

Memtable is an in-memory data structure in RocksDB that serves both read and write. When data is written, it is always written to the active MEMTable. The memTable is always queried before the query is executed because the data in the MEMTable is always updated. Memtable implementation is skiplist, suitable for range query and insert;

Memtable life cycle

When an active memtable is full, it is read-only and becomes IMmutable memtable. A new active memTable is then created to provide writes.

Immutable memtables are retained in memory, waiting for background threads to flush them. The number of MEMtables is greater than min_WRITE_BUFFer_number_to_MERGE. Flush merges and compresses IMMUTABLE memtable once and writes it to disk L0 SST. After flush, the corresponding memtable is destroyed.

Related parameters:

Write_buffer_size: specifies the capacity of a memtable

Max_write_buffer_number: specifies the maximum number of memtables

Min_write_buffer_number_to_merge: Sets the minimum number of memtables that can be merged before the SST is flushed (if set to 1, no merge compression was performed during Flush)

1.2 a WAL (the write – ahead log)

WAL is probably familiar, a persistent log file that favors sequential writing. Many storage systems have similar designs (e.g. MySQL redo log/undo log, ZK WAL).

The RocksDB writes data to the WAL first and then to the MEMTable. If a fault occurs, the RocksDB replays the WAL to restore data in the memory to ensure data consistency.

What are the benefits of this design? In this way, the LSM tree can treat volatile memory as persistent storage and trust the data on that memory.

When WAL is created or deleted, a WAL is created each time a CF (column family data, described below) is flushed. This does not mean that the old WAL will be deleted, because other CF data may not be unflushed. Only when all CF data is flushed and all WAL related data is unflushed will the WAL be deleted.

1.3 SSTable (sorted string table)

Sorted String Table SSTable (full name: Sorted String Table) exists on disk. SSTable is a persistent, ordered, and unchangeable Map structure. Keys and values are arbitrary Byte strings. The memtable in memory executes flush if the conditions are met

The file structure of SSTable is as follows:

Now that the file structure is divided, each area must have a role:

Data blocks, which store ordered key-value pairs, are Data entities of sstAbles. To save storage space, the entire key is not stored for each key-value pair. Instead, parts that are not shared with the previous key are stored. The storage of duplicate key content is avoided (this space-saving approach through Delta Encode is also common in other storage middleware underlayers).
Meta Block: Stores Filter information to speed up data query efficiency in the SST. Filter Bloom Filter is used to check whether the specified data block contains the data to be queried.
Meta Index Block: The Index of the Meta Block has only one record. The key is the name of the Meta Index (Filter name) and the value is the position pointing to the Meta Index.
Index Block, Index Block is used to store Index information of all data blocks. An IndexBlock contains several records, each representing the index information of a data block.
Footer, pointing to the location and size of each partition. The Footer is a fixed length. When an SSTable file is read, the number of bytes at the end of the file is fixed and the Footer information is obtained. The information in Footer indicates the location of MetaIndexBlock and IndexBlock to find MetaBlock and DataBlock.

As you can see, the SSTable file not only stores the actual data, but also has index structure and Filter to speed up the SST query efficiency.

2. Some other noun concepts

Column Family (CF)

After RocksDB 3.0, a Column Family feature was added, and each KV storage must specify its CF. RocksDB creates a CF in default to be compatible with previous versions. If CF is not specified when kv is stored, RocksDB stores it in the default CF.

RocksDB allows users to create multiple Column families that have separate MEMtable and SST files but share the same WAL file. The advantage is that you can select different configurations for different Column families based on application characteristics without increasing WAL write times.

A column family can be thought of as a table if compared to a relational database.

3. The reading & writing

3.1 read operation

Look in active memTable;
If active memtable does not exist, look in imMUTABLE memtable.
If immutable memtable does not exist, it is found in the L0 SSTable (RocksDB uses a traversal method to find the L0 SSTable, and controls the number of L0 files to improve the search efficiency).
If not, search the remaining sstAbles. (For files at L1 Level and above, each SSTable does not overlap. Binary search can be used to quickly find the Level and sstables where the key is located.)

Each SSTable uses bloom Filter to quickly determine whether data exists in the current file before lookup, reducing unnecessary IO.

RocksDB sets up a read cache structure Block cache for frequently accessed data blocks in SST and provides two out-of-the-box implementations LRUCache and ClockCache.

3.2 write operation

WAL files are written to prevent data loss.
After WAL writes, the data is written to the active MEMTable in memory (RocksDB implements memtable using a hop table data structure to ensure order).
Then, when memtable data reaches a certain size, it becomes IMmutable memtable, and new memtables are generated to provide services.
After the drop condition is met, IMmutable memtable is merged into the SST of the hard disk.

By the way, write to disk in RocksDB is asynchronous by default, only writing data to the operating system cache returns pageCache, which is an asynchronous process. Asynchronous write throughput is 1000 times higher than synchronous write. The disadvantage of asynchronous writes is that a machine or operating system crash may discard data cached by the operating system from the last batch of write requests, but RocksDB’s own crash does not result in data loss. The probability of machine or operating system crashes is low, so asynchronous writes can be considered safe in most cases.

4. Compaction

LSM tree converts discrete random write requests into batch sequential write requests (WAL + MEMtable) to improve write performance. But it also brings some problems:

Read Amplification. As described above, a read operation may access a large number of files.
Space Amplification. Because all writes are appends-only and not an in-place update to the data, stale data is not cleaned up immediately.

So it is necessary to maintain and reduce the number of SST files. RocksDB performs Compaction based on the configured Compaction algorithm policies. Compaction deletes keys that have expired or are marked as deleted or duplicated, and remerges data to improve query efficiency.

4.1 Level Style Compaction (default compaction style)

By default, RocksDB adopts a Level Style Compaction policy to Compaction the LSM tree.

If a Level Style Compaction is executed, L0 stores the latest data in RocksDB, Lmax stores older data, and L0 may contain duplicate keys, but no duplicate keys exist in other layers. Each compaction task merges a file from the Ln layer with multiple files from the adjacent Ln+1 layer, removes expired keys, marks deleted, or duplicates, and places the consolidated file in the Ln+1 layer.

Compaction reduces read Amplification (reducing the number of SST files) and space Amplification (cleaning up invalid data), but it also causes Write Amplification (when a Compaction occurs, the underlying I/O is consumed by a greater number of operations than when a Compaction occurs).

RocksDB also supports other Compaction strategies.

4.2 the Universal Compaction

Only compress all files in L0, merge them and put them in L0 layer;
Aim for lower write magnification, and trade off in read magnification and space magnification;

The specific algorithm is not discussed in this paper;

4.3 FIFO Compaction

FIFO, as the name implies, is a first-in, first-out mode that periodically deletes old data. In FIFO mode, all files are in L0. When the total size of SST files exceeds compaction_optionS_FIo. max_table_files_size, the oldest SST file is deleted. For FIFO, the strategy is very simple: all SST’s are at Level 0, and if the threshold is exceeded, the oldest SST is deleted.
This mechanism is ideal for storing sequential data.

⭐ SUMMARY of LSM tree design ideas

The design idea of LSM trees is interesting. So let me summarize here.

The LSM tree dramatically improves write performance by converting random writes to disks into disk-friendly sequential writes (random reads and writes are much slower than sequential reads and writes on both mechanical disks and SSDS).

So how does that translate? The core is to maintain an ordered memtable in the memory. When the memory table exceeds the threshold, it is flushed to disks in batches to generate the latest SSTable files. Because memtable itself already maintains key-value pairs for key-sorting, this step can be done efficiently.

WAL logs are written to the memory table before data is written to the memory table. When a fault occurs, WAL logs can be replay to restore data in the memory to ensure data consistency of the database.

In this append-only write mode, deleting data becomes adding delete marks to data, updating data becomes writing new values, and new and old values of the same key will exist in the database at the same time. This effect is called Space Amplification.

As data is written, the number of underlying SSTable files increases.

In this mode, the read request looks for the key in memory, and if it doesn’t find it, looks for the SSTable file on disk as new -> old. To optimize read performance for this access pattern, storage engines often use common read optimization strategies, such as using additional Bloom filters and read Cache.

This process, or effect, of requiring multiple reads is known as Read Amplification. Obviously, read performance of LSM trees is affected by read performance.

To optimize read performance (read magnification) and storage space (space magnification), the LSM tree reduces the number of SstAbles by running merge and compression processes and removing old values that are invalid (deleted or overwritten). The process is known as compaction.

However, compaction does have the effect of creating multiple writes to disk over the lifetime of the database. This effect is known as Write Amplification. In a write-heavy application, the performance bottleneck may be the speed at which the database can write to disk. In this case, write-down has a direct performance cost: The more times a storage engine writes to disk, the fewer writes per second in the available disk bandwidth.

This is one of the drawbacks I see with LSM engine storage, which is that the compression process can interfere with ongoing read and write requests. Although storage engines try to perform compression incrementally without affecting concurrent access, disk resources are limited, so it’s easy to make requests that require waiting for the disk to complete an expensive compression operation. The impact on throughput and average response time is usually small, but in high percentile cases (such as P99 RT), longer query responses can sometimes occur.

This is a personal summary of the LSM tree. Some abstract descriptions (such as memtable, compaction) that do not mention implementation details can be implemented in real storage. The details may vary, but the design philosophy is similar.

In RocksDB implementation specifically mentioned in this paper, write amplification, read amplification and spatial amplification, just like CAP theorem, cannot be optimized at the same time. RocksDB therefore exposes a number of parameters for users to tune for a wider range of application scenarios. A large part of the work is to do a good trade off among the three magnification factors: write magnification, read magnification and space magnification.

Elasticsearch Lucene class LSM

The segment design idea of Lucene, the underlying search engine of ES, is very similar to that of LSM trees. WAL, buffer, segmented merge are also used.

Once a document is indexed, it is added to the memory buffer and appended to the Translog.

With the refresh operation of the current shard, these documents in the memory buffer are flushed into a new segment, which is opened so that it can be searched, and the corresponding memory buffer is cleared.

As data is written, a COMMIT operation is also triggered, a full commit is made, and the corresponding Translog is deleted. (The specific process is not described in space).

Data is stored in memory and is not searchable until segments are refresh or commit, which is why Lucene is called to provide near-real-time rather than real-time queries.

Documents written to the Segment cannot be modified, but can be deleted. Instead of being changed inside the file, another file will save the DocID of the document to be deleted to ensure that the data file cannot be modified. A query to Index requires querying multiple segments and merging the results, as well as processing deleted documents. To optimize a query, Lucene has a policy to merge multiple segments, similar to when LSM updates an SSTable.

This mechanism avoids random write, and data writes are Batch and Append, thus achieving high throughput. In order to improve the efficiency of writing, file caching system and memory are used to speed up the performance of writing, and log is used to prevent the loss of data.

⭐ LSM tree vs B+ tree

So with that out of the way, let’s do a little bit of a comparison with our common B+ tree.

Different design concepts

While B+ trees, like LSM trees, maintain key-value pairs for key-sorting (which allows efficient key-value lookup and range queries), the design concept is completely different.

LSM trees decompose the database into variable-size segments, typically a few megabytes or more, and always write the segments sequentially.
In contrast, B+ trees decompose databases into fixed-size blocks or pages, traditionally 4KB (sometimes larger), and only one page can be read or written at a time. This design is closer to the underlying hardware because the disks are also arranged in fixed-size blocks.

Update and delete aspects of data

B(+) trees can do in-place updates and deletes, which is more user-friendly for database transaction support because a key only appears in one Page;
However, because LSM trees can only be updated out-of-place and overlap in an L0 layer SSTable, transaction support is weak and can only be updated or deleted when a compaction occurs.

performance

The advantage of LSM tree is to support high throughput write (can be considered as O(1)), this feature is more important in distributed systems, of course, for reading ordinary LSM tree structure, the complexity of reading is O(n), after the use of index or cache optimization can also reach O(logN) complexity.
The advantage of B+ trees is that they support efficient reads (stable O(logN)), but in the case of large write requests (complexity O(logN)), this becomes less efficient because nodes are constantly split and merged as inserts are performed to maintain the tree structure. The random read/write probability of the disk deteriorates.

In general, we would say that LSM trees have better write performance than B trees, and B trees have better read performance than LSM trees. However, please do not ignore the influence of LSM tree write amplification, and think more dialectically when determining performance.

reference

Chapter 3 of Designing Data-intensive Applications: Storage and Retrieval (a highly recommended read!!)
Github.com/facebook/ro…
www.lxkaka.wang/rocksdb-lsm…
Alexstocks. Making. IO/HTML/rocksd…
Cloud.tencent.com/developer/a…
www.elastic.co/guide/cn/el…