LevelDB (Go version) architecture design analysis

Here to LevelDB (go version) source code briefly explore LevelDB architecture implementation, welcome to the comments section

The overall architecture

log

Leveldb writes are not written directly to disk, but to memory first. If the data written to memory has not been persisted, the LevelDB process is abnormal, or the host machine is down, the user’s written data will be lost. So LevelDB will write all write operations to a log file, or log file, before writing to memory. When a process is abnormal, you can use the log to restore it. One log corresponds to the data of a MEMDB or frozen memDB. After the Frozen memDB is persisted as an Sstable, the log will be deleted

Version (version)

This concept is not shown in the figure above, but is an important one

First look at the version structure in the code

type version struct { id int64 // unique monotonous increasing version id s *session levels []tFiles // Level that should be compacted next and its compaction score. // Score < 1 means compaction is not strictly needed. These fields //  are initialized by computeCompaction() cLevel int cScore float64 cSeek unsafe.Pointer closing bool ref int released bool } // tFiles hold multiple tFile. type tFiles []*tFile // tFile holds basic information about a table. type tFile struct { fd storage.FileDesc seekLeft int32 size int64 imin, imax internalKey }Copy the code

The version structure contains a two-dimensional array, where the key represents the level level and the value represents all files below the current level

This is how LevelDB got its name

manifest

LevelDB: LevelDB: LevelDB: LevelDB: LevelDB: LevelDB: LevelDB: LevelDB: LevelDB: LevelDB: LevelDB: LevelDB: LevelDB: LevelDB: LevelDB: LevelDB However, version is only an in-memory structure. If the program restarts, the version information will not exist, so the main purpose of the manifest is to record version information. There are two recording methods: incremental record and full record. Manifest uses a mixture of incremental record and full record

Before moving on to the contents of manifest, let’s introduce a data structure

Type sessionRecord struct {hasRec int Comparer string journalNum int64 prevJournalNum int64 discarded // SeqNum Uint64 compPtrs []cpRecord // When this compaction dies, perform the following operations: AddedTables []atRecord DeletedTables []dtRecord scratch [binary.MaxVarintLen64]byte Err Error} type atRecord struct {level int num int64 size int64 imin internalKey imax internalKey } type dtRecord struct { level int num int64 }Copy the code

A sessionRecord records the tables that are added and deleted during a compaction, as well as the level at which the table is added. Version can also be mapped to a sessionRecord. The application reads the old manifest file, generates a version, maps that version to a sessionRecord and records it into the manifest file. When a compaction happens, Append the sessionRecord

current

Points to the Manifast file that should be used currently

memdb

When a file is added or deleted, the user writes a log to the memory table. When the memory reaches a threshold (which can be set by the user), the memory table cannot be written again, converts to frozen memdb**, and operations annealed. Memdb uses skip list data structure at the bottom

frozen memdb

When a frozen memdb dies when it reaches a set threshold, it converts to frozen memdb**. This is a theoretical concept, and the structure of the frozen memDB is the same as that of the frozen memDB. When a frozen memdb dies, Convert to file storage

sstable

When this compaction happens, the file is logically split into several layers. When this compaction happens, the frozen memdb files land on tier 0, while the tier 0 files land on tier 1. And so on; At the same time, it is important to note that the keys of files at layer 0 can overlap, but the keys of files above layer 0 do not overlap

Read and write operations

write

The write operation only operates on journal log and memDB

At the beginning of write, check whether write merge is enabled. Write merge is to merge multiple goroutine requests into one Goroutine to improve write efficiency
If write merge is enabled and another Goroutine holds the write lock, it hands its data to another Goroutine and waits for its write result to return
If write merge is enabled and the write lock is held, then similarly check if there are other Goroutines that need a ride to write
Put data in batch structure inside, this is a group of operation is to apply for a structure (so, no matter in a single data or write data in batch, the last is to operate in the form of batch write), and then determine the memdb surplus space, not written to the current data enough, enough, will the current memdb frozen, Then apply for a new MEMDB, and the frozen memDB operations run an asynchronous compaction
Then write to the log
After logs are written, memDB is written
If merge write is enabled, notify other goroutines of merge write results

read

Let’s put down the structure of memDB

Type DB struct {CMP comparer.BasicComparer RND *rand.Rand mu sync.RWMutex // [0] : KV offset // [1] : Key length // [2] : Value length // [3] : Height // [3..height] : Next nodes // nodeData []int prevNode [tMaxHeight]int maxHeight int n int kvSize int}Copy the code

Read the logic is relatively simple, first from memdb and frozen memdb search, and then go to the file to search, but some details are still more complicated, we mainly look at the ikey is what

The figure above is an example of KV storage, so if k1=v1 and k2=v2 are stored in memory, it will be a byte data in MEMDB, which is roughly as follows

The snapshot structure is as follows

How can I tell which is the key and which is the value? There is also a field in the DB data structure called nodeData, which acts as a skip list and also records the length of kV

How to use skiplist to find nodedData and kvData in memDB

In addition to looking in memDB, it will also look in sstable, more on that later

Read and write the log

Log structures

The following figure shows a data store of a block

(Figure: Log read-chunk structure)

A log structure is composed of multiple blocks

(Figure: Log read/write – Log structure)

Data in the figure above is the corresponding batch1 in the (figure: log read-write -block structure)… N

Checksum: used to check the accuracy of data

Chunk type: There are three types: first, middle, Last, and full, which indicate the integrity of a chunk. Full indicates a complete chunk

If a piece of data can be written completely to a chunk, the chunk type of the chunk is full

If a chunk of data is large and requires space of more than three chunks, the first chunk is first, the last chunk is last, and the rest chunks are middle

Length: Batch number

Log write

Log writing is relatively simple. Inside the program, a singleWrite is encapsulated. This singleWrite is responsible for writing once and writing big data in blocks at the same time

If not, the remaining space of the block will be filled with 0. If enough, the block will be written. The chunk type is to determine whether the current block has enough space to write data. Just write first, then middle/last

Log read

If the chunk type is not Full/Last, the data is not the Last block. If the chunk type is not Full/Last, the data is not the Last block. Until it’s read completely

Sstable read

Sstable structure

When frozen memDB data is persisted to a file, it is organized according to a certain structure. Leveldb’s data storage structure is called Sstable

The data structure of an Sstable is shown below

Data block: used to store KV key-value pairs

Filter block: After a filter is enabled, the filter created by key is stored. Only Bloom filters are available in levelDB and are disabled by default. If it’s not on, it’s not stored here

Meta Index Block: Stores index information of filter blocks

Index Block: Stores the index information of each data block

Footer: stores the index of the index block and the index of the meta Index block. It also stores a special character -“\x57\ XFB \x80\x8b\x24\x75\x47\ XDB “.

Note: Each block stores the compression type and CRC check information. Filter blocks do not compress or store CRC check information

footer

The footer contains 48 fields, and the length of the meta Index block and the length of the offset and meta index blocks in the entire file are constant. The last 8 bytes store a magic character

meta index block

Meta Index Block records the offset and length of the filter block

index block

Index Block Compared with meta Index block, each record stores one more Max key to record the maximum key of each data block, which is used to quickly locate the target data block

filter block

Filter block is a filter that is disabled by default in levelDB. If the filter block is disabled, there is no data in it

The structure of a filter block is shown as follows

Base LG: size of a Bloom filter. The default value is 11. The corresponding bloom filter size is 2^11 = 2K

Filter offset’s offset: The position of the first filter offset, which is used to split data and filter

Filter n offset: Although filter n offset refers to filter data n, the actual search for filter data does not start with filter offset and then offset. The actual operation is that Baselg records the size of each filter data, which can be directly shifted

data block

Let’s first look at the structure of the data block

Within this structure, there are several raw faces –entry, Restart Pointer

Here we need to first understand the storage structure of KV in the data block

Entry is the storage structure of KV in data block, which can be expressed as

The length of the shared key is the same as the prefix of the previous key. If the previous key is test1 and the stored key is test2, the shared key = test and unshared key = 4

LevelDB is not storing kv files in memDB, but is compressing the key to save storage space

Restart Pointer: The location where each complete key is stored in the entry

The restart Pointer is used to find a key quickly. If a key exists, the restart Pointer is used to find a key. If a key exists, the restart pointer is used to find a key. Traverses the data between the two keys; If there is no restart Pointer, you need to traverse each entry to find the key, which improves efficiency

An 🌰 :

// Set a restart pointer restart_interval=2 entry1: key=deck,value=v1 Entry2: key=dock,value=v2 key=duck,value=v3Copy the code

The data structure is

Read operation

Now that you know the data structure above, you can sort out the read flow of sstable

There is a difference between layer 0 files and non-layer 0 files when searching for keys in sstable files

Layer 0 files are allowed to overlap, while non-layer 0 files are not

File 1-n at tier 0 is not guaranteed to be sorted in order. Files 1-n at tier 0 are not guaranteed to be sorted in order, i.e. files 1 have keys 0-20, files 2 have keys 20-40, files 1 have keys 0-20, and files 2 have keys 10-30

The above two distinctions lead to the difference between layer 0 files and non-layer 0 files

compaction

There are two types of compaction

Memory compaction: Persists data from frozen memdb to L0 layer files
Table compaction: Compacts a tier N file from a tier N+1 file

memory compaction

Memory compaction is a simple operation that iterates data from the MEMDB and writes it to an Sstable in sequence. When a compaction dies, it writes data to the Sstable in sequence. Every time a memory compaction occurs, a new tier 0 file is created.

Memory compaction is a timing-intensive process that requires a shorter compaction time, so it takes precedence over a table compaction that compacts to memory compaction

table compaction

Table compaction is a little bit more complicated than memory compaction

How do I determine which layer OF files I want to compaction
How do I determine the level of this compaction that I need to conduct this compaction

So a few conditions are created when this compaction happens, and when that happens, we start compaction

When the number of layer 0 files exceeds the preset upper limit (4 by default)
When the total size of the level I file exceeds (10 ^ I) MB
Too many invalid reads of a file

When a compaction level is determined, the next step is to determine the Sstables of this compaction level. The main logic is as follows

For level 0 comapction, select the oldest sstable, tFiles[0]
If this is not a Level 0 compaction or if this is caused by an invalidated read number, select this sstable
If neither a Level 0 compaction nor an invalid read times occur, select the sstable that was larger than the maxKey of the comapction sstable last time

A compaction process

Level 0 compaction

Because level 0 allows keys to overlap, a level 0 compaction looks a little different: After an Sstable is identified, all sstAbles at the current level will be traversed first, and then all sstables with overlapping keys will be found to form a new range [minKey, maxKey], and then level 1 will be searched

Level N compaction

Level N (N > 0) does not overlap keys. Therefore, level N does not need to traverse all sstAbles of the current level to expand the range of keys to be searched