Here to LevelDB (go version) source code to briefly explore the LevelDB architecture implementation, welcome in the comment area message exchange

The overall architecture


Writes to LevelDB are not written directly to disk, but are first written to memory. If the data written to memory is not persisted, the LevelDB process is abnormal, or the host machine is down, the user may lose the data written to memory. So LevelDB will first write all writes to a log file, which is the log file, before writing to memory. When a process is abnormal, a log can be used to restore the process. A log corresponds to the data of a MEMDB or frozen memDB. After the Frozen MEMDB persists to sstable, the log is deleted

Version (version)

This concept is not shown in the figure above, but it is an important one

First take a look at the version structure in your code

type version struct { id int64 // unique monotonous increasing version id s *session levels []tFiles // Level that should be compacted next and its compaction score. // Score < 1 means compaction is not strictly needed. These fields //  are initialized by computeCompaction() cLevel int cScore float64 cSeek unsafe.Pointer closing bool ref int released bool } // tFiles hold multiple tFile. type tFiles []*tFile // tFile holds basic information about a table. type tFile struct { fd storage.FileDesc seekLeft int32 size int64 imin, imax internalKey }

The version structure is a two-dimensional array. The key of the two-dimensional array represents the level and the value represents all the files below the current level

That’s where the name LevelDB comes from


From the definition of version above, we can see that each SST file (the storage file of LevelDB) is encoded and recorded in memory, which level it belongs to; However, the version is only a structure in memory. If the program restarts, the version information will not exist. Therefore, the main function of the manifest is to record the version information. There are generally two recording methods: incremental recording and full recording. Manifest selects a mixed method of incremental recording and full recording

Before we move on to what’s stored in the MANIFEST, let’s introduce one data structure

Type sessionRecord struct {hasRec int comparer string // journalNum int64 // prevJournalNum int64 NextFileNum INT64 // Number of the next SST file. Add a new SST file, number +1 seqNum Uint64 compPtrs []cpRecord // When this compaction occurs, AddedTables []atRecord // This compaction has a new compaction. DeletedTables []dtRecord scratch [binary.MaxVarintLen64]byte err error} type atRecord struct {level int num int64 size int64 imin internalKey imax internalKey } type dtRecord struct { level int num int64 }

A sessionRecord records every table that was added or deleted from a compaction, and the level of the table that was added. This version map to a sessionRecord when a compaction restarts. It reads the old manifest file, generates a version, maps this version to a sessionRecord in the manifest file, and when this compaction occurs, I’m going to append a sessionRecord


Points to the Manifast file that should be used currently


When compaction reaches a certain threshold, the data is written to a memory table. We convert this table to a frozen memDB, and we run a compaction operation. Memdb uses skip list data structure at the bottom

frozen memdb

When a memDB reaches a certain threshold, it converts to a frozen memDB. This is a concept, but it has the same structure as memDB. When a frozen memDB converts to a stable memDB, it generates a compaction that converts a file into a file


When a compaction compaction occurs, a file that dies when a frozen memdb file lands on layer 0. When a compaction compaction occurs, a file that dies when a frozen memdb file lands on layer 1. In turn; Also, it is important to note that the keys of files at layer 0 can be overlapped, while the keys of files above layer 0 do not overlap

Read and write operations


Write operations only operate on the Journal log and memDB

  1. Before writing, determine whether write merge is enabled. Write merge is to merge multiple goroutine requests into one goroutine to improve write efficiency
  2. If write merge is enabled and another Goroutine has the write lock, it will hand over its data to the other goroutine and wait for its return to write the result
  3. If write merge is enabled and you have a write lock, check to see if there are any other Goroutines that you need to ride with
  4. Put data in batch structure inside, this is a group of operation is to apply for a structure (so, no matter in a single data or write data in batch, the last is to operate in the form of batch write), and then determine the memdb surplus space, not written to the current data enough, enough, will the current memdb frozen, Then apply a new memDB, freeze memDB, and run an asynchronous compaction
  5. Then write to the log
  6. After writing to the log, write to the MEMDB
  7. If merge write is enabled, the goroutine of other merge writes is notified of the result


Let’s put down the structure of memDB

Type DB struct {CMP comparer.BasicComparer RND *rand.Rand mu sync.RWMutex kvData []byte // Node data: // [0] : KV offset // [1] : Key length // [2] : Value length // [3] : Height // [3..height] : Next Nodes // Jump List nodeData []int prevNode [tMaxHeight]int maxHeight int n int kvSize int}

Read the logic is relatively simple, first from the memDB and frozen memDB to find, and then go to the file to find, but some details of processing is more complex, we here mainly look at ikey is what

The figure above is an example of kv storage, so if k1=v1 and K2 =v2 are stored in memory, it is a byte data in MEMDB, roughly as follows

The snapshot structure is as follows

There is also a field called nodeData in the DB data structure, which acts as a skip list and also records the length of kV

In this way, how to find data in memDB according to nodedData and kvData, using skiplist form, is relatively clear

In addition to looking in memDB, you also look in the Sstable, more on that later

Read and write the log

Log structures

The following figure shows the data store of a block

(Figure: Log read-write -chunk structure)

A log structure is composed of multiple blocks

(Figure: Log read/write – Log structure)

The data in the figure above is the batch1 in the corresponding (figure: log read-write -block structure)… N

Checksum: used to verify data accuracy

Chunk type: There are three types: first, middle, last, and full, which indicate the integrity of a chunk. Full indicates a complete chunk

If a chunk of data can be completely written into a chunk, the chunk type of the chunk is full

If a piece of data is large and requires more than three chunks, the first chunk is first, the last chunk is last, and the rest are middle

Length: Indicates the batch quantity

Log write

Log writing is relatively simple. Inside the program, a singleWrite is encapsulated. This singleWrite is responsible for writing a singleWrite and writing large data in blocks

When a log is written, it first determines whether the block has enough space to write a header. If not, it adds 0 to the remaining space of the block. If the block has enough space, it writes a header. Just write first, then middle/last

Log read

If the chunk type is not Full/Last, the data is not the Last block and the data is incomplete. If the chunk type is not Full/Last, the data is not the Last block. Until it’s read complete

Sstable read

Sstable structure

When data in Frozen MEMDB is persisted to a file, it is organized according to a structure, and the structure of this data store in LevelDB is called sstable

The data structure of an Sstable is shown below

Data block: Used to store kV key and value pairs

Filter Block: After the filter is enabled, the filters created by key are stored. LevelDB provides only the Bloom filter, which is disabled by default. If it is not enabled, there is no storage

Meta Index block: Stores the index information of the filter Block

Index block: Stores the index information of each data block

Footer: stores the index information of the index block and the index information of the Meta Index block. It also stores a special character -“\x57\ XFB \x80\x8b\x24\x75\x47\ XDB”

Note: Each block stores the compression type and CRC information. Filter blocks do not compress or store CRC information


The length of the footer is 48 and is constant. It stores the meta Index block, the offset of the index block in the entire file, and the length of the Meta Index, respectively. The last 8 bytes store a magic character

meta index block

Meta Index block records the offset and length of the filter block

index block

Compared with a Meta Index block, an index block stores one more Max key for each record. The maximum key of each data block is used to quickly locate the target data block

filter block

Filter Block is a filter that is turned off in levelDB by default. When it is turned off, the filter block has no data. When it is turned on, you can select a filter

The filter block structure is as follows

Base LG: Size of a Bloom filter. The default value is 11. The corresponding bloom filter size is 2^11 = 2K

Filter offset’s offset: the position of the first filter offset, which is used to split data and filter

Filter N offset: Although the filter n offset refers to the filter data N, it will not first look for the filter offset and then offset the past when it is really looking for the filter data; The actual operation is that baselg records the size of each filter data, and direct displacement operation is required

data block

Let’s first look at the structure of the data block

Inside this structure, there are several new faces –entry, restart Pointer

So the first thing you need to know about kv is the structure of the data block

Entry is the storage structure of kV in the data block, which can be expressed as

Shared key length: the length of the preceding key is the same as the prefix of the preceding key. For example, if the preceding key is test1 and the current stored key is test2, shared key = test and unshared key = 4

In order to save storage space, levelDB does not store the key as in the kV structure of MEMDB. Instead, the key is compressed, thus causing the restart pointer

Restart Pointer: The location where each complete key is stored in the entry

Restart Pointers exist to find keys quickly. When we look for a key, we first find the complete key based on the restart pointer, and we compare it to see if it’s between the two keys that the restart pointer points to. If it’s between, The data between the two keys is traversed; If no restart pointer exists, you need to iterate through entry after entry to find the key, which is much more efficient

An 🌰 :

// restart pointer restart_interval=2 entry1: key=deck,value=v1 entry2: key=dock,value=v2 key=duck,value=v3

Then the stored data structure is

Read operation

With the above data structures in mind, you can sort out the sstable reading process

When searching for a key in an sstable file, there is a slight difference between a layer 0 file and a non-layer 0 file

Layer 0 files are allowed to overlap, but non-layer 0 files are not

Layer 0 files, 1-n files, are not guaranteed to be in order, non-layer 0 files, 1-n files are guaranteed to be in order, so the key of file 1 is 0-20, the key of file 2 is 20-40, and the key of file 1 for layer 0 is 0-20, and the key of file 2 can be 10-30

The above two differences lead to the difference between searching for layer 0 files and non-layer 0 files


There are two types of compaction

  1. Memory compaction: Persists data from the frozen MEMDB to a L0 file
  2. Table compaction: A process that compacts a file at the N+1 layer

memory compaction

When a memory compaction occurs, it reads a memDB file, writes a memory stable to it, and adds auxiliary data to it. Every memory compaction creates a new 0-tier file.

Memory compaction is a time-sensitive process that needs to be completed in a short period of time, so it has to take precedence over a table compaction, which compacts to a memory compaction

table compaction

A table compaction is a little more complex than a memory compaction

  1. How do WE know which layer we want to write
  2. We have a new compaction. We have a compaction level

When a compaction occurs, it creates a new compaction. When a compaction occurs, it generates a compaction

  • When the number of layer 0 files exceeds the predetermined upper limit (default: 4)
  • When the total size of level I files exceeds (10 ^ I) MB
  • Excessive number of invalid reads of a file

When this compaction level is confirmed, the next step is to determine the sstable of a compaction level

  • For level 0 comapction, select the oldest sstable, which is tFiles[0]
  • If this is not a level 0 compaction, select this Sstable if this is a number of invalid reads
  • If this is not a level 0 compaction or too many invalid reads, select the maxKey Sstable that exceeds the comapction Sstable

A compaction process

Level 0 compaction

When a new compaction occurs on a level 0 server, it creates a new compaction. After determining an sstable, the system first iterates through all sstables at the current level, and then finds all sstables with overlapping keys to form a new range of [minKey, maxKey], and then searches for the range at level 1

Level N compaction

Level N (N > 0) does not overlap keys, so level N does not need to traverse all sstables at the current level to extend the range of keys to be searched