From Hash Index to LSM Tree (part 1)

preface

Database is the most commonly used component in software application system, whether it is a huge and complex e-commerce system, or a simple personal blog, more or less will use the database, or store massive data, or store simple state information. In general, we like to divide databases into relational databases and non-relational databases (also known as NoSQL databases). The former is typically represented by MySQL databases, while the latter is typically represented by HBase databases. No matter relational or non-relational, databases are inseparable from two basic functions :(1) data storage; (2) Data query. In simple terms, when you throw data into a database, it keeps it and returns it to you later when you want to retrieve it.

Around these two basic functions, all kinds of databases have used a lot of technical means to optimize it, among which the most well-known is database index technology. An index is a data structure that can significantly improve data query (read) performance at the expense of a small amount of data storage (write) performance. There are many types of indexes. Hash index is the simplest and most efficient one, but it is not widely used in database system due to its limitations. Nowadays, the most commonly used index technology is B/B+ tree index, which is widely used in relational database, mainly used in the scenario of read more than write less. With the rise of NoSQL database, LSM (log-structured Merged-Tree) Tree is also popular, and is promoted by Google BigTable paper. LSM trees are not strictly an index in the traditional sense, but rather a design idea for scenarios where more write than read is required.

This series of articles, will start from the realization of the simplest key-value database, and then in view of some bottlenecks encountered in the implementation process, using the above index technology, optimization of the database, in order to achieve a more profound understanding of the database index technology.

The simplest database

In Designing Data-Intensive Applications, Martin Kleppmann gives an implementation of the simplest database:

#! /bin/bash

db_set() {

  echo "$1,$2" >> database

}



db_get() {

  grep "^$1," database | sed -e "s/^$1,//" | tail -n 1

}

Copy the code

This less than 10 lines of shell code implements a simple key-value database. It has two functions, db_set and DB_GET, the former for data storage and the latter for data query. The database stores data in a simple text format (database file). Each record contains a key-value pair, separated by a comma (,) between key and value.

The db_set key value can be used to store the key and its corresponding value in the database. Db_get key db_get key

$ db_set 123456 '{"name":"London","attractions":["Big Ben","London Eye"]}' 

$ db_set 42 '{"name":"San Francisco","attractions":["Golden Gate Bridge"]}'

$db_get 42

{"name":"San Francisco","attractions":["Golden Gate Bridge"]}

Copy the code

Through the implementation of db_set, we found that each write of the database directly adds records to the database file. Due to the efficient sequential write of the file system, the data write of the database has high performance. This means that every time you call db_GET to retrieve the value of a key, you need to iterate through all the records to find all the values that meet the criteria and fetch the latest one. As a result, the database’s read performance is very low.

The simplest database read and write operation

Let’s rewrite the simplest database in Java:

/ * *

 * SimpleKvDb.java

* additional written

* Full file scan read

* /

public class SimpleKvDb implements KvDb {

.

    @Override

    public String get(String key) {

        try (Stream<String> lines = logFile.lines()) {

            // grep "^$1," database

            List<String> values = lines

                    .filter(line -> Util.matchKey(key, line))

                    .collect(Collectors.toList());

            / / step2: select the latest value (sed -e "s / ^ $1, / /" | tail - n 1)

            String record = values.get(values.size() - 1);

            return Util.valueOf(record);

        } catch (IOException e) {

            e.printStackTrace();

            return null;

        }

    }

    @Override

    public void set(String key, String value) {

        Echo "$1,$2" >> database

        String record = Util.composeRecord(key, value);

        logFile.append(record);

    }

.

}

Copy the code

The results of pressure measurement using JHM are as follows:

Benchmark                                            Mode  Cnt  Score   Error  Units

SimpleKvDbReadBenchmark.simpleKvDb_get_10000_test    avgt    8  0.631 ± 0.033  ms/op // Read time, 1W records

SimpleKvDbReadBenchmark.simpleKvDb_get_100000_test   avgt    8  7.020 ± 0.842  ms/op // Read time, 10W records

SimpleKvDbReadBenchmark.simpleKvDb_get_1000000_test  avgt    8  62.562 ± 5.466 ms/op // Read time, 100W records

SimpleKvDbWriteBenchmark.simpleKvDb_set_test         avgt    8  0.036 ± 0.005  ms/op / / write lengthy

Copy the code

It can be seen from the results that the database implementation has a high write performance, but the read performance is very low, and the read time increases linearly with the increase of data volume.

So how do you optimizeSimpleKvDbRead performance? Introduce indexing technology!

Index is a data structure derived from the data in the database. It does not affect the data but only the read and write performance of the database. For read operations, it can quickly locate the destination data, greatly improving the read performance. For write operations, there is a slight decrease in write performance due to additional update index operations.

As mentioned in the introduction, there are many types of indexes, each with its own characteristics. Therefore, whether to use an index or not depends on the actual application scenario.

Next, we’ll start by optimizing SimpleKvDb with the simplest and most efficient Hash index.

Hash index the database

Given the fact that key-value databases are inherently Hash tables, it is easy to think of an indexing strategy that maintains a Hash table in memory that records byte offsets for each Key in a data file (database file).

For write operations, you need to update the Hash table after appending records to the data file; For read operations, the Hash table is used to determine the displacement of the key in the data file. Then the byte displacement is used to quickly find the position of the value in the data file and read the value, thus avoiding inefficient full-text traversal.

Read and write operations after adding indexes

Add Hash index to SimpleKvDb and the corresponding code is implemented as follows:

/ * *

 * HashIdxKvDb.java

* additional written

* Use Hash indexes to improve read performance

* /

public class HashIdxKvDb implements KvDb {

    // Data store file

    private final LogFile curLog;

    // Index, value is the offset of the data corresponding to the key in the file

    private final Map<String, Long> idx;

.

    @Override

    public String get(String key) {

        if(! idx.containsKey(key)) {

            return "";

        }

        // step1: Read index

        long offset = idx.get(key);

        // step2: Read value from index

        String record = curLog.read(offset);

        return Util.valueOf(record);

    }

    @Override

    public void set(String key, String value) {

        String record = Util.composeRecord(key, value);

        long curSize = curLog.size();

        // step1: appending data

        if(curLog.append(record) ! =0) {

            // step2: Update index

            idx.put(key, curSize);

        }

    }

.

}

Copy the code

Before implementing HashIdxKvDb, we abstracted the data store file into a LogFile object, whose two basic methods are append (append the record) and read (read the record from offset), where, The read function uses the RandomAccessFile seek method to quickly locate the position of the record, as follows:

// Append log files to store database data

class LogFile {

    // File path

    private Path path;

.

    // Writes a line to the log file, automatically adding a newline character

    // Returns the size of the bytes successfully written

    long append(String record) {

        try {

            record += System.lineSeparator();

            Files.write(path, record.getBytes(), StandardOpenOption.APPEND);

            return record.getBytes().length;

        } catch (IOException e) {

            e.printStackTrace();

        }

        return 0;

    }

    // Read the line whose offset is the starting position

    String read(long offset) {

        try (RandomAccessFile file = new RandomAccessFile(path.toFile(), "r")) {

            file.seek(offset);

            return file.readLine();

        } catch (IOException e) {

            e.printStackTrace();

        }

        return "";

    }

.

}

Copy the code

The results of pressure measurement using JHM are as follows:

Benchmark                                              Mode  Cnt  Score   Error  Units

HashIdxKvDbReadBenchmark.hashIdxKvDb_get_10000_test    avgt    8  0.021 ± 0.001  ms/op // Read time, 1W records

HashIdxKvDbReadBenchmark.hashIdxKvDb_get_100000_test   avgt    8  0.021 ± 0.001  ms/op // Read time, 10W records

HashIdxKvDbReadBenchmark.hashIdxKvDb_get_1000000_test  avgt    8  0.021 ± 0.001  ms/op // Read time, 100W records

HashIdxKvDbWriteBenchmark.hashIdxKvDb_set_test         avgt    8  0.038 ± 0.005  ms/op / / write lengthy

Copy the code

Compared with SimpleKvDb, the read performance of HashIdxKvDb is greatly improved, and the read time does not increase linearly with the increase of data volume. And there was no significant drop in write performance.

While the Hash index implementation is simple, it is extremely efficient, requiring only one disk address (seekOperation), plus 1 disk I/O (readLineOperation) to load the data out. If the data has already been loaded into the file system cache, disk I/O is not even required.

Data consolidation — Compact

So far, both SimpleKvDb and HashIdxKvDb writes have been to append data continuously to a file, which is usually referred to as a, append-only log.

So how do you keep the append-only log from growing endlessly until you run out of disk space?

A notable feature of the append-only log is that old records are not overwritten and deleted, but this data is often useless because when a key’s value is read, the database takes its most recent value. So, one way to solve this problem is to get rid of these useless records:

(1) When append data to append-only log reaches a certain size, create a new append-only log to append data. In this mechanism, data is stored in multiple append-only log files, which we call segment files.

(2) Ensure that only current segment files are readable and writable, while old segment files are read-only and not writable.

(3) Perform compact operation on old segment file — only keep the latest record corresponding to each key and delete the old record.

The compact operation is usually performed in a background thread. The database writes the merged result to a new Compacted segment file so that the old segment file can be read from the compact operation without affecting the logic. After the compact operation is completed, delete the old segment file and migrate the subsequent read operations to the compacted segment file.

At present, the key of a single compacted segment file is unique, but there may be duplicate keys between multiple compacted segment files. We can also go one step further. Compact operation was performed for multiple Compacted segment files again, so that the amount of data would be reduced again.

Similar compact operations can be executed layer by layer, for example, you can compact level2 compacted segment file and generate Level3 compacted segment file. However, more layers of Compact is not always better. Specific compact policies need to be designed based on actual application scenarios.

Add compact to HashIdxKvDb

Next, let’s try adding compact to HashIdxKvDb. In Compact, data is stored in multiple segment files, so the Hash index mechanism is no longer applicable. We need to maintain a separate Hash index for each segment file. Of course, this is easy to implement. You only need to maintain a Hash table with the segment file as the key and the value as the Hash index corresponding to the segment file.

The corresponding code implementation is as follows:

// Multi-file hash index implementation

class MultiHashIdx {

.

    private Map<LogFile, Map<String, Long>> idxs;

    // Get the Key index of the specified LogFile

    long idxOf(LogFile file, String key) {

        if(! idxs.containsKey(file) || ! idxs.get(file).containsKey(key)) {

            return -1;

        }

        return idxs.get(file).get(key);

    }



    // Add the Key index to the specified LogFile

    void addIdx(LogFile file, String key, long offset) {

        idxs.putIfAbsent(file, new ConcurrentHashMap<>());

        idxs.get(file).put(key, offset);

    }

.

}

Copy the code

Level1 Compacted Segment File Level2 Compacted Segment File Level1 Compacted Segment File Level2 compacted Segment File ScheduledExecutorService is used to compact these collections.

/ * *

* Append: When the current segemnt file reaches a certain size, append it to the segment file. And periodically compact the old segment file.

* Maintain a hash index for each segment file to improve read performance

* Support single thread write, multithreading read

* /

public class CompactionHashIdxKvDb implements KvDb {

.

    // The segment file path to be appended

    private LogFile curLog;



    // Write the old segment file set. Level1 compact will periodically merge these files

    private final Deque<LogFile> toCompact;

    Level1 compacted Segment File level2 Compacted segment File

    private final Deque<LogFile> compactedLevel1;

    // Level2 compacted segment File

    private final Deque<LogFile> compactedLevel2;

    // Segment file hash index

    private final MultiHashIdx idx;



    // Perform scheduled scheduling for Compact

    private final ScheduledExecutorService compactExecutor;

.

}

Copy the code

In contrast to HashIdxKvDb, CompactionHashIdxKvDb needs to determine whether the current file size is full before writing new data. If it is, it needs to create a new LogFile to append and archive the full LogFile into the toCompact queue.

@Override

public void set(String key, String value) {

    try {

        // If the current LogFile is full, it is placed in the toCompact queue and a new LogFile is created

        if (curLog.size() >= MAX_LOG_SIZE_BYTE) {

            String curPath = curLog.path();

            Map<String, Long> oldIdx = idx.allIdxOf(curLog);

            curLog.renameTo(curPath + "_" + toCompactNum.getAndIncrement());

            toCompact.addLast(curLog);

            // After a new file is created, the index is updated

            idx.addAllIdx(curLog, oldIdx);

            curLog = LogFile.create(curPath);

            idx.cleanIdx(curLog);

        }



        String record = Util.composeRecord(key, value);

        long curSize = curLog.size();

        // If the write succeeds, the index is updated

        if(curLog.append(record) ! =0) {

            idx.addIdx(curLog, key, curSize);

        }

    } catch (IOException e) {

        e.printStackTrace();

    }

}

Copy the code

Select * from CompactionHashIdxKvDb; select * from CompactionHashIdxKvDb; delete from CompactionHashIdxKvDb; Currently appended LogFile->toCompact queue ->compactedLevel1 queue ->compactedLevel2 queue Therefore, CompactionHashIdxKvDb reads can be inefficient even in extreme cases when the data being queried is stored on a comactedLevel2 queue.

@Override

public String get(String key) {

    // Step 1: Search from the current LogFile

    if(idx.idxOf(curLog, key) ! = -1) {

        long offset = idx.idxOf(curLog, key);

        String record = curLog.read(offset);

        return Util.valueOf(record);

    }

    // Step 2: look in toCompact

    String record = find(key, toCompact);

    if(! record.isEmpty()) {

        return record;

    }

    // Step 3: Search from compactedLevel1

    record = find(key, compactedLevel1);

    if(! record.isEmpty()) {

        return record;

    }

    // Step 4: Search from compactedLevel2

    record = find(key, compactedLevel2);

    if(! record.isEmpty()) {

            return record;

    }

    return "";

}

Copy the code

Hash indexes can be used when level1 compact operations are performed on old segment files in the toCompact queue. Because the corresponding record on the Hash index is always the latest, you just need to traverse the Hash index, query the latest record corresponding to each key, and write it into the new Level1 compacted segment file.

// Create level1 compact to merge old segment files

void compactLevel1(a) {

    while(! toCompact.isEmpty()) {

        // Create level1 compacted segment file

        LogFile newLogFile = LogFile.create(curLog.path() + "_level1_" + level1Num.getAndIncrement());

        LogFile logFile = toCompact.getFirst();

        // Keep only the latest value for each key

        idx.allIdxOf(logFile).forEach((key, offset) -> {

            String record = logFile.read(offset);

            long curSize = newLogFile.size();

            if(newLogFile.append(record) ! =0) {

                idx.addIdx(newLogFile, key, curSize);

            }

        });

        // Store it in the compactedLevel1 queue and delete the corresponding file in toCompact

        compactedLevel1.addLast(newLogFile);

        toCompact.pollFirst();

        logFile.delete();

    }

}

Copy the code

The level2 compact strategy for compactedLevel1 is to merge all of the Level1 compacted segment files in the current queue into one Level2 compacted segment file. The specific steps are as follows:

Create a snapshot of the compactedLevel1 queue. The purpose is to avoid the impact of adding a new Level1 compacted segment file to the queue during Level2 Compact.

2. Write the records in Level1 compacted Segment file to the new Level2 compacted segment file according to the sequence from new to old. If the key already exists in level2 compacted segment file, skip it.

3. Delete the level1 compacted segment file from the compactedLevel1 queue.

// Enter Level2 Compact to merge all files in the compactedLevel1 queue

void compactLevel2(a) {

.

    // Create a snapshot

    Deque<LogFile> snapshot = new LinkedList<>(compactedLevel1);

    if (snapshot.isEmpty()) {

        return;

    }

    int compactSize = snapshot.size();

    // The level2 file is named filename_level2_num

    LogFile newLogFile = LogFile.create(curLog.path() + "_level2_" + level2Num.getAndIncrement());

    while(! snapshot.isEmpty()) {

        // Start with level1 compacted segment file

        LogFile logFile = snapshot.pollLast();

        logFile.lines().forEach(record -> {

            String key = Util.keyOf(record);

            // Only the key that does not exist in level2 compacted segment file will be written

            if (idx.idxOf(newLogFile, key) == -1) {

                // Update index after successful write

                long offset = newLogFile.size();

                if(newLogFile.append(record) ! =0) {

                    idx.addIdx(newLogFile, key, offset);

                }

            }

        });

    }

    compactedLevel2.addLast(newLogFile);

    // After writing, delete the corresponding file in the compactedLevel1 queue

    while (compactSize > 0) {

        LogFile logFile = compactedLevel1.pollFirst();

        logFile.delete();

        compactSize--;

    }

.

}

Copy the code

conclusion

This paper first introduces the simplest database given by Martin Kleppmann, and uses the Java language to implement it again, followed by a series of optimizations using Hash indexing and compact mechanism.

SimpleKvDb stores data in append-only log mode. The main advantage of append-only log is high write performance (sequential writes to a file system are much faster than random writes). Appending also means that old records corresponding to the same key will not be overwritten and deleted. Therefore, when querying data, you need to traverse the entire file to find all records of the key and select the latest one. SimpleKvDb has very low read performance because it involves full-text traversal.

To optimize the read performance of SimpleKvDb, we implemented HashIdxKvDb with Hash index. A Hash index is an in-memory Hash table that stores the offset of each key in a file. So each data query requires only one disk address, plus one disk I/O, which is very efficient.

In order to solve the problem of insufficient disk space caused by the continuous expansion of append-only log, we introduce compact mechanism to HashIdxKvDb and implement CompactionHashIdxKvDb. The Compact mechanism effectively cleans up invalid old data from the database, reducing the pressure on disk space.

Hash indexes are simple and efficient, but have the following two limitations:

1. Hash indexes must be maintained in memory. If you choose to implement Hash indexes on disks, a large number of random disk reads and writes will occur, resulting in a sharp performance deterioration. In addition, with the introduction of compact, data is stored in multiple segment files. Therefore, we have to maintain a Hash index for each segment file. As a result, the memory usage of Hash indexes increases, putting a lot of memory pressure on the system.

2. Interval query is very inefficient. For example, if you want to query all the keys in the database between [key0000, key9999], you must traverse all the elements in the index and find the right key.

How can you optimize for these two limitations of Hash indexes? In the next article in this series, we will introduce another index that does not have these two limitations — the LSM tree.

From Hash Index to LSM Tree (part 1)

preface

The simplest database

Hash index the database

Data consolidation — Compact

Add compact to HashIdxKvDb

conclusion

Related Posts

Microsoft has announced the release of Visual Studio 2022 with 64-bit support for.net 6 and C++ 20

Analyze and explore the serialization and deserialization + generics mechanism of RedisTemplate

Learning Dynamic Programming with Baymax (1)