Leveldb is a persistent store of keys and values that are arbitrary Byte arrays and are sorted by the user’s specified comparator function.

It was written by Jeff Dean and Sanjay Ghemawat.

Basic introduction

features

  • Keys and values are arbitrary byte arrays.
  • Data is stored in order of key value.
  • The caller can provide a custom comparison function to override the sort order.
  • Provide basicPut(key,value).Get(key).Delete(key)Operation.
  • Multiple changes can take effect in a single atomic batch.
  • Users can create a snapshot to get a consistent view of the data.
  • Support forward and backward iteration in data.
  • The Snappy compression library is used to automatically compress data
  • Operations that interact with the outside world are abstracted into interfaces (such as file system operations), so users can customize operating system interactions based on the interfaces.

limitations

  • This is not an SQL database, it has no relational data model, no SQL queries, and no indexes.
  • Only one process (possibly a multithreaded process) can access a particular database at a time.
  • The library does not have built-in client-server support, and users who need it must encapsulate it themselves.

performance

Here is a performance report for running the DB_Bench program. The result is noisy, but enough to get an approximate performance estimate.

configuration

We used a database with one million entries, where each entry’s key is 16 bytes, value is 100 bytes, and value compressed is about half the original size. The test configuration is as follows:

LevelDB:    version 1.1
Date:       Sun May  1 12:11:26 2011
CPU:        4 x Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
CPUCache:   4096 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Raw Size:   110.6 MB (estimated)
File Size:  62.9 MB (estimated)
Copy the code

Specific benchmark source: github.com/google/leve…

Write performance

The “Fill” benchmark creates a brand new database that is written either sequentially (” Fillseq “below) or randomly (” FillRandom” below).

The Fillsync benchmark tests that flush data from the operating system to disk for each write operation. Other operations keep the data in the system for a period of time.

The Overwrite benchmark does random writes that update existing keys in the database.

Micros fillseq: 1.765 / op; 62.7 MB/s fillsync: 268.409 micros/op; 0.4 MB/s (10000 OPS) FillRANDOM: 2.460 micros/op; 45.0 MB/s Overwrite: 2.380 micros/op; 46.5 MB/sCopy the code

Each op corresponds to a key/value pair write operation. That is, a random write benchmark is about 400,000 writes per second (1,000,000/2.46).

Each fillsync operation takes less time (about 0.3 ms) than a disk search (about 10 ms). We suspect that this is because the disk itself caches the update operation into memory and returns a response before the data falls to the disk. This may or may not be safe, depending on whether the hard drive has enough power to hold its memory in the event of a power outage.

Read performance

We listed the performance of forward sequential reads, reverse sequential reads, and random queries. Based test note that created database is very small, so the performance report describes leveldb all data sets can be placed into the memory of the scene, if the data is not in the operating system cache, read consumption mainly depends on the performance of search disk once or twice, the write performance basic not affected by whether can into the memory data sets.

Micros readrandom: 16.677 / op; Approximately 60,000 reads per second readSEQ: 0.476 micros/op; 232.3 MB/s readReverse: 0.724 micros/op; 152.9 MB/sCopy the code

Leveldb compacts its underlying data in the background to improve read performance. The results listed above are based on a large number of random writes, and the performance metrics (usually moving) are better after Compact:

Micros readrandom: 11.602 / op; (Approximately 85,000 reads per second) ReadSEQ: 0.423 micros/op; Readreverse: 0.663 micros/op; 166.9 MB/sCopy the code

Part of the high cost of the read operation is due to the repeated decompression of the database read from disk. If we can provide levelDB with sufficient cache to store the decompressed data in memory, read performance will be further improved:

Micros readrandom: 9.775 / op; Readrandom: 5.215 micros/ OP; approximately 100,000 reads per second before compaction; (approximately 190,000 reads per second after compaction)Copy the code

compile

Project support Cmake out of the box. Compiling is very simple:

git clone --recurse-submodules https://github.com/google/leveldb.git
mkdir -p build && cd build
cmake -DCMAKE_BUILD_TYPE=Release .. && cmake --build .
Copy the code

Introduction to header files

Leveldb exposes all interfaces in include/*.h. Users should not rely on headers in any other directory. These internal apis may change without warning.

  • include/leveldb/db.h: primary DB interface, starting here.
  • include/leveldb/options.h: controls the behavior of the database, as well as the behavior of the read and write.
  • include/leveldb/comparator.h: Compares the abstraction of functions. If you only want to compare keys byte by byte, you can use the default comparator. If you want to customize sorting (e.g. handle different character encodings, decodes, etc.), you can implement your own comparator.
  • include/leveldb/iterator.h: an interface to iterate over data. You can get an iterator from a DB object.
  • include/leveldb/write_batch.h: Apply multiple operations to the database atomically.
  • include/leveldb/slice.h: similar to string, maintains a pointer to a byte array and its length.
  • include/leveldb/status.h: Many public interfaces returnStatusUsed to report success or various errors.
  • include/leveldb/env.h: An abstraction of the operating system environment where the POSIX implementation of the interface residesutil/env_posix.ccIn the.
  • include/leveldb/table.h, include/leveldb/table_builder.h: Low-level modules that most users may not use directly.

use

First create a folder app/ in the Leveldb directory to store our exercise files separately, then create a file such as main.cc, then modify the cmakelists. TXT file and add a line:

  347   if(NOT BUILD_SHARED_LIBS)
+ 348     leveldb_test("app/main.cc")
  349     leveldb_test("db/autocompact_test.cc")
Copy the code

After writing the code, just go back to the build/ directory and execute:

cmake .. && cmake --build .
Copy the code

To compile the main executable.

Open a database

Leveldb databases have a name that corresponds to a directory on the file system where all the database content resides. The following example shows how to open a database and create a database if necessary:

#include <cassert>
#include "leveldb/db.h"

int main(a) {
    leveldb::DB* db;
    leveldb::Options options;
    options.create_if_missing = true;
    leveldb::Status status = leveldb::DB::Open(options, "/tmp/testdb", &db);
    assert(status.ok());
}
Copy the code

If you want to raise an exception when the database already exists, add the following configuration to the leveldb::DB::Open call:

options.error_if_exists = true;
Copy the code

Status

You may have noticed the leveldb::Status return type above. Most methods in Leveldb will return this type if they encounter an error. You can check that it is ok and print the relevant error message:

leveldb::Status s = ... ;if(! s.ok()) cerr << s.ToString() << endl;
Copy the code

Try outputting an existing error message from the database.

Closing the database

When the database is no longer in use, simply drop the database object as follows:

delete db;
Copy the code

Isn’t that easy? As we will see later in the detailed source code analysis, the DB class is implemented based on RAII and the destructor is automatically cleaned up when delete is triggered.

Database read and write

Leveldb provides the Put, Delete, and Get methods to modify/query the database, and the following code shows moving the value corresponding to key1 under key2.

std::string value;
leveldb::Status s = db->Get(leveldb::ReadOptions(), key1, &value);
if (s.ok()) s = db->Put(leveldb::WriteOptions(), key2, value);
if (s.ok()) s = db->Delete(leveldb::WriteOptions(), key1);
Copy the code

Atomic updates

Note in the previous section that if the process dies after putting KEY2 and before deleting key1, the same value will be stored under multiple keys. Similar problems can be avoided by using WriteBatch to apply a set of operations atomically.

#include "leveldb/write_batch.h". std::string value; leveldb::Status s = db->Get(leveldb::ReadOptions(), key1, &value);
if (s.ok()) {
  leveldb::WriteBatch batch;
  batch.Delete(key1);
  batch.Put(key2, value);
  s = db->Write(leveldb::WriteOptions(), &batch);
}
Copy the code

WriteBatch holds a list of operations that will be applied to the database in the order they are added. Note that we perform Delete first and then Put so that we don’t lose data by mistake if key1 is the same as key2.

In addition to atomicity, WriteBatch also speeds up the update process because a large number of independent operations can be added to the same batch and executed at once.

Synchronous write operation

By default, LevelDB is asynchronous for every write: the process drops the write to the operating system and returns immediately, and the transfer from the operating system memory to the underlying persistent store is asynchronous.

Sync flags can be turned on for a particular write: write_options.sync = true to wait until the data is actually logged to persistent storage before returning. (On Posix systems, this is done by calling fsync(…) before the write returns.) Or fdatasync (…). Or msync (… MS_SYNC).

leveldb::WriteOptions write_options;
write_options.sync = true;
db->Put(write_options, ...) ;Copy the code

** Asynchronous writing is usually 1000 times faster than synchronous writing. ** The disadvantage of asynchronous write is that the last few update operations may be lost if the machine crashes. Note that simply crashing the write process (and not restarting the machine) does no harm, because even if sync is false, the write operation is pushed from the process memory to the operating system before the process exits.

Asynchronous writes are usually safe to use. For example, if you are writing a large amount of data to the database, you can also redo the entire write process if the last few update operations are lost. If the amount of data is very large, an optimization point is to use a hybrid scheme, with synchronous writes for every N asynchronous writes, and if a crash occurs, restart from the last successful synchronous write. (Synchronous writes can also update an identifier that describes the location of the reboot in the event of a crash.)

WriteBatch can be used as an alternative to asynchronous write operations. Multiple updates can be added to the same WriteBatch and then dropped together using a synchronous write (write_options.sync = true).

concurrent

A database can only be opened by one process at a time. Leveldb obtains a lock from the operating system to prevent multiple processes from opening the same database at the same time. Within a single process, the same LevelDB ::DB object can be safely used by multiple concurrent threads, that is, different threads can write, Get iterators, or call Get without needing any external synchronization primitives (the LevelDB implementation ensures the required synchronization). But other objects, such as Iterator or WriteBatch, require external synchronization guarantees of their own. If two threads share such objects, they need to use their own locks for mutually exclusive access. See the corresponding header file for details.

Iterative database

The following use case shows how to print all (key, value) pairs in a database.

leveldb::Iterator* it = db->NewIterator(leveldb::ReadOptions());
for (it->SeekToFirst(a); it->Valid(a); it->Next()) {
  cout << it->key().ToString() < <":"  << it->value().ToString() << endl;
}
assert(it->status().ok());  // Check for any errors found during the scan
delete it;
Copy the code

The following use case shows how to print data in the [start, limit] range:

for (it->Seek(start);
   it->Valid() && it->key().ToString() < limit;
   it->Next()) {... }Copy the code

Of course, you can also iterate backwards (note that reverse traversal may be slower than forward traversal, as described in the read performance benchmark above) :

for (it->SeekToLast(a); it->Valid(a); it->Prev()) {... }Copy the code

The snapshot

Snapshots provide consistent read-only views for the entire KV storage (consistent read-only views). ReadOptions::snapshot not null means that the read operation should work on a particular version of DB. If null, the read operation will be performed on an implicit snapshot of the current version.

Snapshots are created by calling DB::GetSnapshot() :

leveldb::ReadOptions options;
options.snapshot = db->GetSnapshot(a); . apply some updates to db ... leveldb::Iterator* iter = db->NewIterator(options); . readusing iter to view the state when the snapshot was created ...
delete iter;
db->ReleaseSnapshot(options.snapshot);
Copy the code

Note that when a snapshot is no longer in use, it should be released through the DB::ReleaseSnapshot interface.

Slice

The it->key() and it->value() calls return values of type LevelDB ::Slice. Those of you familiar with Go should be familiar with Slice. Slice is a simple data structure that contains a length and a pointer to an external byte array. Returning a Slice is more efficient than returning a STD ::string because there is no need to implicitly copy a large number of keys and values. Also, the Leveldb method does not return a C-style string ending in a \0 cutoff because LevelDB’s keys and values allow \0 bytes.

C++ style strings and C style null-terminated strings are easily converted to a Slice:

leveldb::Slice s1 = "hello";

std::string str("world");
leveldb::Slice s2 = str;
Copy the code

A Slice can also be easily converted back to C++ style strings:

std::string str = s1.ToString(a);assert(str == std::string("hello"));
Copy the code

Be careful when using Slice; it is up to the caller to ensure that the external byte array Slice points to is valid. For example, the following code is buggy:

leveldb::Slice slice;
if(...). { std::string str = ... ; slice = str; }Use(slice);
Copy the code

When the if statement ends, the STR will be destroyed and the Slice reference will disappear, causing problems for subsequent use.

A Comparator

The previous examples used the default comparison function, which compares byte by byte lexicographically. You can customize the comparison function and pass it in when you open the database by inheriting the Leveldb ::Comparator and defining the logic. Here’s an example:

class TwoPartComparator : public leveldb::Comparator {
 public:
  // Three-way comparison function:
  // if a < b: negative result
  // if a > b: positive result
  // else: zero result
  int Compare(const leveldb::Slice& a, const leveldb::Slice& b) const {
    int a1, a2, b1, b2;
    ParseKey(a, &a1, &a2);
    ParseKey(b, &b1, &b2);
    if (a1 < b1) return - 1;
    if (a1 > b1) return +1;
    if (a2 < b2) return - 1;
    if (a2 > b2) return +1;
    return 0;
  }

  // Ignore the following methods for now:
  const char* Name(a) const { return "TwoPartComparator"; }
  void FindShortestSeparator(std::string*, const leveldb::Slice&) const {}
  void FindShortSuccessor(std::string*) const {}};Copy the code

When opening the database, pass in the comparator defined above:

// instantiate the comparator
TwoPartComparator cmp;
leveldb::DB* db;
leveldb::Options options;
options.create_if_missing = true;
// Assign the comparator to options.comparator
options.comparator = &cmp;
// Open the database
leveldb::Status status = leveldb::DB::Open(options, "/tmp/testdb", &db); .Copy the code

Backward compatibility

The result returned by the comparator Name() method is bound to the database when it is created and checked each time it is opened. If the Name is changed, the call to Leveldb ::DB::Open will fail. Therefore, change the comparator name if and only if the new key format and comparison function are incompatible with the existing database and the existing data is no longer needed. Leveldb is an ordered KV store. Leveldb is an ordered KV store. Leveldb is an ordered KV store.

What if you have to change the logic? You can evolve your key format bit by bit based on planning ahead. Note that planning ahead is very important. For example, you can store a version number at the end of each key (in most scenarios, one byte will suffice), and when you want to switch to a new key format (as in the TwoPartComparator case above), what you need to do is:

  1. Keep the same comparator name
  2. Increments the version number of the new keys
  3. Modify the comparator function so that it uses the version number to decide how to sort

Performance tuning

By modifying the include/leveldb/options. The h type defined in the default value to leveldb performance tuning.

Block size

Leveldb organizes adjacent keys into the same block (described in a follow-up article on the Sstable file format), which is the basic unit of data transfer between memory and persistent storage. The default uncompressed block size is about 4KB. Applications that frequently scan large amounts of data in batches may want to increase this value, while applications that only do “point reads” on data may want to decrease this value. However, there is no evidence that performance is better when the value is less than 1KB or greater than a few Megabytes. Also note that compression is more efficient with larger block sizes.

The compression

Each block is individually compressed before being written to persistent storage. Compression is turned on by default because the default compression algorithm is very fast and it is automatically turned off for non-compressible data, and there are very few scenarios where users would want to turn compression off completely unless benchmarking shows significant performance improvements. To disable the compression function, do as follows:

leveldb::Options options; options.compression = leveldb::kNoCompression; . leveldb::DB::Open(options, name, ...) .Copy the code

The cache

The contents of the database are stored in a set of files in the file system. Each file stores a series of compressed blocks. If options.block_cache is not NULL, it is used to cache the contents of frequently used decompressed blocks.

#include "leveldb/cache.h"

leveldb::Options options;
options.block_cache = leveldb::NewLRUCache(100 * 1048576);  // 100MB cache
leveldb::DB* db;
leveldb::DB::Open(options, name, &db); . use the db ...delete db
delete options.block_cache;
Copy the code

Note that the cache holds uncompressed data, so it should be sized according to the data size required by the application. (The caching of compressed data is left to the operating system’s buffer cache or user-defined Env implementation.)

When performing a large block read, an application may want to disable the cache so that the large block read does not cause most of the current cache to be displaced. We can provide a separate iterator for this purpose:

leveldb::ReadOptions options;
options.fill_cache = false;
leveldb::Iterator* it = db->NewIterator(options);
for (it->SeekToFirst(a); it->Valid(a); it->Next()) {... }Copy the code

The layout of the Key

Note that both disk transfers and caches are in a block, and adjacent keys (sorted) are always in the same block, so applications can improve performance by grouping keys that need to be accessed together and placing keys that are not used often in a separate key space region.

As an example, suppose we are implementing a simple file system based on LevelDB. The types of data we intend to store to the file system are as follows:

filename -> permission-bits, length, list of file_block_ids
file_block_id -> data
Copy the code

We can add a character prefix to the above key denoting filename, such as ‘/’, and a different prefix to the key denoting file_block_id, such as ‘0’, so that these keys for different purposes have their own key space regions. We don’t have to read and cache large chunks of file content data when scanning metadata.

The filter

Given how Leveldb’s data is organized on disk, a single Get() call can involve multiple disk reads, and the configurable FilterPolicy mechanism can be used to significantly reduce disk reads.

leveldb::Options options;
// Set to enable the Bloom filter-based filtering policy
options.filter_policy = NewBloomFilterPolicy(10);
leveldb::DB* db;
// Open the database with this setting
leveldb::DB::Open(options, "/tmp/testdb", &db); . use the database ...delete db;
delete options.filter_policy;
Copy the code

The above code associates the database with a Bloom filter-based filtering policy that relies on the fact that some bits of each key are kept in memory (10 bits in the above example, since the parameter we passed to NewBloomFilterPolicy was 10), This filter will reduce the number of non-essential disk reads on Get() calls by about 100 times, and the increase in the number of bits per key used for the filter will further reduce the number of disk reads and of course take up more memory. We recommend setting a filter policy for applications that cannot fit all data sets into memory and have a large number of random reads.

If you are using a custom comparator, make sure that the filter policy you are using is compatible with your comparator. For example, if a comparator ignores trailing Spaces when comparing keys, NewBloomFilterPolicy must not coexist with the comparator. Instead, the application should provide a custom filter policy, and it should also ignore the trailing whitespace of the key, as shown in the following example:

class CustomFilterPolicy : public leveldb::FilterPolicy {
 private:
  FilterPolicy* builtin_policy_;

 public:
  CustomFilterPolicy() : builtin_policy_(NewBloomFilterPolicy(10{} ~))CustomFilterPolicy() { delete builtin_policy_; }

  const char* Name(a) const { return "IgnoreTrailingSpacesFilter"; }

  void CreateFilter(const Slice* keys, int n, std::string* dst) const {
    // Use builtin bloom filter code after removing trailing spaces
    std::vector<Slice> trimmed(n);
    for (int i = 0; i < n; i++) {
      trimmed[i] = RemoveTrailingSpaces(keys[i]);
    }
    return builtin_policy_->CreateFilter(&trimmed[i], n, dst); }};Copy the code

You can also provide your own filter policies that are not based on Bloom filters, as shown in Leveldb /filter_policy.h.

Checksums

Leveldb associates checksums with all the data it stores on the file system and has two separate controls over these checksums:

ReadOptions::verify_checksums can be set to true to enforce checksums for all data read from the file system. The default is false, that is, no such validation is performed.

Options:: rotated _checks is set to true before the database is opened so that data corruption is detected immediately. Depending on where the database is corrupted, the error may be reported after the database is opened or during a subsequent operation. This configuration is turned off by default, that is, the database can continue to be used even if persistent storage is partially damaged.

The LevelDB ::RepairDB() function can be used to repair as much data as possible if the database is corrupted (which may not be done when Options:: Checks are turned on).

Approximate space size

The getstatic esizes method is used to obtain an approximation of the size (in units, bytes) of a file system occupied by one or more key segments.

leveldb::Range ranges[2];
ranges[0] = leveldb::Range("a"."c");
ranges[1] = leveldb::Range("x"."z");
uint64_t sizes[2];
db->GetApproximateSizes(ranges, 2, sizes);
Copy the code

Size [0] holds the approximate number of file system bytes corresponding to the [a..c) interval. Size [1] holds the approximate number of file system bytes corresponding to the [x..z) key interval.

The environment variable

All file operations and other operating system calls initiated by LevelDB are eventually routed to a Leveldb ::Env object. Users can also provide their own Env implementation for better control. For example, if an application wants to introduce a manual delay for LevelDB’s file IO to limit LevelDB’s impact on other applications on the same system:

// Customize your own Env
class SlowEnv : public leveldb::Env {
  ... implementation of the Env interface ...
};

SlowEnv env;
leveldb::Options options;
// Open the database with the custom Env
options.env = &env;
Status s = leveldb::DB::Open(options, ...) ;Copy the code

portable

Leveldb can be ported to a particular platform that provides type/method/function implementations exported by Leveldb /port/port.h. See Leveldb /port/port_example.h for more details.

In addition, new platforms may require a new default leveldb::Env implementation. See Leveldb /util/env_posix.h for details.