preface

A new series, “Easy to Understand Data Structures and Algorithms by Looking at Pictures”, is launched, mainly using pictures to describe common data structures and algorithms, easy to read and understand. This series includes dozens of examples of heaps, queues, lists, trees, graphs, sorts, and more.

About the LSM tree

The LSM Tree is a log-structured merge-tree. In fact, it does not belong to a specific data structure, it is more of a data structure design idea. Most NoSQL database core ideas are based on LSM to do, but the specific implementation is different. So I wasn’t going to include it in this series, but a friend left me a few comments asking me to talk about LSM trees, so LET’s talk about LSM trees.

LSM tree birth background

Traditional relational databases use Btree or some variants as storage structures, which can efficiently search. However, it also has an obvious disadvantage when stored on disk, that is, it can be logically close but physically far apart, which can cause a lot of random disk reads and writes. Random reads and writes are much slower than sequential reads and writes. To improve I/O performance, we needed a mechanism that could transform random operations into sequential operations, hence the LSM tree. The LSM tree allows us to write sequentially to the disk, greatly improving the write operation at the expense of some read performance.

About Disk IO

Disk read/write involves searching for data on the disk. The address consists of a cylinder ID, a disk ID, and a block ID. That is to say, the moving arm first moves to the specified cylinder according to the cylinder number, then determines the track of the disk according to the disk number, and finally moves the specified track segment to the magnetic head according to the block number, and then begins to read and write.

The whole process mainly consists of three parts: seek time, latency time, and transmission time. Respectively indicates the time for locating cylinders, moving the track segment with a block number to the head, and transferring data to memory. The most time-consuming part of disk I/O is the search time, so reducing the search time can greatly improve performance.

LSM tree principle

The LSM tree consists of two or more storage structures, such as the simplest two storage structures used in the paper for convenience. A storage structure resides in memory, called a C0 tree, which can be any data structure that facilitates key lookup, such as a red-black tree, a map, or even a hop table. Another storage structure resides on hard disks, called C1 tree, which is similar to B tree. C1 All nodes are 100% full. The node size is the size of a disk block.

Insert the steps

The general idea is as follows: When a new record is inserted, the operation log is first inserted into the log file so that it can be used later. The log is inserted in append form, so the speed is very fast. Insert the index of the new record into C0, which is done in memory without disk IO operation; When C0 reaches a certain threshold or at regular intervals, records in C0 are rolled into disk C1. In the case of multiple storage structures, as C1 becomes larger, it merges into C2, and so on, all the way up to Ck.

Merge step

Two blocks are used during the merge process: an Emptying block and a filling block.

  1. Read the unmerged leaf node from C1 and place it in an Emptying block in memory.
  2. Find the nodes in C0 from small to large, sort the merging with emptying blocks, save the merging results in filling blocks, and delete the corresponding nodes of C0.
  3. By doing step 2, the merging sort results are constantly filled into the Filling block, and when they are full, they are appended to a new location on the disk, not changing the original node. During the merger, if the Emptying block is used up, the unmerged leaf nodes will be read from C1.
  4. All leaf nodes of C0 and C1 will be merged once the above merger is completed.

About optimization Measures

This paper illustrates the basic principle of LSM with diagrams, but there are many optimization strategies in actual projects, and there are many papers for LSM tree optimization. Such as using bloom filters to quickly determine the presence of keys, doing extra indexing to help find records faster, and so on.

The insert

Insert A, E, L, R, U into the LSM tree, the first will be inserted into the memory C0 tree, using AVL tree, insert “A”, first the disk log file to append records, and then insert C0,

Insert “E”, again append log before write memory,

Continue to insert “L”, rotate as follows,

Insert “R” “U”, rotate and end as follows.

If the merge is triggered at this point, then the emptying block is empty and the smallest nodes in the C0 tree will be found, since C1 has no tree yet. The filling block is 4 in length, assuming that the disk block size is 4.

Start finding the smallest nodes and put them in the filling block,

Let’s go to the second node,

And so on, fill the filling block,

Start writing to disk, C1 tree,

I’m going to go ahead and insert B, F, N, and T, and I’m going to log them separately, and then I’m going to insert them into the C0 tree in memory,

If you merge at this point, load the leftmost leaf node of C1 into the Emptying block.

Then merge and sort the C0 tree nodes and emptying blocks. First, “A” enters the filling block.

And then “B”,

The final result of merge sort is,

Append the filling block to the new location of the disk to remove the original node,

Continue merging sort, fill the Filling block again,

When filling blocks are appended to new locations on disks, nodes in the upper layer are also written in disk block size (or multiple disk blocks), avoiding random writes as much as possible. In addition, because the merge process may lead to the update of the upper node, it can be temporarily saved in memory, and later written at the appropriate time.

Find operations

The general idea is to find the C0 tree in memory first, the C1 tree in disk if not, then C2 tree, and so on.

If you want to find “B”, look for the C0 tree.

And then I go to C1 tree, and I start at the root,

Find the “B”.

Delete operation

In order to perform the deletion operation quickly, it is mainly realized by marking the record to be deleted in memory. The corresponding record will be deleted when the asynchronous merge is performed later.

For example, if “U” is to be deleted, the “U” node of the C0 tree becomes,

If there is no record in the C0 tree, a node is generated in the C0 tree and marked with #. When searching for the record, it will be known in memory that the record has been deleted, so there is no need to go to disk to find it. For example, if you want to delete “B”, there is no need to go to disk to delete it, just insert a “B” node into the C0 tree marked with #.

————- Recommended reading ————

Summary of my open Source projects (Machine & Deep Learning, NLP, Network IO, AIML, mysql protocol, Chatbot)

Why to write “Analysis of Tomcat Kernel Design”

My 2017 article summary – Machine learning

My 2017 article summary – Java and Middleware

My 2017 article summary – Deep learning

My 2017 article summary — JDK source code article

My 2017 article summary – Natural Language Processing

My 2017 Article Round-up — Java Concurrent Article


Talk to me, ask me questions:

Welcome to: