This paper continues the sequence of afterfeelings. After reading is not full translation, focus on the problem, ideas, key details (of course, the key understanding varies from person to person) and test data, welcome to exchange.

A, problem,

The advent of new hardware is forcing people to rethink how data management systems should evolve. The new hardware trends described are solid-state drives and multicore cpus.

Solid-state drives provide extremely high IOPS and low read/write latency, which can greatly improve performance if upper-layer systems can take advantage of them. At the same time, the asymmetric nature of SSDS (read faster than write, sequential writes faster than random writes) also requires adaptive design of the upper system. Traditional B+ -tree updates to the page cause so much random write that it is difficult to effectively take advantage of SSD features.

The frequency of single-core CPU has been close to the limit, multi-core CPU to roll out. The biggest problem with making full use of multi-core cpus compared to single-core cpus is concurrency, and locking is typically the biggest problem that limits concurrency. Secondly, cache hit ratio also has a significant impact on performance. After all, CPU cache is generally SRAM, which has an order of magnitude advantage over DRAM. In this paper, it is considered that the two problems of traditional B+ -tree are not well solved. Updating the same page requires mutual exclusion, and the page split is passed to the root node, further increasing the probability of mutual exclusion. In-place updates destroy the CPU cache validity, and these problems need to be solved.

As can be seen from the description of the problem, the main scenario of BW-tree proposed in this paper is solid-state drives (Thus the BW-tree targets Flash storage.). Even full memory systems (The BW-tree only blocks when it needs to fetch a page from stable storage (The LSS), which is rare with a large main memory cache. The final test is also a full memory test), although it is mentioned that some of the policies in the system are also applicable to disk, but as a whole, they are beside the point.

Second, the train of thought

Following the above problems, the author proposes several solutions as follows.

  1. Use the Mapping Table to increase the flexibility of page data management: If we treat the bottom layer of B+ -tree as the data layer and the top layers as the index layer, then this operation enables the index and data to be managed separately using two subsystems, which can be optimized separately. The optimization of the index subsystem is for high concurrency and high cache hit ratio. The objective of data subsystem optimization is to make full use of the advantages of random read and sequential write and reduce random write. Mapping table is the connection point of two subsystems and a key component.
  2. Use delta+base instead of in-place updates: Base can be understood as the first occurrence of a node, and delta can be understood as the corresponding modification of base (INSERT /update/delete), similar to database redo. When updating a bW-tree node, place a delta in front of the node and then point to base, rather than modifying base directly, achieving two benefits at once. (1) Locking base reduces concurrency when multiple threads modify the same page. (2) Locking base does not affect CPU cache validity (if the base is in the cache).
  3. Through a series of atomic operation to realize the split and merge: traditionally, B + – Tree split and does not lock is unthinkable, such as split scenario, point to the parent page to delete a child pages, and then add the two new records point to the new split of two pages (or modify the record, add new records). If you do not lock, for example, just delete a record and another thread accesses it, the result is wrong. In this paper, there is a unified idea to deal with this kind of multi-step atomicity problem. One is to add temporary nodes, establish redundant paths for accessing nodes to be updated, and then CAS breaks the old route and connects the new one. The second is that any thread attempting to access the page affected by the operation will have to pick up the baton and continue until the operation is complete. The second step is essentially a wait, but in an in-memory database, a do-it-yourself notification is faster than waiting for someone else to finish.
  4. Writing only delta makes storage more efficient: It is mentioned in the paper that the Log Structure Store (LSS) is very efficient to write only delta, but there are few details, one of which is interesting to me. The begin and end system transaction records are among the very few objects flushed to the LSS that are not instances of Pages, which looks like user transactions and system transactions are separate instances.

Key details


First look at the system architecture. System is divided into three layers, the top is Bw-tree index layer, the bottom is not the data, but the page ID (the article does not say how to generate the page ID); The core function of the Cache layer is to find the memory address based on the page ID of the index layer (sometimes the memory address is read based on the flash offset of the page). Below that is LSS, which specializes in optimized storage. Intuitively, the biggest difference from the traditional B+-tree + buffer pool is the addition of a mapping table. Although this indirection is not invented by the author, it does bring several benefits. The first advantage is that if the position of the page changes, you can change it in the Mapping table without changing the bw-tree index (because the bW-tree records the page ID, which does not change). The second advantage is that the Mapping table has less strict requirements on the page size (there is no need to split and merge when xK is reached, which can be large or small, and the business is determined), thus reducing the number of split and merge, which further reduces the index modification. The third is to separate the memory and disk structures so that each can be better implemented.


Now look at delta Update. The initial state is page ID pointing to the Base Page. Instead of modifying the page directly, the changes to the page construct a delta, and then the page ID points to delta, which points to the Base Page. This is a bit like inserting a node into a chain-free list, except that the node is not in a peer relationship with the next node, but in a cooperative relationship. When reading a key, if the reader thread encounters a delta, it first looks for the key in the delta and returns if it finds it, otherwise it continues until it finds base. This approach reduces collisions, but it obviously introduces the problem of read magnification. Instead of just searching the base page, you have to search all deltas, which reduces read performance to some extent (hotspot updates can be more serious), and a similar process is required. The deltas and bases are collated into a new page instead of the original base page. After collating, the reader thread can search only the base page, and read performance is restored. Over and over again, this pattern is used by many systems and is a matter of read/write balance.


Finally, the part with the most ink in this paper is how to latch-free when doing structure modification Operations (SMOs)? The figure above shows how to split page P into P and Q. First, create an entry for Q in the Mapping table, Q points to NEIGHBOR R of P. At this time, no one knows the existence of Q except the thread that is currently operating, and the rest of the access will be done without any impact. Step 2 create a shadow proxy for page P. This proxy knows everything, including P is about to split into P and Q, P’s split point, P’s address and Q’s address. Then use CAS to point the page ID of P in the mapping table to the shadow proxy. If multiple people operate at the same time, only one person will succeed. (The paper describes that any thread can initiate a split on P, so there is a possibility of contention, if P is hot?) . At this point, the parent node and the left neighbor node also point to the shadow proxy (because they are both accessed through a Logical pointer to the page ID that points to the shadow proxy). So far, all access to P is through the shadow proxy of P, whether directly through the page ID in the Mapping table, or through the neighbor or the parent node. Since the shadow proxy has all the information of P, each access has no impact. The only problem is that if the parent node accesses Q, it needs to go through the shadow proxy of P first, which is a performance penalty. This is the next problem to solve, which is to create a shadow proxy for the parent node to ensure that the shadow proxy can recognize Q directly. At this point, Q can truly join the big family, so it can do some sorting work later, focusing on the merger of each node and its shadow agent (at this point, the only entrance of the node is the shadow agent, so there is no competition problem).

The process of page merging is similar. The idea is to create a shadow proxy that takes over all access to the original node and then the original node can be merged offline. The idea here is clever, the Mapping table makes the only physical reference to a node come from an entry, so as long as you can handle that entry you can solve the page change problem, which essentially turns the multi-join problem in a tree into a single-join problem, which can then be solved using the thought of a linked list.

4. Experimental data



After all the hard work, it’s time to see the results. Since this article focuses on optimizations like lock-free memory, the comparison premise is that all data is in memory. You can see that BerkeleyDB is almost an order of magnitude better than BerkeleyDB (traditional B+ -tree), which is understandable because the test set is small and the access collision rate is higher, whereas BerkeleyDB is page lock, which is more serious. However, there is a detail in the paper. In the test, BerkeleyDB set the page size to 8K, and [2] pointed out that BerkeleyDB supports the page size from 512B to 64K, so if the page size is reduced, will the gap be narrowed?

Compared to Skiplist, which is 3-4+ times better, the author gives a reasonable explanation that CPU cache hit ratios are a major factor. Skiplist is a variant of a linked list, and the low CPU cache hit ratio is well known, which may also mean that in-memory databases using Skiplist as their primary data structure are inappropriate.

Five, the summary

While this article to the memory database, this paper also mostly in say how to do it unlocked and improve CPU cache hit ratio, but its introduction and positioning of the mapping table is a very interesting place, this system can make different subsystem optimization respectively, so as to have a better chance of making full use of hardware resources, but unfortunately, There are too few details about the structure and storage of the Mapping table.

[1]. www.microsoft.com/en-us/resea.

[2]. Web.stanford.edu/class/cs276…