The nature of indexes

1. Write it first

1.1 background

The reason for this article is that I went to ask a student about the usage scenario of LevelDB. The classic dialogue is as follows:

Big guy: What problem do you think LevelDB solved? (PS: Quite suddenly, a rhetorical question, asked me unprepared, don’t you think I think

Me: LevelDB implementation using LSM Tree writes faster compared to K/V stores of the same type

Big guy: Why?

Me: Check the data on the network is LSM Tree will change the random disk write into sequential write, so it is faster

LSM Tree vs. B Tree, why LSM Tree with B Tree?

(B+ Tree) (K + Tree) (B+ Tree) (K + Tree) (B+ Tree) (K + Tree) (B+ Tree) (K + Tree) (B+ Tree) (K + Tree) (B+ Tree) (K + Tree) (B+ Tree) (K + Tree) (B+ Tree) (K + Tree) (B+ Tree) (K + Tree) (B+ Tree) (K + Tree) (B+ Tree

1.2 thinking

Through this question, I once again deeply aware of the gap between me and the big guy, well, there may be a Mount Everest

“Ability to abstract questions” – To abstract away the key points and ask the usage scenarios of LSM Tree data structures? Abstract thinking
- For example, even though the original question is LevelDB usage scenario?
  
  But there are a lot of variations, for example you can ask RocksDB usage scenarios?
  
  It is a better way to solve the problem if we can extract the commonness of the two and analyze the application scenarios of LSM Tree data structure.
  
  Note: Both RocksDB and LevelDB implementations use LSM Tree
“The ability to summarize problems” — to derive a class of problems or more from one problem? Think by analogy
- For example, both LSM trees and B trees can be used as indexes, a concept most of us are familiar with. But what about going deeper?
  - Why index it?
  - What problems can be solved by adding indexes?
  - What data structures are available for the index?
  - …
  We do not know the knowledge of many and can not enumerate, but in the face of those we do not know the knowledge, learn to use the ability to draw inferences to get twice the result with half the effort
  
  Note: Tao begets one, life begets two, two begets three, and three begets all things.

2. What problem does the index solve

2.1 Why index

Before we answer that question, let’s examine the chart below. Old rules, but also in accordance with the “from the bottom up, step by step in-depth” way.

Do you care about indexes when you only have a few pieces of data?

A: No, but it doesn’t mean you can’t, so it means there are still options. A. Direct linked list or array storage, global traversal when needed

B. Use tree storage or ordered array storage, binary search when necessary
How about thousands, tens of thousands, or even 100,000 pieces of data?

A: Not necessarily, it depends on the scenario, depending on the size of the records in the dataset
How about increasing the data volume to hundreds of millions of records with a size of around 4M?

A: Yes, otherwise the data of this volume will be searched globally and you will either be fired by the boss or sent blades by the user

In summary, the purpose of adding indexes is obvious.

When the data increases to a certain volume, adding indexes can help us search the data more quickly.

Question: Is a full table scan necessarily slower than an index run?

2.2 The nature of indexes

The nature of Indexing) I was shocked to learn this

Change IO operations to CPU operations, memory operations to cache operations, and low-speed media to high-speed media to reduce dependence on low-speed media.

Through the caching effect of high-speed media, index is smaller than data, and ordered, can be quickly approximated, the whole operation by CPU as far as possible in memory to avoid IO intervention, so IO -> CPU & MEM, CPU to complete the approximation, MEM to complete data transmission and caching, as far as possible to reduce IO

Note: When I first heard this theory, I always thought it was wrong, but when I listened to it, I felt surprised that I could understand it this way

Two groups of arrays are given, one group is ordered array, the other group is unordered data, and the element whose element value is 9 is found through traversal, as shown in the figure below. The following conclusions are obtained by comparing the two methods:

The number of searches represents the speed of searches
As the amount of data increases, there is a huge difference in search speed between the two methods O(n) vs O(log(n)).

2.2.1 Adding indexes

First of all, in the actual production environment, the generation of data must be disordered and random.

Secondly, in order to quickly query the data, it requires us to store the data in an orderly way.

Then, in order to more quickly abstract out the data record, you can display the selected record field or implicitly generate an ID, as the index represents the whole record, the benefit is to improve the query speed at the same time, compress the data size. For example, you are advised to select an auto-increment ID for an index in mysql

Finally, this is a conceptual summary of indexes.

A picture is worth a thousand words. Feel the importance of adding an index. Ps young do not know index good, the wrong all sweep true love……

3. Comparison of index data structures

All of the data structures listed below can be used for indexing, but note that this is a subset relationship (i.e. there are other data structures besides the one listed below)

Think 🤔 : Don’t take a hammer, see everything like a nail.

LSM Tree and red black Tree

3.1 Faster search for what?

Okay, so far, we’ve seen that indexes speed up the lookup process, but what does it speed up?

Find the equivalent
Find the range
Fuzzy search
Difference to find
The sum of the lookup
…

The search options vary a lot, see what your business needs, and pick a data structure.

3.2 B+ VS LSM tree

Since we’ve been talking about the essence, we’re going to talk about the essence. So the basic concept is not written here, “students with the same curiosity as the author” can be jabged at the following reference.

3.2.1 B + tree

B+ trees are essentially pluggable ordered arrays.

The disadvantage is that the array transforms into a tree structure, which results in insertion requiring search. Therefore, the insertion of B+ tree is weak, because the pre-insertion cost is to find the location
Advantages: Fast query speed

Note: The problem with B+ trees is that inserts need to be found and nodes full cause splitting, which requires holding very large granularity locks

3.2.2 LSM tree

The essence of an LSM tree is an ordered Array that is not pluggable

The disadvantages are: data cannot be inserted and can only be done by merge, which becomes logging. This introduces interlocking problems such as multi-level write amplification, high CPU cost of merge, delete operation is forced to introduce multiple versions, and searching is very complicated.
Advantages: Fast write speed

3.2.3 Bw – tree

“Learn two, get one free. Learn to earn.”

Microsoft’s DocumentDB, now called Azure Cosmos DB, is based on BW-Tree. The essence of bW-tree is append substitution splitting of leaf nodes at the cost of additional metadata management.

LSM trees are for writing
Bw-tree atomizes buffer operations on a node to avoid locking for parallelism.

Note: Prior knowledge, points include segments

Concurrency is a time range that is executed simultaneously

Parallelism is points in time, points in time are executed simultaneously

The deeper one involves the concept of total and partial orders, well, everything is mathematical…

Note: THE author of Bw-tree is not very well understood, the back of the separate write an article to learn good, here only make a record, this part can skip reading

4)

Disclaimer, the above are my own ideas and thinking, there are mistakes welcome to point out.

Fortunately, when there is a lot of depression, there are also a lot of small happiness, because the weather is very good, because I have eaten delicious snacks, because I can quietly sleep in on weekends, and feel warm and happy for many small satisfaction.
Because the less knowledge you have, the more absolute you believe, because you haven’t heard the opposite. Arrogance is the nature of the ignorant and the quarrelsome.

There is no delay, the end of the day at the beginning of the month, you can eat strawberries in the evening

5. Reference materials

System Properties Comparison LevelDB vs. Redis vs. RocksDB
Difference between O(n) and O(log(n)) – which is better and what exactly is O(log(n))?
What does the worst code you’ve ever seen look like
MySQL > select * from database;
Bw-tree technology interpretation
In-depth understanding of Mysql index underlying data structures and algorithms
AVL tree