preface

Index may be familiar to everyone, in the use of relational database, some frequently used as query conditions of the field we will go to build index to improve query efficiency. In relational database, we generally use b-tree index for storage, so b-tree index is also a kind of index data structure that we often contact with. However, in ES, we do not choose to use b-tree index when conducting full-text search, but use inverted index. This article takes a look at how inverted indexes are stored and retrieved in ES.

Why are full-text indexes not stored using B+ trees

Relational databases, such as MySQL, use B+ tree indexes. The following figure is a simple example of a B+ tree:Above said the index value of the blue, white said pointer, bottom leaf nodes in addition to storing the index value will store all data (InnoDB engine), and the root node and end does not store data, so B + tree is designed in order to make the root node and end can store more nodes, because the search from the root node begin to search, Each node query is an I/O operation. Therefore, a node can store more index values and reduce the number of DISK I/OS.

At this point we can ask the question, if the index value itself is large, then B+ tree performance will be dramatically reduced? The answer is yes, because when the index value is large, a node can store much less data (a node defaults to 16KB), the B+ tree becomes deeper, and it takes more I/OS per query. And full-text index is needed to support the large text indexing, from the space B + tree is not suitable for as a full-text index, and B + tree for every search begins with the root node to search, so will follow the left matching principle, and we use full text search, tend not to follow the principle of the left matching, so may cause disabling indexes.

To sum up, B+ trees are not suitable as full-text search indexes for the following two reasons:

  • The text field of a full-text index is usually long, and the index value itself occupies a large space, which increases the depth of the B+ tree and affects the query efficiency.
  • Full-text indexes usually require full-text search and do not follow the left-most matching principle. Using B+ trees may lead to index failure.

The full text retrieval

In the full-text retrieval, we need to cut the word processing of the document, and then cut the word and document association, and index, so at this time we should how to store the corresponding relationship between keywords and documents?

Is row index

As you probably know, full text search (e.g. Elasticsearch) uses an inverted index, so if there is an inverted index, there is a forward index.

A forward index is also called a forward index. Taking a document as an example, the forward indexing can be understood as using the document ID as the key word of the index, while recording the words in the document (processed by the word segmentation), the frequency of occurrence of each word and the position of each word in the document.

But we normally in the search, are all input one word and then to get the document, so it is clear that is row index is not suitable for doing this kind of query, so generally we use the full-text retrieval is inverted index, but inverted index is not suitable for aggregation operation, so in fact aggregation operation used in the es were row index.

Inverted index

Inverted indexes are also known as inverted indexes. In contrast to a straight index, an inverted index uses a word as an index key and also records which documents contain the word.

Here, we take an English document as an example. The reason why we choose to use English document is that English word segmentation is relatively simple and can be divided by space directly, while Chinese word segmentation is relatively complicated.

(Elasticsearch) Elasticsearch (Elasticsearch)

Elasticsearch is the distributed search and analytics engine at the heart of the Elastic Stack.
Elasticsearch provides near real-time search and analytics for all types of data.
Copy the code

Based on the above two statements, suppose we can get an index structure like the following:

Among them:

  • Term Index: Posting List Term Index: Posting List term index: Posting List term index: Posting List Term index: Posting List Term index: Posting List Because in es, for each field index (stored in memory by default), so when we is a large amount of data, you need to quickly locate the corresponding index of the word memory location, so they individually for each term has established the index, the index can generally choose a hash table or B + tree index storage.
  • Term Dictionary: Records all the words in the document that have been de-duplicated (processed by the word splitter).
  • Posting listTF: Logs the document containing the current word and the location (offset) where the current word appears in the document. This information is an array. The table above lists only the documents for simplicity

Id, it’s actually going to store a lot of information.

If you search for Elasticsearch Elastic, you’ll see the following steps:

1. Perform word segmentation on the input keyword to obtain two words: ElasticSearch and Elastic (both uppercase letters are converted to lowercase letters after the word segmentation).Elasticsearch is found in both documents and Elastic is only found in document 1.

3. The final result is that both documents 1 and 2 are returned, but documents are more relevant because they match one or two words, so documents will be ranked higher than documents 2. However, it is important to note that the actual correlation score algorithm is not so simple, but there is a special algorithm to calculate, the hit word does not necessarily appear first.

How does an inverted index store data

Knowing the search process for an inverted index, how is the data stored for an inverted index?

Before we answer that question, let’s look at another question: what is the purpose of indexing? The most direct purpose is to speed up the retrieval speed, and in order to achieve this purpose, without considering other factors, it is necessary to occupy as little space as possible. In order to reduce the occupied space, it may need to be compressed before storage, and after compression involves decompression. Therefore, the compression algorithm used also needs to achieve the purpose of fast compression and decompression.

FOR the compression

FOR compression algorithm is Frame Of Reference. This algorithm is relatively simple, but also has some limitations, because it has certain requirements on the stored document ID.

Suppose there are 100 million documents, and the corresponding document ID is incremented from 1. Assuming that the keyword elasticSearch exists in 1000W files, and 1000W files are from 1 to 1000W, how much space would it take to store the keyword without any compression algorithms?

Int takes up 4 bytes, and 1000W requires 2 to the 24th power. That is to say, if you store it in binary, you need 24 bits without considering the sign bits. Since Posting list TF is an array, Therefore, in order to parse the data, the data of document ID =1 also needs to be stored with 24 bits, which will waste a lot of space.

To solve this problem, we need to use the FOR algorithm, which does not store the document ID directly, but stores the difference, which is one FOR a regular document ID like this, and one can be converted to binary with just one bit, In this way, only 1000W bit space is enough FOR storage. Compared with the case of storing the original document ID directly, the FOR algorithm in this scenario greatly reduces the space.

The above example is an ideal case, but in reality the probability is low. Let’s look at the following set of document ids:

1.9.15.45.68.323.457
Copy the code

This array evaluates the difference to the following array:

8.6.30.23.255.134
Copy the code

At this time, if we still use the ordinary difference algorithm directly, although it can save space, but it is not the optimal solution, then is there a more efficient method for storage at this time?

If we look at the difference array, we see that the array can be further divided into two groups:

  • [8,6,30,23] : the maximum value of this group is 30 and only 5 bits are needed for storage.
  • [255,134] : This group has a maximum of 255 and requires 8 bits to store.

After such splitting, the original data needs 327=224 bits (the original data is stored directly with INT), and the ordinary difference needs 86=48 bits. After grouping difference splitting, only 54+82=36 bits are needed, further reducing the space. This advantage will become more obvious with the increase of data volume.

However, no matter which scheme is used, there is a problem, that is, after the difference or split, how to restore the data, how to know the space occupied by the elements in the difference array when decompressing?

Therefore, for each data, one byte of space is needed to store the number of bits occupied by the elements in the current array. Therefore, grouping is not as fine as possible. If each difference element is stored separately, it will waste more space than not grouping. The impact of this single byte of storage is smaller or negligible.

Matter of compression

None of the differences in the examples above are very different, but what if we calculate the differences and end up with an array where each element is very different? For example, the following array of document ids:

1000.62101.131385.132052.191173.196658
Copy the code

You can calculate the difference of this array. After calculation, you will find that one is large and one is small, and there is a big difference between the two values. Therefore, this method is not suitable FOR FOR compression, so we need to have another compression algorithm to improve efficiency, which is RBM compression. The core idea of RBM compression algorithm is to divide 32-bit unsigned integers into containers by 16 bits, that is, 65536 containers at most. Because 65536 is actually 2 to the power of 16, and an unsigned int type exactly needs 32 bits for storage, divided into high and low bits just 16 bits on both sides, that is, 65536 at most.

Find a container based on the height 16 bits (for example, if the height 16 bits count is 1, find Container_1, find container_2, and so on). If the container does not exist, create a new container. And put the lower 16 bits into the container, if the container exists, directly put the lower 16 bits into the container.

There will be a phenomenon that the maximum number of containers is 65536, and the maximum number of elements in each container is 65536.

The above array is evaluated to produce the following container (container_1 has no elements) : If we feel that the high and low 16 bits is not easy to understand, then we can understand, we put all the elements in the array divided by 65536, to its mold, every get a mold to create a container, and the rest of the number into the corresponding mold of the corresponding container. Because an int is 2 to the 32nd, which is 65536 squared.

So how do we store the elements in the container? You can choose direct storage or other more efficient storage methods. In THE RBM algorithm, there are three types of containers, which use different methods to store elements in containers:

  • ArrayContainer

ArrayContainer uses a short array for storage, because each container has a maximum of 65535 elements, which are stored in 2 bytes. The characteristic of this storage method is that as the number of elements increases, the space required will always increase.

  • BitmapContainer

BitmapContainer is stored in bitmap mode, that is, a fixed 65536 length container is created. Each element in the container is stored in only one bit. If there is an element in a certain location, it is stored as 1, and if there is no element, it is stored as 0. This storage mode is characterized by a fixed space is 65536 bits, that is, the size of the fixed 8KB.

  • RunContainer

RunContainer is a special container that can be used in specific scenarios. For example, if the document ID is sequential from 1 to 100, you can store 1,99, which means that the 1 is followed by 99 consecutive digits. For example,1,2,3,4,5,6,10,11,12,13 can be compressed to 1,5,10,3, meaning that 1 is followed by five consecutive digits and 10 by three consecutive digits.

For example, ArrayContainer takes up less space than BitmapContainer when storing less than 4096 elements. When storing more than 4096 elements, ArrayContainer takes up less space. Using ArrayContainer would require more than 8KB of space, so using BitmapContainer would take up less space.

How is an inverted index stored

Previously, we talked about what compression algorithm is used to compress inverted indexes in ES. Then, how does the data after compression land on disk? What data structures are used?

Tria Tree

Dictionary Tree, also known as Prefix Tree, is a variant of hash Tree. It can be used for auto-completion, spelling check, longest Prefix matching, etc.

Dictionary trees have the following three characteristics:

  1. The root node contains no characters, and each node except the root node contains only one character.
  2. From the root node to a node, concatenate all characters on the path, namely, the string corresponding to the node.
  3. All children of each node contain different characters.

The following is a dictionary tree generated by typing the following words (AFGCC, AFG, ABP, TAGCC) in sequence on a data structure website: In the figure above, it can be seen that the root node has no letters. Except for the root node, the other nodes have two colors: white and green. What is the difference between the nodes with these two colors?

The green node indicates that the current node is a Final node, that is, the current node is the end node of a word. When searching, if the end node is found to be a Final node, the current letter exists; otherwise, it does not exist.

For example, if I search for ABP and look down from the root node, P is a Final node, then the string ABP exists in the tree. If I search for AFGC, I can also find these letters, but C is not a Final node, so the string AFGC does not exist.

For example, the last three characters of the FGCC suffix in the second column and GCC suffix in the third column are the same, but these repeated strings are stored separately and are not reused, so the dictionary tree does not solve the suffix sharing problem. Only prefix sharing is addressed (which is why dictionary trees are also called prefix trees). When the amount of data reaches a certain level, sharing prefix but not suffix will also bring a lot of space waste, so how to solve this problem?

FST

Finite State obtains the same suffix as string prefixes in order to solve the dictionary tree problem, Finite State obtains the same suffix as string prefixes.

FSM

FSM is a Finite State Machine. If you know about state patterns in design patterns, you should know about state machines. The magic of finite state machines is that states can all be enumerated and then flow between different states with different operations.

Here is a simple finite state machine (if a person does all of the following states in a day, then the states can switch between them, and the numbers in the figure below indicate the conditions for the transition) :

The finite state machine has the following two characteristics:

  1. States are finite and can be all enumerated.
  2. States can flow from state to state.

The FST we need to learn today is actually evolved through FSM.

Going back to our dictionary tree, if we now switch to FST, we get the following data structure:How did you get this picture up here? What do the numbers after the letters mean? What difference does it make if some nodes have numbers and some are blank? How does this diagram distinguish between Final nodes? Let’s build an FST step by step.

The key is the key, and the number behind it is the id of the stored document (which will eventually be converted to binary storage). However, the id represented by each number may be incomplete, for reasons explained below.

  1. First, we receive the first key-value pair to store the index, AFGCC/5, as shown below:

In the figure above, red represents the start node, dark gray represents the end node, and bold lines indicate that the node behind it is a Final node. There is a question as to why 5 should be stored on the first line (a line without a number is actually a null value). In fact, I can store it on any of the following lines, because the final search will add up all the numbers on the whole line to get the final value. This is why I said above that the value on each line may be incomplete, because a value may be divided into several numbers and added together, and stored on different lines.

First of all, the reason why this 5 is stored in the first paragraph is actually to increase the reuse rate, because the further forward you go, the more likely you are to reuse it.

2. Select AFG/10 and select AFG/10. At this point, we find that the node after G stores a 5, but the other line segment does not store a number. Why is this? 10=5+5, the first key pair AFGCC/5 will be stored in the Final node corresponding to the current key (there is a property output in the source). In the search, if we pass the value on the Final node that does not belong to us, we will not add the value on the Final node after G when we search the first index value AFGCC.

Alter table ABP/2 alter table ABP/2 At this time, because the string ABP shares A with the previous one, and the corresponding value of A is 5, which is already larger than 2, so as long as the string ABP shares A, it cannot be successfully stored in any way. Therefore, the first node 5 can only be split into 2+3, and 2 can be stored in the original position of A. Then the following 3 follows the previous principle, the higher the probability of storage reuse, so there is a second line, that is, the position corresponding to the character F, then all the conditions are met.

4. Finally, we store the last index key-value pair TAGCC/6, resulting in the following figure: Since GCC is the same suffix as before, and there is no value stored on the line after GCC, we can simply store the 6 on the first segment. Note that if there is a conflict again, we will need to reassign each segment value. At this point we have the same FST generated in the site shown above.

conclusion

FOR and RBM are two compression algorithms used in Elasticsearch. FOR and RBM are two compression algorithms used in Elasticsearch. FOR and RBM are two compression algorithms used in Elasticsearch. Finally, the reason why FST is selected in ES instead of dictionary tree to store index data is obtained.