This section contains:

  • Principle of inverted indexing
  • Invert index data structures
  • Inversion merging algorithm
  • Inversion list compression algorithm

One of the differences between ES and traditional databases like MySQL is the index. ES uses a data structure called an inverted index.

First look at the forward index:

doc_id body download_count
1 mick li 12
2 jetty chen 21
3 steven li 31

Similar to a hash table, a forward index queries an object by doc_Id.

Inverted index

If it is reversed, I want to query what data in body contains Li?

In a traditional database, I might use

select * form doc where body like '%li%';
Copy the code

In this way, the database may choose to use a full table scan to determine whether the body contains LI, which is inefficient.

But if we implement an index structure like this:

Term (Term) Posting List
mick [1]
jetty [2]
steven [3]
li [1, 3]
chen 2
  • Term: The smallest storage and query unit in an index.
  • Posting List: A document is usually made up of several words, and an inversion List records which documents a word appears in, and where. Each record becomes an inverted item. Inversion lists not only record document numbers, but also store information such as word frequency.
  • Inverted File: The Inverted list of all Inverted words is often stored in an ordered File on disk. Inverted files are physical files that store Inverted indexes.

When querying body data containing LI, you only need to use the index structure to query data contained in Posting List and then map to the final data.

This structure is called an inverted index.

Term Dictionary

But how to efficiently query li in this index structure? As long as we order terms, we can use the binary search tree to query the data under O (logn).

The process of splitting a text into separate terms is word segmentation.

Generally, terms are sorted in order. After sorting, when we search for a Term, we do not need to go through it from the beginning, but adopt binary search. Combining all the terms together is the Term Dictionary.

When we have a large amount of text, there will be a lot of Term after word segmentation. Such a data structure with inverted index will certainly be insufficient if stored in memory.

Term Index

Since we can’t put the entire Term Dictionary in memory, we can create an index for the Term Dictionary and put it in memory.

In this way, you can query Term Dictionary efficiently, and finally query Posting List through Term Dictionary.

It also reduces disk I/O several times compared to B+ trees in MySQL.

This Term Index can be stored using burst-trie, which is a variant of the prefix tree (Trie). It mainly compresses the suffixes to reduce the Trie height for better query performance.

Term Index is usually cached entirely in memory. During the query, the general range corresponding to the Term Dictionary can be quickly located through it, and then all terms within the range can be retrieved from the disk, and then the corresponding Term can be found through binary search. This greatly reduces the amount of disk I/O.

Joint index query

For complex federated index modification, WE need to query according to the conditions to obtain the corresponding Posting List, and then merge it with cross operation.

Merging multiple Posting List ES is done primarily through Skip Lists and bitmaps.

  • Skip ListHop table structure: iterate through the query at the same timePosting List, the use ofSkip ListStructures, jump and contrast with each other to get a collection.
  • BitmapBitmap structure: to find outPosting ListTo calculate theBitmap, and then perform the and operation.

Skip the List table

ElasticSearch stores Posting List data as well as Posting List data. A skip list is an ordered linked list that can be found by binary search.

The merge process is as follows: select the shortest Posting List, walk through the elements left-to-right, and binary search for the traversed elements in the other Posting List.

For long Posting lists, Frame Of Reference is used to compress and encode data to reduce disk usage and index size.

Bitmap Bitmap

When we retrieve the two fields, we can optimize with Bitmap.

For example, if we need to query body=li and download_count = 12, we need to retrieve the result set by Posting List and body=li and download_count = 12.

name doc_ids
name=li [1, 3]
download_count=12 [1]

The data is stored in bitmaps, and then the structure can be derived through bits and calculations.

⇒ \ [1, 3] Rightarrow ⇒ 101

[1] ⇒ \ Rightarrow ⇒ 100

[1]

Then doc_IDS is [1], so the merge efficiency and storage efficiency are naturally higher than before.

The same query requirements are not specifically optimized in MySQL, as can be seen in the same series of articles: Why MySQL is Not suitable for Big Data text retrieval.

Compression algorithm

But there are several problems with Bitmaps:

  1. BitMapThe number of digits is incremental, and eachPosting ListAll must contain the same number of digitsBitMap. Let’s say there are 40 milliondoc, thenBitMapThat’s 5 million bits, which is about 5 MB. So let’s say the index has 100,000 rows, so it’s just10 w * 5 MB = 500GB, disks are generally unable to withstand.
  2. BitMapThe space waste caused by sparsity cannot be solved. Maybe only 2 of 40 million are included. Such a result is obviously not desirable.

So you have to use a compression algorithm.

FOR compression algorithm:

The main principle is to take the difference between adjacent DOC ids and then algorithmically divide the difference into buckets (which data in a bucket takes the least amount of storage overall).

The first digit of each bucket represents the storage space occupied by a single data in the bucket.

73 300 302 332 343 372 // A single int takes 4 bytes
73 227 2     30   11   29 / / for poor
73 227 | 2 30 11 29 / / barrel1:8 | 73 227 // 1 + 2 * 8 bits / 8 = 3 bytes2:5 | 2 30 11 29 // 1 + 4 * 5 bits / 8 = 4 bytes
// The compression takes up 7 bytes, whereas the original takes up 24 bytes
Copy the code

FOR sparse hashes, there is no advantage to using the FOR compression algorithm when the difference value is greater than 65535 (2^16 = 2 bytes).

RBM compression algorithm:

The main principle is to take the quotient of the sparse hash to 65535 modulus (divide the data into high 16 bits and low 16 bits two parts), take the quotient value as the serial number of the bucket, and store the modulus value in the bucket. Then, according to the number of elements in the bucket, select the container to store data. Here, we mainly introduce the container implemented in two ways: Array and Bitmap:

ArrayContainer: an array of n short (16 bits) with unfixed storage space

BitmapContainer: A 1024 long array (64-bit) with a fixed 8 KB storage space

Short is 2 bytes and 16 bits

When cardinality n<4096=212n <4096=2 ^{12}n<4096=212, N ∗24<65536=216n * 2^4 <65536=2 ^{16}n∗24<65536=216

When cardinality n≥4086= 212N \ GE 4086=2 ^{12}n≥4086=212, N ∗24≥65536= 216N * 2^{4} \ GE 65536=2 ^{16}n∗24≥65536=216

When the cardinality n> 4086N > 4086N >4086 exceeds 65536 bits = 8 KB, and this number increases as n increases.

Long is 8-byte 64-bit, and BitMap uses an array of longs of 1024 length underneath

1024∗64=210∗26=216=655351024 * 64=2 ^{10} * 2^6 =2 ^{16} =655351024 ∗64=210∗26=216=65535 Just covers the range of data in the container.

1000      62101      131385   132052   191173     196658
(0.1000) (0.62101) (2.313) (2.980) (2.60101) (3.50) // Split the high and low bits1:1000 62101 //array container: 2 * 2 = 4 Bytes2: none barrels3:311 980 60101 //array container: 3 * 2 = 6 Bytes4:50 // array container: 1 * 2 = 2 Bytes
// Contains 12 Bytes
Copy the code

reference

ElasticSearch index VS MySQL index

Why is ElasticSearch better for full-text indexing than MySQL