Why is ElasticSearch better for full-text indexing than MySQL

Those of you who are familiar with MySQL know that MySQL does not support complex conditional queries very well. MySQL uses a maximum of one index for a condition to filter, and then the rest of the condition can only be filtered in memory during the row traversal.

Because only one index can be used to filter complex query conditions, a large number of I/O operations are required to read row data and CPU is consumed for memory filtering, resulting in query performance degradation.

ElasticSearch is a mainstream solution for querying complex conditions and is widely used in order and log query scenarios.

Let’s see why ElasticSearch is good for complex conditional queries.

ElasticSearch profile

Elasticsearch is a real-time distributed search engine that uses Lucene for indexing and searching. It offers “quasi-real-time search” capabilities, dynamic cluster size, and elastic scaling.

Elasticsearch uses Lucene as its full-text search engine to process plain text data, but Lucene is just a library that provides interfaces for indexing, performing searches, etc., but not distributed services, which is what Elasticsearch does.

Let’s look at ElasticSearch. For beginners, let’s start by mapping ElasticSearch concepts to MySQL concepts. However, there are a lot of differences in the details of ElasticSearch and ElasticSearch, and you can’t force them to be the same.

Select * from ElasticSearch; select * from ElasticSearch;
The Type of ElasticSearch is similar to the Type of Table in MySQL; Note that this concept has been completely removed in 7.x and is conceptually different from Table;
The Document Document in ElasticSearch is similar to the data Row in MySQL. Each Document is composed of multiple Filed fields, which are similar to the Column in MySQL.
The Mapping in ElasticSearch defines the index fields and their data types in the index library, similar to the table structure Schema in a relational database.
ElasticSearch uses its own domain language Query DSL for add, delete, change and Query, while MySQL uses SQL for appeal.

ElasticSearch also has a number of concepts related to its distributed features, which we will not cover until we learn more about them.

Inverted index

There are Inverted indexes in MySQL, and ElasticSearch is Inverted Index, which allows faster filtering and query of complex conditions than MySQL. Moreover, full-text search is also dependent on Inverted indexes. Now, let’s take a look at what an inverted index is.

An inverted index, as described by Wikipedia, is a database index structure that stores a mapping between the content of a document and its location. MySQL > alter table index (‘ primary key ‘, ‘primary key’); I am also trying to figure out this problem at present, and I may have to wait until I understand its concrete implementation.

Again, let’s take the example of book retrieval, assuming that we have the following data, each row is a Document, each Document consists of ID, ISBN number, author name, and score.

The inverted index of the above data by ISBN and Author is shown below. The inverted index is created separately for each field, independent of each other. There are two specific terms, index Term and Posting List. The value of the field is Term, such as N0007, and the List of document ids corresponding to Term is Posting List, as shown in red.

Generally, terms are sorted in order. For example, the Author name is sorted in alphabetical order. After sorting, when we search for a Term, we do not need to go through from the beginning, but adopt binary search. A series of sorted terms form the index table Term Dictionary.

However, Term dictionaries are often too large to fit into memory for faster queries, and you need to create an Index for them, also known as Term Index.

ElasticSearch uses the Burst-Trie structure to implement Term Index, which is a variant of the prefix tree Trie that compresses the suffix to reduce the Trie height for better query performance.

The Term Index does not need to contain all terms, as MySQL indexes do, but rather prefixes to those terms. It is similar to a Dictionary’s lookup directory that allows you to quickly locate a location in the Term Dictionary and then look backwards from that location.

To sum up, the inverted index of Alice, Alf, Arlan, Bob, Tom and other words is shown below. The green part is Term Index, the blue part is Term Dictionary, and the red part is Posting List.

Generally speaking, Term Index is all cached in the memory. During query, the Term Index is used to quickly locate the rough range corresponding to the Term Dictionary, and then the disk is read to find the corresponding Term. In this way, the number of disk I/O is greatly reduced.

Joint index query

Now that you know about ElasticSearch’s inverted index, let’s look at how it handles complex syndicated index queries. For example, in the book example above, we need to query for books with a score of 2.2 and the author’s name being Tom.

In theory, we just need to query the inverted index of Score and Author fields, obtain the Posting List of the responses, and merge them together.

MySQL does not support this merge operation. It can only query by the index of one field and then do memory filtering based on the condition of the other field. By the way, MySQL’s join function is also too weak. Those who are interested can read this article

ElasticSearch allows you to merge data sets using Skip Lists and bitsets.

Skip List structure is used to iterate through Posting List queried by Score and Author. Skip List structure is used to jump and compare with each other to get collection.
Using the Bitset structure, we calculate the respective bitsets of the Posting List values queried by Score AND Author, AND then perform AND.

Hop table merge policy

ElasticSearch stores Posting List data as well as Posting List data, which shows the basic idea of space for time.

Here to introduce the basic concept of the hop table, it is actually a binary search can be ordered linked list. Skip table in the original ordered list above the increase of multi-level index, through the index to achieve fast search. First in the highest index to find the last less than the current search element position, and then jump to the second advanced index continue to search, until jump to the bottom, by this way, speed up the query.

For example, the Posting List detected by Score is [2,3,4,5,7,9,10,11], and the result detected by Author is [3,8,9,12,13].

The merge process starts with the shortest Posting List, that is, the result set of Author, and uses the smallest ID as the current maximum. Then the rest of the Posting list looks for places greater than or equal to that value.

For example, in the above result set, 3 is searched in the Score result set first, and it is indicated that 3 is one of the set elements of the two. Then start the round again, select the next value 8 in Author result set 3, query 8 in Score result set, and find that the minimum value greater than or equal to 8 is 9, so it is impossible to have a common value 8, and then search 9 in Author result set. It was found that the minimum value greater than or equal to 9 was 12, so it was found that the value greater than or equal to 12 did not exist in the Score result set. Finally, the set of the two is only [3].

During the query process, each Posting List can use the skip List to quickly skip invalid ID values based on the current ID, speeding up the process of merging and fetching intersection.

ElasticSearch uses Frame Of Reference to compress and encode long Posting lists, reducing disk usage and index size. We’ll talk more about the implementation of the specific storage structure later.

Bitset merge policy

ElasticSearch uses skipList to perform data merge, as well as Posting list to Cache results for reuse.

To reduce the memory consumption of the cache, ElasticSearch uses the Roaring Bitmap to store Posting lists instead of arrays and bitsets.

We can start by talking about why simple array or bitset data structures are not used. Here’s a common interview question:

Given a set of 4 billion non-repeating integers in the interval [0, 2^ 32-1], how can I quickly determine if a number is in the set?

If we were to use an unsigned long array to store it, we would consume 4 billion * 32 bits = 160 bytes, which is approximately 16000 MB.

If a bitmap Bitset is used for storage, that is, a bit in the original set is set to 1, otherwise it remains 0. This only consumes 2 ^ 32 bits = 512 MB, which is about 3.2 % of the original.

However, Bitset also has its drawbacks, namely sparse storage. For example, instead of 4 billion, there are only 2 or 3 bitsets, so only a few bits in the Bitset are 1’s and all the other bits are 0’s, but it still takes up 512 MB.

RoaringBitmap is designed to solve the problem of sparse storage. The diagram below shows the basic principles of RoaringBitmap.

First, calculate the divisor and remainder of 32-bit unsigned integers and 65536 as shown in the figure above. Container is a 32-bit unsigned integer divided into buckets with the highest number of 16 bits, that is, the maximum number of buckets can be 2^16=65536. To store data, locate a Container based on the highest 16 bits of the data (if no container is found, a new container is created) and place the lower 16 bits in the Container. In other words, a RoaringBitmap is a collection of containers.

The specific storage structure in a Container is then determined based on the cardinality of the data stored in the container.

If the cardinality is less than 2 ^ 12 (4096), using an ordered array of type unsigned short will consume up to 8 KB.
When the cardinality is greater than 4096, ordinary bitsets of size 2 ^ 16 are used for storage, with a fixed consumption of 8 KB. Of course, bitsets are sometimes compressed with stroke length coding (RLE) to further reduce space footprint.

ElasticSearch uses the Roaring Bitmap to cache Posting list with different conditions, and then calculates the final result set.

Afterword.

ElasticSearch is better than MySQL for complex query conditions, but there are some pros and cons to ElasticSearch, because with all the preparation for the query, the insert speed of ElasticSearch is slower than MySQL. And the data stored in ES is not immediately retrievable.

Welcome to follow Li Xiaobing and continue to share the principles and practical experience related to MySQL, Redis and ElasticSearch.

Personal blog, welcome to play

Refer to the article

www.nosqlnotes.com/technotes/s…
zhuanlan.zhihu.com/p/33671444
www.cnblogs.com/forfuture19…
www.jianshu.com/p/818ac4e90…

Why is ElasticSearch better for full-text indexing than MySQL

ElasticSearch profile

Inverted index

Joint index query

Hop table merge policy

Bitset merge policy

Afterword.

Refer to the article

Related Posts

I think I understand the proxy model

FrameworkServlet for SpringMVC source code analysis

Computer operating system – Process management