Abstract:ES has become an all-around data product, which is becoming more and more popular in many fields. This paper aims to analyze the use of ES from the database field.

This article was shared from Elasticsearch on Huawei Cloud by css_blog.

1. Program description

What is the main function of Elasticsearch? Different scenarios have different positioning. In the log scenario, we can use Elk ecology to build the log analysis system. In the field of big data, ES can provide data warehouse with massive logs against HBase; In the database domain, ES can be used as an analytical database for query analysis. ES has become an all-around data product, which is becoming more and more popular in many fields. This paper aims to analyze the use of ES from the database field.

ES is not a relational database, data update using optimistic locking, through version number control, does not support transaction processing, which is the difference between ES and traditional database (MySQL). But ES supports precise query acceleration, multi-condition arbitrary combination query, a variety of aggregate query, query speed is very fast, can replace the database complex condition query scene demand, can even replace the database as a secondary index.

In accelerate the scene database typically produce the goods from the customers order data will be written to Mysql relational database, database to ensure transactional, but as the goods orders data is becoming more and more customers query conditions changing at the same time, not all of the fields are indexed, database query capabilities far cannot satisfy the query demands. We consider using ES to synchronize database data in full volume, and conduct multi-condition aggregate query in ES, so that the query results can be linked search in MySQL, and when searching for details of commodity orders, MySQL data and ES data are not required to be consistent in real time, and MySQL binlog information can be consumed through Canal and synchronized to ES. Realize a write to ensure data consistency. The following databases are illustrated with MySQL as an example.

Second, index principle analysis

The reason why the query ability of ES is far more than that of MySQL relational database is mainly determined by their implementation principle and the difference in the underlying stored data structure. The implementation principle of the two products is compared as follows.

Elasticsearch improves search efficiency by processing all incoming text and putting it in memory by indexing it. In this respect, ES is better than MySQL’s B+ tree structure. MySQL needs to put the index into the disk, and each read needs to read the index from the disk first and then look for the corresponding data node. However, ES can directly find the approximate location of the target document in the memory to maximize the efficiency. Moreover, MySQL has a more obvious disadvantage when it comes to composite queries. It does not support complex composite queries such as aggregation operation. Even if a composite query is needed, an index must be established in advance, but ES can complete such complex operations.

(1) Database index B+ tree

The index in the database is organized by tree, commonly used B tree, B-tree, B+tree, the following introduces the organization structure of B+tree.

Let’s first imagine why we need to create an index. Suppose we have a table book, which stores the information of the books we keep, such as the name, author, publication date, etc. We have 10000 records. If we need to find a book named “database”, then our SQL is as follows:

Select name, author form book where name = 'database';

We need to scan the entire table to make a full comparison. If we set up an index on name, the titles are already sorted in order. When querying, we just need to find the corresponding position and we can get the results quickly.

The essence of indexing is to filter out the desired results by constantly narrowing the range of data we want to obtain, and at the same time to change random events into sequential events. That is to say, with this indexing mechanism, we can always lock data in the same search way.

The database is indexed by B+ Tree:

B+ Tree data is stored only in leaf nodes. In this way, on the basis of B tree, each node stores more keywords, the tree has fewer levels, so the query data is faster, all the keyword Pointers have leaves, so the number of lookups is the same, so the query speed is more stable.

(2) Elasticsearch index principle

The index built by ES is stored in the way of inverted index.

All the input data are indexed, and all the documents are corresponding to each other. When we search for data, we directly look up the dictionary (TERM), find the corresponding document ID, and then find the data. This is similar to the way MySQL uses B+tree to build indexes, but if the dictionary Term is very large, the search of Term will be slow. ES further suggests the dictionary index (FST) to improve the search ability of the dictionary.

Term Index is stored in memory in the form of tree. FST+ compression of public prefix method is used to greatly save memory. I/O times are reduced by searching to the block where Term Dictionary is located through Term Index and then looking for Term on disk.

After sorting by Term Dictionary, the time complexity of retrieval is reduced from original N to logN by dichotomy.

Three, query comparative analysis

The following common scenarios for database search compare ES and database:

  • The full text retrieval

ES supports full-text search, which can divide data into words. Each word can be indexed by using FSP to create a dictionary index, while MySQL relational database does not support this. Imagine if you were searching for several keywords in a field instead of an entire field.

  • Accurate search

If MySQL has indexed this field, there is little difference between using ES search and MySQL search performance. MySQL may be faster, but ES is a distributed system that can support PB level data search, and ES has more obvious advantages for large table search.

  • Multiconditional query

As we know, MySQL needs to index fields to speed up the search process, while ES is fully indexed by default. For multi-conditional queries, MySQL will be triggered to set up a joint index. Otherwise, MySQL will select one field to search for, and the results will be filtered by the second field to get the final results.

ES uses multiple field result sets to merge operations, using Bitmap or Skiplist to speed up the search, compared with MySQL has obvious advantages.

  • Aggregate search

If you have not set up an index, you need to do a full table scan sort. If you have set up an index, you need to do a range query on B+ Tree.

In order to speed up the aggregate search, ES uses Doc value to solve the aggregate search problem. DocValue is just columnar storage.

The storage results are as follows:

DocValue data is sorted by document ID. DocValue converts random reads into sequential reads. In ES, because of sharding, the data is split into multiple pieces and placed on different machines. But the user experience is like a single library. For aggregate queries, the processing is done in two phases:

  • The local Lucene Index of the Shard evaluates the local aggregation results in parallel.
  • Receive the local aggregation result of all shards and aggregate the final aggregation result.

This two-phase aggregation architecture allows each shard not to return the original data, but to return a much smaller aggregate result. This greatly reduces the consumption of network bandwidth.

  • Multiple copy acceleration

We know that ES has the concept of Shard and Replica. On the one hand, duplicates can guarantee the reliability of data; on the other hand, multiple duplicates can speed up the search and improve the search concurrency ability.

Synchronize database to ElasticSearch

Combined with the user’s actual usage style and the size of the data volume, MySQL data can be selected in a variety of different ways to ES.

  • Canal=>Elasticsearch

Use Canal to consume the MySQL binlog log directly to write ES, which could cause Canal’s write blocking if MySQL writes a lot.

  • Canal =>Kafka=>Elasticsearch

Canal data is written to Kafka, and consumption of Kafka data is synchronized to ES using another app

Five, problem summary

1. Index shard problem

Before data is written into ES, we need to plan the number of shards and replicas of the data in advance. Replicas can be dynamically modified, but the number of shards can not be modified after the shards number is created.

As MySQL’s data volume increases, too few shards will cause the problem of too many shards per shard.

If an index is 600G and there are only three shards, each shard is 200G, it will greatly consume the query capacity and is not conducive to data migration.

We can scroll to create indexes by month, associating all indexes with index aliases.

test_data-202101
test_data-202102

2. Query acceleration problem

In a scenario where ES is used for database acceleration, we want the ES query capability to be as fast as possible. We need to tune the ES query when it does not meet the requirements.

Commonly used methods are:

Click on the attention, the first time to understand Huawei cloud fresh technology ~