Abstract: ES has become a versatile data product, more and more popular in many fields, this paper aims to analyze the use of ES from the database field.

This article is shared by CSS_blog, a huawei cloud community guide to Elasticsearch Database Acceleration.

I. Program description

You can use ELK to build log analysis system in log scenarios. ES is the most popular search engine in the search field. In the field of big data, ES provides a data warehouse for massive logs to standard Hbase. In the database field, ES can be used as an analytical database of query analysis. ES has become a versatile data product and is becoming more and more popular in many fields. This paper aims to analyze the use of ES from the database field.

ES is not a relational database, data update adopts optimistic lock, through version number control, does not support transaction processing, which is also different from the traditional database (Mysql); However, ES supports precise query acceleration, arbitrary combination of multiple conditions query, a variety of aggregation query, the query speed is very fast, can replace the database complex condition query scenario requirements, and even can replace the database to do secondary index.

In accelerate the scene database typically produce the goods from the customers order data will be written to Mysql relational database, database to ensure transactional, but as the goods orders data is becoming more and more customers query conditions changing at the same time, not all of the fields are indexed, database query capabilities far cannot satisfy the query demands. We consider using ES to fully synchronize database data and perform multi-conditional aggregation query in ES. The results of the query can be associated search in Mysql. When querying the details of commodity orders, Mysql data and ES data can not be consistent in real time. Mysql binlog log information can be consumed through Canal and synchronized to ES to achieve one write and ensure data consistency. Mysql is used as an example for the following databases.

Analysis of index principle

Why the query capability of ES far exceeds that of Mysql relational database is mainly determined by the difference between their implementation principle and the data structure of the underlying storage. The implementation principle of the two products is compared as follows.

Elasticsearch processes all input text and indexes it into memory to improve search efficiency. In this point, ES is superior to the B+ tree structure of MySQL. MySQL needs to put the index into disk, and each time it reads, it needs to read the index from disk first and then find the corresponding data node. However, ES can directly find the approximate location corresponding to the target document in memory, maximizing efficiency. In addition, MySQL has more obvious disadvantages in combination query. It does not support complex combination query, such as aggregation operation. Even if you want to combine query, you need to build an index in advance.

1. Database index B+ tree

Indexes in a database are organized by trees, such as B tree, B-tree, and B+ Tree. The following describes the organizational structure of B+ Tree.

Let’s imagine why we need to create an index. Suppose we have a table called Book, which contains information about books we hold, such as name, author, date of publication, etc. We have 10,000 records. If we need to find a book named Database, our SQL would be:

Select name, author form book where name = 'database';Copy the code

We need to scan the entire table to make a full comparison. If we index name, the book titles are sorted in order, and we only need to find the corresponding position when querying, we can quickly obtain the results.

The essence of indexing is to filter out the desired results by constantly narrowing the range of data to be retrieved, and at the same time turning random events into sequential events. That is to say, with this indexing mechanism, we can always lock data in the same way of searching.

SQL > select * from B+tree;

B+ Tree data is stored only in leaf nodes. In this way, on the basis of B-tree, each node stores more key words, and the tree has fewer levels, so the query data is faster. All key Pointers have leaf nodes, so the search times are the same, so the query speed is more stable.

2. Elasticsearch index principle

ES index is stored in the way of inverted index.

Index all the input data and match them with documents. When we search for data, we directly search the dictionary (Term) to find the document ID corresponding to Term, and then find the data. This is similar to the way Mysql uses the B+tree tree to build indexes, but if the dictionary Term is large, the search for Term will be slow. ES further recommends the dictionary index (FST) to improve the search ability of the dictionary.

Term Index is stored in the memory in the form of a tree. FST+ compressed common prefix method is used to greatly save memory. Term Index is queried to the block where the Term Dictionary resides and then the number of I/OS is reduced.

After sorting Term Dictionary, the time complexity of retrieval is reduced from N to logN by dichotomy.

Third, query comparative analysis

The following compares ES and databases in common database search scenarios:

  • The full text retrieval

ES supports full-text search and can classify data into words. Each word is indexed in a dictionary through FSP, while Mysql relational database does not support this. Imagine that if the search is not for the whole field but for several keywords in the field, Mysql search must be full table scan.

  • Accurate search

If Mysql has built an index for this field, there is little difference in performance between ES search and Mysql search. Mysql may be faster, but ES is a distributed system that can support pB-level data search, and ES has more obvious advantages for large table search.

  • Multi-condition query

We know that Mysql needs to index fields to speed up the search process, while ES default is full index, for multi-condition query, trigger Mysql to set up a joint index, otherwise multiple field search, Mysql first select a field search, the second field filter results get the final result.

ES uses multiple field result sets to merge and uses Bitmap or Skiplist to speed up the search, which has obvious advantages over Mysql.

  • Aggregate search

SQL > select * from B+tree; SQL > select * from B+tree; SQL > select * from B+tree;

In order to speed up the aggregative search, ES solves the problem of aggregative search through Doc Value. DocValue is column storage.

The storage results are as follows:

Docvalue data is sorted according to document ID. Docvalue changes random reading into sequential reading. In ES, due to the existence of sharding, data is divided into multiple pieces and placed on different machines. But the user experience is as if there is only one library. For aggregated queries, the processing is done in two stages:

  • Shard’s local Lucene Index computes the local aggregation results in parallel.

  • Receive the local aggregation results of all shards and aggregate the final aggregation results.

This two-phase aggregation architecture allows each shard to return a much smaller aggregation instead of the original data. This greatly reduces the consumption of network bandwidth.

  • Multi-copy acceleration

As we know, ES has the concepts of Shard and Replica. On the one hand, duplicate can ensure data reliability; on the other hand, multiple copies can speed up search and improve search concurrency.

Database to Elasticsearch synchronization scheme

Considering the actual usage mode of users and the size of data volume, Mysql data can be selected from ES in various ways.

1. Canal=>Elasticsearch

Use Canal to directly consume Mysql binlog log to write ES. In this way, if Mysql write volume is large, Canal write block will be faced.

2. Canal =>Kafka=>Elasticsearch

Canal data is written to Kafka, and Kafka data is consumed to ES using another app

5. Summary of problems

1. Index SHard problem

Before the data is written to ES, we need to plan the number of shards and Replicas of the data in advance. Replicas can be changed dynamically, but the number of Shards cannot be changed after the data is created.

As the amount of Mysql data increases, too few shards can cause the problem of too much data per shard.

If an index is 600G and there are only three shards, each of which is 200G, the query capability will be greatly reduced and data migration will be adversely affected.

We can scroll to create indexes by month and use index aliases to associate all indexes.

test_data-202101
test_data-202102
Copy the code

2. Query the acceleration problem

In a scenario where databases are accelerated using ES, we want the ES query capability to be as fast as possible. We need to tune the ES query when it does not meet the requirements.

Common methods are:

Click to follow, the first time to learn about Huawei cloud fresh technology ~