Make Elasticsearch fly! - Performance optimization practice dry goods

0,.

End goal of Elasticsearch performance optimization: user experience. On the definition of cool – famous product liang Ning once said, “when people meet the state is called pleasure, people will not be satisfied with the uncomfortable, will begin to seek. If the person is seeking it and gets instant gratification, it feels good!” .

Elasticsearch is fast, accurate and complete.

As for Elasticsearch performance optimization, Alibaba, Tencent, JINGdong, Ctrip, Didi, 58, etc. have a lot of in-depth practice summary, are very good reference. In this paper, the performance optimization is discussed based on the cool point of Elasticsearch.

1. Optimization practice of cluster planning

1.1 Planning clusters Based on the target data volume

At the beginning of the business, we were often asked questions such as how many nodes to cluster, how much memory and CPU to use, and whether to use SSD?

The main consideration is: How much data do you want to store? How many nodes can be extrapolated against the target amount of data.

1.2 To set aside capacity Buffer

Note: Elasticsearch has three warning levels where disk usage reaches 85%, 90%, and 95%. Different warning water level will have different emergency treatment strategy.

At this point, disk capacity selection should be planned. It is reasonable to keep it under 85%. Of course, it can also be adjusted through configuration.

1.3 Do not reuse a machine for each node in the ES cluster with other service functions.

Unless the memory is very large.

For example: common server, ES+Mysql+ Redis installed, after a large amount of business data, is bound to appear insufficient memory and other problems.

1.4 SSD disks are recommended

Elasticsearch documentation definitely recommends SSDS for cost reasons. SSDS are recommended for high write and search rates in service scenarios.

In alibaba’s business scenario, SSD drives are five times faster than mechanical drives. But it varies by business scenario.

1.5 Ensure proper memory configuration

Official recommendation: The heap memory size is the official recommendation is: Min (32GB, machine memory size /2).

Both Medcl and Wood have explicitly stated that 32/31GB is not necessary, suggesting 26GB for hot data and 31GB for cold data.

There are no specific requirements for overall memory size, but the larger the content, the better the retrieval performance. For reference, in a service scenario with 200GB+ incremental data per day, the server memory must be at least 64GB. The reserved memory other than the JVM should be sufficient, otherwise it will often be OOM.

1.6 The number of CPU cores should not be too small

The number of CPU cores is associated with the ESThread pool. And write and retrieve performance. Suggestion: 16 cores +.

1.7 For super-large business scenarios, cross-cluster search can be considered

Unless the business magnitude is very large, such as didi and Ctrip’s PB+ business scenario, cross-cluster retrieval is not necessary.

1.8 The number of nodes in a cluster does not need to be an odd number

ES maintains cluster communication internally and is not based on the distribution and deployment mechanism of ZooKeeper. Therefore, odd numbers are not required.

However, the discovery.zen.minimum_master_nodes value must be set to the number of candidate primary nodes /2+1 to effectively avoid brain splitting.

1.9 Optimized Node Type Allocation

Number of nodes in a cluster: <=3. Suggestion: Master: true, data: true for all nodes. Both the master node and the routing node. The number of nodes in a cluster is greater than 3. Based on service scenarios, you are advised to: Gradually separate Master nodes and coordination/routing nodes.

1.10 It is recommended to separate hot and cold data

Hot data storage SSDS and common historical data storage mechanical disks improve physical retrieval efficiency.

2. Index optimization practice

Mysql and other relational databases to separate libraries, separate tables. Elasticserach should also be fully considered.

2.1 How many indexes to Set?

Storage based on service scenarios is recommended.

Data of different channel types is stored by index. For example: Zhihu collects information and stores it in zhihu index; APP collected information is stored in the APP index.

2.2 How many Fragments to Set?

It is recommended to measure by data volume. Rule of thumb: It is recommended that the size of each shard do not exceed 30GB.

2.3 Setting the Number of Fragments?

You are advised to configure the number of fragments based on the number of cluster nodes. For a 5-node cluster, 5 shards are reasonable.

Note: The number of fragments cannot be changed unless reindex is used.

2.4 Setting the number of copies?

Unless you have unusually high requirements for the robustness of a system, such as a banking system. More than 2 copies can be considered. Otherwise, 1 copy is sufficient.

Note: The number of copies can be modified at any time through configuration.

2.5 Do not create multiple types under one index

Even if you’re a 5.x version, consider future expansibility with future version upgrades.

Suggestion: One index corresponds to one type. By default, x corresponds to _doc, and 5.x corresponds to doc directly.

2.6 Planning Indexes by Date

As the volume of traffic increases, the contradiction between a single index and a surge in data volume becomes apparent. Planning indexes by date is a natural choice.

Benefit 1: Historical data can be deleted in seconds. Delete the historical index. Note: An index requires the delete_by_query+force_merge operation, which is slow and incomplete.

Benefits 2: easy to separate hot and cold data management, retrieve the last few days of data, directly specify the corresponding date on the physical index, fast a force!

Operation Reference: Template used + Rollover API used.

2.7 Be sure to use aliases

ES does not change index names like mysql does. Using an alias is a relatively flexible option.

3. Data model optimization practice

3.1 Do not Use the default Mapping

The default Mapping field type is automatically identified by the system. The: string type is divided into text and keyword by default. If your business does not need word segmentation or retrieval, but only accurate matching, set it to keyword only.

Select an appropriate type based on business needs to save space and improve accuracy. For example, select floating point type.

3.2 Selection process of each field in Mapping

3.3 Choose a reasonable word divider

Common open source Chinese word splitters include ik word splitter, ANSJ word splitter, HANLP word splitter, stutter word splitter, mass word splitter, “ElasticSearch most complete word splitter comparison and usage method” search to check the comparison effect.

If ik is selected, ik_max_word is recommended. Because: coarse-grained word segmentation results basically contain fine-grained IK_smart results.

3.4 Date, long, or keyword

Based on service requirements, the date type must be used to perform analysis based on the timeline. It is recommended to use the keyword if only second returns are required.

4. Data writing optimization practice

4.1 Do you want a second response?

The nature of Elasticsearch in near real time is that the earliest data written can be queried.

If refresh_interval is set to 1s, a large number of segments will be generated and retrieval performance will suffer.

Therefore, non-real-time scenarios can be scaled up to 30s or even -1.

4.2 Reduce the number of copies and improve write performance.

Before writing, the number of copies is set to 0, after writing, the number of copies is set to the original value.

4.3 Can batch not only write

The bulk interface is BULK. The bulk size is based on the queue size, thread pool size, and number of CPU cores on the machine.

4.4 disable swap

On Linux, temporarily disable switching by running the following command:

1sudo swapoff -a
Copy the code

5. Search polymerization optimization

5.1 Disabling WildCard Fuzzy Matching

When the data level reaches TB+ or even higher, wildcard is likely to be stuck in the case of multi-field combination, and even lead to the collapse and breakdown of cluster nodes.

The consequences are terrible.

Alternative scheme: Scheme 1: The scheme with high accuracy requirements: combine two sets of word segmentation, standard and IK, and use match_PHRASE to search.

Solution 2: Alternative with low accuracy requirements: Ik word segmentation is recommended, and match_PHRASE and SLOP are used to query.

5.2 Match is Used with minimal Probability

The result of the match is obviously inaccurate. Large business scenarios match “match_phrase” with the phrase.

Match_phrase combines reasonable thesaurus and thesaurus to make the search results more accurate and avoid noisy data.

5.3 Filter filters are widely used in service scenarios

For scenarios where you don’t need to compute relevance scores, there is no doubt that the Filter caching mechanism will make retrieval faster.

Example: Filtering a zip code.

5.4 Control returned fields and results

As with mysql queries, select * operations are almost unnecessary in business development.

Similarly, it is not necessary for _source to return all fields in ES.

To control the return of fields via _source, only business-relevant fields are returned. Html_content The batch return of similar fields may be a design flaw in the business.

Obviously, the summary field should be written ahead of time, rather than being intercepted after the content is queried.

5.5 Paging depth query and traversal

From +size; To traverse: scroll; For parallel traversal, use scroll+slice.

Consider the collection of business selection.

5.6 Proper Aggregation Size Settings

The result of aggregation is imprecise. Unless you set size to 2 to the 32nd power of -1, the result of aggregation is the sum of the Top size elements in each shard.

Real business scenarios require attention to accurate feedback results. Try not to capture the full aggregation result — it makes sense to take the TopN aggregation result value at the business level. Because it’s true that the bottom values don’t mean much.

5.7 Rational implementation of aggregated paging

When the aggregation results are displayed, it is bound to face the problem of post-aggregation pagination, which is not supported by ES for performance reasons.

If you need post-aggregation paging, you need to develop your own implementation. Including but not limited to:

Plan 1: each time take aggregation result, take to the memory paging return.

Scheme 2: Realize scroll combined with Scroll after set redis.

6. Business optimization

Let Elasticsearch do what it’s good at, and obviously it’s better at searching based on inverted indexes.

At the business level, users want to see the results they want as soon as possible, but they do not pay attention to the “field processing, formatting, standardization” and other operations in the middle.

To make Elasticsearch search more efficient, do the following: 1) Do the ETL stage of “foreplay” field extraction, bias analysis, classification/clustering, correlation determination before writing ES;

2) “Sleepwear” product manager The product manager may make all kinds of unreasonable demands based on all kinds of weird business scenarios.

As a technician, you need to “inform and understand” the product manager how search engines work, how Elasticsearch works, what you can do and what you really can’t do.

7, summary

In the actual business development, the company generally requires that the horse does not eat grass, but also wants the horse to run fast.

For Elasticsearch development as well, there is almost no way to improve performance due to insufficient hardware resources (CPU, memory, disk are full).

Let Elasticsearch do N more related, unrelated things besides retrieving aggregates and then conclude “Elastic is as slow as you think”.

Do you have a similar image coming to mind?

Make sure your Elasticsearch is going to fly!

We’ll meet again some day…

Recommended reading: 1. Ali: https://elasticsearch.cn/article/61712, drops: http://t.cn/EUNLkNU3, tencent: http://t.cn/E4y9ylL4, ctrip: https://elasticsearch.cn/article/62055, community: https://elasticsearch.cn/article/62026, community: https://elasticsearch.cn/article/7087, community: https://elasticsearch.cn/article/6202

Cognitive listing 8, blockbuster | into Elasticsearch methodology (National Day update edition)

Join knowledge planet to learn more dry goods faster and in less time!

Make Elasticsearch fly! — Performance optimization practice dry goods

0,.