In this paper, starting from vivo Internet technology WeChat public mp.weixin.qq.com/s/qwkZKLb_g number… Qbox.io /blog/ Elasti… Translator: Yang Zhentao

directory

  1. Document modeling
  2. Global sequence number and delay
  3. Multigenerational relationship
  4. Allocate memory for the file system cache

This article is the first in a series of articles on QBOX’s blog about optimizing query performance in terms of document modeling, memory allocation, file system caching, GC, and hardware.

Elasticsearch 5.0.0 is really a big release after 2.x and brings a lot of new stuff. Elasticsearch is now part of the Elastic Stack, aligned with the rest of the Stack, Kibana, Logstash, Beats, and Elasticsearch are all version 5.0.

This version of Elasticsearch is the fastest, most secure, most resilient, and easiest to use yet, and it also brings a lot of improvements and new features.

We’ve used the Elasticsearch Performance Tuning Authority guide series to cover the basics of performance tuning and explain the key system Settings and metrics for each step. The series consists of the following three parts:

  • The Authoritative Guide to Elasticsearch Performance Tuning (Part 1)
  • The Authoritative Guide to Elasticsearch Performance Tuning (Part 2)
  • The Authoritative Guide to Elasticsearch Performance Tuning (Part 3)

Indexing decisions are also important and have a big impact on how data is searched. If it’s a string field, do you need word segmentation or normalization? If so, how? If it’s a numeric attribute, what kind of precision is required? There are many other types, such as date-time, Geospatial Shape, and parent-child relationship, that require more special consideration.

We also discussed “Elasticsearch Index Performance Optimization” in a series of tutorials, showing common tips and methods to maximize index throughput and reduce monitoring and administration load. The tutorial is divided into three parts:

  • How to Maximize Elasticsearch Indexing Performance (Part 1)
  • How to Maximize Elasticsearch Indexing Performance (Part 2)
  • How to Maximize Elasticsearch Indexing Performance (Part 3)

This article aims to recommend some search tuning techniques, policies, and recommended features for Elasticsearch 5.0 and above.

1. Document modeling

The internal object property array does not work as expected. **Lucene ** has no concept of internal objects, so Elasticsearch expands the object hierarchy into a simple list of property names and property values. Take the following documents for example:

curl -XPUT 'localhost:9200/my_index/my_type/1? pretty' -H 'Content-Type: application/json' -d '{ "group" : "fans", "user" : [ { "first" : "John", "last" : "Smith" }, { "first" : "Alice", "last" : "White" } ] }'
Copy the code


The request is internally converted to the following document form:

{
  "group" :        "fans"."user.first" : [ "alice"."john"]."user.last" :  [ "smith"."white"]}Copy the code


If you want to index an array of objects and maintain dependencies for each object in the array, you should use embedded data types instead of object data types. Inline objects internally index each object in the array as a separate hidden document, that is, each inline object can be queried separately using the following inline query:

curl -XPUT 'ES_HOST:ES_PORT/my_index? pretty' -H 'Content-Type: application/json' -d '{ "mappings": { "my_type": { "properties": { "user": { "type": "nested" } } } } }'

curl -XPUT 'ES_HOST:ES_PORT/my_index/my_type/1? pretty' -H 'Content-Type: application/json' -d '{ "group" : "fans", "user" : [ { "first" : "John", "last" : "Smith" }, { "first" : "Alice", "last" : "White" } ] }'


curl -XGET 'ES_HOST:ES_PORT/my_index/_search? pretty' -H 'Content-Type: application/json' -d '{ "query": { "nested": { "path": "user", "query": { "bool": { "must": [ { "match": { "user.first": "Alice" }}, { "match": { "user.last": "Smith" }} ] } } } } }'

curl -XGET 'ES_HOST:ES_PORT/my_index/_search? pretty' -H 'Content-Type: application/json' -d '{ "query": { "nested": { "path": "user", "query": { "bool": { "must": [ { "match": { "user.first": "Alice" }}, { "match": { "user.last": "White" }} ] } }, "inner_hits": { "highlight": { "fields": { "user.first": {}}}}}}}'
Copy the code


Embedded objects are useful when you have a master entity, such as a blog post, with other entities that are relevant but not very important, such as comments. It would be nice to be able to query blog posts based on comments, and inline queries and filters together provide faster join query capabilities.

Disadvantages of the embedded object model are as follows:

In order to add, modify, or delete an embedded object document, the entire document must be re-indexed; This leads to more embedded documents and more overhead.

A search request returns the entire document, not just matching embedded documents. It is still not currently supported, although it has been planned later to support returning to the root document which is best suited for embedded documents.

Sometimes it may be necessary to separate the master document from its associated entities, and this separation is provided by parent-child relationships.

You can establish parent-child relationships between documents with the same index by creating another document’s parent type mapping:

curl -XPUT 'ES_HOST:ES_PORT/my_index? pretty' -H 'Content-Type: application/json' -d '{ "mappings": { "my_parent": {}, "my_child": { "_parent": { "type": "my_parent" } } } }'

curl -XPUT 'ES_HOST:ES_PORT/my_index/my_parent/1? pretty' -H 'Content-Type: application/json' -d '{ "text": "This is a parent document" }'

curl -XPUT 'ES_HOST:ES_PORT/my_index/my_child/2? parent=1&pretty' -H 'Content-Type: application/json' -d '{ "text": "This is a child document" }'

curl -XPUT 'ES_HOST:ES_PORT/my_index/my_child/3? parent=1&refresh=true&pretty' -H 'Content-Type: application/json' -d '{ "text": "This is another child document" }'

curl -XGET 'ES_HOST:ES_PORT/my_index/my_parent/_search? pretty' -H 'Content-Type: application/json' -d '{ "query": { "has_child": { "type": "my_child", "query": { "match": { "text": "child document" } } } } }'
Copy the code


Parent-child Joins are useful for managing entity relationships, especially in cases where index time is more important than retrieval time, but they can be expensive; Parent-child queries are 5 to 10 times slower than an equivalent inline query.

2. Global sequence number and delay

Parent-child relationships use global serial numbers to speed up the JOIN operation. Regardless of whether the parent map uses an in-memory cache or an on-disk doc value, the global sequence number still needs to be reconstructed if any index changes.

The more parents in the shard, the more time it takes to build the global sequence number. Father-child relationships work best when each parent has many children, as opposed to requiring fathers and few children.

The global sequence number defaults to a deferred build: the first parent-child query or aggregate request after refresh triggers the build of the global sequence number. This gives the user an obvious potential spike. You can use eager_global_ordinals to shift the cost of building the global sequence number during the query period to the refresh period by using the mapping _parent property as follows:

curl -XPUT '{' ES_HOST: ES_PORT/company - d "the mappings" : {" branch ": {}," the employee ": {" _parent" : {" type ":" branch ", "fielddata" : {"loading": "eager_global_ordinals"}}}}} 'Copy the code


Here, the global sequence number of the _parent property will be built when a new segment search is visible.

For many parents, the global sequence number takes several seconds to build. In this case, refresh_interval needs to be increased so that refresh is less frequent and the global sequence number remains available longer. This will significantly reduce the CPU cost of reconstructing the global serial number per second.

3. Multigenerational relationships

The ability to Join multiple generations of data (reference Grandparents and Grandchildren) sounds appealing, but consider the costs:

  • The more joins, the worse the performance.
  • Each parent needs to store its own string _id attribute in memory, which can consume a lot of RAM.
  • When considering the suitability of relational schemes and parent-child relationships, consider the following advice on parent-child relationships:
  • Use the father-child relationship conservatively, only when there are many more children than the father.
  • Avoid using multiple parent-child relationships to join a single query.
  • Avoid scoring has_child queries that use has_child filters, or score_mode none.
  • Keep the parent ID short to better compress in the doc value and thus consume less memory during instantaneous loading.

4. Allocate memory for the file system cache

For runtime Elasticsearch, memory is one of the most important resources to monitor closely. Elasticsearch and Lucene consume memory through both JVM heap memory and file system cache. Because Elasticsearch runs in a Java Virtual Machine (JVM), the JVM’s GC cycles and frequency also need to be monitored.

The JVM heap memory

It is important to have a “just right” JVM heap size for Elasticsearch — not too big or too small for the reasons below. The general rule of thumb for Elasticsearch is to allocate less than 50% of available RAM to the JVM heap and no more than 32GB.

Allocating too little heap memory for Elasticsearch leaves more memory for Lucene, which relies heavily on the file system cache to handle requests quickly. Under no circumstances should you set heap memory too small, because when an application faces short outages due to frequent GC, it may experience out-of-memory errors or throughput degradation.

Elasticsearch is installed with a default JVM heap size of 1GB, which is too small in most cases. You can use environment variables to set the desired pair size and restart Elasticsearch:

export ES_HEAP_SIZE=10g
Copy the code


Another way to set the JVM heap size (equivalent to setting the same minimum and maximum values to prevent heap resizing) is to specify this with a command line argument each time Elasticsearch is started:

ES_HEAP_SIZE="10g" ./bin/elasticsearch
Copy the code


Both examples set the heap size to 10GB. To verify this, run the following command:

curl -XGET http://ES_HOST:9200/_cat/nodes? h=heap.maxCopy the code


The returned output shows that the maximum heap memory has been updated correctly.

The garbage collection

Elasticsearch relies on the GC process to free heap memory. Since GC itself consumes resources (in order to free resources!) , so you should pay attention to the GC frequency and duration to determine if you need to adjust the heap memory size. Setting up too much heap memory will result in longer GC times. This kind of excessive pause can be dangerous, as the cluster can be mistaken for a network exception and lose contact with the node.

Therefore, Elasticsearch relies heavily on the file system cache to speed up searches. Ensure that at least half of the available memory is used for file system caching so that Elasticsearch can keep index data hotspots in physical memory.

Use faster hardware

If your search is limited by I/O, you should consider more memory for file system cache sharding (see previous section), or buy a faster driver. In particular, SSDS are known to perform much better than mechanical disks. Use local storage whenever possible, avoid remote or network file systems like NFS or SMB, and be aware of virtualized storage like Amazon EBS.

Elasticsearch works with virtualized storage without a problem, it is popular for being fast and easy to install, but it is also unfortunately inherently slow compared to dedicated local storage on a basic basis. If you create an index library on EBS, be sure to use pre-allocated IOPS, otherwise you will soon be curbed.

If your search is limited to cpus, you should consider buying a faster CPU.

For more content, please pay attention to vivo Internet technology wechat public account

Note: To reprint the article, please contact our wechat account: LABs2020.