ElasticSearch is distributed and tuned

1. Distributed Features of Elasticsearch

1. Distributed introduction and Cerebro

Es supports cluster mode and is a distributed system with two main benefits:
- Increase system capacity, such as memory and disk, so that the ES cluster can support petabytes of data
- Increased system availability, even if some nodes stop services, the entire cluster can still be normal services
An ES cluster consists of multiple ES instances
- Different clusters are identified by cluster name, which can be changed by cluster.name. The default value is ElasticSearch
- Each ES instance is essentially a JVM process with its own name, modified by Node.name

2. Create a cluster

You can start an ES node instance by running the following command

bin/elasticsearch -Ecluster.name=my_cluster -Epath.data=my_cluster_node1 -E node.name=node1 -Ehttp.port=5200 -d
Copy the code

Es cluster data is called cluster state and records the following information:
- Node information, such as node name, connection address, etc
- Index information, such as index name, configuration, etc
- .
The node whose cluster state can be changed is called the master node. A cluster can have only one master node
The cluster state is stored on each node. The master maintains the latest version and synchronizes it to other nodes
Master nodes are eligible to be elected by all nodes in the cluster. The configuration of master nodes is as follows:
- node.master:true
The node that processes the request is the coordinating node, which is the default role for all nodes and cannot be canceled
- Route requests to the correct node for processing, such as an index creation request to the master node
The node that stores data is the Data node. By default, all nodes are of the data type. The configuration is as follows:
- node.data:true
To start an ES node instance, run the following command

bin/elasticsearch -Ecluster.name=my_cluster -Epath.data=my_cluster_node2 -E node.name=node2 -Ehttp.port=5300 -d
Copy the code

3. Copy and Sharding

Improve system availability

Service availability
- In the case of two nodes, one node is allowed to stop services
Data availability
- Import Replication
- There is complete data on each node

Increase system capacity

How do YOU distribute data across all nodes?
- Introduce shards to solve the problem
Sharding is the cornerstone of ES support for Petabyte data
- Sharding stores partial data and can be distributed on any node
- The shard data is specified during index creation and cannot be changed later. The default value is 5
- Sharding can be divided into master sharding and copy sharding to realize high availability of data
- The data of the replica shard is synchronized by the master shard. Multiple copies can be created to improve read throughput

4. Two questions

Does adding nodes at this point increase the data capacity of test_index?
- Can’t. Because there are only three fragments, which have been distributed on three nodes, the new nodes cannot be utilized
Does increasing the number of copies improve the read throughput of test_index?
- Can’t. Because the new replicas are distributed on these three nodes, the same resources are used. If throughput is to increase, additional nodes are required
The number of shards is important and needs to be planned well in advance
- If the number is too small, nodes cannot be added to achieve horizontal capacity expansion
- Too many fragments are distributed on a node, wasting resources and affecting query performance

5. Cluster status

You can use the following apis to view the cluster health status:
- Green Health status: All master and slave fragments are properly allocated
- Yellow indicates that all primary fragments are normally allocated, but duplicate fragments are normally allocated
- Red has a primary shard for allocation

GET _cluster/health

{
  "cluster_name" : "elasticsearch"."status" : "yellow"."timed_out" : false."number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards": 21."active_shards": 21."relocating_shards": 0."initializing_shards": 0."unassigned_shards": 20."delayed_unassigned_shards": 0."number_of_pending_tasks": 0."number_of_in_flight_fetch": 0."task_max_waiting_in_queue_millis": 0."active_shards_percent_as_number": 51.21951219512195}Copy the code

6. Failover

The cluster consists of three nodes and its status is green
What will the cluster do if services on the host where Node1 resides are interrupted?
- When node2 and Node3 find that Node1 cannot respond for a period of time, the master election is initiated. For example, node2 is selected as the master node, and the cluster status changes to Red because primary shard P0 goes offline
- Node2 finds that primary shard P0 is not allocated and promotes R0 to primary shard. As all primary shards are allocated normally, the cluster status changes to Yellow
- Node2 generates new copies for P0 and P1, and the cluster status turns green

7. Distributed file storage

Documents are eventually stored on shards:
- Document1 is finally stored on fragment P1
How is Document1 stored in shard P1? What was the basis for choosing P1?
- A document to shard mapping algorithm is required
- Purpose: Distribute documents evenly across all shards to make full use of resources
- Algorithm:
  - Random selection or round-robin?
  - Not desirable because of the high cost of maintaining the document to shard mapping
  - Calculate the corresponding shard in real time according to the document value
Es calculates the corresponding document fragment by the following formula
- The hash algorithm ensures that data is evenly distributed in fragments
- Routing is a key parameter. The default is the document ID, or you can specify it yourself
- Number_of_primary_shards is the number of primary fragments

shard = hash(routing) % number_of_primary_shards
Copy the code

This algorithm is related to the number of master shards, which is why the number of shards cannot be changed once it is determined

Document creation process

The process of reading documents

8. Split brain problem

Split brain problem, or split-brain, is a classic network problem in distributed systems
- A cluster of three nodes, suddenly the network on Node1 and the other two nodes are interrupted
- Node2 and Node3 re-elect the master. For example, node2 becomes the new master and the cluster state is updated
- Node1 also updates the cluster state after clustering itself
The same cluster has two masters and maintains different cluster states. After the network is restored, the correct master cannot be selected
The solution is to allow master elections only when the eligible master-eligible is greater than or equal to quorum
- Quorum = number of master-eligible nodes /2 + 1, for example, when there are three master-eligible nodes, quorum is 2
- Set discovery.zen.minimum_master_nodes for quorum to avoid split-brain

9. Shard explanation

The inverted index cannot be changed

Once an inverted index is generated, it cannot be changed
Its benefits are as follows:
- There is no need to worry about concurrent file writing, eliminating performance problems caused by locking mechanism
- Since the file does not change, you can make full use of the file system cache, only need to load once, as long as there is enough memory, the file will be read from memory, high performance
- Facilitate the generation of cached data
- This helps compress files and save disk and memory storage space
The disadvantage is that when a new document needs to be written, the inverted index file must be rebuilt, and the new document can be retrieved only after the old file is replaced, resulting in poor real-time performance of the document

Real-time document search

The solution is that the new document directly generates a new inverted index file, query all inverted files at the same time, and then do the summary calculation of the results
Lucene adopts this scheme. The single inverted Index constructed by Lucene is called segment and together called Index. Different from Index in ES, a Shard in ES corresponds to a Lucene Index
Lucene has a dedicated file to record all segment information, called the Commit Point

Real-time document search – refresh

The process of writing segments to disk is still time-consuming. You can use the feature of file system cache to create and open query segments in the cache to further improve real-time performance. This process is called refresh in ES
Documents are stored in a buffer before refresh. During refresh, all documents in the buffer are cleared and segments are generated
Es defaults to refresh every second, so the real-time of documents is improved to one second, which is why ES is called Near Real Time
Refresh occurs in the following situations:
- When the interval is reached, the value is set by index.settings.refresh_interval. The default value is 1 second
- When the index.buffer is full, its size is set by indices.memory.index_buffer_size, which defaults to 10% of the JVM heap and is shared by all shards
- Refresh also occurs when flush occurs

Real-time document search – translog

If a segment in memory goes down before it can be written to disk, the document in it cannot be recovered. How to resolve this problem?
- Es introduces the Translog mechanism, which writes the operation to the translog while the document is written to the buffer
- The translog file is written to disk immediately (fsync). 6.x by default, every request is written to disk. You can change it to write every 5 seconds, which risks losing 5 seconds of data
- Es checks the Translog file when it starts and recovers data from it

Document search real-time – Flush

Flush writes segments from memory to disk.
- Write translog to disk
- The index buffer is emptied so that the document in it generates a new segment, which is equivalent to a refresh operation
- Update the Commit point and write it to disk
- Run fsync to write the segment from the memory to the disk
- Delete the old Translog file
Flush occurs when:
- The default interval is 30 minutes. Before X, the interval can be controlled by index.translog.flush_threshold_period
- Flush_threshold_size specifies the translog size. The default value is 512mb. Each index has its own translog

Real-time document search – delete and update documents

Once a segment is created, it cannot be changed, so what if you want to delete the document?
- Lucene maintains a.del file that records all documents that have been deleted. Note that the.del file records the id of the document within Lucene
- All documents in. Del are filtered out before the query results are returned
How do you update documents?
- Delete the document first, then create a new document

The overall view

Segment Merging

As the number of segments increases, the query speed will be slow because the number of segments in a query increases
Es periodically merges segment operations in the background to reduce the number of segments
You can manually force segment merge using the force_merge API

2. Cluster Tuning recommendations for Elasticsearch

1. Suggestions for deploying the production environment

Follow official recommendations to set all system parameters
See documentation “Setup Elasticsearch -> Important System Configuration”
- Disable swapping
- Increase file descriptors
- Ensure sufficient virtual memory
- Ensure sufficient threads
- JVM DNS cache settings
- Temporary directory not mounted with noexec

Keep the ES Settings as simple as possible

For elasticSearch. yml, only the required parameters are written. Other parameters that can be dynamically set through the API are set through the API
Setup Elasticsearch -> Important Elasticsearch Configuration
- Path settings
- Cluster name
- Node name
- Network host
- Discovery settings
- Heap size
- Heap dump path
- GC logging
- Temp directory
As the VERSION of ES is upgraded, many configuration parameters circulated on the network are no longer supported. Therefore, do not copy others’ cluster configuration parameters

Basic parameters recommended for elasticSearch. yml

cluster name
node name
node.master/node.data/node.ingest
Network. Host you are advised to specify the Intranet IP address and set it to 0.0.0.0
Discovery. Zen. Ping, unicast host address set other cluster nodes
Discovery.zen. minimum_master_nodes is normally set to 2
path.data/pata.log
In addition to the above parameters, add other static parameters as required

Basic parameters set dynamically

Dynamic Settings include transient and persistent. The former will disappear after the cluster restarts, while the latter will not. Both Settings override elasticSearch.yml

About JVM memory Settings

Do not exceed 31GB
Reserve half of the memory for the operating system for file caching
The size is estimated based on the amount of data to be stored in the node. To ensure performance, there is a recommended ratio between memory and data
- The ratio of search items is recommended to be less than 1:16
- The ratio of log items is 1:48 to 1:96
Assume that the total amount of data is 1TB, 3 nodes, and 1 copy. Then the amount of data to be stored for each node is 2TB/3=666GB, that is, about 700GB. 20% of the reserved space is reserved, and each node needs to store about 850GB of data
- For a search project, the memory size of each node is 53GB (850GB/16= greater than 31GB). 31 x 16=496. That is, each node stores a maximum of 496GB of data. Therefore, at least five nodes are required
- For log items, the memory size of each node is 850GB/48=18GB. Therefore, three nodes are sufficient

2. Optimize write performance

ES The process of writing data

refresh
- The process of writing segments to disk is still time-consuming. You can use the feature of file system cache to create and open query segments in the cache to further improve real-time performance. This process is called refresh in ES
- Documents are stored in a buffer before refresh. During refresh, all documents in the buffer are cleared and segments are generated
- Es defaults to refresh every second, so the real-time of documents is improved to one second, which is why ES is called Near Real Time
tanslog
- If a segment in memory is down before it has been written to disk, the document in it cannot be recovered. How to resolve this problem?
- Es introduces the Translog mechanism, which writes the operation to the translog while the document is written to the buffer
- The translog file is written to disk immediately (fsync). 6.x by default, every request is written to disk. This can be changed to write every 5 seconds, which risks losing 5 seconds of data
- Es checks the Translog file when it starts and recovers data from it
flush
- Responsible for writing the segment in memory to disk, mainly doing the following:
- Write translog to disk
- The index buffer is emptied so that the document in it generates a new segment, which is equivalent to a refresh operation
- Update the Commit point and write it to disk
- Run fsync to write the segment from the memory to the disk
- Delete the old Translog file

Write performance optimization

The goal is to increase write throughput – the higher the Event Per Second (EPS), the better
Optimization scheme
- Client: multithreaded write, batch write
- ES: This is mainly between Refresh, translog, and Flush under the premise of high-quality data modeling

refresh

The goal is to reduce the frequency of refresh
- The value of “refresh_interval” is increased to reduce the real-time function, increasing the number of documents processed by refresh at a time. The default value is 1s and the value is set to -1 to disable automatic refresh
- Increase index_buffer_size to indices.memory.index_buffer_size(static parameter, need to be set in elasticSearch.yml), default to 10%

tanslog

The goal is to reduce the frequency of Translog writes to disks, thereby improving the write efficiency but reducing the disaster recovery capability
- Index. The translog. Durability is set to the async, index. The translog. Sync_interval set the size of the need, such as the 120 s, then the translog will be changed to once every 120 s written to disk
- The default value of index.translog.flush_threshold_size is 512mb. If translog exceeds this value, a flush is triggered

flush

The goal is to reduce the number of flush times, and there are few optimizable points at 6.x, most of which are es autocomplete

other

The copy is set to 0 and added after writing
Design a reasonable number of shards and ensure that shards are evenly distributed across all nodes to make full use of all node resources
- Index. Routing. Allocation. Total_shards_per_node limit each index on each node can be allocated to the total number of the deputy shard
- 5 nodes, 10 master shards for an index, 1 copy, what is the value above?
  - (10 + 10) / 5 = 4
  - Set this parameter to 5 to prevent fragment migration failure when a node goes offline

Write performance optimization

Optimized primarily for index level Settings

3. Optimize the read performance

The read performance is affected by the following aspects:
- Does the data model conform to the business model?
- Is the data too large?
- Is the index configuration optimized?
- Is the query statement optimized?

Data modeling

High-quality data modeling is the foundation of optimization
- Pre-calculated values that need to be dynamically computed by script scripts are stored as fields in the document
- Make the data model as close to the business model as possible

The data size

Set different SLAs according to different data sizes
- Tens of thousands of data and tens of millions of data performance must be different

Index configuration tuning

Index configuration optimization mainly includes the following:
- According to the size of the data set a reasonable number of master fragments, can be tested to get the most suitable number of fragments
- Set a reasonable number of copies, not more is better

Query statement tuning

There are several common methods for query statement tuning:
- Filter context is used as much as possible to reduce scoring scenarios. Filter has a caching mechanism, which greatly improves query performance
- Script should not be used for field calculation or score sorting
- In combination with profiles, the Explain API analyzes the root causes of slow queries and then optimizes the data model

4. How do I set the shard number

The performance of ES is basically linear expansion, so we only need to measure the performance index of one Shard, and then calculate the number of shards required according to the actual performance requirements. For example, if Shard write EPS is 10000, and online EPS demand is 50000, then you need 5 shards (and the actual replica situation is also considered).
The process for testing a SHard is as follows:
- Set up a single-node cluster with the same configuration as the production environment
- Set a single shard zero copy index
- Write actual production data for testing and obtain write performance index
- Query data to obtain read performance indicators
Pressure gauge tools can be used with ESRally

5. Xpack monitoring function description

The official launch of free cluster monitoring function

The last

You can follow my wechat public number to learn and progress together.