The author of this set of technology column (Qin Kaixin) focuses on big data and container cloud core technology decryption, has 5 years of industrial IOT big data cloud platform construction experience, can provide full stack of big data + cloud native platform consulting solutions, please continue to pay attention to this set of blog. QQ email address: [email protected], if there is any academic exchange, please feel free to contact.

1. Optimize the disk-based Shard allocation parameter

  • During shard allocation, es takes full account of the available disk space of each node.
  • Cluster. Routing. Allocation. Disk. Threshold_enabled: the default is true, if it is false will disable based on considerations of disk
  • Cluster. Routing. Allocation. Disk. Watermark. Low: low water level control disk usage, the default is 85%, if a node of the disk space utilization rate has exceeded 85%, no distribution of shard to this node
  • Cluster. Routing. Allocation. Disk. Watermark. High: control disk usage of high water level, the default is 90%, if a node of the disk space utilization rate has exceeded 90%, you’ll move the part of the shard on this node
  • Cluster. The info. The update interval: es to investigate every node in the cluster disk usage time interval, the default is 30 s
  • Cluster. Routing. Allocation. Disk. Include_relocations: the default is true, means es in computing a node disk usage, will consider the shard is assigned to the node.

2. Optimize shard Allocation Awareness

2.1 Basic Configuration

  • If a physical machine to run multiple virtual machines, and in a multiple virtual machine run multiple es nodes, or on multiple racks, multiple rooms, are likely to have multiple es nodes on the same physical machine, or on the same frame, or in the same room, then these nodes may be because the physical machine, rack, computer room, collapse together.
  • If ES can sense the physical layout of hardware, it can ensure that Priamry shard and Replica Shard must be allocated to different physical machines, or physical racks, or machine rooms, thus minimizing the risk of physical machine, rack, or machine room crashes
  • Shard Allocation Awareness tells ES about our hardware architecture,
The. / bin/elasticsearch - Enode. Attr. Rack_id = rack_one, Can also be in elasticsearch. Yml set in the frame id cluster. The routing. Allocation. Awareness. The attributes: rack_id node. The attr. Rack_id = rack_oneCopy the code
  • In the above two lines, the first line sets the name of the rack ID attribute, and the second line sets the specific rack ID with that rack ID attribute name. If two nodes are started on the same rack, an index with five primary shards and five replica shards is created. Shards are allocated to the two nodes.
  • If two more nodes are started and set to another rack, es will move the shard to the new node to ensure that the primary shard and replica shard are not on the same rack. However, if rack 2 fails, shards will still be allocated on rack 1 during the recovery in order to restore the cluster
  • Prefer local shard mechanism: If shard Awareness is enabled for search or get requests, es will use local shards to execute requests, that is, shards in the same Awareness group. That is, try to use one rack or machine room shard to execute the request, not across the rack or machine room
You can specify multiple awareness properties, such as frame id and computer name, similar to the following: cluster. Routing. Allocation. Awareness. The attributes: rack_id, zoneCopy the code

2.2 Perception of compulsion

  • Now, if we have two rooms, and we have enough hardware resources to accommodate all of the shards, maybe the hardware in each room can only accommodate half of the shards, not all of the shards. Using only raw awareness, if one machine room fails, ES will allocate all the shards that need to be recovered to the remaining machine room, but the hardware resources of the remaining machine room are not enough to hold all the shards.
  • The force-aware feature solves this problem because it will definitely not allow all shards to be allocated in one machine room
For example, a perception attributes are referred to as the zone, there are two rooms, zone1 and zone2, look at the following configuration: cluster. Routing. Allocation. Awareness. The attributes: zone cluster.routing.allocation.awareness.force.zone.values: Zone1,zone2 At this time, if 2 nodes are allocated to zone1 room and an index, 5 primary shards and 5 replica shards are created, but only 5 primary shards are allocated to Zone1 room. Only when we start a batch of nodes in zone2 machine room will replica Shard be allocatedCopy the code

2.4 shard allocation filtering

  • Shard allocation Filtering allows you to allocate index shards to specific nodes or not. Typically, if you want to take nodes offline, this feature prevents shards from being allocated to nodes that are going offline. We can also move the shard of the offline node to another node.
  • You can use the following command to take a node offline because you are not allowed to assign a shard to that node
    PUT _cluster/settings
    {
      "transient" : {
        "cluster.routing.allocation.exclude._ip" : "10.0.0.1"}}Copy the code

2.5 Shard Delay Allocation when a Node goes Offline

  • If a node is removed from the cluster, the master does the following. These operations protect the cluster from loss of data because sufficient Replica shards are copied for each shard
(1) If there is a primary shard on that node, Then the master will upgrade the replica shards on other nodes to primary shards. (2) Allocate new replica shards to ensure the replica quantity is sufficient. (3) Perform shards on the remaining nodes Rebalance. Ensure load balancingCopy the code
  • However, this process may cause heavy load in the cluster, including network load and disk I/O load. If the offline node is taken offline due to a fault, a new node will replace it immediately, so this immediate execution of the Shard recovery process is not necessary. Consider the following scenario:
(1) a node loses its network connection to the cluster. (2) The master node elevates the replica shard corresponding to the primary shard on the node to the primary shard. (3) The master node allocates a new replica (4) Each new replica shard transmits a complete replica of the Primary shard over the network. (5) Many shards are moved to other nodes to rebalance (6). After a few minutes, the rebalance is rebalance. The node that lost its network connection reconnects to the cluster. The master rebalance again because some shards need to be assigned to the nodeCopy the code
  • If the master node only has to wait a few minutes, the lost node will come back and the lost shard will automatically recover. Because the data is stored locally on the node, there is no need to re-copy data and network transmission. This process is very fast
  • Index. unassign.node_left. delayed_timeout: this parameter sets the waiting period for replica shard to be replicated and allocated after a node goes offline. The default value is 1 MB. You can run the following command to change this parameter:
    PUT _all/_settings
    {
      "settings": {
        "index.unassigned.node_left.delayed_timeout": "5m"}}Copy the code
  • If delayed Allocation is enabled, you will see the following scenario:
(1) a node loses its network connection. (2) Master raises the replica shard in other nodes corresponding to the primary shard in that node to the primary shard. (3) Master records a message log. The replica shard of the primary shard has not been redistributed and started, it has been delayed, the 1m (4) cluster will remain yellow because there are not enough replica shards (5). The missing node will be delayed in a few minutes, If the replica shard missing in (6) is returned to the cluster, it will be directly allocated to that node using its local dataCopy the code
  • If a node is certain that it will not return to the cluster, you can manually set it up and do not wait for the node to return
    PUT _all/_settings
    {
      "settings": {
        "index.unassigned.node_left.delayed_timeout": "0"}}Copy the code

2.6 Priority of Index Recovery

  • Shards that are not assigned are assigned by priority: index.priority, index creation date, index name
    PUT index_3
    {
      "settings": {
        "index.priority"10} :Copy the code

2.7 Number of Shards per node

Cluster. Routing. Allocation total_shards_per_node, set the number of shards per node load default is unlimited

3 summary

There are still a lot of work to be done in production deployment. This paper starts from the primary idea and integrates the problems.

The author of this set of technology column (Qin Kaixin) focuses on big data and container cloud core technology decryption, has 5 years of industrial IOT big data cloud platform construction experience, can provide full stack of big data + cloud native platform consulting solutions, please continue to pay attention to this set of blog. QQ email address: [email protected], if there is any academic exchange, please feel free to contact

Qin kai new

Copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please indicate the source.