Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

preface

Elasticsearch needs to be expanded or reduced due to service adjustment

Disk failure or power failure, need to shut down maintenance and restart

In this aspect, due to the feature of ElasticSearch decentralization, businesses can have no sense of expansion, shrinkage, downtime and maintenance, and normal reading and writing of ES cluster

To achieve the above function, the number of nodes must be three or more than five, or three of the nodes are master and the others are data nodes. Either there are three nodes that are both master and data

parameter

The es data redundancy depends on a single parameter: “number_of_replicas”: “1”,

If you have a cluster of 3 nodes and the above parameter is set to 1, there is one master shard and one replica shard, and there are two total shards

If a node fails, cluster read and write operations are not affected. If two nodes fail, some indexes cannot be written, and some indexes return only some data when read

state

Es is divided into three states:

  • Green all master and replica shards are allocated. The cluster is 100% available.
  • All Yellow master shards have been sharded, but at least one copy is missing. No data is lost, so search results remain intact. High availability is somewhat diluted and the cluster can still write at this point
  • At least one master shard (and all of its copies) of Red is missing, the search returns only partial data, and write requests assigned to the shard return an exception

downtime

The maintenance operation is similar to the cluster upgrade, that is, the node data is not moving. After the node is shut down and restarted, it can be put into service immediately

However, due to the decentralized feature of ES, once the node is shut down, data will be automatically moved to the normal node to ensure service availability. If your data is very large, this action takes a long time and the cluster IO will be very high. Therefore, it is necessary to disable allocation during maintenance

Website document: www.elastic.co/guide/en/el…

Shard document: www.elastic.co/guide/en/el…

The following actions are performed in Kibana Dev Tools, as well as the rest

1. Stop allocation

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"}}Copy the code

2. Perform synchronization. This step is optional

POST _flush/synced
Copy the code

3. Shutdown for maintenance

Stop the ES service, then turn off the machine, and then you can replace the disk, maintain the operation, and so on

After stopping the node, your cluster status should be Yellow. If it is Red, there is a problem. Check to see if all index copies are set to 1

After the node is maintained, start the machine, start the service, and check the cluster status

systemctl stop elasticsearch.service
systemctl start elasticsearch.service

GET _cat/health
GET _cat/nodes
Copy the code

4. Enable assignment

After allocation is enabled, you can view the recovery progress

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": null
  }
}

GET _cat/recovery
Copy the code

Enlarge shrinks capacity

capacity

After a node is added to a cluster, the system automatically discovers the cluster, reorganizes the cluster, and evenly distributes data to the node

The official document: www.elastic.co/guide/en/el…

Shrinkage capacity

To shrink, make sure your node data is greater than the number of copies and odd

Ensure that the number of index copies is greater than or equal to 1. If the number of index copies is 0 and the fragment is on the shrink machine, the index data will be lost

You can use cluster-level shard allocation filtering to weed out nodes, remove their fragments before shutting them down, and then stop the nodes after the shard migration is complete

PUT _cluster/settings
{
  "transient" : {
    "cluster.routing.allocation.exclude._ip" : "10.0.0.1"
  }
}

GET _cat/health
Copy the code

The official document: www.elastic.co/guide/en/el…

troubleshooting

A node is stopped for a long period of time because of an abnormal crash or other reason that a node is hanging, or because allocation is not disabled. There will be a failure

UNASSIGNED

Most of the failures are data problems, either not allocated, or the number of allocated times, or the disk is almost full, not allocated, and so on

  • INDEX_CREATED: Unassigned due to the API that created the index
  • CLUSTER_RECOVERED: The recovered cluster is not allocated
  • Index_too: because an index has been opened or closed too
  • DANGLING_INDEX_IMPORTED: Unallocated because of the import of the index
  • NEW_INDEX_RESTORED: Not allocated due to a restore to a new index
  • EXISTING_INDEX_RESTORED: Unallocated due to restoring to a closed index
  • REPLICA_ADDED: Unallocated because replicas are explicitly added
  • ALLOCATION_FAILED: Unallocated due to fragmentation allocation failure
  • NODE_LEFT: unallocated because the node hosting the shard leaves the cluster
  • REINITIALIZED: Unallocated due to shard moving from start to initialization (for example, using shadow shadow copy shard)
  • REROUTE_CANCELLED: Cancel assignment as a result of the explicit cancel rerouting command
  • REALLOCATED_REPLICA: a better replica location is identified and used, causing the existing replica allocation to be cancelled or unallocated

Debugging commands

Viewing Cluster Status

GET _cluster/health
Copy the code

View the status of fragments

GET /_cluster/health? level=shardsCopy the code

Check the reason why no assignment was made

GET _cluster/allocation/explain? prettyCopy the code

The following request returns the unassigned. Reason column, which indicates the reason for the unallocated shard.

GET _cat/shards? h=index,shard,prirep,state,unassigned.reasonCopy the code

View the shard being restored

curl http:/ / 127.0.0.1:9200 / _cat/recovery? active_only=true
Copy the code

View the number of copies of all indexes

curl http:/ / 127.0.0.1:9200 / _cluster/health? level=indices | jq .indices > test2222.txt
cat test2222.txt | jq .[].number_of_replicas | grep 0
Copy the code

View the details of all shards

curl http:/ / 127.0.0.1:9200 / _cat shards? v > shared111.txt
Copy the code

View the maximum number of open files on each node

GET _nodes/stats/process? filter_path=**.max_file_descriptorsCopy the code

Viewing Node Details

GET _nodes/process
Copy the code

The number of retries allocated reached the upper limit. Procedure

If the downtime is too long, it may cause the number of allocation times to reach the upper limit, resulting in the subsequent machine startup does not allocate

Error message extraction

Click on the gray box above the head plugin to view error messages, or to view the ES log

The solution

The official document: www.elastic.co/guide/en/el…

Then retry allocation of shards that are blocked due to too many subsequent allocation failures

POST _cluster/reroute? retry_failed=true
Copy the code

The number of fragments reached the value. Procedure

reached the limit of incoming shard recoveries [2]

The above error occurs because allocation is not disabled during the shutdown. During the restart of the machine, some shard copies have been copied, and three copies are generated at this time.

This problem does not need manual intervention, just need to wait for the cluster to automatically reorganize, will actively delete the redundant data (generally delete the earliest data, that is, the one that stopped)

Allotment prohibited

replica allocations are forbidden due to cluster setting [cluster.routing.allocation.enable=primaries

If the above error occurs, it is because the allocation is disabled, usually manually. It may be that the previous operation forgot to enable allocation.

Perform open assignment

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": null}}Copy the code

Primary and secondary shards cannot exist on the same node

the shard cannot be allocated to the same node on which a copy of the shard already exists

The master shard and the duplicate cannot be on the same node. If an error occurs, wait for the cluster to automatically adjust it. If the fault persists for a long time, manually assign the shard to the specified node

The official document: www.elastic.co/guide/en/el…

If reroute is not executed and there is a risk of data loss when reroute is executed, be careful not to execute in production

POST /_cluster/reroute
{
    "commands": [{"allocate_replica": {"index":"Index name"."shard":4."node":"Node ID"}}}]Copy the code