You can run the _cluster/health command to query the status of the Elasticsearch cluster:

GET _cluster/health
Copy the code

Under normal conditions, it shows a healthy state, which is green. For the description of monitoring colors, please refer to my previous article “Some important concepts of Elasticsearch: Cluster, node, index, Document, Shards and Replica”. But when our cluster has no shards assigned, or data is missing, its status will be yellow or red.

  • Red: at least one primary shard is not allocated in the cluster
  • Yellow: All primary copies have been assigned, but at least one copy has not been assigned
  • Green: Allocate all shards

The command above returns the following result:

{ "cluster_name" : "my_cluster", "status" : "red", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 104, "active_shards" : 104, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 60, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 63.41463414634146}Copy the code

As shown above, our current cluster status is red, indicating that data has been lost. So how do we find out which shard and which index is the problem?

We can use the following command to view the cluster:

GET _cluster/health? level=indicesCopy the code

The command above allows us to determine which index or index is the problem. The command above shows the result:

From the above we can see that restored_logs_4 index is defective. It displays a state of red, which is red.

We can also query the shard:

GET _cluster/health? level=shardsCopy the code

The command above shows the result:

The command above shows that shard 0 of restoreD_logs_4 is in 0 state. This happens if the shard has never been assigned, or if it has been assigned, but the entire node may lose the shard for some reason.

We even use the following method directly to get all cases of the index:

GET _cluster/health/restored_logs_4? level=shardsCopy the code

The results shown above are:

{ "cluster_name" : "my_cluster", "status" : "red", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 0, "active_shards" : 0, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 2, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : Indices: {" restoreD_logS_4 ": {"status" : "red"," number_OF_SHards ": 1, "number_of_replicas" : 1, "active_primary_shards" : 0, "active_shards" : 0, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 2, "shards" : { "0" : { "status" : "red", "primary_active" : false, "active_shards" : 0, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 2 } } } } }Copy the code

In order to further find out what the cause is, we can use the following command to query:

GET _cluster/allocation/explain
Copy the code

In practice, we need to configure some parameters to get the allocation of a specific index, such as:

GET _cluster/allocation/explain
{
  "index": "restored_logs_4",
  "shard": 0,
  "primary": true
}
Copy the code

The command above shows the result:

{
  "index" : "restored_logs_4",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "CLUSTER_RECOVERED",
    "at" : "2020-10-05T08:08:54.241Z",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
  "node_allocation_decisions" : [
    {
      "node_id" : "Ohi9yhffThGZ5X8gq4AXLw",
      "node_name" : "node1",
      "transport_address" : "127.0.0.1:9300",
      "node_attributes" : {
        "ml.machine_memory" : "34359738368",
        "xpack.installed" : "true",
        "transform.node" : "true",
        "ml.max_open_jobs" : "20",
        "my_rack" : "rack1"
      },
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    }
  ]
}
Copy the code

From the above description, we can see why our shard assignment was unsuccessful.

In practical use, we can also get a cluster to change to another state by the following way, for example:

GET _cluster/health? wait_for_status=yellowCopy the code

The above call indicates that the result will only be returned when the cluster state turns yellow, otherwise it will remain in the block state.