ElasticSearch is designed as a distributed search engine and is based on Lucene. The core idea is to start multiple ES process instances on multiple machines to form an ES cluster. ES hides the complex distributed mechanism. I will analyze the distributed principle of ES below.

Once the ElasticSearch node is started, it uses multicast to find and connect to other nodes in the cluster. In a cluster, a node is elected as the master node. This node is responsible for managing the state of the cluster and dispatching index sharding to the corresponding nodes when the cluster topology changes.

Data sharding, node expansion and fault tolerance

                                                                 Figure 1. Empty cluster

Figure 2. Master sharding

Figure 3. Master-slave sharding

In Figure 1, the ES cluster has only one master node, which is an empty node without any shard data. In Figure 2, there is only one master node and the index is divided into three shards. Since there is only one node, all shards can only be distributed on Node1 node (primary and replica cannot exist on the same node at the same time). If the index is set with three primary shards, 3 replicas, that none of the 3 replicas will be active because there are no nodes to hold the replicas. Figure 3 has 2 nodes. 3 primary shards are on Node1 and 3 replicas are on Node2. Replica Shard is responsible for fault tolerance and load of read requests.

The master node

Figure 4. Node expansion

If a new node, node3, is added to the ES cluster, Shards will automatically perform load balancing among the nodes. Figure 4 shows that when node3 is added, P0 (Primary shard 0) and R2 (Replica 2) are automatically distributed to Node3.

After expansion, each node has fewer shards, which means that each shard can occupy more resources on the node, such as IO, CPU and Memory, and the whole system will be better. If the capacity expansion bottleneck of the system is exceeded, for example, 6 shards in Figure 4, but expanded to 9 basis points, each shard is distributed on one node and there are still 3 idle nodes, then we can increase the number of replica Shards (primary shard is determined when the index is created, Replicas will be created unmodifiable), increasing the number of replicas to six, so that each machine has one shard and has exclusive access to each machine’s resources.

Figure 5. Extending from shard

Figure 7. Fault tolerance (fault recovery)

In Figure 7, after node1 fails

  • The master election automatically elects another node to become the new master.
  • The new master elevates a replica shard from the missing Primary shard to the primary shard. Then the cluster status changes to yellow because all the primary shards become active. But one replica shard is missing, so not all replica shards are active.
  • Restart the failed node. The new master will copy all the missing copies to the node, and the node will use the existing shard data. The cluster status will change to green after the outage. Because the primary shard and replica Shard are complete

The number of fragments is set properly

1. Depending on the actual service scenarios, five fragments below 10 million level are sufficient by default. Hundreds of millions of levels need to be pressed until the business considers that the query delay is unacceptable and the number of fragments is insufficient.

2. A rule of thumb is that each shard should not exceed 30GB. If the shard size has reached the performance bottleneck, split indexes should be considered.