In the previous part, we analyzed how ES cluster chooses its master. A brief review is that Bully algorithm is optimized to try to avoid data inconsistency. So today, we will continue to analyze how data is stored in the cluster. We know that ES data copies are based on the master-slave pattern, that is, a data shard contains one master shard and multiple replica shards, so how does the master-slave pattern work? Keep reading.

Several terms

Before introducing the MASTER-slave pattern of ES, we need to understand a few key terms:

  • Replica Group: A collection of data that are copies of each other. Only one copy of the Secondary group is the Primary data, and the other copies are Secondary data.
  • Configuration: The Configuration information describes the replicas of a copy group, who the Primary is, and which node it is on.
  • Configuration Version: Indicates the Version of the Configuration information, which increases with each change.
  • Serial Number: indicates the sequence of write operations. Each write operation increases and is maintained by the primary copy. Serial Number for short is SN.
  • Prepared List: Prepared List of write operations. Store a List of external requests and sort the requests by SN. The SN of the write operations added to the List must be greater than the maximum SN
  • Committed List: Committed List of write operations.

Data writing process

The write process is summarized as all requests are routed to the master shard and then copy synchronization is performed. The detailed steps are as follows:

  1. The client write request arrives at the coordination node, and then passes the preliminary verification, and routes the request to the master shard.
  2. Once the master shard is reached, it is validated again, and then the write request is executed on the master shard;
  3. After the operation succeeds, the master shard synchronizes the request to all in-SYN shards concurrently.
  4. The write operation of all copy fragments such as the master copy succeeds and replies, and then returns to the coordination node.
  5. The coordination node replies the client after receiving the reply.

This is the normal write process, so think about it, what should happen if there is a failure in the process?

  • Master Shard Exception If the Master shard is abnormal, the Master shard sends a write request to the Master, and the Master rerouts the request to the new Master shard.
  • The Master removes the abnormal shard from the in-sync-Replica and informs the Master shard. Then the Master guides the Master to create new replica shards and synchronizes data.

Data synchronization between master shard and replica shard

Here is a brief introduction to the data copy policy in active/standby mode

  1. After the write request reaches the primary node, an SN is generated, and an UpdateRequest is constructed using the SN, which is stored in the local Prepare List
  2. After receiving the update from the slave node, the slave node inserts the update into the Prepare List and sends an ACK to the master node.
  3. After receiving all the replies from the slave nodes, the master sends the UpdateRequest to the Committed List, which is moved one bit later.
  4. The master sends a COMMIT request to all slave nodes, telling them where the committed List is, and then the slave nodes synchronize it

After the above four steps, the data consistency between the primary and secondary data can be basically guaranteed

Abnormal detection between active and standby nodes

Lease mechanism: The master copy periodically obtains the possible exceptions of the lease from the standby copy and the solutions:

  1. If the Master node does not receive a lease reply from a replica node within a certain lease period, it tells the Master to remove it and demote itself from being the Master node.
  2. If the replica node does not receive a lease request from the master node within a grace period, the configuration manager is told to remove the master node and promote itself as the master node. If multiple slave nodes compete for promotion, the slave node will be promoted as the master node

Theoretically, in the absence of clock drift, as long as the grace period >= lease period, the master node is guaranteed to perceive the lease expiration first. Therefore, when the new master node is generated, the old master node is already invalid and the brain split will not occur.

These are reflected in ES:

The active and standby nodes are the Master nodes in ES. Prepare List corresponds to local checkpoint Committed list corresponds to global checkpoint Each copy in ES has a local operation sequence number and a global operation sequence number. The master shard is responsible for advancing the global operation sequence number, and the local maintains the local operation sequence numberCopy the code

The last

The above is the simple analysis of ES data model, and the text and pictures will be supplemented later.