I am participating in the creation activity of the Nuggets newcomer to start the road of writing together.

ZooKeeper ensures data consistency in distributed systems through the ZAB protocol.

ZAB agreement

ZooKeeper ensures data consistency through ZAB message broadcast, crash recovery, and data synchronization.

News broadcast

  1. After a transaction request (Write) comes in, the Leader wraps the Write request into a Proposal transaction and adds a globally unique 64-bit incrementing transaction ID, known as the Zxid (messages are ordered by comparing zxids).
  2. The Leader node broadcasts the Proposal transaction to other nodes in the cluster. The Leader node and the followers node are decoupled. The communication will pass through a FIFO message queue, and the Leader will allocate a separate FIFO queue to each Follower node. The Proposal is then sent to the queue;
  3. After receiving the Proposal, the Follower node persists it to the disk and sends an ACK to the Leader.
  4. When the Leader node receives more than half of the Follower nodes’ ACKS (Quorum mechanism), it will commit the transaction on the local machine and broadcast the COMMIT at the same time. After the Follower nodes receive the COMMIT, they will complete their own transaction commit.

The transaction request processing process of ZAB protocol is similar to a two-stage commit process. The first stage is broadcast transaction operation, and the second stage is broadcast commit operation. However, under this two-stage commit model, data inconsistency caused by the breakdown of the Leader node cannot be handled, such as the following two situations:

  1. No Follower Proposal1 crashes when the Leader (Server1) initiates a transaction Proposal1.
  2. When the Leader initiates a Proposal2 and receives an ACK from more than half of the followers, it crashes before it can send a Commit message to the Follower node.

ZAB protocol introduces crash recovery mode to solve the problem of Leader failure and data inconsistency caused by Leader failure. Crash recovery mode must address the following issues:

  1. When Server1 recovers and joins the cluster again, it must ensure that a Proposal1 is discarded, that the discarded message does not reappear.
  2. The new Leader elected must have the Proposal with the largest Zxid of all the machines in the cluster, that is, ensure that the messages that have been processed cannot be lost.

Crash recovery

When the Zookeeper cluster enters the crash recovery phase:

  1. The Leader node is elected during the crash recovery phase when the cluster service is started.
  2. If the Leader node breaks down suddenly or loses contact with more than half of the followers due to network problems, the cluster also enters the crash recovery mode.
Electing the Leader node
  1. All nodes are in the Looking state
  2. After the Leader is down, the remaining Follower nodes change their status to Looking (note that the Observer does not participate in the election) and then start the Leader election process.
  3. Each Server node issues a vote to participate in the election
  4. In the first vote, all servers vote for themselves, and each sends the vote to all the machines in the cluster.
  5. The cluster receives the votes from the various servers and begins to process the votes and elections
  6. The process of voting is the process of comparing zxids. Assuming that Server3 has the largest Zxid, Server1 determines that Server3 can be the Leader, and Server1 votes for Server3 based on the following criteria: If the epochs are equal, select the epoch with the largest ZXID. If both epochs and ZXIDS are equal, select the server with the largest ID.
  7. In the election process, if a node obtains more than half of the votes, it will become the Leader node; otherwise, it will vote again.
  8. If the election succeeds, the status of each node is Leading and Following.

The node in Zab has three states: following (the current node is the Follower node), leading (the current node is the Leader node), looking/election (the current node is in the election state). The transition between the Zab protocol message broadcast and crash recovery phases is accompanied by a node state transition.

Data synchronization

After the collapse recovery is completed, the next work is data synchronization. During the election process, the Leader node is confirmed to be the node with the largest Zxid by voting. In the synchronization stage, all copies in the cluster are synchronized with the latest Proposal history obtained by the Leader in the previous stage.

ZAB protocol is a typical implementation of CP in CAP theory. In its crash recovery stage, the Leader node election process and data synchronization process after the completion of the election are not provided for external services, which is to ensure the strong consistency of data at the cost of availability.