Use a flow chart to understand the ZAB protocol in ZooKeeper

I want to know some distributed consistency protocol for a long time. I have been writing some drafts before, and then I want to update more content slowly. Non-data handling, knowledge level is limited, if there is any fallacy, welcome criticism.

Introduction to the

ZAB, which stands for ZooKeeper Atomic Broadcast, is the core algorithm used by ZooKeeper to maintain data consistency. Based on this protocol, ZK implements the system architecture of active/standby mode to keep the consistency of all copies in the cluster. For write requests that change the state of the service, synchronization is handled through a consistency protocol. Read requests can be returned on the local copy.

Basic concept

It is necessary to understand some concepts before familiarizing yourself with the process.

Concepts and Terminology

Leader: receives client requests, converts a client transaction request into a Proposal, and distributes the Proposal to all followers in the cluster. After that, the Leader waits for the feedback from all the followers. When half of the followers servers give correct feedback, also called Quorum, the Leader will send a Commit message to all the followers again and submit the Proposal.
Follower: follows the master, forwards write requests to the master, and is responsible for reading requests. Therefore, while cluster expansion increases read request performance, it decreases write performance because more machines need to be synchronized.
Proposal Proposal:<v, z>, v represents the value and z represents the zxID.
Commit: Commit a transaction.
Quorum arbitration: Generally refers to a half-body mechanism.
Oberver Observer: A form of membership, but does not participate in elections. It was introduced only for system scalability.
Epoch Epoch: The term of each Leader.

ZXID

Zookeeper Transaction Id (ZXID) is the ZAB protocol Transaction Id. It is a 64-bit integer during the entire process.

The lower 32 bits are a monotonically increasing counter, incrementing by 1 each time the Leader server generates a new transaction Proposal.
The high 32 bits represent the epoch number of the Leader, which is similar to Raft’s term. Whenever a new Leader is elected, the new Leader will fetch the zxID of the largest transaction Proposal from the local log, parse out the epoch and add 1. And will lower 32 position 0 to start the new ZXID.

Node status

In the ZAB protocol, each process can be in one of three states. There is also an Oberserving state, the state of the observer, which can be ignored.

Each node holds the following data,

The term

CEpoch: the Follower sends the epoch value of the Proposal for the last transaction it processed.
NewEpoch: the Leader generates a NewEpoch value based on receiving the epoch of the Follower.
Ack-e: Follower confirms to receive the Leader’s new epoch.
NewLeader: establishes the leader and sends the NewLeader message to other followers.
Ack-ld: Follower confirms to receive the NewLeader message from the Leader.
Commit-ld: submit the proposal of the new Leader.
Veto: the Leader starts a new transaction.
Ack: Follower confirms receiving the Proposal from the Leader.
Commit: The Leader sends a Proposal to the followers, asking all the followers to submit a transaction Proposal.

process

ZAB protocol mainly includes two processes: message broadcast and crash recovery. It can be divided into three stages: Discovery, Synchronization, and Broadcast. Below is an overview of the ZAB protocol. We break each phase into 3 sections and explain what is done in each section. The color arrows indicate the direction of communication between nodes, and the black arrows indicate the process of entering each phase.

Crash recovery

The ZAB Agreement goes into recovery mode and elects a new Leader when the following occurs.

Service Framework Restart
Network interruption
Crash out

The ZAB agreement goes into recovery mode and elects a new Leader. After a new Leader is elected and more than half of the machines in the cluster have completed data synchronization with the new Leader, the ZAB protocol exits the recovery mode.

The election process

There are three machines in the sample figure: Server 1, Server 2, and Server 3. The parentheses (3, 6) represent the vote, which contains sid (myID in the configuration file) and zxID. ZooKeeper uses a fast election algorithm by default. For details, see the source code and the above flow chart.

After a crash or restart, each node changes to the Looking state and the election process begins.
Each node votes for itself and then broadcasts the results to all nodes. That is to say,server 1Vote for yourself(3, 6)Broadcast to all nodes.
If the majority of nodes agree, then you are considered the Leader. All the nodes in the graph have been selected(3, 6)It then broadcasts to all nodes to determine that the Leader is Server 1.

found

Phase 1 CEpoch: the Follower sends the epoch value of the transaction Proposal it finally processed to the Leader, a messageCEpoch(F.p).F.pYou can extract the ZXID.
Phase 2 NewEpoch: after the Leader receives more than half of the CEpoch messages from the followers, the Leader generates the messageNewEpoch(e')To more than half of these followers,e'Is larger than any epoch value received from CEpoch messages, after all, the dynasty is changing.
Stage 3 ACK-E:
1. Once the Follower is received from the LeaderNewEpoch(e')Message, ifF.p < e'.
2. Once the Leader receives confirmation messages from more than half of the followers. It picks an F from the half of the followers and uses it as the set S’ (for synchronizing the cluster data) to initialize the transaction before ending the discovery phase.

How to select this Follower? Because if you want to select the set of transactions to be synchronized, you must choose the most complete transaction. Therefore, the epoch must be the largest and the ZXID the largest.

synchronous

After the discovery process is complete (i.e., the transaction set S’ of the data source Follower F is determined), the synchronization phase begins.

Phase 1 NewLeader: the Leader sends the new epoch and S’ to all the followers of the Quorum in the form of NewLeader(e’, S’) messages. In the previous stage, L. Story = F. story, so S’ is L. Story in the flowchart.
Phase 2 ACK-LD: After the Follower receives the NewLeader(e’, S’) message,
1. If the Follower epoch is equal toe'That is, to make sure that they are not the people of the Lord, because the previous stage has stored the lateste'. Followers will perform the transaction application operation and will receiveS'In all transaction proposals, note just receive.
2. If the Follower epoch is not equal toe'That is, not this round of followers go directly into the next generation cycle.
3. After receiving more than half of the followers’ ACK-LD messages, the Leader sends a Commit message to all the followers. The next stage is Broadcast.
Phase 3 commit-ld: After receiving the Commit message from the Leader, abDeliver (

) is called in sequence to process each transaction of S’, and then this phase is completed.
,>

News broadcast

After master/slave data synchronization is complete, the cluster is available for external services. The Leader is responsible for the write requests (if any fall on the followers, they are forwarded to the Leader).

Phase 1 Propose: After receiving a new transaction request from the client, the Leader will generate a corresponding transaction Proposal and send it to all followers according to the order of zxids (increasing)P<e', <v, z>>, includingepoch(z) == e'.
Phase 2 Ack: The Follower processes these proposals according to the order in which the messages are received, appends them to H, and then feeds them back to the Leader.
Stage 3 Commit: Once the Follower receives the message from the LeaderCommit(e', <v, z>)Message will be calledabdeliver(<v, z>)Commit the transaction<v, z>. Note that at this point the Follower must have committedz' < zPrevious transactions.

Then the cluster goes out to broadcast the message and normal external service status until the next election.

Broadcast process

After entering the message broadcast stage, the Leader will allocate a FIFO queue for each Follower to communicate, ensuring that one Follower can only keep in sync with one Leader at a time. The Leader and Follower are aware of each other through heartbeat detection.

There are two situations in which the Leader terminates the Leader of the current cycle,

If the Leader is offline, the Follower’s heartbeat packet times out and the Follower enters the Looking state.
Since the Leader has no more than half of the followers, the Leader himself enters the Looking state, and the followers who follow him also turn into the Looking state.

When an exception occurs, a new crash recovery process begins.

Broadcast process,

The Leader server receives the request from the Client and generates a Proposal for it.
The Proposal is then sent to all followers (from). The master assigns a separate queue to each slave to keep the messages in order.
The master waits for Ack responses from all slave servers, and when more than half of the slave servers have Ack responses, the master commits the local transaction. The Commit is then broadcast to all slaves, and the slave completes the Commit after receiving the Commit.

reference

ZAB paper
Zookeeper distributed collaboration technology
Principle and practice of distributed technology from Paxos to ZooKeeper
zookeeper wiki
ZooKeeper’s atomic broadcast protocol: Theory and practice
Stop referring to databases as CP or AP