Zookeeper Zab protocol

Zab

Zab is the full name of Zookeeper Atomic Broadcast. Zab ensures the final consistency of distributed transactions.

Zab is an atomic broadcast protocol specially designed for Zookeeper to support crash recovery. Zookeeper mainly relies on Zab to achieve data consistency. Based on this protocol, Zookeeper implements a master/slave (Leader and Follower) system architecture to ensure data consistency among replicas in a cluster.

Zab protocol core

There is only one Leader in Zookeeper, and only the Leader can handle the transaction request from the external client and convert it into a transaction Proposal (write operation), The Leader server then synchronizes the transaction Proposal operation data to all followers (data broadcast/data replication).

The core of Zookeeper’s adoption of the Zab protocol is that as long as one server submits a Proposal, it is necessary to ensure that all servers can submit the Proposal correctly, which also reflects the consistency of CAP/BASE.

Zab mode

Zab protocol has two modes: message broadcast mode and crash recovery mode.

Message broadcast mode

The transmission strategy of data copy in Zookeeper cluster is to adopt message broadcast mode. The synchronization mode of data copy in Zookeeper is similar to but different from that in 2PC, which requires the coordinator to wait for all participants to send ACK confirmation messages before sending commit messages. Requiring all participants to either succeed or fail, the 2PC approach creates serious blocking problems.

In Zookeeper, when the Leader waits for ACK feedback from followers, it is sufficient as long as more than half of the followers give feedback successfully, rather than receiving all feedback from followers.

[Photo from the Internet]

Zookeeper broadcast message procedure:

The client initiates a write operation request
After processing the client request, the Leader server converts the request into a Proposal and assigns each Proposal a globally unique ID, namely, an ZXID
There is a queue between the Leader server and each Follower, to which the Leader sends messages
The Follower machine sends an ACK to the Leader server after processing the message from the queue and writing it to the local transaction log
When the Leader server receives more than half of the acks from followers, it considers it ready to send a Commit
The Leader sends a Commit message to all the Follower servers

Crash recovery mode

Once the Leader server crashes or loses contact with half of the followers due to network problems, the Leader server enters the crash recovery mode.

To ensure that all processes in the Zookeeper cluster can be executed sequentially, only the Leader server receives write requests. Other servers also forward write requests received from clients to the Leader server for processing.

Zab protocol crash recovery must meet the following two requirements:

Ensure that proposals submitted by the Leader must be submitted by all Follower servers
Make sure to discard proposals that have been proposed by the Leader but have not been submitted

In other words, the newly elected Leader cannot contain the unsubmitted Proposal, but must be the Follower server nodes of the submitted Proposal. The newly elected Leader node contains the highest ZXID. Therefore, when the Leader is elected, The ZXID is used as the basis of the information of each Follower when voting. This has the advantage of avoiding the Leader server from checking the submission of the Proposal and discarding the work.

Leader election algorithm

You can set the algorithm for electing the Zookeeper Leader using the following configuration items:

0: UdP-based LeaderElection
1: FastLeaderElection based on UDP
2: FastLeaderElection based on UDP and authentication
3: FastLeaderElection based on TCP

In version 3.4.10, the default value is 3, and the other three algorithms have been deprecated. The following highlights the TCP-based FastLeaderElection

FastLeaderElection principle

myid

Each Zookeeper server needs to create a file named myID in the data folder. The file contains the unique ID of the entire Zookeeper cluster. For example, a Zookeeper cluster contains three servers. Hostname is zoo1,zoo2, and zoo3 respectively. Myid is 1,2, and 3 respectively. The ID and hostname must correspond one by one in the configuration file, for example, server. The data that follows is myID

server.1=zoo1:2888:3888
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888
Copy the code

ZXID

Similar to the transaction ID in RDBMS, it is used to identify a Proposal ID. In order to ensure the sequence, the ZXID must monotonically increase. Therefore, Zookeeper uses a 64-bit number to represent the Leader’s epoch, the highest 32 bits being the Leader’s epoch, starting from 1. Each time a new Leader is elected, the epoch is incremented by 1, and the serial number of the lower 32 bits is the serial number of the epoch. Each time the epoch changes, the serial number of the lower 32 bits is reset, so as to ensure the global incrementation of ZXID.

Server status

Looking: The Leader state is not determined. The server in this state considers that there is no Leader in the cluster and initiates the Leader election
Following: status of the Follower, indicating that the current server role is Follower and that it knows who the Leader is
Leading: indicates the status of the Leader, which maintains the heartbeat communication between the server and followers
Observing: indicates that the current server role is an Observer. The only difference between the current server role and followers is that it does not participate in voting or voting during cluster write operations

Vote data structure

When each server elects a leader, it sends the following key information:

Each server maintains an incrementing integer called logicClock, which indicates the number of votes that the server initiated
State Indicates the current server status
Self_id MyID of the current server
Self_zxid Indicates the maximum ZXID of the data stored on the current server
Vote_id Specifies the myID of the server being promoted
Vote_zxid Specifies the maximum ZXID of the data stored on the server being pushed

The number of electoral rounds is increased

Zookeeper provides that all valid votes must be in the same round. When each server starts a new round of voting, it increses the logicClock it maintains first.

Initial ballot:

Each server empties its own ballot box, which records the votes received, before broadcasting its own ballot.

Send the initial ballot

Each server initially broadcasts a vote for itself

Receiving external votes

The server will try to get votes from other servers and count them into its own ballot box. If it cannot get any external votes, it will confirm whether it has a valid connection with other servers in the cluster. If so, it will send its own vote, and if not, it will establish a connection immediately.

Judge the election cycle

After receiving an external vote, it will first process it differently according to the logicClock contained in the vote information:

If the logicClock of the external vote is larger than its own, it indicates that the election cycle of the server is behind that of other servers. Immediately empty its ballot box and update its own logicClock to the received one. Then I will check my previous vote and the received vote to determine whether I need to change my vote. Finally, I will broadcast my vote again.
The external vote’s logicClock is smaller than its own logicClock. The current server simply ignores the vote and continues to process the next vote
The logickClock of the external vote is equal to its own, when the vote is PK

Vote PK

Based on (self_id, self_zxID) and (vote_id, vote_zxid) :

If the logicClock of the external vote is greater than its own logicClock, change its own logicClock and its vote’s logicClock to the received logicClock
If the logicClock is the same, then compare the vote_zxID of the two. If the vote_zxID of the external vote is large, then update the vote_zxID and vote_myID of the own vote to the vote_zxID and vote_myID of the received vote and broadcast them. In addition, the received tickets and their updated tickets into their own ticket box. If the same ballot (self_MYID, self_ZXID) already exists in the ballot box, it is overwritten directly
If the vote_zxID of the two is the same, the vote_myID of the two will be compared. If the vote_myID of the external vote is large, the vote_myID of the own vote will be updated to the vote_myID of the received vote and broadcast. In addition, the received vote and its updated vote will be put into its own ballot box

Statistics votes

If it is determined that more than half of the servers have approved their vote (possibly an updated vote), the vote is terminated. Otherwise continue to receive votes from other servers.

Updating the Server status

After the vote stops, the server updates its status. If more than half of the votes are for me, update my server status to LEADING; otherwise, update my server status to FOLLOWING.

Zab