15 minutes Master the ZooKeeper-Zab protocol

This section describes ZAB protocol

The basic concept

ZAB (ZooKeeper Atomic Broadcast) Broadcast protocol of ZooKeeper Atomic messages. ZAB protocol is a crash recovery atomic broadcast protocol specially designed for distributed coordination service ZooKeeper.

ZooKeeper relies on the ZAB protocol to implement distributed data consistency. Based on this protocol, ZooKeeper implements a system architecture in active/standby mode to maintain data consistency among replicas in the cluster.

Specifically, ZooKeeper uses a single main process to receive and process all transaction requests from clients, and uses ZAB atomic broadcast protocol to broadcast the state changes of server data to all replica processes in the form of transaction proposals.

Considering that the main process may crash and exit or restart at any time, the ZAB protocol needs to ensure that the main process can still work normally when the above exception occurs, that is, crash recovery.

General process of ZAB:

All transaction requests must be coordinated by a globally unique server, called the Leader server,
The remaining servers become Follower servers.
The Leader server converts a client transaction request into a transaction Proposal (Proposal) and distributes the Proposal to all Follower servers in the cluster.
The Leader server needs to wait for feedback from all the Follower servers and wait for more than half of the Follower servers to give the correct feedback
The Leader then sends a Commit message to all the Follower servers again, asking them to submit the previous Proposal.

Key words: master-slave cluster mode, distributed transaction, ZAB, two-phase commit

ZAB protocol includes two basic modes, namely message broadcast and crash recovery

News broadcast

The above process has described the flow of message broadcast mode, which is similar to a two-phase commit process.

The two-phase commit process involved in the ZAB protocol here is slightly different. In the two-stage submission process of ZAB protocol, the interrupt logic is removed, and all Follower servers either normally feedback the transaction Proposal proposed by the Leader or abandon the Leader server. At the same time, ZAB removes the interrupt logic from the two-phase submission, which means that we can submit the transaction Proposal after more than half of the Follower servers have responded with acks, rather than waiting for all the Follower servers in the cluster to respond. Half a sentence and the cluster responds to continue executing the transaction request.

For example, a ZAB service consisting of three machines usually consists of one Leader and two Follower servers. At some point, if one of the Follower servers dies, the Commit will still be broadcast and the entire ZAB cluster will not be interrupted because the Leader server is still supported by more than half of the machines (including the Leader himself).

In the whole message broadcast process, the Leader server will generate a corresponding Proposal for each transaction request to broadcast, and before broadcasting the transaction Proposal, the Leader server will first allocate a globally monotonically increasing unique ID for this transaction Proposal. We call this the transaction ID (ZXID); Therefore, the sequential order of message processing can be guaranteed by the incremental feature, which is also one of the ZK features FIFO.

Incidentally, if other machines in the cluster receive a transaction request from the client, these non-Leader servers will first forward the transaction request to the Leader server.

The specific implementation

Specifically, during message broadcast, the Leader server assigns a separate queue to each Follower server, and then puts the transaction proposals to be broadcast into these queues one by one, and sends messages according to the FIFO policy. After receiving the Proposal, each Follower server writes it to the local disk as a transaction log and sends an Ack response to the Leader server. When the Leader server receives more than half of the Ack responses from the followers, it broadcasts a Commit message to all the followers to inform them to Commit the transaction. At the same time, the Leader also completes the transaction submission. Each Follower server also commits the transaction after receiving the Commit message.

Crash recovery

When the whole service framework is in the startup process, or when the Leader server has network interruption, crash exit and restart, ZAB protocol will enter the recovery mode and elect a new Leader server. When a new Leader server is elected and more than half of the machines in the cluster complete state synchronization with the Leader server, the ZAB protocol exits the recovery mode. Among them, the so-called state synchronization is data synchronization, which is used to ensure that more than half of the machines in the cluster can keep the data state consistent with that of the Leader server.

Basic features: The ZAB protocol states that if a transaction Proposal is successfully processed on one machine, it should be successfully processed on all machines. How does this work?

After the Leader election is completed, the Leader server will first confirm whether all proposals in the transaction log have been submitted by more than half of the machines in the cluster, that is, whether data synchronization has been completed before the official work starts (i.e. receiving transaction requests from the client and then putting forward new proposals). Let’s look at the data synchronization process

Data synchronization

The Leader server prepares a queue for each Follower server and sends proposals to the Follower server one by one for transactions that have not been synchronized by each Follower server. Each Proposal message is followed by a Commit message. After the Follower server synchronizes all its unsynchronized transaction proposals from the Leader server and successfully applies them to the local database, the Leader server adds the Follower server to the actual list of available followers. How do you determine if Follower server transactions need to be synchronized? This depends on the characteristics of ZXID.

ZAB transaction id ZXID is a 64-bit number. The lower 32 bits represent the transaction id. The number increases monotonically by one. The higher 32 bits represent the numbering of the Leader cycle epoch. Whenever a new Leader server is elected, the ZXID of the largest transaction Proposal in the local log will be extracted from the Leader server, and the corresponding epoch value will be resolved from the ZXID, and then it will be added by 1. It then recalculates the lower 32 bits from 0. Transaction consistency can be synchronized using a low 32 comparison of the high 32 to the soon-to-be Follower server.

After the discussion of ZAB, data recovery is followed by the Leader election process, which continues to detail the Zookeeper election

They are elected

The election is mainly divided into two stages: server startup and Leader election during server running.

Election during server startup

Each Server issues a vote

Since it is the initial situation, both Server1 and Server2 will vote as the Leader server. The most basic elements of each vote include: myID and ZXID of the proposed server, which are expressed in the form of (myID, ZXID). Since this is the initialization phase, both Server1 and Server2 vote for themselves, that is, Server1 votes for (1,0) and Server2 votes for (2,0), and each sends this vote to all the other machines in the cluster.
Receives votes from various servers

Each server receives votes from other servers. After each server in the cluster receives a vote, it first determines the validity of the vote, including checking whether it is a round vote and whether it is from a server in the LOOKING state.
Processing to vote

After receiving a vote from another server, for each vote, the server needs to PK the vote of others with its own vote. The rules of PK are as follows.
- Check the ZXID first. The server with a larger ZXID takes precedence as the Leader.
- If zxIDS are the same, then compare myid. The server with the larger myID serves as the Leader server.
Vote statistics

After each vote, the server counts all the votes and determines whether more than half of the machines have received the same message. “Half” means greater than half of the number of machines in the cluster, that is, greater than or equal to (n/2+1).

Changing the server state Once the Leader has been identified, each server updates its state: FOLLOWING if Follower, LEADING if Leader.

Leader election during server running

When the Leader hangs up, the whole cluster is temporarily unavailable and a new round of Leader elections takes place. The basic process of Leader election during server running is the same as that during startup.

Change the status.

When the Leader hangs, the remaining non-Observer servers change their server state to LOOKING and start the Leader election process.
Each Server issues a poll and during this process, the poll information (myID, ZXID) needs to be generated. Since this is runtime, the ZXID on each server may be different, so we assume that Server1’s ZXID is 123 and Server3’s ZXID is 122. In the first round, both Server1 and Server3 vote for themselves, generating votes (1,123) and (3,122), respectively, which are then distributed to all the machines in the cluster.
The following steps are consistent with the Leader election process at startup time.