Zookeeper principle

The main contents of this paper are as follows:

Paxos protocol and Zookeeper application.
Zookeeper includes the cluster structure and features.
The design of Zookeeper includes QuorumPeer model, Leader election, Lead flow, followLeader flow, and broadcast mode to process Zookeeper write requests

1 Paxos

Zookeeper uses the PaxOS protocol both in leader election and broadcast mode. To be exact, it is a variant of paxOS protocol. Therefore, let’s first learn about PaxOS.

1.1 Data consistency in distributed systems

In a distributed system based on message delivery, there may be problems such as slow participant processing, downtime, restart, and network instability, resulting in delayed, lost, or repeated messages. Paxos is a protocol that solves the problem of distributed systems maintaining consistency under any of the above anomalies. It is important to note that this protocol assumes that there are no Byzantine problems in the delivery of messages, that is, messages do not go wrong.

1.2 the story

Lamport proposed Paxos through a story, which is as follows; Legislators on the Greek island of Paxon voted to pass laws in the chamber and communicated information by means of notes passed by waiters, and each lawmaker would record the laws passed on his ledger. The problem is that the law enforcers and waiters are not reliable, they will leave the parliament hall at any time because of various things, and new law enforcers may enter the parliament hall at any time to vote on the law, how to use the voting process can be normal, and the law passed without conflict.

1.3 Semantic Definition

1.3.1 role

The participants in the algorithm are mainly divided into three roles, and each participant can also lead multiple roles.

Proposer. Submit a proposal. The proposal information includes the proposal number and value.
Acceptor Proposal acceptor. Proposals can be accepted upon receipt.
Learner. You can only “learn” from approved proposals, that is, get approved proposals.

1.3.2 Basic semantics

A value is approved only if it is submitted by a proposer. A decision without approval is called a proposal. In an execution instance of the Paxos algorithm, only one value is chosen. Learners can only acquire values chosen.

1.3.3 Resolution process

The adoption of a decision is divided into two stages: Prepare stage, Accepet stage

1.3.3.1 prepare stage

When a Porposer wishes to propose scenario V1, it first issues a prepare request to the majority of acceptors. The Prepare request contains a sequence number. When an Acceptor receives a prepare request, it checks the prepare request it responded to last time

If SN2>SN1, the request is ignored and the approval process ends.
Otherwise, check the last accepted accept request

and reply to

; Simply reply if no prior approval has been given;
,>
,>

1.3.3.2 accept stage

If a majority of acceptors accept a proposal

and all acceptors accept a proposal

, the Porposer accepts a proposal

.

,

… Porposer finds more than half of all replies, assuming

, then sends an accept request, the request content is

; If the number of responses does not meet a majority, the numbered numbered numbered SN1+ is numbered numbered 1. An Acceptor accepts and responds to an accept request without breaching its commitments to other proposers.
,>
,>
,>
,>
,>
,>
,>

1.3.4 constraints

According to the three semantics above, four constraints can be evolved. (NOTE: P1 indicates the prepare stage; P2: Accept phase).

P1 An acceptor must accept a proposal the first time it is received.
P2a Once a proposal for a value is chosen, any subsequent proposal accepted by acceptors must have a value.
P2b Once a proposal with a value is chosen, then any proposal with a future proposer must have a value.
P2c If a proposal numbered N has value V, then there is a majority, either all of them have not accepted any proposal numbered less than n, or they have accepted all proposals numbered less than n with the proposal with the largest number having value V.

1.4 Disadvantages and simplification

1.4.1 shortcomings

The convergence speed of Paxos algorithm is slow in the case of competition, and even live lock may occur. For example, when three or more proposers send prepare requests, it is difficult for a proposer to receive more than half of the responses, resulting in an ongoing implementation of the phase 1 agreement.

1.4.2 simplified

In order to avoid competition and accelerate the speed of convergence, a Leader role is introduced in the algorithm. Under normal circumstances, only one participant can play the role of Leader, while other participants play the role of Acceptor, and all of them play the role of Learner. Multi-paxos is a simplified version of the classic Paxos protocol, which was proposed by Google engineers. The biggest difference between the two is that multi-PaxOS contains the leader node while Paxos does not. Middleware such as Chubby, ZooKeeper, Megastore, and Spanner all use Multi-PaxOS. Multi Paxos runs a complete Paxos algorithm to elect a leader. The leader processes all write requests and omits the prepare process. The Leader uniquely submits the value to each Acceptor for a vote.

2 Zookeeper principle

2.1 the role

role	describe
Leader	Responsible for initiating and deciding votes; System Status Update
Learner-Follower	Participate in the voting process, write requests and forward them to the leader; Receive client connections
Learner-Observer	Do not participate in the voting process, write request forward to the leader; Receive client connections. Improve system scalability and read performance.
Client	Request Initiator

The Leader sends status changes to the followers and observers. The write throughput depends on the number of arbitrations. A larger number of arbitrations results in a smaller write throughput. One of the main reasons for introducing the Observer was to improve the extensibility of read requests. The Observer does not participate in the voting process, so there is no overhead for write operations, and because it stays in sync with the Leader, it can support more read requests.

2.2 Cluster Structure

[ImG-tBAN74WE-1624859190341] [IMG /Zookeeper/cluster structure.] note: This image is from the network.

2.3 features

Final consistency The client is presented with the same view no matter which Server it is connected to. This is the most important performance of ZooKeeper.
Reliability has simple, robust, and good performance. If a message M is accepted by one server, it will be accepted by all servers.
Real-time Zookeeper ensures that the client obtains server update information or server failure information within an interval. However, due to network delay and other reasons, Zookeeper cannot guarantee that the two clients can obtain the newly updated data at the same time. If the latest data is needed, the sync() interface should be called before reading the data.
Wait-free: A slow or invalid client cannot interfere with a fast client so that each client can wait effectively.
Atomic updates can only succeed or fail. There are no intermediate states.
Ordering includes global ordering and partial ordering. Global ordering means that if message A is published before message B on one Server, message A will be published before message B on all servers. Partial ordering means that if a message B is published by the same sender after message A, a must be placed before B.

2.4 Atomic Broadcast Protocol (ZAB)

Zookeeper uses the ZAB to synchronize data between multiple servers. Zab protocols are classified into recovery mode (primary selection) and broadcast mode (synchronous). When the service starts or the leader crashes, Zab enters the recovery mode. When the leader is elected and most of the servers have completed the synchronization with the Leader, the recovery mode ends and the Zab enters the broadcast mode.

3 they are design

This section provides a brief introduction to the implementation principles of Zookeeper. See other blogs for details.

3.1 QuorumPeer model

QuorumPeer is the core of the ZooKeeper server and is responsible for the following:

The initial state is LOOKING, so the Leader election takes place first, which is actually an action to join the cluster.
After selecting the Leader technology, they determine that they are followers or leaders. Different roles perform different business processes.
When exiting the followLeader or Lead process, the state is set to LOOKING and a new cycle begins again.

3.2 leader election

This diagram is a simplified election process, just to show the main process, see the blog for details. The election process is actually the implementation of Paxos protocol. The main election process is as follows:

Increment logicalClock (also called electionEpoch), updating the proposal (first proposing itself as the leader).
Send the vote information to other acceptors in the cluster.
Collect and collect acceptor reply results. This process first resolves the conflict among electionEpoch, ZXID and proposal leader, and in principle chooses the larger one as the information for the next proposal.
Determine whether the leader has been selected from the statistical results; If the leader is not elected or if the leader has been elected but is found to be invalid when determining whether the leader is valid, the “Send (updated) vote message to other Acceptors” step is repeated. Otherwise, the leader is elected, updates its status, and exits the election process.

3.3 Lead process

As shown in the figure, the leader is responsible for the following:

Create LearnerHandler threads for each follower/Observer to handle all interactions with the server.
Ensure that acceptedEpoch is received from more than half of the servers to compute the new epoch for the cluster, or exit the LEAD process.
Ensure that more than half of the servers have epoch updated events to ensure that the epochs of the majority of the cluster have been consistent; Otherwise, exit the LEAD process.
Ensure that more than half of the servers have synchronized with the committed data from the Leader to ensure that the entire cluster majority is in a consistent state, otherwise exit the Lead process.
Enter broadcast mode, receive requests, issue proposals, count and process proposals.

3.4 followLeader process

As shown in the figure, followers are responsible for the following:

Connect the leader.
Inform the leader of its previous AcceptedEpoch so that the leader can compute the epoch of the new cluster
Inform the Leader that you have updated the epoch so that the Leader can confirm that the epoch of the cluster majority has been unified.
Synchronize submitted proposals from the leader to keep yourself in sync with the leader.
Enter the broadcast mode, forward write requests to the leader, receive proposals, reply to proposals, and submit proposals.

3.5 Leader Broadcast Mode

As shown in the figure above, the responsibilities of the leader in broadcast mode mainly include:

Receive write requests and send them as proposals to acceptors (in this case followers)
Receiving responses from followers to the proposal, collecting the response results, and sending a COMMIT to the follower notifying the Observer that the proposal is committed.
Receives heartbeat maintenance information to prolong the session with followers.
Verifying session Validity

3.6 Follower broadcast Mode

As shown in the figure above, the main responsibilities of followers in broadcast mode include:

Receive the proposal message from the leader and reply to the leader.
Receive a Commit proposal message from the Leader.
Receives heartbeat maintenance information and sends the heartbeat to the leader.
Verifying session Validity
Receives sync messages and synchronizes messages from the leader to ensure that the information obtained by the client is up to date.

4 Processing Zookeeper requests

The main abstraction Zookeeper uses in real life is the request handler, which is an abstraction of the different stages of the process, with each server registering a different sequence of request handlers.

4.1 Request Model

This figure is from the official document of Zookeeper and shows the request process. All write requests are used to maintain cluster data consistency through ZAB protocol.

4.2 leader

4.2.1 Registering the RequestProcessor chain sample code

4.2.1.1 LeaderZooKeeperServer# setupRequestProcessors

protected void setupRequestProcessors() {
    RequestProcessor finalProcessor = new FinalRequestProcessor(this);
    RequestProcessor toBeAppliedProcessor = new Leader.ToBeAppliedRequestProcessor(
            finalProcessor, getLeader().toBeApplied);
    commitProcessor = new CommitProcessor(toBeAppliedProcessor,
            Long.toString(getServerId()), false,
            getZooKeeperServerListener());
    commitProcessor.start();
    ProposalRequestProcessor proposalProcessor = new ProposalRequestProcessor(this,
            commitProcessor);
    proposalProcessor.initialize();
    firstProcessor = new PrepRequestProcessor(this, proposalProcessor);
    ((PrepRequestProcessor)firstProcessor).start();
}
Copy the code

4.2.1.2 ZooKeeperServer# setupRequestProcessors

    protected void setupRequestProcessors() {
        RequestProcessor finalProcessor = new FinalRequestProcessor(this);
        RequestProcessor syncProcessor = new SyncRequestProcessor(this,
                finalProcessor);
        ((SyncRequestProcessor)syncProcessor).start();
        firstProcessor = new PrepRequestProcessor(this, syncProcessor);
        ((PrepRequestProcessor)firstProcessor).start();
    }
Copy the code

4.2.2 model

Holdings analysis

The PrepRequestProcessor accepts the request from the client and executes it.
ProposalRequestProcessor (LearnerSyncRequest) indicates that the leader serves as the server to the client and receives a Sync request from the client. If it is not a LearnerSyncRequest, it is considered that a voting decision is needed, so the request is sent to the leader, and then all followers are sent to vote. Note: The request may come from the write request forwarded by the followers to the leader, or the leader may receive the write request from the client and persist the request to the local disk through SyncRequestProcessor. Example code is as follows:

public void processRequest(Request request) throws RequestProcessorException { /* In the following IF-THEN-ELSE block, we process syncs on the leader. * If the sync is coming from a follower, then the follower * handler adds it to syncHandler. Otherwise, if it is a client of * the leader that issued the sync command, then syncHandler won't * contain the handler. In this case, we add it to syncHandler, and * call processRequest on the next processor. */ if(request instanceof LearnerSyncRequest){ zks.getLeader().processSync((LearnerSyncRequest)request); } else { nextProcessor.processRequest(request); if (request.hdr ! = null) { // We need to sync and get consensus on any transactions try { zks.getLeader().propose(request); } catch (XidRolloverException e) { throw new RequestProcessorException(e.getMessage(), e); } syncProcessor.processRequest(request); }}}Copy the code

The CommitProcessor, as the leader, sends itself the sync command after collecting enough votes. As followers, they receive the Sync command sent by the Leader and commit the commit action themselves.
Leader# ToBeAppliedRequestProcessor record is currently being submitted to the request of zkDatabase data, ensure the new followers to connect, can access to these data in memory “is preparing to submit”
FinalRequestProcessor processes the last request. In the case of read/write requests, all servers simultaneously write data.
The SyncRequestProcessor is responsible for persisting write requests to the local disk. In order to improve disk write efficiency, the SyncRequestProcessor uses buffer writes, but periodically (1000 requests) calls flush. After flush, the request is guaranteed to be written to disk. The request is then passed to the AckRequestProcessor for further processing
The AckRequestProcessor is responsible for sending ACK feedback to the voting collector of the Proposal after the SyncRequestProcessor has completed transaction logging to inform the voting collector that the current server has completed transaction logging of the Proposal.

4.3 followers

4.3.1 Registering the RequestProcessor chain sample code

FollowerZooKeeperServer#setupRequestProcessors as follows:

protected void setupRequestProcessors() {
    RequestProcessor finalProcessor = new FinalRequestProcessor(this);
    commitProcessor = new CommitProcessor(finalProcessor,
            Long.toString(getServerId()), true,
            getZooKeeperServerListener());
    commitProcessor.start();
    firstProcessor = new FollowerRequestProcessor(this, commitProcessor);
    ((FollowerRequestProcessor) firstProcessor).start();
    syncProcessor = new SyncRequestProcessor(this,
            new SendAckRequestProcessor((Learner)getFollower()));
    syncProcessor.start();
}
Copy the code

4.3.2 model

4.3.3 analysis

FollowerRequestProcessor after client accept the request, write some action (such as create, delete, setData, setAcl, etc.), FollowerRequestProcessor will initiate a request, All write requests are forwarded to the leader, and the request request is sent to the commitProcessor.
CommitProcessor see leader
FinalRequestProcessor see Leader
SyncRequestProcessor is described in Leader
SendAckRequestProcessor sends an ACK for the proposal to the leader

5 blog

My related blog ZAB protocol recovery mode – Data synchronization

ZAB protocol recovery mode -Leader election