This is the 18th day of my participation in the First Challenge 2022

The basic concept

  • Zookeeper is deployed in cluster mode to ensure high availability. In a cluster, the entire Zookeeper service is available as long as most of the machines in the cluster are available
  • The Zookeeper cluster is a high availability cluster based on primary/secondary replication. Generally, a Zookeeper cluster must have at least three servers
  • ZookeeperCluster pattern: Indicates the Master/Slave mode
    • Master server: The Master server provides the write service
    • Slave server: obtains the latest data from the master server through asynchronous replication and provides read services

  • Each Server represents a Server on which the Zookeeper service is installed
  • The servers that make up the Zookeeper service maintain the current server state in memory and communicate with each other
  • The Zookeeper Atomic Broadcast (ZAB) protocol is used to ensure data consistency between Zookeeper clusters

Zookeeper Cluster role

  • ZookeeperThere are no traditional onesMaster-SlaveRather, it contains the following three roles:
    • Leader :Provide read and write services for clients, be responsible for initiating and resolution of voting, update system status
      • A Zookeeper cluster has only one working Leader at a time, who initiates and maintains the heartbeat between followers and observers
      • All write operations must be done by the Leader, who then broadcasts the write operations to the remaining servers
    • Follower :Provides the read service for the client and forwards it to the write serviceLeader.To vote in the election process
      • A Zookeeper cluster may have multiple followers who respond to the heartbeat messages sent by the Leader
      • Followers directly process and return read requests from the Client, forwards the write requests from the Client to the Leader, and votes on the requests when the Leader processes them
    • Observer :
      • Provides a read service for the client, and if it is a write service, it forwards the data to the Leader
      • Do not participate in voting during the election process and do not participate in writing the success strategy
      • Improves the read performance of the cluster without affecting the write performance
      • The Observer role is a new role created after Zookeeper3.3

Leader

  • LeaderMain functions:
    • Restore data
    • Maintains the heartbeat between the Follower and the Follower, receives the Follower request, and determines the type of the Follower request message
    • Different processing is performed according to the message type of the Follower
      • FollowerMessage type:
        • PING message: indicates the heartbeat message of the followers
        • REQUEST message: indicates a proposal message sent by followers, including write requests and synchronization requests
        • ACK message: A response message from a Follower to a proposal. If more than half of the followers pass the proposal, they commit it
        • REVALIDATE message: used to prolong the Session validity period

Follower

  • FollowerMain functions:
    • Send a request message to the Leader. Including PING messages,REQUEST messages,ACK messages,REVALIDATE messages
    • The Leader message is received and processed
    • Receives requests from clients. If it is a write request, it is forwarded to the Leader for processing and a vote is taken
    • Returns the processing result of the Client request
  • FollowerReceive and loop the following fromLeaderThe message:
    • PING message: heartbeat message
    • PROPOSAL message: a PROPOSAL message initiated by the Leader, which requires followers to vote
    • COMMIT message: the latest proposal message of the server
    • UPTODATE message: indicates that the synchronization is complete
    • REVALIDATE message: Based on the Leader’s REVALIDATE result, a decision is made to close the Session to be revalidated or to allow the Session to receive the message
    • SYNC message: Returns the SYNC result to the Client. This message is originally initiated by the Client to force the latest update

Zookeeper cluster election

  • The election process of the Zookeeper cluster:ZookeeperIn theLeaderServer network interruption, crash exit, restart and other exceptions,ZookeeperEnter crash recovery mode, recovery mode requires a new electionLeader,Have all the serversServerReturn to a correct state, then enterLeaderThe electoral process, the process that produces the newLeaderThe server
    • Leader Election:
      • Nodes are initially in the election phase
      • As long as one node gets more than half of the votes of the nodes, it becomes a quasi-leader
    • Discovery stage:
      • The followers and the would-be Leader communicate to synchronize the latest transaction proposals received by the followers
    • Synchronization stage:
      • Synchronize all replicas in the cluster using the latest proposal history obtained by the would-be Leader during the discovery phase
      • After the synchronization is complete, the would-be Leader becomes the real Leader
    • Broadcast stage:
      • The Zookeeper cluster can formally provide transaction services and the Leader can broadcast messages
      • If a new node is added, it needs to be synchronized

Zookeeper Cluster election algorithm

  • The cluster election algorithm of Zookeeper has the following options:
    • 0 – UDP based LeaderElection
    • 1 – FastLeaderElection based on UDP
    • 2 – FastLeaderElection based on UDP and authentication
    • 3 – Tcp-based FastLeaderElection
  • The default cluster election algorithm used by Zookeeper is the 3-TCP-based FastLeaderElection algorithm. The other algorithms have been deprecated
  • FastLeaderElection:
    • FasterLeaderElection algorithm is a standard implementation of Fast Paxos algorithm, which solves the problem of slow convergence in the LeaderElection algorithm
    • Paxos is a consensus consensus algorithm for solving distributed consensus problems. How do you agree on a value proposal in a distributed system

Basic Paxos algorithm

  • The election thread is the thread that initiates the election on the current Server. The main function is to count the voting results and elect the Server
  • The election thread first issues a query to all servers, including itself
  • After receiving the reply, the election thread verifies whether it validates the logicClock for the query initiated by itself, then checks whether the transaction ID submitted at last is consistent, that is, whether the maximum ZXID is consistent, then obtains the myID of the other party, and finally obtains the relevant information of the Leader proposed by the other party (myID, zxID). And store that information in the voting records for this election
  • After receiving replies from all servers, the thread calculates the Server with the largest ZXID and sets the information about this Server as the Server to vote next time
  • The thread sets the Server with the largest ZXID as the Leader recommended by the Server Server. If the winning Server gets more than half of the votes, it sets the winning Server as the Leader and sets its own status according to the information of the winning Server Server. Otherwise, the process continues until the Leader is elected

Zookeeper cluster election vote data structure

  • Each server is being ledLeaderWhen voting, the following data structures are sent as votes:
    • LogicClock: indicates the number of votes initiated by the server. Each server maintains a value that is an incremented integer
    • State: indicates the status of the current server
    • Self_id: specifies the myID of the current server
    • Self_zxid: indicates the maximum ZXID of the data stored on the current server
    • Vote_ id: indicates the myID of the proposed server
    • Vote_zxid: indicates the maximum ZXID of data stored on the selected server

This section describes the voting process for the Zookeeper cluster election

  • Additional electoral rounds:
    • Zookeeper states that all valid votes must be in the same round
    • When each server starts a new round of voting, it first increments the value of the logicClock it maintains
  • Initial ballot:
    • Each server empties the ballot boxes that record the votes received before broadcasting its own
    • Example:
      • Server B votes for server C, server C votes for server A, and server A automatically votes. The ballot box of server A is (A,C),(C,A),(A,A).
      • Only the last vote of each voter will be recorded in the ballot box. If the voter updates his vote, other servers will update the vote of the voter server in their own ballot box after receiving the new vote
  • Send initial ballot:
    • Each server initially broadcasts a vote to itself
  • Receiving external votes:
    • The server attempts to fetch votes from other servers and credit them to its own ballot box
    • If it cannot get any external votes, it verifies that it is validly connected to the rest of the cluster:
      • If a valid connection remains, it sends its own vote again
      • If a valid connection is not maintained, it is immediately re-established
  • To judge the election cycle:
    • After receiving external votes, the number of voting rounds will be expressed according to the voting informationlogicClockFor different processing:
      • Externally votedlogicClockIs greater than itselflogicClockValue:
        • Indicates that the election rounds on the current server are behind those on other servers
        • Immediately empty your own ballot box and update the value of your own logicClock to the value of the logicClock received from the external vote. Then it compares its previous vote with the external vote received to determine whether it needs to change its vote, and finally broadcasts its vote
      • Externally votedlogicClockIs less than itselflogicClockValue:
        • Indicates that election rounds on the remaining servers are behind those on the current server
        • The current server simply ignores the external votes it receives and continues to process the remaining votes
      • Externally votedlogicClockIs equal to itselflogicClockValue:
        • Indicates that the election rounds of other servers are the election rounds of the current server
        • Ballot PK at this time
  • Vote PK:The votePKIs based onself_id,self_zxidandvote_id,vote_zxidThe contrast of
    • If both of themlogicClockConsistent, just compare the twovote_zxid,Compare and choose the larger onevote_zxid
      • If the vote_zxID of the external vote is large, the vote_zxID and vote_id in its own vote are updated with the vote_zxID and vote_id of the received external vote and broadcast
      • Put the received ballot and its updated ballot into its own ballot box
      • If the same ballot already exists in the ballot box, that is, the values of self_id and self_zxID are equal, it is overwritten directly
    • If both of themvote_zxidConsistent, just compare the twovote_id,Compare and choose the larger onevote_id
      • If the vote_id of the external vote is large, the vote_id in its own vote is updated to the vote_id of the received external vote and broadcast
      • Put the received ballot and its updated ballot into its own ballot box
  • Counting the votes:
    • If it is determined that more than half of the servers accept its vote, the vote is terminated
    • If there is not a majority, continue to receive votes from the remaining servers
  • Update server status:
    • After the vote terminates, the server updates its own status
    • If more than half of the votes are for myself, update my server status to LEADING
    • If no more than half of the votes are for you, update your server status to FOLLOWING

The Zookeeper cluster Leader election is enabled

  • Three servers in the cluster 1,2,3. When the cluster starts, the logicClock value of all servers is 1, and the zxID value is 0
  • Initial vote for oneself:
    • After each server is initialized, it votes for itself and saves one of its votes into its own ballot box
      • value(1,1,0):
        • The first number, 1, represents the voting cycle logicClock of the server that cast the vote
        • The second number, 1, represents the myID of the recommended server
        • Number three0Represents the maximum number of recommended serverszxid
          • In this initial voting step, so the first number is 1
          • All votes are cast for themselves, so the second digit myID is the server’s own myID, and the third digit zxID is the server’s own zxID
          • At this point, the server’s own ballot box has only one vote for itself
  • Update ballot:
    • After receiving the external vote, the server processes the votePK,Update and broadcast your own poll results, and then store the appropriate poll results in your own ballot box
      • After server 1 receives the votes from server 2 (1,2,0) and server 3 (1,3,0), because all logicClock values are equal and all zxID values are equal, server 1 selects the external vote of the server with the largest myID to update its own vote result as **(1,3,0),** And empty all their ballot boxes, and then the server 3 votes and their own votes into their own ballot boxes, and finally their own updated ballot results broadcast out. The ballot box on server 1 is (1,3),(3,3).
      • After server 2 receives the vote from server 3, similar to server 1, it updates its own vote to (1,3,0) and saves it to its own ballot box. Finally, it broadcasts the updated vote result. The votes in the ballot box on server 2 are (2,3),(3,3).
      • According to the above rules, server 3 does not need to update the vote, and its ballot box is still (3,3). The updated votes on server 1 and server 2 are updated, because the latest votes on all three servers are the same, and the ballot boxes on the last three servers all contain three votes for server 3
  • Determine the role according to the vote:
    • Based on the vote results, determine the role of each server and update the status of the server
      • According to the above voting results, the three servers agree that server 3 is the Leader
      • At this point, server 3 changes its state to LEADING, and server 1 and server 2 change their states to FOLLOWING
      • Finally, the Leader initiates and maintains the heartbeat between the followers and the Leader

Followers to restart

  • FollowerUnable to find leader after network partition split occurred during restartLeader, will enterLOOKINGStatus and initiate a new round of voting
    • After receiving the vote from server 1, server 3 returns its LEADING status and the vote to server 1
    • After receiving the vote from server 1, server 2 returns its FOLLOWING status and the vote to server 1
    • At this point, server 1 knows that server 3 is the Leader, and the votes of server 2 and server 3 confirm that server 3 has more than half of the votes. Server 1 changes its state to FOLLOWING

Leader to restart

  • Followers initiate a re-vote:
    • When the Leader is down, the followers find that the Leader is not working, they enter the LOOKING state and launch a new round of voting, casting their votes for themselves
  • Ballot PK and broadcast:
    • The server1And the server2Determine whether to update its own ballot according to the external voting situation, including the following two situations:
      • The maximum zxID of server 1 and server 2 is the same:
        • For example, before Leader server 3 goes down, server 1 and Server 2 are fully synchronized after server 3
        • At this time, the update of votes mainly depends on the id of the server, that is, the size of myID. Votes are cast to the server with a large value of myID
      • The maximum zxID of server 1 and server 2 is different:
        • A write operation led by the Leader takes effect only after the confirmation of more than half servers, not all servers
        • In this case, on server 1 and server 2, one server may have the same ZXID as the Leader, and the other server may have a smaller ZXID than the Leader
        • In this case, the update of the ballot depends on the server with a larger ZXID
  • Elect a new Leader:
    • After the votePKAfter that, a new leader can be electedLeader
      • Both server 1 and server 2 give votes to server 1 so that server 2 becomes a Follower and server 1 becomes the new Leader and maintains a heartbeat with server 2
  • The old Leader changes to the FOLLOWING state after being restored:
    • The successful reboot of the old leaderLeaderfromLOOKINGState toFOLLOWINGstate
      • After the old Leader server 3 is restored and started, it enters the LOOKING state and initiates a new round of leadership election to vote for itself
      • Then server 1 will return its LEADING status and the result (3,1, zxID) to server 3. Server 2 returns its FOLLOWING status and the poll result (3,1, zxID) to server 3
      • Server 3 gets the message to confirm that the Leader is server 1, and according to the vote result, server 1 gets more than half of the server votes, and enters the FOLLOWING state

Zookeeper cluster election summary

  • PaxosAlgorithm:
    • Fast Paxos algorithm: When the election begins, each Server proposes to be the Leader
    • Basic Paxos algorithm: Query computes the appropriate Server Server as the Leader Leader
  • ZookeeperCluster election process:
    • For the Leader to be supported by a majority of Server servers, the total number of Server servers must be an odd number of 2n+1,2n+1,2n+1, 2n+1, and the number of Server servers alive cannot be less than n+1n+1n+1
    • After each Server is restarted, the Zookeeper cluster election process is repeated. In recovery mode, if the Server has just been recovered from a crash or started, the Server recovers data and session information from disk snapshots.Zookeeper records transaction logs and takes periodic snapshots to facilitate state recovery during recovery

Unfair Leader election

  • The unfair model is simple to implement, and the method of voting is exactly the same in each round
  • Efficiency is high with a small number of competitors:
    • The time that each Follower node senses that the Leader node is deleted through watch is not exactly the same
    • As long as one Follower node is notified that the node Leader node is deleted, the election of a new Leader can be guaranteed
  • Unfair pattern causedZookeeperHeavy cluster load and poor scalability:
    • If a large number of clients are running, there will be a large number of write requests sent to Zookeeper at the same time
    • Zookeeper has the problem of single point write, resulting in low performance
    • When the Leader gives up the leadership,Zookeeper needs to notify a large number of followers at the same time, resulting in a heavy load

Choose the main process

  • Such as the3aZookeeperThe clientClient1,Client2,Client3At the same time, competing leadersLeader:
    • All three clients simultaneously register the Ephemeral temporary non-sequence out-of-order type node with the Zookeeper cluster at the path /zkroot/leader
    • Since it is an unordered non-sequence node. Only one of the three clients can be successfully created, and the other nodes fail to be created
    • Then Client1 is successfully created and becomes the Leader, while other clients 2 and 3 become followers

Abdicate leadership

  • To relinquish the leadership of the Leader node, delete the /zkroot/ Leader node
  • The leaderLeaderProcess unexpectedly down, then andZookeeperBetween theSessionIt also ends because the node isEphemeralTemporary type of node, so it will be deleted automatically
    • In this case, /zkroot/leader does not exist. For the rest of the clients, it means that the old leader has given up the leadership

Perceived abdication of leadership

  • If the node fails to create the Leader, it becomes a Follower Follower and registers a watch with /zkroot/ Leader. If the Leader relinquishes leadership, that is, when the Leader node is deleted, all followers will be notified

Fresh elections

  • Sense the old leaderLeaderAfter relinquishing leadership, all followersFollowerA new leadership election could be called:
    • The election method of the Leader in the new round is exactly the same as that of the Leader in the last round, which is to initiate the node creation request. If the node is successfully created, the Leader is the Leader; otherwise, the followers follow the Leader and the followers watch the Leader node
    • The result of the new round of Leader election is unpredictable and has nothing to do with the order in the last round of election. So it’s called unfair elections

Fair Leader Election

  • The fairness model is relatively complex to implement
  • Good scalability:
    • Each client watches only one node and notifies only one client each time a node is deleted
  • When the old Leader gives up the leadership, the other clients become the new Leader in turn according to the order of election, that is, the serial number of nodes. So it’s called fair elections
  • The delay of the fair mode is higher than that of the unfair mode, because the fair mode must wait for a specific node to be notified before the election of a new Leader can begin

Choose the main process

  • In the fair leader election, each client is created/zkroot/leaderNode, and of typeEphemeralTemporary andSequenceNodes of an ordered type
    • Because the nodes are Sequence nodes, the three clients can be created successfully, but the node serial numbers are different
    • In this case, the client checks whether the serial number of the node it successfully creates is the smallest
      • If the node number is the smallest, then the Leader node becomes the Leader node
      • If the node number is not the smallest, it becomes the Follower Follower node
  • Like the client sideClient1The serial number of the created node is1,The clientClient2The serial number of the created node is2,The clientClient3The serial number of the created node is3:
    • Because the minimum serial number is 1 and the node with serial number 1 is created by client Client1, client Client1 becomes the Leader. Remaining client Client2,Client3 becomes the Follower Follower

Abdicate leadership

  • Fair leadership election, every followerFollowerwillwatchA node whose node number is just smaller than its own:
    • For example, in nodes 1,2, and 3, Client2 is responsible for the node /zkroot/leader1 of watch Client1, and Client3 is responsible for the node /zkroot/leader2 of watch Client2
    • Note: In the actual project, the serial number should be 10 digits

Perceived abdication of leadership

  • If the Leader is down and /zkroot/ Leader1 is deleted, only the watch Leader node client Client1’s client Client2 can be notified. Client Client3 because watch is the node /zkroot/ Leader2 of client Client2, so I am not notified

Fresh elections

  • After receiving notification of the deletion of client Client1 node /zkroot/leader1, client Client2 does not immediately become a new Follower. Instead, Client2 determines whether its node number is the smallest node number. Client2 becomes the new Leader only when it determines that its serial number is the current smallest
  • Note:
    • If Client2 responsible for watch Leader node Client1 goes down before the old Leader node Client1 relinquishes leadership, Client3 is notified
    • In this case, Client3 will not immediately become the new Leader, but first determines whether its node number is the smallest node number
    • Since the node /zkroot/leader1 created by client Client1 is not deleted, client Client3 will not become the new Leader and the watch will be created for client Client1 with node number 1 in front of Client2

Zookeeper Cluster server status

  • LOOKING: Searches for the Leader status. The Current Zookeeper Server Server does not know the Leader
  • LEADING: indicates the Leader status. The node is Leader, that is, the current role of the server is Leader
  • FOLLOWING: the status of followers. The corresponding node is Follower. The Leader has been elected and the current server is synchronized with the Leader
  • OBSERVING: Indicates the Observer status. The corresponding node is Observer. In most cases, the behavior of the node is exactly the same as that of the followers. However, this node does not participate in the Leader election and voting, but only receives the results of the election and voting
  • Half-write success strategy:
    • After the Leader node receives a write request, the Leader broadcasts the write request to the remaining cluster servers
    • The remaining cluster servers queue the write request and send a success message to the Leader node
    • This write operation can be performed when the Leader node receives more than half of the cluster Server Server success messages
    • The Leader node sends the commit message to the remaining zone Server servers, and the remaining cluster Server servers receive the message and start the write request operation
  • Only the Leader node can perform write operations. The followers and Observers only provide data read operations. If a write request is received, it is forwarded to the Leader node for processing
  • Observer nodes do not participate in the election and voting process, nor do they participate in the half-write success strategy of a write operation

Zookeeper Number of cluster servers

  • The Zookeeper cluster can run normally as long as more than half of the nodes are alive
  • ZookeeperThe number of servers in a cluster is recommended to be an odd number
    • Zookeeper cluster If some Zookeeper servers fail, ensure that the number of remaining Zookeeper servers is greater than the number of Zookeeper servers that fail, the Zookeeper cluster can still be used
    • In other words, if the number of Zookeeper servers in the cluster is NNN, the number of remaining Zookeeper servers must be greater than N2 \frac{n}{2} 2N
    • So 2N − 12N-12n −1 and 2N2N2n have the same tolerance, both n−1n-1n−1
    • Therefore, the number of Zookeeper servers ensures that the odd and even numbers have the same effect. Therefore, you do not need to add an unnecessary Zookeeper server