To ensure the consistency of the Zookeeper cluster, ZAB supports the crash recovery mode. The most critical part of crash recovery is the leader election process, which we will examine in detail below.

Zookeeper Chinese annotation source: github.com/chentianmin…

leaderThe election principle

The leader election occurs in two phases, one when the server is started. One is the leader election when the leader node breaks down during the running process. Before we begin to analyze how elections work, there are a few important parameters to understand:

  1. Server ID(myID) : For example, there are three servers numbered 1,2, and 3. The larger the number, the greater the weight in the selection algorithm.
  2. Transaction ID (zxID): The larger the value, the more new the data, and the greater the weight in the election algorithm. High 32-bitepoch, the lower 32 bits are the auto-increment ID.
  3. Logicalclock (epoch) : or the number of voting times. The value of the logicalclock in the same round of voting is the same. This data increases after each vote, and then makes a different judgment based on the different values compared to the values returned by the other votes received from the server.
  4. Election mode
    1. LOOKING: Campaign status.
    2. FOLLOWING: Follower status, synchronizing leader status, participating in voting.
    3. OBSERVING: Observes and synchronizes the leader status, but does not vote.
    4. LEADING: Leadership status.

When the server is startedleaderThe election

Each node starts with the state of LOOKING, in a wait-and-see state, and then starts to enter the main process of selecting.

If the Leader election is to be performed, at least two machines are required. This section uses a server cluster consisting of three machines as an example. In the cluster initialization phase, when one server Server1 starts, the Leader election cannot be conducted and completed independently. When the second server Server2 starts, the two machines can communicate with each other, and each machine tries to find the Leader, so the Leader election process begins. The election process is as follows:

  1. eachServerSend out a vote. Since this is the initial case,Server1andServer2I think of myself asLeaderEach vote will contain the selected servermyid,ZXIDandepoch(myid, ZXID, epoch)Server1Vote for(1,0,0), the vote for Server2 is(2,0,0), and then each sends this poll to the other machines in the cluster.
  2. Votes from various servers are accepted. After each server in the cluster receives a vote, it first determines the validity of the vote, such as whether it is the current round (EPOCH) or whether it is fromLOOKINGStatus of the server.
  3. Processing to vote. For each vote, the server needs to PK others’ vote and its own vote. PK rules are as follows:
    1. The priority is compared with the epoch. A larger epoch has a higher priority.
    2. Secondly comparedZXID.ZXIDLarger servers are preferredLeader.
    3. ifZXIDSame, so comparemyid.myidLarger servers asLeaderThe server.

Vote for Server1, it is (1, 0, 0), receive Server2 vote for (2, 0, 0), will first compare the two epoch and ZXID are 0, then compare myid, Server2 myid maximum at this time, He updates his vote to (2, 0, 0) and votes again. For Server2, it doesn’t have to update its poll, it just sends the last poll again to all the machines in the cluster.

  1. Vote statistics. After each vote, the server will count the vote information and judge whether more than half of the machines have received the same vote informationServer1,Server2In the cluster, there are two machines that have accepted(2, 0, 0)Is considered to have been selectedLeader.
  2. Changing the Server status. Once that’s doneLeader, each server will update its own status, if yesFollower, then change toFOLLOWINGIf it isLeaderIs changed toLEADING.

In operationleaderThe election

When the leader server in a cluster breaks down or becomes unavailable, the whole cluster cannot provide external services and enters a new round of leader election. The basic process of leader election during server running is the same as that during startup.

  1. The status. After the Leader hangs, the remaining nonObserverEach server changes its server state toLOOKING“And began to enterLeaderThe electoral process.
  2. eachServerA vote will be issued. During run time, on each serverZXIDIt could be different, but let’s sayServer1theZXID In 123,Server3theZXIDFor 122. On the first round of voting,Server1andServer3They vote for themselves, they vote(1, 123, 0).(3, 122, 0), and then each sends its vote to all the machines in the cluster. Votes from various servers are received in the same way as at startup.
  3. Processing to vote. The process is the same as at startup, where,Server1Will be aLeader.
  4. The votes are counted. Same process as at startup.
  5. The server status changed. Same process as at startup

leaderSource code analysis of elections

Source analysis, the most critical is to find a portal, for the ZK leader election, is not triggered by the client, but at the time of startup will trigger an election. So we can go straight to the running commands in the startup script zkserver.sh. ZOOMAIN is QuorumPeerMain. So let’s look at it based on this entry:

QuorumPeerMain.main()

main()Method, calledinitializeAndRun(args)Initialize and run.

QuorumPeerMain.initializeAndRun()

This is mainly about loadingzoo.cfgConfiguration file, run according to the configuration.

QuorumPeerMain.runFromConfig()

As the name suggests, it is launched based on a configuration file. So the whole method is parsing and setting the parameters, and since they’re not used yet, there’s no need to look at them. Look directly at the core code quorumpeer-start (), which starts a thread, and you can see from this code that quorumPeer actually inherits threads. Then it must have a run() method in it.

QuorumPeer.start()

The quorumpeer-start () method overrides the Thread’s start() method. Before the thread starts, the following is done:

  1. throughloadDataBase()Restoring Snapshot Data
  2. cnxnFactory.start()Start thezkServer, which means that users can pass2181This port is communicating. I’ll talk about that later. Let’s go back toleaderElections are the main thread.

QuorumPeer.startLeaderElection()

How to elect a leader:

  1. Building the current note
  2. Gets the currentzkServerIn themyidThe corresponding IP address
  3. Creating an election algorithm

quorumPeer. createElectionAlgorithm()

Create an election algorithm based on the corresponding identity.

FastLeaderElection

Initialize FastLeaderElection. QuorumCnxManager is a core object used to implement network connection management for leaderelection. This will be used later.

FastLeaderElection. starter()

starter()Method, sets some member properties, and builds two blocking queues, namelysendQueueandrecvqueue. And instantiated oneMessager.

Messenger

There are two threads built into Messenger, a WorkerSender and a WorkerReceiver. These two threads are used to send and receive messages, respectively. We will not analyze what ju does for the time being.

summary

With this analysis, let’s make a simple summary, connecting the previous parts through a flow chart.

ZkServer service startup logic

The cnxnfactory.start () method is used to start the zK service. Let’s break it down:

In runFromConfig, a ServerCnxnFactory is built.

ServerCnxnFactory.createFactory()

This method is based on ZOOKEEPER_SERVER_CNXN_FACTORY decided to create the NIO server or Netty server, and by default, should be to create a NIOServerCnxnFactory.

QuorumPeer.start()

Going back to the quorumpeer-start () method, cnxnfactory.start (), should call the NIOServerCnxnFactory class to start a thread.

NIOServerCnxnFactory.start()

Here we start a thread with thread.start(). What object is a thread?

NIOServerCnxnFactory.configure()

Thread builds a zookeeperThread, and the thread argument is this, which means that the current NIOServerCnxnFactory is also a class that implements the thread, so it must override the run() method.

At this point, the NIOServer initialization and startup process is complete. And listen on port 2181. Once a request is detected, the processing is performed accordingly. This will be explored in detail later when analyzing data synchronization.

Analysis of election process

So far, I haven’t done a formal analysis yetleaderThe core process of the election, after the preliminary work is done, then the formal analysis beginsleaderThe electoral process.

Obviously, super.start() means that the current class QuorumPeer inherits from the thread, which must override the run() method, so we can find a run() method in QuorumPeer.

QuorumPeer.run()

PeerState has several states:

  1. LOOKINGCampaign status.
  2. FOLLOWINGEntourage status synchronizationleaderStatus, participate in the vote.
  3. OBSERVING, observation status, synchronizationleaderStatus, not voting.
  4. LEADINGLeadership status.

For elections, the default state is LOOKING, and only the LOOKING state will execute the election algorithm. Each server selects itself as the leader at startup, then sends out the voting information, and loops until a leader is elected.

FastLeaderElection.lookForLeader()

Leader election core code.

Flowchart of voting processing

FastLeaderElection.termPredicate()

QuorumMaj. containsQuorum()

By default, QuorumMaj is used to determine whether the number of votes of the current node is greater than half.

The network communication process of voting

Communication process source code analysis

Socket listeners are created after each ZK service is started

The run() method creates the socket listener.

FastLeaderElection.lookForLeader

This method, analyzed earlier, calls sendNotifications to send the vote request:

FastLeaderElection.sendqueue

Data in the sendQueue queue is obtained concurrently through the WorkerSender. And this WorkerSender thread, when you build fastLeaderElection, is going to start.

QuorumCnxManager.toSend

startConnection

SendWorker will listen to the blocking queue corresponding to SID. If the queue is empty during startup, it will resend the last last message, in case the last message was not processed successfully because the server quit abnormally. RecvWorker:RecvWorker listens to the inputStream of the socket, reads the message and puts it into the message receiving queue. The message is put into the queue, and the process of QCM is finished.

Processing logic after the leader election is complete

After the election through the lookForLeader method is complete, the PeerState of the current node is set to either Leading, FOLLOWING, or OBSERVING. The current leader is selected, but the quorumpeer-run () method has not been executed yet. Let’s go back to the next step.

QuorumPeer.run()

Case FOLLOWING and LEADING:

follower.followLeader()

makeLeader

Initializes a Leader object and builds a LeaderZookeeperServer for the request processing service of the table Leader node.

leader.lead()

On the Leader, lead() is used to handle interactions with followers.