Summary of the cluster

Zookeper is usually deployed in a cluster in a production environment to ensure high availability. The following is a cluster deployment structure diagram provided by Zookeeper’s official website:

As can be seen from the figure above, each node of Zookeeper Server communicates with the primary node, and each node stores backup data and logs. The cluster is available only when most nodes are available.

Background note:

Based on ZooKeeper 3.8.0, this paper analyzes the ZooKeeper election process through the dimension of source code.

For Zookeeper source code compilation suggestions refer to: Compile and run the Zookeeper source code

Cluster Node Status

Before analyzing the leader election, let’s take a look at the status of nodes in the Zookeeper cluster. Cluster node states are defined in the enumeration of QuorumPeer#ServerState, mainly including LOOKING, FOLLOWING, LEADING and OBSERVING. The definition code is as follows:

public enum ServerState {
    // Find the leader state. When the server is in this state, it considers that there is no leader in the before-as-cluster and therefore needs to enter the Leader election state.
    LOOKING,
    // Follower state. The current server role is follower.
    FOLLOWING, 
    // Leader status. Indicates that the current server role is Leader.
    LEADING,
    // Observer status. Indicates that the current server role is observer.
    OBSERVING
}
Copy the code

Leader Election Process

Start the source

QuorumPeerMain is the startup class of Zookeeper. It is started using the main method. The startup class reads and initializes the Zookeper configuration. 2. Start the Zookeeper service

// Do not display non-core code

public static void main(String[] args) {
    QuorumPeerMain main = new QuorumPeerMain();
    main.initializeAndRun(args);
}


protected void initializeAndRun(String[] args) throws ConfigException, IOException, AdminServerException {
    // Start in cluster mode
    if (args.length == 1 && config.isDistributed()) {
        runFromConfig(config);
    } else{}}public void runFromConfig(QuorumPeerConfig config) throws IOException, AdminServerException {
    / / quorumPeer startup
	quorumPeer.start();
}
Copy the code

QuorumPeer is a thread instance class. When the start method is called, the QuorumPeer#run() method is executed to determine whether the current node is electing or synchronizing cluster node data. The following is the core code:

@Override
public void run(a) {
    try {
        while (running) {
            switch (getPeerState()) {
                case LOOKING:
                    // Vote for yourself
                    setCurrentVote(makeLEStrategy().lookForLeader());
                    break;
                case OBSERVING:
                    // Mark it as OBSERVING
                    setObserver(makeObserver(logFactory));
                    observer.observeLeader();
                    break;
                case FOLLOWING:
                    // mark as FOLLOWING
                    setFollower(makeFollower(logFactory));
                    follower.followLeader();
                    break;
                case LEADING:
                    setLeader(makeLeader(logFactory));
                    leader.lead();
                    setLeader(null);
                    break; }}}finally{}}Copy the code

The election of source code

FastLeaderElection is the core class for elections, and in this class there is a process for voting and voting

public Vote lookForLeader(a) throws InterruptedException {
    // Create a ballot box for the current election cycle
    Map<Long, Vote> recvset = new HashMap<Long, Vote>();

    // Create a ballot box. This ballot box is not the same as recvset.
    // Store the Leader vote if it already exists in the current cluster
    Map<Long, Vote> outofelection = new HashMap<Long, Vote>();

    int notTimeout = minNotificationInterval;

    synchronized (this) {
        // Increments the local election cycle
        logicalclock.incrementAndGet();
        // Vote for yourself
        updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
    }

    // Broadcast vote
    sendNotifications();

    SyncedLearnerTracker voteSet = null;


    // If the current server state is Looking, and the stop parameters are false, then the election takes place
    while((self.getPeerState() == ServerState.LOOKING) && (! stop)) {if (n.electionEpoch > logicalclock.get()) {
            logicalclock.set(n.electionEpoch);
            recvset.clear();
            // totalOrderPredicate votes PK
            if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                updateProposal(n.leader, n.zxid, n.peerEpoch);
            } else {
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }
            sendNotifications();
        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
            updateProposal(n.leader, n.zxid, n.peerEpoch);
            sendNotifications();
        }

        // Listen for votes received by the communication layer
        Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);
        // Put in the ballot box
        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
        // Half logic
        voteSet = getVoteTracker(recvset, newVote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch)); }}Copy the code

TotalOrderPredicate is mainly the logic of the vote PK, so let’s look at the code again:

protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {

    if (self.getQuorumVerifier().getWeight(newId) == 0) {
        return false;
    }

    return ((newEpoch > curEpoch)
            || ((newEpoch == curEpoch)
                && ((newZxid > curZxid)
                    || ((newZxid == curZxid)
                        && (newId > curId)))));
}
Copy the code

Here’s what the election process looks like, and there’s actually an official note:

  1. First compare the number of terms of election, the higher the number of terms means that the latest term, the winner
  2. And then compare the zxIDS, which means who has the latest data, and the latest wins
  3. Finally, compare the serverID, which is specified in the configuration file, and the one with the largest node ID wins

SendNotifications () after the election is complete; Notify other nodes.

Process to summarize

Earlier, I briefly explained Zookeeper from the startup process, Leader election, and election result notification. In general, ZooKeeper, as a distributed coordination middleware with high performance and reliability, is very excellent in terms of design idea. Let’s summarize the voting process and multi-layer network architecture again.

The voting process

Generally speaking, in the process of voting, the larger the ZXID is, the more likely it is to become the leader. This is mainly because the larger the ZXID is, the more the data of the node is. In the process of data synchronization, the comparison between node transaction cancellation and log file synchronization is avoided to improve the performance. The following is the election process of five ZooKeeper nodes.

Note: (sid, zxID), the current scenario is server1. Server2 is faulty. The zxID of server3 is 9, and the zxID of server4 and server5 is 8.

After two rounds of election, Sever3 is finally elected as the leader node

Multi-layer network architecture

In the previous analysis, I omitted the NIO operation for communication between Zookeeper nodes. In this part, Zookeeper divides them into transport layer and business layer. SendWorker and RecvWorker process data packets at the network layer, while WorkerSender and WorkerReceiver process data at the business layer.

Here will involve multi-threaded operation, ZooKeeper in the source code also gives a lot of log information, for beginners have some difficulty, we can refer to the following ZooKeeper election source code process this part of the flow chart to assist analysis.

Leader election source code process

I combined the source code of Zookeeper to make a detailed combing of the startup and election process, as shown in the figure below. You can read Zookeeper source code.

Reference documentation

  1. Apache Zookeeper’s official website
  2. Zookeeper leader election mechanism
  3. Understand the Zookeeper Leader election process