In the previous article, ZooKeeper Basic Principles & Application Scenarios, the basic principles and application scenarios of ZooKeeper were introduced in detail, although it introduced the underlying storage principle and how to use ZooKeeper to realize distributed locking. But I think this only scratches the surface of ZooKeeper. So this article gives you a detailed look at the core underlying principles of ZooKeeper. If you’re not familiar with ZooKeeper, take a look back.


This should be the basis of ZooKeeper, the smallest unit of data storage. In ZooKeeper, a file system-like storage structure is abstracted into a tree by ZooKeeper, and each Node in the tree is called a Znode. ZNode maintains a data structure that records the version number of data changes in ZNode and ACL (Access Control List) changes.

With the version number of the data and the updated Timestamp, ZooKeeper verifies that the cache requested by the client is valid and coordinates the updates.

Also, when ZooKeeper’s client performs an update or delete, it must include the version number of the data being modified. If ZooKeeper detects that the corresponding version number does not exist, the update will not be performed. If it is, the version number will be updated after the data is updated in ZNode.

This version number logic is used by many frameworks, such as RocketMQ, where the Broker registers with a NameServer with a version number called

Let’s take a closer look at the data Structure that maintains the version number data. It’s called a Stat Structure, and its fields are:

field paraphrase
czxid Create the zxid for the node
mzxid The zxid of this node is modified for the last time
pzxid The zxid of the node’s children is modified for the last time
ctime The number of milliseconds between the time the current epoch was created and when the node was created
mtime The number of milliseconds between the time of the current epoch and when the node was last edited
version Number of changes to the current node (that is, version number)
cversion The number of changes to the children of the current node
aversion The number of ACL changes on the current node
ephemeralOwner The sessionID of the current temporary node owner (empty if not temporary)
dataLength The length of the data for the current node
numChildren The number of children of the current node

For example, the stat command allows you to view the specific values of the stat Structure in a Znode.

The Epoch here, ZXID, is related to the ZooKeeper cluster, which will be covered in more detail later.


The ACL (Access Control List) is used to Control permissions on ZNode, which are similar to those on Linux. There are three types of permissions in Linux: read, write, and execute. The corresponding letters are R, W, and X. Its permission granularity is also divided into three types, namely owner permission, group permission and other group permission. For example:

Drwxr-xr-x 3 USERNAME GROUP 1.0K 3 15 18:19 dir_name

What is granularity? Granularity refers to the classification of objects that are acted on by permissions. In Other words, the above three kinds of granularity are described as the division of permissions for user, Group to which user belongs, and Other groups. This should be considered as a standard of access control, a typical three-stage form.

Although ZooKeeper is also a three-step process, there are differences in granularity between the two. ZooKeeper’s three-step formula is Scheme, ID and Permissions, meaning respectively the permission mechanism, the users allowed to access and the specific permission.

Scheme represents a permission schema with five types:

  • worldIn this Scheme,IDCan only beanyone“, which means it is accessible to all
  • Auth represents the authenticated user
  • Digest uses the username + password for validation.
  • IP only allows certain IPs to access ZNode
  • X509 is authenticated by the client’s certificate

At the same time, there are five types of permissions:

  • CREATE node
  • Read gets the node or lists its children
  • Write sets the node’s data
  • DELETE can DELETE child nodes
  • Admin can set permissions

As in Linux, this permission is abbreviated, for example:

The user of getACL method can view the permissions of corresponding ZNode, as shown in the figure. The results we can output are in three stages. Respectively is:

  • Scheme uses world
  • idA value ofanyone, which represents that all users have permissions
  • Permissions The specific permission of Permissions is CDRWA, which is the abbreviation for CREATE, DELETE, READ, WRITE, and ADMIN

The Session mechanism

Now that we have learned about the Version mechanism of ZooKeeper, we can move on to explore the Session mechanism of ZooKeeper.

We know that there are four types of nodes in ZooKeeper: persistent nodes, persistent order nodes, temporary nodes, and temporary order nodes.

As we discussed in the previous article, if a client creates a temporary node and then disconnects, all temporary nodes will be deleted. All temporary nodes created by the client will be deleted after the Session expires when the client establishes a connection.

So how does ZooKeeper know which temporary nodes are created by the current client?

The answer is in the Stat Structure
EphemeralOwner (the Owner of the temporary node)field

As mentioned above, if a temporary sequential node is present, then the EphemeralOwner stores the sessionID of the Owner who created the node. This will match the sessionID of the corresponding client, and when the Session expires, it will delete all the temporary nodes created by the client.

When creating a connection, the corresponding service must provide a string with all servers and ports separated by commas, for example.,,

When ZooKeeper’s client receives this string, it will randomly select a service or port to connect to. If the connection breaks later, the client selects the next server from the string and continues trying to connect until the connection is successful.

In addition to this basic IP+ port, ZooKeeper later in the 3.2.0 version also supports bringing paths in connection strings, for example. 0.1:2888127.00 0.1:3888 / app/a

In this way, /app/a is treated as the root directory of the current service, and all node paths created under it are prefixed with /app/a. For example, if I create a node /node_name, its full path will be /app/a/node_name. This feature is particularly useful in multi-tenant environments, where each tenant thinks of itself as the top-level root directory /.

When both ZooKeeper’s client and server are connected, the client is given a 64-bit SessionID and password. What’s this password for? We know that ZooKeeper can deploy multiple instances, and if a client disconnects and establishes a connection to another ZooKeeper server, it will carry this password when establishing a connection. This password is a security measure of ZooKeeper and can be verified by all ZooKeeper nodes. In this way, sessions are valid even if they are connected to other ZooKeeper nodes.

There are two conditions for Session expiration:

  • The specified expiration time has expired
  • The client did not send a heartbeat within the specified time

In the first case, the expiration time is passed to the server when the ZooKeeper client establishes a connection, and the range of this expiration time is currently only between 2X tickTime and 20X tickTime.

TickTime is a configuration item for the ZooKeeper server, which specifies the interval at which the client sends the heartbeat to the server. The default value is TickTime
tickTime=2000, the unit is

The Session expiration logic is maintained by the server of ZooKeeper. Once the Session expires, the server will immediately delete all temporary nodes created by the Client, and then notify all clients that are listening on these nodes for relevant changes.

In the second case, the heartbeat in ZooKeeper is achieved through a PING request. Every once in a while, the client sends a PING request to the server, which is the nature of the heartbeat. The heartbeat notifies the server that the client is alive, which in turn notifies the client that the connection to the server is still valid. This interval is called tickTime, which by default is 2 seconds.

Watch mechanism

Now that we know about ZNode and Session, we can move on to the next key feature, Watch. The word “Watch” has been mentioned many times in the above content. Let me start by summarizing what it does in one sentence

Register a listener to a node that will receive a Watch Event whenever the node changes (for example, update or delete)

Just as there are multiple types in ZNode, there are several types of watches, namely disposable watches and permanent watches.

  • After a one-time Watch is triggered, the Watch will be removed
  • A permanent Watch remains after being triggered and can continue to listen for changes on ZNode. This is a new feature of ZooKeeper 3.6.0

A disposable Watch can be set in parameters when calling methods such as getData(), getChildren(), and exists(), while a permanent Watch needs to be implemented by calling addWatch().

In addition, one-time Watch will be problematic, because there is a time interval between the event triggered by Watch arriving at the client and the establishment of a new Watch at the client. If changes occur during this time interval, the client will not be aware of them.

ZooKeeper cluster architecture

ZAB agreement

Now that we’ve laid the groundwork, we can take a closer look at ZooKeeper from an overall architecture perspective. To ensure its high availability, ZooKeeper uses a master-slave based read-write separation architecture.

We know that in a similar Redis master-slave architecture, there is adoption between nodes
GossipSo what is the communication protocol in ZooKeeper?

The answer is the ZAB (ZooKeeper Atomic Broadcast) protocol.

The ZAB protocol is an atomic broadcast protocol that supports crash recovery and is used to pass messages between ZooKeeper to keep all nodes in sync. Zab is also high performance, highly available, easy to use, easy to maintain, and supports automatic failure recovery.

ZAB protocol divides nodes in the ZooKeeper cluster into three roles, namely Leader, Follower and Observer, as shown in the following figure:

In general, this architecture is similar to the Redis master-slave architecture or MySQL master-slave architecture (you can also read the previous article about it).

  • Redis master-slave
  • MySQL master-slave

The difference is that there are usually two roles in the master-slave structure, namely Leader and Follower (or Master and Slave), but ZooKeeper adds an Observer.

What is the difference between an Observer and a Follower?

Essentially they do the same thing, and both provide ZooKeeper with the ability to scale horizontally, allowing it to handle more concurrency. However, the difference lies in that during the election process of the Leader, the Observer does not participate in voting.

Sequential consistency

As mentioned above, reading and writing are separated in the ZooKeeper cluster. Only the Leader node can handle write requests. If the Follower node receives a write request, it will forward the request to the Leader node for processing, and the Follower node itself will not handle write requests.

After the Leader node receives the message, it will process it in the strict order of the request one by one. This is a key feature of ZooKeeper, which ensures sequential consistency of messages.

For example, if message A arrives before message B, then on all ZooKeeper nodes, message A will arrive before message B, and ZooKeeper will guarantee the message
The global order.


So how does ZooKeeper guarantee the order of messages? The answer is through ZXID.

ZXID can be simply understood as the unique ID of the Message in ZooKeeper. Nodes will communicate and synchronize data by sending proposals (transaction proposals), and ZXID and specific data (messages) will be included in the proposals. ZXID is made up of two parts:

  • Epoch can be understood as the epoch of the dynasty, or the version of the Leader iteration, and the epoch of each Leader is different
  • Counter counter, every time a message comes in, it automatically increases

This is also the underlying implementation of the unique ZXID generation algorithm. Since the epoch used by each Leader is unique, and different messages have different values of counter in the same epoch, all proposals have unique ZXID in the ZooKeeper cluster.

Recovery mode

A properly running ZooKeeper cluster is in broadcast mode. Conversely, if more than half of the nodes are down, you go into recovery mode.

What is Recovery Mode?

In the ZooKeeper cluster, there are two modes:

  • Recovery mode
  • Broadcasting mode

When the ZooKeeper cluster fails, it enters a recovery mode, also known as Leader Activation, which, as the name implies, elects the Leader during this stage. The nodes generate ZXIDs and proposals, and then vote on each other. Voting should be based on principles, and there are two main ones:

  • The elected Leader must have the largest ZXID of all followers
  • And more than half of the followers have returned the ACK, indicating approval of the elected Leader

If an exception occurs during the election process, ZooKeeper will directly conduct a new election. If all goes well, the Leader is successfully elected, but the cluster is not yet available to the outside world because there is no critical data synchronization between the new Leader and followers.

After that, the Leader waits for the rest of the followers to connect and then sends the missing data to all the followers via the Proposal.

As for how to know which data is missing, the Proposal itself is to record the log, and a Diff can be made through the value in the low 32-bit Counter of the ZXID in the Proposal

Of course, there is an optimization here. If there are too many missing data, sending proposals one by one is inefficient. So if the Leader finds that too much data is missing, he will take a snapshot of the current data, package it and send it directly to the followers.

The newly elected Leader’s Epoch will be +1 from its original value, and the Counter will be reset to 0.

Do you think this is the end of it? In fact, it’s still not working here

After data synchronization is completed, the Leader will send a NEW_LEADER Proposal to the followers. If and only if the Proposal is ACK returned by more than half of the followers, the Leader will Commit the NEW_LEADER Proposal. Clusters work properly.

At this point, the recovery mode ends and the cluster enters broadcast mode.

Broadcasting mode

In broadcast mode, the Leader receives the message and sends a Proposal (transaction Proposal) to all the other followers. After receiving the Proposal, the followers send an ACK back to the Leader. When the Leader receives the quorums ACK, the current Proposal is submitted and applied to the node’s memory. What is quorum theta?

ZooKeeper officials recommend that at least one out of every two ZooKeeper nodes must return an ACK. Assuming there are N ZooKeeper nodes, the formula should be N /2 + 1.

This may not be very intuitive, but in plain English, more than half of the followers return an ACK and the Proposal can be submitted and applied to the in-memory ZNode.

ZooKeeper uses 2PC to ensure the data consistency between nodes (as shown in the figure above). However, as the Leader needs to interact with all the followers, the communication overhead will become large and the performance of ZooKeeper will decrease. Therefore, in order to improve the performance of ZooKeeper, more than half of the followers return ACK instead of all the Follower nodes returning ACK.

Well, that’s all for this blog. Welcome to WeChat search for “SH full stack notes”, reply to “queue” for MQ learning materials, including basic concept analysis and RocketMQ detailed source code analysis, continuing to update.

If you find this article helpful, please like it, leave a comment, share it, or leave a comment.