ZooKeeper is used to coordinate processes in distributed systems and provide centralized services such as message sending, shared registers and distributed locks for distributed systems.

Coordination services required in distributed systems include configuration, group membership, leadership election, and locking services. ZooKeeper does not provide these services directly, because stronger primitives can be used to implement weaker primitives, and ZooKeeper provides apis for developers to implement their own primitives. The ZooKeeper API operates like a wait-free data object on the hierarchical structure of a file system, while ensuring first-in, first-out (FIFO) and serial writes for all operations. ZooKpeer uses the pipeline architecture to achieve high throughput and low latency. Zab is used to ensure linearity in update operations. Read operations are performed locally on the server without determining the sequence. The observation mechanism notifies the client when the data is updated, enabling the client to quickly retrieve the latest data.

The ZooKeeper service

ZooKeeper provides apis to clients in the form of libraries that connect clients to the ZooKeeper server. Data nodes in ZooKeeper are called ZNodes and are organized in a tree namespace. After connecting to the server, the client establishes a session and sends requests through the session handle.

The service overview

ZooKeeper provides the client with a data object abstraction (ZNode).

Znodes come in two types:

  • General: Data objects are created and deleted normally.
  • Temporary: Objects are deleted after the session that created them terminates.

If the SEQUENTIAL flag is set when a file is created, an auto-incrementing counter is added to the file name. ZooKeeper implements the watch mechanism, which notifies the client when a data object is updated. The watch mechanism is triggered only once.

Data model: The data model in ZooKeeper is a file system that supports only full read and write. Znode stores abstract concepts of application programs and stores configuration and metadata information.

Session: The client connects to ZooKeeper and establishes a session. The session identifies the client.

The client API

  • create(path, data, flags): Creates a path aspathThe znode will bedata[]Returns the name of the new ZNode,flagsUsed to set the ZNode type: normal or temporary, and SettingsSEQUENTIALMark.
  • delete(path, version): If the version matches, delete itpathCorresponding zNode.
  • exists(path, watch)If:pathThe corresponding ZNode exists, then return true, otherwise return false.watchFlag allows the client to observe the ZNode.
  • getData(path, watch): Returns the data and metadata corresponding to zNode,watchFunctions are similar.
  • setData(path, data, version): If the version matches, willdata[]Written to thepathIn the corresponding ZNode.
  • getChildren(path, watch): Returns a collection of zNode children.
  • sync(path): Wait for all current pending updates,pathIt doesn’t help.

All of the above methods provide blocking and non-blocking versions, and if a version number -1 is passed in, no version checking is done.

They are guaranteed

ZooKeeper has two basic order guarantees

  • Linear writes: All updates that change the ZooKeeper state are serial;
  • Client FIFO: All requests from clients are executed in fifO order.

An example can be given to illustrate how these two guarantees guarantee system performance. Suppose a system elects a master node to manage other nodes. The master node then needs to update some configuration and then notifies the other nodes that it needs to:

  • The master node is modifying the configuration and does not want other nodes to access the configuration that is being modified
  • The master node crashes before the update is complete and does not want other nodes to access the broken configuration

You can set up a readyZNode to solve this problem. The primary node can be deleted before configuration and rebuilt after completion. Other nodes do not read the configuration when they see that ready is not present.

However, there is a problem: if the other nodes read the configuration when they see ready, but the master node then deletes and changes the configuration, the other nodes will get the outdated configuration. This problem can be solved by using the observation mechanism. When ready is deleted, other nodes will be notified in time.

ZooKeeper has two durability guarantees:

  • If most servers are active, the service is available
  • If ZooKeeper successfully responds to a change request, the change can persist over numerous failures as long as most nodes are eventually restored.

Primitive examples

  • Configuration management: Simply save the configuration in a ZNode, and each process can observe the configuration update notification.
  • Rendezvous: Many distributed systems contain master and worker nodes, but the scheduling of nodes is determined by the scheduler. The master node information can be placed in a ZNode for the worker node to find the master node.
  • Group membership: After the group member process goes online, a temporary zNode can be created under the corresponding zNode of the group. After the member process exits, the temporary ZNode is also deleted. Therefore, the group membership status can be obtained through the child ZNodes of group ZNode.
  • Simple locks: Locks can create a corresponding ZNode implementation. If created successfully, the lock is acquired. If it already exists, you need to wait for the lock to be released (the ZNode is deleted) before you can acquire the lock (the ZNode is created).
  • Simple locks without herding: Simple locks have a large number of competing processes. Lock requests can be sorted and locks can be allocated in order.
The Lock 1 n = the create (l + "/ Lock -", EPHEMERAL | SEQUENTIAL) 2 C = getChildren (l,false)
3   if n is lowest znode in C, exit
4   p = znode in C ordered just before n
5   if exists(p, true) wait for watch event
6   goto 2
Copy the code
1   delete(n)
Copy the code
  • Read/write locks: Write locks are similar to regular locks and mutually exclusive with other locks.
The Write Lock 1 n = the create (l + "/ Write -", EPHEMERAL | SEQUENTIAL) 2 C = getChildren (l,false)
3   if n is lowest znode in C, exit
4   p = znode in C ordered just before n
5   if exists(p, true) wait for event
6   goto 2
Copy the code

Read locks can be mutually compatible and write locks mutually exclusive.

Read Lock
1   n = create(l + “/read- "EPHEMERAL | SEQUENTIAL) 2 C = getChildren (l,false)
3   if no write znodes lower than n in C, exit
4   p = write znode in C ordered just before n
5   if exists(p, true) wait for event
6   goto 3
Copy the code
  • The double fence: Double fence ensures that computations on multiple clients start and end at the same time. Add a ZNode to the fence before the client starts the calculation, and delete the znode after the calculation. The client must wait for the number of child ZNodes of fenced ZNodes to reach a certain threshold before starting calculation. The client can wait for a specialreadyIs created when the number of ZNodes reaches the threshold. Before exiting the client, you need to wait for all the child ZNodes to be deleted, which can also be deletedreadyDelete it.

They are used

  • Parsing service: in the parsing service of Yahoo crawler system, the master node needs to inform the parsing node of the system configuration, and the parsing node needs to report its own status. Therefore, the resolution service uses ZooKeeper to manage configuration and leadership elections. The following figure shows the read and write operations of the system. It can be found that read operations account for the majority.

  • Katta: Katta is a distributed index. The master node allocates sharding to slave nodes and tracks progress. ZooKeeper is mainly used for group membership management, leadership election and configuration management.
  • Yahoo!The Yahoo Message mediation is responsible for publishing and receiving messages under a myriad of topics on multiple servers, each with a master/slave backup. The zNode structure of the system is similar to the following figureshutdown,migration_prohibitedIs the configuration information of the system,nodesSaves server information belonging to a group member, whiletopicsSave responsible for specific topics corresponding to the master server has been secondary server, in addition after the master node crashed neededLeadership election.

ZooKeeper implementation

ZooKeeper components are shown in the following figure. A copy of ZooKeeper data is stored on each server. Write operations must be submitted to the database through the consistency protocol, while read requests can be obtained by directly accessing the local database of the server. Before modifying ZooKeeper applications to the database, the modified ZooKeeper applications are written to disks. After a fault occurs, snapshots and logs are added to rectify the fault. According to the consensus protocol, write requests are forwarded to the leader node.

Request handler

After receiving the write request, the request processor converts it into an idempotent transaction, calculates the new data, version number and timestamp according to the request content, and waits for the application to the database.

Atomic broadcast

ZooKeeper uses Zab as the atomic broadcast protocol, using simple majority approval to achieve consensus. Zab ensures that the broadcast is sent and received in the same order. The leading node must ensure that it has received the broadcast from the previous leader before broadcasting.

Multi-copy database

When a server fails, use periodic snapshots and the logs generated after the snapshots to restore the server. Locking is not required when creating a snapshot because transactions are idempotent, so reapplying already applied changes has no impact.

Client-server interaction

When a server performs a write operation, observing clients are notified and observations are cleared, and each server is only responsible for notifying its own connected clients. Each read request corresponds to an ZXID that corresponds to the ID of the last write transaction seen on the server. ZooKeeper provides the sync operation to ensure that all read operations performed after sync can obtain the write results before sync. The client will obtain the latest ZXID from the server. Another function of the ZXID is to ensure that after the client switches servers, the new server can see a view that is not behind the view that the client saw before, that is, the server ZXID cannot be earlier than the client’s ZXID. If a client failure is detected, the session has a timeout period, and the client must send a heartbeat during inactivity to avoid a timeout.


  1. Hunt, Patrick, et al. “ZooKeeper: Wait-free Coordination for Internet-scale Systems.” USENIX annual technical conference. Vol. 8. No. 9. 2010.