Google’s three papers affected many, many people and many, many systems. These three papers have been circulated in the distributed field classic. With MapReduce, we have Hadoop; According to GFS, we have HDFS; According to BigTable, we have HBase. And in all three of these papers, there was a Google Lock Service, Chubby, and so we had Zookeeper.

With the popularity of big data, Hxx people have become familiar with it, and now as a developer, if you don’t know these terms, it seems embarrassed to go out and say hello to people. But for those of us who are not big data developers, Zookeeper is actually a more basic service than those of us who are Hxx. However, helpless is that it has been quietly located in the second line, never so dazzling Hxx people. So what exactly is Zookeeper? What can Zookeeper be used for? How will we use Zookeeper? How is Zookeeper implemented?

What is a Zookeeper

The Zookeeper website says: ZooKeeper is a centralized service for maintaining configuration information, naming, Providing distributed synchronization, and providing group services.

Zookeeper is a distributed service coordination framework that implements synchronization, configuration, maintenance, and naming services. Is a high performance distributed data consistency solution.

ZooKeeper ZooKeeper ZooKeeper ZooKeeper ZooKeeper ZooKeeper ZooKeeper ZooKeeper ZooKeeper ZooKeeper ZooKeeper ZooKeeper ZooKeeper ZooKeeper

Relationship between Zookeeper and CAP

As a distributed system, partition fault tolerance is a key consideration. When a distributed system loses partition fault tolerance, it gives up scalability. In a distributed system, network failures occur frequently, and it is absolutely unacceptable that the whole system becomes unavailable once such problems occur. As a result, most distributed systems make a trade-off between consistency and availability while maintaining partition fault tolerance.

CAP relationship

ZooKeeper is CP (consistency + partition fault tolerance), that is, all access requests to ZooKeeper can get consistent data results, at the same time, the system has fault tolerance for network segmentation. But it does not guarantee the availability of each service request. In extreme environments, ZooKeeper may discard some requests and the consumer application may need to re-request them to get the results.

ZooKeeper is a distributed coordination service that ensures that data is synchronized and consistent across all services under its jurisdiction. So it’s not hard to understand why ZooKeeper is designed to be CP and not AP. Moreover, Zab, the core implementation algorithm of ZooKeeper, is to solve the problem of how to keep synchronization between multiple services in a distributed system.

Analyze features and attributes of the Zookeeper node

Zookeeper provides data storage based on a directory node tree similar to a file system, but Zookeeper is not used to store data. Zookeeper is used to maintain and monitor the status of your stored data. By monitoring the changes of the data status, the cluster management based on data can be achieved.

The data model

Unlike Linux file systems, which are divided into directories and files, Zookeeper data nodes are called zNodes. A ZNode is the smallest unit of data in Zookeeper. Each ZNode can store data and mount child nodes. This forms a hierarchical namespace called a tree.

Znode tree structure

ZNode types in Zookeeper can be specified when a ZNode is created.

PERSISTENT: The ZNode is PERSISTENT. Once the ZNode is created, the data stored in the ZNode will not disappear unless the client deletes it actively.

EPHEMERAL: A temporary ZNode is created when a Client connects to the Zookeeper Service and creates a Session using the Zookeeper connection instance. Once the Client closes the Zookeeper connection, a ZNode of this type is created. The server will clear the Session, and all zNodes created by the Session will disappear from the namespace. In summary, the life cycle of this type of ZNode is the same as that of the connection established by the Client.

PERSISTENT_SEQUENTIAL: a ZNode that is automatically numbered sequentially. This Znoe node automatically increments by 1 based on the number of existing zNodes and does not disappear when a Session breaks.

EPEMERAL_SEQUENTIAL: temporary automatic number of a ZNode node. The number of a ZNode node is automatically increased but disappears when the Session disappears

Watcher data change notification

Zookeeper uses the Watcher mechanism to publish/subscribe distributed data.

Watcher mechanism

The Watcher mechanism of Zookeeper consists of the client thread, client WatcherManager, and Zookeeper server. When the client registers with the Zookeeper server, the Watcher object is stored in the WatcherManager on the client. When the Zookeeper server triggers the Watcher event, the Zookeeper server sends a notification to the client, and the client thread retrieves the corresponding Watcher object from the WatcherManager to execute the callback logic.

Acls ensure data security

Zookeeper stores metadata about the running status of the distributed system. These metadata will directly affect the running status of the distributed system constructed based on Zookeeper. It is very important to ensure the security of data in the system and avoid database exceptions caused by arbitrary changes of data caused by misoperations. Zookeeper provides a comprehensive ACL permission control mechanism to ensure data security.

We can understand the ACL mechanism from three aspects: the Permission mode Scheme, the ID of the authorized object, and the Permission Permission. Usually, “Scheme: ID: Permission” is used to identify a valid ACL information.

Memory data

Zookeeper’s data model is a tree structure. The contents of the whole tree, including all node paths, node data, and ACL information, are stored in the in-memory database. Zookeeper periodically stores this data to disks.

DataTree: DataTree is the core of memory data storage. It is a tree structure that represents a complete piece of data in memory. DataTree does not contain any business logic related to the network, client connection, and request processing, and is a stand-alone component.

DataNode: DataNode is the smallest unit of data storage. In addition to storing the data content, ACL list and node state of the node, it also records the reference of the parent node and the child node list. It also provides the interface for operating the child node list.

ZKDatabase: Zookeeper’s in-memory database that manages all Zookeeper sessions, DataTree storage, and transaction logs. ZKDatabase periodically dumps snapshot data to the disk, and when Zookeeper is started, it recovers a complete in-memory database from the transaction logs and snapshot files on the disk.

Analysis of the implementation principles of Zookeeper

1. Network structure of the Zookeeper Service

The Zookeeper work cluster can be divided into two types. One is the Leader, the only one, and the rest are followers. How to determine the Leader is determined through internal election.

Zookeeper architecture diagram

The Leader and followers communicate with each other. The data of The Zookeeper system is stored in the memory, and a copy is also backed up on the disk.

If the Leader fails, the Zookeeper cluster elects another Leader at the millisecond level.

The Zookeeper Service is unavailable only when more than half of the Zookeeper nodes are down.

2. Zookeeper reads and writes data

Zk Data reading process

Write data. When a client sends a write request to a follower, the follower forwards the request to the Leader. The Leader broadcasts the request using the internal Zab protocol until all Zookeeper nodes successfully write data (memory synchronization and disk update). The Zookeeper Service then sends a response back to the Client.

Read data. Since all Zookeeper nodes in the cluster present the same namespace view (i.e., structural data), the above write request ensures that all Zookeeper nodes in the cluster are in the same namespace, so the read data can be read from any Zookeeper node.

3. Working principles of Zookeeper

Zab agreement

The core of Zookeeper is broadcast, a mechanism that ensures synchronization between servers. The protocol that implements this mechanism is called the Zab protocol.

ZooKeeper Atomic Broadcast (Zab) As the core algorithm for data consistency, Zab is specially designed for ZooKeeper to support Atomic message Broadcast for crash recovery.

The core of Zab protocol is as follows:

All transaction requests must be coordinated by a globally unique server (Leader). The rest of the cluster is called the follower server. The Leader server converts a client request into a transaction Proposal and distributes the Proposal to all the follower servers in the cluster. The Leader server then waits for feedback from all the follower servers. Once more than half of the follower servers have responded correctly, the Leader server sends a commit message to all the follower servers again. It is required to submit the previous proposal.

Zab mode

The Zab protocol consists of two basic modes: crash recovery and message broadcast.

When the whole service framework starts or the Leader server is interrupted, crashes, exits and restarts, Zab protocol will enter the recovery mode and elect a new Leader server.

When a new Leader server is elected and more than half of the machines in the cluster have completed state synchronization with the Leader server, the Zab protocol will exit the recovery mode. State synchronization is data synchronization to ensure that more than half of the machines in the cluster can maintain the same data state with the Leader server.

When more than half of the Follower servers in the cluster are in sync with the Leader server, the entire service framework can enter broadcast mode.

When a zAB-compliant server is started and joins the cluster, if there is already a Leader server in the cluster to broadcast messages, the joined server will automatically enter the data recovery mode: Find the server where the Leader is, synchronize data with it, and participate in the message broadcast process.

Zookeeper allows only one Leader server to process the transaction request. After receiving the client’s transaction request, the Leader server will generate the corresponding transaction proposal and initiate a round of broadcast protocol. If other machines in the cluster receive the client’s transaction request, These non-leader servers will first forward the transaction request to the Leader server.

News broadcast

News broadcast

The Zab protocol’s message broadcast process uses an atomic broadcast protocol, similar to a 2PC commit process. Specific:

ZooKeeper uses a single Leader process to process all client transaction requests and uses the atomic broadcast protocol of Zab to broadcast server data state changes to followers in the form of transaction proposals. Therefore, ZooKeeper can handle a large number of concurrent requests from clients.

On the other hand, because of possible dependencies between transactions, the Zab protocol ensures that the Leader broadcast sequence of changes is processed sequentially, and that some state changes must depend on those state changes generated earlier.

Finally, considering that the Leader of the master process may crash or exit abnormally at any time, the Zab protocol also requires that the Leader can be elected again and data integrity can be guaranteed when the Leader process crashes. After receiving the Proposal, the Follower writes the Proposal to the disk and returns an Ack. After the Leader receives most OF the ACKS, it broadcasts the Commit message and commits the message itself. After receiving the Commit message, the Follower submits the message.

Zab protocol simplifies 2PC transaction commit:

When the interrupt logic is removed, followers either ack or discard the Leader.

The Leader does not need all followers to respond successfully, only one majority Ack is required.

Crash recovery

As described above, Zab normally broadcasts messages. Once the Leader server crashes or loses contact with more than half of the follower servers, it enters crash recovery mode.

The recovery mode requires electing a new Leader to restore all servers to the correct state.

Zookeeper practices, shared locks, Leader election

A distributed lock is used to control the synchronized access to shared resources between distributed systems. It can ensure the consistency of access to one or a group of resources by different systems. It is mainly divided into exclusive locks and shared locks.

An exclusive lock is also called a write lock or an exclusive lock. If transaction T1 applies an exclusive lock to O1, only transaction T1 is allowed to read and update O1 during the entire lock period. No other transaction is allowed to perform any operation on O1 until T1 releases the exclusive lock.

An exclusive lock

A shared lock is also called a read lock. If transaction T1 adds a shared lock to O1, the current transaction can only read O1, and other transactions can only add a shared lock to the data object until all the shared locks on the data object are released.

A Shared lock

Leader election

The Leader election is the key to ensuring distributed data consistency. If one of the following two conditions occurs for a server in the Zookeeper cluster, the Leader election is required.

The server is initially started.

The server cannot stay connected to the Leader while it is running.

Zookeeper only keeps the TCP FastLeaderElection election algorithm after version 3.4.0. When a machine enters the Leader election, the current cluster may be in one of two states:

A Leader exists in the cluster.

No Leader exists in the cluster.

If a Leader already exists in the cluster, this situation usually means that a machine starts late and the cluster is working normally before it starts. In this case, when the machine tries to elect a Leader, it will be informed of the Leader information of the current server. For this machine, You just need to establish a connection to the Leader machine and synchronize the state.

However, in the case that there is no Leader in the cluster, it is relatively complicated, and the steps are as follows:

(1) First vote. Regardless of which causes a Leader election, all the machines in the cluster are in the state of trying to elect a Leader, which is the LOOKING state, and the LOOKING machine sends a message to all the other machines, which is called a vote. The vote contains the SID (the unique identifier of the server) and the ZXID (transaction ID), (SID, ZXID) form to identify a vote information. Zookeeper consists of five machines, whose SID is 1, 2, 3, 4, and 5, and whose ZXids are 9, 9, 9, 8, and 8 respectively. The machine whose SID is 2 is the Leader machine. At some point, machines 1 and 2 are faulty, so the Leader election starts. In the first vote, each machine will take itself as the voting object, so the voting situations of the machines whose SID is 3, 4 and 5 are (3, 9), (4, 8) and (5, 8) respectively.

(2) Change of vote. After each machine sends the vote, it also receives the vote from other machines. Each machine processes the vote from other machines according to certain rules and decides whether to change its vote. This rule is also the core of the entire Leader election algorithm, and the terms are described as follows

Vote_sid: The SID that was selected for the Leader server in the received vote.

Vote_zxid: indicates the ZXID of the Leader server selected in the received votes.

Self_sid: Indicates the current server’s own SID.

Self_zxid: indicates the ZXID of the current server.

Each received vote is processed by comparing (vote_sid, vote_zxid) with (self_sid, self_zxid).

Rule 1: If vote_zxid is greater than self_zxid, the current vote is recognized and sent again.

Rule two: if vote_zxid is less than self_zxid, then stick to your vote and don’t make any changes.

Rule three: if vote_zxid is equal to self_zxID, then the SID of the two is compared. If vote_SID is greater than self_SID, then the current vote is recognized and the vote is sent again.

Rule 4: If vote_zxid is equal to self_zxid, and vote_sid is less than self_SID, then stick to your vote and do not change anything.

Combined with the above rules, the following cluster change process is given.

Leader election

(3) Determine the Leader. After the second round of voting, each machine in the cluster receives the votes of the other machines again, and then counts the votes. If a machine receives more than half of the same votes, the SID machine corresponding to that vote is the Leader. Server3 will then become the Leader.

As can be seen from the above rules, the newer the data on the server (the bigger the ZXID is), the more likely it is to become the Leader, and the more data recovery can be guaranteed. If the ZXids are the same, the larger the SID, the greater the chance.