Introduction of a ZooKeeper

ZooKeeper is an open source distributed coordination framework. It is positioned to provide consistency services for distributed applications and is the administrator of the entire big data system. ZooKeeper encapsulates complex and error-prone key services to provide users with efficient, stable, and easy-to-use services.

ZooKeeper = file system + listener notification mechanism if you don’t understand the above official language, you can consider ZooKeeper = file system + listener notification mechanism.

1.1 File System

Zookeeper maintains a tree-like data structure similar to a file system. This feature prevents Zookeeper from storing large amounts of data. Each node can store a maximum of 1 MB of data. Each subdirectory entry, such as NameService, is called a Znode. Like a file system, we can add and remove zNodes freely, adding and removing children from a ZNode. The only difference is that zNodes can store data. There are four types of ZNodes by default:

  1. PERSISTENT directory node PERSISTENT: The PERSISTENT directory node persists after the client disconnects from ZooKeeper.
  2. Persistent sequential numbering Directory node PERSISTENT_SEQUENTIAL: The node persists after the client disconnects from ZooKeeper, but ZooKeeper sequentially numbers its name.
  3. EPHEMERAL: The temporary directory node was removed after the client was disconnected from ZooKeeper.
  4. EPHEMERAL_SEQUENTIAL directory node: After the client is disconnected from ZooKeeper, the node is deleted. Zookeeper numbers the node name sequentially.

1.2 Monitoring notification mechanism

The Watcher monitoring mechanism is a very important feature of Zookeeper. We can bind the nodes created based on Zookeeper to monitor events, such as node data change, node deletion, child node status change, etc. Through this event mechanism, Zookeeper provides distributed locking and cluster management functions.

Watcher features:

When data changes, Zookeeper generates a Watcher event and sends it to the client. But the client will only be notified once. If this node changes again later, the client that set Watcher will not receive the message again. Watcher is a one-time operation. Permanent listening can be achieved through loop listening.

In general, ZooKeeper’s Watcher mechanism can be divided into three processes:

The client registers Watcher in three ways: getData, Exists, and getChildren.

The server handles Watcher.

The client calls back the Watcher client.

Monitoring process:

First, there is a main() thread

When the Zookeeper client is created in the main thread, two threads are created, one for network connection communication (Connet) and one for listener.

Registered listening events are sent to Zookeeper via the Connect thread.

Add the registered listener events to the List of Registered listeners in Zookeeper.

When Zookeeper detects a change in data or path, it sends the message to the listener thread.

The listener thread calls the process() method internally.

1.3 a Zookeeper characteristics

  1. Cluster: Zookeeper is a cluster consisting of a Leader and multiple followers.
  2. High availability: The Zookeeper cluster can run properly as long as more than half of the nodes in the cluster are alive.
  3. Global data consistency: Each Server stores the same data copy. The data is consistent regardless of which Server the Client connects to.
  4. Order of update requests: Update requests from the same Client are executed in the order in which they are sent.
  5. Data update atomicity: a data update either succeeds or fails.
  6. Real-time: The Client can read the latest data within a certain period of time.
  7. From the perspective of design patterns, ZK is a framework based on the observer design pattern. It manages and stores the data that everyone cares about, and then accepts the observer’s registration. Zk will notify the observers registered in ZK to react to the change of data.
  8. Zookeeper is a distributed coordination system that meets CP requirements, unlike Eureka in SpringCloud that meets AP requirements.

Distributed coordination system: The Leader synchronizes data to followers, and users can request data from followers, so that there is no single point of failure, and as long as the synchronization time is infinitely short, this is a good distributed coordination system.

CAP principle, also known as CAP theorem, refers to Consistency, Availability and Partition tolerance in a distributed system. The CAP principle is that, at best, these three elements can achieve two, not all at once.

2 Functions of Zookeeper

By cross-using the rich data nodes in Zookeeper and cooperating with Watcher event notification mechanism, it is very convenient to build a series of core functions involved in distributed applications. Such as data publish/subscribe, load balancing, naming services, distributed coordination/notification, cluster management, Master elections, distributed locks, and distributed queues.

1. Data publishing/subscription

When some data is shared by several machines and the information changes frequently and the amount of data is small, it is suitable to store the data in ZK.

  • Data store: Stores data to a data node on Zookeeper.
  • Data acquisition: The application reads data from the Zookeeper data node at startup initialization and registers a data change Watcher on the node
  • Data change: When data is changed, the Zookeeper node data is updated. Zookeeper sends a data change notification to each client. After receiving the notification, the client can read the changed data again.

2. Distributed lock

Distributed locks have been covered in Redis, and distributed locks provided by Redis are better than ZK. There are two types of Distributed locks based on ZooKeeper.

  1. Keep the exclusive

Core idea: in ZK, there is a unique temporary node, only those who get the node can manipulate the data, the threads that don’t get the node need to wait. Cons: Can cause herd behavior, 999 concurrent threads request locks from ZK immediately after the first thread runs out.

  1. Sequence control

In order to avoid the herd effect, temporary nodes already exist, and all threads that want to acquire locks create temporary sequential numbered directory nodes under it. The one with the smallest number gets the locks and is deleted when it is used up, and the ones that follow queue up to obtain them.

3. Load balancing

Multiple jar packages with the same service enabled on different servers can be configured on the server side through nginx load balancing. ZooKeeper can also be used to configure load balancing on the client.

  1. Multiple service registration
  2. The client retrieves the middleware address set
  3. Select a service at random from the collection to perform the task

Differences between ZooKeeper and Nginx load balancers:

ZooKeeper does not have a single point of failure. The ZAB mechanism ensures that a new leader can be elected to be only responsible for service registration and discovery, but not forwarding, thus reducing a data exchange (direct communication between consumers and servers). Therefore, corresponding load balancing algorithms need to be implemented by themselves.

Nginx has the problem of single point, high single point load, large amount of data, need to achieve high availability through KeepAlived + LVS standby machine. Each load acts as a middleman forwarding role to increase the network load (indirect communication between the consumer and the service), with load balancing algorithm.

4. Naming service

Naming service refers to obtaining the address of resource or service by the specified name, using ZK to create a global unique path, this path can be used as a name, pointing to the cluster in the cluster, the address of the service provided, or a remote object, etc.

Distributed coordination/notification

  1. For system scheduling, if a user changes the value of a node in ZK, ZooKeeper sends these changes to all clients of watcher registered with this node for notification.
  2. For performance reporting, each worker process creates a temporary node under the directory that carries work progress, so the summary process can monitor changes in the sub-nodes of the directory to get a real-time global picture of work progress.

6. Cluster management

Most cluster services under the big data system seem to be managed by ZooKeeper. In fact, the management mainly focuses on the dynamic offline of machines and Leader election.

  1. Dynamic up-down line:

For example, on the ZooKeeper server, there is a ZNode called /Configuration. Each machine in the cluster is installed on this node to create an EPHEMERAL node. For example, server1 creates **/Configuration/ server1 **, server2 creates **/Configuration/ server1 **, Then Server1 and Server2 both watch /Configuration parent node, so that changes of data or child nodes under this parent node will be notified to the watch client on this node.

  1. Leader election:

The strong consistency of ZooKeeper ensures global uniqueness of node creation in distributed and high concurrency scenarios. That is, when multiple clients request to create /Master nodes at the same time, only one client request can be successfully created. With this feature, you can easily conduct clustered elections in a distributed environment.

Dynamic Master elections. This takes advantage of the EPHEMERAL_SEQUENTIAL node feature, where each node is automatically numbered. Allow all requests to be created successfully, but only in a sequence where the machine with the smallest serial number is selected as the Master.

3 Leader election

The number of Nodes in a ZooKeeper cluster must be an odd number. Generally, three or five nodes are acceptable. In order to avoid a leaderless cluster, it is necessary to choose a Leader. This is a high frequency test.

3.1 Preliminary Knowledge

3.1.1. Four node states.

  1. LOOKING: Searches for the Leader status. When the server is in this state, it considers that there is no Leader in the cluster and therefore needs to enter the Leader election state.
  2. FOLLOWING: status of the follower. Process the non-transaction request of the client, forward the transaction request to the Leader server, participate in the vote of the Proposal of the transaction request, and participate in the vote of the Leader election.
  3. LEADING: Indicates the leader status. The unique scheduler and handler of transaction requests, ensuring the sequential processing of cluster transactions, the scheduler (managing followers, data synchronization) of each server in the cluster.
  4. OBSERVING: Indicates the status of the observer. A server role introduced after version 3.0 can improve the non-transaction processing capability of the cluster without affecting the transaction processing capability of the cluster, process the non-transaction request of the client, forward the transaction request to the Leader server, and do not participate in any form of voting.

3.1.2 Server ID

In terms of Server ID, a unique number will be assigned to each node in the myID file when ZK cluster is built. The larger the number is, the greater the weight will be in the Leader selection algorithm. For example, the comparison is made according to the Server ID during initial startup.

3.1.3 ZXID

ZooKeeper uses the globally increasing Transaction Id to identify all proposals. ZooKeeper Transaction Id is added to all proposals when they are put forward. Zxid is a 64-bit Long type, which is the key to ensure the consistency of Transaction order. The higher 32 bits of zxID represent epoch and the lower 32 bits represent transaction id xID. You can assume that the larger the ZXID, the newer the data stored.

  1. Each leader will have a different epoch value, representing an era/dynasty, to identify the Leader cycle. When each new election is started, a new epoch will be generated. When the new leader is created, the epoch will be incrementally increased and the value will be updated to the ZXID and epoch of all zkServer.
  2. The xID is an ascending transaction number. The higher the value is, the newer the data is. All proposals are put forward with zxIDS, and then transaction execution requests are sent to other servers according to the two-stage process of the database. If more than half of the machines can execute the proposal and it succeeds, the transaction execution will start.

3.2 Leader election

The election of the Leader is generally divided into the election at startup and the run-time election after the Leader fails.

3.2.1 Leader election during startup

Taking the above five machines as an example, the cluster can only work properly if more than half of them, i.e., at least three servers are started.

  1. Server 1 starts and initiates an election.

Server 1 votes for itself. At this point, server 1 has one vote, less than the majority (3 votes), the election cannot be completed, and the state of server 1 remains LOOKING.

  1. Server 2 is started, and another election is called.

Server 1 and server 2 vote for each other. Server 1 finds that server 2’s ID is larger than its own and changes the vote to server 2. At this point, server 1 votes 0, server 2 votes 2, less than a majority (3 votes), the election cannot be completed. Keep the server 1,2 state LOOKING.

  1. Server 3 starts and initiates an election.

As in the above process, servers 1 and 2 vote for themselves first, and then change their vote to server 3 because server 3 has the largest ID. The result of this vote: 0 votes for server 1, 0 votes for server 2, 3 votes for server 3. At this point, server 3 has more than half of the votes (3 votes), and server 3 is elected Leader. Change the status of server 1,2 to FOLLOWING, and server 3 to LEADING.

  1. Server 4 starts and initiates an election.

At this point, servers 1, 2, and 3 are no longer in the LOOKING state and will not change the ballot information or exchange the result of the ballot information. Three votes for server 3 and one vote for server 4. At this point server 4 follows the majority, change the vote information to server 3 and server 4 and change the status to FOLLOWING.

  1. Server 5 starts and initiates an election

Vote 3 as 4, server 3 has 5 votes and server 5 has 0 votes. Server 5 and change the state to FOLLOWING;

  1. In the end

Leader is server 3 and its state is LEADING. The other servers are followers and in the FOLLOWING state.

3.2.2 Run time Leader election

During the operation, if the Master node crashes, it will go to the recovery mode, and the external service will be suspended before the new Leader is elected. It can be roughly divided into four stages: election, discovery, synchronization, and broadcast.

  1. Each Server will issue a vote, the first vote is for itself, where the vote information = (myID, ZXID)
  2. Collect votes from various servers
  3. Process the vote and re-vote, processing logic: compare ZXID first, then compare myID.
  4. As long as more than half of the machines receive the same voting information, the leader can be determined to pay attention to the increase and synchronization of the epoch.
  5. Change server state Looking to Following or Leading.
  6. After followers connect to the Leader, the Leader server compares the ZXID submitted by the Follower server with the ZXID on the Follower server. The comparison result is either rolled back or synchronized with the Leader to ensure the consistency of transactions on all nodes in the cluster.
  7. The cluster restores to broadcast mode and starts to accept write requests from clients.

3.3 split brain

Split-brain is an issue that must be considered in cluster deployments, such as Hadoop and Spark clusters. However, ZAB requires 2N+1 nodes in the cluster to solve the problem of split brain. When the network is split, the number of nodes in one cluster is always more than half, while the number of nodes in the other cluster is less than N+1, because the election of Leader requires the consent of more than half of nodes, so we can draw the following conclusions:

With the half mechanism, for a Zookeeper cluster, there is either no Leader or only one Leader, thus avoiding the brain split problem

ZAB of the consistency agreement

It is suggested to read 2PC, 3PC, Paxos, Raft and ZAB in big data first, otherwise it may be difficult to read.

4.1 Introduction to ZAB

ZAB (Zookeeper Atomic Broadcast Atomic Broadcast Protocol) is a crash recovery consistency protocol specially designed for distributed coordination service Zookeeper. Based on this protocol, ZooKeeper implements a master-slave system architecture to maintain data consistency between replicas in the cluster.

In a distributed system, the leader is responsible for write requests from external clients. The follower server performs read and synchronization. Two problems need to be solved.

  1. How the Leader server updates data to all followers.
  2. The Leader server suddenly fails. What about the cluster?

Therefore, ZAB protocol designs two working modes to solve the above two problems. Zookeeper switches between these two modes:

  1. Atomic broadcast mode: updates data to all followers.
  2. Crash recovery mode: How to recover when the Leader crashes.

4.2 Atomic broadcast mode

You can think of the message broadcast mechanism as a simplified version of the 2PC protocol, which ensures sequential transaction consistency through the following mechanism.

  1. The leader generates a new transaction after receiving a write request from the client and generates a unique ZXID for the transaction.
  2. The leader will distribute the message with the ZXID as a proposal to all FIFO queues.
  3. The FIFO queue takes out the queue head proposal to the follower node.
  4. After receiving the proposal, the follower writes the proposal to the hard drive and sends an ACK to the leader after the hard drive is successfully written.
  5. The FIFO queue returns the ACK to the Leader.
  6. When the leader receives more than half of the ACK messages from followers, it makes a commit request and then sends a COMMIT request to the FIFO.
  7. When the followers receive a commit request, they determine whether the transaction has a lower ZXID than any other transaction in the history queue and commit it. If so, the followers wait for a commit from a smaller transaction.

4.3 Crash Recovery

When the Leader crashes during message broadcast, can data be consistent? When the Leader crashes, it enters crash recovery mode. In fact, it is mainly the treatment of the following two cases.

  1. Leader crashes after copying data to all Follwers.
  2. What if the Leader crashes after receiving an Ack and committing himself and sending part of the commit?

To address this problem, ZAB defines two principles:

  1. The ZAB protocol ensures that transactions that have been committed by the Leader are eventually committed by all servers.
  2. The ZAB protocol ensures that transactions that are only proposed/replicated by the Leader, but not committed, are discarded.

How to ensure that transactions that have already been committed by the Leader are committed, while discarding transactions that have already been skipped? The key is to rely on the ZXID mentioned above.

4.4 ZAB features

  1. Consistency assurance

Reliable delivery: If A transaction A is committed by one server, it will eventually be committed by all servers

  1. Total Order

If there are two transactions, A and B, and one server executes A first and then B, it is guaranteed that A will always be executed before B on all servers

  1. Causal order

If the sender sends B after transaction A commits, then B must execute after A

  1. High availability

As long as the majority (legal number) of nodes are started, the system runs normally

  1. recoverability

When a node is restarted after it goes offline, it must ensure that it can resume the transaction currently in progress

4.5 Comparison between ZAB and Paxos

Similarities:

Both have a Leader process role that coordinates the execution of multiple Follower processes.

The Leader process waits for more than half of the followers to give correct feedback before submitting a proposal.

In ZAB protocol, each Proposal contains an epoch value to represent the current Leader cycle. In Paxos, the name is Ballot

Difference:

ZAB is used to build a highly available distributed data master/slave system (Zookeeper), and Paxos is used to build a distributed consistent state machine system.

5 Scattered ZooKeeper knowledge

5.1 Common Commands

Zookeeper can be deployed in three modes:

Single-machine deployment: The system runs on one host.

Cluster deployment: Multiple machines.

Pseudo-cluster deployment: Multiple Zookeeper instances run on one machine.

Common commands after the deployment are as follows:

5.2 Zookeeper Client

Zookeeper native client 5.2.1

The Zookeeper client is asynchronous. You’ll need to introduce CountDownLatch to make sure you’re connected before you do anything else. The native Zookeeper API does not support iterative path creation and deletion, which has the following disadvantages.

The connection of the session is asynchronous; Callbacks must be used.

Watch needs to be registered repeatedly: Watch once and register once.

Session reconnection mechanism: Sometimes a Session is disconnected and needs to be reconnected.

High development complexity: Development is relatively trivial.

5.2.2. ZkClient

The open source ZK client, packaged on top of the native API, is an easier to use ZooKeeper client with the following optimizations.

Optimization 1. Automatically create a new ZooKeeper instance for reconnection during session Loss and session expire. Optimization 2. Wrap a disposable Watcher into a persistent watcher.

5.2.3 requires. The Curator

Open source ZK client, packaged on the basis of native API, Apache top-level project. Zookeeper is a set of Netflix open source client framework. The complexity of Zookeeper’s native API is clear to anyone who has studied it. Curator helped us encapsulate it and implement some of the development details, including successive reconnections, repeated registrations of Watcher and NodeExistsException. Currently available as a top-level project of Apache, Zookeeper is one of the most popular Zookeeper clients.

5.2.4. Zookeeper Graphical client tool

The tool is called ZooInspector, baidu installation tutorial can be.

5.3 ACL Permission Control Mechanism

An Access Control List (ACL) is used to Control the Access permissions of resources. Zookeeper uses ACL policies to control node access permissions, such as reading and writing node data, creating and deleting nodes, reading child node lists, and setting node permissions.

5.4 Precautions for Using Zookeeper

  1. The number of machines in the cluster is not always better, a write operation requires more than half of the nodes ack, so the more nodes in the cluster, the more nodes can withstand hangs (more reliable), but the worse the throughput. The number of clusters must be odd.
  2. The ZK implements memory-based read and write operations and sometimes broadcasts messages. Therefore, it is not recommended to read or write large data on nodes.
  3. The dataDir directory and dataLogDir directories can become large over time, causing the hard drive to become full. You are advised to write or use your own scripts to save the latest N files.
  4. The default maximum number of connections is 60. Configure the maxClientCnxns parameter to set the maximum number of connections created on a single client machine.

Three things to watch ❤️

If you find this article helpful, I’d like to invite you to do three small favors for me:

  1. Like, forward, have your “like and comment”, is the motivation of my creation.

  2. Follow the public account “Java rotten pigskin” and share original knowledge from time to time.

  3. Also look forward to the follow-up article ing🚀

  4. [666] Scan the code to obtain the learning materials package