ZooKeeper is an open source distributed coordination service created by Yahoo and an open source implementation of Google Chubby. Distributed applications can implement functions such as data publish/subscribe, load balancing, naming services, distributed coordination/notification, cluster management, Master elections, distributed locks, and distributed queues based on ZooKeeper.

Introduction to the

ZooKeeper is an open source distributed coordination service created by Yahoo and an open source implementation of Google Chubby. Distributed applications can implement functions such as data publish/subscribe, load balancing, naming services, distributed coordination/notification, cluster management, Master elections, distributed locks, and distributed queues based on ZooKeeper.

The basic concept

This section describes the core concepts of ZooKeeper. These concepts will be covered in more depth later on in ZooKeeper, so it’s important to understand them in advance.

The cluster character

In ZooKeeper, there are three roles:

  • Leader
  • Follower
  • Observer

A ZooKeeper cluster has only one Leader at a time, and the others are followers or observers.

The ZooKeeper configuration is simple. The ZooKeeper configuration file (zoo.cfg) is the same for each node except the myID file. The value of myID must be the {value} part of server.{value} in zoo.cfg.

Example of the zoo. CFG file:

maxClientCnxns=0 # The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. dataDir=/var/lib/zookeeper/data # the port at which the clients will connect clientPort=2181 # the directory where the transaction logs are stored. DataLogDir = / var/lib/zookeeper/logs for server 1 = 192.168.20.101:2888:3888 server. 2 = 192.168.20.102:2888-3888 Server. 3 = 192.168.20.103:2888-3888 for server 4 = 192.168.20.104:2888:3888 server. 5 = 192.168.20.105:2888-3888 minSessionTimeout=4000 maxSessionTimeout=100000Copy the code

Run the zookeeper-server status command on the terminal of the ZooKeeper host to check whether the ZooKeeper role of the current node is Leader or Follower.

[root@node-20-103 ~]# zookeeper-server status
JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
Mode: followerCopy the code
[root@node-20-104 ~]# zookeeper-server status
JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
Mode: leaderCopy the code

As shown above, Node-20-104 is the Leader and Node-20-103 is the follower.

By default, ZooKeeper has only Leader and Follower roles, but no Observer role.

To use the Observer mode, add: peerType= Observer to the configuration file of any node that you want to become an Observer and append: Observer to the configuration file of all servers that are configured in Observer mode, for example: server.1:localhost:2888:3888:observer

All machines in a ZooKeeper cluster select a Leader through a Leader election process. The Leader server provides read and write services for clients.

Followers and Observers provide read services, but do not provide write services. The only difference between the two is that the Observer machine does not participate in the Leader election process, nor does it participate in a “half-write success” policy for write operations, so the Observer can improve the read performance of the cluster without affecting write performance.

Session

Session is a client Session, so before we talk about client sessions, let’s talk about client connections. In ZooKeeper, a client connection refers to a TCP long connection between the client and the ZooKeeper server. The default ZooKeeper external service port is 2181. When the ZooKeeper client starts, it establishes a TCP connection with the server. After the first connection is established, the session life cycle of the ZooKeeper client starts. It can also send requests to and receive responses from the ZooKeeper server, as well as receive Watch event notifications from the server through this connection. The Session SessionTimeout value is used to set the timeout period of a client Session. If the client is disconnected due to server pressure, network failure, or client disconnection, the session created before can be reconnected to any server in the cluster within the time specified by SessionTimeout.

Data Node (ZNode)

When talking about distribution, the term “node” generally refers to each machine that makes up a cluster. The data node in ZooKeeper is a data unit in the data model, called ZNode. ZooKeeper stores all data in the memory. The data model is a ZNode Tree. The path divided by slash (/) is a ZNode, for example, /hbase/master. Each ZNode stores its own data content, along with a series of attributes.

Note: ZNode is both a Unix file and a Unix directory. This is because each ZNode can not only write data itself (Unix equivalent to a file), but also have next-level files or directories (Unix equivalent to a directory).

In ZooKeeper, ZNodes are classified into persistent nodes and temporary nodes.

Persistent node

A persistent node means that once a ZNode is created, it will remain on ZooKeeper until the ZNode is removed.

Temporary node

The life cycle of a temporary node is tied to a client session, and any temporary nodes created by the client are removed once the client session fails.

In addition, ZooKeeper allows users to add a special property for each node: SEQUENTIAL. Once a node is marked with this property, when the node is created, ZooKeeper automatically appends the node with an integer number, which is an increment maintained by the parent node.

version

ZooKeeper stores data on each ZNode. For each ZNode, ZooKeeper maintains a data structure called Stat, which records the three data versions of the ZNode. These are version (the version of the current ZNode), cversion (the version of the current ZNode child), and Aversion (the ACL version of the current ZNode).

State information

In addition to storing data content, each ZNode also stores some state information about the ZNode itself. You can use the get command to obtain both the content and status information of a ZNode. As follows:

[zk: localhost:2181(CONNECTED) 23] get /yarn-leader-election/appcluster-yarn/ActiveBreadCrumb appcluster-yarnrm1 cZxid = 0x1B00133dc0 //Created ZNode transaction ID ctime = Tue Jan 03 15:44:42 CST 2017 //Created Time: ZNode creation Time mZxid = Mtime = Fri Jan 06 08:44:25 CST 2017 // Indicates the time when the node was last updated pZxid = 0x1B00133dc0 // Indicates the transaction ID of the node when the child node list was last modified. Note that the pZxid is changed only when the list of child nodes is changed. Changes in the content of child nodes do not affect the pZxid. Cversion = 0 // Version number of the child node datLVersion = 11 // Version number of the data node aclVersion = 0 // ephemeralOwner = 0x0 // seddionID of the session that created the node. If the node is persistent, this property has a value of 0. DataLength = 22 // Length of data content numChildren = 0 // Number of childrenCopy the code

In ZooKeeper, the Version attribute is used to implement “write verification” (atomicity of distributed data) in the optimistic locking mechanism.

The transaction operations

In ZooKeeper, an operation that changes the state of the ZooKeeper server is called a transaction. The operations include creating and deleting data nodes, updating data content, and creating and invalidating client sessions. For each transaction request, ZooKeeper assigns a globally unique transaction ID, represented by an ZXID, usually a 64-bit number. Each ZXID corresponds to an update operation. From these ZXids, you can indirectly identify the global order in which ZooKeeper processes these transaction operation requests.

Watcher

Watcher (event listener) is an important feature in ZooKeeper. ZooKeeper allows users to register some Watcher on a specified node, and when certain events are triggered, the ZooKeeper server notifies interested clients of the events. This mechanism is an important feature of ZooKeeper to implement distributed coordination service.

ACL

ZooKeeper uses Access Control Lists (ACLs) to Control permissions. ZooKeeper defines the following five permissions.

  • CREATE: permission to CREATE child nodes.
  • READ: Permission to obtain node data and child node list.
  • WRITE: permission to update node data.
  • DELETE: indicates the permission to DELETE a child node.
  • ADMIN: Sets the ACL permission of a node.

Note: CREATE and DELETE are permission controls for child nodes.

Typical ZooKeeper application scenario

ZooKeeper is a highly available distributed data management and coordination framework. Based on the implementation of ZAB algorithm, the framework can guarantee the consistency of data in distributed environment. It is also based on this feature that Makes ZooKeeper a powerful tool to solve distributed consistency problems.

Data Publishing and subscription (Configuration Center)

As the name implies, publishers publish data to ZooKeeper nodes for subscribers to subscribe data. In this way, data can be dynamically obtained and centralized management and dynamic update of configuration information can be realized.

In our common application system development, we often encounter such requirements: the system needs to use some general configuration information, such as machine list information, database configuration information. These global configurations usually have the following three features.

  • The amount of data is usually small.
  • Data content changes dynamically at run time.
  • All machines in the cluster share the same configuration.

Such global configuration information can be published to ZooKeeper, and clients (clustered machines) can subscribe to the message.

Publish/subscribe systems generally have two design patterns, namely Push and Pull patterns.

  • Push: The server actively sends data updates to all subscribing clients.
  • Pull: The client initiates a request to obtain the latest data. Usually, the client uses scheduled polling pull mode.

ZooKeeper uses a combination of push and pull. As follows:

Once the data of the node is changed, the server will send the Watcher event notification to the corresponding client. After receiving the notification, the client needs to take the initiative to obtain the latest data from the server (push-pull combination).

Naming Service

Naming services are also a common scenario in distributed systems. In distributed system, by using naming service, client application can obtain the address, provider and other information of resource or service according to the specified name. Named entities can be machines in a cluster, services provided, remote objects, and so on — all of which we can collectively call names. A common one is the list of service addresses in some distributed service frameworks, such as RPC and RMI. By creating sequential nodes in ZooKeepr, it is easy to create a globally unique path that can be used as a name.

ZooKeeper’s naming service generates globally unique ids.

Distributed coordination/notification

ZooKeeper has a special Watcher registration and asynchronous notification mechanism, which can well realize the notification and coordination between different machines and even different systems in a distributed environment, so as to realize the real-time processing of data changes. In this way, different clients register the same ZNode on ZK, listen for ZNode changes (including the contents of ZNode itself and its children), and if ZNode changes, all subscribed clients can receive the corresponding Watcher notification, and act accordingly.

ZK distributed coordination/notification is a universal communication method between machines in distributed systems.

The heartbeat detection

In A distributed environment, different machines (or processes) need to check whether each other is running properly. For example, machine A needs to know whether machine B is running properly. In traditional development, we usually determine whether the host can PING each other directly, or more complicated, establish a long connection between the machines, and use the heartbeat detection mechanism inherent in TCP connections to realize the heartbeat detection of the upper machine. These are very common heartbeat detection methods.

Let’s see how ZK can be used to implement heartbeat detection between distributed machines (processes).

Based on the temporary node feature of ZK, different processes can create temporary child nodes under a specified node of ZK. Different processes can directly determine whether the corresponding process is alive according to the temporary child node. In this way, the detected and detected systems do not need to be associated directly, but through a node on the ZK, which greatly reduces the system coupling.

Work progress report

In a common task distribution system, the task execution progress needs to be reported to the distribution system in real time after the task is distributed to different machines. This can be done through ZK. Selecting a node on ZK, under which each task client creates temporary child nodes, enables two functions:

  • Determine whether the task machine is alive by determining whether the temporary node exists.
  • Each task machine will write its task execution progress to this temporary node in real time, so that the central system can obtain the task execution progress in real time.

Master the election

The Master election is the most typical application scenario of ZooKeeper. For example, Active NameNode election in HDFS, Active ResourceManager election in YARN, and Active HMaster election in HBase.

In general, we can choose the primary key feature in the common relational database to implement: Each machine that wants to be Master inserts a record with the same primary key ID into the database, and the database checks for primary key conflicts for us. That is, only one machine inserts successfully — then we consider the client machine that successfully inserts into the database to be Master.

Relying on the primary key feature of a relational database does a good job of ensuring that a single Master is elected in the cluster. But what if the currently elected Master dies? Who’s going to tell me that Master died? Obviously, the relational database cannot notify us of this event. But ZooKeeper can do it!

Using the strong consistency of ZooKeepr, the global uniqueness can be guaranteed for the creation of nodes in the case of distributed high concurrency, that is, ZooKeeper will ensure that the client cannot create an existing ZNode. That is, if there are multiple client requests to create the same temporary node at the same time, only one client request will succeed. With this feature, it is easy to do Master elections in a distributed environment.

The machine on which the client successfully created the node becomes the Master. Meanwhile, any client that fails to create this node will register a Watcher of child node changes on this node to monitor whether the current Master machine is alive. If the current Master machine is found to have died, the other clients will re-elect the Master.

This implements the dynamic election of the Master.

A distributed lock

Distributed locking is a way to control synchronous access to shared resources between distributed systems.

Distributed lock is divided into exclusive lock and shared lock.

Exclusive lock

Exclusive Locks are also known as write or Exclusive Locks.

If transaction T1 holds an exclusive lock on data object O1, then only transaction T1 is allowed to read and update O1 during the entire lock period. No other transaction is allowed to perform any type of operation on this data object (no further lock on this object) until T1 releases the exclusive lock.

As you can see, the core of exclusive locking is how to ensure that only one transaction currently has the lock, and that all transactions waiting to acquire the lock can be notified after the lock is released.

How to use ZooKeeper to implement exclusive lock?

Define the lock

A ZNode on ZooKeeper can represent a lock. For example, the /exclusive_lock/lock node can be defined as a lock.

Gets the lock

As mentioned above, a ZNode on ZooKeeper is regarded as a lock, and the lock is obtained by creating a ZNode. All clients go to the /exclusive_lock node to create a temporary child node /exclusive_lock/lock. ZooKeeper ensures that only one client can be successfully created, and the client is considered to have acquired the lock. At the same time, all clients that have not obtained the lock need to register a Watcher on the /exclusive_lock node to listen for changes on the lock node in real time.

Release the lock

Because /exclusive_lock/lock is a temporary node, it is possible to release the lock in either of the following cases.

  • If the client machine that currently holds the lock goes down or restarts, the temporary node is deleted and the lock is released.
  • After the service logic is executed, the client automatically deletes the created temporary node to release the lock.

In any case, ZooKeeper notifies all clients that have registered node changes on the /exclusive_lock node that Watcher listens for. Upon receiving the notification, these clients re-initiate distributed lock acquisition, i.e., repeat the lock acquisition process.

A Shared lock

Shared Locks are also known as read Locks. If transaction T1 holds a shared lock on data object O1, then T1 can only read O1. Other transactions can also hold a shared lock (not an exclusive lock) on O1 until all shared locks on O1 are released.

Summary: It is possible for multiple transactions to acquire a shared lock on an object at the same time.

ZooKeeper application in large-scale distributed systems

Typical application scenarios of ZooKeeper have been described. This section uses Common big data products Hadoop and HBase as examples to describe ZooKeeper applications to help you better understand distributed application scenarios of ZooKeeper.

ZooKeeper application in Hadoop

In Hadoop, ZooKeeper is used to implement Hive Availability (HA), including the HA of HDFS NamaNode and YARN ResourceManager. In addition, ZooKeepr is used to store the running status of applications in YARN. NamaNode of HDFS and ResourceManager implement HA using ZooKeepr in the same way. This section uses YARN as an example.

The preceding figure shows that YARN consists of ResourceManager (RM), NodeManager (NM), ApplicationMaster (AM), and Container. The most core is ResourceManager.

ResourceManager manages and allocates all resources in a cluster in a unified manner. ResourceManager also receives resource reports from nodeManagers and allocates these information to Application Managers based on specific policies. It internally maintains the ApplicationMaster information, NodeManager information and resource usage information of each application.

To implement HA, multiple ResourceManager must coexist. Only one ResourceManager is in the Active state and the others are in the Standby state. If the Active node fails to work properly (for example, a machine breaks down or restarts), New Active nodes are created through competitive elections for those in Standby state.

The main/backup

The following describes how to implement active/standby switchover between multiple ResourceManager.

  1. Creating a Lock Node ZooKeeper has a Lock node /yarn-leader-election/ Appcluster-YARN. All ResourceManager compete to write a Lock child node when starting: /yarn-leader-election/appcluster-yarn/ActiveBreadCrumb, which is a temporary node. ZooKeepr ensures that only one ResourceManager can be created successfully. The ResourceManager that is created successfully switches to the Active state, and the ResourceManager that is not created successfully switches to the Standby state.

    [zk: localhost:2181(CONNECTED) 16] get /yarn-leader-election/appcluster-yarn/ActiveBreadCrumbappcluster-yarnrm2cZxid = 0x1b00133dc0ctime = Tue Jan 03 15:44:42 CST 2017mZxid = 0x1f00000540mtime = Sat Jan 07 00:50:20 CST 2017pZxid = 0x1b00133dc0cversion = 0dataVersion = 28aclVersion = 0ephemeralOwner = 0x0dataLength = 22numChildren = 0Copy the code

You can see that ResourceManager2 in the cluster is Active.

  1. Register Watcher Listener All Standby ResourceManager registers a Watcher listener for node changes to the /yarn-leader-election/ Appcluster-yarn /ActiveBreadCrumb node. By using the temporary node feature, you can quickly sense the running status of Active ResourceManager.

  2. Active/standby switchover If the Active ResourceManager breaks down or restarts, the client sessions connected to ZooKeeper become invalid. Therefore, the /yarn-leader-election/ Appcluster-yarn /ActiveBreadCrumb nodes are removed. In this case, all other ResourceManager in the Standby state receive the Watcher event notification from the ZooKeeper server and repeat Step 1.

ZooKeeper is used to implement ResourceManager active/standby switchover to implement ResourceManager HA.

The HA of NameNode in HDFS works in the same way as the HA of ResourceManager in YARN. The lock node/hadoop – ha/mycluster/ActiveBreadCrumb.

ResourceManager status storage

In ResourceManager, RMStateStore stores some RM internal status information. This includes applications and their Attempts Information, Delegation Token, and Version Information. Note that most of the state information in RMStateStore does not need to be persisted because it can be easily reconstructed from context information, such as resource usage. In the storage design, three possible implementations are provided, as follows.

  • Memory-based implementations are typically used for daily development testing.
  • File system-based implementation, such as HDFS.
  • Implemented based on ZooKeeper.

Since the data volume of these status information is not very large, Hadoop officially recommends that ZooKeeper be used to store status information. On ZooKeepr, ResourceManager status information is stored under the root node/RMStore.

[zk: localhost:2181(CONNECTED) 28] ls /rmstore/ZKRMStateRoot[RMAppRoot, AMRMTokenSecretManagerRoot, EpochNode, RMDTSecretManagerRoot, RMVersionNode]Copy the code

The RMAppRoot node stores information related to each Application, and RMDTSecretManagerRoot stores information related to security, such as tokens. During initialization, each Active ResourceManager reads the status information from ZooKeeper and processes the status information accordingly.

Summary:

ZooKeepr is used in Hadoop as follows:

  1. HA of NameNode in HDFS and ResourceManager in YARN.
  2. Stores RMStateStore status information

ZooKeeper application in HBase

ZooKeeper is used in HBase to implement HMaster election, active/standby switchover, system fault tolerance, RootRegion management, Region status management, and distributed SplitWAL task management.

HMaster election and active/standby switchover

The principle of HMaster election and active/standby switchover is the same as the HA principle of NameNode in HDFS and ResourceManager in YARN.

System fault tolerance

When HBase is started, each RegionServer creates an information node (CALLED RS status node) under the/HBase/RS node of ZooKeeper, for example, / HBase/RS /[Hostname]. The HMaster registers a listener for this node. When a RegionServer is down, ZooKeeper deletes the RS status node corresponding to the RegionServer because the Heartbeat cannot be received for a period of time (that is, the Session is invalid). Meanwhile, the HMaster receives the NodeDelete notification from ZooKeeper, senses that a node is disconnected, and immediately starts fault-tolerant work.

Why not let HMaster monitor RegionServer directly? If the HMaster manages the RegionServer status through the heartbeat mechanism, the HMaster has a heavier management burden as the cluster grows larger. In addition, the HMaster may hang down. Therefore, data needs to be persisted. In this case, ZooKeeper is the ideal choice.

RootRegion management

For HBase clusters, data store locations are recorded in metadata regions, that is, RootRegion. Each time a client sends a new request and needs to know the location of data, it queries the RootRegion. The RootRegion location is recorded on ZooKeeper (by default, the location is recorded on the /hbase/meta-region-server node of ZooKeeper). When RootRegion changes, such as manual Region movement, load balancing, or RootRegion server failure, ZooKeeper detects the changes and takes measures for disaster recovery. This ensures that the client always gets the correct RootRegion information.

Region management

Regions in HBase are frequently changed due to system faults, load balancing, configuration modification, and Region splitting or merging. Once a Region is moved, it goes offline and online again.

Data cannot be accessed during the offline period, and the state change of a Region must be known globally; otherwise, transactional exceptions may occur. For large HBase clusters, the number of regions may be as many as 100,000 or more. ZooKeeper is also a good choice to manage Region status.

Distributed SplitWAL task management

When a RegionServer is down, some new data is not persisted to HFile. Therefore, when migrating the Service of this RegionServer, an important task is to recover the data in memory from WAL. SplitWAL is the key step in this part. The HMaster traverses the WAL of the RegionServer and moves the WAL to a new address by Region. The HMaster performs the replay of logs.

A single RegionServer has a large amount of logs (thousands of regions and GB of logs), and users often expect the system to recover logs quickly. Therefore, a feasible solution is to divide the WAL processing task among multiple RegionServer servers, which in turn requires a persistent component to assist the HMaster to complete the task allocation. Currently, HMaster creates a SplitWAL node (default: /hbase/SplitWAL) on ZooKeeper. HMaster stores information such as Which RegionServer processes which Region in a list on this node. Each RegionServer obtains the task from the node and updates the node information after the task succeeds or fails to notify HMaster to proceed with the next step. ZooKeeper takes on the role of mutual notification and information persistence in distributed clusters.

Summary:

These are typical scenarios in which ZooKeeper is used to implement distributed coordination in HBase. In fact, HBase relies on ZooKeepr. For example, HMaster relies on ZooKeeper to record the enable/ Disable status of tables, and almost all HBase metadata stores are stored on ZooKeeper.

Due to ZooKeeper’s excellent distributed coordination capability and notification mechanism, ZooKeeper application scenarios are increasingly added to HBase in the evolution of different versions. The overlap between ZooKeeper and HBase is increasing. All to the operation of the ZooKeeper in HBase are encapsulated in the org.. Apache hadoop, HBase. They are in this package, interested students can study by oneself.

Check out the public account FullStackPlan for more dry goods