1. System model

1.1. Data Model

The view structure of Zookeeper is a tree structure. Each node in the tree is called a data node (ZNode). Each ZNode can store data and mount child nodes. The root node of Zookeeper is “/”.

1.2. Node type

In Zookeeper, each data node has a life cycle. The life cycle depends on the node type of the data node. Zookeeper provides the following types of nodes:

The node type instructions
PERSISTENT nodes After a data node is created, it exists on the Zookeeper server until the node is deleted.
PERSISTENT_SEQUENTIAL node The basic features are consistent with the persistent nodes, but the additional features are sequential. In Zookeeper, each parent node maintains a sequence for its first-level child nodes, which records the order in which each child node is created. Based on this sequential feature, you can set this tag when creating child nodes. Then, during node creation, Zookeeper automatically adds a numeric suffix to the given node name as a new, complete node name.It is also important to note that the upper limit of the numeric suffix is the maximum value of the integer.
EPHEMERAL nodes The life cycle of the temporary node is tied to the client session, and if the client session fails, the node is automatically cleaned up.In addition, Zookeeper stipulates that child nodes cannot be created based on temporary nodes, that is, temporary nodes can only be used as leaf nodes.
EPHEMERAL_SEQUENTIAL nodes Basic features are the same as temporary nodes, but sequential features are added.

1.3. Status information

In addition to storing data content, each data node also stores some State information of the data node itself.

State property instructions
cZxid Create ZXID indicates the transaction ID of the data node when it is created.
ctime Create Time: indicates the Time when the data node is created.
mZxid Modified ZXID indicates the transaction ID of the node when it was last updated.
mtime Modified Time: indicates the Time when the data node was last updated.
pZxid Represents the transaction ID when the child node list of this node was last modified. Note that the pZxid is changed only when the list of child nodes is changed. Changes in the content of child nodes do not affect the pZxid.
cversion Indicates the version number of the child node.
dataVersion Indicates the version number of a data node.
aclVersion Indicates the ACL version number of a node.
ephemeralOwner SessionID of the session that created the temporary node. If the node is persistent, this property has a value of 0.
dataLength Represents the length of the data content.
numChildren Indicates the number of children of the current node.

1.4, ZXID

In Zookeeper, transactions refer to operations that can change the state of the Zookeeper server. They are also called transaction operations or update operations, including data node creation and deletion, data node content update, and client session creation and failure. For each transaction request, Zookeeper assigns it a globally unique transaction ID, represented by an ZXID, usually a 64-bit number. Each ZXID corresponds to an update operation. From these ZXids, you can indirectly identify the global order in which Zookeeper processes these update operation requests.

ZXID is a 64-bit number, in which the lower 32 bits can be regarded as a simple monotonically increasing counter. For each transaction request of the client, the Leader server will increase this counter by 1 when it generates a new transaction Proposal. The higher 32 bits represent the number of the Leader cycle epoch. Whenever a new Leader server is elected, the ZXID of the largest transaction Proposal in the local log will be extracted from the Leader server. The corresponding epoch value was resolved from the ZXID, and then it was added by 1. After that, the number would be used as the new epoch, and 0 32 lower would be used to generate a new ZXID.

Version 1.5,

Zookeeper introduces the concept of version for data nodes. Each data node has three types of version information (the three versions are described in the status information above). Any update operation on the data node will cause the version number to change. We take DatSpanning as an example. After a data node is created, the value of dataVersion of the node is 0, which means that “the current node has been updated 0 times since its creation”. If you update the data content of this node now, then the value of DatSpanning will become 1. That is, the number of changes to the data content of the data node.

Version is used to implement “write check” in optimistic locking. For example, when you want to modify the data content of a data node, add the version number. If the version number of the data node is the same as the version number passed in, modify the data. Otherwise, the modification fails.

1.6, the Watcher

1.6.1 overview,

Zookeeper provides the publish/subscribe function for distributed data. A typical publish/subscribe model system defines a one-to-many subscription relationship that allows multiple subscribers to listen to a topic object at the same time and notify all subscribers when the topic object itself changes in state so that they can act accordingly. In Zookeeper, Watcher mechanism is introduced to implement this distributed notification function. Zookeeper allows clients to register a Watcher listener with the server. When the Watcher is triggered by some specified event on the server, an event notification is sent to the specified client to implement distributed notification.

As can be seen from the figure above, the Watcher mechanism of Zookeeper mainly includes client thread, client WatchMananger and Zookeeper server. In terms of specific workflow, the client registers Watcher with the Zookeeper server and stores Watcher objects in WatchMananger of the client. When the Zookeeper server triggers the Watcher event, a notification is sent to the client. The client thread retrieves the Watcher object from the WatchManager to perform the callback logic.

1.6.2 Watcher features
  • One-off: Indicates that Zookeeper removes a Watcher from the corresponding storage once it is triggered on either the server or the client. So one thing developers should keep in mind when using Watcher is the need for repeated sign-ups.
  • Client serial execution: The client Watcher callback process is a serial synchronization process, which ensures that the sequence is maintained, and developers should be careful not to let the processing logic of one Watcher affect the entire client Watcher callback.
  • Lightweight: The WatchedEvent is the smallest notification unit of the Entire Watcher notification mechanism of Zookeeper. This data structure contains only three parts: notification status, event type, and node path. In other words, Watcher notifications are very simple and tell the client that an event has occurred without specifying what the event is.
1.6.3 watcher interface design

Watcher is an interface, and any class that implements the Watcher interface is a new Watcher. Watcher contains two enumerated classes: KeeperState and EventType

  • Watcher notification state (KeeperState)

    KeeperState is the notification type when the connection status of the client and server changes. Path is org. Apache. Zookeeper. Watcher. Event. KeeperState, is an enumeration type, its enumerated attribute is as follows:

Enumerated attribute instructions
SyncConnected The connection between the client and server is normal
Disconnected The client is disconnected from the server
Expired The session session expires
AuthFailed When the identity authentication fails
  • Watcher EventType (EventType)

    EventType is the notification type when a data node (ZNode) changes. KeeperState is always in SyncConnected state when EventType changes. When KeeperState changes, EventType is always None. Its path is org. Apache. The zookeeper. Watcher. Event. EventType, is an enumeration type, enumeration properties are as follows:

Enumerated attribute instructions
None There is no
NodeCreated When the data node Watcher listens to is created
NodeDeleted When the data node Watcher listens to is deleted
NodeDataChanged When the content of the data node Watcher listens to changes (regardless of whether the content data changes)
NodeChildrenChanged When the list of children of the data node Watcher listens to changes

Note: The notification of related events received by the client only contains information such as status and type, excluding the specific content before and after the node change. The data before the change needs to be stored by the service itself, and the data after the change needs to be obtained by using methods such as GET.

1.6.4. Capture corresponding events

The zooKeeper client connection status and the event types monitored by ZooKeeper to ZNode node are described above. In ZooKeeper, zk.getChildren(Path, watch), zK.exists (path, watch), and zk.getData(path, watcher, stat) are used to register listening for a ZNode.

The following table uses the Node-x node as an example to illustrate the relationship between the registration methods called and the listener events:

Registration way Created ChildrenChanged Changed Deleted
Zk. The exists (“/node – x, “watcher) Can be monitored Can be monitored Can be monitored
Zk. GetData (“/node – x, “watcher) Can be monitored Can be monitored
Zk. GetChildren (“/node – x, “watcher) Can be monitored Can be monitored

1.7, the ACL

Zookeeper provides a comprehensive Access Control List (ACL) permission Control mechanism to ensure data security.

1.7.1 overview,

An ACL consists of three parts: Permission mode (Scheme), authorization object (ID), and Permission. Scheme: ID: Permission is usually used to identify valid ACL information. The following are introduced respectively:

  1. Permission Mode (Scheme)

    plan instructions
    world There is only one user: Anyone who logs into Zookeeper (default)
    ip IP address authentication is used for clients.
    auth Use the user whose authentication has been added for authentication.
    digest User Name: Password is used for authentication.
  2. Authorization Object (ID)

    The authorization object ID is the entity to which the permission is granted, for example, an IP address or a user.

  3. Permission

    permissions The ACL shorthand describe
    create c You can create child nodes.
    delete d You can delete child nodes (only the next-level nodes).
    read r You can read node data or a list of child nodes.
    write w You can update nodes.
    admin a You can set access control list (ACL) permissions for nodes.
1.7.2, features,
  • ZooKeeper permission control is based on each ZNode. You need to set the permission for each node.
  • Each ZNode supports multiple permission control schemes and multiple permissions.
  • Child nodes do not inherit the permissions of their parents, and clients do not have access to a node, but may have access to its children.
1.7.3, case
  • World Licensing Mode

    The command

    setAcl <path> world:anyone:<acl>
    Copy the code

    case

    [zk: localhost:2181(CONNECTED) 0] create /node1 "node1" Created /node1 [zk: localhost:2181(CONNECTED) 1] getAcl /node1 'world,'anyone : cdrwa [zk: localhost:2181(CONNECTED) 2] setAcl /node1 world:anyone:crwa cZxid = 0x100000004 ctime = Fri May 29 14:31:54 CST 2020 mZxid = 0x100000004 mtime = Fri May 29 14:31:54 CST 2020 pZxid = 0x100000004 cversion = 0 dataVersion = 0 aclVersion = 1  ephemeralOwner = 0x0 dataLength = 5 numChildren = 0Copy the code
  • IP Authorization Mode

    The command

    setAcl <path> ip:<ip>:<acl>
    Copy the code

    case

    Note: Run the./ zkcli. sh -server IP command to remotely log in to ZooKeeper

    [zk: localhost:2181(CONNECTED) 18] create /node2 "node2" Created /node2 [zk: Localhost :2181(CONNECTED) 23] setAcl /node2 IP :192.168.150.101:cdrwa cZxid = 0xe ctime = Fri Dec 13 22:30:29 CST 2019 mZxid = 0x10 mtime = Fri Dec 13 22:33:36 CST 2019 pZxid = 0xe cversion = 0 dataVersion = 2 aclVersion = 1 ephemeralOwner NumChildren = 0 [zk: localhost:2181(CONNECTED) 25] getAcl /node2 'IP,'192.168.150.101: [zk: localhost:2181(CONNECTED) 0] get /node2 Authentication is not valid: /node2 # No permissionCopy the code
  • Auth authorization mode

    The command

    SetAcl <path> auth:<user>:<acl>Copy the code

    case

    [zk: localhost:2181(CONNECTED) 6] create /node3 "node3"
    Created /node3
    
    #Adding an Authentication User
    [zk: localhost:2181(CONNECTED) 7] addauth digest ld:123456
    
    [zk: localhost:2181(CONNECTED) 8] setAcl /node3 auth:ld:cdrwa
    cZxid = 0x10000000c
    ctime = Fri May 29 14:47:13 CST 2020
    mZxid = 0x10000000c
    mtime = Fri May 29 14:47:13 CST 2020
    pZxid = 0x10000000c
    cversion = 0
    dataVersion = 0
    aclVersion = 1
    ephemeralOwner = 0x0
    dataLength = 5
    numChildren = 0
    
    [zk: localhost:2181(CONNECTED) 9] getAcl /node3
    'digest,'ld:kesl2p6Yx58a+/mP+TKSFZkzkZ0=
    : cdrwa
    
    #After an authentication user is added, the user can access it
    [zk: localhost:2181(CONNECTED) 10] get /node3
    node3
    cZxid = 0x10000000c
    ctime = Fri May 29 14:47:13 CST 2020
    mZxid = 0x10000000c
    mtime = Fri May 29 14:47:13 CST 2020
    pZxid = 0x10000000c
    cversion = 0
    dataVersion = 0
    aclVersion = 1
    ephemeralOwner = 0x0
    dataLength = 5
    numChildren = 0
    Copy the code
  • Digest Authorization Mode

    The command

    setAcl <path> digest:<user>:<password>:<acl>
    Copy the code

    The password is ciphertext processed by SHA1 and BASE64. You can run the following command to calculate the password in SHELL:

    echo -n <user>:<password> | openssl dgst -binary -sha1 | openssl base64
    Copy the code

    Let’s compute a ciphertext

    echo -n monkey:123456 | openssl dgst -binary -sha1 | openssl base64
    Copy the code

    case

    [zk: localhost:2181(CONNECTED) 12] create /node4 "node4" Created /node4 [zk: localhost:2181(CONNECTED) 13] setAcl /node4 digest:monkey:Rk6u/zJJdOYrTZ6+J0p4/4gTILg=:cdrwa cZxid = 0x10000000e ctime =  Fri May 29 14:52:50 CST 2020 mZxid = 0x10000000e mtime = Fri May 29 14:52:50 CST 2020 pZxid = 0x10000000e cversion = 0 dataVersion = 0 aclVersion = 1 ephemeralOwner = 0x0 dataLength = 5 numChildren = 0
    #Cannot read without permission
    [zk: localhost:2181(CONNECTED) 14] getAcl /node4
    Authentication is not valid : /node4
    
    #Adding an Authentication User
    [zk: localhost:2181(CONNECTED) 15] addauth digest monkey:123456
    
    [zk: localhost:2181(CONNECTED) 16] getAcl /node4               
    'digest,'monkey:Rk6u/zJJdOYrTZ6+J0p4/4gTILg=
    : cdrwa
    
    [zk: localhost:2181(CONNECTED) 17] get /node4
    node4
    cZxid = 0x10000000e
    ctime = Fri May 29 14:52:50 CST 2020
    mZxid = 0x10000000e
    mtime = Fri May 29 14:52:50 CST 2020
    pZxid = 0x10000000e
    cversion = 0
    dataVersion = 0
    aclVersion = 1
    ephemeralOwner = 0x0
    dataLength = 5
    numChildren = 0
    Copy the code
  • Multi-mode authorization

    The same node can use multiple mode authorization at the same time

    [zk: localhost:2181(CONNECTED) 18] create /node5 "node5" Created /node5 [zk: localhost:2181(CONNECTED) 19] addauth digest ld:123456 [zk: Localhost: 2181 (CONNECTED) 20] setAcl/node5 IP: 192.168.150.101: cdrwa, auth: ld: cdrwa cZxid = 0 x100000010 ctime = Fri May 29 14:56:38 CST 2020 mZxid = 0x100000010 mtime = Fri May 29 14:56:38 CST 2020 pZxid = 0x100000010 cversion = 0 dataVersion = 0 aclVersion = 1 ephemeralOwner = 0x0 dataLength = 5 numChildren = 0Copy the code
1.7.4 ACL Super administrator

Zookeeper has a permission management mode called super, which provides a super management to conveniently access nodes with any permission

Assume that the super pipe is super:admin. Generate the ciphertext for the super pipe

echo -n super:admin | openssl dgst -binary -sha1 | openssl base64
Copy the code

Open the /bin/zkserver. sh server script file in the ZooKeeper directory and find the following line:

nohup $JAVA "-Dzookeeper.log.dir=${ZOO_LOG_DIR}" "-Dzookeeper.root.logger=${ZOO_LOG4J_PROP}"
Copy the code

This is the command used to start ZooKeeper in the script. By default, there are only the above two configuration items. We need to add a supermanagement configuration item

"-Dzookeeper.DigestAuthenticationProvider.superDigest=super:xQJmxLMiHGwaqBvst5y6rkB6HQs="
Copy the code

So the complete command is changed to

nohup $JAVA "-Dzookeeper.log.dir=${ZOO_LOG_DIR}" "-Dzookeeper.root.logger=${ZOO_LOG4J_PROP}" "-Dzookeeper.DigestAuthenticationProvider.superDigest=super:xQJmxLMiHGwaqBvst5y6rkB6HQs="\
    -cp "$CLASSPATH" $JVMFLAGS $ZOOMAIN "$ZOOCFG" > "$_ZOO_DAEMON_OUT" 2>&1 < /dev/null &
Copy the code

Then start ZooKeeper and run the following command to add permissions

Addauth digest super:admin # Add an authentication userCopy the code

2. Leader election

2.1. Server status

  • Looking: Searches for the leader status. When the server is in this state, it considers that there is no Leader in the cluster and therefore needs to enter the Leader election process.
  • Leading: Indicates the leader status. Indicates that the current server role is Leader.
  • Following: status of the follower. The current server role is follower.
  • Observing: Indicates the status of the observer. Indicates that the current server role is observer.

2.2 Leader election during server startup

In the initial stage of server cluster, we take the server cluster composed of three machines as an example. When one server server1 is started, it cannot elect the Leader. When the second machine server2 is also started, the two servers can communicate with each other. Each machine tries to find a Leader and enters the Leader election process.

  1. Each server issues a vote. Since it is the initial situation, server1 and Server2 will vote themselves as the leader server. Each vote will contain the myID and zxID of the selected server, represented by (myID, zxID). At this time, server1’s vote is (1, 0). Server2 votes for (2, 0), and each sends this vote to the other machines in the cluster.

  2. Each server in the cluster receives votes from the various servers in the cluster.

  3. Process the vote. For each vote, the server needs to pk other people’s vote with its own vote. The PK rules are as follows

    • Check the ZXID first. The server with a larger ZXID takes precedence as the leader.
    • If zxIDS are the same, then compare myid. The server with the larger myID acts as the leader server.

    For Server1, its vote is (1, 0), and for Server2, its vote is (2, 0). First, the zxID of both of them is 0, and then the myID of Server2 is compared. At this point, the myID of Server2 is the largest, so it updates its vote to (2, 0), and then votes again. For Server2, it does not have to update its own vote, but simply issues the last vote to all the machines in the cluster again.

  4. Count the votes. After each vote, the server counts the voting information to determine whether more than half of the machines have received the same voting information. For server1 and Server2, it counts that two machines in the cluster have received the voting information (2, 0), and the leader is considered to have been elected.

  5. Change the server state. Once the leader is determined, each server updates its state, changing it to following if it is follower or leading if it is leader.

2.3 Leader election during server operation

During the running of ZooKeeper, the leader and non-leader servers perform their respective functions. Even if a non-leader server breaks down or joins in, the leader will not be affected. However, once the Leader server dies, the whole cluster will suspend external services and enter a new round of leader election. The process is basically the same as the Leader election process in the start-up period.

Assume that three servers, server1, server2, and server3, are running. The current leader is Server2. If the leader fails at a certain moment, the leader election starts. The election process is as follows:

  1. Change the status. After the Leader hangs, the remaining non-Observer servers change their server state to Looking and start the leader election process.
  2. Each server issues a vote. During the run, the zxID on each server may be different. In this case, assume that server1’s ZXID is 123 and server3’s ZXID is 122. In the first round of voting, both server1 and Server3 will vote for themselves, resulting in a vote (1, 123), (3, 122), Each then sends the vote to all the machines in the cluster.
  3. Receives votes from various servers.
  4. Process the vote. The processing of the vote is the same as the processing rules mentioned above during server startup. In this example, since Server1’s ZXID is 123 and Server3’s ZXID is 122, it is obvious that Server1 will be the Leader.
  5. Count the votes.
  6. Change the server state.

2.4 Observer Roles and Settings

Observer Role:

  1. Does not participate in the cluster leader election
  2. Does not participate in ack feedback when data is written to the cluster

To use the Observer role, add the following configuration to any configuration file that you want to become an Observer:

peerType=observer
Copy the code

Add the observer mode to all server configuration files, for example:

Server. 3 = 192.168.60.130:2289-3389: the observerCopy the code