Zookeeper Cluster Architecture, read/write mechanism, and consistency principle (ZAB protocol)

preface

Students, in the last chapter, we mainly talked about two startup modes of Zookeeper and how to build them. This chapter mainly talks about the principles related to clusters. The first chapter can be regarded as the basic part of Zookeeper Principles, and this chapter is the advanced part of Zookeeper Principles, with knowledge analysis on the read and write mechanism of Zookeeper cluster and ZAB protocol.

The content of this paper mainly includes the following points:

Zookeeper cluster architecture
Zookeeper read/write mechanism
ZAB agreement
Some additional discussions about the Zookeeper cluster
- Zookeeper (Read Performance) Scalability and Observer nodes
- Zookeeper and CAP theory
- Limitations of Zookeeper as a service registry

1. Zookeeper cluster architecture

Let’s talk about the cluster architecture of Zookeeper.

Role in the Zookeeper cluster

As mentioned in Chapter 1, Zookeeper operations that change the state of the Zookeeper server are called transactions. The operations include creating and deleting data nodes, updating data content, and creating and invalidating client sessions.

Leader: The Leader node is responsible for initiating and deciding votes in the Zookeeper cluster (a transaction operation) and updating the system status. It can also receive and respond to requests sent by the Client.
Learner learners
- The Follower node receives and responds to requests from the Client. If the request is a transaction operation, the Follower node forwards the request to the Leader node, initiates a vote, and participates in the internal voting of the cluster.
- Observer Observer: The Observer nodes have the same functions as followers but do not participate in the voting process and synchronize the status of the Leader node.
The Client Client

Zookeeper uses replication to achieve high availability. In the Replicated mode mentioned in the previous chapter, every change in the ZNode tree of Zookeeper is replicated to other Server nodes, subject to the Leader node.

The above is really just a conceptual overview, but after looking at the read/write mechanism and the two modes of the ZAB protocol, you will have a better understanding of these roles.

Read/write mechanism of Zookeeper

Reading and writing process

The following figure shows the process of a Zookeeper Server node providing read/write services in cluster mode.

As shown in the figure above, in addition to containing a request handler to process requests, each Zookeeper Server node has a ReplicatedDatabase ** for persistent data. The ReplicatedDatabase contains the entire Data Tree.

Read Requst from the Client is directly served by the local copy of the corresponding Server.

As for the Write Requst service from the Client, Zookeeper needs to ensure that the local copy of each Server is consistent (single system image), which needs to be handled by the consistency protocol (ZAB protocol mentioned later). Successfully processed write requests (data updates) are first serialized to the local disk of each Server node (for data recovery to start again) and then saved to the in-memory database.

In cluster mode, Zookeeper uses a simple synchronization policy to ensure data consistency based on the following three basic requirements:

Globally serialize all write operations

Serialization converts variables, including objects, into contiguous bytes data. You can store the serialized variables in a file or transfer them over the network. And then de-serialized and restored to the original data.
Ensure that instructions from the same client are executed by FIFO (and FIFO for message notifications)

FIFO – First in, first out

Custom atomic message protocol

To put it simply, all write requests to data are forwarded to the Leader node for processing. The Leader node will vote on the update and send the proposal message to other nodes in the cluster. When more than half of the followers persist the change, the Leader node will consider the write request to be processed successfully. Commit this transaction.

Optimistic locking

The core idea of Zookeeper is to provide a Wait Free core service for distributed system synchronization without locking mechanism. Its core for file, data read and write services, does not provide lock exclusive services.

However, each Zookeeper update operation will update the Version of ZNode (see Chapter 1 for details), which means that the client can implement locking logic when updating data based on version comparison. Take the following figure for example.

For example, when we update the database, we will add a version field to achieve optimistic locking by comparing the versions before and after the update.

3. ZAB Agreement

Finally, the ZAB protocol, after the introduction of the ZAB protocol, you will have a deeper understanding of some of the features of Zookeeper, and the rest of the article will have a more thorough understanding.

ZAB protocol is a crash recovery consistency protocol specially designed for distributed coordination service ZooKeeper. This mechanism ensures the synchronization between servers. Full name Zookeeper Atomic Broadcast Protocol – Zookeeper Atomic Broadcast Protocol.

Two modes

Zab protocol has two modes: recovery mode and broadcast mode.

Broadcasting mode

The broadcast mode is similar to the two-phase commit in distributed transactions, because a Zookeeper write is treated as a transaction, so it is essentially the same.

In broadcast mode, a write request goes through the following steps

The ZooKeeper Server receives the write request from the Client
Write requests are forwarded toLeadernode
LeaderThe node persists the updates locally first
LeaderThe node proposed the update toFollowersTo enter the process of collecting votes
FollowerThe node receives the request, successfully persists the change locally, and sends an ACK toLeader
LeaderWhen more than half of the ACKS are received,LeaderThe COMMIT message is broadcast and delivered locally.
When I receivedLeaderWhen a COMMIT message is sent,FollowerIt will also deliver the message.

The broadcast protocol uses THE TCP FIFO channel in all communication, and by using this channel, it is very easy to maintain order. Messages are delivered in order through the FIFO channel. The order of received messages is preserved as soon as they are processed.

In this mode, if the Leader fails, the Zookeeper cluster cannot provide write services. This introduces the following recovery mode.

Recovery mode

To put it simply, when the Leader in the cluster fails or the service starts, ZAB goes into recovery mode, which includes Leader election and state synchronization between other servers and the Leader.

NOTE: Primary selection is the most important and complex process in ZAB protocol. There are many concepts in this chapter, which is not conducive to understanding the knowledge in this chapter. Therefore, I plan to introduce them separately in the next chapter, and students can choose to eat them.

Some additional discussions about the Zookeeper cluster

Zookeeper (Read performance) scalability and Observer nodes

The scalability of a cluster means that more cluster nodes can be introduced to improve certain performance. Zookeeper provides read and write services. Initially, Zookeeper introduced the Follower node to improve the read service performance. However, according to the read/write mechanism and ZAB protocol we have learned before, the introduction of new Follower nodes will lead to the decline of Zookeeper write service, because the vote initiated by the Leader will be successful only when more than half of the followers respond. If you have more followers, It adds to the stress of the voting process in the agreement, which could slow down the overall voting response. As a result, the write throughput of the cluster decreases as the number of followers increases.

In this case, Zookeeper introduced the Observer role in the cluster architecture after version 3.3.3. The only difference between Zookeeper and followers is that they do not participate in voting or voting. This improves the read performance without affecting the write performance.

In addition, they write performance can’t be extended, this also is he doesn’t fit as the center of the service registry found one of reasons, in the service discovery and health monitoring scenarios, with the increasing scale of service, whether it is used frequently launch service registry write requests, or brush millisecond service health status of write requests, Metro Zookeeper creates a lot of write pressure because its write performance is not scalable. This will be explained in more detail in the article cited below.

2. Zookeeper and CAP theory

CAP theory exists in the distributed domain:

C: Consistency, all data changes are synchronized.
The system has good response performance.
P: Partition tolerance. In practical terms, partitioning is a time-bound requirement for communication. If the system cannot achieve data consistency within the time limit, it means that A partitioning situation has occurred and it must choose between C and A for the current operation, meaning that the system is available regardless of any message loss.

The meaning of CAP Theorem — Ruan Yifeng

It has been proved that any distributed system can only satisfy two points at once, not all three. Therefore, it would be foolish to waste energy thinking about how to design the perfect system for all three, and the trade-offs should be tailored to the application scenario.

According to the read and write mechanism and the ZAB protocol, Zookeeper is essentially a distributed system biased towards CP. Because the broadcast protocol essentially sacrifices the system’s response performance. In addition, the following characteristics can also be seen from it. These are the features that are presented at the end of chapter one.

Sequence Consistency Transaction requests initiated from the same client are applied to ZooKeeper in strict accordance with the order in which they are initiated.

(2) The atomicity of all transaction requests is consistent on all the machines in the cluster. That is to say, a transaction is successfully applied to all the clusters in the cluster, or not applied at all. Therefore, there must be no case where the transaction is applied to some machines in the cluster and not applied to others.

(3) Single view No matter which ZooKeeper server the client connects to, the server data model is the same.

(4) Reliability Once a server has successfully applied a transaction and responded to the client, the server state changes caused by that transaction are retained until another transaction changes it.

3. Limitations of Zookeeper as a service registry

Directly quote an article about Ali middleware, which is better than me. In a practical production situation, most companies do not reach the micro-service level of large companies, and Zookeeper is fully capable of meeting the needs of service registries.

Why doesn’t Alibaba use ZooKeeper for service discovery?

conclusion

This chapter mainly introduces the cluster architecture of Zookeeper, elaborates several roles and components of ZK, introduces the read and write mechanism of Zookeeper and the core ZAB protocol, and finally discusses some miscellaneous knowledge points together.

I think the information in this chapter is quite large, and I have a deeper understanding of the mechanism and principle of Zookeeper cluster service after sorting it out. It is best to combine the basic concepts of chapter 1 with those of chapter 1. Hopefully this will help you understand Zookeeper.

In the next chapter, I will detail the Zookeeper Leader Election process not covered in this chapter.

reference

[1] < zookeeper.apache.org/doc/r3.4.13… The official documentation

[2] Why does Alibaba not use ZooKeeper for service discovery?

[3] www.cnblogs.com/sunddenly/p…

[4] The Meaning of CAP Theorem — Ruan Yifeng