This paper focuses on the treatment of ZooKeeper split brain problem. ZooKeeper is a service for coordinating (synchronizing) distributed processes. It provides a simple, high-performance coordination kernel on which users can build more complex distributed coordination functions. Split brain usually occurs in cluster environments such as Elasticsearch and ZooKeeper. These cluster environments have a single brain. For example, Elasticsearch has a Master node and ZooKeeper has a Leader node.

First share a Spring knowledge mind map to everyone

Why are the ZooKeeper cluster nodes deployed in odd numbers

ZooKeeper fault tolerance means that when several ZooKeeper node servers are down, the remaining number of ZooKeeper server servers must be greater than the number of ZooKeeper node servers that are down, that is, the number of remaining node services must be greater than N /2. In this way, the ZooKeeper cluster can continue to be used. For example, if at most two of the five ZooKeeper node machines are down, they can still be used because the remaining three machines are larger than 5/2. Why is it best to have an odd number of nodes? This is to save resources with the maximum number of fault-tolerant servers. For example, if the maximum fault tolerance is 2, the number of ZooKeeper services is odd 5 and even 6. In other words, a maximum of two ZooKeeper services can be broken down when there are six ZooKeeper services. Therefore, from the perspective of resource saving, it is unnecessary to deploy six (even) ZooKeeper service nodes.

The ZooKeeper cluster has this feature: If more than half of the servers in the cluster are running properly, the whole cluster is available. In other words, if there are two ZooKeeper nodes, the ZooKeeper service cannot be used as long as one ZooKeeper node dies, because there is no more than half of one ZooKeeper node, so the death tolerance of the two ZooKeeper nodes is 0. Similarly, if there are 3 ZooKeepers, one of them dies, and there are 2 normal ones left, which is more than half, so the tolerance of 3 Zookeepers is 1. Similarly, several more examples can be listed: 2->0, 3->1, 4->1, 5->2, 6->2. A rule can be found that the tolerance of 2N and 2N-1 is the same, both are N-1. Therefore, in order to be more efficient, why add an unnecessary ZooKeeper? Therefore, from the perspective of resource saving, it is best to deploy an odd number of Nodes in the ZooKeeper cluster.

This section describes the Split brain scenario in the ZooKeeper cluster

For a cluster, in order to improve the availability of the cluster, it is usually used for multi-room deployment. For example, there is a cluster consisting of six ZKServers deployed in two rooms:

Normally, there is only one Leader in the cluster. If the network between the machine rooms is disconnected, the ZKServers in the two machine rooms can communicate with each other. If the half mechanism is not considered, a Leader will be selected in each machine room.

This is equivalent to the original cluster, divided into two clusters, there are two “brains”, this is the so-called “split brain” phenomenon. For this case, also can see that actually are supposed to be a unified foreign service, a cluster is now two clusters provide service at the same time, if after a while, suddenly broken network unicom, so this time can be a problem, just two clusters are providing services to the public, how to merge data, data conflict how to solve the problem, and so on. The premise of explaining the split brain scenario just now is that the half mechanism is not considered, so in fact, the problem of split brain will not easily occur in ZooKeeper cluster, because of the half mechanism.

The semi-majority mechanism of ZooKeeper: During the Leader election, if a zkServer obtains more than half of the votes, it becomes the Leader. Take a simple example: if there are 5 ZKServers in the cluster, then half=5/2=2, that is to say, in the process of Leader election, at least three ZKServers must vote for the same zkServer, so as to meet the half mechanism and select a Leader.

So why must there be a half mechanism verification in the process of ZooKeeper election?

In this way, a Leader can be elected without waiting for all ZKServers to vote for the same zkServer, which is relatively fast, so it is called fast Leader election algorithm.

Why is greater than rather than greater than or equal to in the ZooKeeper half mechanism? This is more related to the split brain problem, for example, back to the previous scene where the split brain problem occurred (FIG. 1) : When the network in the middle of the machine room breaks down, the three servers in machine room 1 will elect a Leader. However, the condition of the half mechanism at this time is “the number of nodes > 3”, that is to say, at least 4 ZKServers are required to select a Leader. Therefore, for machine room 1, it cannot select a Leader. In the same way, machine room 2 cannot elect a Leader. In this case, if the network between equipment rooms is disconnected, the whole cluster has no Leader. If the condition of the half mechanism is “the number of nodes >= 3”, a Leader will be selected in both machine room 1 and machine room 2, thus resulting in brain split. This would explain why the greater than half mechanism is greater than or equal, in order to prevent brain splitting.

If we assume that we now have only 5 machines, also deployed in two rooms:

At this point, the condition of the half mechanism is “the number of nodes > 2”, that is, at least 3 servers are required to select a Leader. At this point, the disconnection of the network of machine room components has no impact on machine room 1, and the Leader is still the Leader. For machine room 2, the Leader cannot be selected. There is only one Leader in the cluster. Therefore, it can be concluded that with the half mechanism, for a ZooKeeper cluster, there is either no Leader or only one Leader, so that ZooKeeper can avoid the brain split problem.

The Zookeeper cluster is cracked

What is split brain?

To put it simply, a split-brain is when you have two nodes in a cluster and they both know that they need to elect a master in the cluster. When there is no problem with communication between the two, a consensus is reached that one of them will be chosen as master. But if there is a problem with communication between them, then both nodes will think that there is no master, so each node elects itself as master, and there will be two masters in the cluster.

For ZooKeeper, there is a very important question, that is, according to what kind of situation can we judge a node to die down? In distributed systems, these are determined by the monitor, but it is difficult for the monitor to determine the status of other nodes. The only reliable way is heartbeat, which is also used by ZooKeeper to determine whether the client is still alive.

Using ZooKeeper to do the Leader HA is basically the same: Each node tries to register a temporary node that represents the leader, and the others that fail to register become followers. The temporary nodes created by the leader are monitored by the watch mechanism. ZooKeeper determines the status of the leader by the internal heartbeat mechanism. Once the Leader has an accident, Zookeeper can quickly learn about it and notify other followers, and other flowers will respond later. In this way, a switch is completed. This mode is also a relatively common mode, which is basically implemented in this way. However, there is a serious problem in this. If the system is not paid attention to, the system will be cracked in a short period of time. Because the heartbeat timeout occurs, the Leader may hang up, but there may also be a problem in the network between ZooKeeper nodes, which leads to the fake death of the Leader. However, ZooKeeper thinks that the follower has died and notifies other nodes to switch over. In this way, one of the followers becomes the Leader, but the original Leader does not die. At this time, the client also receives the message of switching over the Leader. However, there will still be some delay. ZooKeeper needs to communicate and notify each other one by one. At this time, the whole system will be confused. If you have two clients that need to update the same data to the Leader at the same time, and they happen to be connected to the old and new Leader at the same time, you can have a serious problem.

Here’s a quick summary:

  • Suspended animation: The Leader is presumed dead due to a heartbeat timeout (network cause), but the Leader is still alive.

  • Split brain: a new Leader election will be initiated by feign death, and a new Leader will be elected. However, the old Leader network is connected again, resulting in two leaders. Some clients are connected to the old Leader, while some clients are connected to the new Leader.

What is the cause of ZooKeeper split brain?

The main reason is that the ZooKeeper cluster and the ZooKeeper client cannot synchronize their timeouts. In other words, the ZooKeeper cluster may discover the timeout before the ZooKeeper client. At the same time, after the discovery and switch, the notification of each client also has a speed. The Leader node needs to be disconnected from the Network of the ZooKeeper cluster, but there is no problem with the network between the Leader node and other cluster roles, and the above situations need to be met. However, serious consequences will occur once the Leader node is disconnected from the network of the ZooKeeper cluster.

How does ZooKeeper solve the “split brain” problem?

To solve the split-brain problem, there are generally several methods:

  • Quorums (quorum) : For a 3-node cluster, Quorums = 2, which means that the cluster can tolerate one node failure and still elect one lead, the cluster is still available. For example, if a cluster of four nodes has a Quorums of 3, the tolerance of the cluster is 1. If two nodes fail, the whole cluster is still invalid. This is the default method used by ZooKeeper to prevent “split brain”.

  • Using Redundant Communications: Multiple communication modes are used in a cluster to prevent nodes in the cluster from communicating with each other when one communication mode fails.

  • Fencing: For example, if the Leader can see a shared resource, it indicates that the Leader is in the cluster. If the Leader cannot see a shared resource, it indicates that the Leader is not in the cluster.

  • Arbitration mechanism.

  • Enable the disk lock mode.

In order to avoid the “split brain” situation of ZooKeeper, it is also very simple to switch the follower node after detecting the problem of the old Leader node. Instead, switch the follower node after sleeping for a sufficient period of time. This can be avoided by ensuring that the old Leader has been notified of the change and has done shutdown cleanup before registering as Master. This sleep time is generally defined as the timeout period defined by ZooKeeper, but the system may not be available during this time. But the consequences of inconsistent data are worth it.

ZooKeeper uses Quorums by default to prevent split brains. That is, a Leader can be elected only if more than half of the nodes in the cluster vote. In this way, the uniqueness of the Leader can be ensured. Either a unique Leader is elected, or the election fails. The functions of Quorums in ZooKeeper are as follows:

  • The minimum number of nodes in a cluster is used to elect a Leader to ensure the availability of the cluster.

  • The minimum number of nodes in the cluster that have saved the data before notifying the client that the data has been safely saved. Once these nodes have saved the data, the client is notified that it is safely saved and can continue with other tasks. The remaining nodes in the cluster will eventually save the data as well.

Suppose one Leader faked his death, and the rest of the followers elected a new Leader. If the old Leader is revived and still thinks he is the Leader, his request to the other followers will be rejected. Because every time a new Leader is created, an epoch label will be generated (indicating that the current Leader belongs to the ruling period), the epoch will be incremented. If the followers confirm the existence of the new Leader and know the epoch, All requests whose epoch is smaller than the current leader epoch are rejected. Are there followers who do not know that a new Leader exists? Perhaps, but certainly not the majority, otherwise a new Leader would not be created. ZooKeeper writes also follow the quorum mechanism. Therefore, writes without the majority of support are invalid, and the old leader is still of no effect even though various leaders consider themselves to be the leader.

ZooKeeper can use the default Quorums method to avoid brain splitting, and also take the following preventive measures:

  • Add redundant heartbeat wires, such as double wires, to minimize the chance of “split brain”.

  • Enable disk lock. The server locks the shared disk. When split brain occurs, the other party cannot steal the shared disk resources. However, there is a small problem with using a locked disk. If one party does not “unlock” the shared disk, the other party will never get the shared disk. In reality, if the service node suddenly crashes or crashes, it is impossible to execute the unlock command. Backup nodes cannot take over shared resources and application services. So someone designed a “smart” lock in HA. That is, the serving party only enables the disk lock when it finds that all heartbeat cables are disconnected (unaware of the peer). I don’t usually lock it.

  • Set up an arbitration mechanism. Such as setting the reference IP (e.g., gateway IP), when the heartbeat is completely disconnect, 2 nodes each ping reference IP, general clauses that breakpoint set out in the end, not only “heartbeat” foreign “service”, and also the terminal network link is broken, even start (or continue) application service is useless, it will take the initiative to give up competition, Enable the end that can ping through the reference IP address to start the service. On the safe side, the party that can’t ping the reference IP simply restarts itself to completely free up any shared resources that might still be held up.

conclusion

This interview question is divided into 19 modules, which are as follows: Java Basics, Containers, multithreading, reflection, object copy, Java Web, exceptions, Networking, Design patterns, Spring/Spring MVC, Spring Boot/Spring Cloud, Hibernate, MyBatis, RabbitMQ, Kafka, Zookeeper, MySQL, Redis, JVM.

Concern public number: programmer Bai Nannan, access to the above information.