High availability

We already know that when a file is read or uploaded, it needs to go through the NameNode. If the NameNode is down, the client does not know where to read or write the data, and the whole cluster becomes unusable.



Hadoop’s high availability is achieved through ZooKeeper,Hadoop – Cluster installation (high availability)This article also mentioned the role of ZooKeeper. In fact, for each component of big data, a lot of high availability is done through ZooKeeper. Let’s take a look at how ZooKeeper realizes high availability of Hadoop. For those of you implementing master elections in ZooKeeper, see the previous articleZooKeeper’s master electionNo details will be added here.

High availability is usually used in a redundant way, where multiple nodes are deployed, one of which serves as the active node and the remaining nodes are in standby state. When the active node is not available, the active node will take its place.

In high availability Hadoop, the number of NameNode nodes is 2, and each NameNode has a ZKFC to communicate with ZooKeeper.



When one of the NameNodes successfully creates a node in ZooKeeper, it becomes the Active node, and the other is the Standby node. The standby node listens to the ZooKeeper nodes. When it finds that the Active node does not work, it creates a node on the ZooKeeper node and becomes the Active node.



If a DataNode needs to send a heartbeat to a NameNode on a regular basis, then the NameNode will record the time when each DataNode last sent a heartbeat in memory. As a basis for the existence of the DataNode, then there are two NameNodes. If only the heartbeat is sent to the active node, the NameNode will send the heartbeat to the active node. When the DataNode sends the heartbeat, the DataNode will get the address and port of the two NameNodes and send the heartbeat information together. When the DataNode sends the heartbeat, the DataNode will get the address and port of the two NameNodes.



In addition to storing the heartbeat information of the DataNode, the NameNode also stores metadata information. In order to keep the two NameNodes in sync, the active node sends the metadata to the JournalNode cluster every time the metadata changes, and the standby node periodically reads the metadata information from the JournalNode cluster.



In order to ensure high availability of NameNode, HDFS introduces ZooKeeper and JournalNode. If these two are not high availability, they will also affect the high availability of NameNode, so these two should be high availability as well.

The cluster

The high availability of the NameNode also has its own bottlenecks, such as all access to the NameNode of the active node. The high availability of the NameNode does not reduce the pressure on the active node. Another bottleneck is the management of metadata, which can grow all the time. The memory of the NameNode will run out one day, and adding more memory will cause too much metadata to be loaded at a time, which will make the startup slow, and the full GC will take a long time. To solve these two problems, HDFS introduced federated clusters.

Since one NameNode has a bottleneck, there are several more NameNodes to divide the single pressure. As shown in the figure below, there are three groups of NameNodes, each of which has an active node and a standby node. Their relationship with ZooKeeper and JournalNode is similar to high availability.



How do these NameNodes manage metadata and DataNodes? In federated clusters, the concept of block pools is introduced. For example, if we have three groups of NameNodes, then there will be three block pools. Each DataNode will be divided into multiple logical conceptual blocks.

For example, in the figure below, if each DataNode is divided into three blocks and each block has a corresponding block pool, the NameNode in the first group reads and writes data to the block pool 1 and stores the corresponding metadata information. The NameNode of the second group reads and writes data to block pool 2 and holds the corresponding metadata information. The NameNode of the third group reads and writes data to the block pool 3 and holds the corresponding metadata information. Assuming that the original metadata was 600 gigabytes, each NameNode will only manage 200 gigabytes of metadata at this point. In addition, reading and writing is reduced by a third, greatly reducing the NameNode pressure.