I. Introduction to high availability

Hadoop High Availability includes HDFS High Availability and YARN High Availability. The implementation of HDFS and YARN High Availability is similar, but HDFS NameNode has higher requirements on data storage and consistency than YARN ResourceManger. So its implementation is more complicated, so here’s how:

1.1 Overall ARCHITECTURE of HA

The HDFS ha architecture is as follows:

The image is from:www.edureka.co/blog/how-to…

The HDFS high availability architecture consists of the following components:

  • Active NameNode and Standby NameNode: One NameNode is in the Active state and the other is in the Standby state. Only the Active NameNode can provide read and write services.

  • Active/standby switchover controller ZKFailoverController: The ZKFailoverController runs as an independent process and controls the active/standby switchover of Namenodes. ZKFailoverController can detect the health status of NameNode in time, and implement automatic active/standby election and switchover with the help of Zookeeper when the active NameNode fails. However, NameNode also supports manual active/standby switchover independent of Zookeeper.

  • Zookeeper cluster: Supports active/standby controller switchover.

  • Shared storage system: A shared storage system is the most critical part to implement the high availability of NameNode. The shared storage system stores the HDFS metadata generated during NameNode running. Primary Namenodes and Namenodes synchronize metadata through shared storage systems. During the active/standby switchover, the new active NameNode can continue to provide services only after the metadata is fully synchronized.

  • DataNode: In addition to sharing HDFS metadata information through the shared storage system, the active NameNode and standby NameNode also need to share the mapping between HDFS data blocks and Datanodes. The DataNode reports the location of data blocks to both the active and standby Namenodes.

1.2 Analysis of the data synchronization mechanism of a QJM-based shared storage System

Currently, Hadoop supports Quorum Journal Manager (QJM) or Network File System (NFS) as a shared storage System. The FOLLOWING uses a QJM cluster as an example: Active NameNode submits Editlogs to JournalNode cluster. Standby NameNode synchronizes Editlogs from JournalNode cluster. When Active NameNode goes down, The Standby NameNode can provide services after confirming that the metadata is fully synchronized.

It is important to note that writing editlogs to a JournalNode cluster follows a “half write succeeds” policy, so you should have at least three JournalNodes. You can increase the number of journalNodes, but keep the total number of journalNodes odd. If there are 2N+1 JournalNodes, a maximum of N JournalNode failures can be tolerated according to the half-write principle.

1.3 NameNode Active/standby Switchover

The following figure shows the process of NameNode active/standby switchover:

  1. After HealthMonitor is initialized, it starts internal threads to periodically invoke the method of the HAServiceProtocol RPC interface of the corresponding NameNode to check the health status of the NameNode.

  2. If HealthMonitor detects that the health status of NameNode has changed, it calls back to the corresponding method registered with ZKFailoverController for processing.

  3. If ZKFailoverController determines that an active/standby switchover is required, it first uses ActiveStandbyElector to automatically elect the active/standby switchover.

  4. ActiveStandbyElector interacts with Zookeeper to perform automatic active/standby election.

  5. ActiveStandbyElector will call back the corresponding method of ZKFailoverController to inform the current NameNode to become the primary NameNode or the standby NameNode after the active/standby election is complete.

  6. ZKFailoverController calls the method of the HAServiceProtocol RPC interface of the corresponding NameNode to change the NameNode to Active or Standby.

1.4 YARN has high availability

The high availability of YARN ResourceManager is similar to the high availability of HDFS NameNode. However, Unlike NameNode, ResourceManager has less metadata to maintain. Therefore, its status information can be directly written to Zookeeper and depends on Zookeeper for primary/secondary election.

Second, cluster planning

To meet the ha design objectives, ensure that at least two Namenodes (one active and one standby) and two ResourceManager (one active and one standby) are available, and at least three JournalNodes are available to ensure that half write succeeds. Three hosts are used to build the cluster. The cluster planning is as follows:

Third, preconditions

  • JDK is installed on all servers. For details about how to install JDK, see installing JDK in Linux.
  • Set up the ZooKeeper cluster. For details, see Setting up a ZooKeeper Single-node Environment and Cluster Environment
  • SSH password-free login has been configured for all servers.

4. Cluster configuration

4.1 Download and decompress the package

Download Hadoop. Here I downloaded the CDH version of Hadoop, download address is: archive.cloudera.com/cdh5/cdh/5/

#The tar - ZVXF hadoop - server - cdh5.15.2. Tar. Gz 
Copy the code

4.2 Configuring Environment Variables

Edit profile file:

# vim /etc/profile
Copy the code

Added the following configuration:

Export HADOOP_HOME = / usr/app/hadoop - server - cdh5.15.2 export PATH = ${HADOOP_HOME} / bin: $PATHCopy the code

Run the source command for the configuration to take effect immediately:

# source /etc/profile
Copy the code

4.3 Modifying the Configuration

Go to ${HADOOP_HOME}/etc/hadoop and modify the configuration file. The content of each configuration file is as follows:

1. hadoop-env.sh

#Specify where the JDK is installedExport JAVA_HOME = / usr/Java/jdk1.8.0 _201 /Copy the code

2. core-site.xml

<configuration>
    <property>
        <!-- 指定 namenode 的 hdfs 协议文件系统的通信地址 -->
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop001:8020</value>
    </property>
    <property>
        <! -- Specify the directory where the hadoop cluster stores temporary files -->
        <name>hadoop.tmp.dir</name>
        <value>/home/hadoop/tmp</value>
    </property>
    <property>
        <! -- ZooKeeper cluster address -->
        <name>ha.zookeeper.quorum</name>
        <value>hadoop001:2181,hadoop002:2181,hadoop002:2181</value>
    </property>
    <property>
        <! ZKFC connection to ZooKeeper timeout duration -->
        <name>ha.zookeeper.session-timeout.ms</name>
        <value>10000</value>
    </property>
</configuration>
Copy the code

3. hdfs-site.xml

<configuration>
    <property>
        <! -- Specify the number of HDFS copies -->
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <! -- Namenode node data (metadata) location, can specify multiple directories to implement fault tolerance, multiple directories separated by commas -->
        <name>dfs.namenode.name.dir</name>
        <value>/home/hadoop/namenode/data</value>
    </property>
    <property>
        <! -- Store location of datanode data (data blocks) -->
        <name>dfs.datanode.data.dir</name>
        <value>/home/hadoop/datanode/data</value>
    </property>
    <property>
        <! -- Logical name of cluster service -->
        <name>dfs.nameservices</name>
        <value>mycluster</value>
    </property>
    <property>
        <! -- NameNode ID list -->
        <name>dfs.ha.namenodes.mycluster</name>
        <value>nn1,nn2</value>
    </property>
    <property>
        <! -- NN1 RPC address -->
        <name>dfs.namenode.rpc-address.mycluster.nn1</name>
        <value>hadoop001:8020</value>
    </property>
    <property>
        <! -- NN2 RPC address -->
        <name>dfs.namenode.rpc-address.mycluster.nn2</name>
        <value>hadoop002:8020</value>
    </property>
    <property>
        <! -- Nn1 HTTP address -->
        <name>dfs.namenode.http-address.mycluster.nn1</name>
        <value>hadoop001:50070</value>
    </property>
    <property>
        <! -- HTTP address for Nn2 -->
        <name>dfs.namenode.http-address.mycluster.nn2</name>
        <value>hadoop002:50070</value>
    </property>
    <property>
        <! -- NameNode metadata shared storage directory on JournalNode -->
        <name>dfs.namenode.shared.edits.dir</name>
        <value>qjournal://hadoop001:8485; hadoop002:8485; hadoop003:8485/mycluster</value>
    </property>
    <property>
        <! -- Journal Edit Files -->
        <name>dfs.journalnode.edits.dir</name>
        <value>/home/hadoop/journalnode/data</value>
    </property>
    <property>
        <! Configure isolation to ensure that only one NameNode is active at any given time.
        <name>dfs.ha.fencing.methods</name>
        <value>sshfence</value>
    </property>
    <property>
        <! Sshfence requires SSH secure login -->
        <name>dfs.ha.fencing.ssh.private-key-files</name>
        <value>/root/.ssh/id_rsa</value>
    </property>
    <property>
        <! -- SSH timeout -->
        <name>dfs.ha.fencing.ssh.connect-timeout</name>
        <value>30000</value>
    </property>
    <property>
        <! -- Access the proxy class used to determine the Active NameNode -->
        <name>dfs.client.failover.proxy.provider.mycluster</name>
        <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
    </property>
    <property>
        <! Enable automatic failover -->
        <name>dfs.ha.automatic-failover.enabled</name>
        <value>true</value>
    </property>
</configuration>
Copy the code

4. yarn-site.xml

<configuration>
    <property>
        <! Configure ancillary services to run on NodeManager. You need to configure mapreduce_shuffle to run the MapReduce program on Yarn. -->
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <! -- Enable log aggregation (Optional) -->
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    <property>
        <! Save time for aggregate logs (optional) -->
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>86400</value>
    </property>
    <property>
        <! -- RM HA -->
        <name>yarn.resourcemanager.ha.enabled</name>
        <value>true</value>
    </property>
    <property>
        <! -- RM cluster id -->
        <name>yarn.resourcemanager.cluster-id</name>
        <value>my-yarn-cluster</value>
    </property>
    <property>
        <! -- RM logical ID list -->
        <name>yarn.resourcemanager.ha.rm-ids</name>
        <value>rm1,rm2</value>
    </property>
    <property>
        <! -- RM1 service address -->
        <name>yarn.resourcemanager.hostname.rm1</name>
        <value>hadoop002</value>
    </property>
    <property>
        <! -- RM2 service address -->
        <name>yarn.resourcemanager.hostname.rm2</name>
        <value>hadoop003</value>
    </property>
    <property>
        <! -- RM1 Web application address -->
        <name>yarn.resourcemanager.webapp.address.rm1</name>
        <value>hadoop002:8088</value>
    </property>
    <property>
        <! -- RM2 Web application address -->
        <name>yarn.resourcemanager.webapp.address.rm2</name>
        <value>hadoop003:8088</value>
    </property>
    <property>
        <! -- ZooKeeper cluster address -->
        <name>yarn.resourcemanager.zk-address</name>
        <value>hadoop001:2181,hadoop002:2181,hadoop003:2181</value>
    </property>
    <property>
        <! -- Enable automatic recovery -->
        <name>yarn.resourcemanager.recovery.enabled</name>
        <value>true</value>
    </property>
    <property>
        <! Class for persistent storage -->
        <name>yarn.resourcemanager.store.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
    </property>
</configuration>
Copy the code

5. mapred-site.xml

<configuration>
    <property>
        <! -- Specify mapReduce jobs to run on YARN -->
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
Copy the code

5. slaves

Configure one host name or IP address for each slave node. DataNode services and NodeManager services are started on all slave nodes.

hadoop001
hadoop002
hadoop003
Copy the code

4.4 Distribution Program

After distributing the Hadoop installation package to two other servers, you are advised to configure Hadoop environment variables on the two servers.

#Distribute the installation package to hadoop002SCP - r/usr/app/hadoop - server - cdh5.15.2 / hadoop002: / usr/app /#Distribute the installation package to Hadoop003SCP - r/usr/app/hadoop - server - cdh5.15.2 / hadoop003: / usr/app /Copy the code

5. Start the cluster

5.1 start the ZooKeeper

Start the ZooKeeper service on three servers.

 zkServer.sh start
Copy the code

5.2 start Journalnode

Go to ${HADOOP_HOME}/sbin on each server and start journalNode:

hadoop-daemon.sh start journalnode
Copy the code

5.3 Initializing NameNode

Execute NameNode initialization command on hadop001:

hdfs namenode -format
Copy the code

After the initialization command is executed, copy the contents of the NameNode metadata directory to another unformatted NameNode. The metadata storage directory is the directory specified by using the dfs.namenode.name.dir attribute in HDFS -site. XML. Here we need to copy it to hadoop002:

 scp -r /home/hadoop/namenode/data hadoop002:/home/hadoop/namenode/
Copy the code

5.4 Initializing the HA Status

Run the following command on any NameNode to initialize the HA status of ZooKeeper:

hdfs zkfc -formatZK
Copy the code

5.5 start the HDFS

Go to ${HADOOP_HOME}/sbin of hadoop001 and start HDFS. NameNode services on HadoOP001 and Hadoop002 and DataNode services on the three servers will be started:

start-dfs.sh
Copy the code

5.6 start the YARN

Go to ${HADOOP_HOME}/sbin of hadoop002 and start YARN. The ResourceManager service on HadoOP002 and the NodeManager service on the three servers are started.

start-yarn.sh
Copy the code

Note that the ResourceManager service on HadoOP003 is not started at this time. You need to manually start the ResourceManager service:

yarn-daemon.sh start resourcemanager
Copy the code

6. View the cluster

6.1 Viewing Processes

After a successful startup, the processes on each server should look like this:

[root@hadoop001 sbin]# jps 4512 DFSZKFailoverController 3714 JournalNode 4114 NameNode 3668 QuorumPeerMain 5012 DataNode  4639 NodeManager [root@hadoop002 sbin]# jps 4499 ResourceManager 4595 NodeManager 3465 QuorumPeerMain 3705 NameNode 3915 DFSZKFailoverController 5211 DataNode 3533 JournalNode [root@hadoop003 sbin]# jps 3491 JournalNode 3942 NodeManager  4102 ResourceManager 4201 DataNode 3435 QuorumPeerMainCopy the code

6.2 Viewing the Web UI

The port numbers of HDFS and YARN are 50070 and 8080 respectively. The interface should be as follows:

The NameNode on hadoOP001 is available:

And hadoop002NameNodeIs in the standby state:




ResourceManager on HadoOP002 is available:




ResourceManager on HaDOOP003 is in standby state:




At the same time, there is information about Journal Manager on the interface:


7. Secondary startup of the cluster

The initial startup of the cluster above involves some necessary initialization, so the process is a bit tedious. However, once the cluster is set up, it is convenient to enable it again. The steps are as follows (first, ensure that the ZooKeeper cluster has been started) :

If HDFS is started in HadoOP001, all HDFS high availability related services, including NameNode, DataNode, and JournalNode, will be started:

start-dfs.sh
Copy the code

Start YARN in hadoop002:

start-yarn.sh
Copy the code

At this time, the ResourceManager service on HadoOP003 is usually not started, you need to manually start:

yarn-daemon.sh start resourcemanager
Copy the code

The resources

The above steps are mainly from the official documents:

  • HDFS High Availability Using the Quorum Journal Manager
  • ResourceManager High Availability

For a detailed analysis of Hadoop’s high availability principles, read:

Hadoop NameNode High Availability analysis