This article has been published by netease Cloud community authorized by the author Zhu Xiaoxiao.

Welcome to visit netease Cloud Community to learn more about Netease’s technical product operation experience.



The last article introduced how to build a Hadoop pseudo-distributed cluster. This article will introduce you to the Hadoop distributed cluster. The content is shallow, but it provides a reference for beginners, so that people like me can understand the Hadoop environment.

Environment:

System environment: CentOS7.3.1611 64-bit

Java version: OpenJDK 1.8.0

Use two nodes as the cluster environment: one as the Master and the other as the Slave

Cluster building process:

The Hadoop cluster installation and configuration process is as follows:

(1) Select a machine as Master;

(2) Configure the Hadoop user, Java environment, and SSH Server on the Master node.

(3) Install Hadoop on the Master node and complete the configuration.

(4) Other machines serve as Slave nodes. Similarly, configure hadoop users, Java environment and install SSH Server on Slave nodes.

(5) Copy the Hadoop installation directory on the Master node to other Slave nodes;

(6) Start Hadoop on the Master node.

In the above steps, configure Hadoop users, Java environment, and install SSH Server. The steps of installing Hadoop are described in detail in the article “Hadoop Stand-alone/Pseudo-Distributed Cluster Construction (Introduction for Beginners)”. You can refer to the steps in this article: http://ks.netease.com/blog?id=8333.

All nodes in a cluster must reside on the same LAN. Therefore, perform the following network configuration after completing the first four steps to implement node interconnection.

The network configuration

The system installed by VMware VIRTUAL machine is used in this paper, so you only need to change the network connection mode to Bridge mode to achieve node interconnection. As shown in figure:

After the configuration is complete, you can view the IP address of each node. In Linux, you can run the ifconfig command to view the IP address of the node. The IP address of the Master node is 10.240.193.67.

After configuring the network, you can configure the Hadoop distributed cluster.

To distinguish Master from Slave, you can change the host name of the node. In CentOS7, we run the following command on the Master node to change the host name:

  • sudo vim /etc/hostname

Change the host name to Master:

Similarly, change the host name of the Slave node to Slave.

Then, run the following command on both Master and Slave nodes to modify the IP mapping and add the IP mapping for all nodes:

  • sudo vim /etc/hosts

The relationship between the names of the two nodes and their CORRESPONDING IP addresses is as follows:

After the modification, restart the node for the modification to take effect. You can ping the host names of other nodes from each node. If the ping succeeds, the configuration is successful.

SSH No password to log in to the node

To build a distributed cluster, the Master node can log in to each Slave node using SSH without a password. Therefore, we need to generate the Master node’s public key and add the Master public key to the Slave node’s authorization. The steps are as follows:

(1) Generate the public key of the Master node and run the following command:

  • CD ~/. SSH — If the directory does not exist, run SSH localhost

  • Rm./id_rsa* — Deletes the original public key if it has been generated

  • Ssh-keygen -t rsa — Press Enter

(2) The Master node must be able to log in to the host using SSH without password and run the following command on the Master node:

  • cat ./id_rsa.pub >> ./authorized_keys

  • chmod 600 ./authorized_keys

If you do not change the permission, the following problems may occur. You still cannot log in to the local computer using SSH without a password.

After the login is complete, run the SSH Master command to check whether you can log in to the local computer without a password. For the first login, enter yes and exit.

(3) To transfer the generated public key from Master node to Slave node, run the following command:

  • scp ~/.ssh/id_rsa.pub hadoop@Slave1:/home/hadoop/

The SCP command above remotely copies the Master node’s public key to the /home/hadoop directory on the Slave node. During the execution, the password of the Slave node’s Hadoop user is required. After the password is entered, the transfer progress is displayed:

(4) On the Slave node, add the SSH public key of the Master node to authorization and run the following command:

  • SSH localhost — This command is executed to generate the ~/. SSH directory, or mkdir ~/

  • cat ~/id_rsa.pub >> ~/.ssh/authorized_keys

  • chmod 600 ~/.ssh/authorized_keys

(5) The configuration of password-free SSH login from the Master node to the Slave node is complete. Enter SSH Slave on the Master node for verification. The login succeeds as shown in the following figure:

Configure the Hadoop distributed environment

To configure the Hadoop distributed environment, you need to modify the five configuration files of Hadoop as slaves, core-site. XML, HDFS-site. XML, mapred-site. XML, and yarn-site. XML. For details about the configuration items and their functions, see the official website. This section describes the configuration items required for starting the distributed cluster.

1, the slaves

Write the host name of the DataNode to the file. This file is localhost by default, so in the pseudo-distribution configuration, the nodes are both NameNode and DataNode. Localhost can be deleted and retained. In this article, Master is only used as NameNode node, Slave is used as DataNode node, so delete localhost.

2, the core site. XML

3, the HDFS – site. XML

Since there is only one Slave node, all dfs.replication values are set to 1.

4, the mapred – site. XML

The default file name is mapred-site.xml.template. Rename the file to mapred-site. XML and modify the configuration.

5, yarn – site. XML

After the configuration is complete, copy the Hadoop folder on the Master node to the same directory on the Slave node (/usr/local/), and change the owner of the folder to Hadoop.

Then you can start the distributed Hadoop cluster. Before starting the cluster, you need to disable the firewall of each node in the cluster. Otherwise, datanodes will start, but the Live Datanodes are 0.

Starting a distributed Cluster

To format the NameNode on the Master node for the first startup, run the following command:

  • hdfs namenode -format

Then start Hadoop. Start and stop Hadoop on the Master node.

Start Hadoop and run the startup script in Hadoop /sbin

  • cd /usr/local/hadoop/sbin

  • ./start-dfs.sh

  • ./start-yarn.sh

  • ./mr-jobhistory-daemon.sh start historyserver

After the node is started, you can run the JPS command to view the processes started on each node. If the startup is correct, the ResourceManager, Namenode, SecondaryNameNode, and JobHistoryServer processes are displayed, as shown in The figure.

NodeManager and DataNode processes can be seen in the Slave node, as shown in the figure.

The absence of any process indicates a startup error. In addition, run the HDFS dfsadmin-report command on the Master node to check whether DataNode starts normally. If Live DataNode is 0, the startup fails. Since only one DataNode is configured, Live DataNode is displayed as 1, as shown in the figure.

You can also check the status of DataNode and NameNode on the Web page: http://master:50070.

Then we can perform distributed instance, we launched Mr – jobhistory service, can through the Web page http://master:8088/cluster, click on the Tracking history view task operation information in the UI.

Shutting down the Hadoop Cluster

Execute script on Master node:

  • stop-yarn.sh

  • stop-dfs.sh

  • mr-jobhistory-daemon.sh stop historyserver



Netease Cloud Free experience pavilion, 0 cost experience 20+ cloud products!

For more information about netease’s technology, products and operating experience, please click here.


Alicloud PolarDB and its shared storage PolarFS technology implementation analysis (next)