Since it is both storage and processing large data requires considerable disk or processing resource consumption, then the single is certainly do not meet our requirements, so in this section, we will come to understand the Hadoop cluster pattern, under the condition of the cluster together with processing task distribution, storage sharing related functions such as practice.

Attach:

Hadoop’s official website is hadoop.apache.org, and its own blog is w-blog.cn

1. Preparation

Installation Package List

**/app/install**, temporarily stored in hadoop-1, SCP to the slave node after the configuration

The JDK 8 u101 - Linux - x64. Tar. Gz hadoop - 2.7.3. Tar. GzCopy the code

Server environment

The server system uses centos 7.x 64-bit version

# hadoop-1      192.168.1.101
# hadoop-2      192.168.1.102
# hadoop-3      192.168.1.103
Copy the code

Create the install directory to hold the packages and use OneInstack to update the base components

> mkdir -p /app/install && CD /app/install http://mirrors.linuxeye.com/oneinstack-full.tar.gz > tar -zxvf oneinstack-full.tar.gz > cd oneinstack && ./install.shCopy the code

Disable the firewall on each node (hadoop port communication will be blocked) or add the following ports to the whitelist

> systemctl stop firewalld. Service # disable firewall > systemctl disable firewalld. Service # Disable firewalldCopy the code

Change the host name of each server

Hostnamectl set-hostname hadoop-1 > HostNamectl set-hostname hadoop-2 > HostNamectl set-hostname hadoop-3Copy the code

Change the host of the server (use hostname when communicating with each other)

> vim /etc/hosts
192.168.1.101 hadoop-1
192.168.1.102 hadoop-2
192.168.1.103 hadoop-3
Copy the code

Restart the server for the modification to take effect. After the restart, the host name is changed back. Then, run the ping command to check whether the host name can be pinged

> ping hadoop-1
> ping hadoop-2
> ping hadoop-3
Copy the code

Create Hadoop users for all cluster nodes (use complex passwords to avoid attacks if external IP addresses are exposed)

> useradd -m hadoop -s /bin/bash
> passwd hadoop
Copy the code

To add administrator privileges to hadoop users, you can use sudo to perform operations as root

Hadoop root ALL=(ALL) ALL hadoop ALL=(ALL) ALLCopy the code

Add SHH password-free login to Hadoop-1 (hadoop-1 is our master host)

> su hadoop > ssh-keygen -t rsa SSH/cat id_rsa.pub >> authorized_keys # add authorization > chmod 600./authorized_keys # change file permissions > SSH localhost # In this case, you need to use SSH for the first timeCopy the code

Hadoop-1 enables password-free login to hadoop-2 and hadoop-3 to execute on 2 and 3

> su hadoop
> ssh-keygen -t rsa 
Copy the code

Run it on Hadoop-1

> scp ~/.ssh/authorized_keys hadoop@hadoop-2:/home/hadoop/.ssh/
> scp ~/.ssh/authorized_keys hadoop@hadoop-3:/home/hadoop/.ssh/
Copy the code

In this case, you can run the following command to log in to 2 and 3 through hadoop1

> ssh hadoop-2
> ssh hadoop-3
Copy the code

2. Configure the cluster

Java environment

The first step is to configure the Java environment on each server

> CD /app/install > sudo tar -zxvf JDK-8u101-linux-x64.tar. gz > sudo mv jdk1.8.0_101/ /usr/local/jdk1.8 > SCP /app/install/jdk-8u101-linux-x64.tar.gz hadoop@hadoop-2:~ > scp /app/install/jdk-8u101-linux-x64.tar.gz hadoop@hadoop-3:~Copy the code

Execute as root on hadoop-2 and hadoop-3

> sudo mv ~/jdk-8u101-linux-x64.tar.gz /app/install > cd /app/install > sudo tar -zxvf jdk-8u101-linux-x64.tar.gz > sudo The mv jdk1.8.0 _101 / / usr/local/jdk1.8Copy the code

Add the following information to all node environment variables

> sudo vim /etc/profile # Java export JAVA_HOME=/usr/local/jdk1.8 export JRE_HOME=/usr/local/jdk1.8/jre export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib:$CLASSPATH export PATH=$JAVA_HOME/bin:$PATH # Enabling environment variables > source /etc/profileCopy the code

The following result indicates that the installation is successful

> java-version Java version "1.8.0_101" Java(TM) SE Runtime Environment (build 1.8.0_101-B13) Java HotSpot(TM) 64-bit Server VM (Build 25.101-B13, Mixed mode)Copy the code

The Hadoop environment

Start by preparing the Hadoop environment on Hadoop-1

> CD /app/install > sudo tar -zxvf hadoop-2.1.3.tar. gz > sudo mv hadoop-2.7.3 /usr/local/ > sudo chown -r hadoop:hadoop / usr/local/hadoop - 2.7.3Copy the code

Add the following information to all node environment variables

> sudo vim /etc/profile # hadoop export HADOOP_HOME=/usr/local/hadoop-2.7.3 export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH # Enable environment variables > source /etc/profileCopy the code

Configuring a Hadoop Cluster

In cluster/distributed mode, six configuration files in /usr/local/hadoop-2.7.3/etc/hadoop need to be modified. For more configuration items, click the official description. Only the configuration items are required for normal startup. Vim Hadoop-env. sh, slaves, core-site. XML, hdFS-site. XML, mapred-site. XML, and yarn-site. XML.

1. Modify the JAVA environment variables in hadoop-env.sh

> vim /usr/local/hadoop-2.7.3 /etc/hadoop-env. sh export JAVA_HOME=/usr/local/jdk1.8Copy the code

The file slaves will write the host names as Datanodes into the file, one per line, the default is localhost, so in the pseudo-distributed configuration, the node acts as both NameNode and DataNode. In distributed configuration, you can either keep localhost or delete it to make hadoop-1 nodes only serve as Namenodes. In this tutorial, the Master node is used only as a NameNode, so remove the original localhost from the file and add only the following.

> vim /usr/local/hadoop-2.7.3 /etc/hadoop-slaves Hadoop-2 Hadoop-3Copy the code

Core-site. XML: core-site. XML:

> vim /usr/local/hadoop-2.7.3/etc/hadoop/core-site. XML <configuration> <property> <name> fs.defaultfs </name> <value>hdfs://hadoop-1:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>file:/usr/local/hadoop-2.7.3/ TMP </value> <description>Abase for other temporary directories.</description> </property> </configuration>Copy the code

XML file, dfs.replication is usually set to 3, but we only have two Slave nodes, so dfs.replication is set to 2:

NameNode: Manages metadata of the file system. All data reads are performed through NameNode to obtain the DataNode from which the source data is obtained

DataNode: actual data storage nodes. Specific mapping relationships are stored in NameNode

Another important function of HDFS is replication. When a disk is damaged,HDFS data is not lost. It can be understood as a redundancy backup mechanism

Unlike the single-machine mode, you need to configure the calling address of the NameNode to connect the DataNode

> vim /usr/local/hadoop-2.7.3/etc/hadoop/hdfs-site. XML <configuration> <property> <name>dfs.namenode.secondary.http-address</name> <value>hadoop-1:50090</value> </property> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.namenode.name.dir</name> < value > file: / usr/local/hadoop - 2.7.3 / TMP/DFS/name < value > / < / property > < property > < name > DFS. Datanode. Data. Dir < / name > < value > file: / usr/local/hadoop - 2.7.3 / TMP/DFS/data value > < / < / property > < / configuration >Copy the code

Mapred-site.xml (the default file name is mapred-site.xml.template).

> mv /usr/local/hadoop-2.7.3/etc/hadoop/mapred-site.xml. Template /usr/local/hadoop-2.7.3/etc/hadoop/mapred-site.xml > Vim /usr/local/hadoop-2.7.3/etc/hadoop/mapred-site. XML <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>hadoop-1:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>hadoop-1:19888</value> </property> </configuration>Copy the code

5. File yarn-site. XML:

> vim /usr/local/hadoop-2.7.3/etc/hadoop/yarn-site. XML <configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop-1</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>Copy the code

After the configuration is complete, copy the /usr/local/hadoop-2.7.3 folder on Master to each node. It is recommended that you delete temporary files before switching to cluster mode because you have run pseudo-distributed mode before. Execute on the Master node:

> SCP -r /usr/local/hadoop-2.7.3 Hadoop-2 :/home/hadoop > SCP -r /usr/local/hadoop-2.7.3 hadoop-3:/home/hadoopCopy the code

Execute on nodes 2 and 3:

> sudo mv ~/hadoop-2.7.3 /usr/local
> sudo chown -R hadoop /usr/local/hadoop-2.7.3
Copy the code

NameNode formatting is performed on the Master node (hadoop1).

> hdfs namenode -format
Copy the code

Hadoop can now be started on the Master node:

> start-dfs.sh
> start-yarn.sh
> mr-jobhistory-daemon.sh start historyserver
Copy the code

The following information is displayed when you run start-dfs.sh

WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... HADOOP_OPTS in hadoop_env.sh this warning is eliminated by modifying HADOOP_OPTS in hadoop-env.sh:  export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib:$HADOOP_PREFIX/lib/native"Copy the code

View cluster status (regarding disk usage, server status, etc.)

When the whole cluster is running, you can use Hadoop-1:5070 to check the cluster status. The result is the same as HDFS DFsadmin-report

> hdfs dfsadmin -report
Copy the code

To stop the cluster, run the following command

> stop-yarn.sh
> stop-dfs.sh
> mr-jobhistory-daemon.sh stop historyserver
Copy the code

You can also start or shut down the entire Hadoop cluster directly using the Hadoop Quick start and quick shut down command because the environment variable is **$HADOOP_HOME/sbin**

start-all.sh
stop-all.sh
Copy the code

3. Run the test program in cluster mode

Perform cluster tasks in the same way as in pseudo-distributed mode. Create a user directory on the HDFS.

> hdfs dfs -mkdir -p /user/hadoop
Copy the code

Copy the configuration file in /usr/local/hadoop-2.7.3/etc/hadoop as the input file to the distributed file system:

> HDFS DFS -mkdir input > HDFS DFS -put /usr/local/hadoop-2.7.3/etc/hadoop/*. XML input > HDFS DFS -ls /user/hadoop/inputCopy the code

Then you can run the MapReduce job:

> hadoop jar/usr/local/hadoop - 2.7.3 / share/hadoop/graphs/hadoop - graphs - examples - 2.7.3. Jar grep input output 'dfs[a-z.]+'Copy the code

The output information of the runtime is similar to that of the pseudo-distributed one, showing the progress of the Job.

It might be a little slow, but if you don’t see progress for five minutes, restart Hadoop and try again. If the restart fails, memory may be insufficient. You are advised to increase the VM memory capacity or change YARN memory configuration to resolve the problem.

You need to add it to yarn-site. XML in all clusters

> vim/usr/local/hadoop - 2.7.3 / etc/hadoop/yarn - site. XML < property > < name > yarn. The nodemanager. Resource. The memory - MB < / name > <value>20480</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>2048</value> </property> <property> <name> yarn.nodeManager. Vmem-pmem-ratio </name> <value>2.1</value> </property>Copy the code

If you are using ali cloud server, there is a big hole, ali cloud default host will write some configuration affect hadoop normal running, as follows:

If failed on connection exception still occurs, you can go to /etc/hosts to comment the following configuration:

Localhost localhost #127.0.0.1 izBP1CvZ54M4x8i9l5clyiz #127.0.0.1 localhost localhost izbp1cvz54m4x8i9l5clyiz4.localdomain4 #::1 localhost localhost.localdomain localhost6 localhost6.localdomain6Copy the code

The task progress can also be viewed on the Web interface. Click the History connection in the “Tracking UI” column on the Web interface to view the running information of the task, as shown in the following figure:

Viewing the Processing result

> hdfs dfs -cat output/*
Copy the code

4 summarizes

This section has successfully coordinated the task processing and data storage under the condition of cluster. In the next section, we will talk about some content related to Hadoop maintenance. Finally, thank you for your support and welcome everyone to communicate with us!

Note: I have limited ability to say the wrong place hope we can point out, but also hope to communicate more!