Recently, someone asked if they could post some knowledge about big data. No problem! Today, start from the installation environment, build up their own learning environment.

Three ways to build And use Hadoop:

  • Stand-alone version suitable for development and debugging;
  • Pseudo-distributed is suitable for simulating cluster learning;
  • Fully distributed for production environments.

This document describes how to build a fully distributed Hadoop cluster with one master node and two data nodes.

A prerequisite for

  1. Prepare three servers

Virtual machines, physical machines, and cloud instances can be used. Three instances in Openstack private cloud are used for installation and deployment.

  1. Operating system and software version
The server system memory IP planning JDK HADOOP
node1 Ubuntu 18.04.2 LTS 8G 10.101.18.21 master The JDK 1.8.0 comes with _222 Hadoop – 3.2.1
node2 Ubuntu 18.04.2 LTS 8G 10.101.18.8 slave1 The JDK 1.8.0 comes with _222 Hadoop – 3.2.1
node3 Ubuntu 18.04.2 LTS 8G 10.101.18.24 slave2 The JDK 1.8.0 comes with _222 Hadoop – 3.2.1
  1. JDK installed on three machines

Since Hadoop is written in the Java language, a Java environment needs to be installed on your computer. I used JDK 1.8.0_222 here (Sun JDK is recommended)

Install command

sudo apt install openjdk-8-jdk-headless
Copy the code

To configure the JAVA environment variable, add the following to the bottom of the.profile file in the current user root directory:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
Copy the code

Use the source command to make it effective immediately

source .profile
Copy the code
  1. The host configuration

Modify the hosts files of the three servers

vim /etc/hosts

Add the following, based on the individual server IP

10.101.18.21 master
10.101.18.8 slave1
10.101.18.24 slave2
Copy the code

Secret free login configuration

  1. Production of the secret key
ssh-keygen -t rsa
Copy the code
  1. Master logs in to slave in non-secret mode
ssh-copy-id -i ~/.ssh/id_rsa.pub master
ssh-copy-id -i ~/.ssh/id_rsa.pub slave1
ssh-copy-id -i ~/.ssh/id_rsa.pub slave2
Copy the code
  1. Test secret free login
ssh master 
ssh slave1
ssh slave2
Copy the code

Hadoop structures,

We download the Hadoop package from the Master node, modify the configuration, and then copy it to other Slave nodes for minor modifications.

  1. Download the installation package and create a Hadoop directory
# downloadWget HTTP: / / http://http://apache.claz.org/hadoop/common/hadoop-3.2.1//hadoop-3.2.1.tar.gzUnzip to /usr/localSudo tar -xzvf hadoop-3.2.1. Tar. gz -c /usr/local 
# Change hadoop file permissionsSudo chown -r Ubuntu: Ubuntu hadoop-3.2.1. Tar.gzRename the folderSudo mv hadoop - 3.2.1 hadoopCopy the code
  1. Configure Hadoop environment variables for the Master node

As with JDK environment variables, edit the.profile file in the user directory to add Hadoop environment variables:

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin 
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
Copy the code

Execute source.profile for immediate effect

  1. Configuring the Master Node

All components of Hadoop with the XML file configuration, configuration files in the/usr/local/Hadoop/etc/Hadoop directory:

  • Core-site. XML: configures common properties, such as I/O Settings used by HDFS and MapReduce
  • Hdfs-site. XML: Hadoop daemon configuration, including Namenode, auxiliary Namenode, and Datanode
  • Mapred-site. XML: MapReduce daemon configuration
  • Yarn-site. XML: configures resource scheduling

A. Modify the core-site. XML file as follows:

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/local/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:9000</value>
    </property>
</configuration>
Copy the code

Parameter Description:

  • Fs. defaultFS: indicates the default file system. This parameter is required for HDFS clients to access HDFS
  • Hadoop.tmp. dir: specifies the temporary directory of the Hadoop data store. Other directories will be based on this directory

If hadoop.tmp.dir is not configured, the system uses the default temporary directory/TMP /hadoo-hadoop. This directory will be deleted after each restart. You must re-run format to avoid errors.

B. Edit hdfs-site. XML as follows:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.name.dir</name>
        <value>/usr/local/hadoop/hdfs/name</value>
    </property>
    <property>
        <name>dfs.data.dir</name>
        <value>/usr/local/hadoop/hdfs/data</value>
    </property>
</configuration>
Copy the code

Parameter Description:

  • Dfs. replication: number of copies of data blocks
  • Dfs.name. dir: specifies the file storage directory of the namenode node
  • Dfs.data. dir: specifies the file storage directory of datanode

C. Edit mapred-site. XML as follows:

<configuration>
  <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
  </property>
  <property>
    <name>mapreduce.application.classpath</name>
    <value>$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*</value>
  </property>
</configuration>
Copy the code

D. Edit yarn-site. XML and make the following changes:

<configuration>
<! -- Site specific YARN configuration properties -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
            <name>yarn.resourcemanager.hostname</name>
            <value>master</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME</value>
    </property>
</configuration>
Copy the code

E. Edit workers and modify as follows:

slave1
slave2
Copy the code

Configuring the Worker node

  1. Configuring the Slave Node

Package the Hadoop configured on the Master node and send it to the other two nodes:

# Hadoop package
tar -cxf hadoop.tar.gz /usr/local/hadoop
Copy to the other two nodes
scp hadoop.tar.gz ubuntu@slave1:~
scp hadoop.tar.gz ubuntu@slave2:~
Copy the code

Pressurize Hadoop packages on other nodes to /usr/local

sudo tar -xzvf hadoop.tar.gz -C /usr/local/
Copy the code

Configure Hadoop environment variables for Slave1 and Slaver2:

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin 
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
Copy the code

Start the cluster

  1. Format the HDFS file system

Go to the Hadoop directory on the Master node and perform the following operations:

bin/hadoop namenode -format
Copy the code

Format the Namenode. The operations performed before starting the service for the first time do not need to be performed later.

Truncate part of the log (see line 5 log to indicate formatting success) :

2019-11-11 13:34:18.960 INFO util.GSet: VM type       = 64-bit
2019-11-11 13:34:18.960 INFO util.GSet: 0.029999999329447746% max memory 1.7 GB = 544.5 KB
2019-11-11 13:34:18.961 INFO util.GSet: capacity      = 2^16 = 65536 entries
2019-11-11 13:34:18.994 INFO namenode.FSImage: Allocated new BlockPoolId: BP-2017092058-10.10118.21 -1573450458983
2019-11-11 13:34:19.010 INFO common.Storage: Storage directory /usr/local/hadoop/hdfs/name has been successfully formatted.
2019-11-11 13:34:19.051 INFO namenode.FSImageFormatProtobuf: Saving image file /usr/local/hadoop/hdfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2019-11-11 13:34:19.186 INFO namenode.FSImageFormatProtobuf: Image file /usr/local/hadoop/hdfs/name/current/fsimage.ckpt_0000000000000000000 of size 401 bytes saved in 0 seconds .
2019-11-11 13:34:19.207 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2019-11-11 13:34:19.214 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
Copy the code
  1. Starting a Hadoop Cluster
sbin/start-all.sh
Copy the code

Problems and solutions during startup:

Error: master: RCMD: socket: Permission denied

Solution:

Run echo “SSH” > /etc/pdsh/rcmd_default

B. error: JAVA_HOME is not set and could not be found.

Solution:

Modify hadoop-env.sh for the three nodes and add the following JAVA environment variables

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Copy the code
  1. Run the JPS command to check the running status

The Master node executes output:

19557 ResourceManager
19914 Jps
19291 SecondaryNameNode
18959 NameNode
Copy the code

Slave Node execution input:

18580 NodeManager
18366 DataNode
18703 Jps
Copy the code
  1. View the Hadoop cluster status
hadoop dfsadmin -report
Copy the code

View the results:

Configured Capacity: 41258442752 (38.42 GB)
Present Capacity: 5170511872 (4.82 GB)
DFS Remaining: 5170454528 (4.82 GB)
DFS Used: 57344 (56 KB)
DFS Used%: 0.00%
Replicated Blocks:
	Under replicated blocks: 0
	Blocks with corrupt replicas: 0
	Missing blocks: 0
	Missing blocks (with replication factor 1): 0
	Low redundancy blocks with highest priority to recover: 0
	Pending deletion blocks: 0
Erasure Coded Block Groups: 
	Low redundancy block groups: 0
	Block groups with corrupt internal blocks: 0
	Missing block groups: 0
	Low redundancy blocks with highest priority to recover: 0
	Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (2):

Name: 10.101.18.24:9866 (slave2)
Hostname: slave2
Decommission Status : Normal
Configured Capacity: 20629221376 (19.21 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 16919797760 (15.76 GB)
DFS Remaining: 3692617728 (3.44 GB)
DFS Used%: 0.00%
DFS Remaining%: 17.90%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Mon Nov 11 15:00:27 CST 2019
Last Block Report: Mon Nov 11 14:05:48 CST 2019
Num of Blocks: 0


Name: 10.101.18.8:9866 (slave1)
Hostname: slave1
Decommission Status : Normal
Configured Capacity: 20629221376 (19.21 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 19134578688 (17.82 GB)
DFS Remaining: 1477836800 (1.38 GB)
DFS Used%: 0.00%
DFS Remaining%: 7.16%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Mon Nov 11 15:00:24 CST 2019
Last Block Report: Mon Nov 11 13:53:57 CST 2019
Num of Blocks: 0
Copy the code
  1. Close the Hadoop
sbin/stop-all.sh
Copy the code

Web View The Hadoop cluster status

Enter http://10.101.18.21:9870 in your browser. The result is as follows:

Enter http://10.101.18.21:8088 in your browser. The result is as follows:

JAVA class at 9:30
data