Hadoop in action (1)

The author | WenasWei

preface

After introducing the overall architecture of Hadoop-offline batch technology, we started to learn how to install, configure and use Hadoop. It will be introduced from the following points:

  • Configuring and installing Hadoop for a Linux environment
  • Introduction to the three installation modes of Hadoop
  • Local mode installation
  • Pseudo cluster mode installation

A Linux environment configuration and installation of Hadoop

Hadoop requires some basic configuration requirements to use on a Linux environment, Hadoop user groups and user additions, no-secret login operations, and JDK installation

1.1 Ubuntu network configuration in VMware

You can use this shared blog post to configure the system. CSDN links to: Ubuntu Network Configuration in VMware

It includes the following important operation steps:

  • Buntu system information and change the host name
  • Windows sets up VMware’s NAT network
  • Linux gateway setup and configuration static IP
  • Linux Modify the hosts file
  • Linux password-free login

1.2 Hadoop user groups and user additions

1.2.1 Add Hadoop User Groups and Users

Log on to the Linuch-Ubuntu 18.04 virtual machine as root and execute the command:

$ groupadd hadoop
$ useradd -r -g hadoop hadoop
1.2.2 Give Hadoop user directory permissions

Give the /usr/local directory permission to the Hadoop user as follows:

$ chown -R hadoop.hadoop /usr/local/
$ chown -R hadoop.hadoop /tmp/
$ chown -R hadoop.hadoop /home/
1.2.3 Give Hadoop users SODU permissions

Edit the /etc/sudoers file and add Hadoop ALL=(ALL:ALL) ALL under root ALL=(ALL:ALL) ALL

$ vi /etc/sudoers

Defaults        env_reset
Defaults        mail_badpass
Defaults        secure_path="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin"
root    ALL=(ALL:ALL) ALL
hadoop  ALL=(ALL:ALL) ALL
%admin ALL=(ALL) ALL
%sudo   ALL=(ALL:ALL) ALL
1.2.4 Entrust Hadoop users with login passwords
$passwd hadoop Enter new UNIX password: Retype new UNIX password: passwd: password updated successfully

1.3 JDK installation

To install the JDK for Linux, refer to the shared blog post “Logstash- Data Flow Engine” -< Section 3: Logstash Installs >- (Section 2:3.2: Linux Installs JDK) to configure the installation on each host, CSDN jump link: Logstash- Data Flow Engine

1.4 Download Hadoop’s official website

Download website: https://hadoop.apache.org/rel… Binary download

  • Use wget to name the download (the download directory is the current directory) :

For example: version3.3.0 https://mirrors.bfsu.edu.cn/a…

$wget HTTP: / / https://mirrors.bfsu.edu.cn/apache/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
  • Unzip it and move it to the folder you want: /usr/local

    $mv./ hadoop-3.3.3.tar.gz /usr/local $CD /usr/local $tar -zvxf hadoop-3.3.3.tar.gz

1.5 Configure the Hadoop environment

  • /etc/profile:

    Export JAVA_HOME=/usr/local/ Java /jdk1.8.0_152 export JRE_HOME = / usr/local/Java/jdk1.8.0 _152 / jre export CLASSPATH = $CLASSPATH: $JAVA_HOME/lib: $JAVA_HOME/jre/lib export HADOOP_HOME = / usr/local/hadoop - 3.3.0 export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH:$HOME/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
  • Enables the configuration file to take effect

    $ source /etc/profile 
  • Check to see if the Hadoop configuration was successful

    $hadoop version hadoop 3.3.0 Source code repository at https://gitbox.apache.org/repos/asf/hadoop.git - r T18 aa96f1871bfd858f9bac59cf2a81ec470da649af Compiled by brahma on 2020-07-06:44 z Compiled with protoc 3.7.1 From the source with checksum 5dc29b802d6ccd77b262ef9d04d19c4 This command was run using / usr/local/hadoop - 3.3.0 / share/hadoop/common/hadoop - common - 3.3.0. Jar

    As can be seen from the results, the Hadoop version is Hadoop 3.3.0, indicating that the Hadoop environment has been successfully installed and configured.

II. Introduction of three installation modes of Hadoop

Hadoop provides three different installation modes, which are stand-alone mode, pseudo-cluster mode, and cluster mode.

2.1 Single mode

Single machine mode (local mode) : Hadoop default mode, that is, non-distributed mode, without other configuration can run non-distributed, that is, Java single process, convenient for debugging, tracking and troubleshooting problems, just need to configure JAVA_HOME in Hadoop’s hadoop-env.sh file.

Local stand-alone mode runs the Hadoop program with the Hadoop JAR command and outputs the results directly to the local disk.

2.2 Pseudo cluster mode

Hadoop runs in a pseudo-distributed manner on a single node (single point of failure). Hadoop processes run as separate Java processes. Nodes act as both NameNodes and DataNodes and read files in HDFS at the same time. It can logically provide the same running environment as the cluster mode, and physically the pseudo-cluster mode can be deployed on a single server, while the cluster mode needs to be deployed on multiple servers in order to achieve the full physical cluster distribution.

In pseudo-cluster mode, besides configuring Java Home in Hadoop’s hadoop-env.sh file, you also need to configure the file system used by Hadoop, the number of HDFS copies and YARN address, as well as the server’s SSH passpass-free login, etc. In pseudo-cluster mode, Hadoop program is run with HadopJar command and the results are output to HDFS.

2.3 Cluster mode

Cluster mode is also called complete cluster mode, which is essentially different from pseudo-cluster mode. Cluster mode is a completely distributed cluster implemented on physical servers and deployed on multiple physical servers. The pseudo-cluster mode is logically clustered, but it is deployed on a single physical server.

For production environment, high reliability and high availability of Hadoop environment are required, often a node failure will lead to the whole cluster unavailability; At the same time, the data in the production environment must be reliable, and the data must be recoverable after the failure of a data node or the loss of data. This requires that the production environment must deploy Hadoop cluster mode to meet the various requirements of the production environment.

Deployment in cluster mode is the most complex of the three installation modes, requiring deployment on multiple physical servers and planning the server environment ahead of time, in addition to configuring the file system used by Hadoop, the number of HDFS replicas, and YARN addresses. You also need to configure SSH password-free login between servers, RPC communication between Hadoop nodes, NameNode fail-over mechanism, HA high availability, etc. You also need to install and configure ZooKeeper, a distributed application coordination service.

Cluster mode runs the Hadoop program with the Hadoop JAR command and outputs the results to HDFS.

Three stand-alone modes

3.1 Modify the Hadoop configuration file

Modify the Hadoop configuration file hadoop-env.sh in stand-alone mode and add the Java environment configuration path

$vi /usr/local/hadoop-3.3.0/etc/hadoop/hadoop-env.sh $vi /usr/local/hadoop-3.3.0/etc/hadoop/hadoop-env.sh $vi /usr/local/hadoop-3.3.0/etc/hadoop/hadoop-env.sh JAVA_HOME = / usr/local/Java/jdk1.8.0 _152

3.2 Create test data files

  • Create directory /home/hadoop/input:

    $mkdir -p /home/hadoop/input
  • Create test data file data.input:

    $CD /home/hadoop/input/ vi data. $CD /home/hadoop/input/ vi data. $CD /home/hadoop/input/ vi data kafka spark hadoop storm

3.3 Running Hadoop test cases

Run the MapReduce sample program that comes with Hadoop and count the number of words in the specified file.

  • Run the MapReduce program command with Hadoop:

    $hadoop jar/usr/local/hadoop - 3.3.0 / share/hadoop/graphs/hadoop - graphs - examples - 3.3.0. Jar wordcount /home/hadoop/input/data.input /home/hadoop/output
  • The generic format is described as follows:

    • Hadoop JAR: Run the MapReduce program as a Hadoop command line.
    • / usr/local/hadoop – 3.3.0 / share/hadoop/graphs/hadoop – graphs – examples – 3.3.0. Jar: Hadoop comes with the MapReduce program in the full path of the JAR;
    • wordcount:Identifies that the MapReduce program is using a word count becauseHadoop - graphs - examples - 3.3.0. JarMultiple MapReduce programs exist in the file.
  • The parameters are described below.

    • . / home/hadoop/input/data input: input data. The input file’s full path to the local name;
    • /home/hadoop/output: This is a directory that can’t be created manually. It needs to be created by the Hadoop program.
  • Successful execution results:

    2021-06-02 01:08:40,374 INFO MapReduce.Job: 2021-06-02 01:08:40,374 INFO MapReduce. Job job_local794874982_0001 completed successfully
  • View file results

View the /home/hadoop/output folder and generated files as follows:

$ cd /home/hadoop/output $ /home/hadoop/output# ll total 20 drwxr-xr-x 2 root root 4096 Jun 2 01:08 ./ drwxr-xr-x 4 root  root 4096 Jun 2 01:08 .. / -rw-r--r-- 1 root root 76 Jun 2 01:08 part-r-00000 -rw-r--r-- 1 root root 12 Jun 2 01:08 .part-r-00000.crc -rw-r--r-- 1 root root 0 Jun 2 01:08 _SUCCESS -rw-r--r-- 1 root root 8 Jun 2 01:08 ._SUCCESS.crc

View the statistics file PART-R-00000:

Flume 2 Hadoop 3 HBase 1 Hive 2 Kafka 1 MapReduce 1 Spark 2 SQOOP 1 Storm 2

Four pseudo-cluster mode installation

Hadoop runs in a pseudo-distributed way on a single node. Hadoop processes run as separate Java processes. Nodes are NameNodes and DataNodes, and files in HDFS are read at the same time. XML and hdfs-site. XML and mapred-site. XML. Each configuration is implemented by declaring the name and value of the property.

4.1 Pseudo cluster file configuration

For the configuration of the Hadoop pseudo-cluster pattern, in addition to the hadoop-env.sh file, the following four files need to be configured :core site.xml, hdfs-site.xml, mapred-site.xml, and arn-site.xml. Each file is in the same directory as the hadoop-env.sh file. The functions of each file are as follows:

4.4.1 core – site. XML

Specifies the location of the NameNode. Hadoop.tmp.dir is the base configuration that the Hadoop file system depends on, and many paths depend on it. If hdfs-site. XML does not configure the location of the NameNode and DataNode, it will be placed in this path by default.

  • The core-site.xml configuration file is as follows:

    <? The XML version = "1.0" encoding = "utf-8"? > <? xml-stylesheet type="text/xsl" href="configuration.xsl"? > <configuration> <property> <name>hadoop. TMP </name> <value>/usr/local/hadoop-3.3.0/tmp</value> <description>Abase for other temporary directories.</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop1:9000</value> </property> </configuration>

    Note: where hadoop1 is the configured hostname

4.1.2 HDFS – site. XML

Configure the path where the NameNode and DataNode will store the files and the number of copies.

  • The hdfs-site.xml configuration file is as follows:

    <? The XML version = "1.0" encoding = "utf-8"? > <? xml-stylesheet type="text/xsl" href="configuration.xsl"? > <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> < name > DFS. The namenode. Name. Dir < / name > < value > / usr/local/hadoop - 3.3.0 / TMP/DFS/name < value > / < / property > < property > < name > DFS. Datanode. Data. Dir < / name > < value > / usr/local/hadoop - 3.3.0 / TMP/DFS/data value > < / < / property > < / configuration >

    Note: Pseudo Distributed has only one node, so DFs.Replication needs to be configured to 1, with at least three nodes in cluster mode; In addition, the node locations of the DataNode and NameNode are also configured.

4.1.3 mapred – site. XML

This file is not available in previous versions of Hadoop, so you need to renaming mapred.xml. template: Configure whether the MapReduce job is to be committed to the Yarn cluster or executed locally using the local job executor. HADOOP_HOME.

  • The mapred-site.xml configuration file is as follows:

    <? The XML version = "1.0"? > <? xml-stylesheet type="text/xsl" href="configuration.xsl"? > <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value> </property> </configuration>
4.1.4 yarn – site. XML

Configure the hostname of the node where the ResourceManager resides; Configure the list of secondary services that are executed by NodeManager.

  • The yarn-site.xml configuration file is as follows:

    <? The XML version = "1.0"? > <configuration> <! -- Site specific YARN configuration properties --> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop1</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>

4.2 Formatting NameNode and Starting Hadoop

4.2.1 Entrust script root account to run

The script directory is /usr/local/hadoop-3.3.0/sbin. The root account of the script should be given the permissions to run: start-dfs.sh, start-yarn.sh, stop-dfs.sh and stop-yarn.sh.

  • Sh and stop-dfs.sh are used to start and stop the HDFS process node, respectively. The script needs to add root to the top of the script:

    HDFS_DATANODE_USER=root
    HADOOP_SECURE_DN_USER=hdfs
    HDFS_NAMENODE_USER=root
    HDFS_SECONDARYNAMENODE_USER=root
  • (2) Start -yarn.ss and stop-yarn.sh are the nodes that start and stop yarn, respectively.

    YARN_RESOURCEMANAGER_USER=root
    HADOOP_SECURE_DN_USER=yarn
    YARN_NODEMANAGER_USER=root
4.2.2 Format the NameNode
  • To format the NameNode command:

    $ hdfs namenode -format 

The NameNode was formatted successfully when the following message appears in the output:

INFO common.Storage: Storage directory /usr/local/hadoop-3.3.0/tmp/ DFS /name has been successfully formatted.
Holdings to start the Hadoop
  • (1) to start the HDFS

Execute a script from the command line to start HDFS:

$ sh start-dfs.sh

JPS sees the process

$ jps 
13800 Jps
9489 NameNode
9961 SecondaryNameNode
9707 DataNode
  • (2) start the YARN

Start YARN by executing a script from the command line:

$ sh start-yarn.sh

JPS sees the process

$ jps 
5152 ResourceManager
5525 NodeManager
13821 Jps
4.2.4 View Hadoop node information

There are two ways to verify the successful startup of the Hadoop pseudo-cluster mode: one is to check whether the NameNode state is “active” in the browser through the interface exposed by Hadoop, and the other is to execute the MapReduce program to verify that the installation and startup is successful.

Enter the address in the browser to access:

http://192.168.254.130:9870/

The login interface is shown in the figure, and the node is in “active state “:

4.3 Run the MapReduce program to verify the environment setup

Running the MapReduce program to verify the setting up of the environment is divided into the following four steps:

  • Create the input file directory on HDFS
  • Upload the data file to HDFS
  • Execute the MapReduce program
4.3.1 Create the input file directory on HDFS

Create a new /data/input directory on HDFS, as follows:

$ hadoop fs -mkdir /data
$ hadoop fs -mkdir /data/input
$ hadoop fs -ls /data/

Found 1 items
drwxr-xr-x   - root supergroup          0 2021-06-05 11:11 /data/input
4.3.2 Upload data files to HDFS

Upload the local data file “data.input” to the directory in HDFS: /data/input

$ hadoop fs -put /home/hadoop/input/data.input /data/input $ hadoop fs -ls /data/input Found 1 items -rw-r--r-- 1 root Supergroup 101 2021-06-05 11:11 /data/input/data.input $hadoop fs-cat /data/input/data.input Hadoop mapreduce hive flume hbase spark storm flume sqoop hadoop hive kafka spark hadoop storm
4.3.3 Execute the MapReduce program
  • Run the WordCount program that comes with Hadoop, and the specific commands are as follows:

    $Hadoop jar - Hadoop - MapReduce-Examples -3.3.0.jar WordCount /data/input/data. Input /data/output
  • /data/output = /data/output = /data/output = /data/output = /data/output

    $ hadoop fs -ls /data/output
    
    Found 2 items
    -rw-r--r--   1 root supergroup          0 2021-06-05 11:19 /data/output/_SUCCESS
    -rw-r--r--   1 root supergroup         76 2021-06-05 11:19 /data/output/part-r-00000
    
    $ hadoop fs -cat /data/output/part-r-00000
    
    flume    2
    hadoop    3
    hbase    1
    hive    2
    kafka    1
    mapreduce    1
    spark    2
    sqoop    1
    storm    2

    Each word and the number of the word in the test data file can be output correctly from the part-r-00000 file, indicating that the pseudo-cluster mode of Hadoop correctly outputs the results of MapReduce to HDFS.

END

This article is mainly for the subsequent deployment of Hadoop and other big data components of the network policy processing, including the most important network static IP setting, host name modification, setting the no-secret login and other operations, the next article will introduce the Hadoop cluster mode installation, welcome to pay attention to the WeChat public account: attack dream clear; I am a migrant worker in the tide of the Internet. I hope to learn and make progress with you. I believe that the more you know, the more you don’t know.

Reference documents:

  • [1] RongT. Blog garden: https://www.cnblogs.com/tanro… , 2019-04-02.
  • [2] the Hadoop’s official website: https://hadoop.apache.org/
  • [3] Bing ‘an. Massive data processing and big data technology real station [M]. 1st edition. Beijing: Peking University Press,2020-09