The preparatory work

Rely on that

Hadoop depends on the Java environment, and different Hadoop versions correspond to different Java versions. Details are as follows

Hadoop version Java version
2.7.x – 2.x Java 7, java8
3.0-3.2 java8
3.3 Java8, java11
The Apache Hadoop community uses OpenJDK for the build/test/release environment and other JDKS/JVMS should work fine. But it's best to use OpenJDKCopy the code

Hadoop Version selection

As of this publication, the latest version of Hadoop is 3.3.0, but as you can see from the announcement, 3.1.3 is the latest stable version. It is recommended to choose stable version to avoid many pits.

This article also uses version 3.1.3 to build the Hadoop environment!

Its installation

Since hadoop 3.1.3 is selected, the corresponding OpenJDK should also select java8 version

How to install and configure the Java environment for Linux? See Installing and configuring the Java environment for Linux

Installing Other Components

Install PDSH

sudo yum install -y pdsh
Copy the code

Installation and deployment

download

Wget HTTP: / / https://archive.apache.org/dist/hadoop/common/hadoop-3.1.3/hadoop-3.1.3.tar.gzCopy the code

Unpack the

The tar - ZXVF hadoop - 3.1.3. Tar. GzCopy the code

Hadoop installation and deployment mode

Hadoop can be installed and deployed in a variety of modes

  • Single-machine mode: In this mode, no distributed file system exists and data is accessed from the local file system without any daemons. This mode is mainly used for developing and debugging the application logic of MapReduce programs.

  • Pseudo-distributed mode: Hadoop is installed on a node with a distributed file system. It is the same as fully distributed but has poor performance. It is usually used for personal testing.

  • Fully distributed: A Hadoop cluster consisting of multiple machines uses the master-slave architecture. This mode is rarely used in production because only one Namanode has a single point of failure.

  • High availability mode: This mode solves the problem of completely distributed single point of failure. In this mode, there are multiple Namenodes, but only one is in the active state, and the others are all in the standby state. When a Namenode fails, it will automatically switch to other backup Namenodes. This mode of hadoop construction is often used in production, but it also has disadvantages. For example, there is only one Namenode that provides services externally at the same time. With the increase of service data and the expansion of clusters, namenode will be under increasing pressure

  • Federated mode: This mode applies to large-scale clusters. Multiple Namenodes can provide services at the same time, and each Namenode maintains only part of datanodes’ metadata

Local mode

Local mode is mainly used for running and debugging during local development. After downloading and decompressing Hadoop, no setting is done. The default mode is local mode. In local mode all modules run in a JVM process using a local file system rather than HDFS.

To verify that the local mode is configured correctly, we can use hadoop’s own character statistics program to test it

  1. Start by preparing a text file to be analyzed and adding anything
vi ~/test.txt
Copy the code

2. Run MapReduce Demo of Hadoop

~ / hadoop - 3.1.3 / bin/hadoop jar ~ / hadoop - 3.1.3 / share/hadoop/graphs/hadoop - graphs - examples - 3.1.3. Jar wordcount ~/test.txt ~/testCopy the code

~/test is the directory for analyzing results. Do not create it yourself

If you don’t have enough memory on your machine, you may also get an OOM error

Solution: Lower the Hadoop maximum heap memory

Vi ~ / hadoop - 3.1.3 / etc/hadoop/hadoop - env. ShCopy the code

findexport HADOOP_HEAPSIZE=And append belowexport HADOOP_HEAPSIZE=512(My machine is small, so ONLY 512MB is allocated.)

Run the preceding command again. If the job ID contains local, the job is run in local mode.

View the output file

In local mode, mapReduce outputs are sent locally.If the _SUCCESS file is in the output directory, the JOB is successfully run. Part -r-00000 is the output result file.

Pseudo-distribution model

Pseudo-distributed mode simulates a distributed environment on a single machine and has all the functions of Hadoop

Five files need to be configured

  • ~ / hadoop – 3.1.3 / etc/hadoop/hadoop – env. Sh
  • ~ / hadoop – 3.1.3 / etc/hadoop/core – site. XML
  • ~ / hadoop – 3.1.3 / etc/hadoop/HDFS – site. The XML
  • ~ / hadoop – 3.1.3 / etc/hadoop/mapred – site. XML
  • ~ / hadoop – 3.1.3 / etc/hadoop/yarn – site. The XML

Configure hadoop – env. Sh

First, look at the address of JAVA_HOME

echo $JAVA_HOME
Copy the code

Configure the JAVA_HOME path

Vi ~ / hadoop - 3.1.3 / etc/hadoop/hadoop - env. ShCopy the code

Go to Export JAVA_HOME and add it below

Export JAVA_HOME=JAVA_HOME addressCopy the code

Configure the core – site. The XML

Vi ~ / hadoop - 3.1.3 / etc/hadoop/core - site. XMLCopy the code

Configure the default HDFS access port and temporary data directory

<configuration>
    <property>
        <! -- temporary directory where data is stored. Note that the path must be absolute and do not use the ~ symbol -->
        <name>hadoop.tmp.dir</name>
        <value>/ root/hadoop - 3.1.3 / HDFS/TMP</value>
    </property>
    <property>
        <! -- HDFS access port -->
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
Copy the code

Hadoop.tmp. dir is a temporary hadoop directory. NameNode data of HDFS is stored in this directory. By default, hadoop.tmp.dir points to/TMP /hadoop-${user.name}. In this case, if the operating system restarts, the system will clear all files in/TMP

Note that the Hadoop temporary directory '~/hadoop-3.1.3/ HDFS/TMP' in the example needs to be created in advanceCopy the code
Mkdir -p ~/hadoop-3.1.3/ HDFS/TMP tree -c -fp ~/hadoop-3.1.3/ HDFS /Copy the code

Configuration HDFS – site. XML

Vi ~ / hadoop - 3.1.3 / etc/hadoop/HDFS - site. The XMLCopy the code

Configuring the Replication Number

<configuration>
    <property>
        <! -- Set the replication number to 1, that is, no replication is performed -->
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
Copy the code

Formatting HDFS

Formatting is to block datanodes in the HDFS, a distributed file system. The initial metadata generated after partitioning is stored in NameNode.

When formatting, pay attention to the permissions of the hadoop.tmp.dir directoryCopy the code
~ / hadoop - 3.1.3 / bin/HDFS namenode - formatCopy the code

Formatted successfully If “has been formatted successfully” is displayed

After the formatting is successful, check whether the DFS directory exists in the directory specified by hadoop.tmp.dir in core-site. XML

Tree - C - ~ / hadoop - 3.1.3 pf/HDFS/TMP /Copy the code

  • Fsimage_ * is a local file where NameNode metadata is persisted when memory is full.
  • fsimage_Md5 is a verification file used to verify Fsimage_Integrity.
  • Seen_txid is the hadoop version
  • Vession File contents
    • NamespaceID: indicates the unique ID of the NameNode
    • ClusterID: indicates the clusterID. NameNode and DataNode cluster ids should be the same.
Cat/root/hadoop - 3.1.3 / HDFS/TMP/DFS/name/current/VERSIONCopy the code

Start the cluster.

We have configured hadoop-env.sh, core-site. XML, hdFS-site. XML and initialized HDFS. Now we are ready to start HDFS.

A complete HDFS system has the following components

  • A Namenode node
  • A standby Namenode node (together with the namenode node above)
  • N Datanodes

Start the namenode

~ / hadoop - 3.1.3 / bin/HDFS daemon start - the namenodeCopy the code

Start the secondarynamenode

~ / hadoop - 3.1.3 / bin/HDFS - daemon start secondarynamenodeCopy the code

Start the datanode

~ / hadoop - 3.1.3 / bin/HDFS - daemon start datanodeCopy the code

Viewing startup Status

jps
Copy the code

Create directories, upload files, and download files on HDFS

Create a directory on the HDFS

~ / hadoop - 3.1.3 / bin/HDFS DFS - mkdir/demoCopy the code

Upload local files to HDFS

~/hadoop-3.1.3/bin/ HDFS DFS -put /root/test.txt /demoCopy the code

/root/test. TXT is the file to be uploaded, and /demo is the directory created in HDFS

View files in HDFS

~ / hadoop - 3.1.3 / bin/HDFS DFS - cat/demo/test. TXTCopy the code

Download the HDFS file to the local PC

~/hadoop-3.1.3/bin/ HDFS dfs-get /demo/test. TXT /root/testCopy the code

TXT is a file to be downloaded from the HDFS. /root/test is a local directory that needs to be created in advance

Configuration mapred – site. XML

If there is no 'mapred-site. XML' file, see if there is a configuration template file 'mapred-site.xml.template', copy this template to generate 'mapred-site. XML'Copy the code
Vi ~ / hadoop - 3.1.3 / etc/hadoop/mapred - site. XMLCopy the code

Set the framework used by MapReduce. Yarn is used in this example

<configuration>
    <property>
        <! Set MapReduce to use the YARN framework -->
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
Copy the code

Configuration of yarn – site. XML

Vi ~ / hadoop - 3.1.3 / etc/hadoop/yarn - site. The XMLCopy the code

Set the yarn mixing mode to the default mapReduce algorithm

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>
Copy the code

Start the YARN

We have configured hadoop-env.sh, core-site. XML, mapred-site. XML, and yarn-site. XML. Now we only need to start YARN

The YARN system consists of Resourcemanager and NodeManager

Start the resourcemanager

~ / hadoop - 3.1.3 / bin/yarn - daemon start the resourcemanagerCopy the code

In YARN, ResourceManager manages and allocates all resources in a cluster in a unified manner. ResourceManager receives resource reports from each Node (NodeManager) and allocates the resource reports to applications (ApplicationManager) based on certain policies.

Start the nodemanager

~ / hadoop - 3.1.3 / bin/yarn - daemon start nodemanagerCopy the code

Viewing startup Status

jps
Copy the code

The port number of the YARN Web client is 8088. You can view the port number at http://localhost:8088.

Running graphs Job

Hadoop’s share directory comes with jars containing small examples of MapReduce instances, Position in hadoop – 3.1.3 / share/hadoop/graphs/hadoop – graphs – examples – 2.5.0. Jar, can run the examples of newly built experience hadoop platform, We’ll run a classic WordCount instance here to test it

Check the startup status of all services. Ensure that all services are started

~ / hadoop - 3.1.3 / bin/yarn jar ~ / hadoop - 3.1.3 / share/hadoop/graphs/hadoop - graphs - examples - 3.1.3. Jar wordcount /demo/test.txt /demo/outputCopy the code

/demo/test. TXT is a file in the HDFS. If no file exists, prepare it in advance. /demo/output is an output directory in the HDFS

hadoop fs -rm -r /demo/output
Copy the code

If can’t find or unable to load the main class. Org. Apache hadoop. Graphs. The v2. App. MRAppMaster, requires the yarn – site. In the XML file configuration hadoop classpath

See the hadoop classpath

~ / hadoop - 3.1.3 / bin/hadoop classpathCopy the code

Configure hadoop classpath

Vi ~ / hadoop - 3.1.3 / etc/hadoop/yarn - site. The XMLCopy the code
<configuration>
    <property>
        <name>yarn.application.classpath</name>
		<value>Hadoop Classpath</value>
    </property>
</configuration>
Copy the code

Try executing the WordCount instance again

~ / hadoop - 3.1.3 / bin/yarn jar ~ / hadoop - 3.1.3 / share/hadoop/graphs/hadoop - graphs - examples - 3.1.3. Jar wordcount /demo/test.txt /demo/outputCopy the code

No error this time, check the output directory

~ / hadoop - 3.1.3 / bin/HDFS DFS - ls/demo/outputCopy the code

Fully distributed mode

Pseudo-distributed mode can also be done on a single VPS, while fully distributed mode requires multiple VPS (at least three). Please refer to my other article hadoop fully distributed system construction for details