This is the third day of my participation in the August More text Challenge. For details, see:August is more challenging

Apache version

Open your corresponding version of Hadoop official documentation (here hadoop.apache.org/docs/r2.10….). . It is best to choose the corresponding version of the document to operate to avoid some compatibility problems.

Install the configuration

  1. Several Linux machines (3 Centos7.6 in this article)

    1. Cluster time synchronization NTP (you do not need it, but you do need it).

    2. Configure the JDK.

    3. Configuration of the host.

    4. Install the required software: SSH (required for cluster deployment) and rsync (required for cluster configuration synchronization).

    5. $SSH localhost = $SSH localhost

      $ ssh-keygen -t rsa -P ' ' -f ~/.ssh/id_rsa
      $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
      $ chmod 0600 ~/.ssh/authorized_keys
      Copy the code
  2. Hadoop package (this article uses Hadoop-2.10.0.tar.gz)

    1. Configure the/etc/hadoop/hadoop – env. Sh. Export JAVA_HOME=${JAVA_HOME} =${JAVA_HOME}

    2. Configure HADOOP_HOME (optional, easy)

    3. Export HADOOP_HOME = / home/Justin/env/hadoop - 2.10.0 export PATH = $PATH: $HADOOP_HOME/bin: $HADOOP_HOME/sbinCopy the code
  3. The Hadoop configuration is complete (but not started yet). You can use the Hadoop command to test:

[justin@hadoop01]$ hadoop
Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
  CLASSNAME            run the class named CLASSNAME
 or
  where COMMAND is one of:
  fs                   run a generic filesystem user client
....
Copy the code

Start the Hadoop

Hadoop has three startup modes:

  • Single-process Mode: Local (Standalone) Mode
  • Pseudo Distributed: Pseudo Distributed Mode
  • Fully Distributed: Fully Distributed Mode

Single-process mode

Start only one Hadoop node. This deployment mode is generally used to debugging the programs that you want to deploy to the cluster.

This pattern requires no additional configuration and is equivalent to configuring the environment to execute the written JAR package. Can we use the official example to see whether the environment is OK?

validation
$ mkdir input
$ cp etc/hadoop/*.xml input
$Bin/hadoop jar share/hadoop/graphs/hadoop - graphs - examples - 2.10.0. Jar grep input output'dfs[a-z.]+'
$ cat output/*// The basic environment is OKCopy the code

Pseudo distributed

Start Hadoop related background services on a single machine, each in a separate process.

Configure cluster.
  • Configure the/etc/hadoop/core – site. XML

    • <! -- configure HDFS NameNode address -->
      <property>
      	<name>fs.defaultFS</name>
      	<value>hdfs://hadoop01:9000</value>
      </property>
      <! By default, all files executed by Hadoop are in the TMP folder of the system, so they will disappear after the restart. Need to specify a new folder -->
      <property>
      	<name>hadoop.tmp.dir</name>
      	<value>/data/my/tmp</value>
      </property>
      Copy the code
    • For details, see core-default.xml

  • Configure the/etc/hadoop/HDFS – site. XML

    • <! -- specify the number of HDFS copies -->
      <property>
      	<name>dfs.replication</name>
      	<value>1</value>
      </property>
      Copy the code
    • For details, check HDFS -default.xml

  • Format NameNode (only on first startup, no formatting later)

    • $ bin/hdfs namenode -format
      Copy the code
    • Format the NameNode to generate a new cluster ID. The cluster ids of NameNode and DataNode are inconsistent, and the cluster cannot find the past data. Therefore, when formatting NameNode, delete data and log before formatting NameNode.

  • Start the NameNode&DataNode

    • $ sbin/start-dfs.sh// It can also be started separately$ sbin/hadoop-daemon.sh start namenode
      $ sbin/hadoop-daemon.sh start datanode
      Copy the code
  • stop

    • $ sbin/stop-dfs.sh
      Copy the code
Verify the HDFS

If namenode or Datanode startup fails, go to the Logs folder and view the logs.

Two ways:

  1. [justin@hadoop01]$ jps
    4899 DataNode
    5316 Jps
    5096 SecondaryNameNode
    4783 NameNode
    Copy the code
  2. Browser visit http://hadoop01:50070/

  1. Run the MapReduce example test directly using HDFS
// Upload the test file to the created input folder$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/justin # Only step by step
$ bin/hdfs dfs -mkdir /user/justin/input
$ bin/hdfs dfs -put etc/hadoop/*.xml /user/justin/input// Execute the official case Mr Program. Output must not exist$Bin/hadoop jar share/hadoop/graphs/hadoop - graphs - examples - 2.10.0. Jar grep/user/Justin/input/output/user/Justin'dfs[a-z.]+'// Check the result$ bin/hdfs dfs -cat /user/justin/output/*// Delete the result$ hdfs dfs -rm -r /user/justin/output
Copy the code
Configuration of YARN
  1. Configure etc/hadoop/mapred-site. XML (copy mapred-site.xml. Template).

    <! -- specify MR program to run on YARN -->
    <property>
    	<name>mapreduce.framework.name</name>
    	<value>yarn</value>
    </property>
    Copy the code
    • For details, see mapred-default.xml
  2. Configure the/etc/hadoop/yarn – site. XML

    <! -- How to obtain data from Reducer -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <! -- Specify the address of YARN ResourceManager -->
    <property>
    
     <name>yarn.resourcemanager.hostname</name>
     <value>hadoop01</value>
    </property>
    Copy the code
    • For details, see yarn-default.xml
  3. Start resourcemanager and NodeManager (ensure that NameNode and DataNode are started)

    • $ sbin/start-yarn.sh// Or start separately$ sbin/yarn-daemon.sh start resourcemanager
      $ sbin/yarn-daemon.sh start nodemanager
      Copy the code
    Copy the code
  4. stop

    • $ sbin/stop-yarn.sh
      Copy the code
Verify the YARN
  1. [justin@hadoop01]$ jps
    4899 DataNode
    6150 Jps
    5096 SecondaryNameNode
    5897 ResourceManager
    4783 NameNode
    5999 NodeManager
    Copy the code
  2. http://hadoop01:8088/cluster

  1. Use YARN to run the MapReduce example test (perform the same example as MR above)

  1. The result can be viewed on the command line or in a browser.

Configuration JobHistory

Now that everything is ready, there is one problem:

The History is not available because we need to configure our JobHistory service.

  1. Configure the/etc/hadoop/mapred – site. XML

    Here the specific IP address can be specified at will (preferably find a relatively idle machine), but specify which IP address when the time to start JobHistory on which machine, otherwise it will fail to start!

    <! Jobhistory service address -->
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>hadoop01:10020</value>
    </property>
    <! Jobhistory web address -->
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>hadoop01:19888</value>
    </property>
    <! --> select * from HDFS where HDFS is stored -->
    <property>
        <name>mapreduce.jobhistory.done-dir</name>
        <value>/history/done</value>
    </property>
    <! -- path to temporary files in MR running -->
    <property>
        <name>mapreduce.jobhistory.intermediate-done-dir</name>
        <value>/history/done_intermediate</value>
    </property>
    Copy the code
  2. Restart the yarn

$ sbin/stop-yarn.sh
$ sbin/start-yarn.sh
Copy the code
  1. Start theJobHistoryservice
$ sbin/mr-jobhistory-daemon.sh start historyserver
Copy the code
  1. Look at http://hadoop01:19888/jobhistory
  2. Execute the MapReduce job and click History to verify (if this cannot be opened, check whether the URL domain name is correct and set host).

Configure log aggregation

When the log server configuration is complete, an error occurs when you click on the History panel of a Job to view logs:

Aggregation is not enabled. Try the nodemanager at hadoop01:27695
Or see application log at http://hadoop01:27695/node/application/application_1595579336183_0002
Copy the code

This requires us to enable log aggregation, so that we can see the details of the program running, so as to facilitate development and debugging.

  1. Configuration of yarn – site. XML

    <! Enable log aggregation -->
    <property>
    	<name>yarn.log-aggregation-enable</name>
    	<value>true</value>
    </property>
    <! -- Set log retention time to 7 days -->
    <property>
    	<name>yarn.log-aggregation.retain-seconds</name>
    	<value>604800</value>
    </property>
    Copy the code
  2. Restart the YARN and History services.

Complete distribution

For a real cluster, we will deploy Hadoop’s NameNode and DataNode on different machines.

Before configuration, you need to plan how to distribute these service nodes. In this example:

hadoop01 hadoop02 hadoop03
HDFS The NameNode, DataNode DataNode SecondaryNameNode, DataNode
YARN NodeManager The ResourceManager, NodeManager NodeManager

Try to distribute the nodes evenly and don’t stack them on one machine. By default, DataNode and NodeManager are configured on each machine to manage Data and CPU respectively.

Ensure that haDOOP01, HaDOOP02, and haDOOP03 are used to log in to other machines using SSH.

Are you sure you want to continue connecting (yes/no)? Confirm the yes

If ResourceManager is configured on Hadoop02, it also requires that Hadoop02 can SSH the other two machines.

Just connect the three machines to each other for convenience.

  1. First in haDOOP01 will be the basic environment is configured, the specific process refer to pseudo distributed.

    1. Hadoop-env. sh Sets JAVA_HOME. My test situation is that other XXXX-env.sh without setting JAVA_HOME is ok.

    2. core-site.xml

      <! -- select NameNode from HDFS where Hadoop01 -->
      <property>
      	<name>fs.defaultFS</name>
      	<value>hdfs://hadoop01:9000</value>
      </property>
      <! -- Optional items -->
      <! -- hdFS-site will also configure name and data, so you don't need this -->
      <property>
      	<name>hadoop.tmp.dir</name>
      	<value>/data/my/tmp</value>
      </property>
      Copy the code
    3. hdfs-site.xml

      <! -- Set the HTTP address of the secondaryNamenode -->
      <property>
      	<name>dfs.namenode.secondary.http-address</name>
      	<value>hadoop03:50090</value>
      </property>
      <! -- set the path where namenode is stored -->
      <property>
      	<name>dfs.namenode.name.dir</name>
      	<value>/ home/Justin/env/hadoop - 2.10.0 tmp_dfs/name</value>
      </property>
      <! -- Set the path for storing datanodes -->
      <property>
      	<name>dfs.datanode.data.dir</name>
      	<value>/ home/Justin/env/hadoop - 2.10.0 / tmp_dfs/data</value>
      </property>
      <! -- Optional items -->
      <! -- set the number of HDFS replicas (default: 3) -->
      <property>
      	<name>dfs.replication</name>
      	<value>3</value>
      </property>
      Copy the code
    4. yarn-site.xml

      <! -- Reduce_shuffle --> Mapreduce_shuffle -->
      <property>
      	<name>yarn.nodemanager.aux-services</name>
      	<value>mapreduce_shuffle</value>
      </property>
      <! -- Specify the machine on which ResourceManager is located -->
      <property>
      	<name>yarn.resourcemanager.hostname</name>
      	<value>hadoop02</value>
      </property>
      <! Enable log aggregation -->
      <property>
      	<name>yarn.log-aggregation-enable</name>
      	<value>true</value>
      </property>
      <! -- Set log retention time to 7 days -->
      <property>
      	<name>yarn.log-aggregation.retain-seconds</name>
      	<value>604800</value>
      </property>
      Copy the code
    5. mapred-site.xml

      <!--只是配置在yarn上运行MapReduce-->
      <property>
      	<name>mapreduce.framework.name</name>
      	<value>yarn</value>
      </property>
      <! Jobhistory service address -->
      <property>
          <name>mapreduce.jobhistory.address</name>
          <value>hadoop01:10020</value>
      </property>
      <! Jobhistory web address -->
      <property>
          <name>mapreduce.jobhistory.webapp.address</name>
          <value>hadoop01:19888</value>
      </property>
      <! --> select * from HDFS where HDFS is stored -->
      <property>
          <name>mapreduce.jobhistory.done-dir</name>
          <value>/history/done</value>
      </property>
      <! -- path to temporary files in MR running -->
      <property>
          <name>mapreduce.jobhistory.intermediate-done-dir</name>
          <value>/history/done_intermediate</value>
      </property>
      Copy the code
    6. Salve: vim etc/hadoop/slaves The main purpose of configuring slave is to know which nodes have Hadoop when clustering.

      hadoop01
      hadoop02
      hadoop03
      Copy the code
  2. And then copy it to another machine.

    scp -r /home/justin/env  justin@hadoop02:/home/justin/
    scp -r /home/justin/env  justin@hadoop03:/home/justin/
    #Or use xsync for synchronizationRsync av/home/Justin/env/hadoop - 2.10.0 / hadoop0X: / home/Justin/env/hadoop - 2.10.0 /Copy the code
  3. If the cluster is started for the first time, format the NameNode (before formatting, empty the files in TMP and logs).

    bin/hdfs namenode -format
    Copy the code
  4. Start to cluster

    Note: If NameNode and ResourceManger are not on the same machine, start YARN on the machine where ResouceManager resides instead of NameNode.

    #Because NameNode specifies hadoop01, ResourceManager specifies Hadoop02
    hadoop01: start-dfs.sh
    hadoop02: start-yarn.sh
    #If both NameNode and ResourceManger are specified on hadoop01, it can be used. Otherwise, ResourceManger fails to start
    start-all.sh
    Copy the code
  5. Start the JobHistory service on the corresponding machine.

    $ sbin/mr-jobhistory-daemon.sh start historyserver
    Copy the code
validation

It’s the same as the planned distribution:

[justin@hadoop01]$ jps
32176 Jps
32033 NodeManager
31064 NameNode
31208 DataNode

[justin@hadoop02]$ jps
10899 ResourceManager
11012 NodeManager
8581 DataNode
11421 Jps

[justin@hadoop03]$ jps
24960 DataNode
25953 NodeManager
25082 SecondaryNameNode
26124 Jps
Copy the code

Browser access:

http://hadoop01:50070/ (use NameNode IP)

http://hadoop02:8088/cluster (IP) using the ResourceManager

http://hadoop03:50090/status.html (see SecondaryNameNode)

If the page is empty, modify the share/hadoop/HDFS/webapps/static/DFS – dust. Js line 61 as follows. Then rsync the file on all machines and restart HDFS. Clear the cache and refresh.

'date_tostring' : function (v) {
    //return moment(Number(v)).format('ddd MMM DD HH:mm:ss ZZ YYYY');
    return new Date(Number(v)).toLocaleString();
},
Copy the code