preface

Hadoop itself is a distributed system application, but most of the time there is no need to do clustering for simple testing. Pseudo-distribution is essentially configuring standalone versions of Hadoop

IP changes are not allowed in Hadoop, which means that the IP must be the same for project development and the final allowed state. If you change it, it means that your entire configuration has to be reconfigured. All operations below are performed under root

Mapping configuration

For this to work, it is recommended to have Linux restart by typing rebootCopy the code

Configure SSH login exemption

For details about how to configure SSH password-free login, see Step 5 in Hadoop installation and Setup

Hadoop configuration

After configure SSH, can be achieved under the hadoop related configuration, need to configure the following configuration file all the configuration file/usr/local/hadoop/etc/hadoop

  • XML configuration: “core – site.”
    Determine the core information of Hadoop, including temporary directories and access addressesCopy the code
  • XML configuration: “HDFS – site.”
    You can determine the number of backup files and the path to the data folderCopy the code
  • XML configuration: “yarn – site.”
    It can be simply interpreted as configuring the processing of related jobs.Copy the code

Configure the core-site.xml file

cd /usr/local/

cd hadoop/etc/hadoop/

vim core-site.xml

Copy the code
  • Locate the
    <configuration>  </configuration>
    Copy the code
  • Add the following code in the middle of the tag
    <property>

    <name>hadoop.tmp.dir</name>

    <value>/home/root/hadoop_tmp</value>

    <description>Abase for other temporary directories.</description><! -- You don't have to write -->

    </property>

    <property>

    <name>fs.defaultFS</name>

    <value>hdfs://hadoopm:9000</value>

    </property>

    Copy the code

HDFS ://localhost:9000 information configured in this file describes the path of the page manager to be opened later


<value>/home/root/hadoop_tmp</value>
Copy the code

The above line of code is the most important, it is not obvious. This file path is configured for temporary files. If not configured, the file TMP will be generated in the Hadoop folder (” /user/local/hadoop/ TMP “). If configured, all configuration will be cleared upon reboot and your Hadoop environment will be invalid.

  • To ensure correct operation, create the /home/root/hadoop_tmp directory directly
    cd ~

    mkdir hadoop_tmp

    Copy the code

Note:

  • The environment used is the development version of Hadoop 2.x. The default port is 9000
  • If the environment is Hadoop 1.x development version. The default port is 8020

Configure the HDFS -site. XML file

HDFS of Hadoop is the most critical

cd /usr/local/hadoop/etc/hadoop/

vim hdfs-site.xml

Copy the code
  • Locate the
    <configuration>  </configuration>
    Copy the code
  • Add the following code in the middle of the tag

    <property>

    <name>dfs.replication</name>

    <value>1</value>

    </property>

    <property>

    <name>dfs.namenode.name.dir</name>

    <value>file:///usr/local/hadoop/dfs/name</value>

    </property>

    <property>

    <name>dfs.datanode.data.dir</name>

    <value>file:///usr/local/hadoop/dfs/data</value>

    </property>

    <property>

    <name>dfs.namenode.http-address</name>

    <value>hadoopm:50070</value>

    </property>

    <property>

    <name>dfs.namenode.secondary.http-address</name>

    <value>hadoopm:50090</value>

    </property>

    <property>

    <name>dfs.permission</name>

    <value>false</value>

    </property>

    Copy the code
  • Dfs.replication:

    The number of copies of a file. Usually, three copies of a file are backed upCopy the code

    “DFS. The namenode. Name. Dir” :

    Define the name node pathCopy the code

    “DFS. Datanode. Data. Dir” :

    Define the data file node pathCopy the code

    “DFS. The namenode. HTTP – address” :

    The HTTP path of the name serviceCopy the code

    “DFS. The namenode. Secondary. The HTTP address” :

    Second name node (not very useful at this point, which is required if you are in distributed computing)Copy the code

    “DFS. Permission” :

    Authentication of permissions, because if true is set, processing access to files may not be possible in the future.Copy the code

Configure the yarn-site. XML file

Configure some corresponding node information

cd /usr/local/hadoop/etc/hadoop/

vim yarn-site.xml

Copy the code
  • Locate the
    <configuration>  </configuration>
    Copy the code
  • Add the following code in the middle of the tag
    <property>

    <name>yarn.resourcemanager.admin.address</name>

    <value>hadoopm:8033</value>

    </property>

    <property>

    <name>yarn.nodemanager.aux-services</name>

    <value>mapreduce_shuffle</value>

    </property>

    <property>

    <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>

    <value>org.apache.hadoop.mapred.ShuffleHandler</value>

    </property>

    <property>

    <name>yarn.resourcemanager.resource-tracker.address</name>

    <value>hadoopm:8025</value>

    </property>

    <property>

    <name>yarn.resourcemanager.scheduler.address</name>

    <value>hadoopm:8030</value>

    </property>

    <property>

    <name>yarn.resourcemanager.address</name>

    <value>hadoopm:8050</value>

    </property>

    <property>

    <name>yarn.resourcemanager.scheduler.address</name>

    <value>hadoopm:8030</value>

    </property>

    <property>

    <name>yarn.resourcemanager.webapp.address</name>

    <value>hadoopm:8088</value>

    </property>

    <property>

    <name>yarn.resourcemanager.webapp.https.address</name>

    <value>hadoopm:8090</value>

    </property>

    Copy the code

By this point, the Hadoop core file has been configuredCopy the code

Open the service

Because Hadoop are distributed development environment, so consider the future for the cluster structures suggest in “/ usr/local/Hadoop/etc/Hadoop” directory to create a masters file, write down the name of the host, The content is hadoopm (the host name previously defined in the hosts file). If you are in a standalone environment, it is ok to leave it in, but it is better to put it in for future configuration and remember it

cd /usr/local/hadoop/etc/hadoop

touch masters

vim masters

Just write hadoopm on the first line

Save the configuration and exit :wq!

Copy the code
  • Modify the file “Slaves” of the slave node to add hadoopm

    vim slaves

    This file may have localhost in the first line, do not delete it, just add hadoopm to it

    :wq! Save the proposed

    Copy the code
  • All namenodes and Datanodes are saved in the Hadoop directory

    cd ..

    cd .. (Exit /etc directory)

    Next, create your own directory

    mkdir dfs dfs/name dfs/data

    Copy the code

Special note: If you have a problem with Hadoop and need to reconfigure it, make sure that these two folders are completely wiped out. Your new configuration won’t take effect until you clean it

  • Formatting the file system
    source /etc/profile

    cd /usr/local/hadoop/bin

    ./hdfs namenode -format

    Copy the code

A reference to INFO Utill.ExitUtil: Withdraw with Status 0 appears if the formatting is normal at this point.

  • Hadoop can then be started using the simplest processing

    cd /usr/local/hadoop/sbin

    ./start-all.sh

    Copy the code
  • You can then use the JDK to provide the JPS command to view all Java processes, if the following six processes are returned

    2848 DataNode

    2721 NameNode

    3266 ResourceManager

    3445 NodeManager

    3115 SecondaryNameNode

    3773 Jps

    Copy the code

There are actually 5 processes. Jps is the process of Jps itself. If the six processes exist, the configuration is successfulCopy the code
  • You can test next, but for now you can only test whether HDFS is working properly
    You can directly use the IP address to access, open a browser and enter

    http://192.168.1.108:50070/

    The IP here is Ubuntu's own IP

    You can also run the following command to view information

    http://hadoopm:50070/

    Copy the code



  • extension

    Start the start-all.sh service

    Stop the stop-all.sh service

    To execute the two commands, go to the following directory
    /usr/local/hadoop/sbin
    Copy the code

    Hadoop2 series of execution programs are divided into: bin for ordinary users can execute; This command can be executed only for the super user in sbin.

    If you want to use the hadoopm name externally, you need to modify the local hosts file to add the mapping configuration

    Path: C: \ Windows \ System32 \ drivers \ etc

    To open the hosts file, set its Users rights to full control

    Append the following:

    192.168.1.108 hadoopm

    Save immediately

    The downside of this approach is that hadoopM cannot be accessed from outside the virtual machine if the IP address of the virtual machine is changed. You need to modify the hosts file again

    Copy the code

JPS checks for missing processes

  • Fault 1: No DataNode process is found when the JPS command is used to view processes

    / HDFS namenode-format generates namespaceID for namenode, but data in /usr/local/hadoop/dfs still contains namespaceID. DataNode cannot be started because the Namespaceids are inconsistent

    Solution:

    Delete the "name" and "data" files in /usr/local/hadoop/dfs

    Run the stop-all.sh command to stop all services in the sbin directory

    Then go to the bin directory and format the file again

    ./hdfs namenode -format

    Go to the sbin directory again to start all services, and run start-all.sh

    At this point, use JPS to look at the process again, and you will see "DataNode "again

    Copy the code
  • How to prevent the above problem 1

    When formatting the file system, enter n when the following message is displayed

    Re-format filesystem in Storage Directory /usr/local/hadoop/dfs/name ? (Y or N) Invalid input: 

    Re-format filesystem in Storage Directory /usr/local/hadoop/dfs/name ? (Y or N)

    Copy the code

  • Fault 2: When you run the JPS command to view processes, the NameNode and ResourceManager processes are not found


    In this case, the IP address before hadoopm configured in the first line of the /etc/hosts file is inconsistent with the current IP address of the system

    Solution: (For VMS)

    Copy the IP address in the first line of the /etc/hosts file

    Then change the IP address of the current system

    Ifconfig eth0 (enter the IP address just copied here)

    Run the start-all.sh command again to start the service

    In this case, use JPS to check the process again, and you can find the NameNode and ResourceManager processes

    If no, replace the IP address in the /etc/hosts file with the current SYSTEM IP address

    Copy the code