Hadoop configuration based on pseudo-distributed system

preface

Hadoop itself is a distributed system application, but most of the time there is no need to do clustering for simple testing. Pseudo-distribution is essentially configuring standalone versions of Hadoop

IP changes are not allowed in Hadoop, which means that the IP must be the same for project development and the final allowed state. If you change it, it means that your entire configuration has to be reconfigured. All operations below are performed under root

Mapping configuration

For easy configuration, be sure to set a host name for each computer:
```
Use vim to go to the file in "/etc/hostname" and make changesCopy the code
```
Alter localhost(python2-virtualbox) to “hadoopm”

Before modification:

After modification:

Modify the mapping configuration of a host

Modify the /etc/hosts file to add the mapping between the IP address and the Hadoopm host

Ifconfig (record your IP address with this command, 192.168.1.108)

vim /etc/hosts

I'm going to add it on the first line

192.168.1.108   hadoopm

This ties the IP address to the host

I'm gonna save it and exit

Copy the code

For this to work, it is recommended to have Linux restart by typing rebootCopy the code

Configure SSH login exemption

For details about how to configure SSH password-free login, see Step 5 in Hadoop installation and Setup

Hadoop configuration

After configure SSH, can be achieved under the hadoop related configuration, need to configure the following configuration file all the configuration file/usr/local/hadoop/etc/hadoop
XML configuration: “core – site.”
Determine the core information of Hadoop, including temporary directories and access addressesCopy the code
XML configuration: “HDFS – site.”
You can determine the number of backup files and the path to the data folderCopy the code
XML configuration: “yarn – site.”
It can be simply interpreted as configuring the processing of related jobs.Copy the code

Configure the core-site.xml file

cd /usr/local/

cd hadoop/etc/hadoop/

vim core-site.xml

Copy the code

Locate the

<configuration>  </configuration>
Copy the code

Add the following code in the middle of the tag

<property>

    <name>hadoop.tmp.dir</name>

    <value>/home/root/hadoop_tmp</value>

<description>Abase for other temporary directories.</description><! -- You don't have to write -->

</property>

<property>        

    <name>fs.defaultFS</name>

    <value>hdfs://hadoopm:9000</value>

</property>

Copy the code

HDFS ://localhost:9000 information configured in this file describes the path of the page manager to be opened later
<value>/home/root/hadoop_tmp</value>
Copy the code
The above line of code is the most important, it is not obvious. This file path is configured for temporary files. If not configured, the file TMP will be generated in the Hadoop folder (” /user/local/hadoop/ TMP “). If configured, all configuration will be cleared upon reboot and your Hadoop environment will be invalid.
To ensure correct operation, create the /home/root/hadoop_tmp directory directly
cd ~

mkdir hadoop_tmp

Copy the code
Note:

The environment used is the development version of Hadoop 2.x. The default port is 9000

If the environment is Hadoop 1.x development version. The default port is 8020

Configure the HDFS -site. XML file

HDFS of Hadoop is the most critical

cd /usr/local/hadoop/etc/hadoop/

vim hdfs-site.xml

Copy the code

Locate the

<configuration>  </configuration>
Copy the code

Add the following code in the middle of the tag

<property>

   <name>dfs.replication</name>

    <value>1</value>

</property>

<property>

   <name>dfs.namenode.name.dir</name>

  <value>file:///usr/local/hadoop/dfs/name</value>

</property>

<property>

   <name>dfs.datanode.data.dir</name>

  <value>file:///usr/local/hadoop/dfs/data</value>

</property>

<property>

   <name>dfs.namenode.http-address</name>

  <value>hadoopm:50070</value>

</property>

<property>

   <name>dfs.namenode.secondary.http-address</name>

    <value>hadoopm:50090</value>

</property>

<property>

   <name>dfs.permission</name>

 <value>false</value>

</property>

Copy the code

Dfs.replication:

The number of copies of a file. Usually, three copies of a file are backed upCopy the code

“DFS. The namenode. Name. Dir” :

Define the name node pathCopy the code

“DFS. Datanode. Data. Dir” :

Define the data file node pathCopy the code

“DFS. The namenode. HTTP – address” :

The HTTP path of the name serviceCopy the code

“DFS. The namenode. Secondary. The HTTP address” :

Second name node (not very useful at this point, which is required if you are in distributed computing)Copy the code

“DFS. Permission” :

Authentication of permissions, because if true is set, processing access to files may not be possible in the future.Copy the code

Configure the yarn-site. XML file

Configure some corresponding node information

cd /usr/local/hadoop/etc/hadoop/

vim yarn-site.xml

Copy the code

Locate the

<configuration>  </configuration>
Copy the code

Add the following code in the middle of the tag

<property>

  <name>yarn.resourcemanager.admin.address</name>

  <value>hadoopm:8033</value>

</property>

<property>

  <name>yarn.nodemanager.aux-services</name>

  <value>mapreduce_shuffle</value>

</property>

<property>

  <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>

  <value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

<property>

  <name>yarn.resourcemanager.resource-tracker.address</name>

  <value>hadoopm:8025</value>

</property>

<property>

  <name>yarn.resourcemanager.scheduler.address</name>

  <value>hadoopm:8030</value>

</property>

<property>

  <name>yarn.resourcemanager.address</name>

  <value>hadoopm:8050</value>

</property>

<property>

  <name>yarn.resourcemanager.scheduler.address</name>

  <value>hadoopm:8030</value>

</property>

<property>

  <name>yarn.resourcemanager.webapp.address</name>

  <value>hadoopm:8088</value>

</property>

<property>

  <name>yarn.resourcemanager.webapp.https.address</name>

  <value>hadoopm:8090</value>

</property>

Copy the code

By this point, the Hadoop core file has been configuredCopy the code

Open the service

Because Hadoop are distributed development environment, so consider the future for the cluster structures suggest in “/ usr/local/Hadoop/etc/Hadoop” directory to create a masters file, write down the name of the host, The content is hadoopm (the host name previously defined in the hosts file). If you are in a standalone environment, it is ok to leave it in, but it is better to put it in for future configuration and remember it
cd /usr/local/hadoop/etc/hadoop

touch masters

vim masters

Just write hadoopm on the first line

Save the configuration and exit :wq!

Copy the code

Modify the file “Slaves” of the slave node to add hadoopm

vim slaves

This file may have localhost in the first line, do not delete it, just add hadoopm to it

:wq! Save the proposed

Copy the code

All namenodes and Datanodes are saved in the Hadoop directory

cd ..

cd .. (Exit /etc directory)

Next, create your own directory

mkdir dfs dfs/name dfs/data

Copy the code

Special note: If you have a problem with Hadoop and need to reconfigure it, make sure that these two folders are completely wiped out. Your new configuration won’t take effect until you clean it

Formatting the file system

source /etc/profile

cd /usr/local/hadoop/bin

./hdfs namenode -format

Copy the code

A reference to INFO Utill.ExitUtil: Withdraw with Status 0 appears if the formatting is normal at this point.

Hadoop can then be started using the simplest processing
```
cd /usr/local/hadoop/sbin

./start-all.sh

Copy the code
```
You can then use the JDK to provide the JPS command to view all Java processes, if the following six processes are returned
```
2848 DataNode

2721 NameNode

3266 ResourceManager

3445 NodeManager

3115 SecondaryNameNode

3773 Jps

Copy the code
```

There are actually 5 processes. Jps is the process of Jps itself. If the six processes exist, the configuration is successfulCopy the code

You can test next, but for now you can only test whether HDFS is working properly

You can directly use the IP address to access, open a browser and enter

http://192.168.1.108:50070/

The IP here is Ubuntu's own IP

You can also run the following command to view information

http://hadoopm:50070/

Copy the code

extension

Start the start-all.sh service

Stop the stop-all.sh service

To execute the two commands, go to the following directory

/usr/local/hadoop/sbin
Copy the code

Hadoop2 series of execution programs are divided into: bin for ordinary users can execute; This command can be executed only for the super user in sbin.

If you want to use the hadoopm name externally, you need to modify the local hosts file to add the mapping configuration

Path: C: \ Windows \ System32 \ drivers \ etc

To open the hosts file, set its Users rights to full control

Append the following:

   192.168.1.108 hadoopm

Save immediately

The downside of this approach is that hadoopM cannot be accessed from outside the virtual machine if the IP address of the virtual machine is changed. You need to modify the hosts file again

Copy the code

JPS checks for missing processes

Fault 1: No DataNode process is found when the JPS command is used to view processes

/ HDFS namenode-format generates namespaceID for namenode, but data in /usr/local/hadoop/dfs still contains namespaceID. DataNode cannot be started because the Namespaceids are inconsistent
Solution:

Delete the "name" and "data" files in /usr/local/hadoop/dfs

Run the stop-all.sh command to stop all services in the sbin directory

Then go to the bin directory and format the file again

    ./hdfs namenode -format

Go to the sbin directory again to start all services, and run start-all.sh

At this point, use JPS to look at the process again, and you will see "DataNode "again

Copy the code

How to prevent the above problem 1

When formatting the file system, enter n when the following message is displayed

Re-format filesystem in Storage Directory /usr/local/hadoop/dfs/name ? (Y or N) Invalid input: 

Re-format filesystem in Storage Directory /usr/local/hadoop/dfs/name ? (Y or N)

Copy the code

Fault 2: When you run the JPS command to view processes, the NameNode and ResourceManager processes are not found

In this case, the IP address before hadoopm configured in the first line of the /etc/hosts file is inconsistent with the current IP address of the system

Solution: (For VMS)

Copy the IP address in the first line of the /etc/hosts file

Then change the IP address of the current system

Ifconfig eth0 (enter the IP address just copied here)

Run the start-all.sh command again to start the service

In this case, use JPS to check the process again, and you can find the NameNode and ResourceManager processes

If no, replace the IP address in the /etc/hosts file with the current SYSTEM IP address

Copy the code

Hadoop configuration based on pseudo-distributed system

preface

Mapping configuration

Configure SSH login exemption

Hadoop configuration

Configure the core-site.xml file

Configure the HDFS -site. XML file

Configure the yarn-site. XML file

Open the service

JPS checks for missing processes

Related Posts

Kafka-1: Installation and key concepts

OpenCV (4) : Image arithmetic and color space modification

Redis progressive Rehash source code analysis