Hadoop for the back end of the basic tutorial (four) : Hadoop pseudo-distributed environment construction

Preface:

I am back again. There are a lot of things at school in the past two days, so I didn’t update it very often. Yesterday, I saw some friends leave messages urging me to improve, not to mention, I felt quite happy when I was urged to improve for the first time. Hadoop running environment construction, we mainly talked about the basic environment construction of Hadoop, although this is simple, but very important, if there is no basic environment, the following code can not run nonsense, although Hadoop is a big data distributed framework, but not everyone can build a set of distributed environment. Some computers have low configuration, running a virtual machine has been overwhelmed, running a cluster directly CPU explosion, but the computer configuration is low also want to use Hadoop. So, Hadoop supports another mode of operation, which is pseudo-distributed mode, which is not distributed, but looks like distributed. Today, we will start with the configuration of pseudo-distributed operation mode and implement a configuration process of pseudo-distributed operation of Hadoop in detail by modifying the configuration file step by step.

Cut the crap and get to the stuff.

Configure and start the HDFS

The first thing we need to do is start the configuration and start our Hadoop distributed file system. Find our Hadoop environment configuration file, hadoop-env.sh.

The/opt/module/hadoop - 2.7.2 / etc/hadoopCopy the code

Then we need to get the JAVA_HOME address that we configured earlier, the command is as follows:

[hanshu@hadoop100 hadoop]$ echo $JAVA_HOME
/usr/lib/jvm/java-1.8.0-openjdk/jre/
Copy the code

Use the Vim editor to open the hadoop-env.sh file, which I’ll just list here

# The java implementation to use.
export JAVA_HOME=${JAVA_HOME}

# The jsvc implementation to use. Jsvc is required to run secure datanodes
# that bind to privileged ports to provide authentication of data transfer
# protocol.  Jsvc is not required if SASL is configured for authentication of
# data transfer protocol using non-privileged ports.
#export JSVC_HOME=${JSVC_HOME}

export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}

Copy the code

Change the value of JAVA_HOME to the Java installation directory we obtained above:

# The java implementation to use.Export JAVA_HOME = / usr/lib/JVM/Java -- 1.8.0 comes with its/jre /Copy the code

Save, configure, all in one go.

Then we need to configure our NameNode and DataNode in the core-site. XML configuration file, which is still in our etc/ Hadoop directory.

We use Vim to open it as follows:

<?xml version="1.0" encoding="UTF-8"? >
<?xml-stylesheet type="text/xsl" href="configuration.xsl"? >
<! -- Save a bunch of comments -->
<configuration>
</configuration>

Copy the code

We could see that it was empty, like a piece of uncultivated wasteland, and with a little kindness, we made full use of it, adding the following:

<?xml version="1.0" encoding="UTF-8"? >
<?xml-stylesheet type="text/xsl" href="configuration.xsl"? >
<! -- Save a bunch of comments -->
<configuration>


<! -- specify HDFS NameNode address -->
<property>
<name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
</property>

<! -- Specify the directory where files are generated while Hadoop is running -->
<property>
	<name>hadoop.tmp.dir</name>
	<value>The/opt/module/hadoop - 2.7.2 / data/TMP</value>
</property>

</configuration>
Copy the code

This is not all. As we mentioned earlier, hadoop files are created with three copies by default. Now I have a fake distributed file, only one computer, only one place to store things, so I can’t make three copies. This configuration file is called HDFS -site. XML. It is still in the same directory, and we can use vim to open it to find out:

<?xml version="1.0" encoding="UTF-8"? >
<?xml-stylesheet type="text/xsl" href="configuration.xsl"? >
<! -- Empty.
<configuration>

</configuration>

Copy the code

Nothing, it doesn’t matter, plus our configuration is not there, as follows:

<?xml version="1.0" encoding="UTF-8"? >
<?xml-stylesheet type="text/xsl" href="configuration.xsl"? >
<! -- Empty.
<configuration>
<! -- Specify the number of HDFS copies -->
<property>
	<name>dfs.replication</name>
	<value>1</value>
</property>
</configuration>
Copy the code

Will we be able to start our HDFS system at this point? In theory, this is ok, but since it is the first time to start the system, we recommend that you format the NameNode first. This step is similar to many times when you just buy a USB flash drive, you need to format the USB flash drive before using it. Therefore, it is not recommended that you format the NameNode card after the first time, because the new cluster ID will be generated when NameNode is formatted. As a result, the cluster ID of NameNode and DataNode is inconsistent, and the cluster cannot find the past data. Therefore, when formatting NameNode, remember to delete data and logs before formatting NameNode. Otherwise, the cluster may fail to start

Format the NameNode:

[hanshu@hadoop100 hadoop-2.7.2]$bin/ HDFS namenode-formatCopy the code

There’s so much output, I’ll just copy it for you and this article is ten thousand words in a minute, so if there’s no error at the end, it’s a formatting success, just like when you’re formatting a usb drive, there’s very few formatting failures.

Start the NameNode:

[hanshu@hadoop100 hadoop-2.7.2]$sbin/hadoop-daemon.sh start namenode Logging to/opt/module/hadoop - 2.7.2 / logs/hadoop - hanshu - the namenode - hadoop100. OutCopy the code

Tip:

They’re all using relative paths, so look at my directory, look at my directory, look at my directory.

Start the DataNode:

[hanshu@hadoop100 hadoop-2.7.2]$sbin/hadoop-daemon.sh start datanode Logging to/opt/module/hadoop - 2.7.2 / logs/hadoop - hanshu - datanode - hadoop100. OutCopy the code

Once that’s done, the Java development plug-in JPS that we specified earlier comes in handy, as shown below:

[hanshu@hadoop100 hadoop-2.7.2]$JPS 4243 JPS 4038 NameNode 4157 DataNodeCopy the code

At this point, our Hdfs configuration is complete, of course, this is not enough, you talk about distributed file system, there is no file manager that I can look at (ZAO)? Of course, Hadoop provides a web side to help us intuitively view the HDFS file system. The website is as follows:

http://localhost:50070/dfshealth.html#tab-overview

This is what it looks like when you open it

However, we generally do not use the home page, the real view of the file in the place, forget it, too lazy to type, directly look at the picture:

Extracurricular tips:

Hadoop log files are stored in the /opt/module/hadoop-2.7.2/logs directory. If the cluster fails to start or an error occurs, you can view the error information here.

Since this is the file system, what must support that create folder, create a file the normal operation, or all embarrassed yourself a file system, it must, because today we are focused on configuration, is not used, so I just simply list a few first, later I wrote HDFS, then lists in detail all kinds of fancy file operations command.

Create a folder:

[hanshu@hadoop100 hadoop-2.7.2]$ bin/hdfs dfs -mkdir -p /user/hanshu/input
Copy the code

I just createdhanshuzuishuai.txtUpload this file to the folder you just created:

[hanshu@hadoop100 hadoop-2.7.2]$bin/ HDFS DFS -put hanshuzuishuai. TXT /user/hanshu/input/Copy the code

Check whether the uploaded file is correct:

[hanshu@hadoop100 hadoop-2.7.2]$bin/ HDFS dfS-ls /user/hanshu/input/ Found 1 items -rw-r--r-- 1 hanshu supergroup 13 2019-12-20 15:07 /user/hanshu/input/hanshuzuishuai.txtCopy the code

View file contents:

[hanshu @ hadoop100 hadoop - 2.7.2] $bin/HDFS DFS - cat/user/hanshu/input/hanshuzuishuai. TXT Korea for the most handsomeCopy the code

You see, this virtual machine, talk nonsense what big truth, can not control, we forgive, don’t and small virtual machine general knowledge (funny).

See if our file has been uploaded on the Web side:

Download file contents:

[hanshu @ hadoop100 hadoop - 2.7.2] $HDFS DFS - get/user/hanshu/input/hanshuzuishuai. TXTCopy the code

Delete file contents:

[hanshu @ hadoop100 hadoop - 2.7.2] $HDFS DFS - rm - r/user/hanshu/input/hanshuzuishuai. TXTCopy the code

Until this is Hdfs simply finished, tired to death I, from two to more than three, early know out of a trilogy, write separately, write a little a day, forget it, all written here, finish today, continue to do:

Configure and start Yarn:

HDFS we are finished configuring, the rest is to configure our Yarn, really, if you do not know what Yarn is now, I suggest you read my last part to see this content, configure Yarn is not complicated, also modify the several configuration files on the line.

Yarn-env. sh = /etc/hadoop Open with Vim:

# User for YARN daemons
export HADOOP_YARN_USER=${HADOOP_YARN_USER:-yarn}

# resolve links - $0 may be a softlink
export YARN_CONF_DIR="${YARN_CONF_DIR:-$HADOOP_YARN_HOME/conf}"

# some Java parameters
# exportJAVA_HOME = / home/y/libexec/jdk1.6.0 /

Copy the code

Configuration JAVA_HOME:

Export JAVA_HOME = / usr/lib/JVM/Java -- 1.8.0 comes with its/jre /Copy the code

The second unfortunate change is yarn-site. XML, opened with vim:

<?xml version="1.0"? >
<configuration>

<! -- Still empty -->

</configuration>
Copy the code

Don’t worry, it will soon be empty, add the following:

<?xml version="1.0"? >
<configuration>
<! -- Reducer obtain data -->
<property>
 		<name>yarn.nodemanager.aux-services</name>
 		<value>mapreduce_shuffle</value>
</property>

<! -- Specify the ADDRESS of ResourceManager in YARN.
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
</configuration>
Copy the code

The third is our mapred-env.sh configuration file, which is opened using vim,

Have you seen JAVA_HOME? OMG, guys, guys, that’s it! For him! (Li Jiaqi Terrier)

Export JAVA_HOME = / usr/lib/JVM/Java -- 1.8.0 comes with its/jre /Copy the code

Finally is to modify mapred-site. XML, this time someone to ask, I did not find ah, over, I did not have this file, I and Han number of tutorials derailed, this later how to learn ah, learn not, or go back to sell hami melon to sell baked sweet potato bar.

Template: mapred-site.xml.template: mapred-site.xml: mapred-site.xml: mapred-site.xml: mapred-site.xml: mapred-site.xml: mapred-site.xml: mapred-site.xml

[hanshu@hadoop100 hadoop]$ mv mapred-site.xml.template mapred-site.xml
Copy the code

Open it with Vim, and guess what’s inside? For every error, empty, modify the configuration file as follows:

<?xml version="1.0"? >
<?xml-stylesheet type="text/xsl" href="configuration.xsl"? >

<configuration>
<! -- Set MR to run on YARN -->
<property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
</property>
</configuration>

Copy the code

At this point, our yarn is also set, can it start?

Before starting Yarn, we must start NameNode and DataNode. Otherwise, when Yarn starts, we will see that we have no resources to schedule. We think that it is better to go home and sleep.

Start the ResourceManager:

[hanshu@hadoop100 hadoop-2.7.2]$sbin/yarn-daemon.sh start resourcemanager Logging to/opt/module/hadoop - 2.7.2 / logs/yarn - hanshu - the resourcemanager - hadoop100. OutCopy the code

Start the NodeManager:

[hanshu@hadoop100 hadoop-2.7.2]$sbin/yarn-daemon.sh start nodeManager Logging to/opt/module/hadoop - 2.7.2 / logs/yarn - hanshu - nodemanager - hadoop100. OutCopy the code

Use the JPS command to see if the startup is successful:

[hanshu@hadoop100 hadoop-2.7.2]$JPS 5986 JPS 5620 ResourceManager 4038 NameNode 4157 DataNode 5870 NodeManagerCopy the code

Oh, this time there will certainly be people inside think, how do you so skilled, HHH, the reason is very simple, how can I put my middle with the wrong pit in the article (proud).

Yarn also provides the related Web side to help us view and use:

http://localhost:8088/cluster

At this point, Yarn is configured.

Configure and start the history server

To see the history of the application, you need to configure the history server. This is easy to configure. Open our mapred-site.xml configuration file and add the following:

<! -- Historical server address -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
</property>
<! -- History server web address -->
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>localhost:19888</value>
</property>
Copy the code

Start history server:

[hanshu@hadoop100 hadoop-2.7.2]$sbin/mr-jobhistory-daemon.sh start historyServer Logging to/opt/module/hadoop - 2.7.2 / logs/mapred - hanshu - historyserver - hadoop100. OutCopy the code

Use JPS to check whether the startup is successful:

[hanshu@hadoop100 hadoop-2.7.2]$JPS 5620 ResourceManager 6565 JPS 4038 NameNode 4157 DataNode 5870 NodeManager 6479 JobHistoryServerCopy the code

Needless to say, the Web side, too

http://localhost:19888/jobhistory

Is it over? No, it feels like a lot to go with. I think so too.

Configure and enable log aggregation

What is the log gathering, is the class committee to collect the operation, we collect the log and put a piece, the program again any problems, convenient debugging:

Restart NodeManager, ResourceManager, and HistoryManager to restart log aggregation.

Also need to modify the configuration file, feel like I learned another programming method:

Configuration-oriented programming.

Configuration of yarn – site. XML

Added the following configuration:

<! -- Log aggregation enabled -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>

<! -- Log retention time set to 7 days -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
Copy the code

Close the NodeManager:

[hanshu@hadoop100 hadoop-2.7.2]$sbin/yarn-daemon.sh stop resourcemanager stopping ResourcemanagerCopy the code

Close the ResourceManager:

[hanshu@hadoop100 hadoop-2.7.2]$sbin/yarn-daemon.sh stop nodeManager stopping NodeManagerCopy the code

Close the HistoryManager:

[hanshu@hadoop100 hadoop-2.7.2]$sbin/mr-jobhistory-daemon.sh stop historyServer stopping HistoryServerCopy the code

Start the NodeManager:

[hanshu@hadoop100 hadoop-2.7.2]$sbin/yarn-daemon.sh start nodeManager Logging to/opt/module/hadoop - 2.7.2 / logs/yarn - hanshu - nodemanager - hadoop100. OutCopy the code

Start the ResourceManager:

[hanshu@hadoop100 hadoop-2.7.2]$sbin/yarn-daemon.sh start resourcemanager Logging to/opt/module/hadoop - 2.7.2 / logs/yarn - hanshu - the resourcemanager - hadoop100. OutCopy the code

Start the HistoryManager:

[hanshu@hadoop100 hadoop-2.7.2]$sbin/mr-jobhistory-daemon.sh start historyServer Logging to/opt/module/hadoop - 2.7.2 / logs/mapred - hanshu - historyserver - hadoop100. OutCopy the code

At this point our pseudo-distributed runtime environment is really configured.

Really, I kid you not.

Let’s start with the technical summary:

Today we mainly write Hadoop pseudo-distributed environment configuration, I think it can not be more detailed, of course, there is a later chapter, after all, we have HDFS and Yarn configured, MR has not, so, we will be based on this configuration, we will run the official WordCount case. Take a look at our log collection, how Yarn is scheduled. Finally:

Thank you very much for reading this, your support and attention is my motivation to continue high-quality sharing.

Relevant code has been uploaded to my Github. Make sure you hit “STAR” ahhhhhhh

Long march always feeling, to a star line

Hanshu development notes

Welcome to like, follow me, have good fruit to eat (funny)

Hadoop for the back end of the basic tutorial (four) : Hadoop pseudo-distributed environment construction

Preface:

Configure and start the HDFS

Configure and start Yarn:

Configure and start the history server

Configure and enable log aggregation

Let’s start with the technical summary:

Related Posts

Ali Daniel, who stayed up all night for a month, produced a Java interview manual with 32W characters, which has received over 31K+ stars on Github

Relearn the Design pattern – the factory pattern

[Bytedance – Commercial Realisation Team] 2022 early batch starts 🔛