Introduction of Hadoop

Hadoop is a distributed system infrastructure developed by the Apache Foundation. Users can develop distributed applications without understanding the underlying details of distribution.

Make full use of the power of clusters for high-speed computing and storage. Hadoop implements a Hadoop Distributed File System (Hadoop Distributed File System)

HDFS. HDFS has high fault tolerance and is designed to be deployed on low-cost hardware. And it provides a high throughput

To access application data, suitable for applications with large data sets. HDFS has relaxed (RELAX) POSIX requirements and can be used as

Streaming access data in a file system. The core design of Hadoop framework is HDFS and MapReduce. HDFS for sea

Massive amounts of data provide storage, and MapReduce provides computing for massive amounts of data.

Environment:

CentOS7.

Installation Guide: installing Centos7 and Networking with VMware

Create a Hadoop user

1. Log in to the VM as user root and create user Hadoop

$useradd -m hadoop -s /bin/bash # create a new user hadoop $passwd hadoop # set passwordCopy the code

 

2. Add administrator rights to hadoop users

$ visudo
Copy the code

The following appears

Press the ESC key on your keyboard and type :98 (colon first, then 98) to quickly locate line 98. Press I on the keyboard to enter insert mode.

Add the following, space is a TAB key.

Press ESC and enter :wq to save and exit.

Run exit to exit the root state and log in as user Hadoop again.

2. Install SSH and configure password-free SSH login

SSH is a reliable protocol designed to provide security for remote login sessions and other network services. With SSH, an added benefit is that the data being transmitted is compressed

Shrink, so you can speed up the transmission.

Ssh-clients and ssh-server are already installed by CentOS by default.

You can use shell commands to check the SSH installation on your machine.

$ rpm -qa | grep ssh
Copy the code

If the following information is displayed, including ssh-clients and ssh-server, the ssh-server has been installed.

If not, you can install it via yum:

$ sudo yum install openssh-clients
$ sudo yum install openssh-server
Copy the code

Run the following command to check whether the SSH is available:

$ ssh localhost
Copy the code

You will be prompted when you log in for the first time:

Type yes and enter your password as prompted to log in to the machine. But this connection requires a password each time. It is convenient to log in without a password.

Use ssh-keygen to generate the key and add the key to the authorization information.

CD ~/. SSH / # SSH localhost SSH -keygen -t rsa # Cat id_rsa.pub >> authorized_keys # add authorized_keys # chmod 600Copy the code

Where you need to type something, enter.

Using SSH localhost again, we can log in directly.

3. Install the Java environment

Download the JDK-8U51-linux-x64.tar. gz package to the /home/hadoop/download folder, and decompress it to the /usr/lib/JVM folder

$ tar -zxf ~/download/jdk-8u51-linux-x64.tar.gz -C /usr/lib/jvm
Copy the code

Edit environment variables:

$ vi ~/.bashrc
Copy the code

Add the JAVA_HOME:

Export JAVA_HOME = / usr/lib/JVM/jdk1.8.0 _51 export PATH = $JAVA_HOME/binCopy the code

Put environment variables into effect:

$ source ~/.bashrc
Copy the code

View the Java version:

$ java -version
Copy the code

 

Verify that the environment variables are correct:

$Java -version $$JAVA_HOME/bin/ Java -versionCopy the code

 

Install Hadoop2

Mirror warehouse:

Mirrors. Cnnic. Cn/apache/hado…

Mirror.bit.edu.cn/apache/hado…

Download “*.tar.gz”, -src file is the source of Hadoop.

Put the downloaded tar.gz file to /usr/hadoop/download.

Run the following command to decompress the Hadoop file:

$sudo tar -zxf ~/download/ hadoop-2.6.7.tar. gz -c /usr/local $CD /usr/local/ $sudo mv./hadoop-2.7.7/ /hadoop # change the file name to hadoop $sudo chown -r hadoop:hadoopCopy the code

View hadoop version information:

$ cd /usr/local/hadoop
$ ./bin/hadoop version
Copy the code

V. Standalone Hadoop Configuration (Non-distributed)

Hadoop runs in non-distributed mode by default and requires no additional configuration. Non-distributed or single Java process, easy to debug.

Now we can run an example to get a feel for Hadoop in action. Hadoop comes with abundant examples (run/usr/local/Hadoop/bin/Hadoop jar

/ usr/local/hadoop/share/hadoop/graphs/hadoop – graphs – examples – 2.7.7. Jar can see all examples), including

Wordcount, terasort, Join, grep, etc.

Here we choose to run the grep example, we take all files in the input folder as input, filter the words that match the regular expression DFS [a-z.]+ and count them

Number of times, and finally output the result to the Output folder.

$CD /usr/local/hadoop $mkdir./input $cp./etc/hadoop/*.xml./input # . / share/hadoop/graphs/hadoop - graphs - examples - 2.7.7. Jar grep. / input/output 'DFS [a-z.] + $cat. / / * # output Viewing the Running resultCopy the code

 

**Hadoop does not overwrite the result file by default, so if you run it again, you will get an error message./output.

Delete output file:

$ rm -r ./output
Copy the code

Hadoop pseudo-distributed configuration

Hadoop can be run in a pseudo-distributed fashion on a single node, with Hadoop processes running as separate Java processes and nodes acting as both NameNode and NameNode

DataNode, meanwhile, reads files in HDFS.

Before setting the Hadoop pseudo-distributed configuration, we also need to set the Hadoop environment variables. Run the following command to set the Hadoop environment variables in ~/. Bashrc:

$ vi ~/.bashrc
Copy the code

Add the following at the end of the file:

# Hadoop Environment Variables
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Copy the code

Give effect to a document:

$ source ~/.bashrc
Copy the code

Hadoop configuration file located in/usr/local/Hadoop/etc/Hadoop, pseudo distributed need to modify the two core configuration file – site. XML and HDFS

Site. The XML.

Hadoop’s configuration file is in XML format, and each configuration is implemented by declaring the name and value of a property.

Modify core-site. XML file:

Replace with the following:

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/local/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
Copy the code

Modify the HDFS -site. XML file:

Replace with the following:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/data</value>
    </property>
</configuration>
Copy the code

After the configuration is complete, perform NameNode formatting:

$ /usr/local/hadoop/bin/hdfs namenode -format
Copy the code

Success will indicate successfully formatted and no longer being withdraw with status 0. Failure will no longer be intended with Status 1.

Start NameNode and DataNode daemons:

$ /usr/local/hadoop/sbin/start-dfs.sh
Copy the code

When Are you sure you want to continue connecting is displayed, enter Yes to continue.

After the startup is complete, run the JPS command to check whether the startup is successful. If the startup is successful, NameNode, DataNode and other processes are displayed

SecondaryNameNode (If SecondaryNameNode is not started, run sbin/stop-dfs.sh to shut down the process and try again).

If there is no NameNode or DataNode, the configuration fails. Check previous steps or check startup logs to locate the cause.

You can view startup logs to analyze startup failure causes

Sometimes Hadoop fails to start properly, for example, the NameNode process does not start smoothly. In this case, you can view the startup log to locate the cause. Note the following points:

  • A prompt such as “localhost: Starting the namenode, logging to/usr/local/hadoop/logs/hadoop – hadoop namenode – localhost. Out “, including localhost corresponds to your host name, But start the log information is recorded in the/usr/local/hadoop/logs/hadoop – hadoop – the namenode – localhost. Log in, so you should see the suffix for. The log file;
  • Each startup log is appended to the log file, so you have to drag it to the back to see when it was recorded.
  • The general Error message is at the very end, where Fatal, Error or Java Exception is written.
  • You can search the Internet for error messages to see if you can find relevant solutions.

After the startup is successful, you can visit http://localhost:50070 to view NameNode and Datanode information, and view files in HDFS online.

In this example, the CentOS GRAPHICAL user interface is not used. Therefore, you need to access the Hadoop Web interface of the VM from a Windows browser.

Open ports in CentOS7 for details, see open ports in CentOS7.

7. Run Hadoop pseudo-distributed instance

The above example is in single-machine mode, where the data read is a local file. The pseudo-distributed instance data comes from the HDFS file system. To use HDFS, you first need to create in HDFS

Create a user directory.

$ /usr/local/hadoop/bin/hdfs dfs -mkdir -p /user/hadoop
Copy the code

To/usr/local/hadoop/etc/hadoop/under all the XML as a file is copied to the HDFS/user/hadoop/input.

We are using a Hadoop user and have created the corresponding user directory /user/hadoop, so we can use relative paths such as input in the command, which corresponds to

The absolute path is /user/hadoop/input.

/usr/local/hadoop/bin/hdfs dfs -mkdir input
/usr/local/hadoop/bin/hdfs dfs -put /usr/local/hadoop/etc/hadoop/*.xml input
Copy the code

After the replication is complete, you can run the following command to view the file list in HDFS:

$ /usr/local/hadoop/bin/hdfs dfs -ls input
Copy the code

Running a MapReduce job:

$ /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input output 'dfs[a-z.]+'
Copy the code

View the run result:

$ /usr/local/hadoop/bin/hdfs dfs -cat output/*
Copy the code

Fetch the HDFS file back to the local directory:

$rm - r/usr/local/hadoop/output # first remove the local output folder (if present) $/ usr/local/hadoop/bin/HDFS DFS - get the output / usr/local/hadoop/output # will be copied to the output folder on HDFS native cat/usr/local/hadoop/output / *Copy the code

Hadoop at run time, the output directory cannot exist, if present, will be thrown. Org. Apache Hadoop. Mapred. FileAlreadyExistsException anomalies.

So before running, you need to delete the output directory.

/ usr/local/hadoop/bin/HDFS DFS - rm - r output # delete the output folderCopy the code

Close the Hadoop:

/usr/local/hadoop/sbin/stop-dfs.sh
Copy the code

Note: * * * * the next time you start the hadoop, without the NameNode initialization, only need to run the/usr/local/hadoop/sbin/start – DFS. Sh!

8. Start YARN

The new version of Hadoop uses the new MapReduce framework (MapReduce V2, also known as YARN, Yet Another Resource Negotiator).

YARN is separated from MapReduce and is responsible for resource management and task scheduling. YARN runs on MapReduce and provides high availability and scalability

Sex.

In pseudo-distributed mode, the program can be executed without starting YARN.

The above by/usr/local/hadoop/sbin/start – DFS. Sh start hadoop, just launched the graphs environment, we can start the YARN, let me

YARN manages resources and schedules tasks.

Modify the configuration file mapred-site. XML:

$CD/usr/local/hadoop/etc/hadoop # into the configuration file folder $mv. / mapred - site. XML. The template. / mapred - site. XML # rename $vi /mapred-site. XML # Edit the fileCopy the code

Modify the configuration as follows:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
Copy the code

Modify configuration file yarn-site. XML:

$ vi ./yarn-site.xml
Copy the code

Modify the configuration as follows:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
        </property>
</configuration>
Copy the code

Start the YARN

$/ usr/local/hadoop/sbin/start - yarn. Sh $start yarn $/ usr/local/hadoop/sbin/Mr - jobhistory - daemon. Sh start historyserver # You can view the task running status on the Web UI only after the history server is enabledCopy the code

Using JPS, you can view multiple ResourceManager and NodeManager processes.

After start-up, can through the web interface at http://localhost:8088/cluster task operation

YARN provides better resource management and task scheduling for clusters. However, YARN does not work on a single machine and slows down programs. So on a single machine

Whether to enable YARN depends on the actual situation.

Note: If YARN is not started, you need to rename mapred-site.xml

If you do not want to start YARN, rename the configuration file mapred-site. XML to mapred-site.xml.template. If necessary, change it back. no

If the configuration file exists but YARN is not enabled, the program displays an error message “Retrying connect to server: 0.0.0.0/0.0.0.0:8032”.

This is why the initial configuration file is named mapred-site.xml.template.

Close the YARN

$ /usr/local/hadoop/sbin/stop-yarn.sh
$ /usr/local/hadoop/sbin/mr-jobhistory-daemon.sh stop historyserver
Copy the code