For Hadoop, there are two main aspects, one is distributed file system HDFS and the other is MapReduce computing model. I will explain the process of building Hadoop environment below. Hadoop Test Environment

Total 4 test machines, 1 Namenode 3 Datanode OS version: RHEL 5.5 X86_64 Hadoop: 0.20.203.0 Jdk: Namenode 192.168.57.75 datanode1 192.168.57.76 datanode2 192.168.57.78 datanode3 192.168.57.79

Preparations before deploying Hadoop

1 Hadoop depends on Java and SSH Java 1.5.x (above) and must be installed. SSH must be installed and SSHD must always be running to manage remote Hadoop daemons using Hadoop scripts. 2 Creating a Hadoop public account All nodes must have the same user name. You can run the following command to add the user name: Useradd hadoop passwd hadoop 3 Run the following command to configure host host name: tail -n 3 /etc/hosts 192.168.57.75 Namenode 192.168.57.76 datanode1 192.168.57.78 datanode2 192.168.57.79 datanode3 4 above require all nodes (the namenode | datanode) configuration is all the same

The SSH configuration

1 Generate the configuration files of private key ID_rsa and public key id_rsa.pub

[hadoop@hadoop1 ~]$ ssh-keygen -t rsa 

Generating public/private rsa key pair. 

Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 

Enter passphrase (empty for no passphrase): 

Enter same passphrase again: 

Your identification has been saved in /home/hadoop/.ssh/id_rsa. 

Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. 

The key fingerprint is: 

d6:63:76:43:e2:5b:8e:85:ab:67:a2:7c:a6:8f:23:f9 [email protected] 

 

2 Configuration files of private key ID_rsa and public key id_rsa.pub

[hadoop@hadoop1 ~]$ ls .ssh/ 

authorized_keys  id_rsa  id_rsa.pub  known_hosts 

 

3 Upload the public key file to the Datanode server

[hadoop@hadoop1 ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@datanode1 

28 

hadoop@datanode1’s password: 

Now try logging into the machine, with “ssh ‘hadoop@datanode1′”, and check in: 

 

  .ssh/authorized_keys 

 

to make sure we haven’t added extra keys that you weren’t expecting. 

 

[hadoop@hadoop1 ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@datanode2 

28 

hadoop@datanode2’s password: 

Now try logging into the machine, with “ssh ‘hadoop@datanode2′”, and check in: 

 

  .ssh/authorized_keys 

 

to make sure we haven’t added extra keys that you weren’t expecting. 

 

[hadoop@hadoop1 ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@datanode3 

28 

hadoop@datanode3’s password: 

Now try logging into the machine, with “ssh ‘hadoop@datanode3′”, and check in: 

 

  .ssh/authorized_keys 

 

to make sure we haven’t added extra keys that you weren’t expecting. 

 

[hadoop@hadoop1 ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@localhost 

28 

hadoop@localhost’s password: 

Now try logging into the machine, with “ssh ‘hadoop@localhost'”, and check in: 

 

  .ssh/authorized_keys 

 

to make sure we haven’t added extra keys that you weren’t expecting. 

 

 

4 verify

[hadoop@hadoop1 ~]$ ssh datanode1 

Last login: Thu Feb  2 09:01:16 2012 from 192.168.57.71 

[hadoop@hadoop2 ~]$ exit 

logout 

 

[hadoop@hadoop1 ~]$ ssh datanode2 

Last login: Thu Feb  2 09:01:18 2012 from 192.168.57.71 

[hadoop@hadoop3 ~]$ exit 

logout 

 

[hadoop@hadoop1 ~]$ ssh datanode3 

Last login: Thu Feb  2 09:01:20 2012 from 192.168.57.71 

[hadoop@hadoop4 ~]$ exit 

logout 

 

[hadoop@hadoop1 ~]$ ssh localhost 

Last login: Thu Feb  2 09:01:24 2012 from 192.168.57.71 

[hadoop@hadoop1 ~]$ exit 

logout 

Configure the Java environment

1 Download the appropriate JDK

// This file is an RPM package used by 64Linux

Wget download.oracle.com/otn-pub/jav…

 

2 install the JDK

rpm -ivh jdk-7-linux-x64.rpm 

 

3 validation Java

[root@hadoop1 ~]# java -version 

Java version “1.7.0”

Java(TM) SE Runtime Environment (build 1.7.0-b147) 

Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) 

[root@hadoop1 ~]# ls /usr/java/ 

The default jdk1.7.0 latest

 

4 Configure Java environment variables

#vim /etc/profile // add the following information to profile:

 

#add for hadoop 

Export JAVA_HOME = / usr/Java/jdk1.7.0

export CLASSPATH=.:
J A V A H O M E / j r e / l i b / r t . j a r : JAVA_HOME/jre/lib/rt.jar:
JAVA_HOME/lib/dt.jar:$JAVA_HOME/ 

export PATH=
P A T H : PATH:
JAVA_HOME/bin 

 

// Set environment variables to effect

source /etc/profile 

 

5 Copy the /etc/profile file to the Datanode

[root@hadoop1 src]# scp /etc/profile root@datanode1:/etc/ 

The authenticity of host ‘datanode1 (192.168.57.86)’ can’t be established. 

RSA key fingerprint is b5:00:d1:df:73:4c:94:f1:ea:1f:b5:cd:ed:3a:cc:e1. 

Are you sure you want to continue connecting (yes/no)? yes 

Warning: Permanently added ‘datanode1,192.168.57.86’ (RSA) to the list of known hosts. 

root@datanode1’s password: 

Profile 100% 1624 1.6KB/s 00:00

[root@hadoop1 src]# scp /etc/profile root@datanode2:/etc/ 

The authenticity of host ‘datanode2 (192.168.57.87)’ can’t be established. 

RSA key fingerprint is 57:cf:96:15:78:a3:94:93:30:16:8e:66:47:cd:f9:cd. 

Are you sure you want to continue connecting (yes/no)? yes 

Warning: Permanently added ‘datanode2,192.168.57.87’ (RSA) to the list of known hosts. 

root@datanode2’s password: 

Profile 100% 1624 1.6KB/s 00:00

[root@hadoop1 src]# scp /etc/profile root@datanode3:/etc/ 

The authenticity of host ‘datanode3 (192.168.57.88)’ can’t be established. 

RSA key fingerprint is 31:73:e8:3c:20:0c:1e:b2:59:5c:d1:01:4b:26:41:70. 

Are you sure you want to continue connecting (yes/no)? yes 

Warning: Permanently added ‘datanode3,192.168.57.88’ (RSA) to the list of known hosts. 

root@datanode3’s password: 

Profile 100% 1624 1.6KB/s 00:00

  

6 Copy the JDK installation package and install the JDK package on each Datanode

[root@hadoop1 ~]# scp -r /home/hadoop/src/ hadoop@datanode1:/home/hadoop/ 

hadoop@datanode1’s password: 

Hadoop-0.20.203.0.rc1.tar.gz 100% 58MB 57.8MB/s 00:01

Jdk-7-linux-x64.rpm 100% 78MB 77.9MB/s 00:01

[root@hadoop1 ~]# scp -r /home/hadoop/src/ hadoop@datanode2:/home/hadoop/ 

hadoop@datanode2’s password: 

Hadoop-0.20.203.0.rc1.tar.gz 100% 58MB 57.8MB/s 00:01

Jdk-7-linux-x64.rpm 100% 78MB 77.9MB/s 00:01

[root@hadoop1 ~]# scp -r /home/hadoop/src/ hadoop@datanode3:/home/hadoop/ 

hadoop@datanode3’s password: 

Hadoop-0.20.203.0.rc1.tar.gz 100% 58MB 57.8MB/s 00:01

Jdk-7-linux-x64.rpm 100% 78MB 77.9MB/s 00:01

Hadoop Configuration // Perform operations as a Hadoop user

1 Configuration Directory

[hadoop@hadoop1 ~]$ pwd 

/home/hadoop 

[hadoop@hadoop1 ~]$ ll 

total 59220 

LRWXRWXRWX 1 hadoop Hadoop 17 Feb 1 16:59 Hadoop -> hadoop-0.20.203.0

Drwxr-xr-x 12 Hadoop Hadoop 4096 Feb 1 17:31 Hadoop-0.20.203.0

-rw-r–r–  1 hadoop hadoop 60569605 Feb  1 14:24 hadoop-0.20.203.0rc1.tar.gz 

 

 

2 Run the hadoop-env.sh command to specify the Java location

vim hadoop/conf/hadoop-env.sh 

Export JAVA_HOME = / usr/Java/jdk1.7.0

 

3 Configure core-site. XML // Locate the namenode of the file system

 

[hadoop@hadoop1 ~]$ cat hadoop/conf/core-site.xml 

 

 

 

 

 

fs.default.name 

hdfs://namenode:9000 

 

 

 

 

4 Configure mapred-site. XML // Locate the active jobtracker node

 

[hadoop@hadoop1 ~]$ cat hadoop/conf/mapred-site.xml 

 

 

 

 

 

mapred.job.tracker 

namenode:9001 

 

 

 

 

5 Configure hdFS-site. XML // Configure the number of HDFS copies

  

[hadoop@hadoop1 ~]$ cat hadoop/conf/hdfs-site.xml 

 

 

 

 

 

dfs.replication 

 

 

 

 

6 Configure the master and slave configuration documents

[hadoop@hadoop1 ~]$ cat hadoop/conf/masters 

namenode 

[hadoop@hadoop1 ~]$ cat hadoop/conf/slaves 

datanode1 

datanode2 

 

7 Copying the Hadoop Directory to All Nodes (Datanode)

[hadoop@hadoop1 ~]$ scp -r hadoop hadoop@datanode1:/home/hadoop/ 

[hadoop@hadoop1 ~]$ scp -r hadoop hadoop@datanode2:/home/hadoop/ 

[hadoop@hadoop1 ~]$ scp -r hadoop hadoop@datanode3:/home/hadoop 

 

8 Formatting HDFS

[hadoop@hadoop1 hadoop]$ bin/hadoop namenode -format 

12/02/02 11:31:15 INFO namenode.NameNode: STARTUP_MSG: 

/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

STARTUP_MSG: Starting NameNode 

STARTUP_MSG:   host = hadoop1.test.com/127.0.0.1 

STARTUP_MSG:   args = [-format] 

STARTUP_MSG:   version = 0.20.203.0 

STARTUP_MSG: build = svn.apache.org/repos/asf/h…

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /

Re-format filesystem in /tmp/hadoop-hadoop/dfs/name ? (Y or N) Y // Enter Y here

12/02/02 11:31:17 INFO util.GSet: VM type       = 64-bit 

12/02/02 11:31:17 INFO util.GSet: 2% max memory = 19.33375 MB 

12/02/02 11:31:17 INFO util.GSet: capacity      = 2^21 = 2097152 entries 

12/02/02 11:31:17 INFO util.GSet: recommended=2097152, actual=2097152 

12/02/02 11:31:17 INFO namenode.FSNamesystem: fsOwner=hadoop 

12/02/02 11:31:18 INFO namenode.FSNamesystem: supergroupsupergroup=supergroup 

12/02/02 11:31:18 INFO namenode.FSNamesystem: isPermissionEnabled=true 

12/02/02 11:31:18 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100 

12/02/02 11:31:18 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 

12/02/02 11:31:18 INFO namenode.NameNode: Caching file names occuring more than 10 times 

12/02/02 11:31:18 INFO common.Storage: Image file of size 112 saved in 0 seconds. 

12/02/02 11:31:18 INFO common.Storage: Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted. 

12/02/02 11:31:18 INFO namenode.NameNode: SHUTDOWN_MSG: 

/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

SHUTDOWN_MSG: Shutting down the NameNode at hadoop1.test.com/127.0.0.1

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /

[hadoop@hadoop1 hadoop]$ 

 

9 Start the Hadoop daemon process

[hadoop@hadoop1 hadoop]$ bin/start-all.sh 

starting namenode, logging to /home/hadoop/hadoop/bin/.. /logs/hadoop-hadoop-namenode-hadoop1.test.com.out

datanode1: starting datanode, logging to /home/hadoop/hadoop/bin/.. /logs/hadoop-hadoop-datanode-hadoop2.test.com.out

datanode2: starting datanode, logging to /home/hadoop/hadoop/bin/.. /logs/hadoop-hadoop-datanode-hadoop3.test.com.out

datanode3: starting datanode, logging to /home/hadoop/hadoop/bin/.. /logs/hadoop-hadoop-datanode-hadoop4.test.com.out

starting jobtracker, logging to /home/hadoop/hadoop/bin/.. /logs/hadoop-hadoop-jobtracker-hadoop1.test.com.out

datanode1: starting tasktracker, logging to /home/hadoop/hadoop/bin/.. /logs/hadoop-hadoop-tasktracker-hadoop2.test.com.out

datanode2: starting tasktracker, logging to /home/hadoop/hadoop/bin/.. /logs/hadoop-hadoop-tasktracker-hadoop3.test.com.out

datanode3: starting tasktracker, logging to /home/hadoop/hadoop/bin/.. /logs/hadoop-hadoop-tasktracker-hadoop4.test.com.out

 

10 validation

//namenode 

[hadoop@hadoop1 logs]$ jps 

2883 JobTracker 

3002 Jps 

2769 NameNode 

 

//datanode 

[hadoop@hadoop2 ~]$ jps 

2743 TaskTracker 

2670 DataNode 

2857 Jps 

 

[hadoop@hadoop3 ~]$ jps 

2742 TaskTracker 

2856 Jps 

2669 DataNode 

 

[hadoop@hadoop4 ~]$ jps 

2742 TaskTracker 

2852 Jps 

2659 DataNode 

 

Hadoop monitors Web pages

http://192.168.57.75:50070/dfshealth.jsp 

5 Simply Verifying HDFS

The hadoop file command format is as follows:

hadoop fs -cmd  

// Create a directory

[hadoop@hadoop1 hadoop]$ bin/hadoop fs -mkdir /test-hadoop 

// Check the directory

[hadoop@hadoop1 hadoop]$ bin/hadoop fs -ls / 

Found 2 items 

drwxr-xr-x   – hadoop supergroup          0 2012-02-02 13:32 /test-hadoop 

drwxr-xr-x   – hadoop supergroup          0 2012-02-02 11:32 /tmp 

// View directories including subdirectories

[hadoop@hadoop1 hadoop]$ bin/hadoop fs -lsr / 

drwxr-xr-x   – hadoop supergroup          0 2012-02-02 13:32 /test-hadoop 

drwxr-xr-x   – hadoop supergroup          0 2012-02-02 11:32 /tmp 

drwxr-xr-x   – hadoop supergroup          0 2012-02-02 11:32 /tmp/hadoop-hadoop 

drwxr-xr-x   – hadoop supergroup          0 2012-02-02 11:32 /tmp/hadoop-hadoop/mapred 

drwx——   – hadoop supergroup          0 2012-02-02 11:32 /tmp/hadoop-hadoop/mapred/system 

-rw——-   2 hadoop supergroup          4 2012-02-02 11:32 /tmp/hadoop-hadoop/mapred/system/jobtracker.info 

// Add a file

[hadoop@hadoop1 hadoop]$bin/hadoop fs -put /home/hadoop/hadoop-0.20.203.rc1.tar. gz /test-hadoop

[hadoop@hadoop1 hadoop]$ bin/hadoop fs -lsr / 

drwxr-xr-x   – hadoop supergroup          0 2012-02-02 13:34 /test-hadoop 

-rw-r–r– 2 Hadoop supergroup 60569605 2012-02-02 13:34 /test-hadoop/ hadoop-0.20.203.rp1.tar. gz

drwxr-xr-x   – hadoop supergroup          0 2012-02-02 11:32 /tmp 

drwxr-xr-x   – hadoop supergroup          0 2012-02-02 11:32 /tmp/hadoop-hadoop 

drwxr-xr-x   – hadoop supergroup          0 2012-02-02 11:32 /tmp/hadoop-hadoop/mapred 

drwx——   – hadoop supergroup          0 2012-02-02 11:32 /tmp/hadoop-hadoop/mapred/system 

-rw——-   2 hadoop supergroup          4 2012-02-02 11:32 /tmp/hadoop-hadoop/mapred/system/jobtracker.info 

// Get the file

[hadoop@hadoop1 hadoop]$bin/hadoop fs-get /test-hadoop/ hadoop-0.20.203.rc1.tar. gz/TMP /

[hadoop@hadoop1 hadoop]$ ls /tmp/*.tar.gz 

/ TMP / 1. Tar. Gz/TMP/hadoop – 0.20.203.0 rc1. Tar. Gz

// Delete files

[hadoop@hadoop1 hadoop]$bin/hadoop fs-rm /test-hadoop/ hadoop-0.20.203.rp1.tar.gz

Does HDFS: / / the namenode: 9000 / test – hadoop/hadoop – 0.20.203.0 rc1. Tar. Gz

[hadoop@hadoop1 hadoop]$ bin/hadoop fs -lsr / 

drwxr-xr-x   – hadoop supergroup          0 2012-02-02 13:57 /test-hadoop 

drwxr-xr-x   – hadoop supergroup          0 2012-02-02 11:32 /tmp 

drwxr-xr-x   – hadoop supergroup          0 2012-02-02 11:32 /tmp/hadoop-hadoop 

drwxr-xr-x   – hadoop supergroup          0 2012-02-02 11:32 /tmp/hadoop-hadoop/mapred 

drwx——   – hadoop supergroup          0 2012-02-02 11:32 /tmp/hadoop-hadoop/mapred/system 

-rw——-   2 hadoop supergroup          4 2012-02-02 11:32 /tmp/hadoop-hadoop/mapred/system/jobtracker.info 

drwxr-xr-x   – hadoop supergroup          0 2012-02-02 13:36 /user 

-rw-r–r–   2 hadoop supergroup        321 2012-02-02 13:36 /user/hadoop 

// Delete the directory

[hadoop@hadoop1 hadoop]$ bin/hadoop fs -rmr /test-hadoop 

Deleted hdfs://namenode:9000/test-hadoop 

[hadoop@hadoop1 hadoop]$ bin/hadoop fs -lsr / 

drwxr-xr-x   – hadoop supergroup          0 2012-02-02 11:32 /tmp 

drwxr-xr-x   – hadoop supergroup          0 2012-02-02 11:32 /tmp/hadoop-hadoop 

drwxr-xr-x   – hadoop supergroup          0 2012-02-02 11:32 /tmp/hadoop-hadoop/mapred 

drwx——   – hadoop supergroup          0 2012-02-02 11:32 /tmp/hadoop-hadoop/mapred/system 

-rw——-   2 hadoop supergroup          4 2012-02-02 11:32 /tmp/hadoop-hadoop/mapred/system/jobtracker.info 

drwxr-xr-x   – hadoop supergroup          0 2012-02-02 13:36 /user 

-rw-r–r–   2 hadoop supergroup        321 2012-02-02 13:36 /user/hadoop 

 

// Hadoop FS help (part)

[hadoop@hadoop1 hadoop]$ bin/hadoop fs -help 

hadoop fs is the command to execute fs commands. The full syntax is: 

 

hadoop fs [-fs <local | file system URI>] [-conf <configuration file>] 

    [-D <propertyproperty=value>] [-ls ] [-lsr ] [-du ] 

    [-dus ] [-mv  ] [-cp  ] [-rm [-skipTrash] ] 

[-rmr [-skipTrash] ] [-put … ] [-copyFromLocal … ]

[-moveFromLocal … ] [-get [-ignoreCrc] [-crc]

    [-getmerge   [addnl]] [-cat ] 

    [-copyToLocal [-ignoreCrc] [-crc]  ] [-moveToLocal  ] 

    [-mkdir ] [-report] [-setrep [-R] [-w]  <path/file>] 

    [-touchz ] [-test -[ezd] ] [-stat [format] ] 

    [-tail [-f] ] [-text ] 

    [-chmod [-R] <MODE[,MODE]… | OCTALMODE> PATH…] 

    [-chown [-R] [OWNER][:[GROUP]] PATH…] 

    [-chgrp [-R] GROUP PATH…] 

    [-count[-q] ] 

    [-help [cmd]] 

The procedure for setting up a Hadoop environment is tedious and requires certain knowledge of the Linux system. It is important to note that the Hadoop environment set up through the above steps can only give you a general understanding of Hadoop. If you want to use HDFS for online services, you need to further configure the Hadoop configuration document. The following documents will continue to be published in the form of blog posts, stay tuned.