This post is from the author: Atsushi Minutsu on GitChat. Due to its length, it is divided into two parts

preface

Hadoop plays a very important role in the big data technology system. Hadoop is the foundation of big data technology, and a solid grasp of basic knowledge of Hadoop will determine how far one will go on the road of big data technology.

This is an introductory article, but there are many ways to learn Hadoop, and there are many road maps available online. The idea of this paper is: to install and deploy Apache Hadoop2.x version as the main line, to introduce the architecture of Hadoop2.x, the collaborative working principle of each module, technical details. Installation is not the goal. Getting to know Hadoop through installation is the goal.

This paper is divided into five parts, thirteen sections and forty-nine steps.

Part ONE: Linux environment installation

Hadoop runs on Linux. Although it can also run on Windows with the help of tools, it is recommended to run on Linux. The first part introduces the installation, configuration and Java JDK installation of the Linux environment.

Part two: Hadoop local mode installation

Hadoop local mode is only used for local development and debugging, or for quick installation to experience Hadoop, which is briefly introduced in this section.

Part three: Hadoop pseudo-distributed mode installation

Learning Hadoop is generally done in pseudo-distributed mode. This mode is to run each module of Hadoop on each process of a machine. Pseudo-distributed means that although each module is run separately on each process, it only runs on the same operating system and is not really distributed.

Part four: Fully distributed installation

The fully distributed mode is the mode adopted in production environments, where Hadoop runs on a cluster of servers and HA is generally implemented in production environments to achieve high availability.

Part five: Hadoop HA Installation

HA refers to high availability. To solve Hadoop single point of failure, HA is usually deployed in production environments. This section describes how to configure high availability for Hadoop2.x and gives a brief introduction to how HA works. The installation process will be interspersed with a brief introduction of the knowledge involved. I hope I can help you.

Part ONE: Linux environment installation

Step 1: Configure the Vmware NAT network

I. Introduction to Vmware network mode

Reference: http://blog.csdn.net/collection4u/article/details/14127671

2. Configure the NAT mode

NAT is network address translation. It adds an address translation service between hosts and VMS to transfer and translate IP addresses between VMS and external hosts.

We deploy the Hadoop cluster. The NAT mode is selected here. Each VM uses the IP address of the host to access the Internet through NAT.

To ensure that all VMS in the cluster have fixed IP addresses and can access the Internet, perform the following operations:

1. After Vmware is installed, the default NAT Settings are as follows:

2. By default, the DHCP service is enabled. NAT automatically assigns IP addresses to VMS, but the IP addresses of each machine need to be fixed.

3. Set a sub-network segment for the machine. The default network segment is 192.168.136.

4. Click the NAT Settings button to open the dialog box for modifying the gateway address and DNS address. Here we specify the DNS address for NAT.

5. The gateway address is the.2 address in the current network segment. It seems to be fixed, so we will not change it.

Step 2: Install the Linux operating system

3. Install Linux on Vmware

1. Choose Create VM from the file menu

2. Select classic installation, and go to the next step.

3. Select Install the OS later, and go to the next step.

4. Select Linux and CentOS 64-bit.

5. Name the VM and give the VM a name that will be displayed on the left of Vmware. Select a directory on the host to save the Linux operating system. One VM must be saved in the same directory, and multiple VMS cannot use the same directory.

6. Specify the disk capacity to allocate to Linux VMS. The default disk capacity is 20 GB.

7. Click Customize Hardware to view and modify the HARDWARE configuration of the VM.

8. After you click Finish, a VM is created. However, the VM is still an empty shell without an OPERATING system.

9. Click Edit VM Settings, locate the DVD, and specify the location of the OS ISO file.

10. Click to start the VM and select the first Press Enter to install the OS.

11. Set the root password.

12. Select Desktop so that an Xwindow will be installed.

13. Do not add normal user, other use the default, the Linux installation is complete.

Four, set up the network

Because the NAT setting of Vmware has disabled the automatic IP assignment function of DHCP, There is no IP in Linux, so we need to set various network parameters.

1. Log in to Xwindow as user root, right-click the network connection icon in the upper right corner, and choose Modify Connection.

2. The network connection list lists all the current network adapters in Linux. Here is only one network adapter, System eth0.

3. Configure IP address, subnet mask, gateway (the same as NAT), and DNS. Because NAT is set to the network segment 100.*, you can set the IP address to 192.168.100.10

4. Use ping to check whether you can connect to the Internet, as shown in the following figure.

5. Modify Hostname

1. Temporarily change hostname

[root@localhost Desktop]# hostname bigdata-senior01.chybinmy.comCopy the code

This modification mode will become invalid after the system restarts.

2. Permanently change the hostname

To make this permanent, you should modify the /etc/sysconfig/network configuration file.

Command: [root@bigdata-senior01 ~] vim /etc/sysconfig/networkCopy the code

After opening the file,

HOSTNAME=bigdata-senior01.chybinmy.com #Copy the code

6. Configure Host

[root@bigdata-senior01 ~] vim /etc/hosts add hosts: 192.168.100.10 bigdata-senior01.chybinmy.comCopy the code

7. Close the firewall

The learning environment can simply be turned off.

(1) Log in to the firewall as the root user and check the firewall status.

[root@bigdata-senior01 hadoop]# service iptables statusCopy the code

[root@bigdata-senior01 hadoop]# service iptables stop

[root@bigdata-senior01 hadoop-2.5.0]# service iptables stop iptables: Setting chains to ACCEPT: filter [ OK ] iptables: Flushing firewall rules: [ OK ] iptables: Unloading modules: [ OK ]Copy the code

(3) If you want to permanently close the firewall.

[root@bigdata-senior01 hadoop]# chkconfig iptables offCopy the code

Off. This requires a reboot to take effect.

Close selinux

Selinux is a Linux subsecurity mechanism that learning environments can disable.

Step 3: Install the JDK

Install the Java JDK

1. Check whether the Java JDK has been installed.

[root @ bigdata - senior01 Desktop] # Java versionCopy the code

Note: The JDK on your Hadoop machine should be Oracle’s Java JDK, otherwise there will be some issues, such as the possibility of not having JPS commands. If you have installed another version of the JDK, uninstall it.

2. Install the Java JDK

Java JDK: jdK-7U67-linux-x64.tar.gz

2. Decompress jdK-7U67-linux-x64.tar. gz to /opt/modules

[root@bigdata-senior01 /]# tar -zxvf jdk-7u67-linux-x64.tar.gz -C /opt/modulesCopy the code

(3) Add environment variables

Set the JDK environment variable JAVA_HOME. You need to modify the /etc/profile file to add

Export JAVA_HOME = "/ opt/modules/jdk1.7.0 _67" export PATH = $JAVA_HOME/bin: $PATHCopy the code

After the modification, run source /etc/profile

(4) After the installation, run Java — version again and you can see that the installation is complete.

Part two: Hadoop local mode installation

Step 4: Hadoop deployment mode

Hadoop can be deployed in local, pseudo-distributed, fully distributed, and fully distributed HA modes.

Modules such as NameNode, DataNode, ResourceManager, and NodeManager run on several JVM processes and machines.

Model name Number of JVM processes occupied by each module Each module runs on several machines
Local mode 1 1
Pseudo-distributed mode N 1
Fully distributed mode N N
HA is completely distributed N N

Step 5: Local mode deployment

10. Local mode introduction

Local mode is the simplest mode. All modules run in the same JVM process, using a local file system instead of HDFS. Local mode is mainly used for running debugging during local development. After downloading the Hadoop installation package, you don’t need to set anything. The default is local mode.

Hadoop can be used directly after decompression

1. Create a directory for storing hadoop in local mode

[hadoop@bigdata-senior01 modules]$ mkdir /opt/modules/hadoopstandaloneCopy the code

2. Decompress the Hadoop file

3. Ensure that the JAVA_HOME environment variable has been configured

[hadoop@bigdata-senior01 modules]$echo ${JAVA_HOME} /opt/modules/jdk1.7.0_67Copy the code

12. Run the MapReduce program and verify

Here we use the WordCount example that comes with Hadoop to test mapReduce in local mode.

1. Prepare the MapReduce input file wc.input

[hadoop@bigdata-senior01 modules]$ cat /opt/data/wc.input
hadoop mapreduce hive
hbase spark storm
sqoop hadoop hive
spark hadoopCopy the code

2. Run MapReduce Demo of Hadoop

You can see that the JOB ID contains the word local, indicating that the job is running in local mode.

3. View the output file

In local mode, mapReduce outputs are sent locally.

If the _SUCCESS file is in the output directory, the JOB is successfully run. Part -r-00000 is the output result file.

【 Unfinished, please wait for tomorrow’s article 】

Part three: Hadoop pseudo-distributed mode installation