What is Hadoop?

Hadoop has three core components: distributed file system: HDFS — Distributed storage of files on many servers HDFS uses the Master/Slave architecture to store data, which is mainly composed of four parts. HDFS Client, NameNode, DataNode, and Secondary NameNode respectively. Let’s take a look at each of these four components.

Client: indicates the Client.

1. File segmentation. When uploading a file to the HDFS, the Client splits the file into blocks and stores the file. 2. Interact with NameNode to obtain file location information. 3. Interact with DataNode to read or write data. 4. The Client provides commands to manage HDFS, such as starting or stopping HDFS. 5. The Client can use some commands to access HDFS.Copy the code

NameNode: It’s a master, it’s a director, it’s a manager.

1. Manage HDFS namespace. Manage Block mapping information. 3. Configure copy policies. 4.Copy the code

DataNode: Slave. NameNode issues commands, and DataNode performs actual operations.

1. Store the actual data blocks. 2. Perform read/write operations on data blocks.Copy the code

Secondary NameNode: is not the hot standby of NameNode. When a NameNode dies, it cannot immediately replace the NameNode and provide services.

1. Assist NameNode to share its workload. 2. Merge fsimage and fsedits regularly and push to NameNode. 3. In an emergency, NameNode recovery can be assisted.Copy the code

Distributed computing programming framework: MAPREDUCE — Implements distributed parallel computing across many machines

Major components of MapReduce programming

The InputFormat class splits all splits into a number of splits and those splits.

Mapper class: produces intermediate results for each pair of <key,value> input.

Combiner class: Merges the same key on the Map end.

Partitioner: During shuffle, an intermediate result is divided into R parts based on the key value. Each part is completed by a Reduce.

Reducer class: Merge all intermediate map results.

The OutputFormat class is responsible for the OutputFormat.

Distributed resource scheduling platform: YARN: Helps users schedule a large number of MapReduce programs and allocate computing resources appropriately

Ii. HDFS Cluster Structure:

3. Procedure for installing the HDFS cluster:

First, you need to prepare N Linux servers

Prepare four VMS: one Namenode node and three Datanodes

2. Change the host name and IP address of each host

Host name: HDP-01 IP address: 192.168.33.61

Host name: HDP-02 IP address: 192.168.33.62

Host name: HDP-03 IP address: 192.168.33.63

Host name: HDP-04 IP address: 192.168.33.64

In Windows, configure the host name of each Linux machine to the Windows local domain name mapping file 192.168.33.61 HDP-01

192.168.33.62 HDP – 02

192.168.33.63 HDP – 03

192.168.33.64 HDP – 04

Configure the basic software environment of the Linux server

A firewall

// Temporarily disable systemctl stop firewalld // Disable systemctl disable firewalld upon system startup

systemctl disable firewalld
Removed symlink /etc/systemd/system/multi-user.target.wants/firewalld.service.
Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
Copy the code

Install JDK :(all software in hadoop system is developed in Java)

1) Use Alt + P to open the SFTP window and drag the JDK package to the SFTP window

2) Decompress the JDK package to /root/apps in Linux

3) Configure the environment variable JAVA_HOME PATH

Vi /etc/profile at the end of the file, add:

Export JAVA_HOME = / root/apps/jdk1.8.0 _60

export PATH=$PATH:$JAVA_HOME/bin
Copy the code

4) After the modification, remember source /etc/profile for the configuration to take effect

5) Check: Enter the command Java -version in any directory to see whether the command is successfully executed

6) Copy the installed JDK directory to another machine using SCP command

7) Copy the /etc/profile configuration file to other machines using SCP command and run source command respectively

Configure domain name mapping for hosts in the cluster

On hdP-01, run the vi /etc/hosts command

127.0.0.1 localhost localhost.localdomain localhost4 localhost4. Localdomain4

::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.33.61 HDP – 01

192.168.33.62 HDP – 02

192.168.33.63 HDP – 03

192.168.33.64 HDP – 04

Then, copy the hosts file to all the other machines in the cluster

4. Install the HDFS cluster

1) to modify the hadoop – env. Sh

Export JAVA_HOME = / root/apps/jdk1.8.0 _60

2) modify the core – site. XML

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hdp-01:9000</value>
</property>
</configuration>
Copy the code

3) modification of HDFS – site. XML

<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/root/dfs/name</value>
</property>

<property>
<name>dfs.datanode.data.dir</name>
<value>/root/dfs/data</value>
</property>
</configuration>
Copy the code

4) Copy the entire Hadoop installation directory to another machine

SCP -r /root/apps/hadoop-2.8.0 HDP-02 :/root/apps/ SCP -r /root/apps/hadoop-2.8.0 HDP-03 :/root/apps/ scp-r / root/apps/hadoop - 2.8.0 HDP - 04: / root/apps /Copy the code

5) start the HDFS

Note: To run hadoop commands, you need to configure the HADOOP_HOME and PATH environment variables in the Linux environment

vi /etc/profile
exportJAVA_HOME = / root/apps/jdk1.8.0 _60exportHADOOP_HOME = / root/apps/hadoop - 2.8.0export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Copy the code

First, initialize the metadata directory for namenode

To initialize the namenode metadata storage directory, run a hadoop command on HDP-01

hadoop namenode -format

1. Create a new metadata storage directory

2. Generate the fsimage file that records metadata

3. Generate cluster ids, for example, clusterID — clusterID

Then, start the Namenode process (on HDP-01)

hadoop-daemon.sh start namenode

After starting, first use JPS to check whether the Namenode process exists

Then, in Windows, use a browser to access the Web port provided by Namenode: 50070

http://hdp-01:50070(cannot open Chinese input method)

Then, start the datanodes (anywhere)

hadoop-daemon.sh start datanode

6) Start HDFS with automatic batch startup script

1) Configure hdP-01 to enable all machines (including itself) in the cluster to log in without password

2) After the configuration is complete, run SSH 0.0.0.0

3) Modify /etc/hadoop/Slaves in the Hadoop installation directory (add datanode nodes to the list)

hdp-01
hdp-02
hdp-03
hdp-04
Copy the code

4) Run the start-dfs.sh script on HDP-01 to automatically start the cluster

5) If you want to stop, use the script: stop-dfs.sh

Namenode responsibilities:

  • 1. Maintain metadata information.
  • 2. Manage the number of HDFS directories
  • 3. Respond to the client’s request.

5. Shell operations in HDFS

Command format: Hadoop fs-ls file path

Shell operation commands in HDFS