Hadoop pseudo-distributed mode learning Notes

preface

Hadoop plays an important role in big data technology system. Hadoop is the foundation of big data technology.

This is an article documenting my own learning. There are many ways to learn Hadoop, and there are many learning roadmap on the web.

Hadoop is introduced

Hadoop is a system architecture capable of distributed processing of massive amounts of data. I use Hadoop-2.8.0, which mainly contains three blocks:

HDFS: Hadoop Distributed File System (HDFS) Distributed storage layer
Mapreduce: Distributed computing framework MapReduce distributed computing layer
Yarn: Resource management system YARN(Yet Another Resource Negotiator) Cluster Resource management

The core of Hadoop is HDFS and MapReduce.

Distributed storage system HDFS

HDFS (Hadoop Distributed File System) is a highly fault tolerant System suitable for deployment on inexpensive machines. HDFS provides high-throughput data access and is suitable for applications with largedata sets. The HADOOP Distributed File System (HDFS) is responsible for storing big data. By dividing large files into blocks and storing them in distributed mode, the HDFS overcomes the limit on the size of server disks and solves the problem that a single machine cannot store large files. The HDFS is an independent module that provides services for YARN and other modules, such as HBase.

The core

NameNode
DataNode
SecondaryNameNode(NameNode snapshot)

The HDFS is a master-slave structure. An HDFS cluster consists of one NameNode and multiple datanodes.

Advantages of HDFS (Design idea)

1. High fault tolerance The HDFS assigns a copy of the same file block to several other hosts to prevent faults on all computers. If one host fails, the HDFS quickly obtains files from another copy. Data is automatically saved to multiple nodes; After the backup is lost, it is automatically restored.

2. The storage of massive data is very suitable for the storage of t-level large files or a bunch of big data files

3. File block storage The HDFS stores a large complete file in 64 MB blocks on different computers. In this way, files in different blocks can be read from multiple hosts at the same time.

4. Mobile computing performs calculations where data is stored, rather than pulling data to where it is calculated, reducing costs and improving performance!

5. Stream data is written once and read in parallel. It does not support dynamic changes to the file content, but requires that the file be written once without changing, and the content can only be added at the end of the file.

6. Can be built on cheap machines to improve reliability through multiple copies, providing fault tolerance and recovery mechanisms. HDFS can run on ordinary PCS, allowing companies to build big data clusters on dozens of cheap computers.

NameNode

role

It is a namespace that manages files
Coordinate client access to files
Record the location and copy information of each file data on each DataNode

File parsing

Version: is a properties file that stores the version number of the HDFS
Editlog: Any operations on file system data are saved!
Fsimage /.md5: a permanent checkpoint for file system metadata, including block to file mapping, file attributes, and so on
Seen_txid: very important, is the file that holds transaction related information

What are FSImage and EditsLog

FsImage and Editlog are the core data structures of HDFS. Corruption of these files can cause the entire cluster to fail. Therefore, name nodes can be configured to support multiple copies of fsimages and Editlogs. Any FsImage and EditLog updates are synchronized to each copy.

SecondaryNameNode

role

A snapshot of Namenode, periodically backing up the Namenode
Record metadata and other data in the Namenode
Can be used to restore Namenode, does not replace Namenode!

Execute the process

SecondaryNameNode periodically communicates with the NameNode, requesting it to stop using the EditLog and temporarily transfer the new write operation to a new file edit.new. This operation is instantaneous.
SecondaryNameNode Obtains the FsImage and EditLog files from NameNode in HTTP Get mode and downloads them to the local directory
Loading FsImage and EditLog into memory is a checkpoint merge of FsImage and EditLog.
After the merge is successful, the new FsImage file is sent to NameNode via POST.
SecondaryNamenode replaces the newly received FsImage with the old one, and edit.new replaces the EditLog so that the EditLog becomes smaller.

DataNode

role

Storage and management of real data
Write once, read many times (no modification)
Files are made up of data blocks, and hadoop2.x’s block size is 128MB by default
Spread data blocks as far as possible across nodes

File parsing

Blk_ < ID > : specifies the HDFS data block, which stores binary data
Blk_ < ID >. Meta: attribute information about a data block, including version information and type information

You can set the number of copies to be generated by modifying the dfs.replication property of hdFS-site. XML! The default is 3!

Basic operation

MapReduce

The MapReduce distributed processing framework provides computing for massive amounts of data. MapReduce is a computing framework that provides a data processing method, that is, data is distributed and streamed through the Map and Reduce phases. It is only suitable for offline processing of big data, not for applications requiring high real-time performance.

Introduction to the

MapReduce is a distributed computing model proposed by Google. It is mainly used in the search field to solve the computing problem of massive data.
MR consists of two phases: Map and Reduce. Users only need to implement Map () and Reduce () functions to realize distributed computing.

MapReduce execution process

Principle of graphs

Procedure for executing MapReduce

Read files in the HDFS. Each line resolves to a

. The map function is called once for each key-value pair. – <0,hello you> <10,hello me>
,v>
Overrides map(), receives the

generated by 1, processes it, and converts it to a new

output. –

,1>
,1>
,1>
,1>
,v>
,v>
Partition the

output of 2. The default partition is one zone.
,v>
Sort (according to k) and group the data in different partitions. Grouping means that the values of the same key are grouped into a collection.

,{1}>
,{1}>
,{1,1}>
,1>
,1>
,1>
,1>
(Optional) Reduce the grouped data
The output of multiple Map tasks is copied to different Reduce nodes over the network based on different partitions. (shuffle)
Merge and sort the output of multiple maps. Overwrite reduce function, receive grouped data, realize its own business logic –

after processing, generate new < K, V > output.
,1>
,1>
,2>
The

output of reduce is written to the HDFS.
,>

Summary of MapReduce principles

Primitives typically consist of several instructions to implement a particular operation. To perform a function by means of an indivisible or uninterruptible program. And resident in memory. However, if the cost of primitive execution is more than the machine can afford, the primitive cannot be executed on the machine. The usual solution to this problem is to double the cost of the machine configuration for each step up in the machine configuration. Hence MapReduce, based on distributed file system principles. It divides the primitive into two steps, map() and reduce(), and the two phases can be divided into modules to be executed on different machines.

Begin to learn

Upload hadoop.tar.gz to the server and decompress it

Configure hadoop environment variables

vi /etc/prifile
Copy the code

Add the configuration

export HADOOP_HOME="/ opt/modules/hadoop - 2.8.0"
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
Copy the code

source /etc/profile
Copy the code

Validate the configuration and validate the parameters

echo ${HADOOP_HOME}
Copy the code

Sh, mapred-env.sh, and yarn-env.sh file

Example Change JAVA_HOME to the Java installation path

How to view the Java installation path:

4. Configuration core – site. XML

The following operations are performed in the directory after Hadoop is decompressed

vi etc/hadoop/core-site.xml
Copy the code

Among them:

Fs. defaultFS – Specifies the address of HDFS and the name of the default file system. A URI whose scheme and permissions determine FileSystem’s implementation. The uri SCHEME determines the Config property of the named FileSystem implementation class (fs.scheme.impl). Uri permissions are used to determine the host, port, and so on of the file system. In local mode, the configuration can be as follows:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>
Copy the code

In full distributed mode, the configuration can be as follows:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://master:9000</value>
  </property>
  <property>
    <name>io.file.buffer.size</name>
    <value>131072</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>file:/home/hadoop/tmp</value>
  </property>
  <property>
    <name>hadoop.proxyuser.hduser.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.hduser.groups</name>
    <value>*</value>
  </property>
</configuration>
Copy the code

Hadoop.tmp. dir – Specifies a temporary hadoop directory. For example, NameNode data of HDFS is stored in this directory by default. Dir is/TMP /hadoop-${user.name}. NameNode stores HDFS metadata in/TMP. If the OS restarts, / TMP will be cleared. NameNode metadata loss is a very serious problem, so we should modify the path.

More configuration properties can be found here

5. Configuration HDFS – site. XML

vi etc/hadoop/hdfs-site.xml
Copy the code

Dfs. replication – The number of backups when HDFS storage is configured. Set this parameter to 1 because this is a pseudo-distributed environment with only one node.

Formatting HDFS

hdfs namenode -format
Copy the code

Formatting is to block datanodes in the HDFS, a distributed file system. The initial metadata generated after partitioning is stored in NameNode. After formatting, check whether the DFS directory exists in the hadoop.tmp.dir (/opt/data in this example) specified in core-site. XML. If yes, the formatting is successful.

ll /opt/data/tmp
Copy the code

Note:

When formatting, pay attention to the permissions of the hadoop.tmp.dir directory. Common hadoop users should have read and write permissions. You can change the owner of /opt/data to Hadoop.

sudo chown -R hadoop:hadoop /opt/data
Copy the code

View the directory formatted by NameNode

ll /opt/data/tmp/dfs/name/current
Copy the code

Fsimage – Is a persistent file to which NameNode metadata is stored when memory is full.
Fsimage *. Md5 – Is a checksum file used to verify the integrity of fsimage.
Seen_txid – is the hadoop version
VERSSION – The file holds namespaceID: the unique ID of the NameNode. ClusterID: indicates the clusterID. NameNode and DataNode cluster ids should be the same.

Start namenode, Datanode, and SecondaryNamenode

Create a directory, upload, and download files on HDFS

Create a directory

hdfs dfs -mkdir /demol
Copy the code

The directories created by HDFS cannot be viewed by using ls. In this case, you can run the following command to view the directories

hdfs dfs -ls /
Copy the code

Upload local files to HDFS

HDFS DFS -put {local file path} /demolCopy the code

When I uploaded the file, I was prompted that datanode was not started. Later, I used the JPS command to check and found that datanode was automatically shut down. The cause of this problem is that the clusterID of Datanode does not match that of Namenode.

XML and core-site. XML. If there is no error, check whether name in TMP/DFS matches cluster ID in data VERISON. If no, delete current files in Datanode and restart. Alternatively, change the cluster ID

Download the file to the local PC

HDFS DFS -get {HDFS file path}Copy the code

Configuration mapred – site. XML, yarn – site. XML

There is no mapred-site. XML file by default, but there is a mapred-site.xml.template configuration template file. Copy the template to generate mapred-site.xml

cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml 
Copy the code

Add the following configuration:

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Copy the code

Specifies that MapReduce runs on the YARN framework.

Configure yarn-site. XML and add the following configuration:

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>0.0.0.0</value>
</property>
Copy the code

Yarn.nodemanager. aux-services Specifies the default yarn alias mode, which is set to the default MapReduce alias algorithm.
Yarn. The resourcemanager. The hostname specified the resourcemanager run on which node.

View more Properties

Start Resourcemanager and NodeManager

yarn-daemon.sh start resourcemanager
yarn-daemon.sh start nodemanager
Copy the code

Check whether it is successfully opened through JSP

Running graphs Job

In the share directory of hadoop, bring some jars, inside with a small example of some graphs instance, position in the share/hadoop/graphs/hadoop – graphs – examples – 2.5.0. Jar, A classic example of word statistics is run here

Create the input file for the test

Create an input directory and upload the files to it

hdfs dfs -mkdir -p /wordcount/input
hdfs dfs -put /opt/data/wc.input /wordcount/input
Copy the code

Run the wordcount mapReduce job

Yarn jar share/hadoop/graphs/hadoop - graphs - examples - 2.8.0. Jar wordcount/wordcount/wordcount/input/outputCopy the code

View the resulting directory

hdfs dfs -ls /wordcount/output
Copy the code

There are two files in the output directory. The _SUCCESS file is empty. If this file exists, the Job is successfully executed.
The part-r-00000 file is the result file, where -r- indicates that the file is the result of the Reduce phase. When the MapReduce program is executed, there may be no Reduce phase, but there must be a Map phase. If there is no Reduce phase, there is -m-.
A reduce produces a file starting with part-r-.

View the result file

conclusion

HDFS – Stores big data files. The files are saved as Hadoop binary files and cannot be viewed directly, ensuring security. The directories are equivalent to virtual directories.
Yarn – A task scheduling framework on which MapReduce runs. It can also run Storm, Spark, etc
Mapreduce – A computing framework that provides a data processing scheme and uses YARN to run the program on multiple machines. If each machine is regarded as a CPU, the program must support parallelization without causing data confusion.

Hadoop pseudo-distributed mode learning Notes

preface

Hadoop is introduced

Distributed storage system HDFS

Advantages of HDFS (Design idea)

NameNode

File parsing

What are FSImage and EditsLog

SecondaryNameNode

DataNode

MapReduce

MapReduce execution process

Principle of graphs

Procedure for executing MapReduce

Summary of MapReduce principles

Begin to learn

conclusion

Related Posts

Why separate database and table?

The language course, “data type | Go on topic

Test the MVC Web Controller using Spring Boot and @webMvcTest