preface

Hadoop plays an important role in big data technology system. Hadoop is the foundation of big data technology.

This is an article documenting my own learning. There are many ways to learn Hadoop, and there are many learning roadmap on the web.

Hadoop is introduced

Hadoop is a system architecture capable of distributed processing of massive amounts of data. I use Hadoop-2.8.0, which mainly contains three blocks:

  • HDFS: Hadoop Distributed File System (HDFS) Distributed storage layer
  • Mapreduce: Distributed computing framework MapReduce distributed computing layer
  • Yarn: Resource management system YARN(Yet Another Resource Negotiator) Cluster Resource management

The core of Hadoop is HDFS and MapReduce.

Distributed storage system HDFS

HDFS (Hadoop Distributed File System) is a highly fault tolerant System suitable for deployment on inexpensive machines. HDFS provides high-throughput data access and is suitable for applications with largedata sets. The HADOOP Distributed File System (HDFS) is responsible for storing big data. By dividing large files into blocks and storing them in distributed mode, the HDFS overcomes the limit on the size of server disks and solves the problem that a single machine cannot store large files. The HDFS is an independent module that provides services for YARN and other modules, such as HBase.

The core

  • NameNode
  • DataNode
  • SecondaryNameNode(NameNode snapshot)

The HDFS is a master-slave structure. An HDFS cluster consists of one NameNode and multiple datanodes.

Advantages of HDFS (Design idea)

1. High fault tolerance The HDFS assigns a copy of the same file block to several other hosts to prevent faults on all computers. If one host fails, the HDFS quickly obtains files from another copy. Data is automatically saved to multiple nodes; After the backup is lost, it is automatically restored.

2. The storage of massive data is very suitable for the storage of t-level large files or a bunch of big data files

3. File block storage The HDFS stores a large complete file in 64 MB blocks on different computers. In this way, files in different blocks can be read from multiple hosts at the same time.

4. Mobile computing performs calculations where data is stored, rather than pulling data to where it is calculated, reducing costs and improving performance!

5. Stream data is written once and read in parallel. It does not support dynamic changes to the file content, but requires that the file be written once without changing, and the content can only be added at the end of the file.

6. Can be built on cheap machines to improve reliability through multiple copies, providing fault tolerance and recovery mechanisms. HDFS can run on ordinary PCS, allowing companies to build big data clusters on dozens of cheap computers.

NameNode

role

  • It is a namespace that manages files
  • Coordinate client access to files
  • Record the location and copy information of each file data on each DataNode

File parsing

  • Version: is a properties file that stores the version number of the HDFS
  • Editlog: Any operations on file system data are saved!
  • Fsimage /.md5: a permanent checkpoint for file system metadata, including block to file mapping, file attributes, and so on
  • Seen_txid: very important, is the file that holds transaction related information

What are FSImage and EditsLog

FsImage and Editlog are the core data structures of HDFS. Corruption of these files can cause the entire cluster to fail. Therefore, name nodes can be configured to support multiple copies of fsimages and Editlogs. Any FsImage and EditLog updates are synchronized to each copy.

SecondaryNameNode

role

  • A snapshot of Namenode, periodically backing up the Namenode
  • Record metadata and other data in the Namenode
  • Can be used to restore Namenode, does not replace Namenode!

Execute the process

  1. SecondaryNameNode periodically communicates with the NameNode, requesting it to stop using the EditLog and temporarily transfer the new write operation to a new file edit.new. This operation is instantaneous.
  2. SecondaryNameNode Obtains the FsImage and EditLog files from NameNode in HTTP Get mode and downloads them to the local directory
  3. Loading FsImage and EditLog into memory is a checkpoint merge of FsImage and EditLog.
  4. After the merge is successful, the new FsImage file is sent to NameNode via POST.
  5. SecondaryNamenode replaces the newly received FsImage with the old one, and edit.new replaces the EditLog so that the EditLog becomes smaller.

DataNode

role

  • Storage and management of real data
  • Write once, read many times (no modification)
  • Files are made up of data blocks, and hadoop2.x’s block size is 128MB by default
  • Spread data blocks as far as possible across nodes

File parsing

  • Blk_ < ID > : specifies the HDFS data block, which stores binary data
  • Blk_ < ID >. Meta: attribute information about a data block, including version information and type information

You can set the number of copies to be generated by modifying the dfs.replication property of hdFS-site. XML! The default is 3!

Basic operation

MapReduce

The MapReduce distributed processing framework provides computing for massive amounts of data. MapReduce is a computing framework that provides a data processing method, that is, data is distributed and streamed through the Map and Reduce phases. It is only suitable for offline processing of big data, not for applications requiring high real-time performance.

Introduction to the

  • MapReduce is a distributed computing model proposed by Google. It is mainly used in the search field to solve the computing problem of massive data.
  • MR consists of two phases: Map and Reduce. Users only need to implement Map () and Reduce () functions to realize distributed computing.

MapReduce execution process

Principle of graphs

Procedure for executing MapReduce

  1. Read files in the HDFS. Each line resolves to a

    . The map function is called once for each key-value pair. – <0,hello you> <10,hello me>
    ,v>
  2. Overrides map(), receives the

    generated by 1, processes it, and converts it to a new

    output. –



    ,1>
    ,1>
    ,1>
    ,1>
    ,v>
    ,v>
  3. Partition the

    output of 2. The default partition is one zone.
    ,v>
  4. Sort (according to k) and group the data in different partitions. Grouping means that the values of the same key are grouped into a collection.






    ,{1}>
    ,{1}>
    ,{1,1}>
    ,1>
    ,1>
    ,1>
    ,1>
  5. (Optional) Reduce the grouped data
  6. The output of multiple Map tasks is copied to different Reduce nodes over the network based on different partitions. (shuffle)
  7. Merge and sort the output of multiple maps. Overwrite reduce function, receive grouped data, realize its own business logic –



    after processing, generate new < K, V > output.
    ,1>
    ,1>
    ,2>
  8. The

    output of reduce is written to the HDFS.
    ,>

Summary of MapReduce principles

Primitives typically consist of several instructions to implement a particular operation. To perform a function by means of an indivisible or uninterruptible program. And resident in memory. However, if the cost of primitive execution is more than the machine can afford, the primitive cannot be executed on the machine. The usual solution to this problem is to double the cost of the machine configuration for each step up in the machine configuration. Hence MapReduce, based on distributed file system principles. It divides the primitive into two steps, map() and reduce(), and the two phases can be divided into modules to be executed on different machines.

Begin to learn

  1. Upload hadoop.tar.gz to the server and decompress it

  1. Configure hadoop environment variables
vi /etc/prifile
Copy the code

Add the configuration

export HADOOP_HOME="/ opt/modules/hadoop - 2.8.0"
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
Copy the code

source /etc/profile
Copy the code

Validate the configuration and validate the parameters

echo ${HADOOP_HOME}
Copy the code

  1. Sh, mapred-env.sh, and yarn-env.sh file

Example Change JAVA_HOME to the Java installation path

How to view the Java installation path:

4. Configuration core – site. XML

The following operations are performed in the directory after Hadoop is decompressed

vi etc/hadoop/core-site.xml
Copy the code

Among them:

  • Fs. defaultFS – Specifies the address of HDFS and the name of the default file system. A URI whose scheme and permissions determine FileSystem’s implementation. The uri SCHEME determines the Config property of the named FileSystem implementation class (fs.scheme.impl). Uri permissions are used to determine the host, port, and so on of the file system. In local mode, the configuration can be as follows:
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>
Copy the code

In full distributed mode, the configuration can be as follows:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://master:9000</value>
  </property>
  <property>
    <name>io.file.buffer.size</name>
    <value>131072</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>file:/home/hadoop/tmp</value>
  </property>
  <property>
    <name>hadoop.proxyuser.hduser.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.hduser.groups</name>
    <value>*</value>
  </property>
</configuration>
Copy the code
  • Hadoop.tmp. dir – Specifies a temporary hadoop directory. For example, NameNode data of HDFS is stored in this directory by default. Dir is/TMP /hadoop-${user.name}. NameNode stores HDFS metadata in/TMP. If the OS restarts, / TMP will be cleared. NameNode metadata loss is a very serious problem, so we should modify the path.

More configuration properties can be found here

5. Configuration HDFS – site. XML

vi etc/hadoop/hdfs-site.xml
Copy the code

Dfs. replication – The number of backups when HDFS storage is configured. Set this parameter to 1 because this is a pseudo-distributed environment with only one node.

  1. Formatting HDFS
hdfs namenode -format
Copy the code

Formatting is to block datanodes in the HDFS, a distributed file system. The initial metadata generated after partitioning is stored in NameNode. After formatting, check whether the DFS directory exists in the hadoop.tmp.dir (/opt/data in this example) specified in core-site. XML. If yes, the formatting is successful.

ll /opt/data/tmp
Copy the code

Note:

  • When formatting, pay attention to the permissions of the hadoop.tmp.dir directory. Common hadoop users should have read and write permissions. You can change the owner of /opt/data to Hadoop.
sudo chown -R hadoop:hadoop /opt/data
Copy the code

View the directory formatted by NameNode

ll /opt/data/tmp/dfs/name/current
Copy the code

  • Fsimage – Is a persistent file to which NameNode metadata is stored when memory is full.
  • Fsimage *. Md5 – Is a checksum file used to verify the integrity of fsimage.
  • Seen_txid – is the hadoop version
  • VERSSION – The file holds namespaceID: the unique ID of the NameNode. ClusterID: indicates the clusterID. NameNode and DataNode cluster ids should be the same.

  1. Start namenode, Datanode, and SecondaryNamenode

  1. Create a directory, upload, and download files on HDFS
  • Create a directory
hdfs dfs -mkdir /demol
Copy the code

The directories created by HDFS cannot be viewed by using ls. In this case, you can run the following command to view the directories

hdfs dfs -ls /
Copy the code

  • Upload local files to HDFS
HDFS DFS -put {local file path} /demolCopy the code

When I uploaded the file, I was prompted that datanode was not started. Later, I used the JPS command to check and found that datanode was automatically shut down. The cause of this problem is that the clusterID of Datanode does not match that of Namenode.

XML and core-site. XML. If there is no error, check whether name in TMP/DFS matches cluster ID in data VERISON. If no, delete current files in Datanode and restart. Alternatively, change the cluster ID

  • Download the file to the local PC
HDFS DFS -get {HDFS file path}Copy the code

  1. Configuration mapred – site. XML, yarn – site. XML

There is no mapred-site. XML file by default, but there is a mapred-site.xml.template configuration template file. Copy the template to generate mapred-site.xml

cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml 
Copy the code

Add the following configuration:

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Copy the code

Specifies that MapReduce runs on the YARN framework.

Configure yarn-site. XML and add the following configuration:

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>0.0.0.0</value>
</property>
Copy the code
  • Yarn.nodemanager. aux-services Specifies the default yarn alias mode, which is set to the default MapReduce alias algorithm.
  • Yarn. The resourcemanager. The hostname specified the resourcemanager run on which node.

View more Properties

  1. Start Resourcemanager and NodeManager
yarn-daemon.sh start resourcemanager
yarn-daemon.sh start nodemanager
Copy the code

Check whether it is successfully opened through JSP

  1. Running graphs Job

In the share directory of hadoop, bring some jars, inside with a small example of some graphs instance, position in the share/hadoop/graphs/hadoop – graphs – examples – 2.5.0. Jar, A classic example of word statistics is run here

Create the input file for the test

Create an input directory and upload the files to it

hdfs dfs -mkdir -p /wordcount/input
hdfs dfs -put /opt/data/wc.input /wordcount/input
Copy the code

Run the wordcount mapReduce job

Yarn jar share/hadoop/graphs/hadoop - graphs - examples - 2.8.0. Jar wordcount/wordcount/wordcount/input/outputCopy the code

View the resulting directory

hdfs dfs -ls /wordcount/output
Copy the code

  • There are two files in the output directory. The _SUCCESS file is empty. If this file exists, the Job is successfully executed.

  • The part-r-00000 file is the result file, where -r- indicates that the file is the result of the Reduce phase. When the MapReduce program is executed, there may be no Reduce phase, but there must be a Map phase. If there is no Reduce phase, there is -m-.

  • A reduce produces a file starting with part-r-.

View the result file

conclusion

  • HDFS – Stores big data files. The files are saved as Hadoop binary files and cannot be viewed directly, ensuring security. The directories are equivalent to virtual directories.
  • Yarn – A task scheduling framework on which MapReduce runs. It can also run Storm, Spark, etc
  • Mapreduce – A computing framework that provides a data processing scheme and uses YARN to run the program on multiple machines. If each machine is regarded as a CPU, the program must support parallelization without causing data confusion.