This article lowers the threshold of big data learning to the horizon

Hadoop is introduced

Hadoop- The Adam and Eve of the big data open source world. The core is the HDFS data storage system, and the MapReduce distributed computing framework.

HDFS

It works by chopping up large chunks of data,

Each chip is made in three copies and placed on three cheap machines, keeping the three pieces of data available as backups of each other at all times. When used, read from only one of the backups, and this fragment is available.

The person who stores the data is called the datenode, and the person who manages the datenode is called the namenode.

MapReduce

The principle is that large tasks are first divided into heap processing -Map, and then summarized processing results -Reduce. Subdivision and collection is a number of servers in parallel, in order to reflect the power of cluster. The difficulty lies in how to break the task down into components that fit the MapReduce model, and what the input and output of the intermediate process <k, V > are.

Introduction to stand-alone version of Hadoop

For those who learn the principle and development of Hadoop, it is necessary to build a Hadoop system. but

Configuring the system is a headache, and many people give up on the process.
There is no server for you to use

Here introduces a configuration free stand-alone version of Hadoop installation and use method, you can simply run a Hadoop example to assist learning, development and testing. Requires a Linux virtual machine installed on the laptop and Docker installed on the virtual machine.

The installation

Download Sequenceiq/Hadoop-Docker :2.7.0 image using Docker and run it.

[root@bogon ~]# docker pull sequenceiq/hadoop-docker:2.7.0 2.7.0: Pulling right from sequenceiq/hadoop-docker860d0823bcab: Pulling fs layer e592c61b2522: Pulling fs layer

Download successfully output

Digest: sha256:a40761746eca036fee6aafdf9fdbd6878ac3dd9a7cd83c0f3f5d8a0e6350c76a
Status: Downloaded newer image for sequenceiq/hadoop-docker:2.7.0

Start the

[root@bogon ~]# docker run-it sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh-bash --privileged=true Starting SSHD: [ OK ] Starting namenodes on [b7a42f79339c] b7a42f79339c: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-b7a42f79339c.out localhost: starting datanode, Logging to/usr/local/hadoop/logs/hadoop - root - datanode - b7a42f79339c. Out Starting secondary namenodes [0.0.0.0] 0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-b7a42f79339c.out starting yarn daemons starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-b7a42f79339c.out localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-b7a42f79339c.out

After successful startup, the command line shell will automatically enter Hadoop’s container environment without the need to execute Docker exec. In the container environment go into /usr/local/hadoop/sbin and execute./start-all.sh and./mr-jobhistory-daemon.sh start historyserver as shown below

Bash-4.1 # CD /usr/local/hadoop/sbin bash-4.1#./start-all.sh This script is Deprecated. Instead of using start-dfs.sh and start-yarn.sh Starting namenodes on [b7a42f79339c] b7a42f79339c: namenode running as process 128. Stop it first. localhost: DataNode running as process 219. Stop it first. Starting secondary NameNode [0.0.0.0] 0.0.0.0: secondarynamenode running as process 402. Stop it first. starting yarn daemons resourcemanager running as process 547. Stop it first. localhost: NodeManager running as process 641. Stop it first. bash-4.1#./mr-jobhistory-daemon.sh start historyserver chown: missing operand after `/usr/local/hadoop/logs' Try `chown --help' for more information. starting historyserver, logging to /usr/local/hadoop/logs/mapred--historyserver-b7a42f79339c.out

Hadoop startup is complete, and that’s as simple as that.

To ask how troublesome distributed deployment is, count the number of configuration files alone! I have seen an old Hadoop bird with my own eyes, because the new server hostname with a horizontal line “-“, with the whole morning, the environment just didn’t get up.

Run the built-in example

Go back to your Hadoop home directory and run the sample program

Bash - 4.1 # CD/usr/local/hadoop bash - 4.1 # bin/hadoop jar share/hadoop/graphs/hadoop - graphs - examples - 2.7.0. Jar grep input output 'dfs[a-z.]+' 20/07/05 22:34:41 INFO client.RMProxy: Connecting connection to resourceManager at /0.0.0.0:8032 20/07/05 22:34:43 InputFormat: Connecting connection to resourceManager at /0.0.0.0:8032 20/07/05 22:34:43 Total input paths to process : 31 20/07/05 22:34:43 INFO mapreduce.JobSubmitter: number of splits:31 20/07/05 22:34:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1594002714328_0001 20/07/05 22:34:44 INFO impl.YarnClientImpl: Submitted application application_1594002714328_0001 20/07/05 22:34:45 INFO mapreduce.Job: The url to track the job: http://b7a42f79339c:8088/proxy/application_1594002714328_0001/ 20/07/05 22:34:45 INFO mapreduce.Job: Running job: job_1594002714328_0001 20/07/05 22:35:04 INFO mapreduce.Job: Job job_1594002714328_0001 running in uber mode : false 20/07/05 22:35:04 INFO mapreduce.Job: map 0% reduce 0% 20/07/05 22:37:59 INFO mapreduce.Job: map 11% reduce 0% 20/07/05 22:38:05 INFO mapreduce.Job: map 12% reduce 0%

The MapReduce calculation is complete, with the following output

20/07/05 22:55:26 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=291 FILE: Number of bytes written=230541 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=569 HDFS: Number of bytes written=197 HDFS: Number of read operations=7 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time  spent by all maps in occupied slots (ms)=5929 Total time spent by all reduces in occupied slots (ms)=8545 Total time spent by all map tasks (ms)=5929 Total time spent by all reduce tasks (ms)=8545 Total vcore-seconds taken by all map tasks=5929 Total vcore-seconds taken by all reduce tasks=8545 Total megabyte-seconds taken by all map tasks=6071296 Total megabyte-seconds taken by all reduce tasks=8750080 Map-Reduce Framework Map input records=11 Map output records=11  Map output bytes=263 Map output materialized bytes=291 Input split bytes=132 Combine input records=0 Combine output records=0 Reduce input groups=5 Reduce shuffle bytes=291 Reduce input records=11 Reduce output records=11 Spilled Records=22 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=159 CPU time spent (ms)=1280 Physical memory (bytes) snapshot=303452160 Virtual memory (bytes) snapshot=1291390976 Total committed heap usage (bytes)=136450048 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=437 File Output Format Counters Bytes Written=197

The HDFS command looks at the output

Bash-4 # bin/ HDFS dfs-cat output/* 6 dfs.audit.logger 4 dfs.class 3 dfs.server.namenode.2 dfs.period 2 dfs.audit.log.maxfilesize 2 dfs.audit.log.maxbackupindex 1 dfsmetrics.log 1 dfsadmin 1 dfs.servers 1 dfs.replication 1 dfs.file

example

Grep is a MapReduce program that calculates regular expression matches in the input, filtering out the string that matches the regular expression and the number of occurrences.

The shell’s grep result displays the entire line; the command displays only the string that matches on the line

grep input output 'dfs[a-z.]+'

The regular expression DFS [a-z.]+ indicates that the string should begin with DFS, followed by one or more single characters other than lowercase or newline character n. The input is all the files in the input,

Bash-4.1 # LS-LRT total 48-rW-r --r-- 1 root root 690 May 16 2015 yarn-site.xml-rW-r --r-- 1 root root 5511 May 16 2015 kms-site.xml -rw-r--r--. 1 root root 3518 May 16 2015 kms-acls.xml -rw-r--r--. 1 root root 620 May 16 2015 httpfs-site.xml -rw-r--r--. 1 root root 775 May 16 2015 hdfs-site.xml -rw-r--r--. 1 root root 9683 May 16 2015 hadoop-policy.xml -rw-r--r--. 1 root root 774 May 16 2015 core-site.xml -rw-r--r--. 1 root root 4436 May 16 2015 capacity-scheduler.xml

The result is output.

The calculation process is as follows

The slight difference is that there are two reduces, and the second reduce is to sort the results in order of occurrence. Map and Reduce processes can be combined at will by the developer, as long as the inputs and outputs of each process are connected.

Introduction to Management System

Hadoop provides a Web-interface management system,

The port number	use
50070	Hadoop NameNode UI port
50075	Hadoop DataNode UI port
50090	Hadoop SecondaryNamenode port
50030	JobTracker monitors the port
50060	TaskTrackers port
8088	YARN task monitoring port
60010	HBase Hmaster monitors the UI port
60030	Hbase HRegionServer port
8080	Spark monitors the UI ports
4040	Spark task UI port

Add command parameters

The docker run command takes parameters to access the UI administration page

Sequenceiq /hadoop-docker:2.7.0 /etc/bootstrap.sh-bash --privileged=true

Execute this command to view the system in the host browser, or if Linux has a browser. My Linux doesn’t have a graphical interface, so check it out on the host.

50070 Hadoop NameNode UI port

50075 Hadoop DataNode UI port

8088 YARN task monitoring port

Completted and running MapReduce tasks can be viewed in 8088, with GERP and wordcount tasks shown above.

Some of the problems

A,. / sbin/Mr – jobhistory – daemon. Sh start historyserver must perform, or in the process of the running task will be reported

20/06/29 21:18:49 INFO IPC.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) java.io.IOException: Java.net.ConnectException: Call From 87 a4217b9f8a / 172.17.0.1 to 0.0.0.0:10020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused

/start-all.sh = Unknown job_1592960164748_0001; /start-all.sh = Unknown Job job_1592960164748_0001

Doker run must be followed by –privileged=true, otherwise java.io. ioException: Job status not available

4. Note that Hadoop will not overwrite the result file by default, so running the above example again will prompt an error, and need to delete./output first. Or output01?

conclusion

The method in this paper can complete the installation and configuration of Hadoop at a low cost, which is helpful for learning and understanding as well as development and testing. If you develop your own Hadoop program, you need to jar the program and upload it to the directory of share/ Hadoop /mapreduce for execution

bin/hadoop jar share/hadoop/mapreduce/yourtest.jar

To run the program and see how it works.