This paper mainly records how to install Hadoop in pseudo-distribution mode on Mac and understand the principle of MapReduce through WordCount.

Summary of learning materials

  • Hadoop’s official website

The resources

  • How to learn Hadoop with Zero Foundation? | zhihu
  • Hadoop family learning roadmap | Blog
  • To configure Hadoop entertainment environment | Blog on the Mac

The Authoritative Guide to Hadoop comes with this book

  • With the book Source Code
  • Accompanying Dataset: Full Dataset

Hadoop based

Hadoop and Spark

Hadoop and Spark are two different big data processing frameworks as shown in the following figure.

  • The blue part in the figure above is the Hadoop ecosystem component, and the yellow part is the Spark ecosystem component.
  • Although they are two different big data processing frameworks, they are not mutually exclusive. Spark and MapReduce in Hadoop are symbiotic.
  • Hadoop provides many features that Spark does not, such as distributed file systems, while Spark provides real-time in-memory computing that is very fast.

Hadoop typically consists of two parts: storage and processing. The storage part is Hadoop’s distributed file system (HDFS), and the processing refers to MapReduce (MP).

Hadoop installation and configuration

  • Ref – 1: configure Hadoop entertainment environment | Blog on the Mac
  • Ref – 2: build Hadoop on Mac OS X development environment guide | zhihu
  • Ref – 3: Mac environment Hadoop installation and configuration of the | Segmentfault

Hadoop Installation Mode

Hadoop installation modes are divided into three types: single-machine mode, pseudo-distributed mode and full distributed mode. The default installation is single-machine mode. You can change the default single-machine mode to pseudo-distributed mode through the configuration file core-site.xml.

For details about the three Hadoop installation modes and how to use VMS for distributed installation, see Chapter 2 – Hadoop Installation in Hadoop Application Technical Description.

The running mode of Hadoop is determined by the configuration file. Therefore, if you want to switch from pseudo-distributed mode to non-distributed mode, you need to delete the configuration item in core-site.xml.

The following briefly describes how to set up a pseudo-distributed Hadoop environment on a Mac by modifying the configuration file.

Hadoop Installation Procedure

The Hadoop installation and configuration steps are as follows (refer to the reference links above for details)

  1. Install Java.
  2. In Mac Settings, the “Share” page is displayed to allow remote loginssh localhostVerify.
  3. Download the Hadoop source code fromHadoop’s official websiteTo download, select DownloadHadoop 2.10.0. To download the.tar.gzUnzip the package and place it into/ Library/hadoop - 2.10.0The path.
  4. Set the Hadoop environment variables

(1) Open the configuration file

vim ~/.bash_profile
Copy the code

(2) Set environment variables

HADOOP_HOME = / Library/hadoop - 2.10.0 PATH =$PATH:${HADOOP_HOME}/ bin HADOOP_CONF_DIR = / Library/hadoop - 2.10.0 / etc/hadoop HADOOP_COMMON_LIB_NATIVE_DIR = / Library/hadoop - 2.10.0 / lib/nativeexport HADOOP_HOME
export PATH

export HADOOP_CONF_DIR

export HADOOP_COMMON_LIB_NATIVE_DIR
Copy the code

(3) Make the configuration file take effect and verify the Hadoop version number

source ~/.bash_profile

hadoop version  
Copy the code
  1. Modify the Hadoop configuration file

Hadoop configuration files to be modified are stored in the etc/ Hadoop directory, including

  • hadoop-env.sh
  • core-site.xml
  • hdfs-site.xml
  • mapred-site.xml
  • yarn-site.xml

The following changes are made step by step

(1) Modify the hadoop-env.sh file

exportJAVA_HOME = / Library/Java/JavaVirtualMachines jdk1.8.0 _231. JDK/Contents/HomeexportHADOOP_HOME = / Library/hadoop - 2.10.0exportHADOOP_CONF_DIR = / Library/hadoop - 2.10.0 / etc/hadoopCopy the code

(2) Modify core-site. XML file

Set a temporary directory and file system for Hadoop. Localhost :9000 indicates the localhost. If a remote host is used, use the corresponding IP address and enter the domain name of the remote host. You need to perform DNS mapping in the /etc/hosts file.

<configuration> <! --localhost:9000 -->> <property> <name>fs.defaultFS</name> <value> HDFS ://localhost:9000</value> </property> <! <property> <name>hadoop.tmp.dir</name> < value > / Users/LBS/devfiles/hadoop/hadoop - 2.10.0 / TMP < value > / < description > Directoriesfor software develop and save temporary files.</description>
  </property>
</configuration>
Copy the code

(3) Modify the HDFS -site. XML file

Hdfs-site. XML specifies the number of copies of the default HDFS parameter. Since it runs on only one node (in pseudo-distribution mode), the number of copies is 1.

<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <! <property> <name>dfs.permissions</name> <value>false</value> <! -- Disable firewall --> </property> <! <property> <name>dfs.namenode.name.dir</name> < value > / Users/LBS/devfiles/hadoop/hadoop - 2.10.0 / TMP/DFS/name < value > / < / property > <! Create a local hadoop data folder. > <property> <name>dfs.datanode.data.dir</name> < value > / Users/LBS/devfiles/hadoop/hadoop - 2.10.0 / TMP/DFS/data value > < / < / property > < / configuration >Copy the code

Modify the mapred-site. XML file

Copy the mapred-site.xml.template file and change it to the mapred-site. XML file. Then set YARN as the data processing framework, and set the host name and port number of JobTracker.

<configuration> <property> <! -- Specify mapReduce to run on YARN --> <name> mapReduce.framework. name</name> <value> YARN </value> </property> </configuration>Copy the code

(5) Modify yarn-site. XML

Configure yarn, the data processing framework

<configuration> <! -- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>localhost:9000</value> </property> </configuration>Copy the code

Start the Hadoop

(1) If Hadoop is started for the first time, NameNode needs to be formatted. This step is not required for subsequent startup.

hadoop namenode -format
Copy the code

(2) Start HDFS: Go to the sbin directory in the Hadoop installation directory and start HDFS (You need to set the Mac to allow remote login, and enter the password for three times)

Tip: During the initial installation and startup, run the./start-all.sh command to perform the necessary initial installation

cd/ Library/hadoop - 2.10.0 / sbin/start - DFS. ShCopy the code

If the following information is displayed, the startup is successful

lbsMacBook-Pro:sbin lbs$ ./start-dfs.sh Starting namenodes on [localhost] Password: localhost: namenode running as process 12993. Stop it first. Password: localhost: Starting secondary namenodes [0.0.0.0] Password: 0.0.0.0: datanode running as process 32400. Connection Closed by 127.0.0.1 port 22Copy the code

Note that warnings are displayed in the log

WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes whereThe applicabledCopy the code

The above reminder is about Hadoop native libraries — Hadoop native libraries are libraries for productivity or for some functional component that cannot be implemented in Java. For details, see Hadoop Using local Libraries to Improve Efficiency under Mac OSX.

You can stop Hadoop as follows

cd/ Library/hadoop - 2.10.0 / sbin/sbin/stop - DFS. ShCopy the code

(3) Execute JPS on the terminal. If the following information is displayed, Hadoop can be started successfully. DataNode, NameNode, and SecondaryNameNode information indicates that pseudo-distributed Hadoop is started.

lbsMacBook-Pro:sbin lbs$ jps 32400 DataNode 12993 NameNode 30065 BootLanguagServerBootApp 13266 SecondaryNameNode 30039 Org. Eclipse equinox. Launcher_1. 5.700 v20200207-2156. The jar ResourceManager NodeManager RunJar Jps 35199 32926 35117 35019Copy the code

Can also visit http://localhost:50070/dfshealth.html#tab-overview to view the Hadoop startup. If the Live Node parameter is displayed, pseudo-distributed Hadoop is successfully started.

(4) Start YARN: Go to the sbin directory in the Hadoop installation directory and start YARN

cd/ Library/hadoop - 2.10.0 / sbin/start - yarn. ShCopy the code

Now you’re done installing, configuring, and starting Hadoop! Next, you can use some shell commands to manipulate files under Hadoop, for example

hadoop fs -ls /&emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; View files and folders in the root directory hadoop fs-mkdir /testCreate a directory in the root directory testData hadoop fs -rm /... /... Hadoop fs-rMR /... hadoop fs-rmr /... Remove an empty folderCopy the code

FAQ

Unable to load native-hadoop library for your platform

If the following message is displayed when the HDFS is started

./start-dfs.sh
Copy the code
lbsMacBook-Pro:~ lbs$ cdSbin lbsmacbook-pro :sbin LBS $./start-dfs.sh 20/03/23 08:46:43 WARN util.NativeCodeLoader: Unable to load native-hadoop libraryfor your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
Password:
localhost: namenode running as process 93155. Stop it first.
Password:
localhost: datanode running as process 93262. Stop it first.
Starting secondary namenodes [0.0.0.0]
Password:
0.0.0.0: secondarynamenode running as process 93404. Stop it first.
Copy the code

The above reminder is about Hadoop native libraries — Hadoop native libraries are libraries for productivity or for some functional component that cannot be implemented in Java. For details, see Hadoop Using local Libraries to Improve Efficiency under Mac OSX.

Hadoop Application Development Technology in detail study notes

MapReduce Quick Start -WordCount

  • Intellij Hadoop development environment to build – WordCount | Jane books
  • Write your first MapReduce program using IDEA
  • Intellij runs and debugs MapReduce programs locally in conjunction with Maven
  • Learn Hadoop together — the first MapReduce program

Engineering to create

  1. Use IDEA to create a Maven-based project — WordCount
  2. inpom.xmlAdd the following dependencies to
<? xml version="1.0" encoding="UTF-8"? > <project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> < modelVersion > 4.0.0 < / modelVersion > < groupId > org. Lbs0912 < / groupId > < artifactId > wordcount < / artifactId > < version > 1.0 - the SNAPSHOT < / version > <! <repository> <id> Apache </id> <url>http://maven.apache.org</ URL > </repository> </repositories> <! <dependencies> <groupId>org.apache.hadoop</groupId> <artifactId> Hadoor-core </artifactId> < version > 1.2.1 < / version > < / dependency > < the dependency > < groupId > org.. Apache hadoop < / groupId > <artifactId>hadoop-common</artifactId> <version>2.7.2</version> </dependency> </dependencies> </dependencies> </project>Copy the code
  1. createWordMapper
package wordcount;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
    IntWritable one = new IntWritable(1);
    Text word = new Text();


    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while(itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); }}}Copy the code
  1. createWordReducer
package wordcount;


import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
    private IntWritable result = new IntWritable();

    public void reduce(Text	key, Iterable<IntWritable> values, Context context) throws IOException,InterruptedException {
        int sum = 0;
        for(IntWritable val:values) { sum += val.get(); } result.set(sum); context.write(key,result); }}Copy the code
  1. createWordMainDrive class
package wordcount; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public class WordMain { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); /** * there must be input/output */if(otherArgs.length ! = 2) { System.err.println("Usage: WordCount <in> <out>");
            System.exit(2);
        }

        Job job = new Job(conf, "wordcount"); job.setJarByClass(WordMain.class); / / the main class job. SetMapperClass (WordMapper. Class); //Mapper job.setCombinerClass(WordReducer.class); // Job. SetReducerClass (wordreducer.class); //Reducer job.setOutputKeyClass(Text.class); // Set the key class of job output data. Job.setoutputvalueclass (IntWritable. Class); / / setting job output value FileInputFormat addInputPath (job, new Path (otherArgs [0])); / / file input FileOutputFormat. SetOutputPath (job, new Path (otherArgs [1])); // Exit system.exit (job.waitForcompletion (true)? 0:1); // Wait for exit}}Copy the code

IDEA directly run the program

  • Intellij Hadoop development environment to build – WordCount | Jane books

Select Run -> Edit Configurations and enter input/ Output in the program parameters field, as shown in the figure below

Add the wordcount test file wordcount1.txt to the input directory

Hello, I love coding are you ok? Hello, i love hadoop are you ok?Copy the code

Run the program again and you will see the following output directory structure

- input
- output
    | - ._SUCCESS.crc
    | - .part-r-00000.crc
    | - ._SUCCESS
    | - part-r-00000
Copy the code

Open the file part-R-00000, you can see the statistical results of the number of words

Hello, 1 Hello, I 1 are 2 coding 1 hadoop 1 I 1 love 2 OK? 2 you 2Copy the code

It should be noted that due to Hadoop’s configuration, the output file directory needs to be deleted before running the program next time.

Export the JAR package to run the program

  1. inFile -> Project StructureOption, for the projectArtifacts, the choice ofWordMain

  1. chooseBuild -> Build Artifacts...To generate the.jarfile

  1. Go to the HDFS directory (not other file system directories) and run the following command
hadoop jar WordCount.jar input/ out/
Copy the code

HDFS Distributed file system

Know the HDFS

The Hadoop Distributed File System (HDFS) is a Distributed File System used on common hardware devices. With fault-tolerant and high throughput, the HDFS is suitable for applications with large data sets. It can access data in the file system through streams.

Applications running on HDFS must stream access to their data sets, which is not typical of regular applications running on regular file systems. HDFS is designed for batch processing rather than user interaction, with an emphasis on data throughput rather than response time for data access.

HDFS stores each file as a sequence of blocks. All blocks except the last block are of the same size.

HDFS architecture

HDFS provides high-performance, reliable and scalable storage services for Hadoop, a distributed computing framework. HDFS is a typical master-slave architecture. An HDFS cluster consists of a master node (Namenode) and a certain number of slave nodes (Datanodes).

  • Namenode is a central server that manages the namespace of the file system (namespace) and client access to files. Determine the mapping of both blocks and data nodes.
    • Provides the name lookup service, which is a Jetty server
    • savemetadataInformation, including filesowershippermissions, what blocks does the file contain,BlockWhich one to keepDataNode
    • The NameNodemetadataThe information is loaded into memory after startup
  • Datanodes are usually one Datanode per node and manage storage on the node where they reside.Datanodes are usually organized in the form of racks that connect all systems through a single switch.The functions of DataNode include
    • Saves blocks, each corresponding to a metadata information file
    • Block information is reported to the NameNode when DataNode threads are started
    • Keep in touch with the NameNode by sending it a heartbeat (every 3 seconds)

  • Frame (Rack) : Three copies of a Block are stored in two or more racks for disaster prevention and fault tolerance
  • The data block (Block) is the basic storage unit of the HDFS. The default size of Hadoop 1.X is 64MB, and that of Hadoop 2.X is 128MB. File systems on HDFS are divided into block-sized chunks (Chunk) as an independent storage unit. Unlike other file systems, files smaller than a block size on HDFS do not occupy the entire block space. Using block abstractions instead of entire files as storage units greatly simplifies the design of storage subsystems.
  • Secondary metadata node (SecondaryNameNode) is responsible for image backup and regular merge of logs and images.

Use Hadoop FSK / -files-blocks to display block information.

Block Specifies the size of a data Block

  1. Reduce file addressing time
  2. Reduce the data overhead of managing fast data because each fast needs to have a corresponding record on the NameNode
  3. Data blocks are read and written to reduce network connection costs

Block Backup Principles

A Block is the smallest component of an HDFS file system. It is uniquely identified by a Long integer. Each Block has multiple copies, three by default. To ensure data security and efficiency, Hadoop’s default storage policy for three copies is shown in the following figure

  • Block 1: Store a Block in the HDFS directory on the local machine
  • Block 2: Store a Block on a DataNode of a different Rack
  • Block 3: Store a Block on a machine under the same Rack of that machine

Such a policy ensures that access to the file to which the Block belongs is preferentially found under the Rack. If an exception occurs in the entire Rack, a copy of the Block can be found in another Rack. This is efficient enough, and at the same time error tolerant of the data.

Hadoop’s RPC mechanism

The Remote Procedure Call (RPC) mechanism faces two problems

  1. Object invocation
  2. Sequence/deserialization mechanism

The RPC architecture is shown in the following figure. Hadoop implements its own simple RPC components that rely on support for the Hadoop Writable type.

The Hadoop Writable interface requires that each implementation class ensure that its objects are properly serialized (writeObject) and deserialized (readObject). Therefore, Hadoop RPC uses Java dynamic proxy and reflection to implement object invocation, and the serialization and deserialization of client-to-server data are implemented by the Hadoop framework or users themselves, that is, data assembly customization.

Hadoop RPC = Dynamic proxy + custom binary stream

Open Source Database HBase

Overview

  • HBase is a scalable distributed, column-oriented, open source database suitable for unstructured data storage. Note that HBase is column-based rather than row-based.
  • The HBase technology can be used to build large-scale structured storage clusters on inexpensive PC servers.
  • HBase is an open source implementation of Google Bigtable. Similar to Google Bigtable using GFS as its file storage system, HBase uses HDFS as its file storage system. Google uses MapReduce to process massive data in Bigtable, and HBase uses Hadoop MapReduce to process massive data. Google Bigtable uses Chubby as a collaborative service, and HBase uses Zookeeper as a corresponding service.

HBase has the following features

  1. Large: A table can have hundreds of millions of rows and millions of columns
  2. Column-oriented: Column-oriented storage and permission control, with column (family) independent retrieval
  3. Sparse: NULL columns take up no storage space, so tables can be designed to be very sparse.

Hadoop combat Demo

It is well said that “big data is better than algorithms”, which means that for some applications (such as recommending movies and music based on past preferences), no matter how good the algorithm is, recommendations based on small data are often worse than recommendations based on large amounts of available data. — The Definitive Guide to Hadoop

  • With Hadoop build movie recommendation system | log
  • Using Mahout build job recommendation engine | log