This article source: making here | | GitEE, click here

1. MapReduce overview

1. Basic concepts

One of the core components of Hadoop: MapReduce, a distributed computing scheme, is a programming model for parallel operations of large-scale data sets, including Map and Reduce.

MapReduce is not only a programming model, but also a computing component. The processing process is divided into two phases: Map phase, which is responsible for decomposing tasks into several small tasks, and Reduce is responsible for summarizing the processing results of several small tasks. In the Map stage, the main input is a pair of key-values, and the output is a pair of key-values after the Map calculation. Then, the same keys are merged to form a key-value set. Then the key-value set is transferred to the Reduce stage, and the final key-value result set is output after calculation.

2. Description of features

MapReduce can realize concurrent work based on thousands of servers, providing powerful data processing capacity. If a single server fails, the computing task will be automatically escaped to another node for execution, ensuring high fault tolerance. However, MapReduce is not suitable for real-time and streaming computation. The calculated data is static.

Two, operation cases

1. Process description

The data files are usually in CSV format, and the data rows are usually separated by Spaces. The characteristics of the data content need to be considered here.

Files are divided into slices and executed concurrently in different MapTask tasks.

After the completion of the MapTask task, the ReduceTask task is executed, which relies on the data in the Map stage.

After the execution of the ReduceTask task, the output file results.

2. Basic configuration

Hadoop: # read file source inputPath: HDFS: / / hop01:9000 / hopdir javaNew. TXT # does not exist before the path must be a program run outputPath: / wordOut

3. Mapper program

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { Text mapKey = new Text(); IntWritable mapValue = new IntWritable(1); @Override protected void map (LongWritable key, Text value, Context context) throws IOException, InterruptedException {// 1, String line = Value.toString (); String[] words = line.split(" "); String[] words = line.split("); // 3, Save for (String word: words) {mapKey.set(word); context.write(mapKey, mapValue); }}}

4, Reducer procedure

public class WordReducer extends Reducer<Text, IntWritable, Text, IntWritable> { int sum ; IntWritable value = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, // I have InterruptedException {// I have InterruptedException; for (IntWritable count : values) { sum += count.get(); } // 2, Value. Set (sum); context.write(key,value); }}

5. Execute procedures

@RestController public class WordWeb { @Resource private MapReduceConfig mapReduceConfig ; @GetMapping("/getWord") public String getWord () throws IOException, ClassNotFoundException, // InterruptedException {// declare Configuration HadooPconfig = new Configuration(); hadoopConfig.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName() ); hadoopConfig.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName() ); Job job = Job.getInstance(hadoopConfig); . / / Job execution input Path FileInputFormat addInputPath (Job, new Path (mapReduceConfig. GetInputPath ())); / / Job execution work output Path FileOutputFormat setOutputPath (Job, new Path (mapReduceConfig. GetOutputPath ())); / / custom Mapper and Reducer for two stages of task processing class job. SetMapperClass (WordMapper. Class); job.setReducerClass(WordReducer.class); Job. SetOutputKeyClass (Text. Class); Job. SetOutputKeyClass (Text. job.setOutputValueClass(IntWritable.class); // Complete Job until it completes Job. WaitForCompletion (true); return "success" ; }}

6. Execute results to view

Package the application for execution on the HOP01 service;

java -jar map-reduce-case01.jar

Three, case analysis

1. Data types

Java data type and corresponding Hadoop data serialization type;

Java type Writable type Java type Writable type
String Text float FloatWritable
int IntWritable long LongWritable
boolean BooleanWritable double DoubleWritable
byte ByteWritable array DoubleWritable
map MapWritable

2. Core module

Mapper module: processing the input data, the business logic is completed in the map() method, the output data is also in the KV format;

The Reducer module: deals with the KV data output by the MAP program, and the business logic is in the reduce() method;

Driver module: Submit the program to yarn for scheduling by submitting a Job object that encapsulates the running parameters;

4. Serialization

1. Introduction to serialization

Serialization: To convert an object in memory into a binary sequence of bytes that can be stored persistently via an output stream or transmitted over a network;

Deserialization: The process of loading an object into memory by receiving an input stream of bytes or reading persistent data from disk;

Hadoop serialization-related interfaces: the serialization mechanism of the Writable implementation, the ordering issue of Comparable management keys;

2. Case realization

Case description: read the file, and the same line of the file to do data accumulation calculation, output the calculation results; This case demonstrates a Hadoop server that executes locally, does not upload the JAR package, and the driver configuration is consistent.

Entity object properties

public class AddEntity implements Writable { private long addNum01; private long addNum02; private long resNum; Public addEntity () {super(); } public AddEntity(long addNum01, long addNum02) { super(); this.addNum01 = addNum01; this.addNum02 = addNum02; this.resNum = addNum01 + addNum02; } // Override public void write(dataOutput dataOutput) throws IOException {dataOutput.writeLong(addNum01); dataOutput.writeLong(addNum02); dataOutput.writeLong(resNum); } // Override public void readFields(dataInput dataInput) throws IOException { This.addNum01 = datainput.readLong (); this.addNum01 = datainput.readLong (); this.addNum02 = dataInput.readLong(); this.resNum = dataInput.readLong(); } // omit the Get and Set methods}

Mapper mechanism

public class AddMapper extends Mapper<LongWritable, Text, Text, AddEntity> { Text myKey = new Text(); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, // InterruptedException {// Read the line String = Value.toString (); // String[] Linearr = line.split(","); String lineNum = linenr [0]; long addNum01 = Long.parseLong(lineArr[1]); long addNum02 = Long.parseLong(lineArr[2]); myKey.set(lineNum); AddEntity myValue = new AddEntity(addNum01,addNum02); // write Context. write(myKey, myValue); }}

Reducer mechanism

public class AddReducer extends Reducer<Text, AddEntity, Text, AddEntity> { @Override protected void reduce(Text key, Iterable<AddEntity> values, Context context) throws IOException, InterruptedException { long addNum01Sum = 0; long addNum02Sum = 0; For (addEntity addEntity: Values) {addNum01Sum += addNum01Sum (); addNum02Sum += addEntity.getAddNum02(); } addEntity addRes = new addEntity (addNum01Sum, addNum02Sum); context.write(key, addRes); }}

Final result of the case:

Five, the source code address

Making address GitEE, https://github.com/cicadasmile/big-data-parent, https://gitee.com/cicadasmile/big-data-parent

Recommended reading: Programming system reorganization

The serial number The project name Making the address GitEE address Recommend index
01 Java describes design patterns, algorithms, and data structures Making, click here GitEE, click here Being fostered fostered fostered fostered
02 Java Foundation, Concurrency, Object-oriented, Web Development Making, click here GitEE, click here Being fostered fostered fostered
03 SpringCloud microservice base component case detail Making, click here GitEE, click here Do do do
04 SpringCloud microservice architecture practical comprehensive case Making, click here GitEE, click here Being fostered fostered fostered fostered
05 Getting started with the SpringBoot Framework basics Making, click here GitEE, click here Being fostered fostered fostered
06 SpringBoot framework integrates common middleware development Making, click here GitEE, click here Being fostered fostered fostered fostered
07 Basic cases of data management, distribution, architecture design Making, click here GitEE, click here Being fostered fostered fostered fostered
08 Big data series, storage, components, computing and other frameworks Making, click here GitEE, click here Being fostered fostered fostered fostered