This article source: making here | | GitEE, click here

1. MapReduce overview

1. Basic concepts

One of the core components of Hadoop: MapReduce, a distributed computing scheme, is a programming model for parallel operations of large-scale data sets, including Map and Reduce.

MapReduce is not only a programming model, but also a computing component. The processing process is divided into two phases: Map phase, which is responsible for decomposing tasks into several small tasks, and Reduce is responsible for summarizing the processing results of several small tasks. In the Map stage, the main input is a pair of key-values, and the output is a pair of key-values after the Map calculation. Then, the same keys are merged to form a key-value set. Then the key-value set is transferred to the Reduce stage, and the final key-value result set is output after calculation.

2. Description of features

MapReduce can realize concurrent work based on thousands of servers, providing powerful data processing capacity. If a single server fails, the computing task will be automatically escaped to another node for execution, ensuring high fault tolerance. However, MapReduce is not suitable for real-time and streaming computation. The calculated data is static.

Two, operation cases

1. Process description

The data files are usually in CSV format, and the data rows are usually separated by Spaces. The characteristics of the data content need to be considered here.

Files are divided into slices and executed concurrently in different MapTask tasks.

After the completion of the MapTask task, the ReduceTask task is executed, which relies on the data in the Map stage.

After the execution of the ReduceTask task, the output file results.

2. Basic configuration

Hadoop: # read file source inputPath: HDFS: / / hop01:9000 / hopdir javaNew. TXT # does not exist before the path must be a program run outputPath: / wordOut

3. Mapper program

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { Text mapKey = new Text(); IntWritable mapValue = new IntWritable(1); @Override protected void map (LongWritable key, Text value, Context context) throws IOException, InterruptedException {// 1, String line = Value.toString (); String[] words = line.split(" "); String[] words = line.split("); // 3, Save for (String word: words) {mapKey.set(word); context.write(mapKey, mapValue); }}}

4, Reducer procedure

public class WordReducer extends Reducer<Text, IntWritable, Text, IntWritable> { int sum ; IntWritable value = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, // I have InterruptedException {// I have InterruptedException; for (IntWritable count : values) { sum += count.get(); } // 2, Value. Set (sum); context.write(key,value); }}

5. Execute procedures

@RestController public class WordWeb { @Resource private MapReduceConfig mapReduceConfig ; @GetMapping("/getWord") public String getWord () throws IOException, ClassNotFoundException, // InterruptedException {// declare Configuration HadooPconfig = new Configuration(); hadoopConfig.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName() ); hadoopConfig.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName() ); Job job = Job.getInstance(hadoopConfig); . / / Job execution input Path FileInputFormat addInputPath (Job, new Path (mapReduceConfig. GetInputPath ())); / / Job execution work output Path FileOutputFormat setOutputPath (Job, new Path (mapReduceConfig. GetOutputPath ())); / / custom Mapper and Reducer for two stages of task processing class job. SetMapperClass (WordMapper. Class); job.setReducerClass(WordReducer.class); Job. SetOutputKeyClass (Text. Class); Job. SetOutputKeyClass (Text. job.setOutputValueClass(IntWritable.class); // Complete Job until it completes Job. WaitForCompletion (true); return "success" ; }}

6. Execute results to view

Package the application for execution on the HOP01 service;

java -jar map-reduce-case01.jar

Three, case analysis

1. Data types

Java data type and corresponding Hadoop data serialization type;

Java type	Writable type	Java type	Writable type
String	Text	float	FloatWritable
int	IntWritable	long	LongWritable
boolean	BooleanWritable	double	DoubleWritable
byte	ByteWritable	array	DoubleWritable
map	MapWritable

2. Core module

Mapper module: processing the input data, the business logic is completed in the map() method, the output data is also in the KV format;

The Reducer module: deals with the KV data output by the MAP program, and the business logic is in the reduce() method;

Driver module: Submit the program to yarn for scheduling by submitting a Job object that encapsulates the running parameters;

4. Serialization

1. Introduction to serialization

Serialization: To convert an object in memory into a binary sequence of bytes that can be stored persistently via an output stream or transmitted over a network;

Deserialization: The process of loading an object into memory by receiving an input stream of bytes or reading persistent data from disk;

Hadoop serialization-related interfaces: the serialization mechanism of the Writable implementation, the ordering issue of Comparable management keys;

2. Case realization

Case description: read the file, and the same line of the file to do data accumulation calculation, output the calculation results; This case demonstrates a Hadoop server that executes locally, does not upload the JAR package, and the driver configuration is consistent.

Entity object properties

public class AddEntity implements Writable { private long addNum01; private long addNum02; private long resNum; Public addEntity () {super(); } public AddEntity(long addNum01, long addNum02) { super(); this.addNum01 = addNum01; this.addNum02 = addNum02; this.resNum = addNum01 + addNum02; } // Override public void write(dataOutput dataOutput) throws IOException {dataOutput.writeLong(addNum01); dataOutput.writeLong(addNum02); dataOutput.writeLong(resNum); } // Override public void readFields(dataInput dataInput) throws IOException { This.addNum01 = datainput.readLong (); this.addNum01 = datainput.readLong (); this.addNum02 = dataInput.readLong(); this.resNum = dataInput.readLong(); } // omit the Get and Set methods}

Mapper mechanism

public class AddMapper extends Mapper<LongWritable, Text, Text, AddEntity> { Text myKey = new Text(); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, // InterruptedException {// Read the line String = Value.toString (); // String[] Linearr = line.split(","); String lineNum = linenr [0]; long addNum01 = Long.parseLong(lineArr[1]); long addNum02 = Long.parseLong(lineArr[2]); myKey.set(lineNum); AddEntity myValue = new AddEntity(addNum01,addNum02); // write Context. write(myKey, myValue); }}

Reducer mechanism

public class AddReducer extends Reducer<Text, AddEntity, Text, AddEntity> { @Override protected void reduce(Text key, Iterable<AddEntity> values, Context context) throws IOException, InterruptedException { long addNum01Sum = 0; long addNum02Sum = 0; For (addEntity addEntity: Values) {addNum01Sum += addNum01Sum (); addNum02Sum += addEntity.getAddNum02(); } addEntity addRes = new addEntity (addNum01Sum, addNum02Sum); context.write(key, addRes); }}

Final result of the case:

Five, the source code address

Making address GitEE, https://github.com/cicadasmile/big-data-parent, https://gitee.com/cicadasmile/big-data-parent

Recommended reading: Programming system reorganization

The serial number	The project name	Making the address	GitEE address	Recommend index
01	Java describes design patterns, algorithms, and data structures	Making, click here	GitEE, click here	Being fostered fostered fostered fostered
02	Java Foundation, Concurrency, Object-oriented, Web Development	Making, click here	GitEE, click here	Being fostered fostered fostered
03	SpringCloud microservice base component case detail	Making, click here	GitEE, click here	Do do do
04	SpringCloud microservice architecture practical comprehensive case	Making, click here	GitEE, click here	Being fostered fostered fostered fostered
05	Getting started with the SpringBoot Framework basics	Making, click here	GitEE, click here	Being fostered fostered fostered
06	SpringBoot framework integrates common middleware development	Making, click here	GitEE, click here	Being fostered fostered fostered fostered
07	Basic cases of data management, distribution, architecture design	Making, click here	GitEE, click here	Being fostered fostered fostered fostered
08	Big data series, storage, components, computing and other frameworks	Making, click here	GitEE, click here	Being fostered fostered fostered fostered

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

The Hadoop Framework: MapReduce Rationals and Primer Cases

1. MapReduce overview

1. Basic concepts

2. Description of features

Two, operation cases

1. Process description

2. Basic configuration

3. Mapper program

4, Reducer procedure

5. Execute procedures

6. Execute results to view

Three, case analysis

1. Data types

2. Core module

4. Serialization

1. Introduction to serialization

2. Case realization

Five, the source code address

The Hadoop Framework: MapReduce Rationals and Primer Cases

1. MapReduce overview

1. Basic concepts

2. Description of features

Two, operation cases

1. Process description

2. Basic configuration

3. Mapper program

4, Reducer procedure

5. Execute procedures

6. Execute results to view

Three, case analysis

1. Data types

2. Core module

4. Serialization

1. Introduction to serialization

2. Case realization

Five, the source code address

Related Posts

Hadoop Learning Note 2: MapReduce IO Types & Files Slice

Hadoop Tutorial – HDFS Client Development

HDFS – How does the dual caching mechanism guarantee high concurrent requests for metadata