Hadoop Learning Note 2: MapReduce IO Types & Files Slice

1. Understanding of MapReduce

What is it: the distributed computing framework that comes with Hadoop by default

What to do: Provide a set of interfaces (core classes: InputFormat, OutputFormat, Mapper, Reducer, Driver) that enable users to implement distributed computing tasks with custom business functions

[Advantages] :

High scalability: computing resources are not enough, directly increase the number of nodes. Quality may not be enough, but quantity must be enough
High fault tolerance: The failure of a node can automatically move to another idle node
Suitable for big data processing: thanks to its scalability, as long as the quantity is sufficient, it can calculate the data at the terabyte level

[disadvantages] :

No real time computation: too slow!! Too slow!!
Not good at streaming: MR’s input data is generally static and cannot handle dynamic input format
Not good at iterative calculation: a complete MR calculation process is generally mapper-reducer – write out the results. Frequent iterative calculation will start multiple groups of MR and series them, and the results of each MR will be written in the form of a file, which will have to be read again if input is given to the next MR group, which is too tedious

[Personal summary] :

In my opinion, the weakness of MR program is summed up in one point: there are too many IO processes. First, the Map input needs to read data from HDFS (this process is already slow), and then the Map input needs to write data after the completion of the Map business (one IO), while the Reduce input needs to read the result file that has been written (possibly network IO) from the Map node. After reading the reduce node, if the result set is too large, it will be written to disk again (another IO), and finally the main reduce process will be written back to disk…… The execution of a set of MR may involve multiple IO of the same node and different nodes. If it is used for complex iterative calculations such as machine learning, IO may take longer than the core business time. Therefore, MR is suitable for one-off, single-step and large-level calculations.

2. Points to be paid attention to in Mapper & Reducer programming

The map() method is called once for each

key-value pair
,>
The reduce() method is called once for each key, which may correspond to multiple values, encapsulated as an iterator object
Don’t guide the wrong bag!! Don’t guide the wrong bag!! Don’t guide the wrong package!! ! The trip to the pit of blood and tears — Mapred (old version, not used) — MapReduce (use it!)

3. Mr Section

What is a slice?

Input is required for MR calculation, and MR is a distributed computing mode. Therefore, for calculation with a large amount of data, multiple Mapper and Reducer will be launched (the number of Reducers is generally far less than Mapper). As the first step, multiple Mapper will accept complete data set. Then the distribution is completely meaningless, so before the data enters the Mapper, it will first carry out logical segmentation, dividing the whole data set into multiple parts, and sending each part to a Mapper for processing. Therefore, the number of slices = the number of Mapper(MapTask)

How do you cut it?

The first kind: according to the file size cut (the default slicing method) – attached source code parsing

Based on the parameter to get the splitSize, that is, to meet the size of the slice, see the source code parsing
File priority > slice, that is, if a file is not splitSize enough, it is directly treated as a slice, not all files as a whole
Just to be clear: the slices here are just logical slices!! It is the equivalent of “marking” a file at intervals, rather than physically dividing the data file into several pieces. In fact, the sharding done by the Client side is just to submit the Job. split information and send these marks to MR. In the Map stage, each Mapper will read the assigned data according to these marks

protected long getFormatMinSplitSize() { return 1L; } public static long getMintSplitSize (jobContext Job) {return () job.getConfiguration().getLong("mapreduce.input.fileinputformat.split.minsize", 1L); } public static Long getMaxSplitSize(JobContext Context) {return ();} public static Long getMaxSplitSize() context.getConfiguration().getLong("mapreduce.input.fileinputformat.split.maxsize", 9223372036854775807L); } / /... If (this.isSplitable(job, path)) {long blockSize = file.getBlockSize(); Long SplitSize = this.putesPlitSize (blockSize, minSize, maxSize); long bytesRemaining; int blkIndex; // If the remaining files are larger than 1.1 * splitSize, continue spliting; Otherwise, it is a shard; For (bytesRemaining = length; (double) BytesRemaining/(double) SplitSize > 1.1d; bytesRemaining -= splitSize) { blkIndex = this.getBlockIndex(blkLocations, length - bytesRemaining); splits.add(this.makeSplit(path, length - bytesRemaining, splitSize, blkLocations[blkIndex].getHosts(), blkLocations[blkIndex].getCachedHosts())); } / /... (Omit 10,000 lines of code here)

// protected long computeSplitSize(long blockSize, long minSize, long blockSize) Math.max(minSize, math.min (maxSize, blockSize)) {// By default the size of the block file -- 128M return math.max (minSize, math.min (maxSize, blockSize)); }

The second type: according to the number of lines in the file

The equivalent of changing the criteria from file size to the number of lines, set how many actions to a slice
It also does not cut across files

The third kind: eliminate small file cutting method

When there are a large number of small files, using the above two mechanisms will result in a large number of slicing, enabling a large number of mappers, and low resource utilization. Therefore, a slicing method is set up specifically for scenarios with too many small files

Important parameters: CombineTextInputFormat setMaxInputSplitSize (job, 4194304); / / the default 4 m

Step1: Virtual storage. See pseudocode for the mechanism

If (file a.size () <= maxInputSplitSize){A file is A virtual file; }else if(file a.size () <= MaxInputSplitSize * 2){A split into two virtual files of equal size [A1, A2]; }else{A cuts off A chunk of maxInputSplitSize and continues with the rest; }

Step2: Re-sectioning

If (file B1 >= MaxInputSplitSize){B1 as a slice; }else{B1 waits for the next virtual file to merge; If (merge size >= maxInputSplitSize){merge files into a slice; }else{continue waiting for the merge; }}

4. Input types commonly used by MR

All inherit the FileInputFormat abstract class. If you customize the input type, you need to inherit the abstract class and override the RecordReader method (to implement the logic for the specific custom reads).

InputFormat	Split_Type	Key — Type	Value — Type
TextInputFormat	Cut by file size (default)	Offset per line < longWritable >	<Text>
NLineInputFormat	Cut by a fixed number of rows	Offset per line < longWritable >	<Text>
CombineTextInputFormat	Eliminate small file cutting	Offset per line < longWritable >	<Text>
KeyValueInputFormat	Cut by file size (default)	Field [0] <Text> after each row Split	field[0] <Text>

Each of the four main input types reads data in rows, each of which forms a KV pair and calls the map method once

Don’t forget!!

In addition to the default input format, the other three input formats require parameters to be set in the Driver class:

Job. setInputFormatClass(......) / / for NLineInputFormat to set how many rows a slice NLineInputFormat setNumLinesPerSplit (job, 3); / / to set the biggest slice size CombineTextInputFormat CombineTextInputFormat setMaxInputSplitSize (job, 4194304); // Default 4M // For the KeyValueInputFormat, specify the delimiter for the Key and Value of each line. // Note that this parameter is set in Configuration!! conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, "\t"); // Split by tabs

5. Output types commonly used by MR

Similarly, the output type of MR (the file type written by the Reducer) also inherits from the FileOutputFormat class, and also supports custom output format, which requires inheriting the FileOutputFormat class and overwriting the RecordWriter method

OutputFormat	Description
TextOutputFormat	The results of the Reducer are written directly in string format as lines
SequenceFileOutputFormat	The results of the Reducer are written in the KV key-value pair formatserializationfile

SequencefileOutputFormat writes the key-value pairs output by the Reducer, and serializes them, so it is more convenient to compress. In addition, the result of one MR process written in SequencefileOutputFormat can easily be used as input to the next MR process, since the KV pair is already serialized and encapsulated, and only needs to be read in deserialized as input

PS

When will sifl support custom font colors!! Really want to remember colorful and coquettishly gas note…… O (╥ man ╥) o

Hadoop Learning Note 2: MapReduce IO Types & Files Slice

1. Understanding of MapReduce

2. Points to be paid attention to in Mapper & Reducer programming

3. Mr Section

4. Input types commonly used by MR

5. Output types commonly used by MR

PS

Related Posts

How is the HDFS-Checkpoint mechanism implemented

Hive Modeling Analysis

High availability of HDFS-NameNode