Author: Unreal good

Source: Hang Seng LIGHT Cloud Community

The basic concept

MapReduce is a programming framework for distributed computing programs that can be submitted by users and run concurrently on a Hadoop cluster.

The core function of MapReduce is to integrate user-written service logic codes and default components into a complete distributed computing program.

MapReduce is a programming model that is divided into two phases: Map and Reduce. The input data is divided into blocks, processed by Map, and output to Reduce.

It can be thought of as a process of collating and then generalizing data.

The core algorithm

An algorithm plan for MapReduce usually consists of three steps:

  • MapThe job of a map or mapper is to process input data. Each work node willmapFunction applies to local data and writes the output to temporary storage.
  • Shuffle: The work node redistributes data based on the output key and sorts, groups, and copies the data. This ensures that all data belonging to a key resides on the same work node.
  • Reduce: The worker node now processes each set of output data for each key in parallel and stores the resulting data to HDFS.

Run the process

MapReduce consists of the following steps:

  • Input: Reads the text file in the system.
  • splitting: Can be obtained by splitting the file by lineK1Number of lines,V1Represents the text content of the corresponding line;
  • mapping: Split each line according to SpacesList(K2,V2), includingK2Represents key words,V2The value of is 1, representing 1 occurrence;
  • shuffling: due to themappingOperations that may be processed in parallel on different machines must be passedshufflingThe operation will be the samekeyThe data of values are distributed to the same node for merging, so as to obtain the final resultK2The primary key,List(V2)Is the iterable set,V2V2 in Mapping;
  • Reducing: For each working nodeK2Key value pairsReduce()Operation, final output.

The splitting and Shuffing operations in the MapReduce programming model are all realized by the framework, and only mapping and Reducing need to be realized by our own programming, which is the origin of the name MapReduce.

The advantages and disadvantages

advantages

  • Large-scale data can be divided into multiple nodes for parallel calculation, reducing the time needed for data calculation;
  • Move the calculation to the location of the data to reduce the network cost;

disadvantages

  • It can only be used for off-line computing, but not for streaming computing and real-time computing.
  • The process results will be saved to disk, increasing the I/O load;

conclusion

The core of MapReduce is that the programs developed by users can be sent to Hadoop for running and execution, and the submitted tasks can be processed dynamically to reduce the computing time and obtain the final calculation results in parallel.