0, preface

To learn something, we need to know what is it? What can it do? How does it work? These three questions. This article will give the answers to these questions, of course, are theoretical knowledge, understand these, the subsequent actual combat learning.

An overview,

1.1 define

MapReduce is a computing model, framework, and platform for parallel processing of big data. Its core function is to integrate user-written service logic codes and default components into a complete distributed computing program that concurrently runs on Hadoop clusters.

1.2 Development History

MapReduce was originally designed by Google to solve the problem of parallelizing massive web data in its search engine. MapReduce rewrote the Google search engine’s Web document indexing system.

In 2003 and 2004, Google presented a paper on the distributed file system GFS and MapReduce at an international conference, announcing the basic principles and main design ideas.

In 2004, Doug Cutting, the founder of the open source project Lucene (search indexing library) and Nutch (search engine), found that MapReduce was an important technology for solving large-scale web data processing. Based on Java design and development of Hadoop MapReduce parallel framework, since then, Hadoop has become an important project under the Apache open source organization.

The launch of MapReduce has brought a great revolutionary impact on the parallel processing of big data in a real sense, making it the industrial standard of big data processing. Changed the way we do large-scale computing.

2. Main functions of MapReduce

1. Data partitioning and computing task scheduling

The system automatically divides the data to be processed by a Job into multiple data blocks. Each data block matches a computing Task, and computing nodes are automatically scheduled to process the corresponding data blocks. This node is the Map node or Reduce node.

2. Data/code localization

To reduce data traffic, a basic principle is to localize data processing. A compute node processes as much data as it can on its local disk, and when localizing data processing is not possible, it finds the nearest available node and migrates data.

3. System optimization

In order to Reduce the cost of data communication, the intermediate data is merged before entering the Reduce node. The results output from Map nodes must be divided using certain policies to ensure that related data is sent to the same Reduce node.

4. Error detection and recovery

MapReduce detects and isolates faulty nodes, schedules and assigns new nodes to take over computing tasks on faulty nodes. Multiple backup storage is used to improve data storage reliability.

3. Core ideas of MapReduce

In MapRecue, how to transfer data processed in the Map phase to the Reduce phase is the most critical process in the MapReduce framework. This process is called Shuffle. Its core mechanisms include data partitioning, sorting and caching.

Map is a mapping that filters and distributes data and converts raw data into key-value pairs.

Reduce is a merge. Values with the same key value are processed and a new key value pair is output as the final result. To enable Reduce to process Map results in parallel, Map outputs must be sorted and divided, and then sent to Reduce. The process of sorting Map outputs and sending them to Reduce is called Shuffle.

Map and Reduce operations require developers to define Map and Reduce classes to simplify and merge operations.

The Shuffle process consists of Map Shuffle and Reduce Shuffle. The Shuffle process on the Map side is to partition, sort, and divide the Map results, and then combine the outputs belonging to the same area.