A, graphs,

  • Distributed computing framework
    • Map: A group of data is mapped to a group according to rules
    • Reduce: Summarize results (similar to group by in SQL)
  • Shuffle: Data is transferred on the network based on keys and is grouped together according to rules.

2. MapReduce process

  • Description:
    • 1. The Client submits the Application program to RM, which contains the Application Master main program and the startup command.
    • 2. The Applications Manager assigns the first Container to the Application to run the Application Master main Application.
    • 3. The Application Master will register with The Applications Manager and view the job running status on the YARN web interface.
    • 4. The main program applies for and receives resources from Resource Scheduler (which machine, how much memory, how much CPU VCORE) through RPC protocol in a polling way.
    • 5, The Application Master receives the resource list and communicates with the NM process to start the Container to run the task.
    • 6. NM sets the environment (container container) for the task, writes the task startup command in the script, and starts the task using the script.
    • 7. Task tasks (Map task and Reduce task) of each Container report progress and status to Application Master through RPC protocol. This allows the Application Master to keep track of the running status of the task. If the Task fails to run, the Container task is restarted.
    • 8. When all the tasks are completed, the Application Master will apply to the Applications Manager to cancel and close the job, and then you can check whether the task is complete and successful on the Web interface.
  • Conclusion:
    • Start the Application Master main program and get the resources;
    • Run the task until it is complete.
  • Note:
    • The main program runs at NM node.
    • The main program needs to apply for a container.
    • The first Container of a job runs the main program.

Maptask determines the mechanism

Mapreduce process – Dimension 2

Shuffle mechanism

The WordCount process

The overall flow of the Map Task:

  • 1) Read: Map Task parses keys/values from input InputSplit through user written RecordReader.
  • 2) Map: In this stage, the parsed key/value is handed to the Map () function written by the user for processing, and a series of keys/values are generated.
  • 3) Collect: In the map() function written by the user, when the data processing is complete, outputCollector.collect () is usually called to input the results. Inside the function, it fragments the generated key/value (through the Partitioner) and writes it to a ring memory buffer.
  • 4) Spill: When the ring buffer is full, MapReduce writes data to the local disk and generates a temporary file. Before writing data to the local disk, sort the data locally, and merge and compress the data if necessary.
  • 5) Merge: When all data processing is complete, Map Task merges all temporary files at once to ensure that only one data file is generated.

The overall process of Reduce Task:

  • 1) Shuffle: also called the Copy stage. The Reduce Task remotely copies a piece of data from each Map Task. If the size of a piece of data exceeds a certain threshold, the Reduce Task writes the data to the disk. Otherwise, the Reduce Task directly stores the data to the memory.
  • 2) Merge: During remote copy, the Reduce Task starts two background threads to Merge the memory and files on the disk to prevent excessive memory usage or files on the disk.
  • 3) Sort: According to MapReduce semantics, the input data of reduce() function written by users is a group of data aggregated by key. To cluster data with the same key, Hadoop uses a sort-based strategy. Each MapTask has implemented partial sorting of its own processing results. Therefore, Reduce Task only needs to merge and sort all data once.
  • 4) Reduce: In this stage, ReduceTask delivers each group of data in turn to the Reduce () function written by the user for processing.
  • 5) Write: Reduce () writes the calculation result to HDFS.