MapReduce application development

The MapReduce programming process: first write map and Reduce functions, use unit tests to make sure the function works as expected, then write a driver to run the job (which can be tested with a small data set in the local IDE), and finally run the program that passes the test on a cluster.

Resource Documents:

Configuration. XML, core-default. XML, and core-site. XML

MapReduce workflows:

An instance of JobControl represents a diagram of a job that can be added to the job configuration to inform the JobControl instance of the dependencies between jobs. JobControl will execute these jobs in dependency order within a thread. If a job fails, JobControl does not execute subsequent jobs on which it depends. (JobControl runs on the client and submits jobs)

Apache Oozie: Oozie runs as a server and the client submits a workflow definition to the server for immediate or later execution. Oozie workflow is a directed acyclic graph consisting of action nodes and control flow nodes. Action nodes perform workflow tasks and control flow nodes manage workflow execution between activities by building conditional logic or parallel execution. When the workflow ends, Oozie notifies the client of the workflow status by sending an HTTP callback. (Oozie workflow)

Working mechanism of MapReduce

MapReduce jobs can be run through a simple method call: Submit () on the Job object. You can also call waitForCompletion(), which is used to submit a job that has not been submitted before and wait for it to complete. The submit() method call encapsulates a lot of processing detail.

Hadoop 2.0 introduces a new execution mechanism based on a system called YARN. Select the framework to execute by setting the value of the mapReduce.framework. name attribute. Local indicates the local job runner, classic indicates the classic MapReduce framework (which uses one JobTracker and multiple TaskTrackers), and YARN indicates the new framework. Different execution frameworks represent different approaches to running MapReduce programs.

The following figure shows how jobs work in classic MapReduce (MapReduce 1).

Contains four separate entities:

Client -> : Submits a MapReduce job

Jobtracker-> : Coordinates the running of jobs

Tasktracker-> : Runs the assigned task

Distributed file system -> : The HDFS is used to share job files among other entities


YARN (MapReduce 2) :

YARN divides Jobtracker’s functions into multiple independent entities, alleviating the expansion bottleneck faced by “classic” MapReduce. Jobtracker is responsible for job scheduling and task progress monitoring. YARN divides these two roles into two independent daemons: the resource manager that manages resource usage on the cluster and the application manager that manages the task life cycle on the cluster. The basic idea is that the application manager negotiates with the resource manager the computing resources of the cluster: containers on which application-specific processes run. Containers are monitored by node managers running on cluster nodes.

MapReduce on YARN includes more entities than classic MapReduce:

-> : indicates the client that submits MapReduce jobs.

-> : YARN resource manager, which coordinates the allocation of computing resources in clusters.

-> : YARN node manager, responsible for starting and monitoring the computing containers on the machines in the cluster;

-> : MapReduce application master coordinates MapReduce jobs. It runs with MapReduce jobs in containers that are allocated by resource managers and managed by node managers.

-> : distributed file system, usually HDFS, used to share job files among other entities.

The running process of the job is shown below.


For more free technical information: annalin1203