1. Summary of yarn

Yarn is a resource scheduling platform that provides server computing resources for computing programs. It is similar to a distributed operating system platform. Mapreduce is similar to applications running on an operating system.

2. Important concepts of Yarn

1) YARN does not know the operation mechanism of the user-submitted program

2) YARN only provides computing resource scheduling. (A user program applies for resources to YARN, and Yarn allocates resources.)

3) The director role in YARN is ResourceManager

4) The role that provides computing resources in YARN is called NodeManager

5) In this way, YARN is completely decoupled from running user programs, which means that various types of distributed algorithms (MapReduce is just one of them) can be run on Yarn, such as MapReduce, Storm, spark…

6) Therefore, computing frameworks such as Spark and Storm can be integrated and run on YARN as long as their frameworks have a resource request mechanism that complies with YARN specifications

7) Yarn serves as a universal resource scheduling platform. From then on, all existing computing clusters in an enterprise can be integrated into a physical cluster to improve resource utilization and facilitate data sharing

\

3. Yarn resource scheduling process (core, if the picture is not clear, you can download the local view)

Detailed explanation of working mechanism

(0) Mr Program is submitted to the node where the client is located (MapReduce)

(1) YarnRunner applies for an application from Resourcemanager.

(2) RM returns the resource path of the application to YarnRunner

(3) The program submits the resources required for running to HDFS

(4) After submitting the program resources, apply for running mrAppMaster

(5) RM initializes the user’s request into a task

(6) One of the NodeManagers receives the task.

(7) The NodeManager creates the Container and generates MRAppmaster

(8) Container Copies resources from the HDFS to the local PC

(9) MRAppmaster applies to RM for running the MapTask container

(10) RM assigns the mapTask running task to the other two NodeManagers. The other two NodeManagers pick up the task and create the container respectively.

(11) MR sends the program startup script to the two NodeManagers that receive the task. The two NodeManagers start mapTask respectively, which sorts the data partitions.

(12) MRAppmaster applies to RM for two containers to run the Reduce task.

(13) Reduce Task obtains data of the corresponding partition from MapTask.

(14) After the program is run, MR will cancel himself to RM.

Running mode of Mr

MAPREDUCE running mode

Local operating mode

– MapReduce is submitted to LocalJobRunner and runs locally as a single process (yarn’s local emulator)

– The processed data and output results can be stored in the local file system or HDFS

Config. set(” fs.defaultfs “,” HDFS ://hadoop102:8020″) // This is a cluster

config.set( mapreduce.framework.name=local ); // Local, local can not be configured, default configuration lib file is local.

— — —

– How to implement local run? Write a program that don’t bring the cluster configuration file (nature is your Mr Program graphs in a conf. Framework. The name = local and yarn. The resourcemanager. The hostname parameter)

– Local mode is very easy to debug business logic by breaking points in Eclipse

— — —

If you want to run local mode in Windows to test the program logic, you need to configure the environment variable in Windows:

% HADOOP_HOME % = d:/hadoop-2.6.1

%PATH% = % HADOOP_HOME % \bin

And replace d:/hadoop-2.6.1 lib and bin directories with Windows platform compiled versions (Windows version of hadoop to install)

— — —

Cluster operating mode

– Submit the MapReduce program to the YARN cluster Resourcemanager and distribute the mapReduce program to multiple nodes for concurrent execution

– The processed data and output results are stored in the HDFS file system

– Commit cluster implementation steps:

A. JAR the program and start it on any node in the cluster using hadoop commands

     $ hadoop jar wordcount.jar cn.itcast.bigdata.mrsimple.WordCountDriver inputpath outputpath

B. Run the main method directly in Eclipse

Mapreduce.framework. name= YARN and the two basic configurations of YARN. Yarn-site. XML has been configured in most clusters. If it is not a local connection, no configuration is required.

config.set(“fs.defaultFS”,”hdfs://hadoop102:8020″)

config.set(mapreduce.framework.name”,”yarn”)

config.set(“yarn.resourcemanager.hostname”,”hadoop103″);

Method 2: You can also copy configuration files such as core_site. XML directly to the project.

C. If you want to submit a job to a cluster in Windows Eclipse, modify the YarnRunner class

Summary: Hadoop JAR, YARN JAR, java-JAR are essentially start jar package main method.