Pengtuo. Tech/Big Data R&D /2018/…

First introduction to YARN

Apache Hadoop YARN is a resource management and job scheduling technology in the Hadoop distributed processing framework. As one of the core components of Apache Hadoop, YARN is responsible for allocating system resources to various applications running in a Hadoop cluster and scheduling tasks to be executed on nodes in different clusters.

The basic idea of YARN is to decompose the resource management and job scheduling/monitoring functions into separate daemons (daemons), which have a global ResourceManager(RM) and each application ApplicationMaster(AM). An application can be a single job or a DAG for a job.

ResourceManager and NodeManager form a data computing framework. ResourceManager has the final right to mediate resources between all applications in the system. NodeManager is the per-machine framework agent responsible for the Containers, monitoring their resource usage (CPU, memory, disk, network) and reporting it to ResourceManager.

Each application ApplicationMaster is actually a framework-specific library whose job is to coordinate resources from ResourceManager and perform and monitor tasks with NodeManager.

In the YARN architecture, ResourceManager runs as a daemon. As a global master in the architecture, ResourceManager usually runs on dedicated computers and mediates available cluster resources among competing applications. ResourceManager tracks the number of active nodes and resources available on the cluster and coordinates which resources and events should be acquired by user-submitted applications. ResourceManager is a single process with this information, so it can make scheduling decisions in a shared, secure and multi-tenant manner (for example, based on application priority, queue capacity, ACLs, data location, etc.).

When a user submits an application, a lightweight process instance named ApplicationMaster is launched to coordinate the execution of all tasks in the application. This includes monitoring tasks, restarting failed tasks, speculatively running slow tasks, and calculating the total value of application counters. ApplicationMaster and tasks belonging to its applications run in resource containers controlled by NodeManagers.

NodeManager has a number of dynamically created resource containers. The size of a container depends on the amount of resources it contains, such as memory, CPU, disk, and network IO. Currently, only memory and CPU are supported. The number of containers on a node is the product of configuration parameters and the amount of node resources (such as total CPU and total memory) other than those used for daemons and OS.

ApplicationMaster can run any type of task inside the container. For example, the MapReduce ApplicationMaster asks the container to start a Map or Reduce task, and the Giraph ApplicationMaster asks the container to run the Giraph task. You can also implement a custom ApplicationMaster that runs specific tasks

In YARN, MapReduce was simply relegated to the role of a distributed application (but still very popular and useful), now known as MRv2.

In addition, YARN supports the concept of resource reservation through the ReservationSystem, which allows users to specify the time and time constraints (for example, expiration dates) of a resource through a configuration file, and retain the resource to ensure predictable execution of important jobs. ReservationSystem tracks resource timeouts, performs reservation access controls, and dynamically instructs the underlying scheduler to ensure reservations are full.

2. YARN basic service components

YARN is a master/slave structure. In the resource management framework, ResourceManager is the Master and NodeManager is the slave.

YARN consists of ResourceManager, NodeManager, ApplicationMaster, and Container.

  • ResourceManager is an independent process running on the Master node. ResourceManager manages, schedules, and allocates resources in a unified cluster.
  • NodeManager is an independent process running on the Slave node to report node status.
  • The ApplicationMaster acts as the guardian and manager of the Application. It is responsible for monitoring and managing the operation of all attempts of the Application on each node in the cluster. Apply for and return resources to Yarn ResourceManager.
  • Container is a unit of yarn that allocates resources, including memory and CPU resources. Yarn allocates resources by Container.

ResourceManager centrally manages and schedules resources on each NadeManager. When a user submits an application, you need to provide an ApplicationMaster for tracing and managing the application. It applies for resources from ResourceManager and asks NodeManger to start tasks that occupy resources. Because different ApplicationMasters are distributed on different nodes, they do not affect each other.

Each application submitted by the Client to ResourceManager must have an ApplicationMaster. After ResourceManager allocates resources, the ApplicationMaster runs in a Container on a Slave node. Tasks that do specific things also run in a Container associated with a Slave node.

2.1 ResourceManager

RM is a global resource manager. There is only one resource manager in a cluster. RM manages and allocates resources for the entire system, including processing client requests, starting and monitoring ApplicationMaster, monitoring NodeManager, and resource allocation and scheduling. It mainly consists of two components: the Scheduler and the Applications Manager (ASM).

(1) Scheduler

The scheduler allocates resources in the system to running applications based on capacity and queue constraints, such as the allocation of resources to each queue and the execution of a maximum number of jobs. It is important to note that this scheduler is a “pure scheduler”, it is responsible for any specific application-related work, such as not responsible for monitoring or tracking application execution status, etc., nor responsible for restarting failed tasks caused by application execution failure or hardware failure. This is done by the application-specific ApplicationMaster.

The scheduler allocates resources based only on the Resource requirements of each application. The Resource allocation unit is represented by an abstract concept called Resource Container (Container). Container is a dynamic Resource allocation unit. It encapsulates resources such as memory, CPU, disk, and network to limit the amount of resources used by each task.

(2) Application manager

The application manager is responsible for managing all applications in the system, receiving job submissions, assigning the application the first Container to run ApplicationMaster, This includes application submission, negotiating resources with the scheduler to start ApplicationMaster, monitoring ApplicationMaster’s health, and restarting it if it fails.

2.2 ApplicationMaster

Manages each instance of an application running in YARN. Job or application management is handled by the ApplicationMaster process, and Yarn allows us to develop ApplicationMaster for our own applications.

Function:

  • Data segmentation;
  • Allocates resources to the application and further allocates them to internal tasks.
  • Task monitoring and fault tolerance;
  • Coordinates resources from ResourceManager and monitors easy execution and resource usage through NodeManager.

The communication between ApplicationMaster and ResourceManager is the core part of the Yarn application from submission to operation. It is the fundamental step for Yarn to dynamically manage resources in the cluster. The ApplicationMaster from multiple applications dynamically communicates with ResourceManager to continuously apply for, release, re-apply for, and release resources.

2.3 NodeManager

NodeManager Has multiple nodes in the cluster and is responsible for resources and usage on each node.

NodeManager is a slave service. It receives resource allocation requests from ResourceManager and allocates specific Containers to applications. In addition, it monitors and reports Container usage information to ResourceManager. Working with ResourceManager, NodeManager allocates resources in the Hadoop cluster.

Run the following command: NodeManager Displays resource usage on a node and the running status of containers (such as CPUS and memory)

  • Receives and processes command requests from ResourceManager, and allocates a Container to an application task.
  • RM collects reports from each NodeManager to track the health status of the entire cluster. NodeManager monitors its own health status.
  • Handle requests from ApplicationMaster;
  • Manages the life cycle of each Container on the node.
  • Manage logs on each node;
  • Perform some additional services applied on Yarn, such as the Shuffle process of MapReduce.

When a node is started, it registers with ResourceManager and tells ResourceManager how many resources it has available. During the running, NodeManager and ResourceManager work together to constantly update the information and ensure the optimal status of the cluster.

NodeManager only manages its own Container and does not know about the applications running on it. The component responsible for managing application information is ApplicationMaster

2.4 Container

Container is a resource abstraction in YARN. It encapsulates multi-dimensional resources on a node, such as memory, CPU, disk, and network resources. When AM applies for resources from RM, the resources returned by RM are represented by Container. YARN assigns a Container to each task, and the task can use only the resources described in the Container.

The relationship between Containers and cluster nodes is as follows: A node can run multiple Containers, but a Container does not span nodes. Any job or application must run in one or more Containers. In the Yarn framework, ResourceManager only tells ApplicationMaster which Containers are available. ApplicationMaster also needs to go to NodeManager to request a specific Container.

Note that a Container is a dynamic resource division unit that is dynamically generated based on application requirements. YARN supports only CPU and memory resources, and uses the lightweight resource isolation mechanism Cgroups to isolate resources.

Function:

  • Abstraction of the Task environment;
  • Describe a set of information;
  • The set of task running resources (CPU, memory, IO, etc.);
  • Task operating environment

YARN application submission process

The Application execution process in Yarn can be summarized as follows:

  1. Application submission
  2. Start the ApplicationMaster instance of the application
  3. The ApplicationMaster instance manages the execution of the application

The specific submission process is:

  1. The client program submits an application to ResourceManager and requests an ApplicationMaster instance.
  2. ResourceManager finds a NodeManager that can run a Container and starts the ApplicationMaster instance in this Container.
  3. ApplicationMaster registers with ResourceManager. After the registration, the client can query ResourceManager to obtain details about ApplicationMaster. You can now interact directly with your ApplicationMaster (the client communicates with the ApplicationMaster, and the application sends a request to the ApplicationMaster that meets its needs).
  4. During normal operations, ApplicationMaster is based onThe resource - request protocolSend the ResourceManagerThe resource request - request;
  5. When the Container is successfully allocated, ApplicationMaster sends the Container to NodeManagerThe container - launch - specification informationTo start the Container,The container - launch - specification informationContains information needed to allow the Container and ApplicationMaster to communicate;
  6. The application code runs in the form of tasks in the started Container, passing information about the progress and status of the applicationApplication - specific agreementSend to ApplicationMaster;
  7. While the application is running, the client submitting the application actively communicates with the ApplicationMaster to obtain information about the application’s status, progress updates, and so does the communication protocolApplication - specific agreement;
  8. Once the application is executed and all related work is completed, ApplicationMaster unregisters and closes ResourceManager, and all containers used are returned to the system.

Abridged version:

  • Step 1: Submit applications to ResourceManager.
  • Step 2: ResourceManager applies for resources for ApplicationMaster and communicates with a NodeManager to start the first Container to start ApplicationMaster.
  • Step 3: ApplicationMaster registers and communicates with ResourceManager to apply for resources for internal tasks. Once resources are obtained, ApplicationMaster communicates with NodeManager to start tasks.
  • Step 4: After all tasks are complete, ApplicationMaster logs out of ResourceManager. The entire application is complete.

Resource Request and Container

Yarn is designed to allow our applications to use the entire cluster in a shared, secure, multi-tenant manner. Yarn must be aware of the entire cluster topology to ensure efficient cluster resource scheduling and data access.

To achieve these goals, the ResourceManager Scheduler defines some flexible protocols for application Resource requests to better schedule applications in the cluster. Therefore, Resource Request and Container are born.

An application sends a resource request to ApplicationMaster that meets its own requirements. ApplicationMaster then sends the resource request to the ResourceManager Scheduler as a resource-request. The Scheduler returns the allocated resource description Container in the original Resource-request.

Each ResourceRequest can be considered as a serializable Java object and contains the following fields:

<resource-name, priority, resource-requirement, number-of-containers>
Copy the code
- resource-name: indicates the name of the resource. At present, it refers to the host and rack where the resource is located. VMS or more complex network structures may be supported in the future. Resource requirements currently refer to the number of memory and CPU requirements - number-of-containers: a collection of containers that meet requirementsCopy the code

Once the Containers are in hand, the ApplicationMaster also needs to interact with the NodeManager on the machine where the Container is allocated to start the Container and run the related tasks. Of course, the allocation of containers needs to be authenticated to prevent ApplicationMaster from requesting cluster resources.

5. Configure YARN

A). Modify the YARN configuration file

etc/hadoop/mapred-site.xml:

<configuration>
	<property>
		<name>mapreduce.framework.name</name>
		<value>yarn</value>
	</property>
</configuration>
Copy the code

etc/hadoop/yarn-site.xml:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>
Copy the code

B). Start ResourceManager and NodeManager sbin/start-yarn.sh, and stop RM and NM sbin/stop-yarn.sh

C). Verification: You can run the JPS command to check whether YARN is started

If the preceding figure is displayed, YARN is successfully started

D). Submit the job to YARN as a jar package. Assume that the jar package is example.jar format:

Hadoop JAR Jar package Name Application name Input path Output pathCopy the code

Such as:

hadoop jar example.jar wordccount /input/hello.txt /output/helloCount.txt
Copy the code

reference

[1] blog.csdn.net/Mr_HHH/arti…

[2] hadoop.apache.org/docs/stable…

[3] www.ibm.com/developerwo…