Ideally, applications’ requests for Yarn resources should be met immediately. However, in reality, resources are often limited, especially in a busy cluster. An application resource request often needs to wait for a period of time before the corresponding resource is available. In Yarn, Scheduler is responsible for allocating resources to applications. In fact, scheduling itself is a difficult problem, and it is difficult to find a perfect strategy that can solve all application scenarios. To do this, Yarn provides a variety of schedulers and configurable policies to choose from. YARN architecture is as follows:

  • ResourceManager (RM) : manages and schedules resources on NM devices in a unified manner, allocates idle Containers to AM devices, and monitors their running status. Allocate an idle Container to an AM resource request. It mainly consists of two components: the Scheduler and the Applications Manager.
  • Scheduler: The Scheduler allocates resources in the system to running applications based on capacity and queue constraints, such as the allocation of resources to each queue and the execution of a certain number of jobs at most. The scheduler allocates resources based on the resource requirements of each application in containers, limiting the amount of resources used by each task. The Scheduler is not responsible for monitoring or tracking the state of the application or for tasks that need to be restarted for any reason (the ApplicationMaster is responsible for that). In summary, the scheduler assigns resources encapsulated in containers to applications based on the resource requirements of the application and the resources available on the cluster machines. Schedulers are pluggable, such as CapacityScheduler and FairScheduler. (PS: In practical application, only simple configuration is required)
  • Application Manager: The Application Manager manages all applications in the entire system, including Application submission, negotiating resources with the scheduler to start the AM, monitoring the AM running status, and restarting the AM if it fails, as well as tracking the progress and status of assigned Containers. ApplicationMaster is an application framework that coordinates resources with ResourceManager and works with NodeManager to execute and monitor tasks. MapReduce is a native-supported framework that can run MapReduce jobs on YARN. Many distributed applications have developed corresponding application frameworks to run tasks on YARN, such as Spark and Storm. If necessary, we can also write a YARN Application that complies with the specification.
  • NodeManager (NM) : NM is the resource and task manager on each node. It periodically reports resource usage on the node and the running status of containers to RM. Requests for starting and stopping Containers from the AM are received and processed. ApplicationMaster (AM) : Each application submitted by a user contains an AM, which is responsible for monitoring applications, tracking application execution status, and tasks that fail to restart.
  • Container: is a resource abstraction in YARN. It encapsulates multi-dimensional resources on a node, such as memory, CPU, disk, and network resources. When AM applies for resources from RM, the resources returned by RM for AM are represented by Container. YARN assigns a Container to each task, and the task can only use the resources described in the Container.

1. Yarn scheduler

1.1. FIFO Scheduler

FIFO Scheduler arranges applications into a queue according to the order of submission, which is a first-in, first-out queue. When allocating resources, the application at the top of the queue will be allocated resources first, and the next application will be allocated after the demand of the application at the top is satisfied, and so on. FIFO Scheduler is the simplest and easiest to understand Scheduler and does not require any configuration, but it is not suitable for shared clusters. A large application may consume all cluster resources, causing other applications to block. In a shared cluster, Capacity or Fair Scheduler is preferred, both of which allow large and small tasks to receive system resources while committing. The Yarn scheduler comparison diagram below shows the differences between these schedulers. As shown in the diagram, small tasks are blocked by large tasks in a FIFO scheduler.

1. Capacity Scheduler

Yarn-site. XML resource scheduler configured by default. For the Capacity scheduler, there is a dedicated queue for running small tasks, but setting up a dedicated queue for small tasks preemptively consumes cluster resources, resulting in the execution time of large tasks falling behind that of FIFO schedulers. With this resource scheduler, you can configure the YARN resource queue, which is described later.

1.3. Fair Scheduler

The design goal of the Fair scheduler is to allocate Fair resources to all applications (the definition of Fair can be set through parameters). The Yarn scheduler comparison diagram above shows the fair scheduling of two applications in a queue. Of course, fair scheduling can also work across multiple queues. For example, suppose you have two users, A and B, who each have A queue. If USER A starts A job but user B has no task, user A obtains all cluster resources. After B starts A job, A’s job continues to run, but after A while, the two tasks each obtain half of the cluster resources. If the B to start A second job and other jobs running, it will be the first job of A and B share B this queue resources, namely B two job resources will be used for A quarter of the cluster, and half of A job is used in cluster resources, the result is ultimately equality between the two users to share resources. In the Fair scheduler, we do not need to use system resources up front. The Fair scheduler dynamically adjusts system resources for all running jobs. When the first big job is submitted, only this job is running and it gets all the cluster resources. When the second small task is submitted, the Fair scheduler allocates half of the resources to the small task so that the two tasks share cluster resources equally. A) Fair scheduler, which is able to share the resources of the whole cluster b) no pre-occupation of resources, every job is shared c) every time a job is submitted, the entire resource is occupied. If another assignment is submitted, the first assignment assigns some resources to the second assignment, and the first assignment frees some resources. The same goes for other assignments… So every assignment that comes in has an opportunity to get resources.

1.4. The Fair Scheduler differs from the Capacity Scheduler

  • Fair sharing of resources: In each queue, Fair Scheduler can choose to allocate resources to the application according to FIFO, Fair, or DRF policies. The Fair policy is equal allocation. By default, each queue allocates resources in this way
  • Support for resource preemption: When there are remaining resources in a queue, the scheduler shares these resources with other queues, and when a new application is submitted in the queue, the scheduler reclaims resources for it. In order to reduce unnecessary computing waste as much as possible, the scheduler adopts the strategy of wait and then reclaim forcibly, that is, if there are unreturned resources after waiting for a period of time, the scheduler will preempt resources. Resources are freed by killing some tasks from queues that overuse resources
  • Load balancing: Fair Scheduler provides a load balancing mechanism based on the number of tasks, which distributes the tasks in the system as evenly as possible among the nodes. In addition, users can also design load balancing mechanisms according to their own requirements
  • Flexible scheduling policies: Fiar Scheduler allows the administrator to set scheduling policies for each queue (FIFO, Fair, or DRF are supported).
  • Improved mini-application response time: Small jobs can quickly acquire resources and complete due to the Max/min fair algorithm

2. Configure the Yarn scheduler

The YARN resource scheduler is configured in yarn-site. XML.

2.1. FairScheduler

Fair Scheduler configuration options include yarn-site. XML, which is used to configure Scheduler level parameters, and a customized configuration file (the default value is fair-scheduler. XML), which is used to configure resource quantity and weight of each queue.

2.1.1 yarn – site. XML

Yarn – site. XML is introduced

<! - the scheduler start - ><property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
    <description>Configure the scheduler plug-in class name used by Yarn. Fair Scheduler corresponding is: org. Apache. Hadoop. Yarn. The server. The resourcemanager. Scheduler. Fair. FairScheduler</description>
</property>
<property>
    <name>yarn.scheduler.fair.allocation.file</name>
    <value>/etc/hadoop/conf/fair-scheduler.xml</value>
    <description>XML file path for configuring resource pools and their attribute quotas (local path)</description>
</property>
<property>
    <name>yarn.scheduler.fair.preemption</name>
    <value>true</value>
    <description>Enable resource preemption. Default is True</description>
</property>
<property>
    <name>yarn.scheduler.fair.user-as-default-queue</name>
    <value>true</value>
    <description>If this parameter is set to true, if no resource pool is specified, the user name is used as the resource pool name. This configuration enables automatic allocation of resource pools based on user names. default is True</description>
</property>
<property>
    <name>yarn.scheduler.fair.allow-undeclared-pools</name>
    <value>false</value>
    <description>Whether to allow the creation of undefined resource pools. If set to true, YARN automatically creates an undefined resource pool specified in the task. If set to false, the undefined resource pool specified in the task will be invalid and the task will be assigned to the default resource pool. ,default is True</description>
</property><! - the scheduler end - >Copy the code

2.1.2 fair – scheduler. XML

In the production Yarn environment, four types of users need to use clusters: Production, Spark, default, and Streaming. In order to make the task submitted are not affected, we plan on Yarn is configured with four resource pool, respectively for the production, the spark, the default, streaming. In addition, resources and priorities are allocated to each resource pool based on actual services. Default is used for development and testing purposes. The configuration of fair-Scheduler. XML on ResourceManager is as follows:


      
<allocations>
    <queue name="root">
        <aclSubmitApps></aclSubmitApps>
        <aclAdministerApps></aclAdministerApps>
        <queue name="production">
            <minResources>8192mb,8vcores</minResources>
            <maxResources>419840mb,125vcores</maxResources>
            <maxRunningApps>60</maxRunningApps>
            <schedulingMode>fair</schedulingMode>
            <weight>7.5</weight>
            <aclSubmitApps>*</aclSubmitApps>
            <aclAdministerApps>production</aclAdministerApps>
        </queue>
        <queue name="spark">
            <minResources>8192mb,8vcores</minResources>
            <maxResources>376480mb,110vcores</maxResources>
            <maxRunningApps>50</maxRunningApps>
            <schedulingMode>fair</schedulingMode>
            <weight>1</weight>
            <aclSubmitApps>*</aclSubmitApps>
            <aclAdministerApps>spark</aclAdministerApps>
        </queue>
        <queue name="default">
            <minResources>8192mb,8vcores</minResources>
            <maxResources>202400mb,20vcores</maxResources>
            <maxRunningApps>20</maxRunningApps>
            <schedulingMode>FIFO</schedulingMode>
            <weight>0.5</weight>
            <aclSubmitApps>*</aclSubmitApps>
            <aclAdministerApps>*</aclAdministerApps>
        </queue>
        <queue name="streaming">
            <minResources>8192mb,8vcores</minResources>
            <maxResources>69120mb,16vcores</maxResources>
            <maxRunningApps>20</maxRunningApps>
            <schedulingMode>fair</schedulingMode>
            <aclSubmitApps>*</aclSubmitApps>
            <weight>1</weight>
            <aclAdministerApps>streaming</aclAdministerApps>
        </queue>
    </queue>
    <user name="production">
        <! -- Configuration for specific users :production Maximum tasks that can be run simultaneously -->
        <maxRunningApps>100</maxRunningApps>
    </user>
    <user name="default">
        <! -- For the default user configuration can run up to a maximum of tasks -->
        <maxRunningApps>10</maxRunningApps>
    </user>

    <! -- users max running apps -->
    <userMaxAppsDefault>50</userMaxAppsDefault>
    <! -- The default user can run a maximum of tasks at the same time -->
    <queuePlacementPolicy>
        <rule name="specified"/> 
        <rule name="primaryGroup" create="false" />
        <rule name="secondaryGroupExistingQueue" create="false" />
        <rule name="default" queue="default"/>
    </queuePlacementPolicy>
</allocations>
Copy the code

Parameter Description:

  • Set the format to “X MB, Y vcores”. If a queue’s minimum guaranteed resources are not met, it takes precedence over other queues of the same class to obtain resources. For different scheduling policies (more on this later), the minimum guaranteed resources have different meanings. Only memory resources are considered, that is, if a queue uses more memory resources than its minimum, it is considered satisfied. For the DRF policy, the amount of resources used by the primary resource is considered, that is, if a queue has more primary resources than its minimum, it is considered satisfied.
  • MaxResources: The maximum number of resources that can be used. Fair Scheduler ensures that the number of resources used by each queue does not exceed the maximum number of resources that can be used by the queue.
  • MaxRunningApps: Maximum number of applications running at the same time. By limiting this number, you can prevent the disk from bursting with intermediate output results generated when too many Map tasks are running simultaneously.
  • Weight: indicates the weight of the resource pool. The larger the weight is, the more resources it obtains. For example, if a pool has 20GB of memory that is not needed, you can share it with other pools. The amount of memory for each pool is determined by the weight
  • AclSubmitApps: List of Linux users or user groups that can submit applications to the queue. The default value is *, indicating that any user can submit applications to the queue. Note that this property is inheritable, meaning that the list of the subqueue inherits the list of the parent queue. When this attribute is configured, users or user groups are separated by commas (,), and users and user groups are separated by Spaces, for example, user1, user2 group1,group2.

AclAdministerApps: User names and groups that allow you to manage tasks; The administrator of a queue can manage the resources and applications in the queue, for example, killing any application.

  • MinSharePreemptionTimeout: share the minimum time. If the resource usage of a resource pool is always lower than the minimum resource usage, the system starts to preempt resources.
  • SchedulingMode/schedulingPolicy: queue scheduling model, can be fifo, fair or DRF.

Administrators can also add the maxRunningJobs attribute to a single user to limit the maximum number of applications that can run simultaneously. In addition, the administrator can set the default values of the above properties by using the following parameters:

  • UserMaxJobsDefault: the default value of the user’s maxRunningJobs property.
  • The default value of defaultMinSharePreemptionTimeout: minSharePreemptionTimeout attribute of the queue.
  • The default value of defaultPoolSchedulingMode: schedulingMode attribute of the queue.
  • Take time fairSharePreemptionTimeout: fair share. If the resource usage of a resource pool is always less than half of the fair share, the resource preemption starts.

In this way, tasks submitted by users in each user group are added to the corresponding resource pool without affecting other services. The hierarchy of queues is implemented by nesting elements. All queues are children of the root queue, even if they are not matched to the element. The queue in the Fair scheduler has a weight attribute (which is the definition of Fair) and uses this attribute as the basis for Fair scheduling. Cluster in this case, when the scheduler distribution 7.5, 1,1,0.5 resources to the production, the spark, streaming, default as fair, when the weight percentage is not here. Note that for queues that are automatically created by the user when there is no profile, they still have weights and a weight value of 1. Each queue can still have different scheduling policies within it. The default scheduling policy for queues can be configured using the top-level element. If not, fair scheduling is used by default. Despite being the Fair scheduler, it still supports FIFO scheduling at the queue level. Each queue’s scheduling policy can be overridden by its internal elements. In the example above, the default queue is assigned FIFO scheduling, so tasks submitted to the default queue can be executed in fifO order. Need to pay attention to, the spark, production, streaming, scheduling is still a fair scheduling between the default. Each queue can be configured with the maximum and minimum number of resources and the maximum number of applications that can be run.

The Fair scheduler uses a rules-based system to determine which queue applications should be placed on. In the example above, the element defines a list of rules, each of which is tried one by one until a match is found. For example, the first rule specified will put the application on the specified queue. If the application does not specify a queue name or the queue name does not exist, the application will not match the rule and the next rule will be tried. The primaryGroup rule attempts to place the application in a queue named after the user’s Unix group name. If this queue is not available, do not create the queue and move on to the next rule. If all previous rules are not met, the default rule is triggered and the application is placed in the default queue. Of course, we can leave the queuePlacementPolicy rule unconfigured, and the scheduler defaults to the following rule:

<queuePlacementPolicy>
      <rule name="specified" />
      <rule name="user" />
</queuePlacementPolicy>
Copy the code

The above rule means that a queue will be created with a user name and queue name unless the queue is defined exactly. There is also a simple configuration policy that places all applications on the same queue (default), so that all applications share the cluster equally rather than among users. This configuration is defined as follows:

<queuePlacementPolicy>
     <rule name="default" />
</queuePlacementPolicy>
Copy the code

We can also set yarn.scheduler.fair.user-as-default-queue=false without using the configuration file. In this way, the application will be placed in the default queue instead of individual user name queues. In addition, we also can set yarn. The scheduler. Fair. Allow undeclared – pools = false, so that users can’t create a queue.

When a job is submitted to an empty queue in a busy cluster, the job is not executed immediately but blocked until the running job releases system resources. To make the execution time of submitted jobs more predictable (you can set the timeout to wait), the Fair scheduler supports preemption. Preemption allows the scheduler to kill containers that occupy more than their share of the resource queue, and these resources can then be allocated to the queue that deserves their share. Note that preemption reduces cluster execution efficiency because terminated containers need to be re-executed. You can set the parameters of a global yarn. The scheduler. Fair. The preemption = true to enable function. In addition, there are two parameters that control the expiration time of preemption (these parameters are not configured by default, at least one of them needs to be configured to allow preemption of containers) :

minSharePreemptionTimeout
fairSharePreemptionTimeout
Copy the code

If the queue does not get the minimum resource guarantee within the time specified by minimum Share preemption timeout, the scheduler preempts the containers. We can configure this timeout for all queues through the top-level element in the configuration file; We can also configure elements within the element to specify a timeout for a queue.

Similarly, if the queue does not acquire half of the equal resources within the period specified by Fair Share preemption timeout (a ratio that can be configured), the scheduler will preempt the containers. This timeout can be configured for all queues and for a queue by top-level and element-level elements, respectively. The ratio mentioned above can be configured by (Configure all queues) and (Configure a queue), and the default is 0.5.

Note that the mapping between users and user groups that submit tasks on clients must be maintained on ResourceManager. ResourceManager reads the mapping between users and user groups from ResourceManager when allocating resource pools. Otherwise, it will be allocated to the default resource pool. A warning similar to UserGroupInformation: No groups available for user is displayed in the log. The user group corresponding to the user on the client machine is irrelevant.

Each time a new user is created on ResourceManager or the resource pool quota is adjusted, run the following command to refresh the user for the new user to take effect.

yarn rmadmin -refreshQueues
yarn rmadmin -refreshUserToGroupsMappings
Copy the code

Dynamic update supports only resource pool quota modification. To add or reduce resource pools, restart the Yarn cluster.

Fair Scheduer configuration and usage of the resource pool, on the ResourceManager monitoring WEB page can see: http://ResourceManagerHost:8088/cluster/scheduler

2.2 Capacity Scheduler Configuration (Default configuration)

Hadoop2.7 uses the Capacity Scheduler yarn-site.xml by default

<property>
  <name>yarn.resourcemanager.scheduler.class</name>
  <value>org.apache.hadoop.yarn.server.resourcemanager.capacity.CapacityScheduler</value>
</property>
Copy the code

The Capacity scheduler allows multiple organizations to share an entire cluster, with each organization gaining access to a portion of the cluster’s computing power. By assigning specific queues to each organization, and then allocating cluster resources to each queue, the entire cluster can serve multiple organizations by setting up multiple queues. In addition, the queue can be divided vertically, so that multiple members of an organization can share the queue resources. Within a queue, resources are scheduled using a first-in, first-out (FIFO) policy.

A job may not use the entire queue. However, if there are multiple jobs running in this queue, if the queue is full of resources, then it is allocated to those jobs. What if the queue is full of resources? It is still possible for the Capacity scheduler to allocate additional resources to this queue, which is the concept of “Queue Elasticity”.

In normal operations, the Capacity scheduler does not force a queue to release Containers. If a queue has insufficient resources, the queue can only obtain the resources released by other queues. Of course, we can set a maximum resource usage for the queue so that the queue does not consume so many free resources that other queues cannot use them. This is where the tradeoff of “elastic queues” comes in.

Suppose we have the following hierarchy of queues:

Root ├── heavy Exercises ── scienceCopy the code

Here is a simple Capacity scheduler configuration file called capacity-scheduler.xml. In this configuration, two subqueues prod and dev are defined below the root queue, accounting for 40% and 60% of the capacity, respectively. Note that a queue is configured via the property yarn.sheduler.capacity.. Specifies the inheritance tree of the queue, such as root.prod queue, which generally refers to capacity and maximum-capacity.


      
<configuration>
	<property>
		<name>yarn.scheduler.capacity.root.queues(/&eae)
		<value>prod,dev</value>
	</property>
	<property>
		<name>yarn.scheduler.capacity.root.dev.queues</tta*e> 
		<value>eng,science</value>
	</property>
	<property>
		<name>yarn.scheduler.capacity.root.prod.capacity</name>
		<value>40</value>
	</property>
	<property>
		<name>yarn.scheduler.capacity.root.dev.capacity</name>
		<value >60</value>
	</property>
	<property>
		<name>yarn.scheduler.capacity.root.dev.maximuin-capacity</name>
		<value>75</value>
	</property>
	<property>
		<name>yarn.scheduler.capacity.root.dev.eng.capacity</name>
		<value >50</value>
	</property>
	<property>
		<name>yarn.scheduler.capacity.root.dev.science.capacity</name>
		<value >50</value>
	</property>
</configuration>
Copy the code

As you can see, the Dev queue is split into eng and science sub-queues of the same size. The maximum-capacity attribute of dev is set to 75%, so dev will not consume all cluster resources even if the PROD queue is completely idle, meaning that 25% of the proD queue resources are still available for emergencies. Note that eng and science queues do not have the maximum-capacity attribute set, which means that jobs in eng or science queues may use all the resources of the entire DEV queue (up to 75% of the cluster). Similarly, proD may occupy all cluster resources because the maximum-capacity attribute is not set. Capacity Container In addition to configuring queues and their Capacity, you can also configure the maximum number of resources that can be allocated to a user or application, the number of applications that can run concurrently, and the ACL authentication of queues.

As for the queue setup, it depends on your application. For example, in MapReduce, we can specify the queue to use with the mapReduce.job.queuename attribute. If the queue does not exist, we will receive an error when submitting the task. If we do not define any queues, all applications will be placed in a default queue.

Note: For the Capacity scheduler, our queue name must be the last part of the queue tree and will not be recognized if we use the queue tree. For example, in the configuration above, we can use prod and eng as queue names, but if we use root.dev.eng or dev.eng, it will not work.

2.3 FIFO Scheduler

Yarn – site. XML file

<property>
	<name>yarn.resourcemanager.scheduler.class</name>
	<value>org.apache.hadoop.yarn.server.resourcemanager.fifo.FifoScheduler</value>
</property>
Copy the code

Public account: Java big data and data warehouse, get data, learn big data technology.