Tencent cloud EMR based on YARN for cloud native container optimization and practice

Introduction | traditional HADOOP ecosystem YARN management/computing resource scheduling, the system has obvious resources commonly use cycle. The real-time computing cluster resource consumption is mainly in the daytime, while the data reporting business is arranged in the offline computing cluster. The primary problem of separate deployment from online business is low utilization of resources and high consumption cost. With the growth of the business and the unexpected demand for report calculation, in order to solve the problem of reserving resources for the offline cluster, Tencent cloud EMR team and container team jointly launched Hadoop Yarn on Kubernetes Pod, in order to improve the utilization rate of container resources, reduce the cost of resources, and increase the utilization rate of idle container cluster CPU by several times. This article focuses on the optimization and practice of Hadoop resource scheduler YARN in a container environment.

First, Hadoop Yarn on Kubernetes POD hybrid deployment mode

Hadoop Yarn on Kubernetes POD provides two functions: flexible scale-out capacity and off-line hybrid deployment. Elastic expansion and contraction capacity mainly focuses on how to make use of cloud native resources and rapidly expand resources to supplement computing power. The purpose of the off-line hybrid deployment mode is to make full use of the idle resources of the online cluster and reduce the frequency of reserving idle resources for the offline cluster as much as possible.

EMR elastic expansion capacity module (YARN-AUTOSCALER) provides two kinds of capacity expansion by load and by time elastic expansion. For scaling by load, users can set thresholds for different metrics to trigger scaling, such as AvailableCore, Pending vCore, Available Mem, Pending Mem in the YARN queue. You can also use the time expansion rule, by the day, by the week, by the month rules specified trigger.

When the elasticity rule is triggered, the off-line deployment module obtains the specification and quantity of idle computing power available in the current online TKE cluster and invokes the Kubernetes API to create the corresponding amount of resources. The Ex-Scheduler extension scheduler ensures that POD is created on the node with more remaining resources. This POD is responsible for starting YARN’s services.

With this scheme, YARN’s NodeManager service can be quickly deployed to POD nodes. But also YARN native scheduling does not take into account heterogeneous resources, which raises two problems:

1. AM POD was ejected, resulting in APP failure

Under the condition of resource shortage of Node nodes, Kubelet triggers the mechanism of actively expelling POD in order to ensure the stability of Node nodes. If an AM service exists on the node, then the entire Application is considered to have failed and the ResourceManager redistributes the AM. For computationally heavy tasks, the cost of Application rerun is prohibitive.

2. Resource sharing limitations for YARN native non-exclusive partitions

YARN’s label partitioning feature supports both Exclusive and non-exclusive partitions.

Exclusive: For example, if you specify an Exclusive partition X, the container of YARN will only be assigned to that partition.
Non-exclusive: For example, non-exclusive partition X, resources of the X partition can be shared with the default partition.

Only when the partition default is specified can the Application running on the default use the resources of partition x.

However, in the real usage scenario, users will allocate their own exclusive partition resources to each business department, and the default partition will be divided for each department to use. Default partition resources are sufficient. Business departments want to be able to use their own exclusive partition and make full use of the default partition resources at the same time. When both exclusive and default partitions are insufficient, elastic scaling will be triggered to expand resources to their own exclusive partition.

Second, challenges to YARN transformation

The development of the above feature is not only difficult for the technology itself. It is also necessary to minimize the impact of user stock cluster stability and reduce the cost of user business-side transformation.

Cluster stability: Hadoop YARN is an underlying scheduling component in a big data system, and if too many changes are made, the risk of failure increases. At the same time, the introduced feature will inevitably require an upgrade of the Haoop Yarn for the stock cluster. The upgrade operation shall have no perception of the stock business cluster and shall not affect the business of the day.
Business side cost: The introduced new feature must also conform to native YARN usage habits to facilitate business-side users’ understanding and reduce business-side code modification.

1. AM can choose storage media independently

The current community of YARN does not consider mixed deployment of heterogeneous resources on the cloud. In an online TKE cluster, containers are ejected when resources are tight. To prevent application recalculations from wasting resources, it is necessary to provide AM to specify whether POD resources can be allocated.

When selecting the storage media independently, the configured identifier is used to determine whether the resource can be provided to AM by NodeManager via RPC. The ResourceManager decides to allocate the AM of Application to the stable resource media by reporting the information. The benefits of configuring the information reported by NodeManager are obvious:

Decentralization: Reduce the ResourceManager processing logic. Otherwise, resource information will need to flow into the ResourceManager via RPC/ configuration when scaling the resource. If necessary, do not add entities, and the transformation of ResourceManager should be lightweight.
Cluster stability: After the stock business cluster is upgraded to YARN, you need to restart NodeManager, only ResourceManager. Yare’s high availability ensures that the upgrade process has no impact on the business. The reason you don’t need to restart NodeManager is that by default, NM treats native resources as allocable.
Easy to use: The user can freely determine the task resources through the configuration have the right to allocate AM, not only limited POD container resources.

2. Multi-label dynamic allocation of resources

In YARN’s native tag design, you can only have a single tag in a tag expression when you commit a task. If more than one partition resource is being used at the same time to improve utilization, you must set the non-default partition to non-exclusive. The tag expression must solve three problems:

Resource isolation: After setting non-exclusive partition A, resources cannot be exchanged to APP of partition A in A timely manner after being occupied by APP of other partition.
Freely shared resources: Only the Default partition is eligible to apply for non-exclusive partition resources.
Dynamic selection of partition resources: when multi-partition resources are shared, the available areas cannot be selected according to the size of the remaining resources in the partition, which affects the task execution efficiency.

By supporting extended expression syntax and adding support for logical operator expressions, Tencent Cloud EMR team enables APP to apply for multiple partition resources. At the same time, the resource statistics module is developed to dynamically count the available resources in the partition and allocate the most appropriate partition for the APP.

Third, practical operation drill

Test environment: specify 172.17.48.28/172.17.48.17 NodeManager for the default partition, 172.17.48.29/172.17.48.26 NodeManager for x partition.

Queue Settings:

<? The XML version = "1.0" encoding = "utf-8"? > <? xml-stylesheet type="text/xsl" href="configuration.xsl"? > <configuration> <property> <name>yarn.scheduler.capacity.root.queues</name> <value>a,b</value> </property> <property> <name>yarn.scheduler.capacity.root.accessible-node-labels.x.capacity</nam e> <value>100</value> </property> <property> <name>yarn.scheduler.capacity.root.accessible-node-labels.y.capacity</nam e> <value>100</value> </property> <! -- configuration of queue-a --> <property> <name>yarn.scheduler.capacity.root.a.accessible-node-labels</name> <value>x</value> </property> <property> <name>yarn.scheduler.capacity.root.a.capacity</name> <value>50</value> </property> <property> <name>yarn.scheduler.capacity.root.a.accessible-node-labels.x.capacity</n ame> <value>100</value>  </property> <! -- configuration of queue-b --> <property> <name>yarn.scheduler.capacity.root.b.accessible-node-labels</name> <value>y</value> </property> <property> <name>yarn.scheduler.capacity.root.b.capacity</name> <value>50</value> </property> <property> <name>yarn.scheduler.capacity.root.b.accessible-node-labels.y.capacity</n ame> <value>100</value>  </property> </configuration>

1. Provide that AM can only be assigned at 172.17.48.28

Configure the following configuration items on the nodeManager node of the other three nodes:

yarn.nodemanager.am-alloc-disabled = true

After configuration, the AM of the submitted Application can only be started on node 172.17.48.28.

2. Use composite labels

By graphs. The job. The node label – expression specified label expression, x | | said at the same time use the x/default partition.

Hadoop jar/usr/local/service/hadoop/share/hadoop/graphs/hadoop - mapredu ce - examples - 3.1.2. Jar PI - D mapreduce.job.queuename="a" -D mapreduce.job. node-label-expression="x||" 10 10

After committing with this command, observe that the container of the Application is assigned to the x/default partition.

Best practices for Hadoop Yarn on Kubernetes POD

The customer’s big data application and storage runs in YARN’s managed big data cluster, which faces a number of problems in the production environment, mainly in the lack of big data computing power and the waste of resources in the online business trough. For example, when the computing power of offline computing is insufficient, the data punctuality cannot be guaranteed. In particular, when the random urgent big data query task is encountered, there is no available computing resource, so the existing computing task can only be stopped or the existing task can be completed. Either way, the overall efficiency of task execution will be greatly reduced.

Based on Hadoop Yarn on Kubernetes POD scheme, the offline tasks are automatically expanded to the cloud cluster, which is mixed with TKE online business cluster. It makes full use of idle resources in the trough period of the cloud to improve the computing power of offline business, and makes use of the elastic capacity of rapid expansion of resources on the cloud to supplement the computing power of offline computing in a timely way.

After optimizing customer’s online TKE cluster resource usage through Hadoop Yarn on Kubernetes POD solution, the cluster’s idle CPU utilization can be increased by 500%.

Five, the summary

This paper proposes the optimization and practice of cloud native containerization based on YARN, which greatly improves the stability and efficiency of task operation, effectively improves the utilization of cluster resources and saves the hardware cost in the mixed deployment cloud native environment. In the future, we will explore more big data cloud native scenarios to bring more practical benefits to enterprise customers.

Author’s brief introduction

Zhang He is a senior engineer at Tencent Cloud. Currently, he is mainly responsible for the management and control modules related to elastic MapReduce of Tencent Cloud big data products, and the technical research and development of an important component Hive. Contributed code to Apache Hive, Apache Calcite open source projects, graduated from UESTC.