Deploy Apache Flink Natively on YARN/Kubernetes

Author: Ren Chunde

As the next generation of big data computing engine, Apache Flink is developing rapidly and its internal architecture is constantly optimized and reconstructed to adapt to more runtime environments and larger computing scale. Flink Improvement Proposals – 6 redesigned in each cluster management system (Standalone/YARN/Kubernetes, etc.) on the unity of the resource scheduling architecture, this article introduces development and clear layered architecture of the grid resources scheduling and other design features, The implementation of two modes of PER-job and session on YARN, as well as the detailed design of K8S cloud native fusion under discussion.

The content of this article is as follows:

Apache Flink Standalone Cluster
Native fusion of Apache Flink and YARN
Apache Flink native fusion with K8S
summary

Apache Flink Standalone Cluster

Figure 1 shows Flink’s Standalone cluster deployment as a master-slave architecture, where the master JobManager(JM for short) is responsible for scheduling cell tasks for jobs, and the TaskManager(TM for short) reports to the JobManager and is responsible for executing tasks with threads within it.

Isolation: When multiple jobs run in a cluster, tasks Of different jobs may be executed on the same TM, and the resources used by the threads (CPU/MEM) cannot be controlled and affect each other. Even if one Task causes the entire TM to go Out Of Memory, all the jobs on the same TM will be affected. Multiple jobs are scheduled in the same JM and are affected by faulty jobs.
Multi-tenant resource usage (quota) management: Cannot control the total usage of Job resources by users, and cannot be coordinated and managed among tenants.
Availability of the cluster: Although the JM can be deployed with Standby and support High Available, the JM and TM processes are not managed. As a result, many processes break down and the whole cluster becomes unavailable due to the above isolation problems.
Cluster o&M: Complex O&M operations are required, such as version upgrade and capacity expansion.

To solve the above problems, Flink needs to run on popular and mature resource scheduling systems, such as YARN, Kubernetes, and Mesos. How to achieve this?

Native fusion of Flink and YARN

Apache Flink Standalone Cluster on YARN

A simple and effective deployment mode is to deploy Flink Standalone ON the YARN Cluster by using the features supported by YARN, as shown in Figure 2(Apache Flink Standalone Cluster ON YARN).

Multiple jobs can have multiple YARN Applications. Each APP is a standalone cluster that operates independently and relies on isolation methods such as Cgroups supported by YARN itself to avoid the interaction between multiple tasks.
Apps of different users can also run in different YARN scheduling queues to solve the multi-tenant problem through queue Quota management.
In addition, the Flink Standalone Cluster can be highly available by restarting the App process and scheduling again using YARN.
Simply modify parameters and configuration files, and distribute Flink Jars through YARN distributed cache to facilitate upgrade and capacity expansion.

Although the above problems are solved, each (few)Job has a Standalone Cluster, which is difficult to achieve efficient resource utilization because:

The scale of Cluster (number of TMS) is statically specified when YARN App is started. Flink’s own compilation optimization makes it difficult to estimate resource demand before running, which makes it difficult to rationalize THE number of TMS, resulting in more resource waste, less impact on Job execution speed and even failure to run.
The resource size of each TM is also statically specified, so it is also difficult to estimate the actual needs. We cannot dynamically apply for TM of different sizes according to the resource requirements of different tasks, but can only set TM of the same size. Then it is difficult to place an integer Task exactly, and the remaining resources are wasted.
App startup (1.Submit YARN App) and Flink Job submission (7.Submit Job) are completed in two stages. Therefore, the submission efficiency of each task is low and the resource flow rate of the cluster is also reduced.

The more Flink jobs available in a large-scale YARN cluster, the more resources are wasted and the greater the cost is. In addition, the problem of Standalone running on other resource scheduling systems is the same as that of on YARN. Therefore, Alibaba Real-time Computing took the lead in improving Flink’s resource utilization model based on YARN’s actual production experience, and subsequently discussed with the community to design and implement a set of general architecture, which is suitable for different resource scheduling systems.

FLIP-6 – Deployment and Process Model

The FLIP-6 fully records the reconfiguration of the deployment architecture, with the new modules shown in Figure 3. Similar to mapReduce-1, mapReduce-2 is upgraded to YARN+MapReduce-2. Resource scheduling and Job computing logical unit (Task) scheduling are divided into two layers, so that ResourceManager(RM) and JobManager(JM) play their respective roles. The coupling with the underlying resource scheduling system is reduced (just implement different plugable ResourceManager), reduce the logic complexity, reduce the difficulty of development and maintenance, optimize the JM implementation resources according to the Task required application, solve the problem of low resource utilization on YARN/K8S, In addition, clusters and jobs can be expanded.

Dispatcher: communicates with clients, receives Job submissions, and generates jobManagers. The life cycle can be different jobs.
ResourceManager: Interconnects with different resource scheduling systems to implement resource scheduling (applying for or releasing resources) and manage Containers and TaskManagers. The same life cycle can cross jobs.
JobManager: one instance for each Job, which schedules and executes the calculation logic of the Job.
TaskManager: registers with RM to report resource status, receives Task execution from JM, and reports Task status.

Native fusion of Apache Flink and YARN

Based on the preceding architecture, Flink on YARN supports two deployment modes: Per-job and Session (Flink on YARN is used by users).

Per-Job

Per-job is a Flink Job bound to the life cycle of its YARN Application(App). When submitting the YARN App, distribute the File /jars of Flink Job through YARN Distributed Cache. In addition, JM applies for slots to RM based on the actual resource requirements of tasks generated by JobGraph. Flink RM then dynamically applies for or releases YARN containers. Perfect (?) All the previous problems are solved. YARN isolation is utilized and resources are efficiently utilized.

Session

Per – perfect Job? It takes a long time (in seconds) to apply for resources and start TM when YARN App is submitted. Especially in scenarios such as interactive analysis and short query, the execution time of Job calculation logic is short. As a result, a large proportion of App startup time seriously affects end-to-end user experience. Missing the advantage of Standalone mode on Job submission being fast. However, the power of flip-6 architecture can easily solve this problem. As shown in Figure 5, a Flink Session is run through the pre-started YARN App (Master and multiple TM have been started, similar to Standalone, which can run multiple jobs), and then the Job is submitted and executed. These jobs can quickly leverage existing resources to perform calculations. The Blink branch is a bit different from the Master implementation (whether or not to pre-load TM), and will later be consolidated, and will continue to develop the resource flexibility to implement sessions – automatically scaling up TM numbers on demand, which is not available on standalone.

Resource Profile

A Resource Profile describes the amount of CPU and Memory used by a single Operator. The RM then applies for containers from the underlying resource management system to perform TM based on these resource requests. For details, see Task Slots and Resources.

A native fusion of Flink and Kubernetes

In recent years, Kubernetes has developed rapidly and has become a native operating system in the cloud era. Can the deployment and integration of Apache Flink, the next generation of big data computing engine, open up a new continent of big data computing?

Apache Flink Standalone Cluster on Kubernetes

Relying on K8S ‘own capabilities to support Service deployment, the Flink Standalone Cluster can be deployed via a simple K8S: Deployment & Service or Flink Helm Chart can be easily deployed to a K8S cluster, but they also have low resource utilization problems like Standalone on YARN, so “native fusion” is still needed.

Native fusion of Apache Flink and Kubernetes

Flink and K8S “native fusion”, mainly in flip-6 architecture to achieve K8SResourceManager to docking Kubernetes resource scheduling protocol, now Blink branch implementation architecture as shown below, users use documents see Flink on K8S, Merge work is in progress on the trunk Master

summary

Deployment management and resource scheduling are the foundation stones of big data processing system. Through the abstract layering and reconstruction of FLIP-6, Apache Flink has built a solid foundation, which can run “native” on various major resource scheduling systems (YARN/Kubernetes/Mesos) to support larger scale and higher concurrency computing. The efficient use of cluster resources provides a reliable guarantee for the continuous development of the follow-up. Related functions are still being optimized and improved. For example, the difficulty of Resource Profile allocation has made some developers feel intimidated and seriously reduced the usability of Flink. We are trying to implement Auto Config/Scaling and other functions of Resource and concurrent configuration to solve such problems. “Serverless” architecture is developing rapidly, and it is expected that Flink and Kubernetes’ fusion will become a powerful cloud native computing engine (like FaaS), saving resources for users and bringing greater value.

For more information, please visit the Apache Flink Chinese community website