Buses can be seen everywhere with our daily life. Buses of different routes are sent out in an orderly manner according to their own schedules. When they arrive at the station, passengers on the platform slowly go to the next station… There are short overtime runs and shorter departures in the morning and longer intervals in the middle of the night. All this is subject to the dispatch of the bus terminal.

In the big data platform, there are also various tasks that need to be carried out in a certain time interval and sequence, and the scheduling engine is responsible for managing all these tasks. Not only does it have to execute tasks on time, but it also has to face complex scenarios such as:

After A task is executed once every 10 minutes for 11 minutes, in the next period, do YOU want to start counting tasks B that need to be executed after task A is completed? After A day has elapsed, do you want to wait for 100,000 tasks to be submitted at the same time? In what order should you perform the tasksCopy the code

There are many kinds of problems. Without a robust and intelligent scheduling engine, it is impossible to support a big data platform like an orderly bus system.

There are many scheduling frameworks on the market, such as Quartz, Elastice-Job, xxl-Job, etc., but they only support timed delivery of tasks, just like a bus with a fixed schedule that can get to its stop on time but has difficulty facing the morning and evening rush hours. Such a single scheduling method is far from satisfying the “twists and turns, complex and changeable” business scene. At this time, DAGScheduleX, a multi-million level distributed scheduling engine developed by Dazhin, came into play. It not only meets the timing function, but also has rich built-in policies to deal with different scenarios, such as: resource limitation, fast failure, dynamic priority adjustment, fast expiration, upstream and downstream scheduling state dependence.

Data stacks support both basic scheduled scheduling and complex cross-cycle dependency strategies.

In the whole data stack architecture, DAGScheduleX, as the link between the application of data stack platform and the underlying big data cluster, plays the role of connecting the preceding and the following. Within the scope of cluster resources, DAGScheduleX coordinates the allocation of task resources, and arranges task submission and operation and periodic scheduling.

First, DAGScheduleX main process

Multi-cluster configuration and multi-tenant isolation

In actual data development, we may have multiple environments, such as development, testing, and so on. To submit the task to the corresponding cluster, we only need to configure different cluster environments on the console of the stack and bind different tenants. At this time, the task promotion can achieve cluster isolation according to different tenants.

1. The console can be bound to different types of clusters: production environment A Hadoop, production environment B LibrA 2. Multiple tenants can be bound to a cluster. 3. When submitting a task, use tenantId to identify the target cluster

Instance generation and submission

DAGScheduleX currently supports a variety of computing components, such as Flink, Spark, TensorFlow, Python, Shell, Hadoop MR, Kylin, Odps, RDBMS (multiple relational databases) and more, All upper-layer application submission tasks can be performed simply by finding the corresponding plug-in type.

DAGScheduleX supports custom task types, and it is very easy to extend new plug-ins by defining the typeName of the corresponding plug-in and implementing the interface methods defined in IClient. Interface methods are as follows:

Init (initialization) method judgeSlots(resource judgment) method submitJob(submit task) method getJobStatus(get task status) method getJobLog(get task execution log) method cancelJob(cancel task) methodCopy the code

When a Task is submitted to DAGScheduleX, the next day’s Job instance will be generated one day in advance. On the day of execution, they will run according to the scheduled time and then obtain the execution result. Of course, data supplement and immediate running are not limited. DAGScheduleX also supports cross-tenant upstream and downstream task dependency, task self-dependency, task priority adjustment, console task queue management, and o&M center task monitoring.

4. Task alarm

When the upstream and downstream dependent links are long, the failure of an upstream Job instance may cause data problems in the downstream. In this case, DAGScheduleX supports monitoring alarms in a variety of scenarios:

Failed to execute the task within the specified period. The task is not running. The task is stoppedCopy the code

The console alarm channel supports the following alarm channels:

The DAGScheduleX alarm SDK is introduced to implement the custom alarm logic console alarm channel in ICustomizeChannel. The alarm scenario configured in the packaged JAR application is uploadedCopy the code

Five, the summary

DAGScheduleX is a distributed task scheduling engine capable of instance generation, scheduling, submission, operation and maintenance, and alarm of tasks. The stack of offline computing, streaming computing, algorithm development and other suites rely on the scheduling engine to perform tasks, is a very important hub.

This article was published in: Stack research Club

Stack is a cloud-native, site-based data central-platform PaaS, and we have an interesting open source project on Github called FlinkX.

FlinkX is a data synchronization tool based on Flink. It can collect static data, such as MySQL, HDFS, etc., as well as real-time changing data, such as MySQL binlog, Kafka, etc. It is a data synchronization engine that integrates whole domain, heterogeneous and batch data. Welcome to github community to find us to play ~