Introduction: In this paper, we explain how TCT(Tencent Cloud Task) achieves accurate real-time, stable and efficient Task scheduling, as well as Task segmentation and arrangement from the perspective of architecture and technical implementation. (Edit: Middleware Q sister)

background

Cause and effect, cause and effect

First, let’s consider a few business scenarios:

  • XX Credit Card Center, from 1:00 a.m. to 3:00 a.m. on 28th of each month, needs to complete the generation of the monthly expense list of all network users.
  • XX Clothing needs to send birthday messages to members from 9:00 a.m. every day.
  • On XX game platform, after new users register, they need to generate scheduled tasks for the current users and settle the commission amount of virtual currency exchange at the end of the month.
  • XX company needs to periodically run Python scripts to clear invalid TMP files in a file service system.
  • XX Insurance Company needs to count the number of newly added policies of the previous day at 2:00 a.m. every day and trigger the report generation task, and cc email after completion.

Business scenarios such as batch processing of massive scheduled tasks mentioned above have become common in the evolution of enterprises from single architecture to micro-service architecture and cloud service architecture. Conventional Scheduling frameworks based on Quartz cannot cope with the demands of such distributed scenarios, nor can they achieve accurate real-time, stable and efficient task scheduling. Can not achieve task segmentation, arrangement, failure supplement. Therefore, enterprises urgently need a one-stop distributed task scheduling solution to help enterprises uniformly manage complicated and chaotic scheduled tasks, enhance the service capability of enterprise micro-server platform, and support the transformation of enterprise cloud service.

Existing open source solutions

The stone of other mountains can attack jade…

In the past development, predecessors left a lot of excellent schemes, each has advantages and disadvantages. Common open source products: Quartz, XXL-job, ElasticJob, Antares, SIA-Task, etc.

  • Quartz: This framework is the most widely used. It is completely implemented based on Java. Quartz basically achieves the ultimate control of a single task.
  • Xxl-job: a lightweight distributed task scheduling platform, whose core design goal is rapid development, simple learning, lightweight, easy to expand. Xxl-job supports fragmentation, simple task dependence, and sub-task dependence, but does not support cross-platform.
  • Elastice-job: supports Job fragmentation (Job fragmentation consistency), no task scheduling, and no cross-platform support.
  • Sia-task: Has the characteristics of cross-platform, choreographer, high availability, non-intrusion, consistency, asynchronous parallel, dynamic expansion, real-time monitoring, etc.

From the logical architecture and technical implementation of the open source solution, we can also intuitively see the shortcomings of the open source solution:

  • Architecture: the responsibilities of the scheduler are not clearly divided and the system is not scalable enough. In the face of large-scale virtualization & complex network environment, simple remote call is not sufficient.
  • Performance: As the number of tasks and high-frequency events increases, the ZooKeeper cluster becomes a performance bottleneck. Simple remote call or task pull and other schemes can not meet the business demands of large volume and high frequency.
  • Function: lack of complete authentication system design, security can not be guaranteed. System o&M, such as task intervention and alarm monitoring, is weak.

Introduction of TCT

In order to solve the above problems, we carried out in-depth exploration and designed an enterprise-level distributed Task scheduling system TCT (Tencent Cloud Task). TCT provides a one-stop distributed task scheduling solution that supports random and broadcast tasks, task fragmentation and task scheduling, and provides a sound monitoring and alarm system. We combine the actual business scenarios of users, draw on historical experience, and mainly solve several core problems:

The above core elements have different requirements for the system, which can be summarized as follows for reference:

The core elements of Functional specifications System features
Task is triggered The trigger point of the task is analyzed and calculated to generate the trigger event cpu-intensive
Task scheduling Responsible for task distribution rules, manage the task operation life cycle IO intensive
Task touch of Undertake the access and acquisition of task events, and manage the access pipeline of task execution information Network IO type
Task execution The task execution unit, which executes the real business logic Depends on the business scenario

The technical architecture

Architecture, always evolving with requirements…

Here we explain the functional modules in the architecture diagram:

Function module description
Controls (Admin) The user console provides task management and intervention interfaces, and configures task o&M indicators
The Trigger (Trigger) Parse the task and generate the trigger event
The Scheduler (Scheduler) Assign tasks and manage the task operation life cycle
Access Gateway (AGW) Authentication, callback management, and transparent transmission of task information
SDK / Agent Get the task execution unit to execute the task logic

Functional architecture

Advantage 1: Modular microservice architecture design, clear responsibilities

The trigger

  • According to the task execution rules, the task triggering events at different time points can be calculated and analyzed. Through the implementation of MQ reliability delivery (the subsequent article will gradually explain how to achieve reliability delivery), to reduce the peak to fill valley, avoid peak IO and other problems, improve throughput.
  • Through reasonable sharding strategy and DISASTER recovery strategy, the resolution loading strategy of traditional multi-node lock competition round training is solved to reduce the pressure on storage.
  • Cold and hot data isolation loading mechanism, further reduce the storage pressure and system overhead. According to the task execution policy of high frequency, the preloading policy and dynamic adjustment of the preloading algorithm are adopted to solve the problem of high load caused by high frequency triggering.

The scheduler

  • The most complex control logic component in the whole task scheduling system is IO intensive component.
  • Effectively improve system throughput by subscribing to MQ message events decoupled from triggers. Focus on logical control of task scheduling, such as task execution scheduling, load balancing, fault tolerance, traffic limiting, and billing.

Access gateway

  • It independently implements client access authentication and authentication and provides effective permission verification policies.
  • Responsible for callback management of upstream and downstream channels, decoupled from complex service logic.
  • Automatic detection and awareness mechanism for client nodes and service nodes to realize session management effectively.
  • Data transparent transmission and routing, realize the closed-loop within the component.
  • Combined with SDK/Agent design, the bottleneck of the number of connections of a single node and the problem of high concurrent TCP connection establishment in the scenario of cold service nodes are effectively avoided.

Advantage two: stateless design, simple horizontal expansion

The trigger

Through effective sharding strategy, the elastic expansion and shrinkage of services can be quickly completed under the condition of avoiding triggering pressure centralization and realizing approximately stateless horizontal expansion.

The scheduler

Completely stateless design scheme, no need to consider the task back source problem, to achieve stateless horizontal capacity expansion.

Access gateway

The completely stateless design scheme can realize stateless horizontal capacity expansion, and the number of TCP connections in theory has no upper limit.

Advantage three: complete function

Flexible trigger rules

  • Support Cron expressions such as * 0/5 * * *? And so on.
  • Trigger rules for specific cycle frequencies, such as intervals of 36 minutes, etc.

Convenient management capability: Provides various management and control capabilities such as pause, resume, stop, and retry.

Three execution modes are supported

  • Random node execution: Select an available executing node in the cluster to execute a scheduling task. Application scenario: Scheduled reconciliation.
  • Broadcast execution: Dispatches scheduling tasks to all executing nodes in the cluster and executes them. Application scenario: Batch o&M.
  • Fragment execution: The system splits nodes based on user-defined fragment logic and distributes them to different nodes in the cluster for parallel execution, improving resource utilization efficiency. Application scenario: Collect statistics of massive logs.

Three trigger modes are supported

  • Manual trigger: The user selects a specific task from the task management list and executes it manually. The scheduler immediately distributes the task and generates an execution batch. Application scenario: Supplement periodic execution tasks.
  • Periodic triggering: You can set the execution time of a task by setting the interval for triggering a task. Period Settings that are not supported by cron expressions are supported. Application scenario: Scheduled backup.
  • Workflow triggering: WORKFLOW is a set of tasks, which can orchestrate the upstream and downstream logical dependencies of tasks to trigger tasks. Application scenarios: Massive data processing, such as data collection, data filtering, data cleaning, and data aggregation process choreography.

Log tracing capability

The log service enables you to query task execution logs. Users can record the execution batch details of all tasks, stop the execution operation of the batch in the execution state, and trigger the re-execution operation of the terminated batch. Click batch ID to go to the execution details of the batch, click Task ID to go to the execution batch list of the task, and click Deployment Group to go to the resource details list.

Support complex task scheduling capabilities

Can realize the task workflow of various scenarios. The complex task scheduling logic is completed by constructing the upstream and downstream dependencies of scheduling tasks. Applicable to big data process processing, task execution work order, batch o&M process orchestration and other application scenarios.

conclusion

A platform system has many challenges from product functions to technical architecture, which requires layers of abstraction and gradual optimization to complete the implementation of a mature product. In the era of big data, faced with massive data and user scale, any architecture design is confronted with many problems, such as network response, fault tolerance, idempotence, data reliability/consistency, etc.

For a platform, task reliability is the first priority, followed by task execution timeliness. Reasonably carry out the modular separation of functions, design different expansion schemes for different scenarios, improve the overall throughput of the system under the premise of ensuring SLA, achieve reliable and effective access, and deal with business scenarios with high frequency and large volume.

For users, diversified management methods, multi-dimensional operation indicator query, and comprehensive link monitoring are what they are pursuing. Only when users are separated from the complex and chaotic scheduled task scenario, can they focus more on service research and development.


Please scan our wechat official number and look forward to meeting you

There are free books, lucky draw (Tencent doll/T-shirt) activities on the official account from time to time, come to pay attention to us and add middleware small Q sister wechat