The author | Draveness

It took the author about 2 months to write this article. The full text is about 2W words. It is suggested to read it after bookkeeping or by computer.

Scheduling is a very broad concept and the term is used in many fields. In computer science, scheduling is a method of assigning tasks to resources. Tasks can be virtual computing tasks, such as threads, processes, or data flows, that are scheduled to be executed on hardware resources, such as processors and cpus.

Figure 1 – Scheduling system design essentials

This paper will introduce the common scenarios of the scheduling system and some key issues in the design process. The design of the scheduler will ultimately come down to one problem — how to efficiently allocate and schedule resources to achieve our goals, which may include rational use of resources, minimize costs, and quickly match supply and demand.

Figure 2 – Text context and content

In addition to the introduction will be confronted with the common problems of scheduling system design, this paper also analyzes several common evolution of design, and implementation principle of the scheduler, including the operating system process scheduler, the language of the run-time scheduler and Kubernetes workload scheduler, help us to understand the core principle of the scheduler design.

Design principle

Scheduling system is actually a Scheduler, which can be seen in many systems. As we said above, schedulers exist not only in operating systems, but also in programming languages, container choreography and many business systems.

The core function of these scheduling modules is to allocate the limited resources to maximize the utilization of resources or reduce the tail delay of the system. The scheduling system is faced with the imbalance between the demand and supply of resources.

Figure 3 – Tasks and resources for the scheduler

In this section, we will introduce the key issues that need to be considered in the design of scheduling system from many aspects, including the demand research, scheduling principle and architecture design of scheduling system.

1. Demand research

Before starting to build the scheduling system, the primary work is to carry out detailed demand research and analysis, in this process, the following two things need to be completed:

  • Investigate the application scenarios of the scheduling system, and deeply study the characteristics of the tasks to be performed in the scenarios and the resources that can be used to perform the tasks;
  • The purpose of analyzing the scheduling system may be cost first, quality first, maximizing the utilization of resources, etc. The scheduling purpose is generally dynamic and will change with the change of demand.

Application scenarios

The scheduling system application scenario is the first problem we need to consider, which is crucial to the analysis of the application scenario. We need to have an in-depth understanding of the characteristics of the tasks to be executed and the resources that can be used to execute the tasks in the current scenario. We need to analyze the following characteristics of the task to be performed:

  • Whether the task has a deadline and must be completed by a certain point;
  • Whether tasks support preemption and what are the specific rules of preemption;
  • Whether the task contains pre-dependent conditions;
  • Whether a task can only run on a specified resource;
  • .

Resources used to perform tasks may also be unbalanced, with different resources processing tasks at different speeds.

The diversity of resource and task characteristics determines the design of scheduling system. Here we give a few simple examples to help readers understand the process of scheduling system requirement analysis.

Figure 4-Linux operating system

In the process scheduler of the operating system, the tasks to be scheduled are threads. These tasks are usually in the executing or not executing state (waiting or terminated). The CPU used to process these tasks is usually non-separable. A CPU can only perform one task at a time, which is a physical limitation. To summarize, the tasks and resources of the operating system scheduler have the following features:

  • Task — Thread. Simple state: it can only be in the executing or not executing state. Different priorities: The tasks to be executed may have different priorities. Ensure the fairness of different tasks.
  • Resources – CPU time. Resources are non-divisible: Only one task can be run at a time.

In the above scenario, the task to be performed is the basic unit of operating system scheduling, the thread, and the resource to be allocated is CPU time. The Go language scheduler faces almost the same scenario as the operating system scheduler, where the tasks are Goroutine and the resources that can be allocated are threads running on the CPU.

Figure 5 – Container choreography system Kubernetes

In addition to the operating system and programming language this relatively bottom scheduler, container and computing task scheduling is also very common today, Kubernetes as a container scheduling system will be responsible for the container in the cluster, it has a little understanding of the people know that the basic unit of Kubernetes scheduling is Pod, These pods are scheduled to execute on Node:

  • Task — Pod. Different priorities: A system Pod with a higher priority can preempt resources of a system Pod with a lower priority. Stateful: PODS can be classified as stateless and stateful. Stateful PODS rely on persistent storage volumes.

  • Resource — Node. Different resource types: Resources on different nodes have different types, including CPUS, Gpus, and memory. These resources can be split but belong to the current node. Unstable: The node may become unavailable due to a sudden reason, for example, the network connection is down or the disk is damaged.

Scheduling system is very common in life and work. In addition to the above two scenarios, other scenarios requiring scheduling system include CDN resource scheduling, order scheduling and offline task scheduling system. In different scenarios, we need to think deeply about the characteristics of tasks and resources that guide the design of the system.

Scheduling purpose

After an in-depth analysis of the scheduling scenario, we need to understand the purpose of scheduling. We can understand the scheduling purpose as Cost function in machine learning. To determine the scheduling purpose is to determine the definition of Cost function. The common scheduling purpose has been introduced in the book scheduling Theory, including the following contents:

  • Makesapan – the amount of time that a task completes scheduling from the first to the last;
  • Maximum Lateness – tasks with the longest deadlines;
  • Total weighted completion Time — The Total weight multiplied by the completion time;
  • .

These are the objectives of partial theoretical scheduling. The objectives of most business scheduling systems are to optimize the indicators closely related to business — cost and quality. How to strike a balance between cost and quality requires careful thinking and design. Due to space limitations and the complexity of business scenarios, this paper will not analyze how to balance cost and quality, which often need to combine business considerations and do not have enough similarity.

2. Scheduling principle

A scheduler with excellent performance is the premise to achieve a specific scheduling purpose. When we discuss scheduling scenarios and objectives, we often ignore the extra overhead of scheduling. However, indexes such as the delay and throughput during scheduler execution cannot be ignored when scheduling is heavy. This section looks at some important concepts related to the scheduler implementation that can help us achieve a high-performance scheduler:

  • Cooperative scheduling and preemptive scheduling;
  • Monotonic scheduler and multi-scheduler;
  • Task sharing and task stealing;

Collaborative versus preemptive

Cooperative and Preemptive scheduling are common multi-task operation strategies in operating systems. The definitions of these two scheduling methods are completely different:

  • Collaborative scheduling allows tasks to execute for any length of time until the task actively notifies the scheduler to relinquish resources.
  • Preemptive scheduling allows tasks to be suspended during execution by the scheduler, which then decides which task to run next.

Figure 6 – Cooperative versus preemptive scheduling

The execution time of the task and the additional overhead of task context switching determine which scheduling method results in better performance. As shown in the figure below, figure 7 shows a collaborative call the process of scheduling task scheduler, the scheduler once for a certain task allocation resources, it will wait for the task release resources actively, in figure 4 although task execution time is different, but they will be released after completion of the task execution resources, the entire process takes only 4 times context switching.

Figure 7 – Collaborative scheduling

Figure 8 shows the process of preemptive scheduling. Since the scheduler does not know the execution time of all tasks, it assigns a slice of time to each task. Due to the short execution time of task 1 and task 4, they completed the tasks when they were first scheduled. However, task 2 and task 3 take a long time to execute and exceed the upper limit allocated by the scheduler. Therefore, to ensure fairness, preemption will be triggered and other tasks in the waiting queue will obtain resources. During the entire scheduling process, a total of six context switches occurred.

Figure 8 – Preemptive scheduling

If some tasks take a long time to execute, collaborative task scheduling will starve other tasks to death. However, if the execution time of the tasks to be executed is short and nearly identical, the use of collaborative task scheduling can reduce the extra overhead caused by task interruption, resulting in better scheduling performance.

In most cases, the execution time of tasks is uncertain. In cooperative scheduling, if tasks do not proactively surrender resources, other tasks will wait and block. Therefore, the scheduling system generally takes preemptive task scheduling as the main task, and supports cooperative task scheduling.

Monotonic scheduler and multi-scheduler

Using a single scheduler or multiple scheduler also needs careful consideration when scheduling system is designed, multiple scheduler doesn’t necessarily mean that multiple processes, it is possible that multiple scheduling threads in a process, they can choose on multi-core parallel scheduling, concurrent scheduling on a single core, also can improve performance at the same time the use of parallel and concurrent.

Figure 9 – Monotonic scheduler schedules tasks and resources

However, for the scheduling system, the decision it makes will change the state of resources and the context of the system, and then affect the subsequent scheduling decisions, so the serial scheduling of monotonic scheduler is the only way to accurately schedule resources. A single scheduler uses different channels to collect the context needed for scheduling, and after receiving the scheduling request, it will make the optimal decision according to the task and resource situation.

With the continuous evolution of the scheduler, the performance and throughput of the monotony scheduler may be limited. We still need to introduce parallel or concurrent scheduling to solve the performance bottleneck. At this time, we need to divide the resources to be scheduled and let multiple schedulers be responsible for scheduling resources in different areas.

Figure 10 – Multiple scheduler and resource zones

Concurrent scheduling of multiple schedulers can greatly improve the overall performance of a scheduler, such as the Go language scheduler. When run by Go, multiple cpus are assigned to different processors for scheduling, so the performance of the scheduler can be improved through parallel scheduling.

The two scheduling methods introduced above are established on the premise of precise scheduling. Each scheduler in multiple schedulers will face irrelevant resources, so the scheduling is serial for resources in the same partition.

Figure 11 – Multi-scheduler coarse-grained scheduling

Scheduling multiple resources simultaneously with multiple schedulers is possible, but scheduling accuracy may be sacrificed — different schedulers may receive state updates at different times, leading to different decisions by different schedulers. Load balancing can be seen as multithreading and process scheduler, because the tasks and resources control information is limited, this kind of coarse-grained scheduling results are likely to be different load of the machine will have bigger difference, so whether it is a small cluster or mass cluster is likely to result in some instances the load is too high.

Job sharing and job stealing

This section continues with two scheduling paradigms for redistributing tasks between multiple schedulers — Work Sharing and Work Stealing. A single scheduler can handle all tasks and resources at the same time, so it does not suffer from the imbalance of tasks and resources of multiple schedulers. In most of the scheduling in the scene, the task execution time is uncertain, assuming that multiple scheduler respectively the same resource scheduling, as a result of task execution time uncertainty, multiple waiting for scheduling task queue in the scheduler will eventually happen differences – part queue contains a large number of tasks, while others queue does not include the task, then you need to introduce the task allocation strategy again.

Job sharing and job stealing are two very different redistribution strategies. In work sharing, when the scheduler creates a new task, it will assign part of the task to other schedulers. In job theft, when the scheduler’s resources are not fully utilized, it steals some tasks to be assigned from other schedulers, as shown in the following figure:

Figure 12 – Job steal scheduler

Both of these task redistribution strategies add additional overhead to the system. Compared to work sharing, work theft only triggers when the current scheduler’s resources are not fully utilized, so work theft introduces less additional overhead. Work theft is more common in production environments, with Linux operating systems and Go languages choosing work theft policies.

3. Architecture design

This section will analyze the architecture design of the scheduler from both internal and external perspectives. The former analyzes the relationship between multiple internal components and the process of making scheduling decisions. The latter analyzes how multiple schedulers should collaborate and whether there are other external services that can assist the scheduler in making more rational scheduling decisions.

Inside the scheduler

When the scheduler receives a task to be scheduled, it will make a reasonable scheduling decision based on the collected status and the Spec of the task to be scheduled. We can learn the internal logic of common scheduling systems from the following figure.

Figure 13 – The scheduler makes scheduling decisions

A common scheduler generally consists of two parts – a state module for collecting states and a decision module for making decisions.

  • State module

The status module collects as much information as possible from different sources to provide a rich context for scheduling, including resource attributes, utilization, and availability. Depending on the scenario, the context may need to be stored in persistent storage such as MySQL, or a copy may be cached in memory to reduce the overhead of the scheduler accessing the context.

  • Decision module

The decision module will make scheduling decisions according to the context and task specifications collected by the status module. It should be noted that the scheduling decisions made are only valid at the moment. At a certain point in the future, the change of the state may lead to the previous decisions not meeting the requirements of the task, for example: When we use the Kubernetes scheduler to schedule a workload to some nodes, these nodes may suddenly become unavailable due to network problems, and the workload on that node will not work properly, i.e. the scheduling decision fails.

When scheduling, the scheduler uses the following three steps to schedule appropriate resources for the task:

  1. The scheduling sequence of different tasks can be determined by priority, task creation time and other information.
  2. Select appropriate resources for the task through filtering and scoring;
  3. If no resource meets the condition, the sacrificial preempt object is selected.

Figure 14 – Scheduling framework

The figure above shows several steps performed by the common scheduler decision module, such as prioritizing, scoring idle resources, and determining the victims of resource preemption. The last of the above three steps is often optional, and some scheduling systems do not need to support preemptive scheduling.

Outside the scheduler

If we look at the scheduler as a whole, we get a completely different perspective on architectural design from the outside of the scheduler — how to leverage the external system to enhance the functionality of the scheduler. Here we will introduce the external design of two kinds of scheduler, namely multi-scheduler and Descheduler.

  • How the scheduler

The serial scheduling and parallel scheduling section has analyzed the design of multiple scheduler. We can partition the resources to be scheduled and let multiple scheduler threads or processes be responsible for the scheduling of resources in each area, so as to make full use of the parallel ability of multiple and CPU.

  • Strong-arm scheduler

A counterscheduler is an interesting concept. It can remove scheduling decisions that are no longer correct, reduce entropy in the system, and allow the scheduler to re-make decisions based on the current state.

Figure 15 – Scheduler and counter scheduler

The introduction of counterscheduler makes the whole scheduling system more robust. The scheduler is responsible for making the correct scheduling decision based on the current state, while the counter-scheduler removes the wrong scheduling decision based on the current state. Their functions seem to be opposite, but their purpose is to schedule more appropriate resources for the task.

Counterscheduler is not so widely used, the actual application scenarios are also relatively limited. The author first discovered this concept in the Descheduler project incubated by Kubernetes, but Kubernetes will only be used in certain scenarios because the counterscheduler removing scheduling relationships may affect running online services.

The operating system

The scheduler is an important component in an operating system. The operating system includes the process scheduler, network scheduler, and I/O scheduler. This section describes the process scheduler in an operating system.

Some readers may wonder why the smallest unit of operating system scheduling is not thread, and why process scheduling is used here. In Linux, the scheduler doesn’t schedule processes or threads. Instead, it schedules task_struct structures, which can represent both threads and processes. The scheduler treats both processes and threads as tasks. We will use the term process scheduler, but it is important to note that the Linux scheduler does not distinguish between threads and processes.

Linux incorporates process and thread scheduling by treating them as one in the same. A process can be viewed as a single thread, but a process can contain multiple threads that share some number of resources (code and/or data).

Next, this section looks at the types of scheduling systems in the operating system and the evolution of the Linux process scheduler.

1. Scheduling system type

The operating system classifies process schedulers into three different types: long-term, intermediate, and short-term. Each of these three different types of schedulers provides different capabilities, and we’ll look at each of them in turn in this section.

Long term scheduler

The long-term Scheduler, also known as a Job Scheduler, determines which tasks go to the Scheduler’s ready queue. When we try to execute a new program, the long-term scheduler authorizes or delays the execution of the program. The purpose of the long-term scheduler is to balance the number of tasks for I/O intensive or CPU intensive processes running at the same time:

  • If there are too many I/O intensive tasks, there are no tasks to be scheduled in the ready queue, and the short-term scheduler does not need to perform scheduling, and THE CPU resources will be idle.
  • If there are too many CPU-intensive tasks, there are no tasks to be scheduled in the I/O waiting queue, and the I/O device is idle.

The long-term scheduler balances both I/O intensive and CPU intensive tasks that are running at the same time, maximizing the I/O and CPU resources of the operating system.

Intermediate scheduler

The mid-term scheduler removes inactive, low-priority, page-error-prone, or memory-hogging processes from memory, freeing resources for other processes.

Figure 16 – Intermediate scheduler

When a running process is stuck in an I/O operation, the process consumes only computing resources, in which case the intermediate scheduler removes it from memory and waits for the I/O operation to complete before the process rejoins the ready queue and waits for the short-run scheduler to schedule it.

Short-term scheduler

The short-run scheduler, which is probably the most familiar scheduler, selects a process from the ready queue to execute. The process is selected using a specific scheduling algorithm, which takes into account the priority of the process, queuing time and other characteristics. Because of the limited execution time available to each process, short-term schedulers execute frequently.

2. Design and evolution

This section focuses on the Linux CPU scheduler, also known as the short-term scheduler. The Linux CPU scheduler was not designed to be as complex as it is today. For a long period of time (V0.01 ~ V2.4), The Linux process scheduler was responsible for dozens of simple functions. Let’s take a look at the history of the different versions of the scheduler:

  • Initial scheduler · V0.01 ~ V2.4. Implemented by dozens of lines of code, the function is very rudimentary; Process up to 64 tasks at a time;

  • Scheduler · V2.4 ~ V2.6. All tasks need to be traversed during scheduling. When there are many tasks to be executed, the interval between two executions of the same task is long, resulting in serious hunger problems.

  • Scheduler · V2.6.0 ~ V2.6.22. Time complexity by introducing runqueues and priority arrays; Use local run queues instead of global run queues to enhance scalability in symmetric multiprocessors; Introducing job stealing to ensure the balance of tasks in multiple run queues;

  • Fully fair scheduler · V2.6.23 ~ present. Red black tree and running time are introduced to ensure fairness of scheduling. Scheduling classes are introduced to implement different scheduling strategies for different task types.

The evolution from the original Scheduler to today’s complex Completely Fair Scheduler (CFS) is detailed here.

Initial scheduler

The original Linux process scheduler consisted of only two files, sched.h and sched.c. It’s hard to imagine that earlier versions of Linux used the schedule function, which was only a few dozen lines long, to schedule operating system processes:

void schedule(void) {
    int i,next,c;
    struct task_struct ** p;
    for(p = &LAST_TASK ; p > &FIRST_TASK ; --p) {
       ...
    }
    while (1) {
        c = -1;
        next = 0;
        i = NR_TASKS;
        p = &task[NR_TASKS];
        while (--i) {
            if(! *--p)continue;
            if ((*p)->state == TASK_RUNNING && (*p)->counter > c)
                c = (*p)->counter, next = i;
        }
        if (c) break;
        for(p = &LAST_TASK ; p > &FIRST_TASK ; --p)
            if (*p)
                (*p)->counter = ((*p)->counter >> 1) + (*p)->priority;
    }
    switch_to(next);
}
Copy the code

Both processes and threads are known as task_struct structures in Linux, and all schedulers are stored in an array of up to 64 processes that the scheduler can handle.

Figure 17 – Initial process scheduler

The above function wakes up the interruptible process that received the signal, and then finds the maximum executable process from the queue in reverse order. Counter is the number of time slices that the process can occupy. The function performs different logic depending on the value of the time slice:

  • If the maximum counter time slice is greater than 0, the assembly language implementation of switch_to is called to switch the process;
  • If the maximum counter time slice is equal to 0, meaning that all processes have 0 execution time, then all processes get a new time slice;

The Linux timer triggers the do_timer every 10ms to decrease the counter of the currently running process by one, and the current process will be re-scheduled when its counter reaches zero.

O (n) scheduler

The scheduler is a scheduler used by Linux from V2.4 to V2.6. Since the scheduler iterates through all tasks in the worst case, the time complexity of scheduling tasks is. The Linux scheduling algorithm splits CPU time into epochs, or slices of time that each task can use.

The source code for the scheduler can be found in the sched.h and sched.c files. The scheduler implementation is much more complex than the previous version of the scheduler. The scheduler iterates through all the tasks in the run queue in the Schedule function and calls the goodness function to calculate their weights separately to get the next running process:

asmlinkage void schedule(void){
    ...
still_running_back:
    list_for_each(tmp, &runqueue_head) {
        p = list_entry(tmp, struct task_struct, run_list);
        if (can_schedule(p, this_cpu)) {
            int weight = goodness(p, this_cpu, prev->active_mm);
            if(weight > c) c = weight, next = p; }}... }Copy the code

At the beginning of each period, the code above computes time slices for all tasks, and the scheduler is called a scheduler because it needs to be executed n times. By default, each task is allocated a time slice of about 200ms per cycle, however this scheduling and allocation is the scheduler’s biggest problem:

  • After the completion of each round of scheduling, there will be no tasks to be scheduled, and the scenarios that need to improve the interactive performance will be seriously affected. For example, dragging the mouse on the desktop will cause obvious lag.
  • Each search for the task with the highest weight needs to traverse all the tasks in the number group;
  • The average time slice size allocated by the scheduler is 210ms. When the program contains 100 processes, the interval between the same process being run twice is 21s, which seriously affects the availability of the operating system.

Because of this problem with the scheduler, the Linux kernel replaced the implementation with a new scheduler two releases later.

The O (1) scheduler

The scheduler has been used for four years in the Linux kernel from V2.6.0 to V2.6.22. It can complete process scheduling in constant time. You can view the source code of the scheduler in Sched. h and sched.c. As the number of lines of code increased from 2100 to 5000 due to increased implementation and functional complexity, the scheduler improved on the scheduler as follows:

  • The scheduler supports scheduling with time complexity;
  • The scheduler supports the scalability of Symmetric Multiprocessing (SMP).
  • The scheduler optimizes the affinity of symmetric multiprocessing.

The data structure

The scheduler implements time complexity by running two important data structures, runqueue and prio_array. Each runqueue holds two priority arrays, one for active and one for expired processes:

struct runqueue { ... prio_array_t *active, *expired, arrays[2]; . } struct prio_array { unsignedint nr_active; unsignedlong bitmap[BITMAP_SIZE]; struct list_head queue[MAX_PRIO]; };Copy the code

Nr_active in the priority array represents the number of active processes, while bitmap and LIST_head together form the data structure shown below:

Figure 18 – Priority array

The priority array’s bitmap contains a total of 140 bits, each of which indicates whether the process with the corresponding priority exists. The priority array in Figure 17 contains three processes of priority 2 and one process of priority 5. The flag bit for each priority corresponds to a linked list in the LIST_head array. The scheduler uses the above data structure for scheduling as follows:

  • Calls sched_find_first_bit to allocate CPU resources by priority;
  • Call Schedule to select process execution from the linked list header;
  • This function schedules processes of the same priority through schedule rotation. After the process to be executed is selected each time, the process is added to the end of the queue to ensure that processes of the same priority will be executed in sequence (round-robin).
  • The timer fires the scheduler_tick function every 1ms and moves the current process into an expired array if it runs out of time;
  • When no process is running in the active queue, schedule swaps the active and expired priority arrays.

In addition to these rules, which are the main rules that the scheduler operates by, the scheduler also needs to support preemption, CPU affinity, etc., but I won’t go into them here.

Local run queue

The global run queue is the main reason why schedulers are difficult to scale on symmetric multiprocessor architectures. In order to ensure the consistency of the runqueue, the scheduler needs to obtain the global lock of the runqueue during scheduling. With the increase of the number of processors, more lock competition will occur when multiple processors are scheduling, which seriously affects the scheduling performance. The scheduler solves this problem by introducing a local run queue where different cpus can get the run queue bound to the current CPU via this_Rq, reducing the granularity of locks and the possibility of collisions.

#define this_rq() (&__get_cpu_var(runqueues))
Copy the code

Figure 19 – Global and local runqueues

Since multiple processors no longer need to share a global runqueue, the scalability of the symmetric versus processor architecture is enhanced. When we add new processors, we only need to add new runqueues without introducing more lock conflicts.

Priority and time slices

The scheduler contains two different priority calculation methods, one is static task priority, the other is dynamic task priority. By default, tasks have a static task priority of 0, but we can change the task’s priority by calling nice; The scheduler rewards I/O intensive tasks and penalizes CPU intensive tasks. It achieves dynamic priority adjustment by changing the static priority of tasks, because processes interacting with the user are I/O intensive processes that improve their own priority due to the scheduler’s dynamic policy, thereby improving the user experience.

Completely fair scheduler

The Completely Fair Scheduler (CFS) is a Scheduler that was incorporated into the kernel in V2.6.23 and is the kernel’s default process Scheduler to maximize CPU utilization and interaction performance. CFS in Linux kernel version V2.6.23 consists of the following files:

  • include/linux/sched.h
  • kernel/sched_stats.h
  • kernel/sched.c
  • kernel/sched_fair.c
  • kernel/sched_idletask.c
  • kernel/sched_rt.c

As you can see from the CFS name, the scheduler provides complete fairness for different processes. Once some processes are treated unfairly, the scheduler runs them to maintain a fair run time for all processes. This way of ensuring fairness is similar to “adding water when there is more water, adding water when there is more water” :

  • The scheduler looks for the most unfairly treated process in the run queue and allocates computing resources to the process. The allocated computing resources are the difference between the running time of other resources and the minimum unit of time that the process can run.
  • After the process runs, it finds that there are other processes in the runqueue that are treated most unfairly, and the scheduler will run a new process.
  • .

The scheduler algorithm continuously calculates the running time of each process and schedules the most unfairly treated process in the queue in order to ensure that the running time difference of each process is not greater than the minimum running time unit.

The data structure

The term runqueue is still used, but CFS no longer uses queues internally to store processes. Cfs_rq is a new structure for managing runnable processes that uses red-black trees instead of lists:

struct cfs_rq {
    struct load_weight load;
    unsignedlong nr_running;
    s64 fair_clock;
    u64 exec_clock;
    s64 wait_runtime;
    u64 sleeper_bonus;
    unsignedlong wait_runtime_overruns, wait_runtime_underruns;
    struct rb_root tasks_timeline;
    struct rb_node *rb_leftmost;
    struct rb_node *rb_load_balance_curr;
    struct sched_entity *curr;
    struct rq *rq;
    struct list_head leaf_cfs_rq_list;
};
Copy the code

Red-black tree is a balanced binary search tree. The worst time complexity of the add, delete, change and search operation of red-black tree is, that is, the height of the tree. The left-most node rb_leftmost in the tree runs the shortest time and is also the next process to be run.

Note: In the latest version of THE CFS implementation, the kernel uses the virtual runtime vRuntime instead of the wait time, but the basic scheduling principles and sorting methods haven’t changed much.

Scheduling process

The scheduling process of CFS is completed by the schedule function, which can be divided into the following steps:

  • Disable the preemption function of the current CPU.
  • If no task exists in the current CPU’s run queue, call IDle_balance and execute it from another CPU’s run queue.
  • Call pick_next_task to select the highest priority task in the red-black tree;
  • Call context_switch to switch the running context, including register state and stack;
  • The preemption function of the current CPU is enabled.

The scheduling process of CFS is very similar to that of the scheduler, except that the current scheduler adds an optional work-stealing mechanism and changes the underlying data structure.

Scheduling classes

One of the more interesting concepts in CFS is the scheduling class, which determines the scheduling strategy of a process. Each scheduling class contains a set of functions responsible for scheduling, and the scheduling class is represented by the sched_class structure shown below:

struct sched_class {
    struct sched_class *next;
    void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
    void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
    void (*yield_task) (struct rq *rq, struct task_struct *p);
    void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
    struct task_struct * (*pick_next_task) (struct rq *rq);
    void (*put_prev_task) (struct rq *rq, struct task_struct *p);
    unsigned long (*load_balance) (struct rq *this_rq, int this_cpu,
            struct rq *busiest,
            unsigned long max_nr_move, unsigned long max_load_move,
            struct sched_domain *sd, enum cpu_idle_type idle,
            int *all_pinned, int *this_best_prio);
    void (*set_curr_task) (struct rq *rq);
    void (*task_tick) (struct rq *rq, struct task_struct *p);
    void (*task_new) (struct rq *rq, struct task_struct *p);
};
Copy the code

The scheduling class contains functions for initializing, enqueuing, and unqueuing tasks, and the design here is slightly similar to the object-oriented design. The kernel contains SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE, SCHED_FIFO, and SCHED_RR scheduling classes that implement functions in sched_Class to provide different scheduling behaviors.

3. Summary

This section introduces the design principles and evolution history of the operating system scheduler. It has been a long time since the CFS was incorporated in 2007, the current scheduler has become more complex, and the community is constantly improving the process scheduler.

We can see the change of mainstream system architecture from the evolution of Linux scheduler. In the beginning, a scheduler with dozens of lines of code could complete basic scheduling functions, but now tens of thousands of lines of code are used to complete complex scheduling, ensuring low latency and high throughput of the system.

Due to the lack of space, it is difficult to carry out an exhaustive analysis of the operating system scheduler. You can find the Linux source code used by the author here and analyze the different versions of the process scheduler yourself.

4. Read on

  • Understanding the Linux 2.6.8.1 CPU Scheduler

  • CFS Scheduler

  • Inside the Linux 2.6 Completely Fair Scheduler

  • The Linux desktop may soon be a lot faster

  • Modular Scheduler Core and Completely Fair Scheduler

  • The Linux Scheduler: A Decade of Wasted Cores

The Go

Go language is a programming language born in 2009, I believe many people are impressed by the Go language is simple syntax, can support high concurrency services. Syntactic simplicity is the top-level design philosophy of a programming language, and high concurrency support for a language relies on the run-time scheduler, which is what this section examines.

3. If you know something about Go, you will know that Communicating sequential processes (CSP) affect the concurrency model of Go, in which Goroutine and Channel represent entities and media for communication, respectively.

Figure 20 – Concurrency model of Go and Erlang

“Don’t communicate by sharing memory, we should communicate by sharing memory” is not just a design philosophy encouraged by Go. The older Erlang language actually follows the same design, but Erlang chose to use the Actor model. We will not introduce the difference and connection between CSP and Actor here, interested readers can find relevant resources in recommended reading and should reference.

1. Design and evolution

Today’s Go scheduler has excellent performance, but if we Go back to the V0.x version of the Go language scheduler, the original scheduler was very rudimentary and could not support high concurrency services. The whole scheduler went through several iterations of large versions to achieve today’s excellent performance.

  • Single thread scheduler · 0.x – source code. It contains just over 40 lines of code; Only single thread scheduling, consisting of G-M model;
  • Multithreaded scheduler · 1.0 – source code. Multithreaded scheduling was introduced; Global locking leads to serious competition.
  • Mission Steal Scheduler · 1.1 – Source code. The processor P is introduced to form the current G-M-P model. A scheduler based on job stealing is implemented on the basis of processor P. In some cases, Goroutine does not give up threads causing hunger problems; An excessively long stop-the-world (STW) program will not work.
  • Preemptive scheduler · 1.2 ~ present – source code. Realize true preemptive scheduling based on signal; When garbage collection scans the stack, preemption scheduling is triggered. Preemption time is not enough to cover all edge cases; The compiler inserts check instructions when the function is called to realize cooperative preemptive scheduling. GC and loop can cause the Goroutine to hold up resources for too long, causing the program to pause; Cooperative preemptive schedulers – 1.2 to 1.13; Preemptive scheduler – 1.14 ~ present;
  • Nonuniform storage access scheduler · Proposal. Partitioning the various resources in the runtime; Implementation is complex and not on the agenda to this day;

In addition to multithreading, task stealing, and preemptive schedulers, the Go language community currently has a proposal for a Non-Uniform Memory Access (NUMA) scheduler that may one day be implemented by the Go language. In this section, we take a look at the implementation of the different versions of the scheduler and the possible future scheduler proposals in turn.

Single-threaded scheduler

The Go language contains only G for Goroutine and M for thread in the 0.x scheduler, and there is only one thread globally. The source code for the single-threaded scheduler can be found in the Clean Up Scheduler commit. At this point, the Go scheduler was implemented in C, and the schedule function contained only 40 + lines of code:

static void scheduler(void) {
    G* gp;
    lock(&sched);
    if(gosave(&m->sched)){
        lock(&sched);
        gp = m->curg;
        switch(gp->status){
        case Grunnable:
        case Grunning:
            gp->status = Grunnable;
            gput(gp);
            break; . } notewakeup(&gp->stopped); } gp = nextgandunlock(); noteclear(&gp->stopped); gp->status = Grunning; m->curg = gp; g = gp; gogo(&gp->sched);
}
Copy the code

This function is executed as follows:

  • Get the global lock of the scheduler.
  • Call goSave to save stack registers and program counters;
  • Call NextgAndUnlock to get the Goroutine that the next thread M needs to run andunlock the scheduler;
  • Modify the Goroutine to execute on the global thread M;
  • Call the gogo function to run the latest Goroutine.

The only good thing about this single-threaded scheduler is that it can run, but from this commit we can see that there are two important data structures, G and M, that form the framework for the Go language scheduler.

Multithreaded scheduler

Version 1.0 of the Go language supported a multithreaded scheduler when it was officially released, and the Go language team went from unusable to usable at this stage compared to the previous version, which was completely unavailable. We can find version 1.0.1 of the scheduler in proc.c, and the multithreaded version of the scheduler function Schedule contains over 70 lines of code, the core logic of which we have retained here:

static void schedule(G *gp) {
    schedlock();
    if(gp ! = nil) { gp->m = nil; Uint32 V = runtime·xadd(&runtime·sched.atomic, -1<< McPuShift);if(atomic_mcpu (v) > maxgomaxprocs) runtime, throw ("negative mcpu in scheduler");
        switch(gp->status){
        case Grunning:
            gp->status = Grunnable;
            gput(gp);
            break;
        case. :}}else{... } gp = nextgandunlock(); gp->status = Grunning; m->curg = gp; gp->m = m; The runtime, gogo (& gp - >sched, 0);
}
Copy the code

The overall logic is not much different from that of a single-threaded scheduler, which introduces the GOMAXPROCS variable to help us control the maximum number of threads in our program so that we can have multiple active threads in our program at the same time.

The main problem of multithreaded Scheduler is lock competition during scheduling. Performance tests conducted on the Scheduler in Scalable Go Scheduler Design Doc found that 14% of the time was spent on runtime.futex. The current Scheduler implementation has the following problems to solve:

  • Globally unique scheduler and global lock, all scheduling states are centrally stored, resulting in lock competition;
  • Threads often need to pass runnable Goroutines to each other, introducing a lot of latency and extra overhead;
  • Each thread needs to handle the memory cache, resulting in a large memory footprint and affecting Data locality;
  • System calls frequently block and unblock running threads, adding additional overhead.

The global locking problem here is similar to that encountered in the early days of the Linux operating system scheduler, and the solution is much the same.

Mission steal scheduler

In 2012, Dmitry Vyukov, an engineer at Google, pointed out the problems of the existing multithreaded Scheduler in Scalable Go Scheduler Design Doc and proposed two ways to improve the multithreaded Scheduler:

  • The processor P is introduced into the current G-M model.
  • The scheduler based on job stealing is implemented on the basis of processor P.

The Go scheduler based on task stealing uses the same G-M-P model that is still used today. We can find the source code of the task stealing scheduler in the Runtime: Improved Scheduler submission, and the scheduler’s schedule function is now simpler:

static void schedule(void) {
    G *gp;
 top:
    if(the runtime gcwaiting) {gcstopm (); goto top; } gp = runqget(m->p);if(gp == nil) gp = findrunnable(); . execute(gp); }Copy the code
  • If the current runtime is waiting for garbage collection, call the gcStopm function;
  • Call runqget and Findrunnable to get the Goroutine to execute from the local or global runqueue.
  • The execute function is called to run the Goroutine on the current thread M.

When the current processor’s local run queue does not contain goroutines, calling findrunnable triggers a job steal that randomly fetches goroutines from other processors’ queues.

The processor P introduced in the runtime G-M-P model is the middle layer between thread M and Goroutine. We can see the relationship between P and M and G from its structure:

struct P {
    Lock;
    uint32  status;  // one of Pidle/Prunning/...
    P*  link;
    uint32  tick;   // incremented on every scheduler or system call
    M*  m;  // back-link to associated M (nil if idle)
    MCache* mcache;
    G** runq;
    int32   runqhead;
    int32   runqtail;
    int32   runqsize;
    G*  gfree;
    int32   gfreecnt;
};
Copy the code

Processor P holds a runqueue RUNq, an array of runnable Goroutines, which also holds a pointer to thread M in reverse. When scheduling, the scheduler selects the Goroutine of the queue header from the processor’s queue to execute on thread M. The following picture shows the relationship between thread M, processor P, and Goroutine in the Go language.

Figure 21 – G-M-P model

Based on work stealing multithreading scheduler will each thread binding to the CPU of independent and through different processor management respectively, through the work in different processors to steal redistribution of tasks, promoted the scheduler and the overall performance of the Go language program, today all Go language service of high-performance benefit from this change.

Preemptive scheduler

Changes to Go’s concurrency model have improved the scheduler’s performance, but in version 1.1 the scheduler still does not support preemptive scheduling, and programs have to rely on Goroutine to voluntarily relinquish CPU resources. The Go scheduler introduced collaborative based preemptive scheduling in version 1.2 to address the following issues:

  • A single Goroutine can run on a thread without switching to another Goroutine, causing starvation;
  • Garbage collection pauses the entire program (stop-the-world (STW), which can take several minutes without preemption, causing the entire program to fail to work.

However, the preemptive scheduling implemented in version 1.2 is based on collaboration. In a long period of time, the scheduler of Go language contains some edge cases that cannot be preempted. It was not until 1.14 that the true preemptive scheduling based on signal was realized to solve some problems.

Cooperative based preemptive scheduling

The scheduler implementation with the introduction of preemptive scheduling can be found in the proc.c file. Go implements preemptive scheduling on top of the current segmented stack mechanism, and all Goroutines have a chance to check at runtime for preemption at function call time. Collaborative preemption is implemented through multiple commits:

  • runtime: mark runtime.goexit as nosplit
  • Runtime: Add StackGuard0 to G. Introduce a stackGuard0 field for Goroutine. When this field is set to StackPreempt, Goroutine is preempted.
  • Runtime: introduce preemption function (not used for now). Preemption functions preemptone and preemptall were introduced, which set the StackPreempt of Goroutine. Introduce preempt request StackPreempt;
  • Runtime: Preempt Goroutines for GC. Call preemptall in runtime· StopTheWorld to set StackPreempt for Goroutine on all processors; Add preemption code in runtime· newStack function, when stackGuard0 equals StackPreempt trigger scheduler preemption;
  • Runtime: Preempt long-running goroutines In system monitoring, if a Goroutine takes more than 10ms to run, retake and preemptone are called;
  • Runtime: More reliable preemption. Fixed an issue where Goroutine could not be preempted for periodically executing non-blocking CGO or system calls.

From the above series of commits, we can see that the Go language runtime StackPreempt preempt requests when garbage collection pauses and system monitoring detects Goroutine running more than 10ms. Because the compiler inserts runtime.newstack into the function call, runtime.newstack checks whether the Goroutine’s StackGuard0 is StackPreempt and triggers preemption to free the current thread.

This approach does not incur much extra overhead at runtime and is relatively simple to implement, but it does add runtime complexity and is generally a successful implementation. Because the preemption is done by the compiler inserting a function at a specific time, or requires a function call as an entry point to trigger the preemption, this is a cooperative preemptive scheduling.

Signal – based preemptive scheduling

The collaborative preemptive scheduling implementation, while clever, leaves a lot of edge cases, and we can find some legacy issues in the Runtime: Non-cooperative Goroutine preemption:

  • runtime: tight loops should be preemptible #10958
  • An empty for{} will block large slice allocation in another goroutine, even with GOMAXPROCS > 1 ? # 17174
  • runtime: tight loop hangs process completely after some time #15442
  • .

Go implemented non-cooperative preemptive scheduling in version 1.14. In the process of implementation, we reconstructed the existing logic and added new states and fields to Goroutine to support preemption. The Go team implemented this feature with the following submission, which we can follow in the order of the submission:

  • Runtime: Add general suspendG/resumeG. Pendg (Runtime.suspendg) and Runtime.resumeg (Runtime.resumeg) are used to reconstruct the process of stack scanning. Running Goroutine preemptStop is marked true when the Runtime. suspendG function is called; Call runtime.preemptPark to suspend the current Goroutine, update its state to _Gpreempted, and trigger a rescheduling of the scheduler.
  • Runtime: asynchronous preemption function for x86. – Added asynchronous preemption functions Runtime. asyncPreempt and Runtime. asyncPreempt2 on x86 architecture;
  • Runtime: Use signals to preempt Gs for suspendG. A Goroutine that supports pausing by signaling the thread; Sighandler register SIGURG signal handler runtime.doSigPreempt in runtime.sighandler The runtime.preemptM function sends preemption requests to threads;
  • Runtime: Implement Async Scheduler Preemption Modify the implementation of runtime.preemptone to add asynchronous preemption logic.

The current preemptive scheduling is also triggered only during garbage collection scan tasks. We can comb through the process of triggering preemptive scheduling:

  • When the program starts, the SIGURG signal handler runtime.doSigPreempt is registered in the runtime.sighandler function.
  • The Runtime. suspendG function is called to suspend the Goroutine when a stack scan for garbage collection is triggered. Mark _Grunning Goroutine as preemptable, with preemptStop set to true; Call runtime.preemptM to trigger preemption;
  • Runtime. preemptM calls Runtime. signalM to signal SIGURG to the thread;
  • The operating system interrupts the running thread and executes a pre-registered signal handler called Runtime.dosigpreempt.
  • The runtime.doSigPreempt function handles the preemption signal by getting the current SP and PC registers and calling Runtime. sigctxt.pushCall;
  • Sigctxt. pushCall modifies the register and executes from Runtime. asyncPreempt when the program returns to user state.
  • The assembly directive Runtime. asyncPreempt calls the runtime function Runtime. asyncPreempt2;
  • Runtime. asyncPreempt2 calls runtime.preemptPark;
  • Runtime. preemptPark changes the state of the current Goroutine to _Gpreempted and calls Runtime. schedule to put the current function to sleep and release the thread. The scheduler selects another Goroutine to continue execution.

The above 9 steps show the execution of signal-based preemptive scheduling. We also need to discuss the signal selection in this process. SIGURG is selected for four reasons:

  • The signal needs to be passed through by the debugger;
  • The signal will not be used and intercepted by the internal LIBC library;
  • The signal can appear at will without triggering any consequences;
  • We need to deal with different signals on multiple platforms.

The current preemption scheduling does not solve all the potential problems, because STW and stack scanning are more likely to cause problems, and they are also safe-points for preemption. Therefore, we will add preemption function here first, and more preemption time points may be added in the future.

Non-uniform memory access scheduler

The Non-Uniform Memory Access (NUMA) scheduler is currently only a proposal for the Go language, because the proposal is too complex and the performance of the current scheduler is good enough that it is not implemented. The idea behind this proposal is to reduce lock contention and increase data locality by splitting global resources so that each processor can obtain local resources nearby.

In the current runtime, threads, processors, network trainers, run queues, global memory allocator state, memory allocation cache, and garbage collector are all global resources. There is no guarantee of localization at runtime, and the topology of the system is not clear. Partial structures can provide some locality, but there is no guarantee at the global level.

Figure 22-GO NUMA scheduler

As shown in the figure above, the stack, global run queue, and thread pool are partitioned according to NUMA nodes, and the network trainer and timer are held by separate processors. This approach can take advantage of locality to improve the performance of the scheduler, but the implementation itself is too complex, so the Go language team has yet to implement this proposal.

2. Summary

The Go language’s scheduler iterated rapidly in the first few releases, but the scheduler hasn’t changed much since version 1.2, until version 1.14 introduced true preemptive scheduling that solved problems that had existed since 1.2. In the foreseeable future, the Go language scheduler will evolve further, increasing the time point of preemptive scheduling to reduce the existence of edge cases.

This section selects the implementation principle of Go language scheduler from the book “Go Language Design and Implementation”. You can click the link to learn more about the design and implementation principle of Go language.

3. Read more

  • How Erlang does scheduling

  • Analysis of the Go runtime scheduler

  • Go’s work-stealing scheduler

  • runtime: clean up async preemption loose ends

  • The Go netpoller

  • System Calls Make the World Go Round

  • Linux Syscall Reference

Kubernetes

Kubernetes is a production-level container scheduling and management system. In the past period of time, Kubernetes has rapidly gained market share and become the implementation standard in the container choreography field.

Figure 23 – Container Choreography system evolution

Kubernetes, which means “helmsman” in Greek, was founded by several Software engineers at Google. It was heavily influenced by the Borg and Omega projects within the company. Many of its designs were borrowed from Borg, while also improving on its shortcomings. Kubernetes is currently a project of the Cloud Native Computing Foundation (CNCF), a solution for many companies to manage distributed systems.

The scheduler is the core component of Kubernetes. Its main function is to bind the running Node to the workload Pod to be run. Unlike other scheduling scenarios, resource utilization is also important in Kubernetes, but it is only one factor that Kubernetes focuses on. It needs to support many and complex business requirements in the container choreography scenario, besides considering whether CPU and memory are sufficient. Other domain-specific scenarios need to be considered, such as two services not occupying the same port on the same machine, several services running on the same machine, resource scheduling based on node type, and so on.

These complex business scenarios and scheduling requirements make the internal design of the Kubernetes scheduler completely different from other schedulers, but as a scheduler at the user application layer, we can learn many useful patterns and designs from it. Next, this section introduces the design and evolution of the scheduler in Kubernetes.

1. Design and evolution

The evolution process of Kubernetes scheduler is relatively simple. We can divide its evolution process into the following two stages:

  • Predicates and priority-based scheduler · v1.0.0 to v1.14.0
  • Scheduler based on scheduling framework · V1.15.0 ~ present

Kubernetes was released from v1.0.0 to V1.14.0, a total of 15 versions used predicates and priorities to manage different scheduling algorithms until V1.15.0 introduced the scheduling framework (Alpha feature) to refactor existing schedulers. Here we analyze the evolution of the scheduler with v1.14.0 predicates and priorities and v1.17.0 scheduling framework.

Predicate and priority algorithms

Predicates and Priorities schedulers have existed since the release of Kubernetes V1.0.0, and the final implementation of V1.14.0 is not much different from the original design. However, a number of improvements were introduced from V1.0.0 to V1.14.0:

  • V1.2.0 – Scheduler Extension Change the scheduler’s decision by calling an external scheduler extension;

  • · V1.5.0 – Mapreduce-like scheduler priority functions The scheduler priority algorithm supports map-Reduce calculation, and the parallel Map stage is introduced to optimize the performance of the scheduler.

  • · V1.10.0-Move Scheduler code out of plugin directory Move from Plugin/PKG /scheduler to PKG /scheduler; Kube-scheduler becomes a directly available executable;

Predicate and priority are both abstractions provided by Kubernetes in the scheduling system. The predicate algorithm uses the FitPredicate type, and the priority algorithm uses the PriorityMapFunction and PriorityReduceFunction:

type FitPredicate func(pod *v1.Pod, meta PredicateMetadata, nodeInfo *schedulernodeinfo.NodeInfo) (bool, []PredicateFailureReason, error)
type PriorityMapFunction func(pod *v1.Pod, meta interface{}, nodeInfo *schedulernodeinfo.NodeInfo) (schedulerapi.HostPriority, error)
type PriorityReduceFunction func(pod *v1.Pod, meta interface{}, nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo, result schedulerapi.HostPriorityList) error
Copy the code

Because V1.14.0 is also the first version of Kubernetes that the author just participated in, the design at that time is also very impressive. V1.14.0 Kubernetes scheduler will use PriorityMapFunction and PriorityReduceFunction to calculate the score of all nodes in a map-reduce manner and select the node with the highest score from among them. The following figure shows the scheduler execution in v1.14.0:

Figure 24 – Predicate and priority algorithms

As shown in the figure above, we assume that there is a predicate algorithm and a Map-reduce priority algorithm in the scheduler. When we select the most suitable one among the six nodes for a Pod, the six nodes will be filtered by the predicate first, and the predicate algorithm in the figure will filter out half of the nodes. The remaining 3 nodes are scored 5, 10 and 5 points respectively through Map and Reduce processes. Finally, the scheduler will select node 4 with the highest score.

GenericScheduler. The Schedule is for Pod Kubernetes choice node method, we omitted the method used to check that the boundary conditions and some code:

func (g *genericScheduler) Schedule(pod *v1.Pod, nodeLister algorithm.NodeLister) (result ScheduleResult, err error) {
    nodes, err := nodeLister.List()
    iferr ! = nil {return result, err
    }
    iflen(nodes) == 0 {
        return result, ErrNoNodesAvailable
    }
    filteredNodes, failedPredicateMap, err := g.findNodesThatFit(pod, nodes)
    iferr ! = nil {returnresult, err } ... priorityList, err := PrioritizeNodes(pod, g.nodeInfoSnapshot.NodeInfoMap, ... , g.prioritizers, filteredNodes, g.extenders)iferr ! = nil {return result, err
    }
    host, err := g.selectHost(priorityList)
    return ScheduleResult{
        SuggestedHost:  host,
        EvaluatedNodes: len(filteredNodes) + len(failedPredicateMap),
        FeasibleNodes:  len(filteredNodes),
    }, err
}
Copy the code
  • Get all nodes existing in the current system from NodeLister;
  • Call genericScheduler. FindNodesThatFit filtering algorithm, all predicates parallel execution algorithm nodes. Predicate algorithm will filter nodes according to the incoming Pod and Node, then will filter out the port number conflict, resource insufficient nodes. Call the Filter method of all scheduler extensions to assist filtering;
  • Call the PrioritizeNodes function to score all nodes. Execute PriorityMapFunction concurrently with Pod and Node at the same priority; Pod and the Node to score mapping returned by priority call the PriorityReduceFunction function for parameters; Call the Prioritize method of all scheduler extensions; Add all scores by weight and return the mapping from Node to score;
  • Call genericScheduler. SelectHost method selected node with the highest scores.

This is the scheduling process using predicates and priority algorithms, where we omit sorting in the scheduler’s priority queue, preemption in case of scheduling errors, and Pod persistence of volume binding to Node, preserving only the core scheduling logic.

Scheduling framework

The Kubernetes scheduling framework is the latest scheduler design proposed by Babak Salamat and Jonathan Basseri in 2018. This proposal clarifies the scheduling phases in Kubernetes and provides a well-designed plug-in based interface. According to the Scheduling framework, there are two cycles of Scheduling and Binding in Kubernetes:

  • The scheduling loop selects the most appropriate Node for the Pod among multiple nodes;
  • Binding loops apply scheduling decisions to clusters, including binding pods and nodes, binding persistent storage, and so on.

In addition to two major cycles, The scheduling framework also includes QueueSort, PreFilter, Filter, PostFilter, Score, Reserve, Permit, PreBind, Bind, PostBind and Unreserve Point), these extension points will be triggered during the scheduling process, and their running order is as follows:

Figure 25 – Kubernetes scheduling framework

Scheduler. ScheduleOne method in Scheduler can be used as an entry point to analyze Scheduler implementation based on scheduling framework. Every time this method is called, the whole process of scheduling node for Pod will be completed.

func (sched *Scheduler) scheduleOne(ctx context.Context) {
    fwk := sched.Framework
    podInfo := sched.NextPod()
    pod := podInfo.Pod
    state := framework.NewCycleState()
    scheduleResult, _ := sched.Algorithm.Schedule(schedulingCycleCtx, state, pod)
    assumedPod := podInfo.DeepCopy().Pod
    allBound, _ := sched.VolumeBinder.Binder.AssumePodVolumes(assumedPod, scheduleResult.SuggestedHost)
    iferr ! = nil {return
    }
    ifsts := fwk.RunReservePlugins(schedulingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost); ! sts.IsSuccess() {
        return
    }
    iferr := sched.assume(assumedPod, scheduleResult.SuggestedHost); err ! = nil { fwk.RunUnreservePlugins(schedulingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)return}... }Copy the code
  • The function returned by MakeNextPodFunc, which calls the internal priority queue, gets the next waiting Pod from the queue. The queue that maintains waiting PODS executes the QueueSort plug-in;
  • Call genericScheduler. Schedule function choice node, the process will perform PreFilter, Filter, PostFilter, Score four plug-in extension point;
  • Call framework. RunReservePlugins function Reserve plug-in for preserving resources and stepped into the stage of binding (binding phase running for a long time, to avoid resource be preempted). If failure to perform, call framework. RunUnreservePlugins function Unreserve plug-in.

Because each scheduling decision changes context, Kubernetes needs to execute sequentially at this stage. The binding phase is the implementation of the scheduling process, we will create a new Goroutine parallel execution of the binding loop:

func (sched *Scheduler) scheduleOne(ctx context.Context) {
    ...
    gofunc() {
        bindingCycleCtx, cancel := context.WithCancel(ctx)
        defer cancel()
        fwk.RunPermitPlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
        if! allBound { sched.bindVolumes(assumedPod) } fwk.RunPreBindPlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)iferr := sched.bind(bindingCycleCtx, assumedPod, scheduleResult.SuggestedHost, state); err ! = nil { fwk.RunUnreservePlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost) }else {
            fwk.RunPostBindPlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
        }
    }()
}
Copy the code
  • Start a Goroutine and call framework. RunPermitPlugin asynchronous operation Permit plug-in, this stage can be used to achieve a batch scheduler;
  • Call Scheduler. BindVolumes to bind the volume to Node.
  • The scheduler. bind function is called to bind Pod to Node to complete the scheduling process, which executes the plug-ins of the PreBind, bind and PostBind extension points.

The current scheduling framework is still in the Alpha stage in Kubernetes V1.17.0. Many features are not yet clear, and many improvements may be made in the next few versions to support more and richer scenarios, but the scheduling framework will be the core of the scheduler for a long time.

2. Summary

This section introduces the different designs of Kubernetes scheduler from V1.0.0 to the latest version. There are two different designs in Kubernetes scheduler, one is based on predicate and priority algorithm, and the other is based on scheduling framework.

Many business schedulers also need to select the optimal choice from multiple options, no matter it is the lowest cost or the best quality. We can consider dividing the scheduling process into two stages, filtering and scoring, to establish an appropriate abstraction for the scheduler. The filtering stage will filter out the options that do not meet the requirements according to the requirements. Many of these problems can be solved by following a design approach in which multiple options may be ranked by quality, cost, and weight.

The current Kubernetes scheduling framework has detailed support for the multi-phase extension approach, which is almost the final form of the scheduler’s internal implementation. However, with the increasing complexity of scheduler functions, more complex scheduling scenarios may be encountered in the future, such as multi-tenant scheduling resource isolation, multi-scheduler functions, and Kubernetes community has been working hard to build a high-performance scheduler.

3. Read more

  • Scheduling Framework #624

  • Create a custom Kubernetes scheduler

  • Scheduler extender Document

conclusion

From operating system, programming language to application program, we analyze the design and implementation principle of Linux, Go language and Kubernetes scheduler in this article. These three different schedulers actually depend on each other:

Figure 26 – Three-tier scheduler

As shown in the figure above, Kubernetes’ scheduler relies on the Go language’s runtime scheduler, which in turn relies on Linux’s process scheduler, moving further and further away from the user from top to bottom and focusing more and more on the specific business from bottom to top. Finally, we analyze the similarities and differences of these schedulers through two comparisons:

  • Linux process scheduler and Go language scheduler;
  • System level scheduler (Linux and Go) versus business scheduler (Kubernetes).

These are two different levels of comparison, I believe that through the comparison of different angles can let us have a deeper understanding of the design of the scheduler.

1. Linux and Go

The first is the Linux and Go language scheduler, both of which are very similar in their scenarios. They both end up taking full advantage of the CPU resources on the machine, so there are many similarities in their implementation and evolution:

  • The initial versions of the scheduler were very simple, even rudimentary, supporting only collaborative scheduling;
  • Partitioning by run queue, balancing run queues on different cpus or threads by way of work stealing;
  • Finally, signal-based preemptive scheduling is realized in some ways, but the implementation of Go language is not perfect.

Because the scenarios are very similar, their purposes are very similar, but the granularity of the tasks they schedule will be different. The minimum scheduling unit of the Linux process scheduler is thread, while the Go language is Goroutine. Compared with the Linux process scheduler, the Go language establishes a new model at the user level. Another scheduler is implemented to provide users with lightweight scheduling units to enhance the performance of the program, but it also introduces many components to handle thread-related operations such as system calls, network rotation, etc. At the same time, the combination of multiple tasks of different granularity leads to relatively complex implementation.

The final design of The Linux scheduler introduced the concept of scheduling class, allowing different task types to enjoy different scheduling strategies to reconcile the scheduling dilemma of low latency and real time.

The Go language scheduler has just introduced signal-based preemptive scheduling, leaving many features incomplete. In addition to preemptive scheduling, the complex NUMA scheduler proposal may also be the future direction of the Go language.

2. System and service

If we compare the system scheduler with the business scheduler, we will see that the two are very different in design, because they are at different levels of the system. The system scheduler is concerned with extreme performance, so it separates resources such as run queues by partitioning and reduces system latency by reducing the granularity of locks. The business scheduler focuses on the perfect scheduling function. Although the performance of scheduling is very important, it must be built on the basis of meeting specific scheduling requirements. Since the scheduling requirements of business are often complex, trade-offs and trade-offs can only be made.

Because of the different requirements, we find that the evolution of different schedulers is completely different. System scheduler will make full use of resources, reduce the delay of the system and then to consider joining when performance can’t optimize scheduling classes, and other functions meet the scheduling of different scenarios, and Kubernetes scheduler is more focused on the internal organization of different scheduling algorithm, how to maintain multiple complex scheduling algorithm, at the same time when design a good abstract, It will consider more complex multi-scheduler, multi-tenant, and so on scenarios.

3. The last

The study of historical change is very different, when we discover the cause of the code changes also will feel happy, it let us stand in today again witnessed in the history of the decision, in this paper, the corresponding chapter already contains links to the corresponding source code, readers can read the corresponding content, also sincerely hope that readers can have the harvest.

read

  • Cooperative vs. Preemptive: a quest to maximize concurrency power
  • Randomized Work Stealing versus Sharing in Large-scale Systems with Non-exponential Job Sizes
  • Scalable work stealing

Cloud Native Webinar invites you to attend

“Alibaba Cloud originators pay close attention to technical fields such as microservice, Serverless, container and Service Mesh, focus on cloud native popular technology trends and large-scale implementation of cloud native, and become the technical circle that knows most about cloud native developers.”