preface

All references to the Source code of the Linux operating system in this article are based on Linux version 4.20, and the source code listed in this article is mainly the core attributes or functions, with omissions.

The basic concept

Linux Scheduling unit

The minimum unit of scheduling in Linux is a thread, or task-based.

Each process or thread is a Task managed by a task_struct data structure. When a Linux process is created, it also contains a main thread.

Linux Task Types

Linux divides tasks into two main categories:

  1. Real-time tasks: have a high priority to complete the delivery as soon as possible;
  2. Ordinary tasks: no certain real-time requirements, as soon as possible to complete the delivery.

When real-time tasks exist in the system, the scheduler preferentially selects real-time tasks to execute. Ordinary tasks are not executed. In addition, different types of tasks have different scheduling policies.

Scheduling policies supported by Linux

Task_struct contains a field that identifies the scheduling policy for a task:

unsigned int policy;
Copy the code

The optional values are as follows:

// kernel/sched/sched.h

#define SCHED_NORMAL		0
#define SCHED_FIFO		1
#define SCHED_RR		2
#define SCHED_BATCH		3
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE		5
#define SCHED_DEADLINE		6
Copy the code

Real-time task scheduling policy:

  1. SCHED_FIFO, in order of tasks so that a high-priority task can preempt a low-priority task;
  2. SCHED_RR, which allocates a time slice of a certain size to each task, and when the slice runs out, the task moves to the end of the queue, allowing a high-priority task to preempt a low-priority task;
  3. SCHED_DEADLINE, which gives priority to tasks whose current time is close to the task deadline.

Common task scheduling policy:

  1. SCHED_NORMAL, normal task scheduling policy;
  2. SCHED_BATCH, background task;
  3. SCHED_IDLE, only when idle.

Linux scheduling priority

Linux uses two main priority ranges:

  1. The nice value used by common tasks can be set by using the Nice library function. The smaller the nice value, the greater the chance of scheduling execution. The value ranges from -20 to 19
  2. Real-time priority of a real-time task. The value ranges from 0 to 99

Task_struct (task_struct);

// The value of prio is the value of the priority used by the scheduler. A smaller prio indicates a higher priority. The value ranges from 0 to 139. 100 to 139 indicates the priority of a common task. (For nice values, -20 to 19) int Prio. // static prio does not change over time, the kernel does not actively change it, only nice can be used to change the static prio // the valid range is 100 to 139, the default is 120,0 to 99 is meaningless int static prio; Int normal_prio; int normal_prio; int normal_prio; Unsigned int rt_priority; // Real-time priority. The value ranges from 0 to 99 and is valid only for real-time tasks.Copy the code

Linux scheduling classes

The scheduling class is the logic responsible for actually executing the scheduling policy. Inside task_struct there is a pointer to the scheduling class structure:

const struct sched_class *sched_class;
Copy the code

Sched_class is just an abstract structure that internally defines some scheduling priorities, as follows:

Struct sched_class {// Scheduling classes are a linked list in order of priority, next executes the next scheduling class const struct sched_class *next; Void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); Void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); Void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags); Struct task_struct * (*pick_next_task)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf); void (*put_prev_task)(struct rq *rq, struct task_struct *p); void (*set_curr_task)(struct rq *rq); Void (*task_tick)(struct rq *rq, struct task_struct *p, int queued); /* * The switched_from() call is allowed to drop rq->lock, therefore we * cannot assume the switched_from/switched_to pair is serliazed by * rq->lock. They are however serialized by p->pi_lock. */ void (*switched_from)(struct rq *this_rq, struct task_struct *task); void (*switched_to) (struct rq *this_rq, struct task_struct *task); . };Copy the code

Scheduling classes have the following specific implementation:

// Stops all other threads without being interrupted by other tasks. The highest-priority task uses extern const struct sched_class stop_sched_class; // Execute the deadline scheduling policy extern const struct sched_class dl_sched_class; // The policy field specifies extern const struct sched_class rt_sched_class; // The policy field specifies extern const struct sched_class; // An extern const struct sched_class fair_sched_class uses the CFS scheduling algorithm internally; // Execute the IDLE scheduling policy extern const struct sched_class IDle_sched_class;Copy the code

Each scheduling class has its own implementation mechanism for the above interface functions. At the same time, the scheduling class is actually a linked list data structure, which is in the order of priority:

Stop_sched_class → dl_sched_class → rt_sched_class → fair_sched_class → IDle_sched_class

Source reference:

const struct sched_class stop_sched_class = { .next = &dl_sched_class, .. } const struct sched_class dl_sched_class = { .next = &rt_sched_class, ... } const struct sched_class rt_sched_class = { .next = &fair_sched_class ... }...Copy the code

Linux Scheduling entity

Scheduling entities are used to maintain the meta-information related to task scheduling. There are several types of scheduling entities:

// Schedule entity struct sched_entity se for regular tasks; // Scheduling entity struct sched_rt_entity RT for real-time tasks; // Deadline scheduling class struct sched_DL_entity DL;Copy the code

Since most of the tasks we dealt with were normal tasks, we focused on the details of sched_entity:

Struct sched_entity {// the weight value of the current task. Linux uses nice functions to set priorities for tasks. Each nice value has a weight value; unsigned long runnable_weight; struct rb_node run_node; struct list_head group_node; unsigned int on_rq; u64 exec_start; u64 sum_exec_runtime; // The virtual runtime of the current task, the CFS scheduling algorithm concept, based on the actual running time of the task and the value calculated by the priority weight u64 vRuntime; u64 prev_sum_exec_runtime; } struct load_weight { unsigned long weight; u32 inv_weight; };Copy the code

The idea and implementation of CFS scheduling algorithm

The idea of CFS (Complete Fair Schedule) scheduling algorithm is to maintain a quasi vRuntime for each task,

The scheduler preferentially executes the task with the lowest vRuntime value. The concept of VRuntime was introduced to support priority scheduling.

Vruntime is the value calculated based on the actual running time and priority weight of the task. The calculation formula is as follows:

vruntime += delta_exec * NICE_0_LOAD / weight

  • Vruntime: Virtual runtime

  • Delta_exec: actual execution time

  • NICE_0_LOAD: indicates the weight of the process priority if the nice value is 0

  • Weight: indicates the weight of the nice value of a task. As mentioned earlier, the value of the nice value ranges from -20 to 19. Each nice value has a corresponding weight in Linux memory programs.

    const int sched_prio_to_weight[40] = { /* -20 */ 88761, 71755, 56483, 46273, 36291, /* -15 */ 29154, 23254, 18705, 14949, 11916, /* -10 */ 9548, 7620, 6100, 4904, 3906, /* -5 */ 3121, 2501, 1991, 1586, 1277, /* 0 */ 1024, 820, 655, 526, 423, /* 5 */ 335, 272, 215, 172, 137, /* 10 */ 110, 87, 70, 56, 45, /* 15 */ 36, 29, 23, 18, 15,};Copy the code

As we can see, the smaller the NICE value, the greater the weight value, the smaller the vruntime will be, and the greater the chance of getting scheduled execution, and the smaller the chance of getting scheduled execution.

In order to obtain the time complexity of the minimum vRuntime task first, the LinuxCFS algorithm adopts the data structure of red-black tree, and its concrete implementation is cfs_rq in the run queue.

Linux run queue

For each CPU, there is an RQ structure that maintains all the tasks that need to be run. This structure is called a running queue. Each rQ has several subqueues, including:

Struct CFS struct cfs_rq CFS; // Struct rt_rq rt; // Struct dl_rq dl; struct task_struct *curr; }Copy the code

The cfS_Rq run queue is used for ordinary tasks, and the internal implementation of the CFS run queue is actually a red-black tree.

// CFS task queue, qi struct cfs_rq {struct load_weight load; unsigned long runnable_weight; unsigned int nr_running; unsigned int h_nr_running; u64 exec_clock; u64 min_vruntime; struct rb_root_cached tasks_timeline; /* * 'curr' points to currently running entity on this cfs_rq. * It is set to NULL otherwise (i.e when none are */ // points to the struct sched_entity *curr that is running on cfs_rq; struct sched_entity *next; struct sched_entity *last; struct sched_entity *skip; }; /* * Leftmost-cached rbtrees. * * We do not cache the rightmost node based on footprint * size vs number of potential users that could benefit * from O(1) rb_last(). Just not worth it, users that want * this feature can always implement the logic explicitly. * Furthermore, users that want to cache both pointers may * find it a bit asymmetric, but that's ok. */ struct rb_root_cached { struct rb_root rb_root; struct rb_node *rb_leftmost; };Copy the code

Scheduling execution

Implementation of the fair_sched_class scheduling class

Since most of the tasks are normal tasks, the fair_sched_class is used, so let’s take a look at the source code implementation:


const struct sched_class fair_sched_class = {
    .next			= &idle_sched_class,
    .check_preempt_curr	= check_preempt_wakeup,
    .pick_next_task		= pick_next_task_fair,
    .set_curr_task    = set_curr_task_fair,
    .task_tick		= task_tick_fair,
};
Copy the code

The implementation of specific interface functions is as follows:

Get the next task to execute pick_next_task implementation — pick_next_task_fair

  1. Retrieves the task_struct of the currently running task
  2. Call update_curr() to update the total execution time, vRuntime, and so on for the currently running task
  3. Gets the scheduling entity for the next task to run
  4. Put the task scheduling entity back on the CFS queue
  5. Returns the next task to run
static struct task_struct * pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { struct cfs_rq *cfs_rq = &rq->cfs; struct sched_entity *se; struct task_struct *p; int new_tasks; again: if (! cfs_rq->nr_running) goto idle; #ifdef CONFIG_FAIR_GROUP_SCHED if (prev->sched_class ! = &fair_sched_class) goto simple; do { struct sched_entity *curr = cfs_rq->curr; If (curr) {if (curr->on_rq) update_curr(cfs_rq); else curr = NULL; if (unlikely(check_cfs_rq_runtime(cfs_rq))) { cfs_rq = &rq->cfs; if (! cfs_rq->nr_running) goto idle; goto simple; } } se = pick_next_entity(cfs_rq, curr); cfs_rq = group_cfs_rq(se); } while (cfs_rq); p = task_of(se); if (prev ! = p) { struct sched_entity *pse = &prev->se; while (! (cfs_rq = is_same_group(se, pse))) { int se_depth = se->depth; int pse_depth = pse->depth; if (se_depth <= pse_depth) { put_prev_entity(cfs_rq_of(pse), pse); pse = parent_entity(pse); } if (se_depth >= pse_depth) { set_next_entity(cfs_rq_of(se), se); se = parent_entity(se); }} put_prev_entity(cfs_rq, pse); // Put the task back in the CFS queue after the time is changed. set_next_entity(cfs_rq, se); } goto done; Simple: #endif // Put the task back in rq put_prev_task(rq, prev); do { se = pick_next_entity(cfs_rq, NULL); set_next_entity(cfs_rq, se); cfs_rq = group_cfs_rq(se); } while (cfs_rq); p = task_of(se); . }Copy the code

Update the current task running time statistics update_curr function

  1. Update the task execution time sum_exec_Runtime
  2. Update the VRuntime of task
static void update_curr(struct cfs_rq *cfs_rq) { struct sched_entity *curr = cfs_rq->curr; u64 now = rq_clock_task(rq_of(cfs_rq)); u64 delta_exec; delta_exec = now - curr->exec_start; curr->sum_exec_runtime += delta_exec; Curr ->vruntime += calc_delta_fair(delta_exec, curr); update_min_vruntime(cfs_rq); . . }Copy the code

Clock interrupt handling task_TICK function implementation — task_tick_fair

Clock interrupt handling:

  1. Find the scheduling entity SE for the currently running task
  2. The entity_tick function is executed. The entity_tick internal call update_curr updates the total time and vRuntime of the current running task and determines whether the current task should be preempted
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) { struct cfs_rq *cfs_rq; struct sched_entity *se = &curr->se; for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); entity_tick(cfs_rq, se, queued); } if (static_branch_unlikely(&sched_numa_balancing)) task_tick_numa(rq, curr); update_misfit_status(curr, rq); } static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) {// update current task runtime statistics, including vRuntime, Update_curr (cfs_rq); Check_preempt_tick (cfs_rq, curr) if (cfs_rq->nr_running > 1) check_preempt_tick(cfs_rq, curr); }Copy the code

Check whether the current task should be preempted — check_preempt_tick

  1. Calculate the ideal_Runtime of the task in this cycle
  2. Calculate the actual execution time delta_exec of the task
  3. Verify that it should be preempted
/* kernel/sched/fair.c */ static void check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) { unsigned long ideal_runtime, delta_exec; struct sched_entity *se; s64 delta; Ideal_runtime = sched_slice(cfs_rq, curr); Delta_exec = execr ->sum_exec_runtime - curr->prev_sum_exec_runtime; If (delta_exec > ideal_runtime) {/* tell rq to resched_curr(rq_of(cfs_rq)); clear_buddies(cfs_rq, curr); return; Granularity < sysctl_sched_MIN_granularity) return;} if (delta_exec < sysctl_SCHED_MIN_granularity) return; Vruntime se = __pick_first_entity(cfs_rq); delta = curr->vruntime - se->vruntime; if (delta < 0) return; if (delta > ideal_runtime) resched_curr(rq_of(cfs_rq)); } // If it should be preempted, then resched_curr(rq_of(cfs_rq)), and set_tsk_need_resched, Static inline void set_tSK_need_resched (struct task_struct * TSK) { set_tsk_thread_flag(tsk,TIF_NEED_RESCHED); }Copy the code

How do REAL-TIME tasks take precedence over ordinary tasks

The scheduler’s first job in executing the core scheduling logic is to get the next task that should be executed. As we said earlier, scheduling is actually a list of priorities. The scheduler goes through each scheduling class in turn and calls pick_next_task() for each scheduling class. The task is first obtained from RT_Rq, and if not, the fair_sched_class gets the task from CFs_Rq, enabling real-time tasks to take precedence over normal tasks. Because of this, when real-time tasks exist in the system, ordinary tasks will not get the chance to execute. Its specific implementation is as follows:

#ifdef CONFIG_SMP #define sched_class_highest (&stop_sched_class) #else #define sched_class_highest (&dl_sched_class) #endif #define for_each_class(class) \ for (class = sched_class_highest; class; class = class->next) /* * Pick up the highest-prio task: */ static inline struct task_struct * pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { const struct sched_class *class; struct task_struct *p; /* * Optimization: we know that if all tasks are in the fair class we can * call that function directly, but only if the @prev task wasn't of a * higher scheduling class, Because otherwise those loose the * opportunity to pull in more work from other CPUs. If all tasks in the current RQ are normal tasks, If (likely((prev->sched_class ==)){// we can start with the fair_sched_class scheduling class directly instead of starting from scratch with each scheduling class as for_each_class(class){} below &idle_sched_class || prev->sched_class == &fair_sched_class) && rq->nr_running == rq->cfs.h_nr_running)) { p = fair_sched_class.pick_next_task(rq, prev, rf); if (unlikely(p == RETRY_TASK)) goto again; /* Assumes fair_sched_class->next == idle_sched_class */ if (unlikely(! p)) p = idle_sched_class.pick_next_task(rq, prev, rf); return p; } // Iterate over each scheduling class to get the next task to execute, For_each_class (class) {p = class->pick_next_task(rq, prev, rf); if (p) { if (unlikely(p == RETRY_TASK)) goto again; return p; }}}Copy the code

Active scheduling and preemptive scheduling

Generally, we divide scheduling types into two types:

  1. Active scheduling: The task executes the schedule library function (the schedule system call) and actively allocates CPU resources. For example, when a task is waiting for IO resources and the task is in the state of being unable to advance, it can call the schedule to allocate CPU resources to other tasks

  2. There are two main types of preemptive scheduling:

    1. The process is taking too long to execute, and the clock interrupt handler triggers the preemption of the currently executing task.
    2. When a task is wakened, preemption is triggered if the priority of the wakened task is higher than that of the currently running task.

The execution process of active scheduling

As we have seen before, active scheduling is triggered by the user executing the schedule() function. Usually, when an I/O blocking operation is performed, the user can choose to execute the schedule function to release CPU resources.

An implementation of the schedule() function calls a __schedule() function that encapsulates the core scheduling logic:

static void __sched notrace __schedule(bool preempt) { struct task_struct *prev, *next; unsigned long *switch_count; struct rq_flags rf; struct rq *rq; int cpu; cpu = smp_processor_id(); Rq = cpu_rq(CPU); Prev = rq->curr; prev = rq->curr; . Next = pick_next_task(rq, prev, &rf); next = pick_next_task(rq, prev, &rf); . // If the next task to be executed is different from prev, the process context switch is required, and the next task is loaded to run. = next)) { rq->nr_switches++; rq->curr = next; ++*switch_count; trace_sched_switch(preempt, prev, next); rq = context_switch(rq, prev, next, &rf); } else { rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP); rq_unlock_irq(rq, &rf); }}Copy the code

The specific implementation of context switch in the program will not be studied in detail due to many details, and will be improved later.

The execution of a preemptive schedule

Earlier we said that there are two main times to trigger preemptive scheduling:

  1. Clock interrupt
  2. Process of awakening

Clock interrupts trigger preemption

The scheduler_tick() function is called by the clock interrupt handler:

  1. Retrieves the task_struct of the currently running task

  2. Calls the implementation of the schedule class’s clock interrupt function task_tick

    1. For normal tasks, task_tick of fair_sched_class is called
    2. Internally, it updates the running time of the current task, the VRuntime, and determines whether the current task should be preempted
    3. If it should be preempted, the task is marked with TIF_NEED_RESCHED and waits until a specific time to execute the actual scheduling logic
    4. Refer to the above implementation of fair_sched_class for the logic
/* * This function gets called by the timer code, with HZ frequency. * We call it with interrupts disabled. */ void scheduler_tick(void) { int cpu = smp_processor_id(); Struct rq *rq = cpu_rq(CPU); Task_struct task_struct *curr = rq->curr; update_rq_clock(rq); // Call the task_tick function curr->sched_class->task_tick(rq, curr, 0); . . }Copy the code

Waking up the task triggers preemption

When we talk about active scheduling, we said that while waiting for IO, we can call the schedule function to actively abandon the CPU and enter the blocking state. When the IO is complete, the task needs to be woken up, and if the priority of the woken task is higher than that of the currently running task, preemption is triggered.

  1. Call the try_to_wake_up function to wake up the task

  2. The tTWu_DO_wakeup logic will eventually be executed

    1. Try_to_wake_up will first add the wake task to the TTWu_queued queue and finally activate the task by calling tTWu_do_activate
    2. Ttwu_do_activate calls tTWu_do_wakeup internally to wakeup
  3. Ttwu_do_wakeup internally calls check_preempt_curr to determine whether the current task should be preempted

  4. If it should preempt, the task will be marked with TIF_NEED_RESCHED and the actual scheduling logic will be executed at a specific time

The specific implementation logic is as follows:

static int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) { unsigned long flags; cpu = task_cpu(p); // Add wake task to queue ttwu_queue(p, CPU, wake_flags); return success; } static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags) { struct rq *rq = cpu_rq(cpu); struct rq_flags rf; rq_lock(rq, &rf); update_rq_clock(rq); Ttwu_do_activate (rq, p, wake_flags, &rf); rq_unlock(rq, &rf); } static void ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, struct rq_flags *rf) { int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK; lockdep_assert_held(&rq->lock); . . ttwu_activate(rq, p, en_flags); Ttwu_do_wakeup (rq, p, wake_flags, rf); } static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags, Struct rq_flags *rf) {// check_preempt_curr(rq, p, wake_flags); p->state = TASK_RUNNING; trace_sched_wakeup(p); } void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) { const struct sched_class *class; If (p->sched_class == rq->curr->sched_class) {// Call the implementation of the scheduling class for the currently running task to determine whether to preempt rq->curr->sched_class->check_preempt_curr(rq,  p, flags); } else { for_each_class(class) { if (class == rq->curr->sched_class) break; if (class == p->sched_class) { resched_curr(rq); break; }}}}Copy the code

The time to actually execute preemption logic

The actual preemption time is divided into user mode and kernel mode:

User mode:

  1. When the system call returns to user mode

    static void exit_to_usermode_loop(struct pt_regs *regs, U32 cached_flags) {while (true) {// Determine whether scheduling should be performed if (cached_flags & _TIF_NEED_RESCHED) schedule(); If (cached_flags & _TIF_SIGPENDING) do_signal(regs); if (cached_flags & _TIF_SIGPENDING) do_signal(regs); }}Copy the code
  2. When the interrupt returns to user mode

    Some interrupt functions need to return to usermode after processing, and will eventually be called to exit_to_usermode_loop

Kernel mode:

  1. When preemption is enabled, that is, when preempt_enable() is called. The kernel performs certain operations that do not allow interrupts, so it temporarily calls preempt_disable() to turn off preemption when those operations are performed, and then turns it on when the operation is complete.
  2. When the interrupt returns to kernel mode

Whether active or preemptive, the __schedule() core scheduling logic is ultimately executed.

conclusion

CFS scheduling algorithm for Linux task scheduling

The idea of CFS scheduling algorithm is to maintain a virtual running time vRuntime for each task, and save the scheduling information of each task based on the red-black tree data structure. The red-black tree is sorted according to the virtual running time, and the scheduler preferentially selects the task with the smallest VRuntime to execute, that is, the leftmost node of the red-black tree.

The reason why it is called virtual running time is that it is calculated based on the actual running time and weight ratio. The specific formula is:

Virtual running time = Actual running time * NICE_0_LOAD (weight when nice is 0)/weight

The nice value of each task at the user program level ranges from -20 to 19. Each NICE value corresponds to a weight value in the kernel program design. The lower the NICE value is, the higher the weight value is, and the smaller the calculated Vruntime is.