Working principle of the Linux process scheduling, first published in: blog.ihypo.net/15279557709…

The Linux process was one of my most important questions when I was interviewing, because many languages do not implement a set of process management and scheduling, but directly borrow the capabilities of the operating system, especially in the Python community, where there is no need to reinvent the wheel and use a simple encapsulation of system calls. So if you don’t know a lot about Linux processes, it’s hard to really know what happens to the concurrent code you write.

The scheduler is the subsystem in the kernel that ensures that processes work efficiently, deciding which processes to run, when, and for how long. The reasonable scheduling of the scheduler is the guarantee of maximizing the system resources.

From the first Linux release in 1991 to the 2.4 kernel, the Linux scheduler was fairly crude and primitive in design. But in the 2.5 kernel, the scheduler uses a scheduler called O(1) (because its scheduling algorithm has O(1) time complexity). However, the O(1) scheduler was not good enough for time-sensitive programs requiring user interaction (perhaps because of this, the early Linux desktop version was slow to advance), and the development of the 2.6 kernel specifically optimized interactive programs and introduced a new scheduling algorithm. The most famous is the Rotating Staircase Deadline scheduler (RSDL). Until kernel 2.6.23, this algorithm completely replaced the O(1) algorithm and was called the completely fair scheduling algorithm, namely CFS. This article will focus on O(1) algorithm and CFS algorithm, and briefly explain how CFS is implemented.

Basic concepts of multitasking and scheduling

Linux a multitasking operating system that can execute multiple processes concurrently and interactively. On a multi-core processor, a multitasking operating system enables multiple processes to execute in true parallel on different processors, whereas on a single-core processor, it creates the illusion of multiple processes executing simultaneously. Whether single-core or multi-core, an operating system can block or sleep multiple processes, handing off only the right ones to the processor.

Multitasking systems are generally divided into non-preemptive multitasking and preemptive multitasking. Linux falls into the latter category, where the scheduler decides when to stop a process so that others can have a chance to execute. The action of forcing a suspend is preemption, and the amount of time a process can run before it is preempted is predetermined and has a special name, timeslice.

A time slice is the amount of processor time actually allocated to each runnable process. Many operating systems use dynamic time slice calculation, that is, the exact number of absolute time slices allocated to the process is calculated dynamically based on the load of the machine, rather than being specified and never changing. However, the Linux scheduler itself does not allocate time slices to achieve fair scheduling.

The essence of the scheduler is the scheduling algorithm, and algorithm policies determine when and what processes the scheduler wants to execute, so let’s start with a few practical questions to understand what scheduling algorithms do.

IO intensive and CPU intensive

Processes can be classified as IO intensive or CPU intensive. The former spends most of its time submitting OR waiting for IO requests. Such processes are often running, but usually run for short periods at a time because they end up blocking while waiting for more IO requests. CPU intensive, on the other hand, spends most of its time executing code that, unless preempted, keeps executing because it doesn’t have too many IO requests or too many chances of being blocked. The scheduler doesn’t run them very often for the sake of response time.

Because the system need to consider the speed of response, the scheduler just make sure not to let the CPU intensive long running, but some of the procedure does not completely accords with the two kinds of classification, such as office software such as Word, often wait for the keyboard (wait IO), and at any time, and may stick to spell check and crazy processing macro calculation.

How to balance a scheduling strategy between two conflicting goals: rapid response and maximum system utilization. Different systems have different solutions, but usually a complex set of algorithms, but most algorithms cannot guarantee that low-priority processes will be treated fairly, whereas Linux CFS basically solves this problem (if the number of processes is not huge).

Process priority issues

One of the most basic types of scheduling algorithms is priority-based scheduling, which is the idea of hierarchical processing based on the value of the process and other requirements on the processor. Generally, it can be divided into two types. One is that the processes with a higher priority run first and the processes with a lower priority run later. Processes with the same priority are scheduled by rotation. The other is that the process with a higher priority uses a longer time slice, and the scheduler always selects the process with the highest priority that has not used up the time slice.

Linux uses two different priority ranges. The first is the nice value, which ranges from -20 to +19 and defaults to 0, and the higher the nice value, the lower the priority. A high-priority (low-nice) process can get more processor time than a low-priority (high-NICE) process. However, the allocation of time varies from system to system. For example, macOS uses the absolute value of time slices, while Linux uses the ratio of time slices.

The second value is real-time priority. The default value ranges from 0 to 99. Contrary to the NICE value, a higher real-time priority indicates a higher priority. Real-time processes are Linux’s preference for time-intensive processes.

Time slice

A timeslice is a number that indicates how long a process can continue to run before being preempted. Scheduling policies always specify a default time slice, but it is not easy to determine the default time slice. If the time slice is too long, process switching will often be delayed, resulting in poor system response to interaction. If the time slice is too short, processor time caused by process switching will be significantly increased.

As a result, any long time slice will result in poor performance, and many operating systems take this very seriously, so the default time slice is a moderate size, such as 10ms. Instead of using absolute length, CFS in Linux assigns the CPU time ratio to the process, so that the processor time received by the process in Linux is closely related to the load on the machine.

In Linux, the percentage of progress gains is also influenced by nice values, which show the effect of priority. Under multiple conditions, the kernel can determine how processes run on a timeslice basis: if the new process consumes less usage than the currently executing one, it immediately disenforces the current process and throws the new process into action.

Processes that require user interaction

As mentioned earlier, user-interacting processes are more real-time and therefore very special to handle, so let’s give a practical example of how Linux handles processes that require user interaction.

If you have a very CPU-intensive process, such as compiling a huge C program, and a similar word processor like Notepad that is very interactive. For a compilation process, we don’t really care when it runs, half a second too early or half a second too late, but if it’s done as quickly as possible, it’s great, whereas for a Notepad process, we’d like to see feedback immediately after we hit the keyboard.

To achieve the above requirements depends on the system to allocate higher priority and more time slices to notepad (may involve the system to automatically identify the notepad program needs a higher priority), but in Linux, it adopts a very simple way, and does not even need to assign different priorities. If two processes have the same nice value, they each get 50% of the CPU time (the time slice under Linux is a ratio). Because Notepad is constantly waiting for input from the user, the compiler process will no doubt be executing most of the time, so notepad must use less than 50% of the CPU and the compiler must use more than 50%. When the user enters, it finds that the usage ratio of the newly awakened Notepad is smaller than that of the compiling process, and immediately preempts to respond quickly. While notepad completes work and continues to wait for user input, execution is suspended and reassigned to the compilation process.

Scheduling algorithm

After describing what the scheduler does with a few simple concepts, let’s begin the formal discussion of scheduling algorithms.

The scheduler class

Before we get into scheduling algorithms, we need to explain the scheduler class. The Linux scheduler is provided as a module to allow different types of processes to selectively select scheduling algorithms. The module that provides the different scheduling algorithms is the scheduler class. CFS, for example, is a scheduler class for common processes (defined in kernel/sched_fair.c) called SCHED_NORMAL on Linux (SCHED_OTHER on POSIX).

Each scheduler has a priority, and the kernel selects the highest priority scheduler, which then schedules and executes the process.

O(1) scheduling algorithm

Before we get to CFS, let’s introduce the traditional Unix scheduling algorithm: the O(1) scheduling algorithm. Modern process schedulers have two general concepts: process priority and time slice. Time slice refers to how much time the process runs. After the process is created, it is given a time slice.

Obviously, in addition to allowing higher priority processes to preempt as much as possible, the O(1) scheduling algorithm also weights time slices according to priority. However, as mentioned above, the traditional scheduling algorithm uses the absolute time length, which also causes some problems. For example, if there are two processes with different priorities, one with nice value of 0 and the other with nice value of 1, then their weighted time slices are respectively 100ms and 95ms, and their time slices are very close. But if you change the nice value to 18 and 19, then their time slices become 10ms and 5ms, and the former is twice as long as the latter, so even though the nice value is only 1 different, the final result is very different.

Therefore, CFS dispenses with absolute allocation of time slices entirely, and instead allocates processor usage weights.

CFS scheduling algorithm

The starting point for CFS is a simple idea: process scheduling should work as if the system had an ideal multitask processor. In such a system, each process will get 1/n processor time (if there are n processes). For example, if we have two runnable processes, run one for 5ms and then run the other for 5ms, if the process switch is fast enough, it seems possible to run both processes simultaneously in 10ms and each uses half of the processor’s power.

Of course, this isn’t realistic. First of all, one processor can’t really run multiple tasks at once, and switching between processes is lossy and can’t switch infinitely fast. CFS takes a compromise approach: Instead of assigning time slices to each process, let each process run for a certain amount of time, rotate, and select the least running process as the next process to run. CFS calculates how long a process should run based on the total number of processes that can run, rather than relying on nice values to calculate time slices (nice values only affect gravity, not absolute values).

Each process runs according to its weight as a proportion of the total runnable process “time slice”. To calculate the time slice accurately, CFS sets a target for the approximation of the infinitesimal scheduling period in perfect multitasking, called target delay. Smaller scheduling cycles lead to better interactivity and closer to perfect multitasking (but with more switching overhead). For example, if we set the target delay at 20ms, then if there are two processes of the same priority, each process can only run 10ms before being preempted, whereas if there are five such tasks, each task can only run 4ms.

However, in the example above, if the number of processes is very large, say more than 20, then the elapsed time for each process is less than 1ms, and may even be less than the elapsed time for process switching. Of course, Linux avoids this by setting a baseline called minimum granularity (generally 1ms by default). Therefore, CFS is fair only when the number of processes is not huge (a typical system runs a few hundred processes, which is still very fair). For processes with different priorities, CFS also performs well. For example, the target delay is still 20ms. If one of the two processes has a NICE value of 0 and the other has a nice value of 5, the weight of the latter will be 1/3 of the former. That is, the two processes get 15ms (20 * 2/3) and 5ms (20 * 1/3) processor time, respectively. However, if the nice values of the two processes are 10 and 15 respectively, since the weight relationship has not changed, the two processes still get 15ms and 5ms of processor time, respectively. Therefore, the nice value does not affect the scheduling decision, only the relative value will affect the ratio of processor time allocation.

Under CFS, the amount of processor time any process gets is determined by the relative difference between its own nice value and that of all other runnable processes. Nice value is changed from arithmetic weighting to geometric weighting, and it is the absolute value of time slice that is changed into usage ratio, which results in lower scheduling delay in multi-process environment.

The realization of CFS

After discussing various concepts, finally arrived at the key point, namely the realization of CFS algorithm. The relevant code is in kernel/sched_fair.c, and we focus on the following four areas:

  • Record running time
  • Select the process to run
  • Scheduler selection
  • Sleep and wake up

Record running time

As mentioned above, the core of CFS is the CPU usage ratio, so it is very important to keep track of the running time of the process. On most Unix systems, a time slice of absolute time is assigned to a process, and each time a system clock tick occurs, the time slice is reduced by one beat period. Whenever a process’s timeslice is reduced to 0, it is preempted by a process that has not yet been reduced to 0.

CFS does not have an absolute time slice, but it still needs to keep an account of the running time of each process to ensure that each process is only running within its fairly allocated processor running time. The bookkeeping information keeps its structure pointer SE in the task_struct of the process (see the previous article, What You Need to know about Linux Process Management).

An important member variable in the structure, vRuntime, records the total elapsed time (the sum of time spent running) of the process, and this time is weighted (priority, machine load, and so on). The unit of virtual time is NS. Therefore, vRuntime and system timer are no longer related.

The kernel updates the process’s vRuntime by periodically calling the uodate_curr() function (defined in kernel/sched_fair.c), which counts the current process’s execution time, And __uodate_curr() is called to get the run-time weighted value based on machine load (the total number of runnable processes), which is then added to the old vRuntime to get the new VRuntime.

Select the process to run

If a “perfect multitasking processor” like the one described above exists, then the VRuntime values for each runnable process should be consistent. While no such processor exists, CFS tries to move in this direction by picking the smallest vRuntime process each time it selects a process.

CFS uses red-black tree to organize runnable queues. Rbtree is a self-balanced binary search tree. Each node in the tree has a key value, which can be used to quickly retrieve data on the node. I’ll write a separate article later on data structures (database indexes are also generally based on red-black trees).

CFS maintains a red-black tree (in kernel/sched_fair.c) that contains all runnable processes. The nodes’ keys are the virtual running times of runnable processes. When the CFS scheduler needs to pick the next running process, You simply need the __pick_next_entity() method to find the process with the least virtual running time in the tree (that is, the leftmost leaf node of the tree, which the kernel has designated a global variable for quick access) and run the process.

If a blocked process is waiting or a new process is created by fork(), it is inserted into the red-black tree using the enqueue_entity() method. To ensure quick response, the newly inserted node is compared to the leftmost leaf node. If the new node is the leftmost leaf node, The scheduler immediately puts the new process into action.

Conversely, if a process is blocked (or otherwise unrunnable), the scheduler removes the process from the red-black tree via the dequeue_entity() method, and if the left-most node is removed, calls the rb_next() method to find the new left-most node.

Scheduler selection

As mentioned earlier, the kernel supports a variety of schedulers, and CFS is just one of them. The main entry point for process scheduling is the function schedule() defined in kernel/sched.c. All it does is select a process and run it, and its logic is very simple:

  1. callpick_next_task()Get a Task
  2. Put the task into operation

The real scheduler logic is in pick_next_task(), which does the following:

  1. Traversal according to the priority of the scheduling class
  2. Return immediately if only the scheduling class has runnable processes

Pick_next_task () makes a simple optimization for CFS. If all runnable tasks are under the CFS scheduler, instead of iterating through other scheduling classes, it keeps taking tasks from the CFS scheduler.

Sleep and wake up

A dormant (blocked) process is in a particular non-executable state. It can be blocked for many reasons, such as waiting for a signal, or for input from the user’s keyboard, etc. Either way, the kernel does the same thing: The process marks itself as dormant, removes itself from the executable red-black tree and puts it on a wait queue, and then calls Schedule () to select and execute another process.

As mentioned in the previous article, there are two states for sleeping processes: TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE. In either state, the sleeping process is on the same wait queue.

A wait queue is a simple linked list composed of processes waiting for some events to happen. When the events related to the wait queue occur, the processes on the queue will be awakened. In order to avoid competition conditions, the implementation of sleep and wake up should not have batch. However, if a simple implementation can cause the process to sleep after the test condition is true, then the process will sleep indefinitely, so the process is queued as follows:

  1. Call macrosDEFINE_WAIT()Create an item for the wait queue
  2. calladd_wait_queue()Put yourself in the queue
  3. callprepare_to_wait()Change the process state toTASK_INTERRUPTIBLETASK_UNINTERRUPTIBLE
  4. If the state is set toTASK_INTERRUPTIBLESignal wake up (pseudo wake up) to examine and process the signal
  5. After waking up, check whether the wait condition is true. If it is, the loop will be broken, otherwise it will be called againschedule()And keep repeating
  6. Once out of the loop (condition satisfied), the process sets itself toTASK_RUNNINGAnd call thefinish_wait()Method to remove itself from the wait queue

Wakeup is done by wake_up(), which wakes up all processes on the specified wait queue. The main logic of wake_up() is found in the call to try_to_wake_up() :

  1. Set the process toTASK_RUNNING
  2. callenqueue_task()Put the process into a red-black tree
  3. If the awakened process has a higher priority than the currently executing process, preempt immediately
  4. Set up theneed_reschedFlag (whether the flag triggers rescheduling)

However, as mentioned above, because there is a false wake up, the process is not always woken up because the waiting condition has been fulfilled.

Preempt process context switching

Context switching refers to switching from one executable process to another. At the end of the article, I’ll talk about context switching and preemption of processes.

The function context_switch() handled by the kernel is defined in kernel/sched.c, which basically does two things:

  1. callswitch_mm()Switch virtual memory from the previous process mapping to the new process
  2. callswitch_to()Switching from the previous process processor state to the new process processor state (including stack information, register information, and so on)

User preemption occurs when need_resCHED is flagged as the kernel is about to return user space, causing schedule() to be called. Because a process returning from the kernel to user space knows that it is safe to either continue or choose a new process to execute, either after a system call or an interrupt, the process can check that need_resCHED is marked to determine whether schedule() needs to be called again. In summary, normal user preemption occurs when:

  • Returns user space from the system call
  • Return user space from interrupt handler

Because Linux fully supports kernel preemption, the kernel can preempt the task being executed at any time as long as the schedule is secure (no locks are held). In Linux, locks are determined by counting. In thread_info, preempt_count is used to record locks held. If this value is 0, the process is safe.

If a process in the kernel is blocked, or if schedule() is explicitly called, explicit preemption occurs in the kernel. Explicit preemption is always supported because if a function explicitly calls schedule(), it knows it can be preempted safely.

Kernel preemption typically occurs when:

  • The interrupt handler is executing before returning to kernel space
  • Once again, kernel code is preemptible
  • If the task in the kernel is called explicitlyschedule()when
  • When a task in the kernel blocks