process

In the early days of single-tasking computers, users could submit only one job at a time, monopolizing all the resources of the system and doing only one thing at a time. However, the CPU and I/O speeds are very different. A job spends very little time on the CPU and most of the time waiting for I/OS.

In order to make more reasonable use of CPU resources, the memory is divided into multiple blocks, and different programs use their own memory space without interfering with each other. Here, a single program is a process, and the CPU can switch between multiple processes to execute, so that the CPU utilization becomes higher.

In order for the CPU to switch between multiple processes, the process context (such as program counters, stacks, kernel data structures, and so on) needs to be saved so that execution can resume the next time you switch back. A scheduling algorithm is also required, and Linux uses a completely fair scheduling algorithm based on time slices and priorities.

thread

Multiple processes were introduced to solve CPU utilization problems, so why threads? The answer is to reduce the overhead of context switching.

A process may relinquish a CPU at two points in time for CPU switchover:

  • Process blocking, such as network blocking, code-level blocking (locks, sleeps, etc.), system calls, etc
  • Process time slice used up, release CPU

The following two steps are required when a process switches cpus:

  • Switch the page directory to use the new address space
  • Switch kernel stack and hardware context

There is no essential difference between a process and a thread in Linux. The main difference is that a process has its own memory space, while threads (with a process) have a shared memory space.

The memory address space needs to be converted during process switching, and thread switching does not do this, so thread switching is less costly than process switching.

Why is the memory address space conversion so slow? In Linux, the address space of each process is virtual. Converting a virtual address space to a physical address space requires a page table lookup. This query is a slow process, so a cache called TLB is used to speed up the query.

In summary, threads are designed to reduce overhead during process switching.

coroutines

When our application is IO intensive (e.g., Web server, gateway, etc.), there are two ways to pursue high throughput:

  1. In order to reduce the overhead of thread creation, you can use thread pool technology. Theoretically, the larger the thread pool, the higher the throughput, but the larger the thread pool, the more CPU switching overhead

The creation and destruction of a thread need to call the system call, each request is created, high concurrency overhead appears to be very large, and the memory occupation of the thread is MB level, the number of not too much

Why the more threads, the more CPU switches? To be precise, the more executable threads, the more CPU switching, because the operating system scheduling to ensure absolute fairness, there are executable threads, must be equal, so the number of switching

  1. Using an asynchronous non-blocking development model, a process or thread receives requests and then uses IO multiplexing to keep the process or thread from blocking, eliminating the overhead of context switching

The advantages and disadvantages of the two schemes are obvious. Scheme 1 has simple implementation but low performance. Scheme 2 performs very well, but is complicated to implement. Is there something in between? With simplicity and high performance, coroutines solve this problem.

Coroutine is an abstraction from the user’s perspective. The operating system does not have this concept. Its main idea is to implement scheduling algorithm in the user state and complete a large number of tasks with a small number of threads.

Coroutines need to solve several problems encountered by threads:

  • The memory footprint should be small and the creation overhead should be small
  • Reduce the overhead of context switching

The first is easy to implement. A user-mode coroutine is just a data structure, requires no system calls, and can be designed to be small, up to the KB level.

The second point can only be solved by reducing the number of context switching, because the nature of coroutines is threads, and the switching cost cannot be reduced in user mode. The overall cost can only be reduced by reducing the number of switching, which can be done as follows:

  1. Keep the number of executable threads as low as possible, and the number of switches is bound to be low
  2. Keep threads running as much as possible, rather than blocking to free up time slices

Goroutine

Goroutine is a coroutine implemented by Golang, which is supported at the language level and very convenient to use. The core of goroutine is the MPG scheduling model:

  • M: Kernel thread
  • P: the processor used to execute the Goroutine, which maintains the local runnable queue
  • G: Goroutine, code and data structures
  • S: scheduler, maintains M and P information

There is also a global runnable queue.

  1. Start a Goroutine in Golang using the go keyword, and it will be suspended to P’s runqueue, waiting to be scheduled



2. When the running G0 in M0 blocks (such as a system call), M0 sleeps and abandons mounted P0 so that it can be scheduled by another M



3. When the M0 system call ends, it tries to “steal” a P. If it fails, M0 places G0 in the global runQueue

  1. P periodically checks the global Runqueue to make sure it has something to do when it digests G, and also steals G from other PS

From the above, it seems that the MPG model only limits the number of threads running at the same time, but context switching only happens on runnable threads, which should have some effect, but only a part of it.

Golang intercepts thread blocking at the Runtime level and optimizes for it, which falls into two categories:

  • Network IO, channel operation, lock: only block G, M, P available, that is, the thread will not yield time slice
  • System call: block M, P needs to be switched, the thread will yield the time slice

So all in all, Goroutine is less expensive than thread switching.

conclusion

Improved CPU utilization from single-process to multi-process; From process to thread, reducing context switching overhead; Moving from threads to coroutines further reduces the overhead of context switching, allowing high-concurrency services to be written in simple code, with each step of the technology being developed to solve practical problems.