The concept of multithreading is now involved in client, server, and Web development. As we all know, the thread is the smallest unit of operation scheduling that the operating system can perform. Multiple threads in the same process share all the system resources of the process.

thread

Three basic concepts

  • Kernel thread: A thread implemented in kernel space and managed by the kernel
  • User thread: A thread implemented in user space that is not managed by the kernel but is managed by user mode
  • Lightweight process (LWP) : Support for user threads in the kernel (middle layer between user threads and kernel threads, high abstraction of kernel threads)

Threading model

  1. In the one-to-one model (kernel-level thread model), the process creates the LWP. Since each LWP corresponds to a kernel-level thread, the user is creating one kernel thread by one, and the kernel manages the thread. Most of the languages we are familiar with are one-to-one, such as C#, Java, C, etc. (These languages have their own way of implementing user-mode threading, but the threading of the language itself is one-to-one)
  2. One to many model (user-level model)



    User processes create multiple user-level threads corresponding to the same LWP, so essentially there will be only one kernel thread running in kernel mode, so the scheduling of user and thread is to manage threads in user mode through the thread library in user space.

    The advantage of this model is scheduling in user state, so there is no need to frequently switch between kernel state and user state, which greatly improves the thread efficiency. The coroutines of the Python language are implemented in this model, but this model does not achieve true concurrency.

  3. Many-to-many models (user-level and kernel-level threading models)



    The many-to-many model is a collection of advantages based on the above two models. Multiple threads in user mode correspond to multiple LWPs, but they are not one-to-one correspondence. In user mode, the binding relationship of user threads corresponding to each LWP is managed and scheduled.

    So a many-to-many model can have more user-mode threads running on relatively fewer kernel threads. This user – mode thread is what we usually call the coroutine. The Go language coroutine is a many-to-many model, which is scheduled in user mode by the Runtime of the Go language, which is also the reason for the high concurrency of the Go language.

Coroutine (Goroutine)

As mentioned above, coroutines are user-mode threads, and all of their scheduling is implemented in user-mode. In this aspect, GO language should be the most representative (after all, high concurrency is the selling point). Goroutine, a coroutine of GO language, is taken as the research object.

The GM model

In the early days of the Go language design, the Go team implemented coroutines using a simple GM model.

G: Goroutine, or coroutine, normally takes up 2KB of stack memory and automatically expands if it runs out of stack space. It can be thought of as a packaged snippet of code. A goroutine is not a unit of execution.

M: The machine worker thread is the thread that actually executes the code. Its lifecycle is managed by Go’s Runtime.

In the early days of the Go language, a global queue was used to hold all G’s. When the user creates a new G, the Runtime places the packaged G at the end of the global queue and maintains the tail pointer. When M finishes executing a G, it will go to the global queue head and fetch a G for M to execute, and maintain the global queue head pointer.

You can see that there is a very serious problem here, that if you allow G to be arbitrarily added, and if you allow any M to arbitrarily take G, then you have concurrency problems. The solution to the concurrency problem is as simple as locking, and that’s exactly what the Go language team does. The lock means that all access operations will involve the single global mutex, and the entire schedule will be reduced in efficiency. In addition to the performance issues caused by locks, the worker thread is often blocked and unblocked during system calls, and so on [2]. Therefore, the official team subsequently adjusted the scheduling model and added the concept of P.

GMP model

The concept of P is introduced into the new model, and G and M have no big changes.

P: process processor, which represents the context required by M, can also be understood as a Token required by M to run. When P has a task, it needs to create or wake up a M to execute the tasks in its queue. So the number of P determines the number of parallel tasks to execute. This can be set using Runtime. GOMAXPROCS, which is the default number of CPU cores in the current Go version.

The above diagram was simplified and drawn according to Cao Da’s GMP model diagram.

P structure: A P corresponds to an M, and the P structure contains a runnext field and a local queue. The runnext field stores the next G executed by M. Each P contains a local array queue of 256 size, which is unlocked. This effectively solves the problem of frequent access to the global queue, and RunNext and the local queue are local to the PM structure. There are no other execution units competing with each other, so there is no need to lock.

M structure:

There are two special M’s in Runtime:

① is the main thread, which is specially used to handle the main logic;

(2) monitor thread sysmon, it does not need P to execute, monitor thread inside is an endless loop, constantly detect whether there is a block or execution time is too long G, after finding these G will preempt.

Ordinary M:

There are many M’s in the global scheduler. The idle M is placed in the M Idle list of the global scheduler. If the M is needed, it will be fetched from this list first. Each M structure has three G’s. Gsignal is the G that M specifically handles the signal of the runtime and can handle some wakeup mechanism, G0 is the G that M needs to switch to when executing the runtime scheduling code, and a G is the G of the user logic that M needs to execute.

M has several states:

  • Spin (spinning): A state that hasn’t found Goroutine to execute yet, but is looking for. With this state and the count of nmsping, it helps the Runtime determine if additional threads need to be started to execute goroutines.
  • When executing Go code: M is executing Go code, and M will have a P;
  • In executing native code: M is executing native code or blocking syscall, but M does not own P;
  • In hibernation: M finds no G to run and goes to sleep and adds it to the free M linked list. M does not own P at this time.

scheduling

The whole scheduling process is a process of production and consumption. As the producer, the user creates a goroutine, which is packaged into a G, which is consumed by M after a series of scheduling by Runtime.

The production end

The production process is basically the process of putting a packaged G into a queue. If the current local queue is full, the packaged batch is put into the global queue.

The consumer end

Compared with the production process, the consumption process is much more complex. It mainly describes how P and M are scheduled, and the final execution process is completed by M.

As we mentioned earlier, each M has a special G0 for scheduling, which can be seen as when our M does not finish executing a G, we need to switch to the stack space of G0, start calling schedule function to find a G to execute, and then cut to the stack space of G found, and execute it.

Notice that during each execution, there is a variable that is constantly ++, and this schedtick is used every time schedule schedules G, so let’s look at how we find G each time we call schedule.

As can be seen from the figure, the whole process of looking for executable G, as long as G is found in this process, the schedule will be ended and the obtained G will be returned to execute.

  1. Schedtick %61 == 0 in the schedule function, which is every 61 times (why 61 times? I don’t know, sort of a magic number, for the author’s own consideration) just go to the global queue once (make sure the G in the global queue is executed). Otherwise we will go to the current M bound to the queue of P, of course first execute runnext G.
  2. When no G is available in the local queue, we enter a FindRunnable function dedicated to finding G.
  3. The first step into FindRunnable is to look for it in the RunNext and local queues of the currently bound P.
  4. From the global queue, the number of fetches is the global queue length /gomaxprocs+1, but can only take up to 128 G into the local queue and take out one to execute.
  5. Get the blocked G from the NetPoll network polller (which monitors time-consuming operations such as network I/O and file I/O) to continue execution.
  6. Work stealing, stealing G from another P (randomly selected by a random algorithm) in a local queue, stealing half as many as it is.
  7. Try global queue fetching again.
  8. Check all idle P queues, and if there is any runnable G, bind to that P and execute.
  9. Try again to get it in NetPoll.

    Finally, if no G can be found to execute in the entire FindRunnable, the current M will go to sleep and be placed in the global M linked list to be woken up.

In this paper, GC aspect processing and a lot of details are ignored. The real scheduling process is very complicated. The official code of Go language is open source and has detailed comments.

References

[1]https://mp.weixin.qq.com/s/Nb… “Are processes, threads and coroutines stupid to tell apart? A farthing for you!” [2]https://docs.google.com/docum… “Scalable Go Scheduler Design Doc”