Golang GMP model for concurrent scheduling

One of Golang’s distinctive features is its Goroutine. Goroutine is an important guarantee for Golang to support high concurrency. Golang creates thousands of Goroutines to handle tasks, and allocates, loads, and schedules these goroutines to the processor using the G-M-P model.

What is a Goroutine

Goroutine = Golang + Coroutine. Goroutine, a coroutine implemented by Golang, is a user-level thread. Goroutine has the following features:

It starts cheaply compared to threads, starting with a small stack space (about 2Kb)
The ability to dynamically scale the stack size, up to Gb support
Working in user mode, switch to small
The relation with threads is N :m, that is, m goroutines can be multi-scheduled on N system threads

Processes, threads, Goroutine

In process-only operating systems, a process is the basic unit with resources and independent scheduling. In operating systems that introduce threads, threads are the basic unit of independent scheduling and processes are the basic unit of resource ownership. A thread switch in the same process does not cause a process switch. Switching threads between different processes, such as switching from a thread in one process to a thread in another, causes a process switch

The way threads are created, managed, and scheduled is called the thread model. Thread models are generally divided into the following three types:

The Kernel Level Thread model
The User Level Thread model
Two – level threading model, also known as hybrid threading model

The biggest difference between the three threading models lies in the corresponding relationship between user-level threads and the Kernel Scheduling Entity (KSE). KSE is the abbreviation of Kernel Scheduling Entity, which is an object Entity that can be scheduled by the operating system Kernel scheduler. It is the minimum Scheduling unit of the operating system Kernel and can be simply understood as the Kernel level thread.

User-level threads, known as coroutines, are created and managed by applications and must be bound to kernel-level threads before they can be executed. Thread scheduling by CPU is preemptive, while user-mode scheduling by coroutine is cooperative. The next coroutine is executed only after one coroutine cedes CPU.

User-level threads (ULT) vs. kernel-level threads (KLT) :

features	User-level thread	Kernel level thread
The creator	The application	The kernel
Whether the operating system senses the presence	no	is
Overhead costs	Low creation cost and low context switching costContext switching does not require hardware support	High creation costs and high context switching costsContext switching requires hardware support
If the thread is blocked	The entire process will be blocked. That is, multiple processing cannot be used to take advantage of concurrency	Other threads can continue to execute without blocking
case	Java thread, POSIX threads	Window Solaris

Kernel-level threading model

In the kernel-level threading model, there is a one-to-one relationship between user threads and kernel threads (1:1). The creation, destruction and switching of threads are all done by the kernel. The application does not participate in thread management and can only call the kernel-level thread programming interface (a system call is made whenever an application creates a new thread or destroys an existing thread). Each user thread is bound to a kernel thread. The user thread is bound to the kernel thread for its lifetime. Once the user thread terminates, both threads leave the system.

The operating system scheduler manages, schedules, and dispatches these threads. The runtime library requests a kernel-level thread for each user-level thread. The memory management and scheduling subsystem of an operating system must account for a large number of user-level threads. The operating system creates a context for each thread. Each thread of a process can be assigned to the processor kernel when resources are available.

The kernel-level threading model has the following advantages:

In multiprocessor systems, the kernel can execute multiple threads within the same process in parallel
If one thread in a process is blocked, other threads are not blocked, and other threads in the same process can be switched to continue execution
When a thread blocks, the kernel selects threads that can run another process, whereas in user-space implementations, the runtime system always runs threads in its own process

Disadvantages:

The creation and deletion of threads require CPU participation and cost a lot

User-level threading model

The user thread model has a many-to-one relationship with the kernel thread KSE (N: 1). The creation and destruction of threads, as well as coordination and synchronization between threads, are all done in user mode, specifically by the application’s thread library. The kernel is unaware of this, and the kernel’s scheduling is process-based. At a macro level, only one thread can be running per process at any time, and only one processor core can be assigned to that process.

As you can see from the figure above, the library scheduler selects a thread from the process’s multiple threads and associates that thread with a kernel thread that the process allows. Kernel threads are assigned to the processor kernel by the operating system scheduler. User-level threading is a many-to-one thread mapping

User-level threads have the following advantages:

Thread management, such as thread creation and destruction, thread switching costs, is much less costly than kernel threads because the procedures and callers that save thread state are local procedures only
Threads can utilize more table space and stack space than kernel-level threads

Disadvantages:

When a thread is blocked due to I/O or page failure, if the blocking system call is called, the kernel will block the entire process and thus all threads because it is unaware of the existence of multiple threads. Therefore, only one thread can be running in the same process at the same time
Resource scheduling is carried out according to the process. Under multiple processors, threads in the same process can only be multiplexed under the same processor

Two-level threading model

In the two-level threading model, the user thread has a one-to-one relationship with the kernel thread (N: M). The two-level threading model fully absorbs the advantages of the above two models and avoids the disadvantages as far as possible. Thread creation takes place in user space, and thread scheduling and synchronization takes place in the application. Multiple user-level threads in an application are bound to a number of kernel-level threads that are less than or equal to the number of user-level threads.

Golang’s threading model

Golang implements the hybrid threading model at the bottom. M is the system thread, generated by the system call, and an M is associated with a KSE, which is the system thread in the two-level threading model. G is for Groutine, the application and thread of the two-level threading model. The relationship between M and G is N:M.

Overview of the G-M-P model

G-m-p respectively represent:

G – Goroutine, Go coroutine, is the smallest unit involved in scheduling and execution
M-machine, for system level threads
P-processor refers to the logical Processor. The local QUEUE (also called LRQ) associated with P can run G, which can store up to 256 GB.

The GMP scheduling process is roughly as follows:

If thread M wants to run the task, it needs to get P, which is associated with P.
Then get G from the local queue (LRQ) of P
If there is no runnable G in the LRQ, M will try to take a batch of G from the global queue (GRQ) and put it into P’s local queue,
If the global queue also does not find a runnable G, M will randomly steal half from other P’s local queue and put it into its own P’s local queue.
After getting a runnable G, M runs G, and after G executes, M takes the next G from P, and so on.

The scheduling life cycle

M0 is the main thread numbered 0 after starting the program. The instance of this M is allocated on the global runtime. M0 does not need to be allocated on the heap. M0 performs initialization and starts the first G, after which M0 is the same as any other M
G0 is the first gourtine that is created every time an M is started. G0 is used only by the G responsible for scheduling. G0 does not point to any executable function, and each M has its own G0. The stack space of G0 is used during scheduling or system calls, and the G0 of global variables is the G0 of M0

The life cycle process described above:

Runtime creates the initial thread m0 and goroutine g0 and associates them (g0.m = m0)
Scheduler initialization: set M Max number, P number, stack and memory events, and create GOMAXPROCS P
The main function in the example code is main.main. The Runtime also has one main function — Runtime. main. When the program starts, it creates a goroutine for Runtime. main, call it main Goroutine, and add the main goroutine to the local queue of P.
Start M0, which is bound to P, and fetch G from the local queue of P to the main Goroutine.
G owns the stack, and M sets the operating environment according to the stack information and scheduling information in G
M running G
G exits and goes back to M again to get runnable G, and so on until main.main exits and Runtime. main performs Defer and Panic, or calls Runtime. exit to exit the program.

The number of G – M – P

Quantity of G:

Theoretically there is no upper limit of quantity. To check the current number of G’s, use runtime.numgoroutine ().

Quantity of P:

Determined by the boot-time environment variable $GOMAXPROCS or runtime.gomaxprocs (). This means that only $GOMAXPROCS of goroutines are running at any point in the program execution.

Quantity of M:

Limitations of the GO language itself: When the GO program starts, it sets the maximum number of M, which is 10000 by default. However, the kernel can hardly support such a large number of threads, so this limit can be ignored. Runtime /debug SetMaxThreads sets the maximum number of threads in M. If an M is blocked, a new M will be created. There is no absolute relationship between M and the number of P. If one M blocks, P will create or switch another M. Therefore, even if the default number of P is 1, it is possible to create many M.

The process state of the schedule

From the picture above, we can see:

Each P has a local queue. The local queue holds the goroutine to execute (Flow 2). When the local queue of P bound to M is full, the goroutine is put into the global queue (Flow 2-1).
Each P is bound to an M, which is the entity that actually executes the goroutine in P (flow 3), and M executes it by retrieving G from the local queue in the bound P
When the local force of P bound to M is empty, M will fetch G from the global queue to the local queue to execute G(Flow 3.1). When no executable G is fetched from the global queue, M will steal G from the local queue of other P to execute G(Flow 3.2). This stealing from other P is called work stealing
When G is blocked due to system call (syscall), M will be blocked. At this time, P will unbind M, namely hand off, and search for a new idle M. If there is no idle M, a new M will be created (Flow 5.1).
When G is blocked by channel or network I/O, M does not block. M looks for other runnable G. When the blocked G recovers, it will re-enter runnable and enter the P queue for execution (Flow 5.3).

Scheduling is blocked

Blocking of the GMP model can occur in one of the following ways:

I/O, the select
block on syscall
channel
Waiting for the lock
runtime.Gosched()

User-mode blocking

When a Goroutine is blocked due to a channel operation or network I/O (in fact golang already implemented goroutine NETWORK I/O blocking with Netpoller does not block M, only G), The corresponding G will be placed on a wait queue (such as Waitq of a channel) and the status of G changes from _Gruning to _Gwaitting, and M will skip the G and try to get and execute the next G. If there is no runnable G for M to run, M will unbind P and enter the sleep state. When the blocking G is woken up by G2 at the other end (such as a channel’s readable/write notification), G is marked as runnable and attempts to join P’s runnext, followed by P’s Local and Global queues.

System call blocking

When G is blocked on a system call, G will be blocked in _Gsyscall state and M is also in block on syscall state. At this time, M can be preempted and scheduled: M executing G will be unbound from P, and P will try to bind with other IDLE M and continue to execute other G. If there is no other IDLE M, but G still needs to be executed in the Local queue of P, create a new M. When the system call is complete, G will try again to obtain an idle P and enter its Local queue to resume execution. If there is no idle P, G will be marked as runnable and added to the Global queue.

G-m-p internal structure

The internal structure of G

The important fields in the internal structure of G are as follows. For the complete structure, see the source code

type g struct {
    stack       stack   // set g to its own stack
    m            *m      // which M belongs to
    sched        gobuf   // Saves the scene of g, which is used to recover when goroutine switches
    atomicstatus uint32  // The running status of G
    goid         int64
    schedlink    guintptr // Next g, g linked list
    preempt      bool // Preempt the flag
    lockedm      muintptr // Lock M,g interrupt to restore specified M execution
    gopc          uintptr  // Create the instruction address of the goroutine
    startpc       uintptr  // The address of the goroutine instruction
}
Copy the code

There are 9 states of G, you can refer to the code:

state	value	meaning
_Gidle	0	It has just been allocated and has not been initialized yet.
_Grunnable	1	Already in the run queue, user code has not been executed.
_Grunning	2	Not in the run queue, user code is ready to execute, and M and P have been allocated.
_Gsyscall	3	A system call is being made and M is assigned.
_Gwaiting	4	Blocked at run time, not executing user code, not in the run queue, while it’s blocking somewhere waiting. Groutine Wait’s reasons for what to attendcode
_Gmoribund_unused	5	Not used yet, but hard coded in GDB.
_Gdead	6	Not yet used. This state may have just exited or just been initialized, and it may or may not be executing user code.
_Genqueue_unused	7	Not yet in use.
_Gcopystack	8	Copying stack, not executing user code, not in the run queue.

The structure of the M

M internal structure, complete structure see source code

type m struct {
    g0      *g     // each M has its own unique g0

    curg          *g       // Currently running g
    p             puintptr // which P does it belong to
    nextp         puintptr // When m is awakened, it first owns p
    id            int64
    spinning      bool // Is in spin

    park          note
    alllink       *m // on allm
    schedlink     muintptr // Next m, m list
    mcache        *mcache  // Memory allocation
    lockedg       guintptr // corresponds to G lockedm
    freelink      *m // on sched.freem
}
Copy the code

The internal structure of P

P internal structure, complete structure see source code

type p struct {
    id          int32
    status      uint32 // the state of P
    link        puintptr // Next P, P linked list
    m           muintptr // M has this P
    mcache      *mcache  

    // P Local runnable G queue, lockless access
    runqhead uint32
    runqtail uint32
    runq     [256]guintptr
    
    runnext guintptr // A runnable G with a higher priority than runq

    // The list of G links in the dead state will be fetched from this
    gFree struct {
        gList
        n int32
    }

    gcBgMarkWorker       guintptr // (atomic)
    gcw gcWork

}
Copy the code

P has the following states to participate in the source code

state	value	meaning
_Pidle	0	It has just been allocated and has not been initialized yet.
_Prunning	1	When M calls acquirep with the P binding, the state of P changes to _Prunning.
_Psyscall	2	A system call is being made.
_Pgcstop	3	Pause, while the system is undergoing GC and does not move to the next state phase until GC is complete.
_Pdead	4	Obsolete; no longer in use.

The internal structure of the scheduler

Scheduler internal structure, complete structure see source code

type schedt struct {

    lock mutex

    midle        muintptr // Free M list
    nmidle       int32    // The number of idle M's
    nmidlelocked int32    // The number of M locked
    mnext        int64    // The number of M created and the ID of the next M
    maxmcount    int32    // The maximum number of M can be created
    nmsys        int32    // The number of M deadlocks is not counted
    nmfreed      int64    // The total number of M released

    pidle      puintptr // Free P list
    npidle     uint32   // The number of idle P's

    runq     gQueue // Global runnable G queue
    runqsize int32  // The global runnable G

    // Global cache of dead G's.
    gFree struct {
        lock    mutex
        stack   gList // Gs with stacks
        noStack gList // Gs without stacks
        n       int32
    }

    // freem is the list of m's waiting to be freed when their
    // m.exited is set. Linked through m.freelink.
    freem *m
}
Copy the code

Observe the scheduling process

GODEBUG trace way

The GODEBUG variable controls debugging variables within the runtime, separated by commas and formatted as: name=val. The following two parameters can be used to observe GMP:

Schedtrace: Setting the schedtrace=X parameter enables the runtime to output a scheduler summary line every X milliseconds to the standard ERR output.
Scheddetail: Setting schedTrace =X and schedDetail =1 causes the runtime to print detailed multiline information about the state of the scheduler, processor, OS thread, and Goroutine once every X milliseconds.

package main

import (
    "sync"
    "time"
)

func main(a) {
	var wg sync.WaitGroup

	for i := 0; i < 2000; i++ {
		wg.Add(1)
		go func(a) {
			a := 0

			for i := 0; i < 1e6; i++ {
				a += 1
			}

			wg.Done()
        }()
        time.Sleep(100 * time.Millisecond)
	}

	wg.Wait()
}
Copy the code

Execute the following command:

GODEBUG=schedtrace=1000 go run ./test.go

The output is as follows:

SCHED 0ms: gomaxprocs=1 idleprocs=1 threads=4 spinningthreads=0 idlethreads=1 runqueue=0 [0]
SCHED 1001ms: gomaxprocs=1 idleprocs=1 threads=4 spinningthreads=0 idlethreads=2 runqueue=0 [0]
SCHED 2002ms: gomaxprocs=1 idleprocs=1 threads=4 spinningthreads=0 idlethreads=2 runqueue=0 [0]
SCHED 3002ms: gomaxprocs=1 idleprocs=1 threads=4 spinningthreads=0 idlethreads=2 runqueue=0 [0]
SCHED 4003ms: gomaxprocs=1 idleprocs=1 threads=4 spinningthreads=0 idlethreads=2 runqueue=0 [0]
Copy the code

Description of the output:

SCHED XXms: SCHED is the output flag of scheduling logs. XXms is the time from the start of the program to the current line output
Gomaxprocs: The number of P, equal to the current number of CPU cores, or the value of the Gomaxprocs environment variable
Idleprocs: The number of idle Ps, the difference from gomaxprocs is the number of running Ps
Threads: Indicates the number of threads, that is, the number of M
Spinningthreads: Number of spin threads. When M does not find a Goroutine for its scheduled execution, the thread is not destroyed, but is in a spin state
Idlethreads: number of idlethreads
Runqueue: the number of G’s in the global queue
[0] : indicates the number of G in the P local queue. There are several P’s in parentheses

Go Tool Trace

func main(a) {
	// Create a trace file
	f, err := os.Create("trace.out")
	iferr ! =nil {
		panic(err)
	}
	defer f.Close()

	// Start trace Goroutine
	err = trace.Start(f)
	iferr ! =nil {
		panic(err)
	}
	defer trace.Stop()

	// main
	fmt.Println("Hello trace")}Copy the code

Run the following command to generate the trace file trace.out:

go run test.go

Run the following command, open the browser, open the console view.

go tool trace trace.out

conclusion

Golang’s thread model adopts the hybrid thread model, and the relation between thread and coroutine is N:M.
Golang hybrid thread model implementation adopts GMP model for scheduling, G is goroutine, is the coroutine implemented by Golang, M is OS thread, P is logic processor.
Every M need bound with a P, P with the runnable G local queue, M is the unit of execution G, M G process is to first get run from a local queue of P, if not get, is coming from other P steal (that is, the work steal), if other P nor from the global G queue, if did not get, M will be in a state of spin, It’s not destroyed.
When performing G, such as channel blocking user level block, M not blocked at this time, M will continue to look for other runnable G, when the blocking of G after recovery, back into the P queue waiting for the execution, if the G system calls, clog M, will the P and M unbundling (namely hand off), and look for new M free. If there are no free ones, a new M will be created.
Work Steal and Hand Off ensure efficient use of threads.

G-m-p’s efficient guarantee strategies include:

M is reusable, does not need to be repeatedly created and destroyed, and is in a state of spin waiting to wake up when there are no executable Goroutines
Work Stealing and Hand Off strategies ensure efficient use of M
The memory allocation state (McAche) is located at P, and G can be scheduled across M. The problem of poor locality of scheduling across M no longer exists
M obtains G from the associated P, without using a lock. It is lock free

The resources

Golang scheduler GMP principle and scheduling full analysis
Go: Work-Stealing in Go Scheduler
There are three ways to implement threads – kernel-level threads, user-level threads, and hybrid threads