This paper mainly introduces the G-P-M model from the Go scheduler architecture level, through which a small number of kernel threads support a large number of Goroutines to run concurrently. And through NetPoller, Sysmon and other help Go program to reduce thread blocking, make full use of the existing computing resources, so as to maximize the efficiency of Go program.

This paper mainly introduces the implementation architecture of the Go program’s internal scheduler (G-P-M model) in order to achieve extremely high concurrency performance, and how the Go scheduler deals with the scenario of thread blocking in order to maximize the utilization of computing resources.

How do we make our system faster

With the rapid development of information technology, the processing capacity of single server is becoming stronger and stronger, which forces the programming mode to be upgraded from serial mode to concurrent mode.

Concurrency models include IO multiplexing, multi-process and multi-threading, each of which has its own advantages and disadvantages. In modern complex high-concurrency architectures, several models are used together, and different models are applied in different scenarios to maximize the performance of the server.

And multithreading, because of its light weight and ease of use, has become the most frequently used concurrency model in concurrent programming, including the subsequent derivative coroutines and other sub-products, are also based on it.

Concurrency ≠ parallel

Concurrency is different from parallelism.

On a single CPU core, threads can switch tasks through time slices or by ceding control to run multiple tasks at the same time. This is called concurrency. But only one task is actually being executed at any one time, and the other tasks are queued by some algorithm.

Multicore cpus allow “multiple threads” within the same process to run simultaneously in a true sense, which is called parallelism.

Process, thread, coroutine

Process: A process is the basic unit of system resource allocation and has independent memory space.

Threads: Threads are the basic unit of CPU scheduling and dispatching. Threads are attached to processes, and each thread shares resources with the parent process.

Coroutines: Coroutines are lightweight threads in user mode. The scheduling of coroutines is completely controlled by the user. Switching between coroutines only requires saving the context of the task without the overhead of the kernel.

Thread context switch

Interrupt processing, multitasking, user mode switching, etc., will cause the CPU to switch from one thread to another. The switching process needs to save the state of the current process and restore the state of the other process.

Context switching is expensive because it takes a lot of time to swap threads at the core. The latency for context switching is between 50 and 100 nanoseconds, depending on different factors. Given that the hardware executes, on average, 12 instructions per nanosecond per core, a context switch can cost anywhere from 600 to 1,200 instructions in latency. In practice, context switching takes up a lot of the program’s execution time.

If there is a cross-core Context Switch, it may invalidate the CPU cache (the cost of CPU accessing data from the cache is about 3 to 40 clock cycles, and the cost of CPU accessing data from main memory is about 100 to 300 clock cycles), This scenario is more expensive to switch over.

Golang is built for concurrency

Since its official launch in 2009, Golang has rapidly gained market share thanks to its high speed and efficient development. Golang supports concurrency at the language level, allowing programs to run concurrently through a lightweight coroutine called Goroutine.

Goroutine is very lightweight for two main reasons:

Context switching cost is small: Goroutine context switching only involves the value modification of three registers (PC/SP/DX); The context switch of contrast thread involves mode switch (from user mode to kernel mode), 16 registers, PC, SP… Wait for register refresh;

Low memory footprint: thread stack space is usually 2M, Goroutine stack space minimum 2K;

The Golang program can easily support 10W Goroutine running, and at 1K threads, the memory footprint can reach 2GB.

Go scheduler implementation mechanism:

The Go program uses the scheduler to schedule Goroutine execution on kernel threads, but Goroutine is not directly tied to OS threads to run m-machine, Instead, the P-processor (logical Processor) in the Goroutine Scheduler “mediates” access to kernel thread resources.

The Go scheduler model is usually called g-P-M model, which includes four important structures, namely G, P, M and Sched:

G:Goroutine. Each Goroutine corresponds to a G structure. G stores the run stack, state, and task functions of the Goroutine, which can be reused.

G is not the body of execution, and each G needs to be bound to P to be scheduled for execution.

P: Processor: indicates the logical Processor. For G, P is equivalent to the CPU core. G can be scheduled only if it is bound to P. For M, P provides related execution Context, such as McAche, task queue, etc.

The number of P determines the maximum number of G in the system that can be parallel (if the number of physical CPU cores >= the number of P).

The number of P is determined by the user-set GoMAXPROCS. However, no matter how large GoMAXPROCS is, the maximum number of P is 256.

M: Machine, the OS kernel thread abstraction, represents the resource that actually performs the computation. After binding valid P, it enters the schedule loop. The schedule loop gets its mechanism roughly from the Global queue, the Local queue of P, and the WAIT queue.

The number of M is variable and is adjusted by Go Runtime. In order to prevent system scheduling failure caused by creating too many OS threads, the default maximum number of M is 10000.

M does not retain G state, which is the basis for G to be able to schedule across M.

Sched: Go scheduler, which maintains queues storing M and G as well as some state information of the scheduler, etc.

The scheduler loop roughly takes G from the various queues, the local queues of P, switches to G’s execution stack and executes G’s functions, calls Goexit to clean up and returns to M, and so on.

To understand the relationship among M, P and G, we can explain the relationship among the three through the classic model of gopher cart carrying bricks:



The task of Gopher is: there are a number of bricks on the site, and the Gopher carries the bricks to the fire with the help of a car. M can be regarded as the ground mouse in the picture, P is the car, G is the brick in the car.

Now that the relationship is clear, let’s focus on how gophers move bricks.

The Processor (P) :

Create a batch of cars (P) based on the GoMAXPROCS value set by the user.

Goroutine (G) :

The Go keyword is used to create a Goroutine, which is equivalent to making a brick (G) and placing it in the current car (P).

The Machine (M) :

Gophers (M) cannot be created externally, only too many bricks (G) and too few gophers (M), it is too busy to come, just happen to have spare car (P) not used, then borrow some gophers (M) from other places until the car (P) used up.

There is a procedure for borrowing gopher (M) when gopher (M) is not enough, which is to create a kernel thread (M).

It should be noted that gopher (M) cannot transport bricks without trolley (P). The number of trolley (P) determines the number of gopher (M) that can work, which corresponds to the number of active threads in the Go program.

In the Go program we show the G-P-M model as follows:



P stands for logical processors that can run “in parallel,” with each P assigned to a system thread, M, and G for the Go coroutine.

There are two different run queues in the Go scheduler: the global run queue (GRQ) and the local run queue (LRQ).

Each P has an LRQ that manages Goroutines assigned to execute in the context of P, which in turn are context-switched by M bound to P. GRQ applies to Goroutines that have not yet been assigned to P.

As you can see from the figure above, the number of G’s can be much larger than the number of M’s. In other words, the Go program can leverage a small number of kernel-level threads to support a large number of Goroutine’s concurrency. Multiple Goroutines share the computing resources of kernel thread M through user-level context switching, but there is no performance penalty associated with thread context switching for the operating system.

In order to make full use of computing resources of threads, the Go scheduler adopts the following scheduling strategies:

Stealing (work stealing)

As we know, the reality is that some Goroutine runs fast or slow, so it is bound to bring problems: busy busy dead, idle idle dead, Go certainly does not allow the existence of fish P, is bound to make full use of computing resources.

In order to improve the parallel processing capability of Go and improve the overall processing efficiency, when the G tasks between each P are not balanced, the scheduler allows to obtain G execution from GRQ or LRQ of other P.

To reduce congestion

What if the executing Goroutine blocks thread M? Will the Goroutine in the LRQ on P fail to get the schedule?

Blocking in Go can be divided into the following four scenarios:

Scenario 1: Goroutine blocks due to atomic, mutex, or channel operation calls. The scheduler will switch out the currently blocked Goroutine and reschedule the other Goroutines on the LRQ.

Scenario 2: Goroutine blocks due to network requests and IO operations. What will G and M do in this case?

The Go program provides a Network poller (NetPoller) to handle network requests and IO operations. IO multiplexing is implemented in the background via KQueue (MacOS), Epoll (Linux), or IOCP (Windows).

By using NetPoller to make network system calls, the scheduler prevents Goroutine from blocking M while making these system calls. This allows M to execute other Goroutines in P’s LRQ without creating a new M. Helps reduce scheduling load on the operating system.

The diagram below shows how it works: G1 is executing on M, and there are three goroutines waiting to execute on LRQ. The network poller is idle, doing nothing.



Next, G1 wants to make network system calls, so it is moved to the network poller and handles the asynchronous network system calls. M can then execute additional Goroutine from LRQ. At this point, G2 is contextually switched to M.

   

Finally, the asynchronous network system call is done by the network poller, and G1 is moved back into THE LRQ of P. Once G1 can do a context switch on M, its go-related code can be executed again. The big advantage here is that no extra M is required to perform network system calls. The network poller uses system threads, which process a valid event loop at all times.

     

This may seem complicated, but thankfully the Go language hides this “complexity” in the Runtime: Go developers don’t have to worry about whether sockets are non-block or not, and they don’t have to register file descriptor callbacks themselves. They just treat sockets as “block I/O” in the Goroutine corresponding to each connection. The simple network programming pattern of Goroutine-per-Connection is implemented (but a large number of Goroutines can also introduce additional problems, such as stack memory increase and scheduler burden).

The “block socket” in Goroutine seen by the user layer is actually “simulated” by the Go Runtime Netpoller using non-block socket + I/O multiplexing. The NET library in Go is implemented in this way.

Scenario 3: When calling some system methods, if the system method calls are blocked, in this case, the NetPoller cannot be used, and the Goroutine making the system call will block the current M.

Let’s look at a case where a synchronous system call (such as file I/O) would block M: G1 would make a synchronous system call to block M1.



After the scheduler’s intervention: recognizing that G1 has caused M1 obstruction, the scheduler separates M1 from P and also takes G1 away. The scheduler then introduces a new M2 to serve P. At this point, you can select G2 from LRQ and perform a context switch on M2.

         

After the blocking system call is complete: G1 can be moved back to LRQ and executed again by P. If this happens again, the M1 will be put aside for future reuse.



Scenario 4: If a sleep operation is performed in Goroutine, M is blocked.

The Go program has a monitoring thread in the background, Sysmon, which monitors long-running G tasks and sets up commandeable identifiers so that other Goroutines can run first.

The next time the Goroutine makes a function call, it will be hijacked, protected, and put back into the local queue of P for the next execution.

summary

This paper mainly introduces the G-P-M model from the Go scheduler architecture level, through which a small number of kernel threads support a large number of Goroutines to run concurrently. And through NetPoller, Sysmon and other help Go program to reduce thread blocking, make full use of the existing computing resources, so as to maximize the efficiency of Go program.

The above content is some of my own feelings, share out welcome correction, incidentally beg a wave of attention, have ideas of partners can comment or private letter I oh ~

                                     

Author: joellwang


Source:
Tencent Technology Engineering