Concurrency is a very important concept in programming, and the Go language naturally supports concurrency at the language level.

Concurrency and parallelism

Concurrency: Multiple tasks are executed at the same time.

Parallel: The execution of multiple tasks at the same time, with time overlap.

Process, thread, coroutine

Processes, threads, coroutines (also called lightweight threads)

A process is a dynamic execution process of a program in a data set. It can be simply referred to as an “in-progress program”. It is an independent unit of CPU resource allocation and scheduling. The process is generally composed of three parts: program, data set and process control block. We write programs that describe what a process does and how it does it; Data sets are the resources that the program needs to use in the execution process. Process control block is used to record the external characteristics of the process, describe the process of execution changes, the system can use it to control and manage the process, it is the only symbol of the system aware of the existence of the process. The limitation of processes is that they are expensive to create, undo, and switch.

Threads: A concept developed after a process. Thread is also called lightweight process, it is a basic CPU execution unit, is the smallest unit in the process of program execution, by the thread ID, program counter, register set and stack together. A process can contain multiple threads. Thread has the advantage of reduced the costs of concurrent execution procedure, improve the operating system of the concurrent performance, the disadvantage is that the thread does not have its own system resources, have only essential resources at run time, but each thread can share of the same process owned by the system resources, if the process is likened to a workshop, then the thread is like the inside of the workshop workers. However, some exclusive resources have locking mechanism, improper handling may cause “deadlock”.

Coroutine: a user – mode lightweight thread, also known as a microthread, English name Coroutine, Coroutine scheduling is completely controlled by the user. Coroutines are usually understood in comparison to subroutines (functions). A subroutine call is always an entry, a return, and an exit completes the execution of the subroutine.

The biggest advantage of coroutines over traditional system-level threads and processes is that they are “lightweight” and can easily be created in the millions without depleting system resources, whereas threads and processes are usually limited to 10,000 at most. This is why coroutines are also called lightweight threads.

!!!!! Compared with multithreading, coroutine has the advantages of high execution efficiency. Because subroutine switching is not thread switching but controlled by the program itself, there is no overhead of thread switching, and the more threads there are, the greater the performance advantage of coroutines compared to multithreading.

Concurrency is implemented by the Go language through the coroutine Goroutine. Goroutines are similar to threads. They are user-mode threads that can create thousands of goroutines as needed and work concurrently. Goroutine is scheduled by the Go language runtime, while threads are scheduled by the operating system.

Threading model

In modern operating systems, threads are the basic unit of processor scheduling and allocation, and processes are the basic unit of resource ownership. Each process is made up of private virtual address Spaces, code, data, and various other system resources. A thread is a unit of execution within a process. Each process has at least one main thread of execution, which is automatically created by the system without being actively created by users. The user creates additional threads in the application as needed, and multiple threads run concurrently in the same process.

Let’s start with threads. Whatever concurrency model exists at the language level, at the operating system level, it must exist in the form of threads. The operating system architecture can be divided into user space and kernel space according to the different access permissions of resources. Kernel space mainly operates to access CPU resources, I/O resources, memory resources and other hardware resources, providing the most basic basic resources for the upper application program, user space is the fixed activity space of the upper application program, user space can not directly access resources. Resources provided by the kernel space must be called by “system call”, “library function”, or “Shell script”.

The current computer language can be considered as a kind of “software” in a narrow sense. The so-called “thread” in them is often the user state thread, and the operating system itself kernel state thread (KSE for short), or there is a difference.

The Go concurrent programming model is underpinned by the thread library provided by the operating system, so it’s important to start with the thread implementation model.

Threads can be thought of as a flow of control in a process. A process will contain at least one thread, because at least one control flow will be running continuously. Thus, the first thread of a process is created with the start of the process, which is called the main thread of the process. Of course, a process can also contain multiple threads. These threads are created by existing threads in the current process by calling the system call, or more specifically, the pThread Create function. A process with multiple threads can execute multiple tasks concurrently, and even if one or some tasks are blocked, it does not affect the normal execution of other tasks, which can greatly improve the response time and throughput of the program. Threads, on the other hand, cannot exist independently of processes. Its life cycle cannot exceed that of the process to which it belongs.

There are three main thread implementation models, namely: user-level thread model, kernel-level thread model and two-level thread model. The biggest difference lies in the correspondence between threads and the Kernel Scheduling Entity (KSE). As the name implies, a kernel scheduling entity is an object that can be scheduled by the kernel’s scheduler. In many publications and books, it is also referred to as kernel-level threading, the smallest scheduling unit in the operating system kernel.

Kernel-level threading model

User threads have a one-to-one relationship with KSE (1:1). Thread libraries for most programming languages (pthreads for Linux, java.lang. Threads for Java, STD :: threads for C++11, etc.) are a wrapper around operating system threads (kernel-level threads), creating each Thread statically associated with a different KSE. So its scheduling is done entirely by the OS scheduler. This approach is simple to implement, takes direct advantage of the threading capabilities provided by the OS, and there is generally no interaction between different user threads. However, its creation, destruction, and context switching between multiple threads are directly performed by the OS level, which will have a significant impact on THE PERFORMANCE of the OS in a scenario where a large number of threads are required.

Each thread is scheduled independently by the kernel scheduler, so if one thread blocks, no other threads are affected.

Advantages: With the support of multi-core processor hardware, the kernel space threading model supports true parallelism, allowing another thread to continue executing when one thread is blocked, so concurrency is strong.

Disadvantages: Each user-level thread needs to be created with a corresponding kernel-level thread, which is expensive to create and affects application performance.

User-level threading model

The user thread and KSE are many-to-one (M:1). The creation, destruction and coordination between multiple threads of this kind are all responsible for by the user’s thread library, which is transparent to the OS kernel. All threads created in a process are dynamically associated with the same KSE at run time. There are many languages that implement coroutines in this way. This approach is lightweight compared to what kernel-level threads can do, consumes much less system resources, and therefore the number of things that can be created and context switches are much less costly. However, this model has a fatal flaw. If we make a blocking system call on a user thread (such as read network IO in blocking mode), then once KSE is scheduled out of the CPU by the kernel due to blocking, all the remaining corresponding user threads become blocked (the whole process hangs). So the language association Cheng Ku will put himself to encapsulate some blocking operations for fully non-blocking form, and then used to block the point, take the initiative to give up yourself, and in some way notice or wake up the other to perform user threads run on the the KSE, avoiding the kernel scheduler due to blocked the KSE do a context switch, This way the whole process won’t be blocked.

Advantages: The advantage of this model is that thread context switches occur in user space, avoiding mode switch, which has a positive impact on performance.

Disadvantages: All threads are based on a single kernel scheduling entity, the kernel thread, which means that only one processor can be used, which is not acceptable in a multi-processor environment. In essence, user threads solve the concurrency problem, but not the parallelism problem. If a thread is stuck in the kernel state due to I/O operations, and the kernel state is blocked waiting for I/O data, then all threads will be blocked. The user space can also use non-blocking I/O, but without avoiding performance and complexity problems.

Two-level threading model

User threads and KSE are man-to-many (M:N). This implementation combines the advantages of the first two models, creates multiple Kses for a process, and threads can be dynamically associated with different Kses at run time. When a KSE is scheduled out of the CPU by the kernel due to the blocking operation of the threads working on it, The remaining user threads currently associated with it can re-associate with other Kses. Of course, the implementation of this dynamic association mechanism is very complicated and requires users to implement it themselves, which is one of its disadvantages. Concurrency in Go language is implemented in this way. Go implements a runtime scheduler to implement the model itself, which is responsible for the dynamic association between the “threads” in Go and KSE. This model is sometimes referred to as the hybrid threading model, where the user scheduler “schedules” user threads to KSE and the kernel scheduler schedules KSE to CPU.

GoConcurrent scheduling:GMPmodel

Go builds a unique two-level threading model on top of the kernel threads provided by the operating system. The Goroutine mechanism implements the M: N threading model. The Goroutine mechanism is an implementation of coroutine. Golang’s built-in scheduler can execute one coroutine per CPU in multi-core cpus.

How does the scheduler work

Create a new “thread” in Go (called a Goroutine in Go) :

Go func() {// do something in one new goroutine}()Copy the code

Functionally equivalent to Java8 code:

new java.lang.Thread(() -> { 
    // do something in one new thread
}).start();
Copy the code

The key to understanding how goroutine works is to understand the implementation of the Go language Scheduler.

There are four main structures supporting the whole scheduler implementation in Go language, namely M, G, P and Sched. The first three are defined in Runtime. h, and Sched is defined in proc.c.

  • SchedThe structure is the scheduler, which maintains storageMandGQueue and some status information of the scheduler.
  • MStructure isMachine.System threadIt’s managed by the operating system, and Goroutine runs on top of M; M is a large structure that maintains a small object memory cache (McAche), a goroutine currently executed, a random number generator, and much more.
  • PStructure isProcessor.The processorIts main purpose is to execute a Goroutine. It maintains a Goroutine queue, the RunQueue. The Processor is an important part of getting us from N:1 to M:N scheduling.
  • GIs implemented by GoroutineThe core structureIt contains the stack, the instruction pointer, and other information important to the scheduling goroutine, such as its blocked channel.

!!!!! The number of processors is set to the value of the environment variable GOMAXPROCS at startup or by calling the function GOMAXPROCS() at run time. A fixed number of processors means that only GOMAXPROCS threads are running go code at any one time.

Triangle, rectangle, and circle represent Machine, Processor, and Goroutine respectively.

In the case of a single processor, all goroutines run on the sameMEach of the system threadsMSystem threads maintain onePAt any moment, onePOnly one of themG, otherGinrunqueueIn the waiting. aGAfter running its own timeslice, relinquish the context and return torunqueueIn the. Multi-core processor scenarios in order to rungoroutines, eachMThe system thread will hold oneP.

Under normal circumstances, scheduler would follow the above process, but threads can be blocked and so on. Take a look at goroutine’s handling of thread blocks and so on.

Thread block

When runninggoroutineWhen blocking, such as a system call, another system thread is created (M1), currentMThe thread abandoned itsP.PGo to a new thread to run.

runqueuecompletes

When one of themProcessortherunqueueNull, nonegoroutineIt can be scheduled. It steals half from the other contextgoroutine.

!!!!! G, P and M in the figure are concepts and data structure objects abstracted from the Go language runtime system (including memory allocator, concurrent scheduler, garbage collector and other components, which can be imagined as JVM in Java) : G: Short for Goroutine, the code that uses the go keyword plus the function call creates a G object, which encapsulates a task to be executed concurrently, also known as a user-mode thread. It is a user-level resource that is transparent to the OS, lightweight, can be created in large numbers, and has low context switching costs. M: Short for Machine, it is created by clone system call on Linux platform. It is essentially the same as threads created by Linux PThread library, which are OS thread entities created by system call. The purpose of M is to perform the concurrent tasks wrapped in G. The main responsibility of the scheduler in Go runtime system is to arrange G fairly and reasonably to execute on multiple M. It is an OS resource, and the number of resources that can be created is limited by the OS. Generally, there are more G’s than active M’s. P: Processor is a logical Processor that manages G objects (each P has a G queue) and provides localized resources for G to run on M.

From the perspective of the two-level threading model, it seems that P is not needed, just G and M, so why add P? In fact, the early implementation of the Go language runtime system (Go1.0) did not have the concept of P, and the scheduler in Go directly allocated G to run on the appropriate M. But it has brought many problems, for example, different G run concurrently on different M may need to apply to the system resources (e.g., heap memory), because the resources is global, will be a lot of system performance loss due to competition for resources, in order to solve the similar problems, Go behind the run-time system (Go1.1) joined the P, let P G management object, M must bind to a P to run G before it can run G managed by that P. The advantage of this is that G can apply for some system resources (local resources) in the P object in advance. When G needs them, it first applies for them from its own local P (no lock protection is required). If they are insufficient or do not apply for them from the global, G will take more resources from the global for efficient use later. Just like today, when we go to the government to handle matters, we first go to the local government to see if it can handle them. If not, we go to the central government to provide efficiency. Moreover, since P decouples G and M objects, even if M is blocked by the running G on it, other G associated with M can migrate to other active M to continue running along with P, so that G can always find M and run itself in time, thus improving the concurrency of the system. Go runtime system realizes a set of user-mode concurrent scheduling system by constructing G-P-M object model, which can manage and schedule its own concurrent tasks by itself. Therefore, it can be said that Go language supports concurrency native. The self-implemented scheduler is responsible for allocating concurrent tasks to different kernel threads, and the kernel scheduler takes over the execution and scheduling of the kernel threads on the CPU.

As you can see, the concurrency of Go is very simple, with a syntax sugar wrapped around the internal complexity of the implementation. Its interior can be summarized in the following diagram:

In the end, the complete scheduling system of Go runtime is very complex, and it is difficult to describe it clearly in a single article. Here we can only introduce it from a macro perspective and have an overall understanding first.

// Goroutine1
func task1() {
    go task2()
    go task3()
}
Copy the code

If a G(Goroutine1) has been assigned to M by P, and two GS are created during the execution of Goroutine1, these two GS will be immediately put into the local G task queue of the same P as Goroutine1, waiting for the execution of M bound to the P. This is the most basic structure. It’s easy to understand. The key question is:

1. How to distribute as reasonably as possible on a multi-core systemGTo multipleMRun on, make full use of multi-core, improve concurrency?

If a large number of GS are created in a Goroutine using the go keyword, they are temporarily placed in the same queue, but if there are idle Ps (the number of PS in the system is by default the number of CPU cores GOMAXPROCS), The Go runtime system can always ensure that there is at least one (and usually only one) active M bound with idle P to various G queues to find runnable G tasks, called spin M. Generally, the search order is: the bound QUEUE of P -> global queue -> other P queues. If they find the P queue, they take it out and start running, otherwise they go to the global queue. Because the global queue needs lock protection, if there are many tasks in it, they will transfer a batch to the local P queue to avoid competing for locks every time. If the global queue still does not exist, the task is stolen from another P queue (stealing half of the task back). This ensures that, while there are still runnable G tasks, there are always an equal number of M + P combinations executing G tasks or on the way to executing G tasks (looking for G tasks).

2. If aMIn the implementationGThe process of beingGThe system call is blocked in.

In this case, the M will be the kernel scheduler to schedule the CPU and in the blocking state, associated with the M G, there is no other way to continue, but Go running system of a monitoring thread (thread sysmon) can detect such M, and to bind with the M P, look for other free or new M over the P, Then, it will continue to run G, and when M recovers from the blocking state, it needs to find a new idle P to continue to execute the original G. If there is no idle P in the system at this time, the original G will be put into the global queue, waiting for other M+P combinations to discover and execute.

3. If a certainGinMRunning time is too long, there is no way to do preemptive scheduling, so that theMOn the otherGGet a certain running time to ensure the fairness of the scheduling system?

The Linux kernel scheduler does scheduling mainly based on time slice and priority. For threads of the same priority, the kernel scheduler tries to ensure that each thread gets a certain amount of execution time. To prevent some threads from starving to death, the kernel scheduler initiates a preemptive schedule that interrupts a long-running thread and frees up CPU resources to allow other threads to execute. Of course, there is a similar preemption mechanism in the Go runtime scheduler, but it cannot guarantee the success of preemption, because the Go runtime system does not have the interrupt ability of the kernel scheduler, and it can only gracefully make the running G voluntarily surrender the execution right of M by setting the preemption flag in the G that runs for too long. At this point, it is important to mention that Goroutine can dynamically expand its thread stack while running. OS threads typically have a fixed stack memory (usually 2MB). A Goroutine stack starts its life with a very small stack (typically 2KB). The goroutine stack is not fixed, it can be increased and decreased as needed. Although it’s rare to use it this big. It is possible to create 100,000 goroutines at a time in Go. This can be scaled from an initial 2KB size to a maximum of 1GB (on 64-bit systems). Therefore, the stack space required for each function call needs to be calculated before it is called, and then expanded as needed (exceeding the maximum value will cause runtime exceptions). The mechanism of Go preemptive scheduling is to check the following preemption flag when deciding whether to expand the stack or not, and decide whether to continue with it or give up. The runtime system monitors the thread accounting and sets the preemption flag to G, which is running for too long. G then checks the preemption flag when there is a function call, and if it is set, puts itself in the global queue, so that other GS associated with M have a chance to execute. However, if the G being performed is a time-consuming operation with no function calls (such as just a calculation in the for loop), even if the preemption flag is set, the G will hog the current M until it completes its task.

A brief summary of the GMP model:

GPM is the implementation of Go language runtime level and a scheduling system implemented by Go language itself. Different from operating system scheduling OS threads.

  • GIt is a goroutine, which contains information about the goroutine and its binding to the P.
  • PManages a set of Goroutine queues. P stores the context (function pointer, stack address, and address boundary) in which goroutine is currently running. P will perform some scheduling on its own goroutine queue (such as suspending the goroutine that occupies a long CPU time, running the subsequent goroutine, etc.). When its own queue is consumed, IT will fetch from the global queue. If the global queue is also consumed, it will grab tasks from other QUEUES of P.
  • M (machine)The Go runtime virtualizes the kernel thread of the operating system. M and the kernel thread are usually mapped one by one. A Groutine is eventually placed on M for execution.

P and M usually correspond one to one. Their relationship is: P manages a group of G mounted on M. When a G blocks on a M for a long time, the Runtime creates a new M, and the P where G blocks will mount other GS on the new M. Reclaim the old M when the old G is blocked or considered dead.

The number of P is set by runtime.GOMAXPROCS (maximum 256). After Go1.5, the default is the number of physical threads. Some P and M are added when concurrency is high, but not too much. Switching too often is not worth the cost.

In terms of thread scheduling alone, Go has an advantage over other languages in that OS threads are scheduled by the OS kernel, whereas Goroutine is scheduled by the Go runtime’s own scheduler, which uses a technique called M: N scheduling (multiplexing/scheduling m Goroutines to N OS threads). One big characteristic is goroutine scheduling is done under the user mode, does not involve frequent switching between kernel mode and user mode, including memory allocation and release, is at the user mode maintains a large memory pool, not directly call system malloc function (unless the memory pool needs to be changed), the cost is much lower than scheduling OS thread. On the other hand, it makes full use of multi-core hardware resources, approximately divides several Goroutines into physical threads, and the ultra-light weight of its own Goroutine ensures the performance of GO scheduling.

Source code analysis