One of Golang’s features is goroutine, which makes concurrent programming easier for programmers and is suitable for server programming. As a backend developer, it is important to understand the scenarios and common solutions for concurrent programming. How do you do high-concurrency programming in general? What are the classic models?

It all starts with C10k, which stands for Client 10000, a single server serving 10,000 clients simultaneously. Of course, now the business is facing C100k, C1000k. Early servers were based on the process/thread model, assigning a process (thread) to handle each new connection. Processes (threads) occupy a certain amount of resources in the operating system. Process (thread) creation is bottlenecked due to hardware limitations. There is also a cost to process (thread) context switching: each time the scheduler dispatches a thread, the operating system stores all necessary information about the thread, such as program counters, stacks, registers, and states.

CPU calculations are much faster than I/O operations. In general, common Internet applications such as the Web are I/O intensive rather than computational intensive. I/O intensive is when a computer’s CPU spends a lot of time waiting for data to come in and out, rather than computing. Most computing resources are wasted when the CPU spends most of its time waiting for I/O.

Obviously, simply opening a process/thread to handle a connection is not enough. To achieve high concurrency, you should think about your I/O strategy. Under the same hardware condition, the effect produced by different designs will be very different. Before discussing several I/O models, let’s introduce the concepts of synchronous/asynchronous, blocking/non-blocking, and operating systems.

Reference:

The C10K problem

Synchronous/asynchronous? Blocking/non-blocking? Synchronization is when the caller takes the initiative to check the status of the call. Asynchronously, the called notifies the caller. For example, in Web applications, the back end sends Web pages to the front end by rendering templates, which is synchronous. Here the front end is the caller, reloading the entire page each time the data is requested. While the front-end uses jQuery Ajex to request data to the server, it is asynchronous, each request data does not need to reload the entire page, local refresh can be.

The difference between blocking and non-blocking is whether the call is returned immediately. After A calls B, it waits (blocks) at the call and does not proceed with the rest of the code until B returns. This is called blocking. Instead of blocking, method A calls method B, which returns immediately, and A can continue executing the following code without being blocked by the call. When A method is blocked, the thread on which the method is located is suspended and placed on A blocking queue by the operating system scheduler until the event for which A is waiting occurs.

The I/O model under Unix also has synchronous/asynchronous and blocking/non-blocking concepts. Check out my notes on the I/O model in Unix

Processes, threads, and coroutines are independent units of resource allocation in the system. These resources include: the user’s address space, mechanisms for synchronization and communication between processes (threads), opened files and requested I/O devices, and an address mapping table maintained by the core process. The kernel senses the process through process control blocks (PCBS).

Thread is the basic unit of scheduling and dispatch. The kernel senses threads through thread control blocks (TCBS).

The thread itself does not own system resources, but only a few essential resources that are guaranteed to run independently, such as TCBS, program counters, local variables, status parameters, return addresses, and other registers and stacks. All threads of the same process have the same address space, and threads can access resources owned by the process. Multiple threads can execute concurrently. A process has several relatively independent threads, but at least one thread.

Threads are implemented in different ways, including Kernel Supported Threads (KST) and user-level Threads (UST, User Supported Threads). The TCBS of kernel-level threads are stored in the kernel space, and their creation, blocking, undo, and switching activities are also implemented in the kernel space. User-level threads are kernel independent. The implementation of user-level threads is in user space and the kernel does not perceive the existence of user threads. The scheduling algorithm for user threads can be process-specific and not scheduled by the kernel, but at the same time, user threads cannot take advantage of parallel execution of multiple processors. In a process with multiple user threads, if one thread blocks, all of the process’s threads will block. Kernel switching requires conversion to kernel space that user threads do not, so the former is more expensive. However, user threads also require kernel support, typically through the runtime system or kernel control threads to connect to a kernel thread, with different implementations of 1:1, 1:n, and N :m.

In a time-sharing operating system, processor scheduling is generally based on round robin (RR) of time slices, with multiple ready threads queued up to execute time slices in turn. In order to guarantee interaction and real-time performance, threads acquire processors in a Preemptive Mode. The overhead of preemption is relatively high. In nonpreemption Mode, a running thread is not scheduled or suspended until it finishes executing, blocks due to a system call (such as an I/O request), or voluntarily relinquish the processor.

Coroutine is implemented based on non-preemptive scheduling. Processes and threads are concepts at the operating system level, while coroutines are at the compiler level and are now supported in many programming languages such as Erlang, Lua, Python, and Golang. To be precise, a coroutine is a user – mode, lightweight thread. It runs in user space and is not scheduled by the system. It has its own scheduling algorithm. During context switching, the coroutine switches in user space rather than getting stuck in the kernel to do thread switching, reducing overhead. Simply put, the compiler provides its own run-time system (not the kernel) for scheduling, context saving and recovery, and reimplements a “concurrency” mechanism. The concurrency of the system is the rotation of time slices. The single processor interacts with each other to execute different execution streams, creating the feeling of simultaneous execution of different threads. The concurrency of coroutines is the rotation of control within a single thread. Compared with preemptive scheduling, coroutine is the initiative to give weight to achieve cooperation. The advantage of coroutines is that asynchronous code is much more readable than callbacks. The downside is that, because it’s user-level threads, you can’t take advantage of concurrent execution on multi-core machines.

Threads were created to separate the two functions of a process: resource allocation and system scheduling. Allow finer – grained and lighter threads to undertake scheduling and reduce scheduling overhead. Threads are not lightweight enough, however, because scheduling is done in kernel space, and every thread switch requires dropping into the kernel, which is a significant overhead. Coroutines implement scheduling logic in user space, simulating the transfer of control itself (the compiler runtime system/programmer) to achieve more fine-grained control.

Reference:

Computer Operating System

Concurrency model

  1. There is no difference between a single process and a single thread, because a process has at least one thread. Looping through requests should be the most rudimentary approach. When a large number of requests come in and a single thread processes them one by one, it’s easy to get backlogged and get no response. This is a non-concurrent approach.

  2. The multiprocess main process listens for and manages connections. When a client requests a connection, a child process is forked and the parent process continues to wait for requests from other clients. However, processes occupy a lot of server resources, and the server load will be high.

Apache is a multi-process server. There are two modes:

Prefork MPM: Use multiple child processes, but each child process does not contain multiple threads. Each process handles only one connection. It is as fast as worker MPM on many systems, but requires more memory. This thread-free design is superior to worker MPM in some cases because it can be used on third-party modules that are not thread-safe (such as PHp3/4/5), is easy to debug on platforms that do not support thread-debugging, and has greater stability than worker MPM.

Worker MPM: Uses multiple child processes with multiple threads in each child process. With one request per thread, this MPM is usually a good choice for high-traffic servers. Because it requires less memory and is more scalable than Prefork MPM.

The biggest benefit of this architecture is isolation. If the child process crashes, the parent process will not be affected. The downside is that it overloads the system.

Reference:

Architecture and Principle of Web server Apache

  1. Multithreading is similar to multi-processing, but instead of threads. The main thread listens and accepts () connections, and the child thread (worker thread) handles reading of the business logic and flow. Child threads block, and other threads within the same process do not block.

The disadvantage is that:

Threads are frequently created and destroyed, which is a significant overhead on the system. This problem can be solved with thread pools. Thread pool is to create a part of threads in advance, and the thread pool manager is responsible for scheduling threads to achieve the effect of thread reuse, avoiding the performance overhead brought by repeatedly creating threads and saving system resources.

To deal with synchronization, when multiple threads request the same resource, you need to use something like a lock to keep the thread safe. Poor synchronization affects data security and performance.

The crash of one thread causes the crash of the entire process.

Multithreading can be used to improve response speed, overlap IO and computation, and reduce latency. Although multithreading cannot improve absolute performance, it can improve average response performance.

This is actually relatively easy to think of, especially for computer students who have just learned about multithreading and operating systems. This is sufficient when the volume of requests is not high. How many threads are connected depends on the hardware performance of the server. But high concurrency can’t be achieved by linearly stacking hardware or adding threads. 100 threads may be able to achieve 1000 concurrency, but at 10000 concurrency, multiplying the number of threads by 10 May not be possible, such as thread scheduling overhead, synchronization becomes a bottleneck.

  1. Nginx uses a multi-process (single-thread) and multiplex IO multiplexing model:

After Nginx starts, there will be a master process and multiple independent worker processes.

Receive signals from the outside world and send signals to each worker process, and each process may process this connection

The master process can monitor the running status of worker processes. When the worker process exits (under abnormal circumstances), a new worker process will be automatically started.

The master process first creates a sock file descriptor through socket() for listening, and then forks the workers process, which inherits the parent process’s SOckFD. The child process then creates the Connected descriptor after accept(), and then communicates with the client through the connected descriptor.

There is a stampede: when a connection comes in, all child processes are notified and “race” to establish a connection with it.

Nginx adds a mutex to Accept to deal with stampedes.

In each worker process, Nginx calls the kernel epoll() function for I/O multiplexing.

Reference:

Explore the principle of high concurrency in Nginx

Node.js Node.js is also a single-threaded model. All logic in Node.js is the callback function of the event, so Node.js is always in the event loop, and the program entry is the callback function of the first event in the event loop. An I/O request may be issued or an EMIT event may be emitted directly within the event callback function, which returns to the event loop after execution. The event loop checks for unprocessed events in the event queue until the program terminates. Node.js’s event loop is invisible to the developer and is implemented by the Libev library. Libev constantly checks for active event listeners that can be detected, and then exits the event loop and the program ends.

Node.js single-thread is non-blocking because the underlying implementation has another thread polling the event queue. For upper-level developers, they only need to worry about the single thread, without permission to open new threads, and do not need to worry about thread synchronization.

The downside of this mechanism is that it can result in a lot of nested callback functions and unreadable code. Because there is no multithreading, there is no way to implement parallel execution on a multi-core machine.

Reference:

Basic understanding of node.js mechanism and principle

Use Libevent and LibeV to improve network application performance

  1. Coroutine Is a scheduler based on user space. The specific scheduling algorithm is implemented by specific compilers and developers, which is more flexible and controllable than multithreading and event callback. The way coroutines are scheduled varies from language to language. Python explicitly yields the switch in code, while Golang uses the GO syntax to enable Goroutine. The specific scheduling is performed by the runtime provided at the language level.

Gorounte’s stack is small, typically a few k’s, and can grow dynamically. The default stack space for threads is 2M on Windows and 8M on Linux. This is why Goroutine supports tens of thousands of concurrent applications on a single machine, because it’s cheaper.

From a stack perspective, processes have their own separate heap and stack, neither sharing the heap nor the stack, and processes are scheduled by the operating system. Threads have their own separate stack and shared heap, shared heap, not shared stack, threads are also scheduled by the operating system (kernel threads). Coroutines share the heap like threads, not the stack, and are scheduled by the programmer in the code of the coroutine.

When using goroutine, it can be used as a lightweight thread. As with multi-process and multi-threading, the main Goroutine listens and opens multiple working Goroutines to process connections. The advantage is that you can open more Goroutines to handle connections than you can with multiple threads.

The underlying implementation of Goroutine, the key lies in three basic objects, G(Goroutine), M(machine), P (Process). M: Connected to the kernel thread, representing the kernel thread; P: represents the resources M needs to run G. It can be thought of as a local scheduler, maintaining a Goroutine queue. G: Represents a Goroutine with its own stack. The mapping of M and G can be analogous to the M: N model of operating system kernel thread and user thread. By controlling the number of P, the concurrency of the operating system can be controlled.

Reference:

What do you make of Golang’s “don’t communicate through shared memory, communicate through shared memory”?

Golang source code exploration (two) coroutine implementation principle

Why can Goroutine handle large concurrency?

coroutines

Actors and CSP model Traditional multithreaded programming is synchronized using shared memory. However, when the degree of parallelism becomes high, the uncertainty increases, and it is necessary to use mechanisms such as locking to ensure correctness, but the bad use of locking is likely to drag down the performance. And multithreaded programming is also more difficult, not in line with people’s habits of thinking, easy to make mistakes, will produce deadlock. So there are new programming models for high concurrency, messaging instead of shared memory and locking.

Then came the idea of “Don’t communicate by sharing memory, share memory by communicating”, Actor and CSP are two concurrent programming models based on this idea, which have been described in many academic papers. That said, there is a mathematical justification for this, and understanding these two models has many useful implications for the development of high-concurrency servers. As an engineer, it is not necessary to have theoretical innovation, but to learn to apply the theoretical results to their own projects.

“While the Actor model focuses on the entities that participate in communication, the CSP model focuses on the channels used for communication.” Java/Scala has a library, Akka, which is an implementation of the Actor model. Golang’s coroutine mechanism is CSP model.

“The Actor model espoused a philosophy of ‘Everything is an Actor,’ similar to the ‘everything is an object’ philosophy of object-oriented programming.” “Actor model = data + behavior + messages. The state inside the Actor model is maintained by its own behavior. External threads cannot directly invoke the behavior of objects, but must invoke the behavior through messages, which ensures that the data inside the Actor can only be modified by itself.”

My understanding is that, inside the model, the processing of data is always single thread, so there is no need to consider thread safety, no need to lock, external can be multi-thread, to manipulate data need to send messages to the internal thread, the internal thread only process one message at a time, a message represents a data processing behavior. The internal and external threads implement asynchronous messaging through the Mailbox.

CSPS are similar to actors, and processes (or goroutine in go) correspond to acotor, the entity that sends the message. Channel corresponds to mailbox, which is the carrier of messages. The difference is that there is only one mailbox with an actor. Actor and mailbox are coupled. Channels exist independently as first-classes (this is obvious in Golang) and are anonymous. Mailbox is asynchronous, and channels are generally synchronous (in Golang, channels have synchronous mode, and buffer size can also be set to asynchronous).

Reference:

Actor Concurrency model & Based on shared memory threading model

Why is the Actor model the ultimate solution for highly concurrent transactions?

How to explain the CSP model in the concurrency model in a simple way?

Concurrent programming: Actors model and CSP model

To sum up, the key to high concurrency is to achieve asynchronous non-blocking and more efficient utilization of CPU. Multithreading can achieve non-blocking, but it occupies a lot of resources and costs a lot of switching. Coroutines use dynamic stack growth and user – mode scheduling to avoid the two problems of multithreading. Event drivers use a single-threaded approach to avoid occupying too many system resources, do not need to worry about thread safety, but can not take advantage of multi-core. Which model to use depends on the requirements. Models or technologies are just tools. All continents lead to Rome.

CSP and Actor models are more elegant, because they can conform to people’s habits of thinking and avoid the use of locks. In my opinion, locking and multi-threading are easy to be abused. It is a micro-oriented and linear way of thinking, which is not strategic enough. Less coupling than using message communication.

High concurrency programming is necessary. On the one hand, many applications require high concurrency support. With more and more users on the network, business scenarios will become more and more complex, requiring stable and efficient server support. On the other hand, the performance of modern computers is relatively high, but if the software design is not good enough, it can not give full play to the performance. It’s a waste.

While writing this article, I found a lot of interesting open source and projects that deserve further research and reading, but I don’t have enough time to go into them. Read on and update some posts:

Golang’s Goroutine is based on the C coprogramming library implemented by Russ Cox, one of the authors of libTask Golang: swtch.com/libtask/

Event-driven programming framework: the libev software. Schmorp. DE/PKG/libev. H…

Akka Scala implements the Actor framework: akka.io/

It took me nearly a week to write this article, and I checked a lot of materials. The difficulty of writing this article was much bigger than I expected, and it was not well written. I quoted a lot of blogs, and I could not organize a good language to explain the problem. I may have misunderstood some points, but I’ll correct them later. But at least I finished it. At least I have a sensible understanding of mainstream concurrent programming, which is also an explanation for myself.

Update: Functional programming 2018.01.01

Recently, functional programming is also a model that can be used to solve concurrency problems.

Imperative and functional languages have different abstractions.

Imperative programming is an abstraction of computer hardware concerned with the steps to solve a problem. Functional programming is the abstraction of mathematics, the transformation of problems into mathematical expressions.

Functional languages have two characteristics: data immutable, independent of the operation of the saved or retrieved state; No side effects. The function is called with the same input and always returns the same value. Therefore, concurrent programming can be done without relying on locks.

I haven’t learned functional languages, so I don’t have a good understanding of how functional programming can achieve concurrency. But there is a sense that functional languages are an area worth exploring.

There is a saying that “the primary technical mission of software is to manage complexity.” (Code Book). There are so many abstractions, partly to solve problems efficiently, and partly to reduce the mental burden on programmers. A programming model is the way a programmer looks at a problem. To solve the same problem, of course, it is better to choose a programming model that is friendly to programming and in line with people’s habits of thinking. “Code is written for people, not machines” (SICP). While machines can still perform, the ultimate goal is to free people to spend most of their energy on cutting edge, creative work.