This article is based on my speech at the Gopher Beijing party on February 27th, with some additions and adjustments. Contribute to “high Availability Architecture” public release.

Before we get to this topic, let’s take a look at two concepts that come first in almost every article on concurrency:

  • Concurrency is all about task cutting. , for example, you are CEO of a startup company started only you a person, you act the role of diversification, for a while to do product planning, write code that a moment, a meeting with the customer, although you can’t see the customer to write code at the same time, but because you segmentation task, allocate the time slice, appear as multiple tasks in performing together.
  • Parallelism Parallelism focuses on simultaneous execution. In the same example above, you find yourself too busy to juggle your time, so you hire an engineer, product manager, and marketing director to do multiple tasks at once.

So to sum up, concurrency does not require parallelism. It can be simulated by time slicing, such as a multi-task system on a single-core CPU. Concurrency requires tasks to be split into independently executed segments. Parallelism is about simultaneous execution, it must be on multiple (core) cpus, and parallelism must support concurrency. For the most part, this article will not make a strict distinction between the two concepts; by default, concurrency is concurrency under a parallel mechanism.

Why are concurrent programs so difficult?


We believe that writing correct
concurrent.
fault-tolerant and
scalableApplications are too hard. Most of the time it’s because we are using the wrong tools and the wrong level of abstraction. – Akka

Akka’s official documentation begins with a good line: The reason it’s so hard to write the right concurrency, fault tolerance, and extensibility programs is because we use the wrong tools and the wrong abstractions. (Of course the original meaning of the document is that Akka is the right tool, but we can treat that statement independently).

So let’s start with the abstraction of the program. At the beginning our program is process-oriented, data structure +func. Then with object orientation, objects combined with number structures and FUNc, we wanted to abstract out objects, statuses and behaviors in a way that mimics the real world. But both procedural funC and object-oriented FUNC are essentially organizational units of code blocks, and do not themselves contain the definition of a concurrency strategy for code blocks. To address the need for concurrency, the concept of threads was introduced.

Threads

  1. System kernel mode, more lightweight process
  2. Scheduling is done by the system kernel
  3. Multiple threads of the same process can share resources

The advent of threads solves two problems. One is the urgent need for concurrency mechanisms to ensure user interface responsiveness when guIs appear. The second is the multi-user problem brought by the development of the Internet. The earliest CGI programs were very simple, using scripts to wrap the original stand-alone program in a process that would be started by a user. However, this obviously didn’t support many users, and if resources needed to be shared between processes, they had to communicate with each other. The introduction of threads alleviated this problem.

Thread on the use of relatively simple, if you think this code needs to concurrency, I put it in a separate thread execution, the system is responsible for the scheduling, specific when to use thread, how many threads will be decided by the caller, but define party do not know the caller will be how to use your own code, because many concurrency issues are caused by misuse, For example, the Map in Go and the HashMap in Java are not concurrency safe and their misuse in multithreaded environments can cause problems. It also adds complexity:

  1. Threading is simple if each task is independent and does not need to share any resources. But the world is often complex, and there are always resources to be shared. As in the previous example, developers and marketers need to negotiate a solution with the CEO at the same time, and the CEO becomes a race condition.
  2. Dependencies and order of execution If tasks between threads have dependencies, wait and notification mechanisms are needed to coordinate. For example, in the previous example, if the product and THE CEO discuss the solution depends on the market and the CEO discuss the solution, then a coordination mechanism is needed to ensure the order.

To solve the above problems, we introduced a number of complex mechanisms to ensure that:

  • Mutex(Lock) (sync package in Go, Concurrent package in Java) protects data with Mutex, but with locks, concurrency is significantly reduced.
  • Semaphore uses semaphores to control concurrency or as signal notifications between threads.
  • Volatile Java specifically introduced the volatile keyword to reduce the use of locks in read-only situations.
  • Compare-and-swap ensures atomicity through CAS provided by hardware, which is also a mechanism to reduce the cost of locking.

If the above two problems only increase the complexity, we can solve them to some extent through in-depth study, rigorous CodeReview, and comprehensive concurrent testing (such as adding -race parameter to unit testing in Go language). (Of course, this is also controversial. There is a paper that most concurrent programs do not have problems but the concurrency is not enough, if the NUMBER of CPU cores continues to increase, the program runs longer, it is difficult to guarantee that there are no problems). But the biggest headache is the following:

How many threads are needed in the system?

Let’s start with hardware resources and consider the cost of threads:

  • Each thread needs a Stack space to save support for kickbacks. Java stack space (64-bit VM) default is 1024K, not counting other memory, just stack space, starting 1024 threads requires 1G memory. Although you can control this with the -xSS argument, since threads are essentially processes, the system assumes that they are meant to run for a long time, and too little stack space can lead to slightly more complex recursive calls (such as regular expression matching at complex points) that cause stack overflows. So adjust parameters to treat the symptoms rather than the root cause.
  • A non-rigorous test I ran on a PC simulated two threads waking up to each other and suspending in turn, with a thread switch cost of about 6000 nanoseconds/time. This does not take into account the effect of stack size. A foreign paper specifically analyzes the cost of thread switching and basically draws the conclusion that the switching cost is directly related to the stack space usage.

  • CPU utilization

    One of our primary goals for concurrency is that we have multiple cores, and we want to maximize CPU utilization and hardware resources. From that perspective, how many threads should we use?

    We can calculate this by a formula, 100/(15+5)*4=20, and 20 threads is the best way to do it. But on the one hand, network time is not fixed. On the other hand, what about other bottleneck resources? Locks, such as database connection pools, can be more complex.

As the father of a one year old, I think this question is as difficult as writing a program to feed my child and thinking “How much food is appropriate for my child?” , this question has the following answers and strategies:

  • If the child does not eat, it will be better (but the child is naughty, do not eat may want to play)
  • The child eats full good (nonsense, how do you know the child eats full? Kids can’t talk.)
  • Increments gradually, watch over time, and then calculate an average (this might be our usual strategy for tweaking threads, but how much increments is appropriate?).
  • If the child vomits, don’t feed him. (This boundary condition may be reached if the model is incrementally and externally observed. If system performance goes backwards due to the addition of threads, don’t add threads.
  • Didn’t control the boundary well, gave the child to support the bad (this father bear is too scary. But adjusting the thread can easily crash the system.)

As you can see from this example, it is very difficult to observe from an external system, or to calculate in an empirical way. So the conclusion is:

Let the child can talk, eat full of their own say, learn to eat, self-management is the best plan.

However, there is no use, the computer can not speak, how to manage themselves?

However, we can draw a conclusion from the above discussion:

  • Threads are expensive (memory, scheduling) and impossible to create on a large scale
  • This problem should be solved dynamically by the language or framework

Thread pool scheme


After Java1.5, Doug Lea’s Executor family was included in the default JDK and is a typical thread pool scenario.

Thread pool controls the number of threads to a certain extent, implements thread reuse and reduces the cost of using threads. However, the problem of quantity is still not solved. When the thread pool is initialized, it still needs to set a minimum and maximum number of threads, as well as the length of the task queue. Self-management is only dynamic adjustment within the set range. In addition, different tasks may have different concurrency requirements, and multiple thread pools may be required to avoid interaction, resulting in a Java system with a large number of thread pools.

New train of thought


As we can see from the previous analysis, if threads are always running, we only need to set the number of threads equal to the number of CPU cores, which can maximize CPU utilization and reduce switching costs and memory usage. But how?

But Chen Li did not stop

This means that a working piece of code is placed in the thread, and if it doesn’t work (waiting, blocking, etc.), it is removed. Popular saying is don’t occupy the toilet, if not out, need to brew, first let the toilet out, because the toilet is a scarce resource.

There are two ways to do this:

  1. Asynchronous callback schemes, such as NodeJS, register a callback method (including some context data objects) to the IO scheduler (libev in Linux, the scheduler is in a different thread) in case of a blocking event such as a network call, and the current thread is released to do something else. When the data is ready, the scheduler passes the result to the callback method and executes, which is not actually in the original thread that initiated the request, but is unconscious to the user. The problem with this approach is that it’s easy to get callback hell, because all blocking operations must be asynchronous or the system freezes. And then there’s the fact that the asynchronous approach is a little bit against the human mind, which is used to the synchronous approach.

  2. GreenThread/Coroutine/Fiber This solution with the above plan is essentially the difference is not big, the key lies in the context of callback and execution mechanism. To solve the problem of callback methods, the idea is to write the code sequentially, but to suspend the current code snippet in case of blocking calls such as IO, saving the context and freeing the current thread. Wait for the IO event to come back, and then find another thread to let the current snippet resume execution. The code is written as if it is synchronized, as if it was done on the same thread, but in fact the system may switch threads, but have no sense of the program.

GreenThread

  • User space is first in user space, avoiding the cost of switching between kernel and user mode.
  • Scheduled by the language or framework layer
  • Smaller stack space allows large number of instances to be created (millions)

Several concepts

  • Continuation is a concept that may not be familiar to those unfamiliar with FP programming, but it can simply be understood as a mechanism by which our program can be paused and then the next call to contine picks up where it left off. This is equivalent to an extra entry point for program calls.
  • A Coroutine is an implementation of continuations, typically represented as a language-level component or class library. It mainly provides yield and resume mechanisms.
  • Fiber and Coroutine are actually two sides of the same body. It is mainly described from the system level. It can be understood that what happens after Coroutine runs is Fiber.

Goroutine


Goroutine is essentially an evolution and implementation of the previous GreenThread family of solutions.

  • First, it has a built-in Coroutine mechanism. Because for user-mode scheduling, there must be a mechanism for snippets to be paused/resumed.
  • Second, it has a built-in scheduler, which implements Coroutine multi-threaded parallel scheduling, while shielding users from scheduling details through encapsulation of networks and other libraries.
  • Finally, a Channel mechanism is provided for communication between goroutines, realizing CSP concurrent Processes. Because Go’s channels are provided through syntactic keywords, many details are shielded from the user. The Go Channel is the same mechanism as the SynchronousQueue in Java, which is an ArrayBlockQueue if it has a buffer.

Goroutine scheduler

This diagram is usually referenced wherever we talk about the Goroutine scheduler, and you can check out the original blog for more details. Just a few points:

  1. M for system thread, P for processor (core), and G for Goroutine. Go implements M:N scheduling, which means threads and Goroutines have a many-to-many relationship. This is not implemented in many GreenThread/Coroutine schedulers. For example, prior to Java1.1, threads were actually greenthreads (the word comes from Java), but since many-to-many scheduling was not implemented, that is, parallelism was not really implemented and the advantages of multi-core could not be fully utilized, so it was later changed to Thread implementation based on the system kernel.
  2. If a system thread is blocked, the goroutines lining that thread are migrated. There are other mechanisms as well. If M is idle and there are no tasks on the global queue, it may steal tasks from other M to perform. This is a rebalance mechanism. Here is no longer detailed, there is a need to see the special analysis of the article.
  3. The specific implementation strategy is similar to the mechanism we analyzed earlier. When the system starts, a separate background thread (not in Goroutine’s scheduled thread pool) is started to start netPoll polling. When a Goroutine makes a network request, the network library will associate the FD (file descriptor) with pollDesc (the structure used to describe netpoll, including the Goroutine blocked by reading/writing the FD) and call runtime.gopark. Suspend the current Goroutine. When netpoll in the background obtains an Epoll event (in Linux), pollDesc in the event is extracted, the associated blocking Goroutine is found, and pollDesc is recovered.

Is Goroutine a silver bullet?

Goroutine has greatly reduced the cost of concurrency. Do we just go func for everything that needs concurrency?

Go solves the CPU utilization problem with Goroutine scheduling. But what about other bottleneck resources? For example, shared resources with locks, such as database connections. In the online Internet application scenario, if each request is thrown into a Goroutine, a large number of Goroutines will be blocked when the resource bottleneck occurs, and finally the user request times out. The Goroutine pool is used to control the flow, and the question arises: How many goroutines are appropriate for the pool?

So the problem is still not solved from a more fundamental point of view.

The Actor model


For those of you who haven’t been exposed to the concept, actors are an abstraction similar to OO objects. The abstraction of object programming to reality is object = attribute + method. However, when using method, it actually occupies the CPU slice of the caller, and whether concurrency is determined by the caller. This abstraction is different from the real world. The real world is more like an abstraction of actors, communicating with each other via asynchronous messages. For example, if you say hi to a beautiful woman, whether the beautiful woman responds and how she responds is decided by the beautiful woman herself, which operates in the beautiful woman’s own brain and does not occupy the sender’s brain.

So Actor has the following characteristics:

  • Processing – Actors can do calculations without consuming the caller’s CPU slice, and the concurrency strategy is up to them.
  • Storage-actor can save state
  • Communication – Actors can communicate with each other by sending messages

Actors follow these rules:

  • Send messages to other actors
  • Create other actors
  • Accept and process messages and modify your state

Actor goals:

  • Actor can be updated independently to achieve hot upgrade. Because actors are not directly coupled to each other and are relatively independent entities, hot upgrades are possible.
  • Seamlessly bridge local and remote invocations because actors use message-based communication mechanisms to interact with both local and remote actors via messages, this Bridges the gap between local and remote invocations.
  • Communication between fault-tolerant actors is asynchronous, and the sender just sends and doesn’t care about timeouts or errors, which are taken over by the framework layer and a separate error-handling mechanism.
  • Easy to expand, naturally distributed because the communication mechanism of Actor Bridges local and remote calls, when the local Actor can not handle, you can start Actor on the remote node and forward the message to the past.

Implementation of Actor:

  • The benchmark of the Erlang/OTP Actor model, other implementations are basically based on the Erlang model to some extent. Hot upgrade and distributed.
  • Akka (Scala,Java) is implemented based on threading and asynchronous callback patterns. Since There is no Fiber in Java, it is thread-based. To prevent threads from blocking, all blocking operations in Akka need to be asynchronous. Either the asynchronous framework provided by Akka, or the future-callback mechanism, is converted to callback mode. Distributed, but not yet hot upgrade.
  • Quasar (Java) To solve Akka’s blocking callback problem, Quasar implements Coroutine/Fiber in Java via bytecode enhancement. At the same time through the mechanism of ClassLoader to achieve hot upgrade. The disadvantage is that bytecode enhancement is implemented through the JavaAgent mechanism at system startup.

Golang CSP VS Actor


The motto for both is:

Don’t communicate by sharing memory, share memory by communicating

The mechanism of message communication is used to avoid race conditions, but there are some differences in the concrete abstraction and implementation.

  • In the CSP model, messages and channels are principals and handlers are anonymous. That is, the sender needs to care about its message type and which Channel it should write to, but not who consumes it and how many consumers there are. Channels are generally type bound. A Channel only writes messages of the same type. Therefore, CSP needs to support Alt/SELECT mechanism and listen for multiple channels at the same time. A Channel is a synchronous mode (Golang’s Channel supports buffer, which supports a certain amount of asynchrony). The logic behind this is that the sender is very concerned about whether the message is processed or not, and the CSP needs to ensure that every message is processed properly and blocked before it is processed.
  • Actors are actors in the Actor model, and mailboxes (channels similar to CSP) are transparent. That is, it assumes that the sender cares about who the message is sent to, but not the message type or channel. So a Mailbox is asynchronous, and the sender cannot assume that the sent message will be received and processed. The Actor model must support a strong pattern matching mechanism because messages of any type are sent over the same channel and need to be distributed through a pattern matching mechanism. The logic behind it is that the real world is inherently asynchronous and non-deterministic, so programs have to adapt to programming with uncertain mechanisms. Since the advent of parallelism, the old deterministic programming mind-set has been challenged, and actors build this directly into patterns.

From this point of view, CSP mode is more suitable for boss-worker mode task distribution mechanism, which is not so intrusive, and can solve a specific problem through CSP in the existing system. It does not attempt to solve the time-out fault tolerance problem of communication, which still needs to be handled by the initiator. At the same time, because channels are explicit, although it is possible to implement remote channels through Netchan (the original Netchan mechanism provided by Go was abandoned due to its complexity, and new Netchan is discussed), it is difficult to be transparent to users. While actors are a new abstraction, the use of actors will face a change in the entire application architecture mechanism and way of thinking. It tries to solve broader problems, such as fault tolerance, such as distribution. The problem with actors, however, is that with current scheduling efficiency, even with a mechanism like Goroutine, it is difficult to achieve the efficiency of direct method calls. Implementing an “everything actors” language like OO’s “everything objects” is definitely inefficient at the moment. So the compromise is to abstract a layer of components of the system into actors based on OO.

Take Rust again


Rust’s approach to solving the concurrency problem is to first acknowledge that resources in the real world are always limited and it is difficult to avoid resource sharing completely. Instead of trying to avoid resource sharing completely, Rust believes that the problem of concurrency lies not in resource sharing but in the wrong use of resource sharing. For example, as we mentioned earlier, when defining a type in most languages, there is no limit to how the caller can use it, only a document or an annotation (@ThreadSafe,@NotThreadSafe annotation in Java) indicating concurrency safety, but only as a hint. Callers cannot be prevented from misuse. Although Go provides a -race mechanism that can be used to detect race conditions by running unit tests with this parameter, if your unit tests are not concurrent enough, the coverage will not be detected. So Rust’s solution is:

  • When defining a type, explicitly specify whether it is concurrency safe
  • Introducing the concept of Ownership of variables, non-concurrent safe data structure transferring between multiple threads does not necessarily lead to problems. What causes problems is that multiple threads operate at the same time, that is to say, the Ownership of the variable is not clear. With the concept of ownership, variables can only be manipulated by the owning scope code, and variable passing causes ownership changes, limiting race conditions at the language level.

With this mechanism, Rust can check and restrict race conditions at compile time rather than at run time. Although there is an increased mental cost at development time, reducing the mental cost of callers and troubleshooting concurrent problems is also a distinctive solution.

conclusion

The revolution has not yet succeeded and all comrades need to work hard

This article takes you through the concurrency problem and various solutions. Although each has its own advantages and usage scenarios, the problems of concurrency are far from being solved. So you still have to work hard, and you have a chance.

One last brick idea: Implement actors on Goroutine?

  • Distributed solved the problem of single machine efficiency, can we try to solve the problem of distributed efficiency?
  • Current automatic scaling schemes are basically implemented by monitoring the server or LoadBalancer and setting a threshold. Similar to the feeding example I mentioned earlier, this is an experience-based solution, but it can be done more carefully and intelligently if the internal and external clusters are combined.
  • Self-management The ultimate goal of the previous two points is to achieve a self-managed system. Students who have done system operation and maintenance know that we take care of the system just like taking care of children. We should always monitor various states of the system, accept various alarms from the system, and then troubleshoot problems and conduct emergency treatment. Children will grow up one day, so can we let the system grow up and manage itself? Although this goal is still far away, I think it can be expected.

Reference and expand reading


  1. Video of this presentation
  2. This presentation is available in PDF
  3. CSP model paper
  4. Actor model paper
  5. Quantifying The Cost of Context Switch
  6. JCSP is a library that implements the CSP model in Java
  7. Overview of Modern Concurrency and Parallelism Concepts
  8. Discussion by Golang Netchan
  9. quasar vs akka
  10. Concurrency is Not Parallelism at Golang’s official blog
  11. Go Scheduler, the source of the scheduler picture in this article
  12. Handling-1-million-requests -per-minute-with-golang a flow-control practice using Goroutine

FAQ:


High Availability architecture public account user “Chuang” : There is a problem Want to ask you said 1024 threads need 1 gb of space as a stack space when threads and processes of address space is a virtual space When you don’t really use the virtual address Is not the physical memory mapped into virtual memory page That is to say, each thread if the call is not so deep will not all stack space key into memory That means 1024 threads don’t actually consume that much memory

A: You’re right, the Java heap and stack are virtual memory, and actually starting a thread doesn’t take up that much memory right away. However, threads are running for a long time, and as the stack grows, space is not reclaimed, that is, up to the XSS limit. This is just to illustrate the cost of threads. In addition, even if the thread is empty (sleep after startup), according to my test, the server with 1 core and 1 gb will suspend the system after starting more than 30,000 threads (need to modify the maximum number of system threads, in /proc/sys/kernel/threads-max), which is still far from the ideal level of millions.