Click open link

preface

Today, wechat has 800 million monthly active users. There is no denying that today’s wechat background has a strong concurrent ability. But just as Rome was not built in a day; Wechat’s technology was once a little immature. Wechat was founded in January 2011, with about 100 million users in that year. In November 2013, wechat reached 355 million monthly active users, making it the mobile messaging app with the largest user base in Asia. In the face of such a volume of improvement, wechat background has encountered a tricky dilemma; What make a person admire is, technical person made beautiful response in time. What’s the technical story behind this? At this moment, how are the requests you send from wechat mobile terminal digested and processed by the background? This time, we focus on the wechat backend solution of the coprogram library libco (libco introduction and source download). The project keeps the synchronous style of agile background, improves the concurrent ability of the system and saves a lot of server cost. It has been running stably on tens of thousands of machines in wechat since 2013. This article is the further technical excavation and collation of the article “Open source Libco library: The foundation of background Framework supporting 800 million users of wechat with tens of millions of connections in a single machine”. \

The original wechat back end in the end encountered a problem

In the early stage of wechat background, most modules adopted semi-synchronous and semi-asynchronous models due to complex and changeable business requirements and rapid product iteration. The access layer is an asynchronous model, while the business logic layer is a synchronous multi-process or multi-thread model. The concurrency of business logic is only tens to hundreds. With the growth of wechat business, until the middle of 2013, the scale of wechat background machine has reached more than 10,000 units, involving hundreds of background modules, with BILLIONS of RPC calls per minute. In such a large and complex system scale, each module is vulnerable to back-end service or network jitter. Therefore, we are in urgent need of asynchronous transformation of wechat background. \

Consideration of wechat background asynchronous transformation scheme

At that time, the wechat technical team had two choices: \

  • Threading asynchrony: Transforming all services to an asynchronous model is equivalent to A complete overhaul from the framework to the business logic code.
  • B coroutine asynchronization: non-invasive asynchronous modification of business logic, that is, modification of only a small amount of framework code. \

The difference in workload and risk factor is obvious. Although plan A’s server-side multi-threaded asynchronous processing is common practice, it works well for the original goal of improving concurrency; However, for such a complex system as wechat background, it is too time consuming and risky. In both asynchronous and synchronous models, asynchronous state needs to be saved. So the similarity in the technical details is that both solutions need to maintain the state of the current request. In the scheme of asynchronous model A, when the request needs to be executed asynchronously, the relevant data of the request needs to be saved actively and then wait for the next scheduled execution of the state machine. In the B coroutine model scheme, the asynchronous state is saved and recovered automatically, and the execution of the coroutine recovery is the context of the last exit. Therefore, the B coroutine scheme does not need to explicitly maintain asynchronous state: on the one hand, it can be simpler and more straightforward programmatically; On the other hand, only a few registers need to be saved in coroutines. Therefore, coroutine services may perform better than pure asynchronous models on complex systems. Based on the above considerations, wechat finally chose plan B and carried out asynchronous transformation of hundreds of wechat background modules through coroutines. \

How to take over legacy synchronous style apis

Once the solution is finalized, the next step is to implement asynchrony while making as few code changes as possible. In general, a normal network background service requires connect, write, read, and other steps. If a synchronous style API is used to call the network, the entire service thread will hang waiting for the network interaction, resulting in waiting and consuming resources. This obviously affected the concurrency performance of the system, but it was chosen because of the unique advantages of the corresponding synchronous programming style: the code was logical, easy to write, and supported rapid iterative and agile development of the business. The wechat team’s revamp needed to eliminate the drawbacks of synchronous style apis, but at the same time wanted to preserve the benefits of synchronous programming. Finally, wechat team’s libco framework innovatively takes over the network call interface (Hook) without modifying the existing business logic code online. The assignment and recovery of the coroutine is registered and callback as an event in the IO of the asynchronous network. When a service process encounters a synchronous network request, the libco layer registers the request as an asynchronous event. The current coroutine frees up the CPU for another coroutine to execute. Libco automatically resumes coroutine execution in the event of a network event or timeout. \

Libco architecture


The libco architecture has been in place since design, and the last major update on GitHub was a functional one. Note: Libco is an open source project.www.52im.net/thread-623-… 。



The Libco framework has three layers: the coroutine interface layer, the system function Hook layer, and the event driver layer.



 



The coroutine interface layer implements the basic source language of coroutines. Simple interfaces such as CO_create and CO_resume are responsible for creating coroutines to restore. The co_cond_signal class interface creates a coroutine semaphore between coroutines that can be used for synchronous communication between coroutines.



The system function Hook layer is responsible for the conversion of synchronous API to asynchronous execution in the system. For common synchronous network interfaces, the Hook layer registers the network request as an asynchronous event and waits for the wake up of the event-driven layer.



The event-driven layer implements a simple and efficient asynchronous network framework, which contains the event and timeout callback required by the asynchronous network framework. For the request from synchronous system function Hook layer, event registration and callback are essentially the relinquishment and resumption of coroutine execution. \

What does libco choose coroutines mean compared to threads?

Coroutines are less familiar to many people than threads. What do threads and coroutines have in common? We can simply think of coroutines as user-mode threads, which have independent register contexts and run stacks like threads. The most intuitive effect for programmers is that code can run normally in coroutines just as it does in threads. But there are differences between threads and coroutines, and we need to focus on running stack management patterns and coroutine scheduling strategies. The concrete implementation of these two points will be discussed in the follow-up part of this paper. What’s the difference? The creation of coroutines is much lighter than scheduling, and the communication and synchronization between coroutines can be unlocked, so that only the coroutine can operate resources in the thread at any time. Wechat’s scheme is to use coroutines, but this means facing the following challenges: \

  • Industry coroutines have no experience in large-scale application in C/C++ environment;
  • How to handle synchronous style API calls, such as Socket, mysqlClient, etc.
  • How to control coroutine scheduling;
  • How to handle the use of existing global variables, thread private variables. \

Let’s look at how to overcome these four challenges. \

Challenge 1: Unprecedented large-scale use of C/C++ coroutines

In fact, the concept of coroutine has been around for a long time, but it has become widely known in recent years because of its widespread use in certain languages (lua, Go, etc.). But there aren’t many that are actually used in C/C++ in mass production. In this libco framework, except for the use of assembly code for register saving and recovery during coroutine switching, all other code implementations are written in C/C++ language. So why did we choose C/C++? At present, most of the background services of wechat are based on C++. The reason is that the earliest background development team of wechat is inherited from mailbox. The mailbox team has always used C++ as the mainstream background development language, and C++ can meet the performance and stability requirements of wechat background. After the addition of coroutine support to wechat’s C++ background service framework, the contradiction between high concurrency and fast development has been resolved. For the most part, developers only need to focus on the configuration of concurrency, not the coroutine itself. Wechat will also try other languages in some tools, but for the whole wechat background, C++ is still the mainstream language of wechat team in the future. \

Challenge 2: Preserving a synchronous style API

We Hook over most synchronous style apis, and libco schedules coroutine recovery at the right time. How do I prevent the coroutine library scheduler from being blocked? Libco’s Hook layer deals with the transition from synchronous apis to asynchronous execution. Our current Hook layer only deals with the primary synchronous network interfaces for which synchronous calls are executed asynchronously without blocking system threads. Of course, we also have a small number of unhooked synchronous interfaces whose calls could cause the coroutine scheduler to block the wait. Similar to threads, we need thread-safe levels of functions when we manipulate data across threads. In the coroutine environment, there are also coroutine safe code constraints. In wechat background, we restrict the use of functions that cause coroutine blocking, such as pthread_mutex, sleep class functions (poll(NULL, 0, timeout) can be used instead), etc. For the transformation of the existing system, it is necessary to audit whether the existing code conforms to the coroutine safety specification. \

Challenge 3: Scheduling ten-million-level coroutines


In terms of scheduling policies, we can take a look at the Linux process scheduling. From the early O(1) to the current CFS completely fair scheduling, it has gone through a very complex evolution. In fact, coroutine scheduling can also refer to process scheduling methods. But doing so can be costly to switch over. At the process/thread level, background services usually do enough work to make full use of multicore resources, so coroutines should be positioned to maximize performance on this basis.



Libco coroutines scheduling strategy is very simple, single collaborators range is limited to a fixed internal thread, cut out only in the network IO jam waiting time, in the network IO event is triggered when cut back, that is above this level can be thought of as coroutines is finite state machine, work in event-driven thread, believe that the background and development of society at once.



So, how do you achieve the ten million level?



Libco defaults to a single run stack for each coroutine, and allocates a fixed amount of memory from the heap as the run stack for the coroutine when it is created. If we use a coroutine to handle an access connection at the front end, then for a mass access service, the concurrency limit of our service can easily be limited to memory.



So the question of magnitude becomes a question of how to use memory efficiently.



To solve this problem, libco uses the shared stack model. (Traditional stack management is stackFull and Stackless.) In simple terms, several coroutines share the same stack.



When switching between coroutines on the same shared stack, the current running stack contents need to be copied to the coroutine’s private memory. To reduce this number of memory copies, memory copies of the shared stack only occur between different coroutines. There is no need to copy the running stack when the occupant of the shared stack has not changed.



 



More specifically, the shared stack: The libco default mode (StackFull) is suitable for most business scenarios. Each coroutine occupies 128K of stack space, and only 1GB of memory can support 10,000-level coroutines. The shared stack is a new feature of Libco, which can support 10 million coroutines on a single machine. In principle, the shared stack mode is a minor innovation between the traditional Stackfull and Stackless modes. Users can customize the allocation of several shared stack memory, and specify which shared stack to use when the coroutine is created.



How to switch between different coroutines, how to actively exit an executing coroutine? Multiple coroutines sharing the same block of stack memory are called coroutine group. Switching between different coroutines in the coroutine group requires copying the stack memory to the private space of the coroutine, while the abandonment and recovery of the same coroutine in the coroutine group does not require copying the stack memory, so the stack memory of the shared stack can be considered as “copy-on-write”.



Co_yield and CO_resume are apis for switching and quitting coroutines on the shared stack. Libco implements copying on demand on the shared stack. \

Challenge 4: Global vs. private variables

Under StackFull mode, the address of the local variable is always the same; In stackless mode, the address of the local variable is invalidated as soon as the coroutine is cut out, which is something developers need to be aware of. By default, libco runs the stack exclusively for each coroutine. In this mode, developers need to pay attention to the use of stack memory and avoid the use of char buf[128 * 1024], which may cause the stack to overflow the core. In the shared stack mode, although the coroutine can be mapped to a relatively large stack memory when it is created, when the coroutine needs to be ceded to other coroutines, the storage overhead of the used stack copy is also there, so it is best to minimize the use of large local variables. Moreover, in the shared stack mode, because multiple coroutines share the same stack space, users need to pay attention to the local stack variable address in the coroutine can not be passed across the coroutine. Coroutine private variables are used in a similar way to thread private variables. Coroutine private variables are globally visible and different coroutines keep their own copies of the same coroutine variable. Developers can declare coroutine private variables through our API macros, with no special considerations in use. When a multithreaded program is converted to a multithreaded program, we can quickly modify global variables using __thread. In the coroutine environment, we created the coroutine variable ROUTINE_VAR, which greatly simplifies the modification of coroutines. Regarding coroutine private variables, since coroutines are essentially executed serially within threads, there may be reentrant problems when we define a thread-private variable. For example, we define a thread-private variable for __thread, which is intended to be exclusive to each execution logic. However, when our execution environment is migrated to coroutines, the same thread private variable may be manipulated by more than one coroutine, which leads to the problem of variable flushing. For this reason, when we did libco asynchronization, we changed most of the thread private variables to coroutine level private variables. Coroutine private variables have the property that when code is running in a multithreaded non-coroutine environment, the variable is thread private; This variable is coroutine private when code is running in a coroutine environment. The underlying coroutine private variables automatically complete the determination of the runtime environment and correctly return the desired value. Coroutine private variables play an important role in the transformation of the existing environment from synchronous to asynchronous, and we define a very simple and convenient method to define coroutine private variables, as simple as only one line of declaration code. In short, the libco library works as an event-driven finite state machine in which the upper processes/threads are responsible for the use of multi-core resources. \

End result: Done

Wechat team once changed a pure asynchronous proxy service driven by a state machine to a service based on libco coroutine, which improved performance by 10% to 20%. Moreover, batch requests were easily realized under the synchronous model based on coroutine. As expected at that time, libco was used to transform hundreds of wechat background modules into coroutines asynchronously. In the whole transformation process, the business logic code was basically unchanged, and the modification was only in the framework layer code. What wechat does is move the business logic that was executed in threads to the coroutine. The transformation work is mainly to review the use of thread private variables, global variables and thread locks in the system to ensure that there will be no data confusion or reentrant when coroutine switching. Up to now, most of the services in wechat background have been multi-process or multi-thread coroutine model, and the concurrency capability has been improved qualitatively compared to before. Libco, which emerged in this process, has also become the cornerstone of wechat background framework. Libco is an open source project, and the source code is updated at the same time. Please click this article: “Open source Libco library: The foundation of the backend framework supporting wechat 800 million Users.” (the original links: mp.weixin.qq.com/s/5UMIXIUvL…). \