In this paper, we summarize the work done by the engineers of Baidu C++ in concurrency optimization. In this paper, we summarize the work done by the engineers of Baidu C++ in concurrency optimization.

Full text 15706 words, estimated reading time 24 minutes.

The background,

To recap, there are roughly three components of a program’s performance: algorithm complexity, IO overhead, and concurrency. Caused by modern computer system structure complicated, a lot of times, the engineer’s performance optimization will be more focused on two other direction algorithm complexity, namely the IO and concurrent, before the baidu the c + + engineer who limit optimization (memory), we introduced the c + + engineer engineer to optimize performance, baidu Some examples of optimizations from a memory IO perspective.

This time we’ll talk about another aspect of performance optimization, which is called concurrency optimization. Similar to IO, for students with rich engineering experience, concurrency is not a strange concept, but everyone’s understanding of concurrency is often not the same. So let’s go back to the more fundamental question of what concurrency optimization is all about.

Yes, this question may be a bit of a leap, but before we naturally move on to how to deal with various concurrency issues, we do need to pause and think about why we need concurrency.

The first concept that comes up is probably large-scale, for example, we have to design large-scale Internet applications, large-scale machine learning systems. But when we think about it, no matter what level of concurrency is used, there are hundreds or thousands of examples behind such a scale system. That is, if a design (especially a stateless computing service design) can already support a small business. When you scale up, it’s likely not to increase the processing power of one business unit, but to add more business units and solve the distributed problems that you might encounter.

In fact, what really makes concurrent programming valuable is the fact that the processing capacity of the business unit itself cannot meet the demand, such as a request processing time is too long, business refinement leads to the accumulation of complexity and so on. Then what is the reason for the more prominent trend of insufficient processing capacity of business units in recent years?

Perhaps the following statistic is illustrative:

(https://www.karlrupp.net/2015…)

The chart above shows the trend of CPU core metrics from a long perspective. The trend in the number of transistors suggests that Moore’s Law is holding up, though it may be getting tougher. However, in the last decade or so, for reasons such as controlling power consumption, CPU frequency growth has basically stagnated, and the continuously increasing number of transistors has been used to build more cores.

From a CPU vendor perspective, the performance provided by a single chip processor has continued to improve, but the performance growth of a single thread has slowed significantly. From an engineer’s perspective, the biggest change is that hardware bonuses no longer translate transparently into performance improvements. With the progress of The Times, more accurate algorithms and more complex computing requirements are putting forward the requirements for the continuous improvement of computing performance. In the early years, much of this growth in computing power could have been naturally addressed by the replacement of processors, but as the main frequency stagnated, the processing performance of the program would stagnate along with the main frequency if it could not take advantage of multiple cores to accelerate it. Therefore, in recent years, the ability to make full use of multi-core computing has increasingly become a label of high performance programs, and only with the full use of multi-core capabilities, can continue to show exponential performance improvement with the evolution of new hardware. With the popularity of multi-core and multi-thread programming, how to deal with program concurrency has gradually become a necessary skill for engineers.

The diagram above describes the basic principle of concurrent acceleration. The first is to split the single execution block of the original algorithm into multiple sub-tasks that can run at the same time, and design the synergy between the sub-tasks. After that, the ability of low-level parallel execution components is utilized to truly overlap multiple sub-tasks in time, so as to achieve the purpose of truly improving the processing speed.

It should be noted that there is also a reverse scissoring head from the bottom up, which mainly expresses that in order to make use of parallel execution components correctly and efficiently, the upper level of concurrency design is often reversed, such as correct data alignment, reasonable critical section implementation, etc. While the acceleration seems to be entirely due to the ability to execute components in parallel at the underlying level, the programming simply needs to do the subtask splitting. At this stage, however, the execution component is not transparent to the upper layer, so this reverse dependency is still critical for ultimate correctness and performance. Understanding both the algorithm and the underlying design, and combining them to achieve reasonable concurrent transformation, is an important skill for engineers.

Parallel execution in a single thread

When you think of parallel execution components, the first thing that comes to mind is multi-core and multi-threading technologies. But before we get into multithreading, let’s take a look at some of the parallelism that we need to worry about even in single-threaded programming. Looking back at the processor trend chart above, you can see that while the main frequency has not increased in recent years, or even steadily declined, there has been a slight improvement in single-threaded processing performance. What this means is that the computing power of a single core is still improving per clock cycle, and much of this improvement is due to the fine-grained parallel execution within a single core and single thread.

3.1 SIMD

One of the most important fine-grained parallelism capabilities is SIMD (Single Instruction Multiple Data), which is a pattern in which Multiple execution units apply the same Instruction to Multiple Data at the same time. In the classical classification, a Single core CPU is classified as SISD (Single Instruction Single Data), while a multi-core CPU is classified as MIMD (Mingle Instruction Multiple Data ATA). GPUs fall under the category of SIMD. But modern CPUs, in addition to the multi-core MIMD base model, also come with fine-grained SIMD computing capabilities.

The above diagram shows an Intel illustration of a SIMD instruction that “compels” the ability to store multiple smaller bits of data in a single register by adding a larger bit width register. Then, by adding special operation instructions, the same calculation operation can be completed in batches for each data element with a small bit width, such as the most typical bit addition operation in the figure. In this case, the CPU only needs to increase registers, transfers of memory, and the bit width of the calculation parts, and for this particular application, the computation performance can be improved by a factor of eight. The cost of this approach is very low compared to the general increase of the core number to 8 times the size, the instruction pipeline system, the cache system all achieve reuse.

From the perspective of CPU development, MIMD enhancement with increasing the number of cores is the most intuitive implementation path in order to be able to process more data per unit cycle. However, adding a set of cores means adding a complete set of instruction components, pipeline components and cache components, and in practical applications, additional transmission and synchronization overhead of data dispersion and aggregation between cores must be considered. On the one hand, the high component requirements lead to the high cost of the complete core expansion. On the other hand, the overhead of multi-core transmission and synchronization is too high for the scenarios with small data sets, which will further limit the application scope. In order to make the most of the limited transistors, modern CPUs, while building more cores, also expand the processing and computing bit width of a single core in another dimension, so as to achieve the purpose of improving theoretical computing performance (number of cores * data width).

However, when it comes to SIMD instruction support on CPUs, one topic that can’t be lost is how it compares to GPUs. The early SIMD instruction set (MMX) on the CPU was born in a background very similar to the functional positioning of GPU, which focused on accelerating image correlation algorithms. In recent years, with the rise of neural network computing, MMX turned to general matrix computing acceleration. But because GPUs are designed for dense, repeatable computable loads, instruction components, pipeline components, and cache components can be much simpler than CPUs, and are therefore much easier to scale on an order of magnitude. As a result, when the computation density is large enough and the data transmission and synchronization overhead is diluted enough (which is also the characteristic of typical neural network computation), the CPU is only used as the control flow to command, while the data is transferred to the GPU in batch for collaborative execution, which is simpler and more efficient.

Due to Intel’s own propaganda on SIMD instruction set, it also focuses on neural network computing. However, in terms of current engineering practice experience, the mainstream intensive computing is mainly realized by GPU. This has led to a number of arguments that the SIMD instruction set is useless on CPUs. In particular, Intel’s downscaling on AVX512 early models in recent years has further reinforced the argument that CPUs should do what CPUs are supposed to do. However, to understand SIMD instructions on CPU only from this perspective is a bit one-sided, and it is easy to ignore some really meaningful application scenarios of SIMD on CPU.

For a program, if the corresponding pure computational complexity per unit of data read is defined as the computational density, and the same degree of computational flow executed by the algorithm on different data units is defined as the pattern repetition, then the program can be divided into four quadrants based on this. In the case of high-density and repeatable computing loads (typical heavy neural network computing), and significantly low-density and non-repeatable computing loads (such as HTML tree parsing), the industry actually has a relatively clear “optimal solution” in the choice of CPU and GPU. However, in the transition zone, the weaknesses of both sides are further magnified in scenarios where the computational duplication is not as strong, or the computational density is not as high. Even with a well-structured and repeatable computing load, transmission and start-up costs become significant as the computing itself decreases in intensity. On the other hand, even the irregular and repeatable computing load, with the increase of computing load, the insufficient number of cores will gradually become a bottleneck. At this point, the choice and use of a CPU that introduces SIMD versus a GPU that introduces SIMT creates a less clear-cut trade-off between different opinions.

Even if the heavy neural network is excluded, from the general characteristics of the program, it is a common phenomenon to have a certain scale of repeated characteristics. Conceptually, for example, looping sections of a program mean more or less batch/repeated computational load. This fine-grained repetition of a certain scale is where SIMD on CPUs is of unique value, although it is less pure because of the branching involved. For example, the most common SIMD optimization is actually memcpy, and modern memcpy implementations detect the amount of SIMD instruction bit width that the CPU can support and try to use it to speed up memory transfers. On the other hand, modern compilers also use SIMD instructions to optimize object copying and perform simple circulatory vectorization to speed things up. A kind of optimization method like this is partial “automatic transparency”, and it is also an important pushing hand that silently supports the performance to rise slightly when the main frequency is unchanged.

Unfortunately, there is still a limit to what simple automatic optimization can do. In order to make full use of SIMD acceleration on the CPU, it is still very much dependent on the program layer to carry out active algorithm adaptation, purposeful use, in other words, to actively implement this kind of concurrent transformation within a single thread. An example that cannot be automatically optimized is the optimization of string segmentation mentioned in Memory. At present, through compiler analysis, it is still difficult to extract data parallel pattern from loop + judgment branch and convert it into SIMD match&mask action. What is more remarkable is that in recent years, a number of algorithms have been redesigned for SIMD instructions, such as Swiss Table Hashtable, SimdJSON parsing library, Base64 codec library, etc., which have brought multiple levels of improvement in their respective fields. However, the adaptive transformation of this kind of algorithms has been completely out of the reach of automatic transparency. It can be predicted that in recent years, especially with the gradual solution of the AVX512 frequency reduction problem under advanced technology, more SIMD modifications of the traditional basic algorithm will/will need to emerge. It will also be a necessary skill for C++ engineers to be proficient in SIMD instruction optimization.

3.2 OoOE


Another important intra-thread parallelism capability is Out of Order Execution of OOOE. The classic textbook CPU pipeline mechanism is generally described as follows (classic level 5 RISC pipeline).

The instruction is simplified as the link of fetching/decoding/calculation/access/write back. When the execution link encounters data dependency, cache miss and other scenarios, it will lead to the generation of the overall pause. Among them, the influence of MEM link is particularly significant, mainly because of the deepening of cache hierarchy and multi-core sharing phenomenon, which brings more and more serious phenomenon of uneven cycle number required for a single fetch. The pipeline in the figure above might look more like this in a multi-tier cache:

To mitigate the effects of pauses, modern performance-oriented CPUs generally introduce out-of-order execution combined with superscalar techniques. On the one hand, for key execution components, such as computing components, access components, etc., to support parallelism by adding multiple components. On the other hand, a buffer pool/queue mechanism is introduced before parts are executed, and longer predictive executions are used to fill each part as much as possible. And eventually moved from pipelined to more “multithreaded” design:

In an out of order execution system, a long instruction sequence is generally maintained through prediction, and an instruction pool is constructed. By resolving the dependencies in the instruction pool, a network structure organized by a DAG (directed acyclic graph) is formed. By calculating the DAG relationship, which depends on the ready instructions, it can enter the execution state and be submitted to the actual execution unit for processing. Execution parts are similar to worker threads in the multithreaded model, and are subdivided into calculation and access according to their characteristics. The computing class generally has a relatively fixed and predictable execution cycle, while the memory fetch class adopts the asynchronous callback model due to the large difference in instruction cycle, and supports the simultaneous launch of dozens of memory fetch operations through the Load/Store Buffer.

The main difference between out-of-order execution and traditional pipelined mode is that when a Cache access instruction cannot be completed immediately due to a Cache Miss, subsequent instructions with no dependencies can be queued (similar to the multi-threaded model in which the OS suspends one thread after it is blocked and dispatches other threads). The computing instruction of queue-jumping can fill the empty window and make full use of the computing capacity, while the storage access instruction of queue-jumping can start transmission earlier and make the storage access pause period overlap as much as possible to reduce the overall pause. Therefore, the efficiency of out-of-order execution system is largely affected by the “flattening” degree of the instruction DAG in the window. The DAG with a shallower dependency depth can provide higher instruct-level concurrency capability, thus providing higher utilization of executing parts and fewer pause cycles. On the other hand, since the Load/Store Buffer also has a maximum capacity limit, when processing the Load of memory access in a larger area, the access instructions that may bring deeper penetration will be as close to the arrangement as possible, so as to improve the overlap of access pauses and effectively reduce the overall pause.

Although the theory is clear, in practice, it is often difficult to locate the internal hot spots of the out-of-order execution system when the performance is only observed from external indicators. In fact, the most straightforward CPU utilization can only express the actual time period of the CPU that the thread is not blocked, but it does not reflect the real utilization efficiency of the CPU’s internal components. IPC (Instruction Per Cyc le), which is a little more advanced, can reflect some utilization efficiency relatively in depth, but the factors affecting IPC are varied. Insufficient instruction parallelism? Or long period ALU calculation load? Or is it too long? Perhaps even the failure rate of branch prediction is too high? In real programs, these problems often coexist, and it is difficult to compare them by single statistics. For example, 10 access pauses /20 ALU incomplete /30 page table traversal mean that where is the bottleneck? A single indicator is often hard to answer.

3.3 TMAM


TMAM (Top-Down Microarchitecture Analysis Method) is a Method that uses CPU internal PMU (Performance Monitoring Unit) counters to decompose and locate component bottlenecks from the Top down. For example, at the top level, the maximum instruction completion rate is first calibrated as the standard (for example, 4 microinstructions in a single cycle on Skylake). If the calibration cannot be achieved, the bottleneck is considered to be the failure to make full use of components. Further subdivision takes the instruction pool as the dividing line. If the instruction pool is not full, but the pointing component cannot output microinstructions at full load, it indicates that there is a bottleneck in the “front end”. Another factor that fails to achieve the maximum instruction rate is when the “front end” sends instructions to the instruction pool but does not produce effective results due to incorrect predictions. Such losses are classified as “false predictions”. The other problem is backpressure pauses due to insufficient instruction pool scheduling capacity, which is classified as a “backend” bottleneck. Further, for example, the “back-end” bottleneck can be based on whether the pause is accompanied by insufficient utilization of ALU, full Load/Store Buffer and other factors, which can be further decomposed and refined to form a set of overall analysis methods. For Intel, for example, this process can be automated using pMU-tools, which is often very helpful in guiding fine-tuned bottleneck analysis and optimization.

int array\[1024\]; for (size\_t i = 0; i < 1024; i += 2) { int a = array\[i\]; int b = array\[i + 1\]; for (size\_t j = 0; j < 1024; ++j) { a = a + b; b = a + b; } array\[i\] = a; array\[i + 1\] = b; }

For example, this is the demonstration of a multi-round Fibonacci sequence calculation process, because the deep cycle has strong instruction dependence in the calculation feature, and the inner cycle length is far greater than the instruction pool depth of the conventional out-of-order execution, there is a large computation dependency bottleneck, which can be confirmed by tool analysis.

The program’s IPC is only 1, and the internal bottleneck also shows the part utilization efficiency concentrated inside the “back end” (using only one port most of the time), where out-of-order execution does not work.

int array\[1024\];
for (size\_t i = 0; i < 1024; i += 4) {
  int a = array\[i\];
  int b = array\[i + 1\];
  int c = array\[i + 2\];
  int d = array\[i + 3\];
  for (size\_t j = 0; j < 1024; ++j) {
    a = a + b;
    b = a + b;
    c = c + d;
    d = c + d;
  }
  array\[i\] = a;
  array\[i + 1\] = b;
  array\[i + 2\] = c;
  array\[i + 3\] = d;
}

This paper demonstrates a typical cyclic unwinding method, which improves the degree of instruction parallelism by simultaneously carrying out two independent computations in the instruction window. The effect can also be confirmed by tool analysis.

However, in practice, operations that can iterate repeatedly on the register are not common. In most cases, relatively light computation load combined with more memory fetching actions will be encountered more often, such as the following example:

struct Line { char data\[64\]; }; Line\* lines\[1024\]; // for (size\_t I = 0; i < 1024; ++i) { Line\* line = lines\[i\]; for (size\_t j = 0; j < 64; ++j) { line->data\[j\] += j; }}

This is an example of cumulative computations on non-contiguous memory, with the outer iteration skipping cache row accesses, and the inner loop performing independent computations and memory accesses on contiguous cache rows.

As you can see, this time the bottleneck is the memory access delay after the cache is pierced, but at the same time the memory access bandwidth is not being fully utilized. This is because although the concurrency in the instruction window is not low, due to the characteristics of the cache hierarchy system, multiple access instructions in the inner loop are actually waiting for the same line to be loaded from memory into the cache. The underlying fetch pressure that caused the actual trigger was not enough to hit the full transmission bandwidth, but the program displayed a large pause.

for (size\_t i = 0; i < 1024; i += 2) { Line\* line1 = lines\[i\]; Line\* line2 = lines\[i + 1\]; . for (size\_t j = 0; j < 64; ++j) { line1->data\[j\] += j; line2->data\[j\] += j; . }}

By adjusting the loop structure so that multiple rows are calculated at once in each inner loop, more rows can be transferred from the memory system at the same time as possible in the instruction window where the pause occurs. From the statistical indicators, it can be seen that the bottleneck focus gradually shifted from the delay of penetration access to access bandwidth, and the actual cache transmission component, Fill Buffer, also began to operate at full load.

3.4 Summarize a single thread concurrency


When modern CPUs encounter the bottleneck of main frequency, they not only increase the number of cores, but also gradually strengthen the parallel ability in a single core. If the popularity of multi-process and multi-threading technology makes the utilization of multi-core technology somewhat less rare and difficult, then the parallel acceleration technology within a single core, due to more black box (multi-level cache and out-of-order execution), lack of standardization (SIMD), will be relatively less popular and less utilization. Although more details of the hardware exposed to the application layer makes the implementation of the program more difficult, difficulties and opportunities often appear along with it. Since the objective development of such increased complexity has been inevitable, then whether to make good use of it has also become a sharp tool for engineers to optimize performance. As architectures become more complex, the ability to leverage architectural principles and tools for optimization will inevitably become an important skill for server-side engineers for some time to come.

IV. Critical section protection in multithread concurrency

Multithreaded concurrency should be a concept more familiar to engineers than single-threaded concurrency. Today, the application technology for dividing computation into multi-threaded execution is relatively mature, and every server engineer has his or her own toolkit of queue + thread pools. Simply splitting up a large task, submitting a thread pool, and recycling the results may be a matter of just a few lines of code without any additional considerations. The real difficulty, in fact, often lies not in the “split”, but in the “combined” part, that is, the task split in the inevitable sharing of data operations. In a higher distributed level, the idea of Share Nothing can also be utilized as much as possible. Before the computation occurs, resources can be isolated as fully as possible through task division. But when you get inside a specific compute node, with some fine-grained splitting and acceleration, sharing is often hard to avoid completely. How to deal with these unavoidable sharing problems correctly and efficiently involves an important technology in concurrent programming, critical section protection.

4.1 What is a critical region


In the concurrent transformation of the algorithm, two types of paragraphs are generally generated. One is the part that can be executed independently without interaction between multiple threads. With the increase of the core, this part can be smoothly expanded horizontally. The other type needs to complete the execution by operating on the shared data. In order to execute correctly, this part of the operation cannot be executed by multiple cores at the same time, but can only be queued by each thread. As a result, the code in the critical zone, which cannot be expanded as the number of cores increases, often becomes the bottleneck point for multithreaded programs. Because of this characteristic, the efficiency of critical section becomes very important, and how to ensure that each thread passes through the critical section safely is the critical section protection technology.

4.1.1 Mutual Exclusion

The most basic protection method of critical region is mutual exclusion technique. This is a typical pessimistic locking algorithm, that is, the critical region is assumed to have high probability of competition, so it needs to use the underlying mechanism for arbitration, after the successful ownership, to enter the critical region to run. This mutually exclusive algorithm has a typical global blocking problem, that is, when the thread in the critical region is blocked or swapped out by the operating system, a global execution window will appear empty. In this empty window, not only can not continue the operation itself, but also the threads that have not obtained the lock can only wait together, causing the phenomenon of blocking amplification. But for parallel blocks, the blocking of a single thread only affects itself, as is the case with the second block, also shown in the figure above.

Because the actual blocking in the critical region is often unexpected, such as a page fault interrupt or a complex memory cleanup required to claim a piece of memory. This would make the problem of blocking diffusion even worse. It would be possible to have another line enter the critical section first, which would have been faster and smoother, but now all the concurrent participants would have to wait for the critical section holder to do something that was not so “critical.” Because of the possibility of global blocking, the algorithm that uses the mutual exclusion technology to protect the critical region has the lowest blocking tolerance, which is generally used as a typical example in the field of “non-blocking algorithm”.

4.1.2 Lock Free

To solve the blocking problem in mutual exclusion technology, an improved critical region protection algorithm is lockless technology. Although it is called lockless, it is mainly taken from a classification term in the hierarchy of non-blocking algorithms, which is essentially an optimistic locking algorithm. That is, the execution of the critical section is started with the assumption that there is no race in the critical section, but with good design, this advance execution is conflict-free and roll-back. However, the final design of a synchronous commit operation is usually based on the atomic variable CAS (Compare And Swap), or version verification And other mechanisms. In the event of a conflict during the commit phase, the parties that have been arbitrated as failures need to roll back the critical section pre-execution and initiate a new attempt.

The biggest difference between lockless technology and mutex technology is that the core execution section of critical section can be independent like parallel paragraph, but different from the real parallel paragraph, in the simultaneous execution of critical section, only one is truly valid, the rest will be arbitrated as invalid and rolled back eventually. However, with the introduction of redundant execution operations, when the critical region is blocked again, it will not propagate between participating threads like the mutex algorithm, and instead a suboptimal thread will commit successfully. Although from the perspective of each participating thread of the concurrent algorithm, there are sections that did not perform a “substantially efficient” calculation, such a wasteful section must correspond to another participating thread performing a “efficient” calculation. So at the overall algorithm level, you can ensure that there is no global pause, that there is always some valid computation going on.

4.1.3 Wait-Free

Lockless technology mainly solves the problem of blocking propagation in the critical region, but in essence, multiple threads still pass through the critical region in queue order. Figuratively speaking, some are similar to the confluence of three forks in traffic. Whether they are mutually exclusive or unlocked, they eventually converge the two lanes into a single lane. The difference is only whether the command is clever enough to ensure that there is no flow break. However, the fact that global throughput in the critical region is reduced to serialization is a common drawback.

The key difference between the Wait Free level and the lockless level lies in this issue of throughput. In addition to the absence of global pauses, Wait Free further guarantees that any thread of algorithm participating should be completed in a limited number of steps. This is different from lockless technology, where not only does the overall algorithm perform efficient computations all the time, but each thread perspective still needs to perform efficient computations all the time. This requires that multiple threads cannot be serialized in a fine-grained manner within the critical region, but must be able to perform efficient computations simultaneously. Going back to the triad convergence example above, in the Wait Free level, the final converging road still needs to be multi-lane to ensure that progress can be made simultaneously.

Although there are a number of algorithms that have a Wait Free level from a theoretical perspective, most of them are conceptual explorations that are not of industrial use. Mainly because Wait Free limits simultaneous progress, but does not describe how fast this progress can be. In order to compare the practical level of wait-free Population Oblivious, a subclass of subclass is proposed, which additionally limits that each participating thread must complete within a definite execution period that can be given in advance, and this period cannot be related to the number of participating threads. This explicitly rejects schemes that resemble inter-thread collaboration (which tends to cause large cache contention) and designs that require very, very long, limited steps to complete.

The figure above illustrates a typical ‘Wait Free Population Oblivious’ idea. Before performing critical section operation, each participating thread is assigned a separate ticket through a co-operation. After that, each participating thread can use the ticket obtained as an identifier to operate on a separate, non-interfering workspace and complete the operation in it. The industrially available Wait Free algorithm is generally more difficult to design, such as the Ticket mechanism, which requires atoms to complete workspace assignment in a coordinated action, whereas many data structures are not easily split. The industrially available Wait Free algorithm on a variety of data structures remains an area of ongoing exploration to this day.

4.2 Lock free is not a panacea

From the perspective of non-blocking programming, there is a significant partial order relationship between the advantages and the advantages of the above critical section processing schemes, that is, Wait Free > Lock Free > Mutual Exclusion. This is primarily a measure of blocking adaptability, which in principle does not directly correspond to the performance latitude. However, it is still easy to create a general impression among engineers that “locks are evil things and algorithms implemented without locks can significantly improve performance”. Combined with the widely spread awareness that locking operation is costly, many engineers tend to shy away from locks in practice. So, is this guiding ideology completely correct?

Let’s start with a set of experiments:

Uint64 \_t\* sequence, size\_t size) {size\_t I; uint64\_t\* sequence, size\_t size; for (i = 0; i < size; ++i) { sequence\[(i + 1) & 7\] += sequence\[i & 7\]; } return sequence\[i & 7\]; } { // Mutual Exclusion ::std::lock\_guard<::std::mutex> lock(mutex); sum += calc(sequence, workload); } { // Lock Free / Atomic CAS auto current = atomic\_sum.load(::std::memory\_order\_relaxed); auto next = current; do { next = current + calc(sequence, workload); } while (! atomic\_sum.compare\_exchange\_weak( current, next, ::std::memory\_order\_relaxed)); } { // Wait Free / Atomic Modify atomic\_sum.fetch\_add(calc(sequence, workload), ::std::memory\_order\_relaxed); }

Here, multi-thread accumulation is used as a case, And three methods, namely, accumulation after locking, CAS submission after accumulation, And FAA (FETCH And ADD) submission after accumulation, are adopted to protect the critical region of the global accumulation results. For different amounts of concurrency and different critical zone loads, the following three-dimensional graph can be formed.

Among them, the Latency term is normalized by dividing the size of the critical section, which is convenient for visualizing the trend of the protection overhead of the critical section under the change of the critical section load. Therefore, there is no horizontal comparability across different load levels. The Cycles term represents the sum of CPU Cycles used in the same number of cumulative Cycles for multi-thread collaboration, which has a slight natural inclination with the critical section load. The two cross-section images of 100/1600 are superimposed together to show the algorithms in the 3 sections, which is convenient for intuitive comparison.

Some information can be analyzed from the above data

1. The FAA-based Wait Free model is significantly better than the other approaches in every respect;

2. The lockless algorithm has some advantages over the mutex algorithm in average throughput, but it does not reach the order of magnitude level;

3. As the competition increases (the critical region size increases, or the number of threads increases), the CPU consumption of lockless algorithm increases significantly;

An analysis based on this information reveals a point that contradicts the conventional wisdom about lock performance mentioned earlier. The performance divide does not occur between the lock-based mutex algorithm and the lockless algorithm, but between the Lock Free and Wait Free algorithms, which are both “lockless.” Moreover, from the perspective of CPU consumption, for scenarios with complex critical sections and high competition intensity, even Lock Free causes excessive consumption due to excessive “invalid prediction execution”. This shows that the locking operation itself, while slightly more expensive than the atomic operation, is not a terrible thing, and that what really affects performance is the loss of parallelism that occurs when critical sections are forced to execute serially.

Therefore, when we encounter the problem of critical section protection, we can first think about whether we can use Wait Free method to complete the protection action. If so, the effect of critical section can be nearly completely eliminated in terms of performance. In most cases, mutex or Lock Free is often used to protect critical areas. At this point, the serialization of critical sections is inevitable, so it is the first common task to fully reduce the proportion of critical sections. The premise of discussion is that the critical sections are already significantly short and will not cause too many invalid predictions as to whether to further adopt Lock Free technology to reduce the protection overhead of critical sections. In addition, since the Lock Free algorithm generally requires a two-phase commit for critical sections in order to support rollback Undo, it is often more complex and may be less localized than the corresponding mutex protection algorithm (for example, in some scenarios, a linked list must be introduced to replace an array). In general, if Wait Free is not possible, then there is no need to be overly attached to Lock Free, and the mutually exclusive approach to optimizing critical sections is often enough to provide performance comparable to that of Lock Free.

4.3 Concurrent counter optimization case


From the above experiments on various methods of critical zone protection, another phenomenon can be found. As the critical sections get smaller, the tendency for protection overhead to increase with the number of threads is significant, even at the Wait Free level, which is designed to be independent of the number of threads participating. This is still a significant problem for concurrent counter scenarios with very small critical sections. So let’s first look at how these losses are caused from the perspective of locking and the implementation of atomic operations.

First of all, a typical lock implementation is given. On the left is the fast path of the lock. In other words, if there is no contention in the outer atomic variable operations, then locking and unlocking actually only experience a set of atomic variable operations. When Fast Path detects a possible conflict, it enters the kernel and attempts to queue. The existence of Fast Path greatly optimizes the lock performance in low conflict scenarios, and the modern operating system kernel provides the “Wait On Address” function in order to optimize the memory cost of the lock. In order to support this queuing mechanism, each lock normally only needs an integer storage cost, and only when trying to Wait, Additional auxiliary structures are created and consumed.

So in a practical design, locks can be created as many, or even as many, as long as they are finely grained enough to decompose conflicts. The most typical one is the design of counter framework BVAR in BRPC.

This is the design of the basic statistical framework in BVaR. Both local count and global convergence are protected by additional locks of each TLS. Because the collection cycle is so long, conflicts can be ignored, so even though a large number of locks are used by default (statistic * thread count), the memory consumption is not significant, and the running overhead is actually low enough to support any sink operation. This example further illustrates that the cost of locking itself is not significant, and that the serialization of software or hardware due to competition is the core of the cost.

However, even if the contention is low, the lock will still be implemented by a set of atomic operations. When we look at the atomic operations ourselves, it is actually composed of atomic instructions protected by the cache lock operation, and this instruction will act as a memory barrier in out-of-order execution to reduce the possibility of overlap. Therefore, in view of the very common simple counter, we further remove the local lock in Baidu internal transformation, to try to further reduce the statistical overhead.

For example, the IntRecorder, which needed to record both the count and the sum, used to rely only on locks to ensure atomic updates because of the need for two 64-bit additions. However, with the increasing popularity of new x86 models, 128bit atomic load/store can be achieved on newer x86 and ARM server models, so the corresponding high width instruction and correct alignment can be used to achieve lock removal.

Another example is the Percentile quantile value statistics. Since the sampled data is a multi-element container and the quantile value statistics require periodic cleanup and recalculation, the mutual exclusion protection method is also commonly used. However, if the version number mechanism is introduced, the flush operation is handed over to the counting thread itself, completely separating the reading and writing of the sample region. On this basis, it is relatively easy to be thread-safe without introducing atomic modifications. Strictly speaking, asynchronous clearing has the possibility of lost boundary sample collection, but because the core reservoir sampling algorithm is also random, it has sufficient accuracy in the field of monitoring index statistics.

In addition to runtime operation, the organization of thread local variables was previously managed by the linked list protected by lock, and replaced by the method of segmented data combined with thread number to achieve spatial continuity. As a result, the overall performance of the counter is further improved.

4.4 Concurrent queue optimization case


Another common data structure in multithreaded programming is the queue. To ensure that concurrent queuing and dequeuing operations can be handled safely, the basic algorithm is to protect the entire queue with locks.

The disadvantage of this approach is obvious, because queues tend to act as the data hub for multi-threaded drivers. In a large number of competitions, the serialization of queue operations can easily affect the parallelism of the overall calculation. So a natural improvement would be to protect the queue head and tail separately, decoupling producers and consumers first, and adding only the necessary synchronization operations to ensure that there is no over-queuing or over-queuing. This is also the practice used in LinkedBlockingQueue in Jave.

After the head and tail separation, further optimizations were made in both directions. First of all, the operation of single node has the possibility of Lock Free, so the corresponding Michael & Scott Lock Free queue algorithm is generated. Typical industry implementations include ConcurrentLinkedQueue in Java, and boost::lockfree:: Queue in Boost.

The other direction is queue sharding, which is to disassemble the queue into multiple sub-queues and select the sub-queues by receiving tokens. The traditional queue algorithm is used inside the sub-queues, such as TBB :: concurrent\_queue, which is a typical implementation of sharding queues.

By comparing the two methods, it can be found that under strong competition, the effect of sharding queue is actually significantly better than that of pure lockless processing, which is also a reflection of the previous analysis on the real effect of lockless technology.

In addition to these general-purpose queues, there is an enhanced contended publish, serial consumption queue, known as BThread ::ExecutionQueue, which is used in BRPC to solve the problem of multi-thread contended FD writes. With some interesting tricks, the Wait Free level is achieved on the multithreaded production side.

The entire queue holds only the tail, not the head. On the production side, the first step is to atomic exchange the new node directly with the current tail pointer, and then to connect the previous tail to the new node. The queue algorithm is Wait Free because the queue operation can be completed in two fixed steps, regardless of whether there is a race. The problem with this on the consumer side, however, is that the consumer also starts with an atom swap, replacing the end of the queue with NullPtr, and then the consumer action is to traverse the single linked list. However, since the production operation is completed in two parts, it may be found that some nodes are still in the state of “broken chain” at this time. Since consumers have no way of knowing the information of subsequent nodes, they can only poll and wait for the producer to finally complete the second step. So in theory, the production/consumption algorithm is not even Lock Free, because if the producer is switched out in the middle of the two phases, then the consumer will be affected by the blocking propagation, and the whole consumption can only be blocked first. However, in queued write FD scenarios, it makes sense to specifically optimize production concurrency and thus achieve better execution efficiency.

But in order to take advantage of the atomic operation completion algorithm, BThread ::ExecutionQueue introduces linked lists as a way to organize data, and linked lists naturally have the problem of store jumps. Can we use arrays to do the same for production and even consumption concurrency of Wait Free?

This is Babylon: : ConcurrentBoundedQueue want to solve the problem.

But before I introduce the queue concurrency principle, let me insert an errata. This queue was briefly mentioned at the end of the Memory article, but a rough review showed that at the acquire- release level, performance can be guaranteed without cache line isolation. After the article was published, I received good feedback from the industry. The discussion found that the test case at that time hit Intel Write Combining optimization technology, that is, when there is only one cache line waiting to be loaded, the write-only action can be completed in advance without blocking, and the unified submission will take effect after the cache line is actually loaded. However, due to memory ordering issues, once the second cache row to load is triggered, the Write Combine on the first cache row will not continue and will have to wait until the second cache row has been written. In principle, the Write Combine technique does relieve False Sharing in write-only scenarios, but the limitation of waiting for only one cache line can be quite limiting for specific use in real scenarios. For example, in the typical scenario of queues, where both data and completion tags are operated on at the same time, it is likely that both are loaded through, and the Write Combine technique cannot be applied. Moreover, only a nonsensical profiling program could have done so much peer writing during the cache row load cycle. Therefore, in conclusion, it is necessary to isolate multithreaded cache lines in general.

Back to Babylon: : ConcurrentBoundedQueue design train of thought, is the child queues split perfectly, would reduce quantity of synchronous granularity on each data slot. Each queuing and queuing request can be mapped to a specific data slot by using atomic increment to obtain an increasing sequence number first, and then using circular array storage. Depending on whether the operation is queued or queued, how many folds occur on the looping array can form a contiguous sequence of versions at a single slot. For example, in 1 and out 5 both correspond to slot 1, and the expected version transition for in 1 is 0 to 1, while out 5 is 2 to 3. In this way, the queuing and dequeuing of the same slot can also form a continuous version change sequence, and a specific operation to the serial number, just need to explicitly detect the version to confirm whether it can start the operation at present, and synchronize the version change with the subsequent operation.

By assigning synchronization to each element, the enqueuing and dequeuing operations can be synchronized at the level of atomic operations in addition to the initial sequence number, and the subsequent operations can be carried out in parallel without interference. The more continuous organization of the data also solves the problem of fetching jumps in linked list storage. The dual-end concurrency of production and consumption also provides stronger universality. In fact, both MPMC (Multiple Producer Mult Iple Consumer) and MPSC (Multiple Producer Single Consumer) scenarios have good performance, especially in scenarios with certain small batch processing.

Recruitment information

Welcome excellent C++ engineers to join Baidu and grow together with Dachen. Follow Baidu Geek, a public account of the same name, and say, “Type in the push, and we look forward to your joining!”

Recommended reading

The Extreme Optimization of | Baidu C++ Engineer

| Baidu large-scale Service Mesh implementation practice

| is a system and method based on real time division bit calculation

———- END ———-

Baidu said Geek

Baidu’s official technical public account has been launched!

Technology dry goods, industry information, online salon, industry conference

Recruitment information, internal information, technical books, Baidu surrounding

Welcome to your attention