Behind baidu’s seemingly simple interface is a vast array of C++ services running in data centers across the country. How to improve performance, reduce delay and cost has become a required course for baidu C++ engineers. Along with the optimization of the in-depth breakthrough, the birth and accumulation of a series of performance optimization theory and program, there is no lack of some popular but exquisite practical experience and skills. From the perspective of memory access, this article has collected and summarized some typical cases with general significance, and shared them with everyone to learn and exchange.

1 background

Behind baidu’s seemingly simple interface is a vast array of C++ services running in data centers across the country. The heavy application of C++ is a double-edged sword of baidu, with steep learning cost, difficult pointer error location and wide spread, which makes many developers flinch. However, on the other hand, the language layer introduces low overhead, is highly maneuverable for the underlying capabilities, and provides an excellent practice environment for the pursuit of ultimate performance.

Therefore, it is a necessary and necessary skill for baidu C++ engineers to master the underlying features and use them to guide application performance optimization. Over time, Baidu engineers will pursue the ultimate performance optimization, gradually precipitation into a habit, and even formed a belief in technology. Let’s take stock and share some of the theories and practices accumulated by baidu C++ engineers in the journey of performance optimization, as well as the “strange techniques” discovered in pursuit of perfection.

2 Review performance optimization

As programmers, we all deal with performance more or less, especially C++ based back-end service engineers, but each engineer’s understanding of the concept of performance optimization varies greatly in detail. The following are several optimization cases to establish a perceptual understanding of performance optimization, and then from the perspective of principle, describe the entry Angle and method basis of performance optimization discussed in this paper.

2.1 Start with string processing

2.1.1 string as a buffer

In order to call the underlying interface and integrate some third-party library capabilities, there are interaction scenarios for C++ strings and C-style strings at the call interface layer, typically like this:

size\_t some\_c\_style\_api(char\* buffer, size\_t size); Void some\_cxx\_style\_function(STD ::string& result) {// First extend to sufficient size result.resize(estimate\_size); Auto acture\_size = some\_c\_style\_api(result.data(), result.size()); auto acture\_size = some\_c\_style\_api(result.data(), result.size()); Result. resize(acture\_size); }Copy the code

One problem with this method is that on the first resize, string initializes all the storage areas within ESTIMate_size with 0. However, in this scenario, the actual valid data is actually written inside some_C_STYle_API, so the initialization at resize is redundant. In scenarios where the size of the interactive buffer is large, such as typical encoding conversion and compression operations, this redundant initialization introduces considerable overhead.

To address this, there have been ongoing attempts to push standards improvements since about three years ago.

www.open-std.org/jtc1/sc22/w…

Note: fortunately for those who use clang + libc++ on this issue, the resize_default_init function has been non-standard in the newer version of libc++, so you can start to use it.

Before the standard is implemented, in order to use it in advance in Baidu (currently gCC8 and GCC10 compilers are widely used), we specially made resize_uninitialized for GCC. Similar to the above function, it can be encoded in Baidu as follows:

size\_t some\_c\_style\_api(char\* buffer, size\_t size);
void some\_cxx\_style\_function(std::string& result) {
    auto\* buffer = babylon::resize\_uninitialized(result, estimate\_size);
    auto acture\_size = some\_c\_style\_api(buffer, result.size());
    result.resize(acture\_size);
}
Copy the code

2.1.2 the split string

In practice, a typical scenario is the parsing of light Schema data, such as some standard delimiters, typically ‘_’ or ‘\t’, which are simply divided into separate pieces of data (this is particularly common in rough processing of information such as logs). Due to the simplicity of the scene, the possible optimization space at the algorithm level is generally considered to be small, but in actual implementation, such codes are widely available:

std::vector<std::string> tokens; // boost::split boost::split(token, str, \[\] (char c) {return c == '\\t'; }); // absl::StrSplit for (std::string\_view sv : absl::StrSplit(str, '\\t')) { tokens.emplace\_back(sv); } // absl::StrSplit no copy for (std::string\_view sv : absl::StrSplit(str, '\\t')) { direct\_work\_on\_segment(sv); }Copy the code

Boost versions are widely available in the code of new engineers, with flexible interfaces and high mobility, but are not very efficient in the actual business, for example, compared to Google’s optimized ABSL, there is a gap of multiple levels. Especially if the engineer did not pay attention to single-character optimization (using is_any_of in the official example), the difference could be orders of magnitude. Furthermore, if you think about the business shape in conjunction, the typical post-split processing can achieve zero copy, which further reduces the overhead of redundant copies and the creation of a large number of temporary objects.

Finally, given baidu’s current internal hardware environment with multiple generations of cpus of different models, further modification of SPILt explicit use of SIMD optimization and adaptive multi-generation vector instruction sets can achieve further performance improvements. In particular, bmi instruction acceleration can even achieve an order of magnitude performance improvement for continuous delimiter detection within a SIMD step size, such as dense short string scenarios.

Finally, at Baidu, we can code it like this:

babylon::split(\[\] (std::string\_view sv) {
  direct\_work\_on\_segment(sv);
}, str, '\\t'};
Copy the code

2.1.3 magic of protobuf

With the wide application of BRPC in Baidu, Protobuf has become the mainstream way of internal data exchange in Baidu, and the code of parsing, rewriting and assembling Protobuf will account for a certain proportion in almost every service. Especially in recent years, with the development trend of microservitization further superimposed, this layer of data exchange boundary becomes more significant.

In some scenarios, such as passing and adding a field, or merging and returning data from multiple back-end stores, the cost of deserialization, modification, and reserialization using standard C++ apis can be significant relative to the actual business to be performed.

For example, let’s define a message like this:

message Field {

    bytes column = 1;
    bytes value = 2;
};
message Record {
    bytes key = 1;
    repeated Field field = 2;
};
message Response {
    repeated Record record = 1;
    bytes error\_message = 2;
};
Copy the code

Let’s imagine a scenario where a logical record is distributed across multiple sub-systems. We need to introduce a proxy layer to merge multiple partial records. In general, the merge operation looks like this:

one\_sub\_service.query(&one\_controller, &request, &one\_sub\_response, nullptr); another\_sub\_service.query(&another\_controller, &request, &another\_sub\_response, nullptr); . for (size\_t i = 0; i < record\_size; ++i) { final\_response.mutable\_record(i).MergeFrom(one\_sub\_response.record(i)); final\_response.mutable\_record(i).MergeFrom(another\_sub\_response.record(i)); . }Copy the code

For a lightweight proxy, the cost of this layer’s repeated parsing, merging, and reserialization of the back end becomes apparent. To further eliminate, start with the nature of the protobuf.

Protobuf is based on the wire format, which is an open standard, and on which is the SDK that supports multi-language construction and parsing. Therefore, it is feasible and effective to try to reduce the further optimization of parsing and merging operations and bypass the c++ API and go deep into the wire format layer. So let’s take a look at some features of the Wire Format layer.

That is, the composition of message is directly stacked by the serialization results of internally contained fields. There is no dividing point between fields, and there is no frame information outside the whole message. Based on this feature, some merge and modify operations can be performed cheaply and safely on serialized bytes results. On the other hand, the format of a Message field is exactly the same as that of a string, meaning that defining a message field or defining a string field to serialize and store the corresponding message is equivalent (and both features are officially promised).

Developers.google.com/protocol-bu…

Combined with these features, the previous merge operation is transformed into:

Message ProxyResponse {// Modify the proxy side of the message definition, so that no deep parse repeated String record = 1; bytes error\_message = 2; }; one\_sub\_service.query(&one\_controller, &request, &one\_sub\_response, nullptr); another\_sub\_service.query(&another\_controller, &request, &another\_sub\_response, nullptr); . for (size\_t i = 0; i < record\_size; Final \_response.mutable\_record(I).append(one\_sub\_response.record(I)); final\_response.mutable\_record(i).append(another\_sub\_response.record(i)); . }Copy the code

In the context of microservices, a similar operation can well control the increase of additional costs.

2.2 Go back to performance tuning

Generally speaking, there are about three components of a program’s performance, namely algorithm complexity, IO overhead and concurrency.

The primary factor is the familiar complexity of the algorithm. A core algorithm optimization and tuning can have an impact on application performance even at the generation level. For example, LSM Tree has improved no-SQL throughput, and event triggering has improved epoll concurrency. However, due to the high level of attention, both the probability of making mistakes and the room for optimization in actual engineering implementation will be greatly reduced. Even in extreme cases, engineers who are not in the lead of scientific research rarely encounter scenarios of improved algorithms when conducting performance optimization. There will be a certain proportion of analyzing problems and selecting appropriate algorithms, but the possible scope is also limited.

More often than not, engineers have to solve performance problems that, to borrow an algorithmic contest phrase, might be better described as the card constant. In other words, the correct and appropriate algorithm is adopted, but because the better implementation scheme under the architecture is not adopted, too large constant term is attached to O(X), resulting in insufficient overall performance. Although in algorithm competition, card constant and constant optimization are the interference items that both problem-makers and problem-solvers do not want to appear in large numbers (because the core algorithm itself is the target after all), constant optimization is often an important means in the field of performance optimization when it comes to the actual project background.

Now let’s review the three optimizations that raised the question. As you can see, none of this includes any improvement in the algorithm logic itself, but it is still possible to achieve multiples or even orders of magnitude optimizations through deep utilization of low-level features (such as dependency libraries or instruction sets). This has a lot to do with the increasing complexity of architecture in recent years, and the typical scenario for these changes is IO and concurrency. Concurrency optimization, for the engineering experience of the students should be familiar with, but IO, especially “memory IO” may need special explanation.

Unlike device IO operations such as read/write/socket, which are written out in code, storage system IO is easily ignored because it happens transparently behind ordinary CPU instructions. Start with a page of numbers from a classic 2009 Jeff Dean lecture.

www.cs.cornell.edu/projects/la…

Although the data is more than a decade old, there is still an order of magnitude difference between the fastest L1 cache hit and Main memory access. These operations are not explicitly controlled in the code, but are done transparently by the CPU, which simplifies programming but also introduces problems. That is, seemingly identical algorithms can differ by orders of magnitude in terms of constants if they are not well adapted to the characteristics of the architecture. And this difference, because it’s subtle, is one of the most overlooked by new engineers. Below, let’s take a look at some “constant optimization” by baidu C++ engineers around the topic of memory access.

3 Start with memory allocation

To use memory, you first need to allocate it. In the c++ era, memory requisition and release at runtime became more frequent with the ease of life cycle management and the emergence of various convenient container wrappers based on class wrappers. But because the address space is a resource shared by the entire process, competition issues have to be considered in multi-core systems. Experienced engineers should quickly associate two well-known memory allocators, TCmalloc and Jemalloc, from Google and Facebook, respectively. Here’s how they work.

3.1 Look at TCMALloc and Jemalloc first

Tcmalloc and Jemalloc introduce thread caching mechanism in common to multi-thread competition. By obtaining large chunks of memory from the back end at one time and putting them into the cache for threads to apply for multiple times, the actual competition intensity for the back end is reduced. Typically, when thread caches are broken down, TCMALloc uses a single page heap (simplifying the transfer cache and central cache, which are also globally unique). Jemalloc hosts multiple Arenas (more than the number of server cores by default). Therefore, consistent with the mainstream evaluation derivation principle spread on the Internet, in the case of fewer threads or lower release intensity, the relatively simple TCMALloc performance is slightly better than Jemalloc. However, in the case of large number of cores and high application release intensity, Jemalloc will show strong performance advantages because the lock competition intensity is far less than TCMALloc.

This is the end of the general evaluation, but we can consider another step. If we are willing to spend more memory in the cache layer to reduce the pressure of back-end competition, can TCMALloc still return to a better state? From the above analysis, it should be possible to make this inference, and the bottleneck of modern server programs is often not in memory footprint, which seems to be a feasible solution.

However, after actual adjustment, engineers will find that, in most cases, it may not achieve the desired effect. Even though the perF analysis shows that the amount of contention costs and application release actions is very small, the overall performance of the program is still not satisfactory.

This is actually an indication of the performance impact of continuous memory allocation, i.e. the acceleration point of the thread cache core is to batch applications rather than simply reduce back-end stress. If the cache is too large, continuous and repeated applications and releases will be undertaken by the cache. As a result, the address space of memory blocks stored in the cache is increasingly scattered, presenting a shuffling effect.

Architecture cache optimization is generally based on locality, that is, the memory near the program’s recent access is likely to be the hot spot of subsequent access. Therefore, if the program continuously requests and accesses memory with a jump change, then the underlying cache optimization is difficult to do properly. When it comes to program performance, although the allocation and release actions become less expensive, the overall performance of the program is not optimized (because the constant term of the fetch operation of the actual running algorithm increases).

3.2 What is the ideal Malloc model?

From the previous analysis, we have roughly identified two core elements of malloc, namely competitiveness and continuity. So is Jemalloc as good as it gets? To answer this question, start with an actual memory usage model analysis.

This is a very typical program, the core is a group of continuous running thread pool, when the task comes, each thread to do their job, to complete a task. In Malloc’s view, multiple long-life threads randomly send malloc and free requests at various points in time. Based on this view, there is no way for Malloc to make any assumptions other than to allocate as much contiguous address space as possible to multiple neighboring malloCs, based on fundamental locality. At the same time, the concept of thread is used to isolate memory partitions and reduce competition. And that’s what TCmalloc and Jemalloc are doing.

But is this where the boundary between memory allocation and the program ends? Can MALloc do better without more business-layer input? So let’s look at memory allocation from a business perspective.

Microservices, streaming computing, caching, these business models cover almost all the mainstream back-end service scenarios. An important feature of these applications is that they have well-defined life cycles. Going back to the early days of programming, it was typical for a server design to have a single start thread for each request and then destroy the whole thing. Even with thread pools, a request received and followed through by a thread is a mature design that lasts quite a long time.

In view of this early business model, malloc can actually make use of more business information, such as memory dynamically applied by threads, which is highly likely to be returned at a certain point in the future. It can be seen from the thread cache adjustment algorithm of TCMALloc that the extra information is actually specially optimized.

However, with the widespread application of the new subtask-level thread pool concurrency technology, that is, requests are subdivided into multiple subtasks to make full use of multi-core concurrency to improve computing performance, to the Malloc visible interface, the business features almost disappear. You can only see continuously running threads in random malloc and free, and large amounts of memory in malloc and free drifting between multiple threads.

So in this context of job, what memory allocation and freeing strategies work best from a competitive and local perspective? Here are two options.

3.2.1 the job arena

The first is the basic Job Arena scheme, in which each job has an independent memory allocator and the dynamic memory used in the job is registered to the job’s arena. Because the job life cycle is clear, dynamic memory released in the middle of the job does not need to be reclaimed immediately and does not significantly increase the memory usage. Memory allocation can be completely contiguous within each thread without considering reclamation. After the job is completed, the entire memory block is released, greatly reducing the actual contention.

Obviously, because of the need to be aware of the business lifecycle, the MALloc interface cannot obtain and support this information, and therefore relies on the container used by the runtime to be able to expose the memory allocation interface separately. Fortunately, the mainstream container libraries in the real world, led by the STL, generally implement the Concept of Allocator, although the details are not uniform.

For example, protobuf, one of the heavily used containers, introduced the Arena concept from Protobuf 3.x, which allows messages to allocate memory structures through arenas. Unfortunately, the Arena allocation for String Field is not officially supported until the latest version 3.15.

Github.com/protocolbuf…

However, because String /bytes is a widely used type for business, the actual improvement in continuity would be compromised if it were removed from Arena. So at Baidu, we internally maintained an ArenaString patch that reproduced the expression in the issue and comment, assigning a structure on the Arena that “looks like” a string. For the read interface, it is possible to render directly through const string& because it is consistent with the memory representation of string. For mutable interfaces, an alternative ArenaString wrapper type is returned, allowing almost seamless switching with auto technology.

Another heavily used container is the STL family, which includes both the STL implementation itself and the advanced boost/ TBB/ABSL implementation of the same interface. Starting from C++17, STL tried to split the two functions of Memory allocation and instance construction mixed in allocator, resulting in the concept of Polymorphic Memory Resouce (PMR). With the constructor and allocator decoupled, programs no longer need to modify the types in template parameters to fit their own memory allocation methods. The PMR itself provides a continuous application, a monolithic release allocator implementation of formal assistance over IC_BUFFer_Resource, but this implementation is non-thread-safe.

Combining the concepts of the above two memory allocators, we implement a SwissMemoryResource using thread cache and lockless loop queue (reduce contention), whole page to get scattered supplies (improve continuity), and unify the allocators interface supporting STL and Protobuf through interface adaptation. Finally, the protocol plug-in is integrated into BRPC. In Baidu, we can use it as follows:

babylon::ReusableRPCProtocol::register\_protocol(); ::baidu::rpc::ServerOptions options; options.enabled\_protocols = "baidu\_std\_reuse"; class SomeServiceImpl : public SomeService { public: / / request and the response itself adopted request level memory resource to allocate virtual void some \ _method (Google: : protobuf: : RpcController \ * controller, const SomeRequest\* request, SomeResponse\* response, google::protobuf::Closure\* done) { baidu::rpc::ClosureGuard guard(done); / / achieved by conversion to the concrete realization MemoryResource auto \ * closure = static \ _cast < Babylon: : ReusableRPCProtocol: : closure \ * > (done); auto& resource = closure->memory\_resource(); // Do some request-level dynamic containers STD :: PMR ::vector< STD :: PMR ::string> TMP \_vector(&resource); google::protobuf::Arena::CreateMessage<SomeOtherMessage>(&(Arena&)resource); . // done->Run;Copy the code

3.2.2 job reserve

More complex is the Job Reserve scheme, based on the Job Arena, which combines the job completion without destructing the intermediate structure or releasing the memory, and instead performs regular compact refactoring. This further requires an intermediate structure that can be reset with memory, capacity extraction, and capacity rebuilding. Let’s use vector as an example:

The main difference from typical vector processing is that after operations such as clear or pop_back reduce the size, the content object is not actually destructed, but simply cleared and reset.

Therefore, the next time you use this slot, you can directly retrieve the constructed elements and still hold the memory within capacity. As you can see, using the same instance over and over again, the container memory and capacity of each element itself tend to saturate, reducing the need for repeated allocation and construction. For engineers familiar with protobuf’s implementation, the clear action of preserving the instance is also used by Protobuf’s message lock.

However, engineers concerned with locality may find that, despite the reduced allocation requirements, the final saturated memory distribution is still not ideal in continuity, since dynamic allocation is carried out on demand without regard to locality. So the container also needs to support one action, which is rebuild.

In other words, after repeated utilization and many temporary applications, it is necessary to extract the current capacity schema and perform an undisturbed reconstruction in the new continuous space to restore the continuity of memory blocks.

3.3 Summarize memory allocation

By analyzing the performance principles of MALloc, the introduction of these two fine-grained memory allocation and management schemes can achieve better memory continuity with less competition.

In the actual measurement, the simple application of Job Arena can generally achieve most performance gains, which can generally achieve multiples of improvement, and can also produce observable performance savings from the overall service perspective. Job Reserve, on the other hand, can achieve further performance gains, but in the case of non-Protobuf containers, it is necessary to implement a custom schema extraction and reconstruction interface, on the other hand, the saturation of capacity will also increase memory usage. The cost of introduction increases so much that it is actually only used where performance is more critical.

Let’s look at memory access

With memory allocation complete, it’s time to actually access memory. Generally, we can disassemble the access requirement into two dimensions, one is the continuous access of single thread, the other is the shared access of multiple threads. Now I will split into two parts to talk about the performance optimization methods of each dimension.

4.1 Sequential access optimization

Generally speaking, when we want to perform a large cache operation, if the access address is continuous, then the actual efficiency can be improved. For example, for container traversal operations, data organized in arrays generally has a significant performance advantage over data organized in linked lists. In fact, in the process of memory allocation, we introduced the continuous allocation (and basically continuous access) of spatial address continuity, for this purpose.

So let’s take a look at how continuous access makes a difference in performance.

This section uses Intel CPU as an example to describe the prefetch process. See:

www.intel.com/content/dam…

When the hardware detects the continuous address access pattern, it activates the multi-layer prefetcher to start execution, and loads the predicted data to the appropriate cache level based on the current load and other factors. This way, when the subsequent access actually comes, the data can be retrieved from closer to the cache hierarchy, speeding up access. Due to the limited capacity of L1, the hardware prefetch step of L1 is short. The acceleration goal is mainly to promote L2 to L1, while the prefetch step of L2 and LLC is relatively long, which is used to promote main memory to cache.

Here, the concept of locality actually acts as a kind of contract for the interaction between software and hardware. Because the program’s natural access mode always has some locality, hardware manufacturers try their best to speed up the access request of this mode by predicting the locality of the program design, and strive to achieve universal performance improvement. The software designer, on the other hand, tries to make the design more local to activate the optimization path of the hardware manufacturer’s design, so that the specific program performance can be further optimized. In a sense, Z can be regarded as a cycle of promotion.

Here is an example to demonstrate how to respect locality and its impact on the program.

STD ::vector<float> large\_memory\_buffer; STD ::vector<float> large\_memory\_buffer; std::vector<float\*> vecs; std::shuffle(vecs.begin(), vecs.end(), random\_engine); for (size\_t i = 0; i < vecs.size(); ++i) { \_\_builtin\_prefetch(vecs\[i + step\]); dot\_product(vecs\[i\], input); }Copy the code

This is a common inner product scoring scenario in recommendation/search systems, that is, large-scale scoring by vector computation. The same code, depending on the presence or absence of shuffle and prefetch, produces something like the following:

Shuffle & No Prefetch: 44ms
Shuffle & Prefetch: 36ms
Shuffle & no prefetch: 13ms
Shuffle & Prefetch: 12ms

As can be seen from the difference between 1 and 3, the performance gap of identical instructions under different access sequences can reach multiple levels. The difference between 1 and 2 shows that manual addition of prefetch operations improves performance to some extent, and is expected to provide finer guidance on the prefetch step size and L1 and L2 distribution. However, the instruction execution cycle and hardware efficiency are difficult to match, manual prefetching is generally used in scenarios that cannot be transformed into physical continuity, but parameter tuning is often a metaphysics. In the last 3 and 4, it can be seen that even under continuous fetch, prefetch still has some limited benefits, which is speculated to be related to the multiple predicted cold starts caused by hardware prefetch failing to cross page boundaries, but the impact is relatively weak.

Perhaps the most instructive scenario is something like this inner product scoring scenario, where engineers sometimes design programs to take vector Pointers from scattered Spaces and organize them into an array or linked list system to manage them, in order to save space. Naturally, this saves memory redundancy and all references to one piece of data. However, if some redundancy is introduced and the required vector data is copied together to form a continuous space, the performance of traversal calculation during retrieval will be significantly improved.

4.2 Concurrent Access Optimization

When it comes to concurrent access, you might want to start with a concept called cache lines.

In order to avoid frequent main memory interaction, in fact, the cache system adopts a method similar to malloc, that is, to divide a minimum unit, called the cache row (64B on the mainstream CPU), all memory operations to the cache are completed in a block of cache behavior unit.

For example, for continuous access, the access of the first B will trigger all 64B data to L1, and the subsequent access of 63B can be directly served by L1. Therefore, the first problem with concurrent access is to consider cache row isolation, which means that data in two different cache rows can be loaded/flushed and transferred independently (because the minimum unit of flow between cache lines is a cache line).

A typical problem is what is commonly called the false share phenomenon, where two uncontested data are inadvertently placed in a cache row, leading to the introduction of a “nonexistent race” for architectural reasons. This is well documented online, and design documents such as BRPC and Disruptor cover it in great detail, so no further elaboration is required here.

4.3 Let’s talk about cache consistency

Once the false share phenomenon is ruled out, the rest is a true sharing problem, where data really needs to be in the same cache row (often the same data) and multiple cores need to modify. Since there are multiple copies of the cache in a multi-core system, consistency between these copies needs to be considered. This consistency is generally guaranteed by a set of state machine protocols (MESI and its variants).

Basically, when competing writes occur, competing ownership is required, and the unowned core can only wait to synchronize to the latest results of the modification before continuing with its modification. One of the things I should mention here is that there is a widespread claim that because of the inconsistencies that caching systems introduce, it causes all sorts of multithreaded visibility problems.

MESI is essentially a “consistent” protocol, that is, a protocol-compliant caching system that is sequentially consistent across multiple upper CPU cores. For example, we can see that the processing behavior of cache contention is the same as that of main memory only.

It’s just that the choke point has moved from competing for writes to a physical main memory unit to competing for exclusivity from multiple physical cache units by protocol. However, since contention blocking was not alleviated, the introduction of the cache was coupled with another component, the Store buffer.

Note: The write cache itself introduces drivers that actually receive out-of-order execution at the same time, as discussed again in Concurrency.

The introduction of write buffering really started to bring visibility problems.

Taking x86 as an example, when multiple cores are in write contention, unowned writes cannot take effect in the cache layer, but can wait in the write buffer instead. In general, the CPU can avoid waiting and start the execution of subsequent instructions first. That is, although the CPU appears to write instructions first and then read instructions, the cache is read instructions first, and write instructions are blocked until the cache is resynchronized. Note that if we focus on the cache interface, the overall order is still consistent, but in the instruction interface, the order is reversed. This is typical of StoreLoad out-of-order becoming LoadStore, which is the only out-of-order scenario on x86. However, for a typical RISC system (ARM/POWER), FIFO is not guaranteed in order to improve pipeline parallelism. When a write operation card is buffered, subsequent write operations may be processed first, further causing the StoreStore to be out of order.

The introduction of write buffers allows contention to occur without immediately blocking the instruction stream and can be tolerated until the buffer is full. However, since the cache writes need to be known that all L1 invalidated operations have been completed, as the number of cores increases, there will be some L1 invalidated long tails blocking the write buffer. Some RISC systems therefore introduce further buffering mechanisms.

Further buffering is commonly called failure queuing, where a write operation is considered complete as long as the failure message is delivered to each failure queue in L1, and the long tail of failures no longer affects the write. This change does even partially break cache consistency, meaning that stale data may be read unless a core waits for the current stale message to empty.

As you can already see, modern architecture designs have introduced several mechanisms that affect consistency in order to optimize a large number of routine operations. But in order to build proper cross-thread synchronization, consistency at certain key nodes is essential.

For example, mfence under x86 is used to instruct subsequent loads to wait for write buffers to take effect, and ARMV8’s LDA is used to ensure that subsequent loads to wait for invalid to take effect. This layer is strongly related to model and instruction design, and instruction matching can bring a variety of memory visibility effects. This significantly increases the cost for engineers to write proper conformance programs and makes it difficult to ensure cross-platform portability. This is where standardization comes into play, a standardized specification in the area of memory consistency known as memory order.

4.4 Let’s talk more about Memory order

As a protocol mechanism, memory sequencing, similar to other protocols, mainly performs the function of explicitly defining the interface layer. Architecture experts abstracted and summarized several different levels of logical consistency from physical level optimization methods to describe and express. With this abstraction as a common boundary standard, hardware and software developers can work independently on their optimizations, resulting in a cross-platform common solution.

For hardware developers, as long as they can eventually design specific instructions or combinations of instructions that support the functionality that implements these in-memory specifications, then any design extension is theoretically feasible, regardless of the risk of software compatibility. Similarly, software developers can write cross-platform compatible multithreaded programs without paying attention to the underlying platform details, as long as they understand consistency through standard logical layers and use the correct memory order.

Memory sequence in the official definition, is a lengthy and extensive content, in order to facilitate understanding, from the point of view of development procedures, out of some concise and refined concepts (although not complete theory) to assist memory and understanding.

First, let’s look at what’s going on behind the scenes.

int payload = 0;
int flag = 0;
void normal\_writer(int i) {
  payload = flag + i;
  flag = 1;
}
int normal\_reader() {
  while (flag == 0) {
  }
  return payload;
}
Copy the code

As you can see in this sample, at the compile level, some degree of order adjustment is done by default (without compromising correctness) for extraneous instructions. On the other hand, by default, the compiler can assume immunity from other threads, so multiple consecutive memory accesses to the same data can be omitted.

Let’s look at the most basic memory order level, relaxed.

int payload = 0;
std::atomic<int> flag {0};
void relaxed\_writer(int i) {
    payload = flag.load(std::memory\_order\_relaxed) + i;
    flag.store(1, std::memory\_order\_relaxed);
}
int relaxed\_reader() {
    while (flag.load(std::memory\_order\_relaxed) == 0) {
    }
    return payload;
}
Copy the code

After using the base memory order level relaxed, the compiler no longer assumes that flag will be reloaded with each loop unaffected by other threads. In addition, we can observe that the out-of-order of flag and payload is restored, but the relaxed order is not guaranteed, i.e., the order is not guaranteed by a compiler. In general, the relaxed level does not differ much from ordinary read and write operations, except that the corresponding memory access is not omitted.

The further level of memory order is consume-release, but currently there is no corresponding implementation case, it is generally promoted to the next level by default, that is, the first real meaningful memory order level acquire-release. First, in principle, the understanding can be simplified in the way of satisfying conditions/giving promises, namely:

Request: write (release) A and acquire (acquire) B respectively on the same variable M, B read the value written by A.
Promise: All other writes before A are visible to reads after B.
Practical impact:

The operations involved do not rearrange across A/B operations;
X86: no additional instructions;
ARMv8: A emptystore buffer, B emptyInvalid queue, A/B preserve sequence;
Armv7&power: Full barrier before A, full barrier after B.

int payload = 0;
std::atomic<int> flag {0};
void release\_writer(int i) {
    payload = flag.load(std::memory\_order\_relaxed) + i;
    flag.store(1, std::memory\_order\_release);
}
int acquire\_reader() {
    while (flag.load(std::memory\_order\_acquire) == 0) {
    }
    return payload;
}
Copy the code

Since x86’s default memory order is no lower than acquire-release, ARMv8 assembly is used to demonstrate the effect. It can be seen that the corresponding instruction has been replaced from ST/LD to STL/LDA, thus using armV8 architecture to implement the corresponding memory ordering semantics.

Furthermore, the memory sequence is the strongest level of sequentially-consistent, which is actually the same as MESI’s commitment level. Understanding can also be simplified in terms of satisfying the condition/giving the promise, i.e. :

Requirements: Write Am, An for two variables M and N (Sequentially Consistent). In any thread, Am is observed to precede An by a read operation (Sequentially Consistent).
Promise: Am will be observed before An in other threads through a read operation B (Sequentially Consistent).
Practical impact:

X86: Am and An then empty the store Buffer, read operation B without additional instructions;
ARMv8: Am and An empty store buffer, B empty invalid queue, A/B preserve sequence;
ARMv7: Full barrier before and after Am and An, full barrier after B;
POWER: Full barrier before Am and An, full barrier before and after B.

It’s worth noting that ARMv8, starting with a deliberately optimized sequentially-consistent rating, omits the total barrier cost. Presumably because order consistency is provided as the default level in the STD :: Atomic implementation, specific optimizations are made to improve performance in a general sense.

4.5 How does understanding memory Order help us

To conclude a basic test, take a look at a set of comparative data:

Multiple threads compete to write adjacent addresses sequentially-consistent: 0.71 unit time
Multiple threads compete to write the nearest neighbor address release: 0.006 unit time
Sequentially -consistent: 0.38 unit time
Multithreading contention write cache line isolation address release: 0.02 unit time

For sequentially consistent memory, cache line isolation has a certain benefit, but for release memory, it has a negative effect. This is due to the fact that under release memory, the write buffer acts as a race buffer because there is no strong memory barrier. When contention is sufficiently alleviated, cache line isolation introduces a transport interaction in which more cache lines are cached for the same throughput.

Under the guidance of this information, we use the pattern of circular array + slot version number to implement lockless queue. Because only the acquire-release level is required for queue operations, the cache line isolation mode is not required for slot version numbers, thus achieving relatively high concurrent performance.

Read the original mp.weixin.qq.com/s/wF4M2pqlV…

Recommended reading

Front-end engineering of H5 performance optimization

Hash table optimization: from avoiding Hash conflicts to using Hash conflicts

Baidu intelligent applet framework performance optimization practice

Baidu Geek talk is changing its name

Receive small smart screen 1s, small smart earphone, small smart speaker for free

Scan the qr code of “Baidu Geek Talk” below

In the public account background reply: geek

For details of the event and links to articles

The deadline is 0:00 on April 26, 2021

Limited number of prizes to participate!

———- END ———-

Baidu said Geek

Baidu official technology public number online!

Technical dry goods, industry information, online salon, industry conference

Recruitment information · Internal push information · technical books · Baidu surrounding

Welcome students to pay attention to!

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Limit optimization by baidu C++ engineers

1 background

2 Review performance optimization

2.1 Start with string processing

2.1.1 string as a buffer

2.1.2 the split string

2.1.3 magic of protobuf

2.2 Go back to performance tuning

3 Start with memory allocation

3.1 Look at TCMALloc and Jemalloc first

3.2 What is the ideal Malloc model?

3.2.1 the job arena

3.2.2 job reserve

3.3 Summarize memory allocation

Let’s look at memory access

4.1 Sequential access optimization

4.2 Concurrent Access Optimization

4.3 Let’s talk about cache consistency

4.4 Let’s talk more about Memory order

4.5 How does understanding memory Order help us

Limit optimization by baidu C++ engineers

1 background

2 Review performance optimization

2.1 Start with string processing

2.1.1 string as a buffer

2.1.2 the split string

2.1.3 magic of protobuf

2.2 Go back to performance tuning

3 Start with memory allocation

3.1 Look at TCMALloc and Jemalloc first

3.2 What is the ideal Malloc model?

3.2.1 the job arena

3.2.2 job reserve

3.3 Summarize memory allocation

Let’s look at memory access

4.1 Sequential access optimization

4.2 Concurrent Access Optimization

4.3 Let’s talk about cache consistency

4.4 Let’s talk more about Memory order

4.5 How does understanding memory Order help us

Related Posts

String file processing encountered a few minor issues

Redis Basics (5) — Transactions

Spring Boot integrates Kafka+Storm