Bytedance Service Mesh data plane compilation and optimization practice

preface

Bytedance implements Service Mesh on a large scale and provides multiple traffic proxy capabilities, such as RPC and HTTP, and various Service governance functions. The Service Mesh architecture consists of a data surface and a control surface. The Service Mesh data surface was developed and modified based on the open source Envoy project and rewritten for the main traffic proxy and Service governance functions. The project was written in C++ language.

In the process of optimizing the data surface, we have explored C++ Devirtualization and compilation optimization based on LLVM compilation tool chain. LTO (Link Time Optimization), Profile Guided Optimization (PGO), C++ Devirtualization and other compiler Optimization techniques were implemented, achieving a significant performance gain of 25%. This article will share our work on compiling and optimizing bytedance Service Mesh data.

background

In order to provide better abstraction and scalability, the virtual functions of C++ are used more frequently, although this can bring great ease in writing programs. However, the machine instructions generated after compilation contain a large number of Indirect calls, each of which inevitably requires a dynamic jump. Excessive indirect calls may cause the following problems:

Indirect instruction jump overhead: Since the actual function (or interface) code address is assigned dynamically at runtime, machine instructions cannot do much optimization and can only execute call instructions directly, which is unfriendly to cache locality, instruction preexecution, and branch prediction.
No inline optimization: Since the implementation of the Virtual function itself is polymorphic, the compilation cannot determine the implementation that will be executed at actual runtime, and therefore, no inline optimization can be performed. At the same time, in many cases, a function is called only for partial return values or side effects, but often the function implementation also performs some additional calculations that could have been eliminated by inline optimization. Since inlining is not possible, indirect Call performs more invalid calculations.
Impediments to further compilation optimization: Indirect Call is similar to a barrier in instructions. Since it is a run-time call itself, it invalidates various control flow judgments and code expansion at compile time, thus limiting the optimization space for further compilation and linking.

Although the virtual function has a significant performance penalty, it is necessary: first, many modules themselves require dynamic subclass implementation; Second, declaring functional modules as virtual interfaces is friendlier for test writing and easier to provide mock implementations; Third, C++ has mature support for virtual functions and interfaces, and the code structure is simple and clear. Even for static polymorphic interfaces, if virtual functions are not used but template mode is used to support (such as CRTP), the code structure will be extremely complicated and the cost of getting started will be high. Difficult to maintain.

Considering the advantages of the Virtual function itself, and the cost of modifying the code structure, we decided to keep the Virtual function structure at the code level and optimize its performance overhead for compilation optimization.

research

Optimizations for Virtual functions (deVirtualization, or Indirect Call Promotion) fall into three general categories: Link Time Optimization (LTO), Whole Program Devirtualization (WPD), and Speculative Devirtualization work in the following ways:

Link Time Optimization (LTO) : chain-time Optimization, in which intermediate compiled objects are generated instead of traditional binary objects at the compile stage, preserving meta-information, and then all intermediate compiled objects are linked globally at the final Link stage, cross-module Optimization is performed, and binary code is generated. LTO is divided into Full LTO and Thin LTO. Full LTO is mainly executed serially and links are time-consuming. Thin LTO uses a small amount of optimization loss as a price to obtain concurrent execution model and greatly speeds up links. Because LTO has a global perspective in the linking phase, it is possible to type inference across modules and perform some Devirtualization optimization.
Whole Program Devirtualization (WPD) : Obtain all subclass implementations of a virtual function by analyzing the inheritance structure of the classes in the application and perform Devirtualization based on this result. This optimization can only be implemented by combining LTO, and after practice, the optimization effect is not ideal (elaborated later).
Speculative Devirtualization: This optimization “speculatively” assumes that the run-time implementation of a virtual callsite is one or more specific subclasses, and if it hits, the corresponding implementation logic can be invoked explicitly, otherwise, the normal indirect call logic can be used. This optimization combined with PGO has a better effect.

This article focuses on the principles and practices of SUPPORTING Devirtualization and PGO optimization techniques, and does not cover much of the principles of LTO and WPD.

Speculative Devirtualization principles are introduced

We use an example to explain supporting Devirtualization’s principles speculatively, assuming we write an interface to Foo and a concrete implementation of FooImpl, as follows:

struct Foo { virtual ~Foo() = default; virtual void do_something() = 0; }; struct FooImpl : public Foo { void do_something() override { ... }};Copy the code

Next, the Foo interface is used in other modules as follows:

void bar(Foo &foo) {
    foo.do_something();
}
Copy the code

After compilation, the machine instruction pseudocode for the bar function is roughly as follows:

addr = vtable.do_something.addr@foo
call *addr
Copy the code

The pseudocode above loads the actual address of the do_something function passed to the argument foo, and then executes a call instruction on that address, the basic principle of dynamic polymorphic distribution.

For the preceding example, in Speculative Devirtualization optimizations, the compiler assumes that foo is most likely to be an object of FooImpl in the real world, and runs instructions supporting that assumption speculatively. Call FooImpl::do_something() directly. Otherwise, turn to the normal indirect call and run the following pseudo-code:

addr = vtable.do_something.addr@foo
if (addr == FooImpl::do_something)
    FooImpl::do_something()
else
    call *addr
Copy the code

As you can see from the pseudocode above, instead of executing an indirect call after retrieving the actual function address, you can call FooImpl::do_something() instead. In this example, there is only one subclass. If there are more than one subclass, there will be an if judgment. After all the if judgments fail, the user will finally fallback to indirect call.

At first glance, this increases the number of instructions, which is counterintuitive to optimization. However, assuming that foo is of type FooImpl in most calls, this is really just adding an address comparison instruction. And, because of the sequential execution nature of CPU instructions, there is no branch overhead (although there is an if). Further, the direct call to FooImpl::do_something() and the call *addr in the else branch may look the same in a high-level language, but from the compiler’s point of view it is completely different. This is because FooImpl::do_something() is explicitly static and can be directly applied to inline optimizations, not only eliminating the overhead of function jumps, but also eliminating unnecessary calculations in the function implementation. Consider an extreme scenario where the implementation of FooImpl::do_something() is an empty function. After inlining, the entire process is optimized from an indirect call to a process that only needs to compare the function address once. This brings a huge performance difference.

Of course, as intuitive as this optimization is. If foo above has a type other than FooImpl, then this is a negative optimization, and because of this, the optimization rarely takes effect by default and is only triggered during PGO optimization. In PGO optimization, the compiler has the profile information of the program at run time, including the probability distribution of indirect call for each implementation function, so the compiler can turn on the optimization for high-probability function implementations based on this information.

PGO optimization practice

Profile Guided Optimization (PGO), also known as FDO (Feedback Directed Optimization), refers to the use of Profile data collected during the run of the program, A post-link optimization technique that recompiles a program to optimize it. The principle is that for input with similar features, the program running features are also similar. Therefore, we can collect the profile feature data during the running period first, and then use it to guide the compilation process to optimize.

PGO optimization relies on profile data collected during the running of the program. Profile data can be collected in two ways: one is piling at compile time (for example, clang’s -fprofile-instr-generate compilation parameter); Second, linux-PERF tool is used to collect and convert perF data into profile format that can be recognized by LLVM. For the second method, AutoFDO is the more general term. The overall AutoFDO process is shown below:

Our practice takes the second approach: runtime collection of PERF. This is because, if the piling method is adopted, only the profile of the specific benchmark can be collected, but not the profile of the real flow on the line. After all, it is impossible to run a version of the piling in the online environment. The successful practice of PGO has greatly contributed to deVirtualization’s performance, and at the same time, due to other optimization mechanisms, it has achieved a 15% performance gain. Here is our focus on PGO optimization.

Introduction of PGO optimization principle based on Profile data

In the profile data collected during the running of the program, the hot functions and instructions of the program are recorded. Without too much expansion here, two simple examples are used to illustrate how it guides the compiler to optimize PGO.

Example of virtual function PGO optimization

The first example follows the Foo interface from above. Assuming that an application uses BarImpl and other supporting subclasses in addition to the FooImpl subclass, we speculatively support call instructions before running operations that fetch the actual function address. The profile data records the number of times FooImpl, BarImpl, and other subclass implementations were actually called in all of the call samples collected. For example, if that call point is sampled 10,000 times, and 9000 of those calls are FooImpl calls, the compiler can start supporting Devirtualization operations for FooImpl. Optimize 90% of cases. As you can see, this optimization is great for virtual functions with a single implementation, optimizing their performance to be no different from normal direct function calls while preserving the extensibility of future virtual functions.

Branch judgment PGO optimization example

The second example is an optimized example for branching judgment. Suppose you have the following code snippet that determines whether argument A is true, and if so, the a_staff logic is executed. Otherwise, the b_staff logic is executed.

if (a)
    // do a_staff...
else
    // do b_staff...
return
Copy the code

At compile time, since the compiler does not assume the probability that a is true or false, it usually prints machine instructions in the same block order, the pseudo-assembly code is as follows. Bool bool bool bool bool bool bool bool bool bool bool bool bool bool bool bool bool bool Otherwise, jump to.else and execute the b_staff logic.

test a, a
je   .else  ; jump if a is false
.if:
; do a staff...
ret
.else:
; do b staff...
ret
Copy the code

In the actual execution of CPU, due to the mechanism of instruction sequence execution and pipeline preexecution, the next instruction following the current instruction is preferentially executed. The above instruction is in favor of a_staff. If a is true, then the whole pipeline is completed without any jump overhead. Conversely, directives are bad for b_staff. If a is false, the previously pre-performed a_staff calculation in pipeline will be invalidated and the instruction will need to be reloaded from.else and re-executed b_staff, which will significantly reduce the performance of the instruction.

From the above analysis, it can be concluded that if the probability of a being true is high in actual operation, the code fragment will be efficient, and vice versa. By collecting the profile data during the running of the program, we can get the actual times of if branch and else branch in the above branch judgment. With the help of this statistic, in PGO compilation, if the probability of an else branch is high, the compiler can adjust the output machine instruction, similar to the pseudo-assembly instruction below, to the benefit of b_staff.

test a, a
jne  .if  ; jump if a is true
.else:
; do b staff...
ret
.if:
; do a staff...
ret
Copy the code

Profile data collection and conversion

In order to collect the profile data during the running period of the Mesh Proxy, normal optimal compilation and binary generation are required first. In order to avoid ambiguity in binary symbols of static functions with the same name, and to distinguish multiple function calls in the same line of C++ code, and improve the optimization effect of PGO, -funique-internal-linkage-names, -fdebug-info-for-profiling, -wl,–no-rosegment Otherwise, perF data collected by Linux-Perf cannot be converted into the format required by LLVM using the AutoFDO conversion tool.

After compiling, select appropriate benchmark or real flow to run the program, and use The Linux-PERF tool to collect PERF data. It has been verified that the Last Branch Record (LBR) function can achieve better optimization results when using Linux-PERF collection. We run the following command to collect PERF data for the Mesh Proxy process.

perf record -p <pid> -e cycles:up -j any,u -a -- sleep 60
Copy the code

After collecting perF data, use the AutoFDO tool (github.com/google/auto… Convert perF data to LLVM profile format.

create_llvm_prof --profile perf.data --binary <binary> --out=llvm.prof
Copy the code

Optimized compilation with PGO

After obtaining profile data, the last step of recompilation with PGO optimization can be carried out. It should be noted that the source code compiled must be exactly the same as the source code used in the previous profile collection, otherwise it will interfere with the optimization effect. To enable PGO optimization, just add the -fprofile-sample-use=llvm.prof clang compilation parameter to optimize PGO compilation using the profile data in the llvm.prof file.

After PGO compilation and optimization, the total number of indirect calls in the mesh proxy binary has been reduced by 80%, basically fulfilling the goal of C++ Devirtualization. In addition, PGO further inlines hot spot functions and instructions in the profile, rearranges hot spot instructions and memory, and further enhances conventional optimization methods, which can bring significant performance benefits.

Other compilation optimization work

Full static linking and LTO practices

After the byte Mesh Proxy reached a certain online size, we encountered some problems with dynamic linking, including the possibility of a lower version of gliBC running on the machine, and the overhead of the function calls themselves.

Considering that the Mesh Proxy itself runs as a separate Sidecar and does not need to be used as a library for other programs, we proposed the idea of fully statically linking binary. The benefits of doing this include avoiding gliBC versioning issues, eliminating the overhead of dynamically linking functions, and allowing further compilation optimizations to be applied with fully static linking.

Since binary does not have any external library dependencies, we added further compiler optimizations, including changing the model of Thread local storage to local-exec. And ThinLTO (Link Time Optimization). ThinLTO delivered a performance improvement of nearly 8 percent.

WPD attempt

In order to achieve the effect of Devirtualization, we also tried Whole Program Devirtualization, but the actual effect was not ideal, and only a small part of indirect calls were optimized. Through the study of the LLVM corresponding module implementation, we know that the current WPD optimization is only effective for virtual functions with only a single implementation, so it is not able to bring significant performance gains at this stage.

BOLT post – link optimization

On the basis of LTO and PGO compilation optimization, we further explore post-link optimization techniques like BOLT, and obtain about 8% performance gain. Considering stability factors, the optimization is still being explored and tested and has not been launched yet.

Afterword.

Hopefully this will be useful to the community, and we are planning to contribute to the Service Mesh field by contributing to the Envoy Open source community edition.

The resources

People.cs.pitt.edu/~zhangyt/re…
Research. Google/pubs/pub452…
Clang.llvm.org/docs/UsersM…
Github.com/llvm/llvm-p…
Llvm.org/devmtg/2015…
Quuxplusone. Making. IO/blog / 2021/0…