C++ service compilation time optimization principle and practice

The background,

Large C++ engineering projects will face the problem of long compilation time. Compilation is everywhere, whether it’s during development debugging iterations, admission-testing, or continuous integration, and reducing compilation time is of great importance to improving development efficiency.

Meituan Search and NLP Department provides the basic search platform service for the company. For the consideration of performance, the underlying basic service is implemented by C++ language, among which the Deep-Query Understanding (DQU) service we are responsible for is also faced with the problem of long compilation time. The entire service code takes about 20 minutes to compile before optimization (32-core machine parallel compilation), which has affected the efficiency of team development iterations. Based on this background, we have carried out a special optimization for the compilation of the DQU service. In this process, we have also accumulated some optimization knowledge and experience, here to share with you.

II. Compiling principle and analysis

2.1 Introduction to compilation principle

In order to better understand the compiler optimization scheme, we will briefly introduce the principle of compilation before introducing the optimization scheme. Usually, when we are developing C++, the compilation process mainly includes the following four steps:

Preprocessor: macro definition replacement, header file expansion, conditional compilation expansion, comment removal.

The gcc-e option yields preprocessed results with an extension of.i or.ii.
C/C++ preprocessing does not do any syntax checking, not only because it does not have syntax checking capabilities, but also because preprocessing commands do not belong to C/C++ statements (which is why macros are defined without semicolons). Syntax checking is something the compiler does.
After preprocessing, all you get is the actual source code.

Compiler: Generates assembly code to produce an assembly language program (which translates a high-level language into a machine language) in which each statement exactly describes a low-level machine language instruction in a standard text format.

The gcc-s option gives you a compiled assembly code file with the.s extension.
Assembly language provides a common output language for different compilers of different high-level languages.

Assembler: Generates object files.

The gcc-c option yields a compiled result file with the.o extension.
.o file, is in accordance with the binary encoding of the generated file.

Linker: Generates an executable or library file.

Static library: When a link is compiled, all the code from the library file is added to the executable file, so the resulting file is relatively large, but the library file is not needed at runtime. The suffix is usually “.a “.
Dynamic libraries: The library’s code is not added to the executable when the link is compiled. Instead, the library is loaded by the runtime link file when the program is executed, so the executable file is small. Dynamic libraries are usually named with the suffix “.so “.
Executables: Links all binaries into one executable, regardless of whether they are target binaries or library binaries.

2.2 C++ compilation features

(1) Each source file is compiled independently

The compilation system of C/C++ is very different from that of other high-level languages. In other high-level languages, the compilation unit is the entire Module, that is, all the source code under the Module, which will be executed in the same compilation task. In C/C++, the compilation unit is a file. Each.c/.cc/.cxx/.cpp source file is a separate compilation unit, which results in optimizations based on the contents of this file and makes it difficult to provide code optimizations across compilation units.

(2) Each compilation unit needs to parse all the included header files independently

If N source files refer to the same header file, then the header file will need to be parsed N times (which is a ghost story for a header file like the Thrift or Boost header, which can run for thousands of lines).

If there is a template (STL/Boost) in the header, it will be instantiated once per CPP file, and STD ::vector<int> in N source files will be instantiated N times.

(3) Template function instantiation

In C++ 98, for every template instantiation in the source code, the compiler needs to do the instantiation work. When linking, the linker also needs to remove duplicate instantiation code. Obviously, every time the compiler encounters a template definition, it has to repeat the instantiation work, repeat the compilation work. At this point, the compiler can be much more efficient if it is able to avoid this kind of repeated instantiation. The introduction of external templates, a new language feature in the C++ 0x standard, solves this problem.

In C++ 98, there is already a language feature called Explicit Instantiation, which aims to instruct the compiler to immediately instantiate a template (that is, force Instantiation). The external template syntax is modified from the explicit instantiation syntax by adding the extern prefix before the explicit instantiation directive.

Template class vector

. Extern template class vector

extern template class vector

Once an external template declaration is used in a compilation unit, the compiler skips the template instantiation that matches the external template declaration when compiling the compilation unit.

(4) Virtual functions

The compiler handles virtual functions by adding a pointer to each object that holds the address of the virtual table that stores the address of the virtual functions of the class (including those derived from the base class). If the derived class overrides the new definition of a virtual function, the virtual table stores the address of the new function. If the derived class does not redefine the virtual, the virtual table stores the address of the original version of the function. If a derived class defines a new virtual function, the address of that function is added to the virtual table.

When a virtual function is called, the program looks at the address of the virtual table stored in the object, goes to the corresponding virtual table, uses the address of the function in the array defined in the class declaration, and executes the function.

Variations after using virtual functions:

The object will add a space to store the address (4 bytes for 32-bit system, 8 bytes for 64-bit system). ② Each class compiler creates a virtual function address table. ③ Every function call needs to add the operation of looking up the address in the table.

(5) Compiler optimization

GCC provides nearly one hundred optimization options to meet users’ optimization needs of different degrees, which are used to make different trade-offs and balances on the three-dimensional model of compilation time, target file length and execution efficiency. There are various optimization methods, and there are generally the following categories:

① Streamline operation instructions. ② Try to meet the flow of CPU operation. (3) By guessing the behavior of the program, adjust the execution order of the code. ④ Make full use of registers. ⑤ Expand simple calls, and so on.

If these compiler options, all in the optimization of code targeted or a complex work, fortunately GCC offers from O0 – O3 and Os this several different optimization levels to choose from, in these options, contains most effective compiler optimizations, and on this basis, to block or add some options, Thus greatly reduced the difficulty of use.

O0: No optimizations, this is the default compilation option.
O and O1: Partial compile optimizations are performed on a program. The compiler tries to reduce the size of the generated code and reduce the execution time, but does not perform optimizations that require a lot of compile time.
O2: A more advanced option than O1, with more optimizations. GCC will perform almost all optimizations that do not involve temporal or spatial compromises. When the O2 option is set, the compiler does not perform loop unwinding and function inlining optimizations. Compared with O1, O2 optimization improves the execution efficiency of generated code on the basis of increasing compilation time.
O3: More optimizations on top of O2, such as the use of pseudo register networks, inlining of normal functions, and more optimizations for loops.
OS: It’s mainly about optimizing the size of the code. Often all kinds of optimizations will disrupt the structure of the program and make debugging difficult to start. And it can mess up the execution order. Programs that depend on the order of memory operations need to do something to make sure the program is correct.

Possible problems with compilation optimization:

① Debugging issues: As mentioned above, any level of optimization will result in a change in code structure. For example, merging and eliminating branches, eliminating common subexpressions, replacing and changing load/store operations within loops, etc., will cause the execution order of the object code to be completely distorted, resulting in a serious shortage of debugging information.

② Memory operation order change problem: after O2 optimization, the compiler will affect the execution order of memory operation. For example: -fscheduer-insns allows other instructions to be completed before data is processed; -fforce-mem may cause inconsistencies between memory and registers, such as dirty data, etc. Some logic that depends on the order of memory operations requires rigorous processing before it can be optimized. For example, using a Volatile keyword to restrict the operation of variables, or using a Barrier to force the CPU to execute strictly in the order of instructions.

(6)C/C++ cross-compilation unit optimization can only be left to the linker

When the linker makes a link, it first determines the location of each object file in the final executable. Then access the address redefinition table of all the object files and redirect the recorded addresses (plus an offset, which is the starting address of the compilation unit on the executable). Then traverse the unsolved symbol table of all the object files, and find the matching symbols in all the exported symbol tables, and fill in the implementation address in the position recorded in the unsolved symbol table, and finally write the contents of all the object files in their respective positions, to generate an executable file. The details of the link are complex. The link stage is a single process, which cannot be accelerated in parallel, resulting in very slow linking of large projects.

Three, service problem analysis

DQU is a query understanding platform for Meituan search. It contains a lot of models, a lot of thesaurus, more than 20 Thrift files, a lot of Boost handling functions, and it also introduces the SF framework, the SDK for corporate third party components, and three submodules for word segmentation. Modules with the method of dynamic library compilation load, between modules for data transmission through the message bus and message bus is a big Event, so this class contains various modules need the definition of data types, so each module will introduce the Event header files, unreasonable dependencies caused the file changes, Almost all modules will be recompiled.

The compilation problems faced by each service have their own characteristics, but the essential reasons for the problems are similar. Combined with the process and principle of compilation, we analyze the compilation problems of DQU service from three aspects: precompilation expansion, header file dependency and compilation process time-consuming.

3.1 Compile and expand analysis

To see the size of the expanded compilation file, you can reserve the compiled intermediate files by specifying the compilation selection “-save-temps” in CMake.

set(CMAKE_CXX_FLAGS "-std=c++11 ${CMAKE_CXX_FLAGS} -ggdb -Og -fPIC -w -Wl,--export-dynamic -Wno-deprecated -fpermissive -save-temps")

Compile the most direct reason is the time-consuming compile files after expansion is larger, compile unfolds, file size and content analysis can be seen through the precompiled unfolds the file has more than 40 ten thousand lines, found that there are a large number of Boost library reference and header file references the file is bigger, affect the compile time consuming. In this way, we can find the commonality of the compilation time of each file. The following is a screenshot of the file size after compilation and expansion.

3.2 Header file dependency analysis

Header dependency analysis is an analysis method to see whether the code is reasonable from the perspective of the number of referenced header files. We have implemented a script to count the dependencies of header files and analyze the output header file dependency reference count to help judge whether the dependencies of header files are reasonable.

(1) header file reference total result statistics

Use the tool to count the total number of header files that are directly and indirectly dependent on the compilation source file, which can be used to introduce quantitative analysis of the problem from scratch.

(2) Single header file dependency statistics

By analyzing the header file dependency and generating dependency topology, we can intuitively see the unreasonable place of dependency.

The figure contains the reference hierarchy and the number of reference header files.

3.3 Segmented statistics of compilation time results

Compile time segment statistics from the results on each file to compile time consuming and takes in all stages of compilation, this is one of the visual result, under normal circumstances, is head and file size and file reference number is a positive correlation, cmake by specifying the environment variables can print out the compile and link stage time-consuming, Through this data can be directly analyzed the time – consuming situation.

set_property(GLOBAL PROPERTY RULE_LAUNCH_COMPILE "${CMAKE_COMMAND} -E time")
set_property(GLOBAL PROPERTY RULE_LAUNCH_LINK "${CMAKE_COMMAND} -E time")

Compiling time output:

3.4 Analysis tool construction

Through the above tool analysis, we can get several compiled data:

① Header file dependencies and number. ② Precompiled expansion size and content. (3) compilation time of each file. The overall link time consuming. ⑤ Compiling parallelism can be calculated.

Through the input of these several data, we consider that we can make an automatic analysis tool to find out the optimization points and display the interface. For this purpose, we built an automated analysis tool for the whole process, which can automatically analyze common time-consuming problems and TOPN time-consuming files. The processing process of the analysis tool is shown in the figure below:

(1) Overall statistical analysis effect

Specific field description:

① Cost_time compilation time is in seconds. File_compile_size, intermediate file size, in M ③ file_name, file name * include_h_nums * include_h_nums * include_h_nums * include_h_nums * include_h_nums ⑤ top_h_files_info, which introduces the most TopN header files.

(2) Top10 compilation time file statistics

Used to show the TopN file that takes the longest to compile statistics. N can be customized.

(3) Statistics on the size of intermediate files compiled by Top10

The size of the compiled file is counted and shown to determine whether the block meets the expectation, which corresponds to the compilation time one to one.

(4) Top10 introduces header file statistics of the most header files

(5) Repeated statistics of the Top10 header files

At present, the im tools support a key into a compile time consuming analysis results, a few small tools, such as relying on tools are integrated into the company’s online file number integration testing process, through an automated tool to check code changes to compile time consuming, the influence of the construction of the tool are iterative optimization, the subsequent MCD platform will be integrated into the company, Automatic analysis can be used to locate the problem of long compilation time and solve the problem of time-consuming compilation of other departments.

IV. Optimization scheme and practice

Through the use of the above related tools, we can find the commonness of Top10 compile-time files, for example, they all rely on the message bus file platform_query_analysis_enent.h, which directly and indirectly introduces more than 2000 header files. We focus on optimizing such files. Common problems with Boost usage, template class expansion, Thrift header expansion, etc., were identified and optimized specifically for these problems. In addition, we also used some of the industry’s common compiler optimization schemes, and achieved good results. The following details describe the various optimization schemes we adopted.

4.1 Common compilation acceleration scheme

There are a number of general compile-acceleration tools in the industry that can speed up compilation without breaking into your code and are well worth trying out.

(1) Parallel compilation

You can add the -j parameter to the Make command to increase the parallelism of the compilation. For example, make-j 4 will start four tasks. In practice, we do not write this parameter to death, but dynamically obtain the number of CPU cores of the compiler through $(nproc) method as the compilation concurrency, so as to maximize the performance advantage of multi-core.

(2) Distributed compilation

Distributed compilation technology is used, such as using Distcc and Dmucs to build a large-scale, distributed C++ compilation environment. Linux platform uses network cluster for distributed compilation, which needs to consider network delay and network stability. Distributed compilation is suitable for larger-scale projects, such as stand-alone compilation that can take hours or even days. In terms of code size and compilation time on a single machine, the DQU service does not need to be accelerated in a distributed way for the time being. For details, please refer to the Distcc official documentation.

(3) Precompiled header file

PCH (Precompiled Header), this method saves the compiled results of commonly used Header files in advance, so that the compiler can use the compiled results directly when dealing with the introduction of the corresponding Header files, thus speeding up the entire compilation process. PCH is a very common method to speed up compilation in the industry, and the feedback is very good. In our project, because it involved the compilation and generation of many Shared libraries, and Shared libraries could not share PCH with each other, it did not achieve the desired effect.

(4) CCache

CCACHE (Compiler Cache) is a Compiler and Cache tool, whose principle is to save the compiled results of CPP in the file Cache, and obtain the compiled results directly from the Cache if there is no change in the corresponding file during compilation. Make also has some caching capability. If the target file has been compiled (and the dependency has not changed), it will not be compiled again if the source file timestamp has not changed. However, CCache is a cache by file content, and multiple items on the same machine can share the cache, so it is more applicable.

(5) Module compilation

If your project is developed in C++20, then the Module compilation is also a solution to optimize compilation speed. Prior to C++20, each CPP is treated as a compilation unit, and there is the problem of importing header files being parsed and compiled multiple times. Module does not need header files (just one Module file, no need to declare and implement two files). It will directly compile your (.ixx or.cppm) Module entity and automatically generate a binary interface file. Different from include preprocessing, the compiled module will not be compiled again the next time it is imported, which can greatly improve the efficiency of the compiler.

(6) Automatic dependency analysis

Google has also launched the open source Include-What-You-Use tool (IWYU) and Clang-based C/C++ engineering redundant header check tool. IWYU Clang compiler suite, use this tool to scan the file dependency problem, at the same time, the tool also provides script header file dependence problem, we try to set up the analysis tool, this tool also provides automated header file solution, but because of our code depends on more complex, the dynamic library, static library, warehouse, etc., The optimization functions provided by this tool cannot be used directly. If the code structure is relatively simple, other teams may consider using this tool to analyze the optimization. The following result files will be generated to guide which header files need to be removed.

> > > Fixing # includes' in/opt/at meituan/zhoulei/query_analysis/SRC/common/qa/record/brand_record. H '@ @ 1, 9 + 1, 10 @ @ #ifndef _MTINTENTION_DATA_BRAND_RECORD_H_ #define _MTINTENTION_DATA_BRAND_RECORD_H_ -#include "qa/data/record.h" -#include "qa/data/template_map.hpp" -#include "qa/data/template_vector.hpp" -#include <boost/serialization/version.hpp>  +#include <boost/serialization/version.hpp> // for BOOST_CLASS_VERSION +#include <string> // for string +#include <vector> // for vector + +#include "qa/data/file_buffer.h" // for REG_TEMPLATE_FILE_HANDLER

4.2 Code optimization scheme and practice

(1) Preposition type declaration

After analyzing the header reference statistics, we found that the bus type Event was the most referenced item in the project, and in this type, various members needed by the business were placed. Examples are as follows:

#include "a.h" #include "b.h" class Event {// A, B, C... A1 a1; A2 a2; / /... B1 b1; B2 b2; / /... };

This resulted in an Event containing a large number of header files, which, when expanded, reached 15M in size. All kinds of businesses need to use events, which can seriously slow down compilation performance.

We solve this problem by using the pre-type declaration, that is, we do not introduce the corresponding type of header file, only make the pre-declaration, in the Event only use the corresponding type of pointer, as follows:

class A2; / /... Class Event {// Business A, B, C... shared_ptr<A1> a1; shared_ptr<A2> a2; / /... shared_ptr<B1> b1; shared_ptr<B2> b2; / /... };

Only when the corresponding member variable is actually used does the corresponding header file need to be introduced; This really does import the header file on demand.

(2) External templates

Because the template is instantiated when it is used, the same instance can appear in multiple file objects. The compiler instantiates each template, and the linker removes duplicate instantiation code. In projects where templates are used extensively, compilers produce a lot of redundant code, which can greatly increase compilation time and link time. This can be avoided in the new C++ 11 standard by using external templates.

// util.h
template <typename T> 
void max(T) { ... }

// A.cpp extern template void max<int>(int); #include "util.h" template void max<int>(int); // explicitly instantiate void test1() {Max (1); }

When compiling a.cpp, we instantiate a function with Max

(int).

// B.cpp #include "util.h" extern template void max<int>(int); Void test2() {Max (2); }

When compiling b.cpp, we no longer generate the Max <int>(int) instantiation code, thus saving the instantiation, compilation, and linking time mentioned earlier.

(3) Use of polymorphic replacement template

Our project makes heavy use of dictionary-related operations, such as loading dictionaries, parsing dictionaries, and matching dictionaries (various fancy matches), all of which support different types of dictionaries through the Template Template extension. According to statistics, there are more than 150 dictionary types, which also causes the amount of template expansion code to swell.

template <class R> class Dict { public: Bool match(const string &key, const string &condition, R &record); Private: map<string, r> dict; private: map<string, r> dict; };

Fortunately, most of the operations in our dictionary can be abstracted out of a few interfaces, so we can only implement operations on the base class:

Class Record {public: virtual bool match(const string &condition); // The derived class needs to implement}; class Dict { public: shared_ptr<Record> match(const string &key, const string &condition); Private: map<string, shared_ptr<Record>> dict; };

Through inheritance and polymorphism, we effectively avoid a lot of template expansion. It is important to note that using a pointer as the Value of a Map increases the pressure on memory allocation. It is recommended to use Tcmalloc or Jemalloc instead of the default Ptmalloc to optimize memory allocation.

(4) Replace the Boost library

Boost is a widely used base library that covers a large number of commonly used functions, which is convenient and easy to use, but has some weaknesses. One notable drawback is that the implementation is in the form of HPP, where the declaration and implementation are in a header file, which makes the precompiled expansion very large.

/ / string manipulation is a common function, simply introduce the header file unfold size is more than 4 m # include < boost/algorithm/string HPP > / / in contrast, the introduction of several STL header files, #include <vector> #include <map> //

There are no more than 20 Boost functions mainly used in our project. Some of them can be replaced in STL, and some of them are implemented manually, which makes the project switch from relying heavily on Boost to most of them being Boost-Free, which greatly reduces the burden of compilation.

(5) Precompilation

There are some Common changes in the code that are relatively small, but have a certain impact on compilation time, such as the files generated by Thrift, the model library files, and the Common files in the Common directory. We precompile them into dynamic libraries, which reduces the compilation time of subsequent files, and also eliminates some compilation dependencies.

(6) Solve compilation dependency and improve compilation parallelism

In our project, there are a large number of mode-level dynamic library files that need to be compiled, and the compilation dependencies specified by cmake files limit the execution of compilation parallelism to some extent.

For example, in the following scenario, the parallelism of compilation can be improved by setting library file dependencies properly.

4.3 Optimization effect

We compared the results before and after the optimization of compilation time with 32C and 64G memory machines, and the statistical results are as follows:

4.4 Keep the optimization results

Compiler optimization is a “upstream” business. Developers tend to add new features, libraries, and even frameworks, and it is difficult to remove old code, libraries, and frameworks (as front-line developers will surely know). Therefore, how to maintain the previous optimization results is also crucial. We have the following experiences in practice:

Code auditing is difficult (changes that increase compilation time are often not visible through auditing the code).
Tools, processes are what you need to rely on.
The key is to control the increments.

We found that the compilation time of a CPP file was positively correlated with the size of its precompiled expansion file (. Ii) (in most cases). For each online version, the pre-compiled expansion size of all its CPP files is recorded to form its CF (Compile Fingerprint). By comparing the two adjacent versions of CF, we can know more accurately which changes are mainly introduced into the compilation time caused by the new version, and further analyze whether the increase in time is reasonable and whether there is room for optimization.

We developed this approach as a scripting tool and introduced it into the live stream process, which allowed us to see the impact of compilation performance on each release of code, and helped us keep track of our early optimizations.

Five, the summary

DQU project is an important link in Meituan search business. The system needs to connect 20+RPC, dozens of models, load more than 300 dictionaries, use tens of G of memory, and respond to requests of more than 2 billion large C++ services every day. In the case of high-speed business iteration, the lengthy compilation time brings great troubles to the development students, which restricts the development efficiency to some extent. Finally, through the construction of compiler optimization analysis tools, combined with the adoption of the general compiler optimization acceleration scheme and code level optimization, the compilation time of DQU was reduced by 70%, and by introducing CCache and other means, the compilation of local development could be completed within 100s, which saved a lot of time for the development team.

After the initial results are achieved, we summarize the whole problem solving process and precipitate some analytical methods, tools and process specifications. These tools were able to quickly and efficiently detect changes in compile time due to new code changes during subsequent iterations of development, and became a part of our ongoing process review. This is a big difference from the one-off or targeted compilation optimizations we used to do. After all, the maintenance of code is a lasting process. To systematically solve this problem, we need not only effective methods and convenient tools, but also a standardized on-line process to maintain the results. Hope this article can be helpful to you.

reference

[1] Compilation Principle Perspective · Illustrated Compilation Principle
[2] CCache
[3] Distributed compilation
[4] The header file is precompiled
[5] The header file is precompiled
[6] C++ Templates
[7] Include-what-you-use

Author’s brief introduction

Zhou Lei, Shi Han, Zhu Chao, Wang Xin, Liu Liang, Chang Shu, Li Chao, Yun Sen and Yong Chao are all from Meituan AI Platform Search and NLP.

| want to read more technical articles, please pay close attention to Meituan technical team (meituantech) WeChat public official.

| in the public, the menu bar reply goodies for [2019], [2018] special purchases, goodies for [2017], [method] the key word, can see Meituan technology team calendar year essay collection.