We have a dream that every R&D engineer will have a “super” computer.

Bytedance Terminal Technology — Sun Xiong

Efficiency bottlenecks of large projects

In recent years, development processes based on the Devops pipeline have become the industry standard for software development. The running efficiency of the assembly line determines the r&d efficiency of the team. For large projects, compile-builds are often the bulk of the time spent on the pipeline. Some projects take more than 30 minutes or even several hours to compile. This kind of performance is very bad.

Byte The build time of large iOS projects is mostly controlled within 5 minutes. This is largely thanks to an internal compile-acceleration solution that combines distributed compilation and distributed caching, and this article explains how it works in detail. But before we get there, let’s look at the compilation bottlenecks and solutions for large projects.

In conclusion, insufficient machine performance and repeated work are the two biggest factors affecting the efficiency of engineering compilation. Therefore, distributed compilation + compilation cache can be adopted to improve the overall performance.

Distributed compilation

The compilation of a project can often be disassembled into parallel compilation subtasks. In the C family of languages (C, C++, ObjC), for example, there are often thousands or even thousands of source code files (files with.c,.cc, or.m extensions) in a project. Each compilation subtask compiles the source file into an object file (files with.o extensions), and then links the whole file into the final executable.

These compilation subtasks can be executed in parallel, as shown below:

The number of cpus determines the maximum parallelism for compilation. The number of CPU cores is usually between 4 and 12 for personal computers and 24 to 96 for dedicated servers. However, the number of CPU cores is still insufficient for large projects involving tens of thousands of files. At this point, using distributed compilation technology, you can get a “supercomputer.”

Compile the cache

Large engineering full compilation, need to deal with thousands or even tens of thousands of compilation subtasks. But most of the subtasks have been compiled before, and if we could somehow get directly to the compiled product, it would save us a lot of time.

Create a central repository for the artifacts of compiled subtasks, which can be indexed through task summaries. In this way, every time we encounter a new task, we first query the summary to the central warehouse, if the query is successful, directly download the compilation product, eliminating the action of repeated compilation.

The above mentioned distributed compilation and compilation cache are two magic weapons to improve the compilation efficiency of large projects. This article mainly introduces the distributed compilation solution of Bytedance.

“Super” computers

With the help of cloud computing, we can assemble a “super” computer, as shown in the picture below:

The “super” computer consists of a central node and several working nodes. The central node is responsible for generating and scheduling subtasks and, in their execution order, sending the tasks to idle working nodes for execution. In this way, the parallel processing capacity of the whole system depends on the sum of the CPUS of all working nodes, and the performance is several times or even tens of times higher than that of a single machine.

A scheme that distributes tasks to working nodes like this is also known as distributed compilation. Distributed compilation is not a new concept, as the distcc tool, an open source tool, provided a solution for distributed compilation in 2008. The Remote Execution API, proposed by Google in 2017, standardizes the implementation of distributed compilation and compilation caching from a protocol perspective.

Let’s first look at the core idea of distributed compilation.

The core idea

The core idea is very simple. The local computer calculates the file that needs to be read for compiling the command, sends the file list and compilation command to the remote machine, and executes the compilation command. After the compilation is complete, the compilation product is requested to be pulled.

Among them, how to find the required file is the key.

Background — preprocessing

Before we do this, we need to add some background on how compilation works.

Source files to be compiled can declare a dependency on a header file by using #include xx.h and #import xx.h.

The first stage in which a compiler processes a compile command is called “preprocessing,” and an important part of this stage is header unwrapping. If there is a #import Car. H in the entry file main.m, the compiler iterates through all search paths to find the Car. The search path is given by parameters such as -i and -isystem in the compilation command

Next, if there are #import statements in the Car. H file, the compiler repeats, finds the dependent file, reads the contents, and replaces it until all #import statements have been expanded.

So, suppose we simulate the preprocessing process, find all the dependent headers, and send the task to the remote end for execution.

Important engine

From the above compilation principle, we can see that dependency analysis is the premise to achieve distributed. In addition, dependency analysis is a determinant of performance.

Because dependency analysis can only be done locally, computing resources are limited. The performance of dependency analysis determines whether the task distribution is smooth. If the dependency analysis is too slow, a large number of work nodes will be restricted and the task distribution will appear bottleneck.

You can think of dependency analysis as an important engine for distributed compilation.

The implementation of dependency analysis is not complicated, and the compiler itself provides parameters, such as Clang. -m can get the full compiled dependency, while -mm can get the user defined dependency.

-M, --dependencies

Like -MD, but also implies -E and writes to stdout by default

-MD, --write-dependencies

Write a depfile containing user and system headers

-MM, --user-dependencies

Like -MMD, but also implies -E and writes to stdout by default

-MMD, --write-user-dependencies

Write a depfile containing user headers

The open source framework RECC makes direct use of compiler capabilities.

The advantage of this approach is that it is easy to develop and secure enough, but there are performance bottlenecks. In our early tests with the headline project, fetching dependencies through the compiler averaged around 200 milliseconds. The compilation time of a single file is usually in the range of 500 to 3000 milliseconds. The time-consuming ratio of dependency analysis is too high, leading to the unsatisfactory efficiency of task distribution.

Dependency analysis takes too long, and because compiler commands are executed by separate processes, the cache cannot be reused between different compilation tasks. On the other hand, the -m parameter of the compiler implies the -e parameter, which stands for “preprocessing”. The preprocessing stage does a lot of other work besides dependency analysis, which we can optimize out.

Google’s Goma uses its own dependency analysis module and has achieved excellent results on Chromium and Android, two large projects. In the implementation of dependency analysis, it makes use of a large number of cache and index techniques to improve the reuse rate of intermediate data by virtue of the architecture advantage of resident process.

In the process of using Goma to accelerate the internal iOS project, we found that the performance would be greatly affected when the compilation task relied on too much Framework or relied on too large HMAP files. Therefore, we optimized it based on Goma for the characteristics of large iOS projects. Finally, compile task dependency resolution can be completed at an average speed of 50ms.

Let’s take a look at some of the techniques goma used in its design and some of the optimizations we made for iOS projects. Due to the limited space, this paper only introduces the more representative parts.

Rapid dependency analysis

Goma uses a combination of dependency caching and dependency analysis. If you have compiled in the working directory before, you can use the dependency cache the next time you use it. Dependency analysis is performed only in the case of a cache miss.

Dependent on the cache

The core principle of dependency caching is to check the last dependency corresponding to the same compilation parameter, if the dependent file has not changed, that is, reuse the dependency.

Its process is shown in the figure below:

One might wonder, why check the last dependency? If you introduce a new file that is not in the list, you will not be able to determine whether the file has changed.

No, the introduction of a new file requires the addition of a new #import directive, which inevitably causes a file in the old dependency list to change, so it is relatively safe to do so.

In some special cases, relying on caching can cause errors. For example, a new header file created during a code change has the same name as an old file and needs to be used during compilation. At this time, if the cache is reused, the cache may be hit because of the existence of old files, which does not meet the development expectations. In addition, compile directives such as has_include can invalidate the cache. If security requirements for compilation are high, you can turn off the caching dependency.

Hitting the dependency cache gives you a list of dependencies for the compiled command in less than 5 milliseconds, which is a good performance.

In practice, however, it is often found that dependencies remain mostly constant even if the file is modified, such as changing the value of a variable or adding a class member. If we can capture this feature, we can greatly increase the cache hit ratio.

Ignore irrelevant lines

Some code changes affect dependencies and some don’t, and if we only consider changes that affect dependencies, we can eliminate a lot of distractions. Here are two examples that show the difference between a valid change and an invalid change.

  • Valid changes (resulting in invalidation of the dependency analysis cache)
- #include <foo.h>
+ #include <bar.h>
Copy the code
  • Invalid changes (do not affect the dependency analysis cache)
- int a = 2;
+ int a = 3;
Copy the code

In addition to the #include and #import statements mentioned above, there are other statements that can cause cache invalidation: #if, #else, #define, #ifdef, #ifndef, #include_next.

Their common feature is that they start with # and are parsed by the compiler during the preprocessing phase. These directives are collectively referred to as Directive, so we simply cache the Directive list of the file. When the file changes, we retrieve the Direcitive list and compare it to the previously cached list. If the list remains the same, we can assume that the file changes do not affect dependencies.

Rely on the analysis of

Depth-first analysis

If you rely on the cache or turn it off, you enter the dependency analysis phase.

Dependency analysis uses a depth-first search algorithm to find all #include and #import files in the code. Note that conditional macros like #if and #else also need to be resolved during the preprocessing phase.

Depth preference is achieved by means of file stack + line pointer, as shown in the figure:

The purple part of the figure is a file stack, where each element holds information about the file. Each file corresponds to a list of directives and maintains a pointer to the current Directive.

At the beginning of the process, the entry file is pushed onto the stack, and then all directives of the entry file are traversed. When a Directive related to #include or #import is read, the dependent file is searched and added to the stack.

At this point, although the entry file has not been parsed, the newly pushed file should be parsed first according to the rule. Therefore, the current line number of the entry file needs to be maintained through the pointer to ensure that the entry file can continue to be parsed next time.

Optimization techniques

There are a lot of repetitive operations in dependency analysis, which can be optimized with a number of tricks. This article introduces two typical tips.

Inverted index

The most common operation in dependency analysis is to find the file with the corresponding name in a bunch of alternative directories.

Suppose we need to find the A.h file mentioned in the #import statement. There are 10 -i arguments in the command line, each pointing to 10 different directories -ifoo, -ibar… The simplest way is to traverse the 10 directories in turn, concatenating paths and trying to find the A.h file.

This is certainly possible, but it is less efficient. For large projects, a single compile command can involve over 5000 #import statements and over 50 header file search paths. This means at least 5000 x 50= 250,000 file system lookups, which is very expensive in time.

Creating inverted indexes can greatly speed up this process. The idea is to traverse the directory to be searched in advance (directory), find the files and subdirectories under the directory (collectively calledentry), and then buildentryPoint to thedirectoryAs shown in the figure below:

Back to the above problem, when we search for #import , we first need to find which of the three directories foo, bar, and taz contains A subdirectory A. Based on the inverted index, we can quickly locate the bar directory without having to start from scratch.

It is worth noting that the ObjC project generally uses HeaderMap technology (i.e. Xcode automatically generates.hmap files) to improve the efficiency of finding header files at compile time. A HeaderMap is essentially an indexed table that establishes a direct mapping of Directive -> Path. When we build the inverted index, we need to parse the contents of the.hmap and merge them into the inverted index.

Cross-task caching (optimized for iOS projects)

Different compilation tasks may have the same dependency files. For example, foo.m and bar.m might both rely on common.h. If foo.m was compiled and common.

Unfortunately, most cases require a re-lookup because the search criteria are often different for different commands. There are many parameters that affect the search criteria, such as -i, -isystem, and -f.

However, iOS projects can often reuse previous lookups.

IOS projects usually adopt the development mode of Xcode + CocoaPods. For the compilation command of the source file in the same Pod, the search path of the header file is basically the same. Using this feature, we provide a cross-task cache acceleration scheme.

We hash the search path list as a whole. If the search paths of two commands are the same, the search results of Directive with the same name must be the same. The scheme is as follows:

  1. The eigenvalues of the search path are extracted before dependency resolution of a single command.

  1. If the header file is not found, the result is cached.

Here’s a concrete example:

Compile task 1: clang ‘-c’ ‘foo.m’ ‘-ifoo -IBar -FCar

Clang ‘-c’ ‘bar. M’ ‘-ifoo -IBar -FCar

Foo. m and bar.m both contain the line #import common.h

Assuming that compile task 1 executes first, we should do the following:

  1. Extract the search directory list as:-IFoo -IBar -FCar
  1. The sha-256 algorithm is used to calculate the digest, and the corresponding search digest is 598CF1e… (Only the first 8 bits are shown)
  1. Perform dependency analysis, readfoo.mRely oncommon.h, traverse the search directory, and findcommon.hThe location, let’s say in the directoryBarThe following.
  1. Write cache, cache with hash table implementation, key is<598cf1e... , common.h>The value forBar
  1. Execute compile task 2 and again encounter the searchcommon.hThe request.
  1. Straight from the cachecommon.hinBardirectory

Index caching (optimized for iOS projects)

Building an index can reduce the number of times a directory is traversed looking for header files, which is a very effective optimization. However, when there are too many header search directories or hMap is too large, the index itself can take tens of milliseconds, which is still a little longer for performance-demanding dependency resolution.

So we thought, can we cache the index itself?

Following the cross-task caching approach, the indexes themselves are also cacheable, and two tasks can share a single set of indexes as long as their header search path and index contents in hMAP are consistent.

The exact scheme is similar to cross-task caching and will not be covered in detail in this article. By caching the index, we added another 20 milliseconds or so to the dependency analysis speed.

conclusion

Distributed compilation and compilation cache are two magic weapons to improve the compilation efficiency of large projects. This article mainly introduces bytedance’s distributed compilation solution.

The core part of the scheme uses the open-source framework goma code, and on this basis, for the characteristics of iOS project, do some optimization.

The core idea of distributed compilation is to swap space for time, introduce additional machines, and increase the number of cpus per compilation. The effectiveness of distributed compilation depends on the speed at which the central node distributes tasks, which in turn depends on the parsing efficiency of the dependencies.

The traditional scheme uses the preprocessing of the compiler to resolve dependencies, which is feasible, but because each resolution requires a separate fork process, the data is difficult to reuse, and there is a performance bottleneck. We used the open-source framework Goma code, and on this basis, for iOS project features, do some optimization.

This paper introduces four techniques of dependency resolution, which are optimized from three aspects of noise elimination, index and cache. Compilation optimization has a long way to go. Thanks to the gOMA team for providing many excellent design ideas and techniques. We will continue to study in this direction and share our ideas with you as much as possible.

About the Byte Terminal technology team

Bytedance Client Infrastructure is a global r&d team of big front-end Infrastructure technology (with r&d teams in Beijing, Shanghai, Hangzhou, Shenzhen, Guangzhou, Singapore and Mountain View), responsible for the construction of the whole big front-end Infrastructure of Bytedance. Improve the performance, stability and engineering efficiency of the company’s entire product line; The supported products include but are not limited to Douyin, Toutiao, Watermelon Video, Feishu, Tomato novels, etc., and have in-depth research on mobile terminals, Web, Desktop and other terminals.

Now! Client/front-end/server/side intelligent algorithm/test development for global recruitment! Let’s change the world with technology. If you’re interested, please contact [email protected]. Email subject: RESUME – Name – Job intention – desired city – Phone number.


Mars-talk 04 is here!

On the night of February 24th in MARS TALK, we invited the engineers from Volcano Engine APMPlus and Beauty, Online share “APMPlus Java OOM Attribution scheme based on Hprof file” and “Optimization practice based on Mars-APMPlus Performance monitoring tool” and other technical dry goods. Join the event group now and have a chance to get the latest ****VR all-in-one machine — Pico Neo3!

⏰ Live: 20:00-21:30, Thursday, February 24

💡 Event format: live online

🙋 registration: scan code into the group registration

For the first MARS TALK of the year, we have a big prize for you. In addition to the Pico Neo3, there are logitech M720 Bluetooth mouse, fascia gun and byte peripheral gifts for you to pick up. Don’t miss it!

👉 Click here to learn about APMPlus