Background introduction by Cen Yu

Due to the failure of Moore’s Law in CPU industry in recent years, many vendors are looking for alternative solutions from the instruction set architecture level. In consumer products, Apple launched the Apple Silicon M1, an ARM instruction set, to great acclaim; In the cloud service industry, Huawei Cloud and Amazon have developed and launched ARM CPU servers several years ago, which have made great achievements in cost and performance.

And for the domestic CPU industry, in addition to north Dazhong, Haiguang, Megabon and a few other hands with x86_64 instruction set authorization, other manufacturers are basically focused on non-X86_64 instruction set. For example, Huawei and Feiteng are developing ARM CPU, while Loongson focuses on MIPS CPU. Risc-v has also attracted the attention of many manufacturers in recent years.

For a variety of non-X86_64 CPUS, the migration and adaptation of industry software will mainly involve embedded end, mobile end, desktop end and server. Considering power consumption, the general logic of the embedded end is relatively simple, and the complexity of code transplantation and adaptation is not high. Mobile phones are generally Android ARM, which does not involve too many adaptation problems.

There are three scenarios on the desktop:

If the application is browser-based, all functions can be satisfied. The domestic version of the system is generally equipped with the Firefox browser, and the application can be adapted to the Firefox browser. If your application is a light desktop application, consider using Electron. Electron (formerly Atom Shell) is an open source framework developed by GitHub. It uses Node.js (as the back end) and Chromium’s rendering engine (as the front end) to develop cross-platform desktop GUI applications. In this case, we can first check whether the software source of the domestic system has the corresponding Electron dependence (generally). If not, you need to compile it. If the application is a heavy Native application, the code needs to be compiled on the corresponding instruction set and system dependency, which is a heavy workload.

Servers also fall into three categories:

If you are using a virtual machine-oriented language, such as Java or various JVA-BASED languages (Kotlin, Scala, etc.), no special adaptation is required for the service. Generally, OpenJDK has been implemented in the software source of the localized system. If not, you can usually find the corresponding open source implementation of OpenJDK in the instruction set. You can install it yourself. In recent years, some languages without strong dependence on C library, such as Go. The compiler system considers a variety of target systems and instruction set architectures at the beginning of the design. It only needs to specify the target system and architecture at the time of compilation, such as GOOS= Linux GOARCH=arm64 GO build. If CGO is used, it also needs to specify the C/C++ compiler. If the service uses a Native language such as C/C++ and is strongly dependent on the system C library, the code needs to be compiled on the corresponding instruction set and system dependency, which will require a large amount of work.

As can be seen from the above, the server and desktop are similar in the adaptation of Native C/C++, but the server has more stringent requirements on performance. The content shared in this paper is mainly about how to adapt server Native C/C++ on a variety of CPU instruction sets, especially how to improve engineering efficiency when the large amount of code, most of the content can also be referred to the desktop.

Compile-run ramblings

Since we need to deal with the adaptation of Native C/C++ programs in a variety of CPU instruction sets, we need to first understand how the program is compiled and run, so as to use various tools in each link to improve the efficiency of adaptation.

As you probably know from your computer classes, C/C++ source code is preprocessed, compiled, and linked to generate object files. The computer then loads the program from disk into memory, ready to run. There are a lot of details hidden in this, let’s look at them one by one.

First of all, source code in the compilation process, first through the front end of the compiler, lexical analysis, grammar analysis, type check, intermediate code generation, generation and target platform irrelevant intermediate representation code. Then, it is handed to the compiler back end for code optimization, object code generation, object code optimization, and the corresponding instruction set object. O file is generated.

GCC handles both the front and back ends together, while Clang/LLVM handles both the front and back ends, respectively. We can also see how common cross-compilation is implemented, where the compiler backend pairs are connected to different instruction sets and architectures.

In theory, all C/C++ programs should be able to compile to all target platforms through local and cross-compilation toolchains. However, in the actual project, you need to consider whether the actual compilation tools such as make, cmake, Bazel, and Ninja already support various situations. For example, Chromium and WebRTC couldn’t build their own architecture on Mac ARM64 at the time of this release due to ninjia and GN toolchains.

The linker then links the target.o file and various dependent libraries together to produce an executable executable.

During the linking process, the corresponding library file will be searched according to the environment variables. Using the LDD command, you can see the executable file and the list of dependent libraries. When adapting to different systems with the same instruction set, consider copying all library dependencies and binary executables together as compilation output.

The resulting Executable File, whether Windows or Linux platform, is a variant of Common File Format (COFF), and for Windows it is A Portable Executable (PE). ELF (Executable Linkable Format) for Linux.

In fact, except for executable files, both Dynamic Linking Library (DDL) and Static Linking Library are stored in executable file format. They are stored in PE-COFF format under Window. Under Linux, all files are stored in ELF format, but the file name suffix is different.

Finally, when the binary executable is started, the system loads it into a new address space. This means that the system reads the header information from the object file and reads the program into the address space segment, uses the linker and loader to load libraries and convert the address space. Then set the process of various environmental information and program parameters, and finally the program to run, execute the program corresponding to each machine instruction.

The LD_LIBRARY_PATH environment variable can be used to specify the library directory to be read, or docker can be used to specify the entire runtime environment.

In the process of reading each machine instruction and then executing it, the computer can also simulate the translation of machine instruction through virtual machine. For example, QEMU can support a variety of instruction sets, and Mac Rosetta 2 can efficiently translate X86_64 into ARM64 and execute it.

Adaptation and engineering efficiency

By analyzing the entire process of compiling and running, we can find many tools in the industry to improve the efficiency of adaptation.

In pursuit of CI/CD rapid construction and no dependence on the system, we will use the way of Docker to compile.

By installing all the tools and dependent libraries from scratch in Dockerfile, you can strictly guarantee that the environment is consistent each time you build.

At compile time, if the dependencies are clear, you can use cross-compilation to directly compile the corresponding program on the X86_64 machine.

If the system depends on the library is more complex but the amount of code is relatively small, you can also consider using QEMU to simulate the corresponding instruction set for local compilation, in fact, using QEMU to directly translate GCC/CLang instructions without modifying the environment. Docker’s Buildx is based on this idea.

However, it should be noted that QEMU is implemented through instruction set translation, which is not efficient. In the case of large amount of code, this scheme is basically not considered. Docker Buildx is also not stable, I have used buildx compilation more than once and docker service failed.

GCC/CLang cross-compilation may be difficult to modify if the code is large and the compilation tool is deeply dependent. Instead, local compilation can be performed directly on the corresponding instruction set.

The specific situation depends on the engineering practice. When the code warehouse is huge and difficult to transform, it is even possible to use part of cross compilation and part of simulation or local compilation of target machine for different modules, and finally link them together, as long as the highest engineering efficiency is guaranteed.

Specific CPU efficiency optimization

Different cpus, even within the same architecture, support different specific machine instructions, which can affect execution efficiency, such as the availability of long instructions. The normal optimization process is for each CPU vendor to push their own features into GCC /clang/ LLVM, which developers can use at compile time. However, this process takes time, and the compiler version is also required, so CPU manufacturers also specify in their documentation that they may need to pay attention to the specific version of GCC during compilation, and even add special parameters when executing GCC commands.

Our RTC service uses Kubernetes for service choreography, so the compiled output is actually Docker images. Choosing the base image needs to be more careful when dealing with multiple instruction set architectures.

Docker base images are usually selected from Scratch, Alpine, Debian, Debian-Slim, Ubuntu, centos.

Unless specifically requested, you don’t want to build Scratch from scratch with an empty image.

Alpine is only 5M, which is nice, but the system C library is based on MUSL rather than the GLIC common on desktop systems or servers. Heavy C/C++ applications, try not to use this version or it could cause a workload overload.

Debian-slim Compared with Debian, it mainly deletes some files and documents that are not commonly used. The general service can choose Slim.

Ubuntu and centos both lack official support for the MIPS architecture. If you want to consider loonson and other MIPS cpus in your work, you can consider Debian-Slim.

Ubuntu unstable or Testing is the source of most open source software. If you are using a C library version that is inconsistent with Debian Unstable or Testing, please note that ubuntu unstable is the source of most open source software.

After CI compilation, qEMu + Docker can be used to start the service and perform simple validation of multiple instruction sets on a single architecture without relying on the machine and environment of the feature.

Docker supports the aggregation of images of various architectures into a tag, that is, on different machines, the execution of Docker pull will obtain the corresponding image according to the instruction set and architecture of the current system. However, such a design, on a system, to generate and store multiple architectures, use and validation of a specific architecture, will be more cumbersome. Therefore, in engineering practice, we directly identified different architectures on the image tag, so that the generation, acquisition and verification of images are very simple and direct.

If the final program needs to run in Native rather than Docker environments, you can specify the dynamic library loading path for different system dependencies by modifying the LD_LIBRARY_PATH environment variable for the current process.

LD_LIBRARY_PATH = LD_LIBRARY_PATH = LD_LIBRARY_PATH = LD_LIBRARY_PATH = LD_LIBRARY_PATH = LD_LIBRARY_PATH = LD_LIBRARY_PATH = LD_LIBRARY_PATH In some cases, an inconsistency in the system’s base C library version may cause the executable binaries to fail in the case of links. In this case, you can consider patchelf to modify ELF, using only the C library of the instruction and the linker to isolate various environmental dependencies.

conclusion

Rongyun has been focusing on IM and RTC, and we have felt the demand for multiple CPU instruction set architectures in the market, no matter in the public cloud or private cloud market. At present, we have implemented full-function adaptation and optimization for public cloud AWS/ Huawei ARM CPU and all ARM/MIPS CPUS in Xinchuang market, as well as targeted adaptation for various operating systems, databases and middleware in Xinchuang market. This paper analyzes the techniques and tools used in the compilation and adaptation project. Welcome to exchange more.

Refer to the link

Qemu: www.qemu.org/ docker buildx: docs.docker.com/buildx/work… Patchelf: github.com/NixOS/patch…