background

As Moore’s Law has broken down in the CPU industry in recent years, many vendors are looking for alternative solutions at the instruction set architecture level. In consumer products, Apple introduced Apple Silicon M1 with the ARM instruction set to great acclaim. In the cloud service industry, Huawei Cloud and Amazon have developed and launched ARM CPU servers by themselves a few years ago, which have made great achievements in terms of cost and performance.

For the domestic CPU industry, in addition to North Dazhong, Haiguang, Mega Core and a few other hands have x86_64 instruction set authorization, other manufacturers are basically focused on non-x86_64 instruction set. For example, Huawei and Phitum are developing ARM CPUs, and Loongson is always focused on MIPS CPUs. RISC-V has also attracted the attention of many manufacturers in recent years.

For various non-x86_64 CPUs, industry software porting and adaptation will mainly involve embedded side, mobile side, desktop side and server side. Considering the power consumption, the general logic of the embedded end is relatively simple, and the complexity of code transplantation and adaptation is not high. The mobile terminal is generally Android ARM, which does not involve too many adaptations.

There are three cases on the desktop side:

If the application is based on the browser, all the functions can be satisfied. The home-made system version is generally built with the Firefox browser, and the application can be adapted to the Firefox browser. If the application is a light desktop application, consider using Electron. Electron (formerly known as Atom Shell) is an open source framework developed by GitHub. It uses Node.js (as the back end) and Chromium’s rendering engine (as the front end) to develop cross-platform desktop GUI applications. In this case, first of all, we can check whether the software source of the domestically produced system has corresponding Electron dependence (generally). If not, you need to compile it. If the application is a heavy Native application, the code needs to be compiled on the corresponding instruction set and system dependencies, which will result in a large workload.

Server, also divided into three cases:

If you are using a virtual-machine-oriented language, such as Java or a variety of JVM-based languages (Kotlin, Scala, and so on), no special adaptations are required for the service. In general, the software source of the localization system will come with the OpenJDK which has been implemented; If not, you can usually find a corresponding OpenJDK open source implementation in the instruction set listed below, so you can install it yourself. In recent years, some languages that are not strongly dependent on C library appear, such as Go. Compilation system is designed at the beginning of the consideration of a variety of target system and instruction set architecture, only need to specify the target system and architecture in the compilation, such as GOOS= Linux GOARCH=arm64 go build, if the use of CGO also need to specify the C/C++ compiler. If the service uses Native language such as C/C++ and has strong dependence on the system C library, it needs to compile the code on the corresponding instruction set and system dependency, which will result in a large workload.

As can be seen from the above, the server and the desktop have similar adaptations to Native C/C++, while the server has more stringent requirements for performance. The content shared in this paper is mainly about how to adapt the server Native C/C++ on multiple instruction set CPUs, especially how to improve the engineering efficiency when the amount of code is huge. Most of the content on the desktop can also be referred to.

Compile-run ramble

Since what we need to deal with is the adaptation of Native C/C++ programs in a variety of instruction set CPUs, we need to first understand how the program is compiled and run, in order to improve the efficiency of adaptation with the help of various tools in each link.

As you probably know from your computer class, C/C++ source code is preprocessed, compiled, and linked to produce object files. The computer then loads the program from disk into memory and runs it. There are a lot of details hidden in this, so let’s take a look at them.

First of all, the source code in the process of compilation, first through the compiler front end, lexical analysis, syntax analysis, type check, intermediate code generation, generation and target platform independent intermediate representation code. Then it is handed over to the compiler back end for code optimization, object code generation, object code optimization, and the generation of the corresponding instruction set object. O file.

GCC handles both front and back ends together in this process, while Clang/LLVM handles front and back ends separately. We can also see how common cross-compilation is implemented, where the compiler backend is connected to different instruction sets and architectures.

Theoretically, all C/C++ programs should be able to be compiled to all target platforms using both local and cross-compilation toolchains. However, when it comes to actual engineering, you need to consider whether the actual compilation tools used, such as make, cmake, bazel, and ninja, are already capable of supporting various situations. For example, at the time of this article, Chromium and WebRTC were unable to compile their own constructs on the Mac ARM64 due to the Ninjia and GN toolchains.

The linker then links the target. O file and various dependent libraries together to produce an executable executable.

During the linking process, the corresponding library files will be found according to the environment variables. Through the LDD command you can see the executable file, dependent library list. When adapting to different systems with the same instruction set, consider copying all library dependencies along with binary executables as compiled output.

The final generated Executable File, whether Windows or Linux platform, is a variant of COFF (Common File Format) Format, under Windows is PE (Portable Executable), Under Linux ELF (Executable Linkable Format).

In fact, besides executable file, Dynamic Linking Library (DDL) and Static Linking Library (Static Linking Library) are stored in executable Linking format. They are stored in PE-COFF format under Window; Linux is stored in ELF format, only the file name suffix is different.

Finally, the binary executable is loaded into a new address space when it is started. This means that the system reads header information from the target file and the program into the address space segment, loading libraries with linkers and loaders, and address space translation. Then set the process of various environmental information and program parameters, the final program to run, the execution of the program corresponding to each machine instruction.

The libraries and dependencies of each system environment are not the same. You can specify the library directory to read by setting the environment variable LD_LIBRARY_PATH, or specify a complete running environment by such a scheme as DOCKER.

In the process of reading and executing each machine instruction, the computer can actually simulate the translation of machine instructions by means of a virtual machine. For example, QEMU can support a variety of instruction sets, and Mac Rosetta 2 can efficiently translate x86_64 into ARM64 and execute.

Adaptation and engineering efficiency

By analyzing the entire process of compilation and running, we can find many tools in the industry to improve the efficiency of adaptation.

Because we pursue CI/CD fast construction and have no dependence on the system, we will use Docker to compile.

By installing all the tools and dependent libraries from scratch in Dockerfile, you can strictly guarantee that the environment is consistent each time you compile.

At the compile stage, if the dependencies are clear, you can use cross-compile to compile the corresponding program directly on the x86_64 machine.

If the system dependent library is relatively complex but the amount of code is relatively small, you can also consider using QEMU to simulate the corresponding instruction set for local compilation, in fact, using QEMU to directly translate the instructions of GCC/Clang and the environment does not need to modify. Docker’s BuilDx is implemented along these lines.

However, it should be noted that QEMU is implemented by means of instruction set translation, which is not efficient. In the case of large amount of code, it is basically unnecessary to consider this scheme. Docker Buildx is also not very stable. I have used it to build Docker Service more than once.

In the case of large amount of code and deep dependency on compilation tools, GCC/Clang cross-compilation may not be easy to change, and local compilation can be carried out directly on the corresponding instruction set.

The specific situation depends on the engineering practice. When the code warehouse is huge and difficult to transform, different modules can even be partially cross-compiled, partly simulated or locally compiled by the target machine, and finally linked together, as long as the highest engineering efficiency is guaranteed.

Specific CPU efficiency optimization

Different CPUs, even within the same architecture, support different specific machine instructions, which can affect performance, such as whether some long instructions can be used. The normal optimization process is for each CPU vendor to push their features to GCC/Clang/LLVM and make them available to developers at compile time. However, this process takes time and requires the version of the compiler, so each CPU manufacturer will document that it may be necessary to pay attention to the specific version of the GCC at compile time, and even add special parameters when executing the GCC command.

Our RTC service uses Kubernetes for service choreography, so the compiled output is actually Docker images. In the face of multiple instruction set architectures, the choice of the underlying image needs to be more careful.

Docker base image usually you will choose from Scratch, Alpine, Debian, Debian-Slim, Ubuntu, CentOS.

Unless specifically requested, you don’t want to build from scratch with an empty mirror.

Alpine is only 5M in size and looks nice, but the system C library is based on MUSL rather than the common GLIC, heavy C/C++ applications on desktop systems or servers. Try not to use this version as it may cause a lot of work.

Compared to Debian, it mainly removes some files and documents that are not commonly used. The general service can choose Slim.

While both Ubuntu and CentOS lack official support for MIPS architectures, Debian-Slim can be considered if you need to consider a MIPS CPU such as the Loong son in your work.

The other thing to note is that a lot of open source software is compiled on Ubuntu and it is important to note at compile time that Ubuntu is based on Debian disconnected or testing branch and that the version of the C library that you are using is not secure from Debian.

After CI is compiled, QEMU + Docker can be used to start the service, and simple verification of multiple instruction sets can be carried out on one architecture, without the need to rely on the machine and environment of the feature.

Docker supports the aggregation of images of various architectures into a single tag, that is, on different machines, the execution of Docker Pull will obtain the corresponding image according to the instruction set and architecture of the current system. However, such a design would be cumbersome to generate and store multiple architectures on a single system, and to specify a particular schema for use and validation. Therefore, in engineering practice, we directly identify different architectures on the Image Tag, so that the generation, acquisition and verification of images are very simple and direct.

If the final program needs to run in Native rather than Docker environment, the dynamic library load path can be specified by modifying the current process’s LD_LIBRARY_PATH environment variable in the face of different system dependencies.

When you compile and generate executable binaries, you can copy all the dependent libraries by executing LDD command. LD_LIBRARY_PATH can be used to specify the corresponding path to cut off the dependence on system libraries. In some cases, because the system base C library version is inconsistent, the executable binaries may fail even when linked. At this point, you can consider patchelf to modify the ELF, using only the instruction’s C library and linker, to isolate all kinds of environment dependencies.

conclusion

Rongyun has been focusing on IM and RTC. We feel the need for multiple CPU instruction set architectures in both the public and private cloud markets. At present, we have carried out full-function adaptation and optimization for public cloud AWS/ Huawei ARM CPUs and all ARM/MIPS CPUs in Xinchuang Market, and have also carried out targeted adaptation for various operating systems, databases and middleware in Xinchuang Market. In this paper, the technology and tools used in the compilation and adaptation project are analyzed. Welcome to share with us.

Refer to the link

qemu: https://www.qemu.org/

docker buildx:

https://docs.docker.com/build…

patchelf: https://github.com/NixOS/patc…