【Rongyun Technology】Native C/C++ service adaptation to multi-instruction set CPU

Background introduction

Since Moore's Law in the CPU industry has failed in recent years, many manufacturers are looking for alternative solutions from the instruction set architecture level. In the field of consumer products, Apple launched the Apple Silicon M1 with ARM instruction set, which has been well received; in the cloud service industry, Huawei Cloud and Amazon have already independently developed and launched ARM CPU servers in the past few years, which are quite costly and performance. There are achievements.

For the domestic CPU industry, except for a few companies such as Beijing Volkswagen, Haiguang, and Zhaoxin who have x86_64 instruction set authorizations, other manufacturers basically focus on non-x86_64 instruction sets. For example, Huawei and Feiteng are developing ARM CPUs, and Loongson has been focusing on MIPS CPUs all year round; the rise of RISC-V in recent years has also attracted the attention of many manufacturers.

For various non-x86_64 CPUs, industry software migration and adaptation will mainly involve embedded terminals, mobile phones, desktops, and servers. The embedded end takes into account the power consumption, the general logic is relatively simple, and the complexity of code migration and adaptation is not high. The mobile terminal is generally Android ARM, which does not involve too many adaptation issues.

There are three situations on the desktop:

If the application is based on a browser, all functions can be met. The localized system style version generally has a built-in Firefox browser, and the application can adapt to the Firefox browser.
If the application is a light desktop application, you can consider using Electron's solution. Electron (formerly known as Atom Shell) is an open source framework developed by GitHub. It uses Node.js (as the back end) and Chromium's rendering engine (as the front end) to complete the development of cross-platform desktop GUI applications. In this case, you can first check whether the software source of the localized system has a corresponding Electron dependency (usually); if not, you need to compile it.
If the application is a heavy Native application, the code needs to be compiled on the corresponding instruction set and system dependencies, which requires a lot of work.

Servers are also divided into three situations:

If you are using a virtual machine-oriented language, such as Java or JVM-based languages (Kotlin, Scala, etc.), the service does not require special adaptation. Generally, the software source of the localized system usually comes with an already implemented OpenJDK; if not, you can generally find the corresponding OpenJDK open source implementation in the instruction set you refer to, and you can install it yourself.
In recent years, some languages that have no strong dependence on the C library, such as Go, have emerged. At the beginning of the design of the compilation system, a variety of target systems and instruction set architectures are considered. You only need to specify the target system and architecture when compiling, such as GOOS=linux GOARCH=arm64 go build. If CGO is used, you also need to specify C/ C++ compiler.
If the service uses native languages such as C/C++ and has a strong dependency on the system C library, you need to compile the code on the corresponding instruction set and system dependencies, which requires a lot of work.

As can be seen from the above, the server and the desktop are similar in the adaptation of Native C/C++, and the server has more stringent performance requirements. The content shared in this article is mainly about how the server Native C/C++ adapts to a variety of instruction set CPUs, especially how to improve engineering efficiency when the amount of code is large, and most of the content can also be referred to on the desktop.

Talk about compiling and running

Since we are dealing with the adaptation of Native C/C++ programs to multiple instruction set CPUs, we need to understand how the program is compiled and run before we can use various tools in various links to improve the efficiency of adaptation.

Everyone generally understands in computer classes that C/C++ source code is preprocessed, compiled and linked to generate object files. Then the computer loads the program from the disk into the memory and it can be run. There are actually a lot of details hidden in the middle, let us look at them one by one.

First of all, in the compilation process, the source code passes through the front end of the compiler for lexical analysis, syntax analysis, type checking, and intermediate code generation to generate intermediate code that has nothing to do with the target platform. Then it is handed over to the back end of the compiler for code optimization, target code generation, and target code optimization to generate target .o files corresponding to the instruction set.

GCC handles the front and back ends together in this process, while Clang/LLVM corresponds to the front end and back end respectively. From this we can also see how common cross-compilation is implemented, that is, the back-end of the compiler is connected to different instruction sets and architectures.

In theory, all C/C++ programs should be able to be compiled to all target platforms through native and cross-compilation tool chains. But in the actual project, you also need to consider whether the actual compilation tools used, such as make, cmake, bazel, and ninja, can already support various situations. For example, when this article was published, Chromium and WebRTC were unable to compile their own architecture on Mac ARM64 due to the ninjia and gn toolchain issues.

Then, the linker links the target .o file and various dependent libraries together to generate an executable file.

During the linking process, the corresponding library file will be searched according to environment variables. You can see the executable file and the list of dependent libraries through the ldd command. When adapting to different system environments with the same instruction set, you can consider copying all library dependencies and binary executable files together as compilation output.

The final executable file, whether it is Windows or Linux platform, is a variant of COFF (Common File Format) format, which is PE (Portable Executable) under Windows and ELF (Executable Linkable Format) under Linux.

In fact, in addition to executable files, DDL (Dynamic Linking Library) and Static Linking Library (Static Linking Library) are all stored in executable file format. They are all stored in the PE-COFF format under Windows; they are all stored in the ELF format under Linux, but the file name suffix is different.

Finally, when the binary executable program is started, the system will be loaded into a new address space. This means that the system reads the header information from the object file and reads the program into the address space segment, and uses the linker and loader to load the library and perform address space conversion. Then set various environmental information and program parameters of the process, and finally run the program to execute each machine instruction corresponding to the program.

The libraries and dependencies of each system environment are different. You can specify the library directory to be read by setting the LD_LIBRARY_PATH environment variable, or you can specify a complete operating environment through solutions such as docker.

When the computer reads each machine instruction and executes it, it can actually perform the translation of the machine instruction through a virtual machine for simulation. For example, qemu can support multiple instruction sets, and Mac rosetta 2 can efficiently translate x86_64 into arm64. And execute.

Adaptation and engineering efficiency

Through the analysis of the entire process of compilation and operation, we can find many tools in the industry to improve the efficiency of adaptation.

Because of the pursuit of rapid CI/CD build and no dependency on the system, we will use docker to compile.

By installing all tools and dependent libraries from scratch in the Dockerfile, you can strictly ensure that the environment for each compilation is consistent.

In the compilation stage, if the dependencies are clearer, you can use cross-compilation to directly compile the corresponding program on the x86_64 machine.

If the system depends on the library is complex but the amount of code is relatively small, you can also consider using qemu to simulate the corresponding instruction set for local compilation. In fact, you can use qemu to directly translate the gcc/clang instructions without modifying the environment. Docker's buildx is implemented based on this idea.

However, it should be noted that qemu is executed through instruction set translation, which is not efficient. In the case of a large amount of code, it is basically unnecessary to consider this scheme. Docker buildx is also not stable. I have used buildx to compile more than once to hang up the docker service.

When the amount of code is large and the compilation tools are heavily dependent, gcc/clang cross-compilation may not be easy to transform, and you can directly compile locally on the corresponding instruction set.

The specific situation depends on the engineering practice. When the code warehouse is huge and the transformation is difficult, you can even use cross-compilation for different modules, simulation or local compilation on the target machine, and finally link them together, as long as the engineering efficiency is the highest.

Specific CPU efficiency optimization

Different CPUs, even with the same architecture, support different specific machine instructions, which will affect the execution efficiency, such as whether some long instructions can be used. The normal optimization process is that each CPU manufacturer pushes its own features to gcc/clang/llvm, which can be used as a developer during compilation. But this process takes time, and there are requirements for the version of the compiler, so each CPU manufacturer will also explain in the document, you may need to pay attention to the specific version of gcc when compiling, and even add special parameters when executing the gcc command.

Our RTC service uses kubernetes for service orchestration, so the compiled output is actually docker images. When faced with a multi-instruction set architecture, you need to be more cautious when choosing a basic image.

The docker base image usually chooses from scratch, alpine, debian, debian-slim, ubuntu, centos.

Unless there are special requirements, no one will choose scratch empty mirrors to build from scratch.

The volume of alpine is only 5M, which looks very good, but the system C library is based on musl instead of glic which is common on desktop systems or servers. For heavy C/C++ applications, try not to use this version, otherwise it may cause a large increase in workload.

Compared with debian, debian-slim mainly deletes some infrequently used files and documents. You can choose slim for general services.

Both ubuntu and centos lack official support for the mips architecture. If you want to consider the situation of mips CPUs such as Godson in your work, you can consider debian-slim.

Another point to note is that many open source software compilation verification systems choose ubuntu, and when compiling, it should be noted that ubuntu is based on debian unstable or testing branch, and the version of the C library used will be different from that of debian.

After CI is compiled, you can use qemu + docker to start the service, and perform simple verification of multiple instruction sets on one architecture, without the need to rely on and feature machines and environments.

Docker supports aggregating images of multiple architectures into one tag, that is, executing docker pull on different machines will obtain the corresponding image according to the instruction set and architecture of the current system. However, with such a design, multiple architectures are generated and stored on one system, and a specific architecture is specified when using and verifying, which will be more cumbersome. Therefore, in engineering practice, we directly identify different architectures on the image tag, so that the generation, acquisition, and verification of images are very simple and straightforward.

If the final program needs to run in a Native rather than a Docker environment, facing different system dependencies, you can specify the dynamic library loading path by modifying the LD_LIBRARY_PATH environment variable of the current process.

When compiling and generating an executable binary file, you can copy all dependent libraries by executing the ldd command, and specify the corresponding path through LD_LIBRARY_PATH to isolate the dependency on the system library. In some cases, because of the inconsistency of the system's basic C library version, it may cause problems when linking executable binary files. At this time, patchelf can be considered to modify the ELF, using only the instruction C library and linker to isolate various environment dependencies.

Concluding remarks

Rongyun has always focused on the IM and RTC fields. Whether in the public cloud or private cloud market, we have felt the market's demand for multiple CPU instruction set architectures. At present, we have carried out full-featured adaptation and optimization for public cloud AWS/Huawei ARM CPUs and all ARM/MIPS CPUs in the Xinchuang market. We have also carried out targeted adaptations for various operating systems, databases and middleware in the Xinchuang market. Match. This article analyzes the technologies and tools used in the compilation and adaptation project, and welcomes more exchanges.

Reference link

qemu: https://www.qemu.org/
docker buildx:
https://docs.docker.com/buildx/working-with-buildx/
patchelf: https://github.com/NixOS/patchelf

【Rongyun Technology】Native C/C++ service adaptation to multi-instruction set CPU

融云RongCloud

引用和评论

融云 uni-app IMKit 上线，1 天集成，多端畅行

CPU密集型任务线程池参数设置

Prometheus中系统CPU使用率如何计算？

Linux使用cpulimit对CPU使用率进行限制

x-cmd install | cpufetch - 轻量强大的高颜值 CPU 信息工具，型号/架构/频率一目了然！