** Abstract: ** discusses the principle of software porting to Kunpeng platform and the corresponding process of software engineering.

In the process of cross-platform software transplantation in Linux environment, developers need to read the code, modify it manually, compile and debug it repeatedly, and the transplantation cycle is long and the efficiency is low. So how to improve the problem of long cycle and low efficiency?

Based on this, Rutao Zhang, an expert in intelligent computing from Huawei, brought a sharing activity themed “How to automatically transplant 90% of the code to Kunpeng Platform”. He mainly realized efficient code transplantation based on C/C++ software from Kunpeng development suite and accelerated developers to achieve cross-platform software transplantation. The following is shared in shorthand:

Today’s topic is about software migration, which is an ancient topic. Because when it comes to platform switching, CPU architecture changes, and even some language version upgrades, we may encounter some software migration issues. Today we will discuss the principles of the software migration process and the corresponding processes of software engineering.

In the process of software transplantation, how did Kunpeng help the developers to improve efficiency, how did He take the software development from Kunpeng and from Huawei and how did he feed back the consistent experience to the developers, so that the developers could accelerate the progress of software development and reduce the cost. We launched our Development kit of Kunpeng to help users with software transplantation and performance acceleration based on Kunpeng platform. Today, we mainly include three contents.

When it comes to software porting, I don’t know how many of you have done relatively low-level software. To do low-level software, you may use some assembly, C++ and such low-level languages. With this underlying language, it is highly dependent on the hardware architecture of the machine, and when you switch from one platform to another, these strongly dependent languages have to be performed again, depending on the programming language we use and the platform environment we are porting to.

When we use assembly code or this compiled language, we face some migration problems, some challenges. Some problems can be solved by the compiler, and some problems, especially some low-level code or low-level code, we may have to manually refer to the manual, and then convert it to the machine code used by the new platform.

Here is one instruction difference between our kunpeng processor and x86 processor, such as a simple program of adding two numbers and two ints. After compiling through GCC, we go through OMGD, we can see the specific format of the instruction and the corresponding corresponding assembly code. Here we can see that for x86 platform, because x86 is a CICS instruction set and a complex instruction set, Kunpeng is fully compatible with Arm64 architecture, vinIC developed by Huawei, and the instruction set is also fully compatible with THE Arm64 deputy compact instruction set.

In fact, the so-called the limit specified sets and complex instruction set to distinguish from the 70 s of 20th century, IBM has done a research, said about the CPU how to efficient operation, and then they will find that there may be some commonly used instructions or the program code, under the background of the commonly used code at the time, could not commonly used and the commonly used have very big difference, At that time and since the process or processes of IC design level of or device is no now so by leaps and bounds, so will want to how to put the CPU from hardware design on a little bit more simple, efficient point from the software, so they proposed the reduced instruction set such a concept, its most significant characteristic is its instruction width is the length. The length of instruction we said is equal, that is to say, the bit width of each instruction is equal, so the SICO executed by each instruction is almost the same, that is, he makes complicated things as simple as possible, and then uses many simple operations to complete a complex task.

From the perspective of the opposite complex instruction set, let’s take a look at the complex instruction level below x86. Each of its specified lengths is different. That is to say, for example, mov and ADD instructions listed here, their machine codes and instruction codes are different and their lengths are different, which is bound to cause the decoder of IC devices. And including our real to the operational software pipelining processing step is not the same, also will inevitably lead to the implementation of each instruction cycle is different, but it also has a benefit, is a command I could do a more complicated things, although said that my instructions may become very long, but I an instruction to complete a more complicated things, It may be easier or relatively easier for upper-level programmers to understand.

This is a simple background of the simplified instruction set and the complex instruction set. From the disassembled x86 instruction set and the assembly code of Kunpeng instruction set, we can see that the operation instructions are completely different and the names of the registers are completely different. On the x86 platform, there are 16 general purpose registers. In x86 64 mode, there are 16 general purpose registers and floating point registers. Based on the MMX technology, SSE or ABS technology we support, x86 platform can have up to 32 floating point registers.

As the kunpeng platform is compatible with Arm64 instruction, the important things in the instruction set are completely compared. Therefore, from this point of view, there are 31 general registers in the Kunpeng platform. Besides these 31 general registers, there are also some state registers or a station register. That corresponds to a floating point register, which has 32 of these advances single-instruction multi-data registers called ASMB. Kunpeng has 32 register bits with a bit width of 128. This is different from the x8664 platform, for example, the current X8664, if it supports ABX512, its bit width is 500 12 bits, from this point of view, is a hardware device difference is very obvious.

And then from a disassembly point of view, you don’t know if you’ve noticed there’s a MOV instruction on the x86 platform. From the first line we can see from the register, RBP a MOV store data, to such a register as EDX, do a from funnel variables from memory in. The same thing happened on the above kunpeng processor platform instruction becomes a LDR register and then addition, of course, there are still below the add, and then in the storage for x86, and from the register mov to memory, but to a new platform it is to use a STR instruction, so this also reflects the characteristics of a risk instruction, Perhaps the second feature, let’s call it this, is a load stall mode, which means that on the Kunpeng processor platform there is no support for a direct access from memory to memory, and a register must be bridge over as a relay.

This is another aspect of the complex instruction set that is different from x86 instructions. Also, on x86, there are so many memory access modes that they are not rich enough for public platforms. So this is a program, just to give you a quick example, from the CPU’s point of view, the same PIECE of C code, the CPU does different things, executes different instructions, goes through different cycles and different operations, and it outputs a result of the final calculation. Of course, from this point of view, there is no difference between the two platforms from this procedure, in addition to the instruction, the execution result will not have any change.

But here the side reaction, because the instruction set is different, so for C, C + + such partial underlying such a language, although it is a high-level language, but must be considered a platform differences, at the time of switching platform, even on the platform of the software development procedure, consider a platform compatibility, so be a good programming habit.

There are many problems to be faced by cross-platform software porting, because software porting itself is an engineering problem. Generally speaking, in the first step, if we decide to migrate from x86 platform to Kunpeng platform, we need to judge whether the software migration is worthwhile and how difficult it is. The common practice today is to take the x86 platform, the corresponding package, and look at its dependencies. What does this mean? So if we look at this software, if it’s running on x86, what third-party components does it rely on? Whether or not these third-party components exist on your target platform requires some judgment. This judgment is usually the installation and operation of this platform repeatedly, and then according to the system reported errors to eliminate one by one, so this is done by manual, quite laborious. If you have transplant experience, you will feel more laborious. Some things are very complicated and trivial. If you make a mistake carelessly, you may not find out the reason.

After you solve this problem in step 1 compilation process, you may also encounter some overruns. As a result, function fault occurs on the new platform, which we really hate. There are many possible reasons, some of which have problems with their own software logic. There may be a problem with APA cross-platform compatibility of third-party components; It may be that there are problems with the support of the system itself, which is more influential factors. In this way, after the transplant, the technical personnel to locate the corresponding. Positioning for everyone to the corresponding engineering personnel, professional technical requirements will be relatively high, there is also a repeated compilation, repeated adjustment, repeated verification, this process cost will be very high.

Once functional verification was complete and some basic testing was done, the software felt ready for use on the new platform. You may face a performance problem, when you use in the work environment, production environment, because the production environment of the software all want to use the smallest hardware to achieve the maximum performance, and then run the highest cost performance, this time will have requirements on software performance, it has requirements. At this time, we will have to take some methods, such as using some commercial software or some open source software commands, to analyze the bottleneck of the software, whether there are problems with the system configuration parameters or the logic of the software itself.

Therefore, these three steps are the three important steps we have accumulated in the development process of Huawei’s software for so many years, which have a decisive impact on the quality of our software and the quality of transplantation. At the same time, these three steps may not be an easy obstacle for anyone to overcome.

For our software porting, we usually talk about compiled software will face such a difficulty, but for interpretation is relatively easy, why? For example, what is our dependence on some software such as Java or Python, or even GOD? Depending on the virtual runtime environment provided by the language, or even some Java virtual machine like GUM provided by Java, we just need to choose a corresponding platform GUM installation, we can mask all the underlying differences.

The software only runs according to the operating environment, usually without problems. For something like C,C++,GOD, which might compile, or even call C,C++ plus this component, we need C,C++ code for porting, which can be divided into several cases.

The first is open source software, so we usually work with the community to get the community to support the empty platform, or the M64 platform, so that we can solve the problem once and for all. Then, for self-developed software and some SB users who develop resource software, they cannot open the code, so we need to carry out commercial cooperation to guide customers to transplant to our kunpeng platform.

For commercial B software, such as Microsoft’s series of software, or Oracle’s software database, it is impossible for us to obtain the source code, and it is not easy to push them to cooperate with our Chinese software community. You have to either cooperate or you have to find an alternative, right? If we really cannot replace the user’s business and modify it, we may have to adopt a kunpeng platform and x86 for some mixed deployment, which is a strategy in software deployment.

There is also a series of development for the Windows platform that we often use. We also know that Windows may have said more than a year ago that it will support Arm64 architecture, but it has not actually announced so far. In fact, commercial considerations or other factors may be considered more, especially for such a large company, but for Windows platform, we carry out limited support in kaiyuan ecosystem, such as Microsoft’s C shut, in fact, his call3.0 has been open source, It is already available on the Arm platform. In other words, we can also support C shut on the kunpeng platform based on call3.0. For kunpeng software transplant process, can process it is decomposed into several steps, listed in one of the most important is to step 2 of the 3 step standard and performance analysis of this step, we now offer the corresponding to each step with some tools to help customers for user developers were analyzed for transplantation.

The binary file dependency scan, is we go to provide a process software software installation, dependency library scan and software run dependency library scan. According to accumulate over a long period of time we have a list of compatibility, the compatibility list covers most of the popular on the market as well as common OS, and the corresponding version, and the corresponding versions of GCC, for transplantation of the second phase, as modified C, C + + source code, we also provide a tool to do C, C + + source code analysis, This analysis focuses on assembly code, side options, and macro definitions, as well as built in functions and attributes provided by editors, and then focuses on examining the user’s Makefile and CMakeList. If the user software is built with make or CMake, we can help find, identify, and recommend changes to the port.

When the migration is completed, we will provide a performance analysis tool to help users check whether the software can meet such a standard, that is to say, check its performance indicators. We will conduct systematic performance analysis and software-level hotspot positioning analysis. Then on this basis, we will provide users with some huawei has accumulated think more effective method of some software optimization, do some operations, such as terminal version shell or some other software to modify this advice, this is today we want to introduce three new software, through these three software we can more convenient and more efficient to complete C, C++ code, such a process of transferring values from non-kunpeng platform to kunpeng platform.

In the process of C, C++ software transplantation, we should focus on three aspects of the problem, the first problem is the difference of the software build file. Here are two examples. One is in our project, we may see on x86 platforms an option called -m64 that knows compilation options, which actually means that we want to generate the software in 64-bit mode. Is divided into 64-bit mode, we compile the ABI of the object code. As a matter of fact, we can replace it with -mabi=lp64 on The “k” platform. Of course, if it is safe, adding -fpic will generate a flowting address to shield some underlying dependencies. In this way, we can achieve a replacement of M64 compiler option.

There is also a corresponding Arm instruction set, SA such a replacement, we commonly may see some of the -march such a parameter, x86 platform provides up to 20 or 30 architecture platform, from the INVENtek to AMD, a variety of Arm platform, is relatively simple, We only need to choose the ARM-compatible architecture supported by your CPU on our kunpeng platform. For us kunpeng 920, we are entering aARM8.2-A architecture. If these versions are relatively new, say 9.1 or above, we can go for -mtune=tsv110. This is actually our Taishan microkernel 110 model which will be carried out within Gcc. We proposed some measures for the architecture to do some public tune optimization, which can provide a relatively good performance. Performance increase, said to be 5%~10% performance improvement.

Then the second part is the C, C + + source code of transplantation, there are two examples, this is the first example is the basic data types, although said we are LP64 kunpeng platform support, then the x86 platform also supports LP64 such a specification, but actually we definition or is there a difference in the details, although the character width, For example, x is always 8 bytes, but x86 x has signed type, but for our kunpeng platform, we use unsigned type, but we can modify the makefile and add a parameter, -makefilex, to define the default unsigned X as signed X. So that we can make sure that there’s no ambiguity in the C code logic about the x operation.

The second problem is that our compiler provides hundreds of macro definitions that can be read by C and C++ software. For example, if we use GC, we can use the corresponding macro definitions in C and C++ software directly in the original file. This macro definition may cause the compiler to check the environment variable directly at compile time, and then set the corresponding correct value, which is relevant to the host environment. I’m talking about compiling and running on the same machine, we’re not talking about the difference between host and target. This time for the corresponding software, we may need to differentiate the macro definition, such as x86 64 here, a see clearly know that he is to support x86, impossible in our kunpeng platforms, then we will advise the user to modify the user code, made the definition of the scope of software precompiled way isolation, apparently kunpeng platform for us, We often use aARCH64 or Arm64 keywords to define software logic, in addition to these, including BBC have their own architecture definition keywords.

The third kind of problem is the transplantation of our assembly code, which is also the most troublesome one, because x86 platform has less than 2100 assembly instructions, kunpeng platform is compatible with Arm64, we have more than 1000, less than 1100, such an assembly instruction, in fact, it adds up to more than 3000 instructions. If you want to separate it out, it’s very painful. The manual of the corresponding instruction set of Int has more than 4000 pages, and the manual of the related instruction set of Arm has more than 7000 pages. The pure English document will definitely crash when you read it, so the transplantation of assembly code in this area is a difficulty.

There are several forms of assembly code in our software process. The first is that we simply use the Asm keyword to write assembly code. The second is that we use built in functions to replace it. We can find corresponding instructions on the kunpeng platform for replacement, such as the prefetch instructions used on the x86 platform. We can also find the built in function on the kunpeng platform for replacement. And then there’s the third type of Intrisic that we might use. Intrisic is an assembler function provided in JCC that can be used just like THE C language. Intrisic is used on x86 platforms and Arm64 platforms, which are very different.

On the x86 platform, the total number of intrisICS is nearly 7000, less than 7000, and there are many differences at the level of Kunpeng, far less than this number. Why? This is because it supports a larger instruction set on x86, and it has evolved over the last 20 or 30 years, right? He has mx instruction set, SSE instruction set, and AVX, AVX is also divided into 128 bits,256 and 500 12 bits of three kinds. Each type of Intrisic corresponds to a very large number, so the number of transplants is very large. And in this we can find, for example, some correspondence for a 28 bit operation, some substitution.

For these problems mentioned above, for example, we C, C++ just put forward these problems, we provide such a few tools, here we provide analysis scanning tools, code migration tools. The analytical scanning tool is to identify the dependencies of our software migration and then help users do compatibility checks. Then the second tool to provide code migration is to do the source code build project project build file, as well as C, C++ source code and assembly code scan migration guide. The third tool is the performance optimization tool. After we transplanted the software to Kunpeng platform, we needed to use this tool to analyze performance and find hot spots. We also provided the concept of an acceleration library based on Kunpeng platform, a component. This provides a way for software and hardware to work together to accelerate user applications.

For example, we have optimized the GDPC basic operating environment, we have optimized compression, encryption, encryption and decryption, including some mathematical calculations, such as some open source or tripartite components, and we have optimized some IPP signal processing programs to improve their functions, which has greatly improved their performance through the combination of software and hardware. In the process of analysis and scanning, we upload the user’s software to our tool environment, and our tool environment will analyze the installation packages of the user’s software on X86 platform, such as the RPM package here and some JAR and Java programs, including some compressed packages. We will go to scan inside recognition software package inside and software installation path, including our internal integration of the package, such as these SO, binaries, whether to test it on a new platform is supported on different operating systems, feedback to the user a consistency analysis report, will tell the user individually SO compatible, How to deal with incompatible? We will provide links to the value of the source code, this is the source level link, or the link to provide the migration document method book, will be provided in our report.

We work this tool provides two ways, one is our way through the command line, below this kind of form through the parameter input, one kind is through the way of outside and we are doing the installation package dependency analysis, and scanning of the original code, can produce a migration analysis guidance to the user’s report, this report is to provide the format of the CVS or HDM format, Users can download it, and it will list in detail which dependent libraries, which secondary files need to be migrated, which C, C++ and assembly code needs to be migrated, and how much? The user is presented with a migration workload, such as one per month.

The calculation criteria, the user can input, for example, if you have good editing ability, you can do 800 lines of C,C++ code in a month, you can do 600 lines of assembly code, right? If you have limited porting capabilities, some coding capabilities, technical costs, you can set it up to say 300 lines of C,C++ code a month, 100 lines of assembly code, and it will calculate your porting workload based on different criteria, and do the first step in engineering, the first part of information mastery.

Here is a list of our main functions, I have already described the basic, is SO file check, build project check, source file check, evaluate consistency, and then perform workload evaluation, two ways, external way and command line way.

With this tool, we can get the first hand data of the engineering amount of software migration, and then decide whether to migrate. When determining extreme value, we can use the code migration tool to do further analysis, code migration tool is mainly analyzed the user’s source code, is still the same, he is emphatically analyzed the makefile, C, C + + source code, including we provide macro definition, the compiler and user-defined macros, and built in function, Intrisic, and assembly code. After we analyze these, we will provide a detailed migration guide, including how to modify the Makefile. How to change C, C++ code? But assembly code, how do we modify?

Here we just give advice, we are not to modify the original code of the user, the user can refer to the corresponding output output consistent report here, go with GTDF everyone to do this comparison, and then to put it outside the tool interface with third party, such as using other editing tools finish modify it. In this page we have listed a general workflow of our code migration tool, and we also have two methods, external and command line, for users to make a choice. We analyzed the user’s source construction engineering, and public building project configuration files, and C, CC + add source or assembler source code, and then transplant know, then the change of the source code, we will provide displayed as contrast, like point 1 is on the left is the examples cited here we want to change which files, is to modify the file list, The second category is what we want the original document to look like, and the third category is what we propose to modify it to look like.

This is the ability that our software migration tools can provide, we are C, C++, we are still for C, C++ so far C, C++ compiled languages, to do the recommendations, and then we have to have the source code, without the source code, there is no migration.

Already said, in front of us how to do software dependency analysis, through the analysis of huawei development suite to do software dependencies, and C, C + + transplantation, after we finish transplantation, we will in the production environment up and running our software, we may be able to do performance analysis, we will provide a us a tool called analysis, This tool is to help the user to do software performance orientation, such as some performance bottlenecks or have you want to continue to optimize, we provided some here means, here for the tools we can help users to analyze processor related indicators, and see the scheduling of some information, including peripheral information, including CPU, disk, network card, even short-term data, To help users analyze a performance metric for C, C++, or Java programs.

We Java class is not to say that GBM as a process, we are to see GBM inside, or has a certain role, or more useful. We will throw all of the data analysis, and through our own set of definition is analyzed, the mathematical model to see the user’s software performance bottlenecks, such as competition for resources or scheduling problems, even said, for example, there are some bugs caused some loops, and so on, we provide a variety of ways to render such an outcome. For example, we commonly use this kind of flame map, we can provide more intuitive visualization here, to help users to see their software in the end there is no problem on the nature.

This is our here is a list of our current performance analysis tools can provide performance index, we can see the hardware devices, such as CPU, memory, disk, network card, the system level, we can also see this kind of line system scheduling and such as processes, threads, and switch between each other, or compete for resources, Lock some key variables such as some of the leading indicators of performance analysis, we also provide a flame based, based on the code logic of the deep inspection, can propose the real overhead of the user code, where the big place, the corresponding code corresponding to the source code.

Through such a means, we can help customers quickly to help developers quickly locate their software, compiler software bottleneck. When locating software bottlenecks, we provide some additional capabilities. For example, here we provide an accelerator library, which is a combination of software and hardware to help users optimize code. What is the reason? This is mainly because we are a SOCK and a chip measurement system.

In addition to the mount tai is the kernel, and kernel of up to 48 or 64, we also provide additional capacity, some additional engine, can support these accelerate the engine, such as the algorithm LZ77 compression, and encryption, such as asymmetric, and symmetric encryption, including some commonly used encryption algorithm of this, DH coding and so on.

We also support e.g. storage use code of some commonly used software such as this algorithm, we have it shipped into the accelerator, the compression is very simple to use, just like we use a peripherals, we only need from huawei’s website to get the corresponding hardware driver code, install it on, then we will be able to use it like a normal peripherals.

Of course you want to use some of the API, we provide may have to follow some, for example we want to provide to the user manual, users may have to modify your own source code, for example may be software originally off some such function, or three sides of the API, this time may go to the accelerator, I need to modify my code logic according to the API, but the code logic only exists at the API level.

For example, we have an accelerated engine called RC, which is used to compute Finish encryption. We support 1024~4096 key lengths: 1024 2048 3072 3096 key lengths. We are in our accelerator engine, we use a user mode to libry to do an isolation, to isolate users, such as open source third-party software, such as the API posted here to open SSL, we connect to open SSAPI, we can also expose the API, The corresponding driver of our IC engine is just below Libry. Users do not need to know how to achieve the following details. However, we can use the hardware computing capability of our accelerator by correctly calling the user Libry provided by Kunpeng RC. Greatly speed up the calculation of RC.

In fact, we also know that RC computations can be quite laborious and time-consuming with CPU. For example, a mid-to-high end call like x86 might only perform about 720 RC2048 calculations per second. However, if you use the RC computing engine provided by Kunpeng 920, the calculation amount will be greatly improved, that is to say, I can release all the CPU used to calculate RC and run my business! To complete such services in a chip provides another option for users. I don’t need to buy some PCIE cards, I directly use software to improve my software performance, to achieve a relatively simple way to improve performance. This is an example of what we’ve done in these, in our porting tools, to provide developers with the ability to use these capabilities directly through our software porting.

This is our release strategy for several tools. We are stuck in the middle column so far. We have completed multi-OS adaptation, for example, we support 3~4, 74, 7.5, 7.6, 7.7, 7.8, right? We also support the winning kirin, etc, we also support the operating system, such as Sue C is we try to help as much as possible to cover our commonly used such type of the operating system, we also support multiple versions of GCC, we from 4.8.9 support so far at least 8.3, we will support to 9 points a few subsequent version, We also support MAC build tools, CVK build tools, and in the future we will support some checks of a build tool like Automake.

Support C, C++ and code transplantation, but also support assembly code recognition, because just said, from the point of view of assembly instructions, from the number of your Intrisic, this amount is very large, but also very technical challenges, is the replacement of assembly language, so we will gradually improve this part. For acceleration, we provide some Intrisic replacements, such as ABS or SSE.

We also went to optimize the, the three parties as the acceleration of some commonly used components, such as some z – lib acceleration or some stapi, there are also some scanning speed, scan the characters we use a new instruction to optimize, performance improvement, achieved a considerable performance change is 50%, twice, three or more, This is a 4 times performance increase, so the acceleration effect is quite significant. This also allows users to run faster, faster and better software applications in space.

To obtain these tools, you can download them from Huawei spot website or huawei Air Side community, which provides some links.

For our accelerated library software, the strategy here is mainly to adopt the open source strategy, such as JDPC or some three parties, including some compression algorithm, compression engine, including these software components, we all promote the corresponding patch to the community. As for the hardware acceleration engine, we can download it directly from The Kunpeng community of Huawei, and then install and use it, which is relatively convenient.

Kunpeng community will become a local bridge of communication and interaction between Huawei and developers. In this community, we can download hundreds of software migration guides and related software tuning experience, and can interact with other developers in this community for further technical discussions. Then a lot of new technical materials, technical documents, including some white papers, some product design proposals will be published in the community, different developers can get some different information.

Listed here, the two tools we hollow developer community how to get these tools, now we have these tools online, on September 30, is the first version, we are a month after September 30 month by month issued such a rhythm, the rhythm will continue until 2020, that is to say we are not a short-term behavior, We will always start from the perspective of developers’ needs, to make this tool more practical, more convenient to help users complete C, C++ + code 90% of the tool migration.

Actually in kunpeng’s development platform, because Kunpeng is an aerial platform, it is a new thing, right? X86 is a new thing, and we can also feel that with the expansion of Kunpeng computing platform, there are more and more applications, which requires a large number of developers to invest in the ecological construction of the platform. Therefore, Huawei has launched a series of skills improvement activities for online certification training, including online courses, cloud-based LABS, online certification and offline training. We hope that everyone can actively participate in building the ecological software ecosystem of Huawei Kunpeng.

There is one thing mentioned here about a Certified development engineer of Huawei, that is, HCIA certification is still in Huawei, which is of great value to Huawei and developers. Because after you pass the certification, to a certain extent, it will become a straight train for you to enter Huawei to engage in software development.

Therefore, you can pay attention to some relevant training and certification information to find a suitable direction for yourself. Then, on a larger stage, we can build the software ecological environment of Huawei Kunpeng and make It do better and better.

Click to follow, the first time to learn about Huawei cloud fresh technology ~