C++ project into wasm process

For business purposes, we need to compile our existing c++ code into webassembly(wasm). This article documents some of the problems and solutions encountered during the compilation process. There are many languages that can be compiled into WASM. The official website lists some languages that support wASM compilation

The C/C + + and Rust and C # is relatively mature, its tool chain is mature, the practice of Rust turn wasm zhuanlan.zhihu.com/p/38661879 for reference, because of our existing project is C + +, Therefore, this article focuses on the practice of compiling C++ into wasm.

Project introduction

This project is a similar RN | Weex a across the end of the project, its upper DSL for small programs, Encode the code written by the small program into a binary code (which contains the vDOM and style information of the first screen as well as the JS code of the business) through compilation tool (Node CLI), and dynamically send it to the client. The client will decode and render the binary code delivered by the client, and dynamically bind it to the JS code of the service.

Due to encode and decode exist a lot of code reuse (embedded a mini js engine implementation), so the code which encode and decode through c + +, the client (ios) | android SDK source depend on c + + code, The node cli requires compiling C++ code into a dynamic library (so for Linux, dylib for MAC, DLL for Windows), and then the node layer through ffi(github.com/node-ffi/no…). Make cross-language calls to dynamic library code. While ffi has been able to successfully implement c++ code calls at the node layer, there are still some problems

Earlier versions of Node-ffi had node version limitations and did not support versions above Node12 (later switched to ref-nAPI to resolve compatibility issues)
A c++ code crash will cause the node process to crash, affecting the use and stability of node on the server side
There are basically two ways to publish a Node CLI that works on multiple platforms
Through node-gyp to the user side to complete the compilation process of so, but because of the c++ code has some requirements on the standard library and language version, the compilation on the user side has certain requirements on the user environment
When releasing CLI, we should complete the compilation of dynamic library of each platform, which requires that the compilation of dynamic library should be completed on three platforms every time releasing CLI, which actually requires us to build our own Gitlab-runner on three systems. However, the internal Gitlab-Runner of the company only supports Linux by default, which requires us to build a set of mature GitLab multi-platform CICD process by ourselves, which is not easy, and it is difficult to solve the needs of developers to issue local versions

Dynamic libraries support Node, Android, and ios perfectly, but there is no way to load the execution dynamic libraries on the Web, which prevents us from migrating the build process to the Web.
Although we have released the dynamic library so that users do not need to compile the dynamic library locally, the dynamic library calls still rely on the ref-napi library to complete the c++ to js binding. The library needs to be compiled locally by the user (relying on node-gyp, which in turn relies on xcode), whereas wasm does not rely on a c++ environment such as xcode, avoiding the user’s dependence on the c++ compilation environment.

Due to some of the limitations mentioned above, we tried to compile c++ code into wasm. In addition to its excellent performance, wasm has excellent cross-platform features, which perfectly fit our needs.

Wasm is has nothing to do with the operating system and node version, so we compile once, can run on Linux MAC | | Windows, and other operating systems, each system don’t need to compile the dynamic library, in the node more than 8 that support the wasm, You don’t have to worry about compatibility issues with Node versions.

Wasm can also run on the Web, allowing us to explore compilation solutions on the Web later.
Wasm runs in a sandbox environment and does not crash due to execution anomalies.

So we tried to migrate the dynamic library of the c++ module variation to wasm.

Wasm compilation and running process

Consider the following simple C program

// hello.c  
#include  
int main(){  
printf("hello world\n");  
return 0;  
}
Copy the code

Compile to executable and execute

$ clang hello.c -o hello  
$ ./hello
Copy the code

Unfortunately, the compiled code above only runs on the same OS and CPU instruction set. Compiled on 32-bit Linux, the result will not run on 64-bit Linux, let alone MAC and Windows.

One of the first issues we encountered when compiling wASM was how to handle system calls. In fact, one of the main reasons that the results of this compilation are difficult to cross platform is that the implementation of system calls is different for different operating systems, and we have to generate different code for different operating systems to fit the different implementation of system calls. A natural way to do this is to compile the results of the above system call into a system call that already supports cross-platform Runtime. Fortunately, a variety of the above runtime already exists

browser
nodejs
wasi

In browsers, for example, js console.log is a natural cross-platform system call that runs smoothly across different operating systems. So all we need to do is compile the above c++ code into wasm+ js glue code, which ADAPTS the system call to the browser-provided js API. The process is shown in the figure below

Nodejs is handled similarly to Browser, except that the JS glue code ADAPTS to the API provided by Node rather than the browser. How does Emscripten compile the above code

 $ emcc hello.c -o hello.js     $ node hello.js     hello world
Copy the code

The command can also be executed in the browser

Each of the above uses has a disadvantage because the generated WASM relies on the JS Gule code injection API, which makes it dependent on the JS glue code to execute the corresponding WASM. As a result, if other third-party environments want to use the generated WASM without the JS Gule code, they need to simulate the JS Glue code API for WASM injection. However, the API for JS Glue code injection is not standard and often changes. This actually makes it difficult for the generated WASM to run smoothly in other environments.

Standardized WASI

To solve this problem, WASM implements a standard API interface (WASI), which does not rely on JS glue code to run properly. Any runtime that implements WASI can load wASM properly. Wasm is essentially independent of JS. It can run completely in a separate sandbox environment and interact with the system API through WASI. This has actually led to the development of the WASM Runtime, which is not limited to compiling multiple languages into WASM. Furthermore, wASM runtime can be implemented in a variety of languages. Wasm can run in runtime other than browser and Node, and can even be embedded in mobile SDKS. Wasi runtime currently supported include

Wasmtime, Mozilla’s WebAssembly runtime
Lucet, Fastly ‘s WebAssembly runtime
a browser polyfill
node@14 in the case of — experimental-WASi-preview1

Emcc currently supports code that generates the WASI format, so this time we will compile the hello-world code above to support WASI

 $ emcc hello.c -o hello.js -s STANDALONE_WASM
Copy the code

The generated WASM does not depend on the generated JS Glue code, and we can execute the generated WASM using any wasI-enabled runtime. We use WASmTime to execute the above code

We can also execute the above code through Node’s WASI functionality

const fs = require('fs'); const { WASI } = require('wasi'); const wasi = new WASI({ args: process.argv, env: process.env, }); const importObject = { wasi_snapshot_preview1: wasi.wasiImport }; (async () => { const wasm = await WebAssembly.compile(fs.readFileSync('./hello.wasm')); const instance = await WebAssembly.instantiate(wasm, importObject); wasi.start(instance); }) ();Copy the code

The result is as follows

We found that the code above did not need to handle any system call bindings, all thanks to WASI support.

If our code was not compiled in STANDALONE\_WASM mode and we executed using wasI runtime, we actually reported an error

This is because the generated WASM relies on the JS gulu code injection API.

The migration process

The build process

A c + + to compile wasm process as shown in the figure below, basically is the c + + – > LLVM bitcode – > wasm + js (glue) | standalone wasm

For simple c++ projects, we can call emcc directly to compile c++ into wasm, but for larger projects, build tools like cmake are used. Fortunately, Emscripten is well integrated with Cmake, so we just need to make the following substitution

$cmake => emcmake $make => emmake makeCopy the code

Build the project as cmake did before. The LLVM bit code generated by cmake can be further compiled into WASM through EMCC. A complete compilation step is as follows

cd build && emcmake .. Emmake make emcc lib.a -o lib.js // generate lib.wasm and lib.jsCopy the code

Here are some of the details that need to be addressed during compilation.

Guaranteed function derivation

c++ name mangle

In order to support function overloading, c++ mangle function names by default (even without overloading), which is similar to the traditional way of compiling c++ into a dynamic library and js calling functions derived from the dynamic library through ffi. In emscripten, if you need to use c++ exported functions in js, You also need to export C++ functions. In order to support overloaded functions, c++ mangle the names of functions by default. As a result, the names of functions written by us are inconsistent with those exported from the actual dynamic library. The following code is an example

#include <stdio.h> int myadd(int a,int b){ int res = a+b; res = res + 2; return res; } int main(){int res = myadd(1,2); printf("res: %d\n",res); }Copy the code

When we compile using clang++, we view the exported symbol name through nm

The myadd function name is now __Z5myaddii, which is very unfriendly to js users, so we need to turn off the c++ name mangle processing. Extern “C” prevents C ++ ‘s default name mangle behavior

#include <stdio.h> extern "C" { int myadd(int a,int b){ int res = a+b; res = res + 2; return res; } int main(){int res = myadd(1,2); printf("res: %d\n",res); }}Copy the code

So when we look at the symbol table again, myadd becomes _myadd, so that the JS side can refer to myadd by _myadd.

Emcc has made various optimizations to the c++ code to reduce the size of the generated wasm. Some of these optimizations make it impossible to read c++ exported functions in js, including DCE and function inlining.

DCE

Emscripten uses a Dead code ellimination to ensure that the generated wASM is as small as possible. In order to ensure that a given function is not discarded by Emscripten or DCE, the compiler is told not to delete that function, emCC uses EXPORTED_FUNCTIONS to ensure that any given function is not deleted

emcc - s "EXPORTED_FUNCTIONS=['_main', '_my_func']" ...

The default configuration of EMCC’s EXPORTED_FUNCTIONS is _main so we see that our main is not removed, in fact main is essentially the same as any other function, so we want to retain main, _main also needs to be added to EXPORTED_FUNCTIONS

An inline function

In order to reduce function overhead at runtime, emscripten may inline some functions except DCE. Function inlining may also result in functions not being exported properly. To ensure that functions are not inlined, EMSCRIPTEN_KEEPALIVE can be used to ensure that functions are not inlined

inline     void EMSCRIPTEN_KEEPALIVE yourCfunc()  { .. }
Copy the code

C++ and Javascript interoperability

Because Javascript and c++ have completely different data systems, Number is the only intersection of the two. Therefore, when Javascript and c++ call each other, they exchange the Number type. When we need to pass other types in C++ and Javascript, we need to convert the other types to Number before we can exchange them. Fortunately, emscripten encapsulates some functionality for us to simplify parameter passing between C++ and Javascript. We can use allocateUTF8 to convert a JS string to number, and UTF8ToString to convert number to A JS string. As shown below.

const s1 = 'hello';     
const s2 = 'world';    
 const res = Module._concat_str(Module.allocateUTF8(s1),Module.allocateUTF8(s2));     console.log('res:', Module.UTF8ToString(res)) // 'hello world'
Copy the code

Emscripten takes it a step further by encapsulating two functions for parameter type conversions, cwrap and ccall

Thus the above code can be simplified as

const s1 = 'hello';     
const s2 = 'world';    
const res = Module.ccall('concat_string','string,['string','string'],[s1,s2]))      console.log('res:',res); 
Copy the code

If the function needs to be called multiple times, we can use cwrap to wrap the function once, which can be called multiple times

const concat = Module.cwrap('concat_string', 'string',['string','string']));    
const r1 = concat(s1,s2); // 'hello world'     
const r2 = concat(s2,s2); // 'world hello'

Copy the code

Note that these internal functions of Emscripten are not exported by default; if you want to use them, you need to export them EXTRA_EXPORTED_RUNTIME_METHODS at compile time

Emcc -s \" extra-exported_runtime_methods =[' exported_runtime_methods ','ccall']\" hello.c -o hello.js // export cwrap and CCallCopy the code

modular

The default execution environment of Emscripten is Browser, so the exported object is actually a global Module object, and its loading is asynchronous. The complete exported Module needs to be retrieved in the onRuntimeInitialized event callback. Ensure the normal operation of module export methods.

The browser

<! doctype html> <html> <head> <meta charset="utf-8"> <title>Emscripten:Export1</title> </head> <body> <script> Module = {}; Module. OnRuntimeInitialized = function () {/ / at this point the console to complete the Module object. The log (Module) _show_me_the_answer ()); The console. The log (Module. _add (12, 1.0)); } </script> <script src="export1.js"></script> </body> </html>Copy the code

In the Nodejs

Emscripten also provides another modular export, which exports a function that returns a promise

Emcc -s MODULARIZE=1 hello.cc -o hello.js // Export the function that returns the promiseCopy the code

In this way, we can easily use Module

const _loadWasm = require('./hello.js'); async main(){ const Module = await _loadWasm(); The return Module. _add (1, 2); }Copy the code

File system processing

The C++ in the project uses a lot of system API, mainly some file IO, originally thought wasm could not support file IO, but in fact emscripten has a good encapsulation of file IO, emscripten provides a set of virtual file system, In order to compatible with different environment file IO adaptation problem.

At the lowest level, Emscripten provides three sets of file systems

MEMFS: The system’s data is stored entirely in memory, which is very similar to the WebPack implementation in that it emulates a set of file system operations in memory and files written at run time are not persisted locally
NODEFS: node.js file system. This system has access to local file systems and can persist files, but only for node.js environments
IDBFS: indexDB file system, which is based on browser-based indexDB objects and can be stored persistently, but only in the browser environment

We first tried using NODEFS to handle this, but the early NODEFS had a big limitation: before using the local file system, you had to mount the local folder that you wanted to operate on.

void setup_nodefs() { EM_ASM(FS.mkdir('/data'); FS.mount(NODEFS, {root : '.'},'/data'); // Mount the current folder to /data); } int main() { setup_nodefs(); FILE *fp = fopen("/data/nodefs_data.txt", "r+t"); Fp = fopen("/data/nodefs_data.txt", "w+t"); if (fp == NULL) fp = fopen("/data/nodefs_data.txt", "w+t"); int count = 0; if (fp) { fscanf(fp, "%d", &count); count++; fseek(fp, 0, SEEK_SET); fprintf(fp, "%d", count); fclose(fp); printf("count:%d\n", count); } else { printf("fopen failed.\n"); } return 0; }Copy the code

This was possible, but it required major changes to our original code, and Emscripten addressed this issue by providing NODERAWFS=1, which allows direct manipulation of the NODEJS API without mounting the file system, thereby avoiding changes to the original code

 emcc -s NODERAWFS=1 hello.c -o hello.js

Copy the code

Memory OOM Processing

When we transferred c++ out of wasm, a serious memory leak was found in the first run. After investigation, it was found that the js glue code generated by emscripten by default would carry some exception handling code.

Each time we call the JS gule code, we bind an event that captures the buffer allocated on the closure, resulting in more and more buffers being captured, leading to a memory leak. Emscripten also has a similar issue github.com/emscripten-…

Luckily emscripten also provides a way to disable this capture behavior github.com/emscripten-…

Emcc build/liblepus.a -s NODEJS_CATCH_EXIT=0 -s NODEJS_CATCH_REJECTION=0 // Disable nodeJS exception catchingCopy the code

This avoids binding where exception catching is performed on every execution. This avoids repeated binding of uncaughtException and unhandleRejection, but other events may still be repeated binding. So we need to ensure that the JS glue code is executed only once

const _loadWasm = require('./js_glue.js') // let task = null; Function loadWasm(){// In concurrent scenarios _loadWasm also executes if(! task){ task = _loadWasm(); } return task; } export async function encode(){ const wasm = loadWasm(); return wasm.encode('src','dist'); }Copy the code

Emscripten allocates 16 MEgabytes of memory to WASM by default. Sometimes this is not enough, and it can also result in OOM. There are two solutions

Adjust to larger memory by -s INITIAL_MEMORY=X
-s ALLOW_MEMORY_GROWTH=1 allows WASM to dynamically grow required memory

debugging

Wasm debugging

The latest Versions of Chrome and Firefox already support debugging of WASM itself

Although breakpoint debugging can be done on WASM, this assembly-level debugging is not enough for complex applications. We prefer to implement debugging at the source level

Sourcemap debugging

Fortunately, Emscripten already supports Sourcemap debugging so that when executing code, you can locate its relative source location.

$emcc-g4 hello.cc --source-map-base / -o indexCopy the code

As you can see in the figure below, we have successfully broken the breakpoint at the c++ source.

However, there are still some limitations to this approach. We see that sourcemap only deals with the mapping of lines of code, not c++ variables to wasm register variables, so for complex applications, sourcemap debugging is still inadequate.

Dwarf debugging

In addition to sourcemap handling the mapping between source code and compiled code, Dwarf is also a kind of more general debugging data format (was debugging data format), it is widely used in c | system such as c + + programing language. It provides code location mapping, variable name mapping and other functions for debugging. Emscripten already has dwarf information for generated WASM code.

$LLDB -- wasmtime -g hello.wasm $LLDB -- wasmtime -g helloCopy the code

We can clearly see that wasm maps to c++ code and variables are successfully mapped to c++ variables.

LLDB dwarf debugging for wasm relies on LLVM jit, which is turned off by default on MacOSX (LLDB jit is turned on on Linux, GDB’s jit function is also enabled on MacOSX. Therefore, we need to manually enable LLDB JIT function on MacOS by configuring Settings set plugin.jit-loader.gdb.enable on on.lldbinit

We can further debug wasm programs using the codelldb plugin on vscode, which also requires jit configuration. Simply configure LLDB initCommands in settings.json

The debugging effect is as follows

We can debug wASM applications using LLDB, but we can’t do LLDB in browsers. Fortunately, browsers are already trying to support DWARF debugging for WASM, and the latest Chrome can turn on experimental features of dwarf debugging

I tried it briefly, but it still seems to be buggy… Variable mappings are not handled and register variables are still displayed

Node debug support for WASM is still limited and breakpoints are not in effect.

conclusion

The actual migration process was much easier than expected. Emscripten’s toolchain was very complete and had a solution for most of the problems. In fact, there were no changes to the C++ code during the migration, just some changes to the compilation tools. This greatly expands the realm of the front end. Our third party libraries are no longer limited to NPM, and we can compile many C++ libraries into wasm for our own use.

The resources

Cliutils. Gitlab. IO/modern – cmak…
Henryiii. Making. IO/cmake_works…
Github.com/3dgen/cppwa…
Hacks.mozilla.org/2019/09/deb…
Hacks.mozilla.org/2019/03/sta…
emscripten.org/index.html