LLVM Infrastructure and Rust

LLVM Infrastructure and Rust

LLVM is a backend engine for many programming languages. It is used by C, C++, Rust, Go and Swift, among others. This blog post is about LLVM, and I will explore the following topics:

  • What is LLVM infrastructure
  • How does LLVM work
  • LLVM program structure
  • LLVM and Rustc

What is LLVM infrastructure

The LLVM infrastructure is a collection of modular and reusable compiler and toolchain technologies. LLVM contains several subprojects, such as Clang, C/C++/ objective-c compiler, debugger LLDB, libc++, etc. Decades of research and refinement have created amazing tools for building languages and compilers, and several modern technologies are taking advantage of these tools.

The LLVM project was started in 2000 by Vikram Adve and Chris Lattner at the University of Illinois at Urbana-Champaign. The name LLVM was originally an acronym for a Low-level Virtual Machine. Although the modern LLVM infrastructure has become virtual machine agnostic, the name remains the same. But the meaning behind the abbreviation itself was removed, and LLVM became the full name of the project.

The LLVM project is released under an open source license. Lattner was hired by Apple, and the entire team was assigned to work on LLVM, a system designed to meet the needs of Apple products. LLVM began to expand its capabilities and grow into a comprehensive project that combines LLVM IR, LLVM debugger, and the LLVM API C++ library.

The LLVM Compiler Framework is a modular and reusable Compiler Framework that uses the LLVM infrastructure to provide end-to-end code compilation. It is used to build, optimize, clean, and generate intermediate code (IR) or binary (machine) code (BC). LLVM allows programming language code to be converted to intermediate code (IR), which can be converted to binary code (BC) for any given hardware architecture. IR language is independent of source and target languages.

LLVM has many uses. Its primary use is as a compiler framework. Here, we provide what’s called the front end and back end, or machine code, which generates IR. This IR can be translated into binary code for the target platform by generating and linking the target platform’s assembly language.

LLVM’s intermediate code (IR) is a low-level programming language similar to assembly. But in addition to assembly, it also provides better type annotation and user-friendly syntax. In addition, IR uses an infinite set of registers instead of a predefined number. We will explore the inner workings of IR in a later section.

How does LLVM work

As mentioned earlier, LLVM is a modular and reusable compiler framework that supports multiple front and back ends. The program life cycle involves writing source code and then compiling it into binary code for execution. Our interest is in the compiler stage. When the source code is compiled into binaries, it goes through several steps in subsequent order.

The compilation process can be divided into three parts. Front end, optimization, back end (Figure 1-a). The compilation process starts at the front end. First, the preprocessor starts organizing the source code. This is the stage where the external library extends to the target source code. In Rust, this directive is defined by the USE statement. A similar instruction in C and C++ is the #include statement.

Second, there is parsing. At this stage, the code is parsed to find syntax errors and to build an abstract syntax tree (AST). The third step in front-end compilation is IR Generation. At this step, the compiler converts the AST into intermediate code (IR) and outputs the results.

During the optimization phase, the compiler performs various conversions and cleanups of the program. This improves application performance, makes the application more secure by reducing the number of bugs, and runs some helper programs to get some work done. Later we’ll explore IR and look at some of the optimizations that come with the compiler.

After optimization, the program enters the back-end stage. Here, Compiler back-endd converts the IR to target-specific assembly code. The assembler then converts the target-specific assembly code into target-specific machine code. Finally, Linker combines multiple machine code files into a single file, which we call an executable.

We already have several language-specific front ends. We have CLang for C and C++, Go has Gollvm, and Rust has a compiler called RUSTC. Similarly, after optimizing LLVM IR, the code is converted to architecture-specific back ends such as x86, ARM, MIPS.

The part between the front end and back end, called the optimizer, is the magic of code optimization and cleanup. This is done through a series of so-called passes. Each Pass runs one after the other, so you can have N passes. Pass can be divided into two groups: analysis and transformation. Analysis Pass analyzes IR to check program properties, while Transformation Pass transforms IR to monitor and optimize programs.

When the source code is converted to LLVMIR, it can take one of three formats (Figure 1-b). The memory format is a binary file that is used at the front end of the compilation process. The other two formats are Bitcode and Assembly. Bitcode is a binary disk storage format that is suitable for fast loading because of its higher memory efficiency. An Assembly is a text format that is easy for humans to read.

You may be wondering why use IR/BC instead of native assembly and native binary? The reason is that the native assembly is bound one-to-one to the platform’s machine code. Depending on the architecture of the target, for example, assembler for x86 will differ from assembler for ARM. Native assembly is converted to native binaries by the assembler, which LLVM also includes. This is the LLVM Pass list out of the box. But that’s not all. You can implement your own pass to clean up or optimize the source code.

LLVM program structure

LLVM IR, like any other system, has its own program structure (Figure 2-a). The top-level container is a module corresponding to each translation unit of the front-end compiler. A module can consist of one or more functions. Each function has one or more basic blocks containing instructions. One line per instruction, and there are a lot of instructions in IR.

We can get it from the Rust source code by running the following command in the terminal: ll (IR assembly) file:

 $ rustc someprogram.rs --emit=llvm-ir -o somename.ll 
Copy the code

Here we tell the RUSTC compiler to generate llVm-ir (as specified by the option –emit=llvm-ir) for someProgram.rs and output it (using the option -o) as someoname.ll. Also, we can use the –emit= llvM-bc flag to generate bitcode.

The actual program structure in LLVM IR consists of layered containers. At the top level, we have modules. It corresponds to each translation unit of the front-end compiler. Modules can be combined with function definitions for use by the LLVM linker when linking. Functions are the next-level container, including function signatures and their basic blocks. A basic block is a set of instructions that are executed sequentially. The basic block has one entrance and one exit. The first block is called the entry block. And finally, IR has instructions. Instructions are executable single lines (Figure 2-a).

In IR files you will encounter two types of variables, local variables, which are represented by the % sign, and global variables, which are represented by the @ sign. LLVM IR can use an unlimited number of temporary registers, rather than a predefined number of registers as native assembler programs do. IR registers are defined by integers, such as 1,2,3,.. N. For example, %2 = load i32, i32* %x means to load the value stored at the address of the local variable x into temporary register 2.

Another advantage of LLVM IR is that it takes advantage of the so-called static singleton assignment (SSA) form. This allows for various optimizations to the code. SSA variables can only be assigned once, rather than regular variables that can be reassigned multiple times. This allows the compiler to do more efficient register allocation because it is easy to get the number of active variables in any given place. It can detect unused variables and prevent programs from making unnecessary pointer assignments.

For example, the following code x = x * x in SSA form would be X_2 := x_1 * x_1. If we enter a loop where the variable keeps getting new values, SSA will use what is called the Phi node. Phi merges variables into the final value that will be returned/output from the branch or loop. We’ll look at concrete examples and review IR syntax in the next section of this blog post.

LLVM and Rustc

Rustc’s official guide lists all the reasons they use LLVM, which I’ll quote directly:

  • We don’t have to write the entire compiler back end. This reduces the implementation and maintenance burden.
  • We benefit from a number of advanced optimizations that the LLVM project has been collecting.
  • We can automatically compile Rust to any platform supported by LLVM. For example, once LLVM adds support for WASM, voila! Rustc, Clang, and many other languages can compile to WASM! (Ok, there are a few extra things to do, but we have 90% completion anyway).
  • We benefit from each other’s compiler projects. For example, when Spectre and Meltdown security vulnerabilities are found, only LLVM needs to be patched.

Now that we’re familiar with what LLVM IR is, what it does, and why so many compilers build on it, it’s time to get our hands on it and explore the inner structure of IR using a simple Rust program called Simple1.rs:


//simple1.rs

fn square(x: u32) - >u32 {
    x * x
}

fn factorial(mut n: u32) - >u32 {
    let mut acc = 1u32;
    while n > 0 {
        acc *= n;
        n = n - 1;
    }
     acc
}
    
fn main() {
    let s = square(4);
    let f = factorial(s);
}
    
Copy the code

Our program has three functions, main is the entry point to our program, square returns the square of a given number, and factorial provides the calculation of the factorial of the integer. We print the LLVM IR file by calling the following command in the terminal:

 
 >$ rustc simple1.rs --emit=llvm-ir 
 
Copy the code

We should now have the simple1.ll file in the same folder as the source code. Let’s open the.ll file in a text editor and start exploring. At the beginning of the file, you’ll find a bunch of metadata and other information and instructions related to the program, but our interest is to delve deeper into how our Rust function converts to IR, and we’ll start by looking at the Square function.

Search the line with the annotation Simple1 :: Square, and next to it you’ll find an LLVM IR container for a given function. It looks something like this:

; simple1::square ; Function Attrs: nonlazybind uwtable define internal i32 @_ZN7simple16square17h74fdb775aec225a0E(i32 %x) unnamed_addr #1 { start: % 0 = call {i32, i1}. @ LLVM umul. With. Overflow. I32 (i32% x, i32% x) _4. % 0 = extractvalue {i32, i1} % 0, 0 %_4.1 = extractValue {i32, i1} %0, 1 = call i1 @llvm.expect. I1 (i1 %_4. label %bb1 bb1: ; Preds = %start ret i32 %_4.0 panic:; preds = %start ; call core::panicking::panic call void @_ZN4core9panicking5panic17h50b51d19800453c0E([0 x i8]* ... }Copy the code

IR begins by defining a function with a randomly generated unique name (which also includes the name of the original function) that returns the value of i32 and takes the value of i32 as an argument. Start is the label of the entry point of the function. In the first instruction, its internal function called @ LLVM LLVM IR. Umul. With. Overflow. I32. This function takes two i32 arguments (both x in this case), performs an unsigned multiplication and outputs a tuple-like result, where the first value is i32 and the second is i1. The result is assigned to the unnamed temporary quantity %0.

At this point, you may have a few questions. First, why are the square function arguments and return values i32 instead of U32 as declared in the source code? LLVM IR does not provide separately defined data types for signed and unsigned values. Instead, it performs operations specific to value types, signed operations with signed values, and unsigned operations with unsigned values. If we square function will be i32 as a parameter and return i32, IR will call @ LLVM. Smul. With. Overflow. I32, there is a symbol of multiplication intrinsic function.

The next question you might ask is why does multiplying an integer by an integer return a tuple? If you look at the type of output tuple, the second item is i1 (single bit value), which is often used for binary/Boolean values. The name of the called function indicates that it performs unsigned multiplication with overflow. If an overflow is detected, that is, the multiplication value exceeds the maximum unsigned value 32 bits can hold, since we perform unsigned multiplication, it sets i1 to 1, or true.

In the following two instructions, the result of IR solution. The unnamed temporary quantity %_4.1 takes the multiplied value and assigns the Boolean result to %_4.1. Next, IR calls another internal function, @llvm.expect. I1 (i1 %_4.1, i1 false), which expects a value of false to be assigned to the unnamed temporary quantity %_4.1. The program assigns execution results to another unnamed temporary quantity, %1. An unnamed temporary variable is an unsigned value created by IR with a prefix of % or @ (for example, %1). Named values are represented as string characters with the same prefix rules imported from source code (for example, %acc).

After that, the IR executive branch. The br instruction checks if the value placed in the temporary quantity %1 is true, if so, jump to %panic, otherwise jump to %bb1. The two names mentioned above are labels for the base block. %panic calls Rust’s panic method (you should know what panic does if you’ve programmed with Rust), and bb1 returns the value stored in the temporary quantity %_4.0 from the function square.

Let’s look at our factorial function again, and here’s something more interesting going on:

; simple1::factorial ; Function Attrs: nonlazybind uwtable define internal i32 @_ZN7simple19factorial17hceebae2a8a73b808E(i32 %0) unnamed_addr #1 { start: %acc = alloca i32, align 4 %n = alloca i32, align 4 store i32 %0, i32* %n, align 4 store i32 1, i32* %acc, align 4 br label %bb1 bb1: ; preds = %bb4, %start %_3 = load i32, i32* %n, align 4 %_2 = icmp ugt i32 %_3, 0 br i1 %_2, label %bb2, label %bb5 bb5: ; preds = %bb1 %1 = load i32, i32* %acc, align 4 ret i32 %1 bb2: ; preds = %bb1 %_4 = load i32, i32* %n, align 4 %2 = load i32, i32* %acc, align 4 %3 = call { i32, I1} @ LLVM. Umul. With the overflow. I32 _4) (i32 i32%2, % % _5. 0 = extractvalue {i32, i1} % 3, 0% _5. 1 = extractvalue {i32, I1} %3, 1 %4 = call i1 @llvm.expect. I1 (i1 %_5.1, i1 false) br i1 %4, label %panic, label %bb3 bb3:; Preds = %bb2 store i32 %_5.0, i32* %acc, align 4 %_6 = load i32, i32* %n, align 4 %5 = call {i32, I1} @ LLVM. Usub. With the overflow. I32 (i32% _6 i32 1) % _7. 0 = extractvalue {i32, i1} % 5, 0% _7. 1 = extractvalue {i32, I1, label %panic1, label %bb4 panic:; preds = %bb2 ; call core::panicking::panic call void @_ZN4core9panicking5panic17h50b51d19800453c0E ... bb4: ; Preds = %bb3 store i32 %_7.0, i32* %n, align 4 br label %bb1 panic1:; preds = %bb3 ; call core::panicking::panic call void @_ZN4core9panicking5panic17h50b51d19800453c0E ...Copy the code

Recall that, as an argument, factorial takes the variable u32 value, which in IR is defined as the unnamed temporary quantity %0. The square function is defined with the named variable %n as an argument, just as in the source code. This is because square never modifies its value, whereas factorial requires it to be mutable and then makes a copy of the argument in the function’s scope. Two named variables %n and %acc are defined at the beginning of the function. They are both allocated (using alloca calls) 32-bit memory because the data is aligned to 4 bytes. You can read more about memory alignment here.

After allocating memory, the store directive temporarily stores the contents of %0 in the address of %n. In the address of %acc, it stores the constant value 1. Finally, it jumps to the entry point of the while loop. The bb1 tag marks the entry into the while condition. IR loads the value stored in the %n address into the temporary quantity %_3 and compares it to the constant value 0. Icmp checks if the value stored in %_3 is unsigned greater than (UGt)0 and assigns the result to another temporary quantity, %_2. After that, it checks the contents of %_2. If true, branch to tag %bb2, if false – tag %bb5.

The bb5 tag marks the exit block of the while loop. Here, it loads the value from the %acc address into the temporary quantity %1 and returns it from the function. The tag bb2 is the entry to the body of the while loop. The body of the while loop is divided into two halves. The first one is responsible for multiplication (bb2 tag), and the second one is responsible for decrement n, or subtraction. Operation n = n-1 is done by calling another internal function for unsigned subtraction. The rest of the operations are the same as we saw in the Square function (Figure 4-a).

You should have noticed that SSA comes into play here. For each assignment instruction, IR introduces new unnamed temporary quantities for each instruction, rather than reassigning values to N and ACC as defined in the source code. The number of an unnamed temporary quantity increments from 0 within the function as each instance is generated.

LLVM IR supports the binary complement of binary numbers. It has integer data types such as I8, i16, i32, i64, and floating point numbers f16, f32, etc., which are called ES operands. If you see an asterisk after an integer type, it means we are dealing with a pointer (for example: i32*). Of course, it could be a constant. Each instruction also has its own type, such as arithmetic operators, binary comparisons, data storage, and load. There are more operations in IR, and you can find the full list at the link at the end of this section.

The LLVM class hierarchy is a bit complex. The underlying object is what is called a Value, which represents an SSA register. All other classes inherit from it. The User class is followed by Instruction, Operator, Contstnt, and so on (Figure 4-b). This is the complete class inheritance tree in LLVM.

The LLVM language has many high-level constructs, types, constants, functions, code generation instructions, and so on. If you are interested in more information, you can browse through the reference manual.

conclusion

As we reviewed in this article, LLVM IR has a number of use cases that allow us to analyze and optimize source code through its pass. Knowing the IR language itself will help us write our pass and build projects around it for debugging, testing, and tuning. Currently, LLVM IR has no Rust API. It is primarily used as a C++ library. However, some users create repositories on crates. IO to bind LLVM. There is an implementation of the Rust binding LLVM C API, llVM-sys, and two other apis that also use LLVM but provide more Rust: Inkwell and llVM-IR. Finally, if you want to learn how to write LLVM Pass, you should start here.