Use Rust to write LLVM’s toy compiler

Translator: Iamazy/Post editing: Zhang Handong

译 文 : A Toy front-end for LLVM, written in Rust


The body of the

My current sideline is writing a compiler in Rust that converts code to LLVM IR. The LLVM API is a bit daunting for beginners, and there aren’t many tutorials (the limited ones are mostly C++ based, and it’s not always clear how to do the same with Rust). I want someone to hold my hand and teach me when I’m ready to do something. That’s why I want to write this article.

For Rust, the best choice for interacting with LLVM’s interfaces is to use LLVM-sys. Some warm-hearted people on the Internet have hosted some documents on LLVM-sys here. Of course, you should also check out the official LLVM guide, as it will help you understand how LLVM “thinks”. This article is basically a Rust translation of the official LLVM guide.

You can get the final code here.

Setting up the development environment

For beginners, there is a reusable way to develop with LLVM:

# `curl` is just so we can next install RustSudo apt - get - y install clang curl LLVM - 3.8 - dev curl https://sh.rustup.rs - sSf | sh
# The `llvm-sys` crate expects something called `llvm-config` on your PATH.Sudo ln -s /usr/bin/llvm-config-3.8 /usr/bin/llvm-configCopy the code

If you are running the above statement on Ubuntu (you may need to run apt-get update), there is no problem. If not, you need to run the above statement in the Vagrant Box using the following Vagrantfile file.

Vagrant. Configure (" 2 ") do | config | config. The vm. The box = "bento/ubuntu 16.04" endCopy the code

You can start by executing cargo init llvm-example –bin and write the following code (copied from llvm-sys) to SRC /main.rs:

/ /! Construct a function that does nothing in LLVM IR.

extern crate llvm_sys as llvm;

use std::ptr;

fn main() {
    unsafe {
        // Set up a context, module and builder in that context.
        let context = llvm::core::LLVMContextCreate();
        let module = llvm::core::LLVMModuleCreateWithName(b"nop\0".as_ptr() as *const _);
        let builder = llvm::core::LLVMCreateBuilderInContext(context);

        // Get the type signature for void nop(void);
        // Then create it in our module.
        let void = llvm::core::LLVMVoidTypeInContext(context);
        let function_type = llvm::core::LLVMFunctionType(void, ptr::null_mut(), 0.0);
        let function = llvm::core::LLVMAddFunction(module, b"nop\0".as_ptr() as *const _, function_type);

        // Create a basic block in the function and set our builder to generate
        // code in it.
        let bb = llvm::core::LLVMAppendBasicBlockInContext(context, function,b"entry\0".as_ptr() as *const _);
        llvm::core::LLVMPositionBuilderAtEnd(builder, bb);

        // Emit a `ret void` into the function
        llvm::core::LLVMBuildRetVoid(builder);

        // Dump the module as IR to stdout.
        llvm::core::LLVMDumpModule(module);

        // Clean up. Values created in the context mostly get cleaned up there.llvm::core::LLVMDisposeBuilder(builder); llvm::core::LLVMDisposeModule(module); llvm::core::LLVMContextDispose(context); }}Copy the code

And in your Cargo. Toml file:

[package]
name = "llvm-example"
version = "0.1.0 from"
authors = ["Ulysse Carion <[email protected]>"]

[[bin]]
name = "main"

[dependencies]
llvm-sys = "0.2"
Copy the code

You can get:

vagrant@vagrant:/vagrant$cargo run llvm-example v0.1.0 (file:///vagrant) Running 'target/debug/main'; ModuleID = 'nop' define void @nop() { entry: ret void }Copy the code

Perfect! Now we can start writing our own stuff.

A very unusual procedure

First, let’s compile a program that simply sets a return code by returning an integer from main.

Here’s how I used it (we sometimes need a parser, so I added the PEG library first) :

#! [feature(plugin)]
#! [plugin(peg_syntax_ext)]

extern crate llvm_sys as llvm;

use std::ffi::CString;
use std::fs::File;
use std::io::Read;
use std::ptr;

fn main() {
    let mut input = String::new();
    let mut f = File::open("in.ex").unwrap();
    f.read_to_string(&mut input).unwrap();

    let parsed_input = parser::program(&input).unwrap();

    unsafe {
        codegen(parsed_input);
    }
}

peg! parser(r#" #[pub] program -> String = i:int_literal "\n" { i } int_literal -> String = [0-9]+ { match_str.to_owned() } "#);

unsafe fn codegen(input: String) {
    let context = llvm::core::LLVMContextCreate();
    let module = llvm::core::LLVMModuleCreateWithName(b"example_module\0".as_ptr() as *const _);
    let builder = llvm::core::LLVMCreateBuilderInContext(context);

    // In LLVM, you get your types from functions.
    let int_type = llvm::core::LLVMInt64TypeInContext(context);
    let function_type = llvm::core::LLVMFunctionType(int_type, ptr::null_mut(), 0.0);
    let function = llvm::core::LLVMAddFunction(module, b"main\0".as_ptr() as *const _, function_type);

    let entry_name = CString::new("entry").unwrap();
    let bb = llvm::core::LLVMAppendBasicBlockInContext(context, function, entry_name.as_ptr());
    llvm::core::LLVMPositionBuilderAtEnd(builder, bb);

    // The juicy part: construct a `LLVMValue` from a Rust value:
    let int_value: u64 = input.parse().unwrap();
    let int_value = llvm::core::LLVMConstInt(int_type, int_value, 0);

    llvm::core::LLVMBuildRet(builder, int_value);

    // Instead of dumping to stdout, let's write out the IR to `out.ll`
    let out_file = CString::new("out.ll").unwrap();
    llvm::core::LLVMPrintModuleToFile(module, out_file.as_ptr(), ptr::null_mut());

    llvm::core::LLVMDisposeBuilder(builder);
    llvm::core::LLVMDisposeModule(module);
    llvm::core::LLVMContextDispose(context);
}
Copy the code

It worked! Test it out:

vagrant@vagrant:/vagrant$ cat in.ex 42 vagrant@vagrant:/vagrant$ cargo run Running `target/debug/main` Vagrant @ vagrant: / vagrant $lli - 3.8 out. Ll; echo $? 42Copy the code

Kind of cool! By the way, here’s the contents of the out.ll file:

; ModuleID = 'example_module'

define i64 @main() {
entry:
  ret i64 42
}
Copy the code

The arithmetic

Let’s add support for adding, subtracting, multiplying and dividing numbers. To do this, we need to extend our syntax. We introduce an enumeration to represent the AST (abstract syntax tree).

pub enum Expr {
    Add(Box<Expr>, Box<Expr>),
    Sub(Box<Expr>, Box<Expr>),
    Mul(Box<Expr>, Box<Expr>),
    Div(Box<Expr>, Box<Expr>),
    Literal(String),}Copy the code

And extend the syntax:

// `product` and `sum` are that way to get operator precedence
peg! parser(r#" use super::Expr; #[pub] program -> Expr = e:expression "\n" { e } expression -> Expr = sum sum -> Expr = a:product _ "+" _ b:sum { Expr::Add(Box::new(a), Box::new(b)) } / a:product _ "-" _ b:sum { Expr::Sub(Box::new(a), Box::new(b)) } / product product -> Expr = a:int_literal _ "*" _ b:product { Expr::Mul(Box::new(a), Box::new(b)) } / a:int_literal _ "/" _ b:product { Expr::Div(Box::new(a), Box::new(b)) } / int_literal int_literal -> Expr = [0-9]+ { Expr::Literal(match_str.to_owned()) } _ = " "* "#);
Copy the code

Next, you can submit the code. You can specify strings such as “addtmp” that will be used as part of the corresponding “register” name in IR.

// When you write out instructions in LLVM, you get back `LLVMValueRef`s. You
// can then use these references in other instructions.
unsafe fn codegen_expr(context: LLVMContextRef, builder: LLVMBuilderRef, expr: Expr) -> LLVMValueRef {
    match expr {
        Expr::Literal(int_literal) => {
            let int_type = llvm::core::LLVMInt64TypeInContext(context);
            llvm::core::LLVMConstInt(int_type, int_literal.parse().unwrap(), 0)
        },

        Expr::Add(lhs, rhs) => {
            let lhs = codegen_expr(context, builder, *lhs);
            let rhs = codegen_expr(context, builder, *rhs);

            let name = CString::new("addtmp").unwrap();
            llvm::core::LLVMBuildAdd(builder, lhs, rhs, name.as_ptr())
        },

        Expr::Sub(lhs, rhs) => {
            let lhs = codegen_expr(context, builder, *lhs);
            let rhs = codegen_expr(context, builder, *rhs);

            let name = CString::new("subtmp").unwrap();
            llvm::core::LLVMBuildSub(builder, lhs, rhs, name.as_ptr())
        },

        Expr::Mul(lhs, rhs) => {
            let lhs = codegen_expr(context, builder, *lhs);
            let rhs = codegen_expr(context, builder, *rhs);

            let name = CString::new("multmp").unwrap();
            llvm::core::LLVMBuildMul(builder, lhs, rhs, name.as_ptr())
        },

        Expr::Div(lhs, rhs) => {
            let lhs = codegen_expr(context, builder, *lhs);
            let rhs = codegen_expr(context, builder, *rhs);

            let name = CString::new("divtmp").unwrap();
            llvm::core::LLVMBuildUDiv(builder, lhs, rhs, name.as_ptr())
        },
    }
}
Copy the code

Now, you can execute programs like 10 * 4 + 20/2-8! Pretty cool, if you ask me.

variable

We’ll keep it simple and assume that the program doesn’t do anything annoying, such as referencing undefined variables. We only store variables in registers and store them in HashMap

, which is useful because there is only one way to run the program.
,>

We extend the language and parser:

pub enum Expr {
    Literal(String),
    Ref(String),
    Assign(String.Box<Expr>),
    Add(Box<Expr>, Box<Expr>),
    Sub(Box<Expr>, Box<Expr>),
    Mul(Box<Expr>, Box<Expr>),
    Div(Box<Expr>, Box<Expr>),
}

peg! parser(r#" use super::Expr; #[pub] program -> Vec
      
        = e:(expression ** "\n") "\n" { e } expression -> Expr = i:identifier _ "=" _ s:sum { Expr::Assign(i, Box::new(s)) } / sum sum -> Expr = a:product _ "+" _ b:sum { Expr::Add(Box::new(a), Box::new(b)) } / a:product _ "-" _ b:sum { Expr::Sub(Box::new(a), Box::new(b)) } / product product -> Expr = a:ref_or_literal _ "*" _ b:product { Expr::Mul(Box::new(a), Box::new(b)) } / a:ref_or_literal _ "/" _ b:product { Expr::Div(Box::new(a), Box::new(b)) } / ref_or_literal ref_or_literal -> Expr = i:identifier { Expr::Ref(i) } / int_literal identifier -> String = [a-zA-Z]+ { match_str.to_owned() } int_literal -> Expr = [0-9]+ { Expr::Literal(match_str.to_owned()) } _ = " "*" #
      );
Copy the code

Then add support for these two new expressions:

unsafe fn codegen_expr(context: LLVMContextRef, builder: LLVMBuilderRef, names: &mut HashMap<String, LLVMValueRef>, expr: Expr) -> LLVMValueRef {
    match expr {
        // ...

        Expr::Ref(name) => {
            *names.get(&name).unwrap()
        },

        Expr::Assign(name, expr) => {
            let new_value = codegen_expr(context, builder, names, *expr);
            names.insert(name, new_value);
            new_value
        },
    }
}
Copy the code

And quickly update the codegen function:

let zero = llvm::core::LLVMConstInt(int_type, 0.0);

let mut names = HashMap::new();
let mut return_value = zero; // return value on empty program
for expr in input {
    return_value = codegen_expr(context, builder, &mut names, expr);
}
llvm::core::LLVMBuildRet(builder, return_value);
Copy the code

Now let’s find out:

vagrant@vagrant:/vagrant$ cat in.ex
a = 3
b = 76
a + b
vagrant@vagrant:/vagrant$ cargo run
     Running `target/debug/main`
vagrant@vagrant:/vagrant$ cat out.ll
; ModuleID = 'example_module'

define i64 @main() {
entry:
  ret i64 79
}
Copy the code

If

I had some trouble using the if keyword. The easiest way to make if work is to store all variables on the stack. And let LLVM do some optimizations. In LLVM, you can create a stack variable with the alloca directive and use load/ Store to read and write.

To do this, we have once again extended the language and syntax by adding new parsing rules.

expression -> Expr
    = if_expression
    / i:identifier _ "=" _ s:expression { Expr::Assign(i, Box::new(s)) }
    / sum

if_expression -> Expr
    = "if" _ e:expression _ "{\n" _ then_body:statements _ "}" _ "else" _ "{\n" _ else_body:statements _ "}" {
        Expr::If(Box::new(e), then_body, else_body)
    }
Copy the code

And added a new type to the AST node:

pub enum Expr {
    Literal(String),
    Ref(String),
    Assign(String.Box<Expr>),
    Add(Box<Expr>, Box<Expr>),
    Sub(Box<Expr>, Box<Expr>),
    Mul(Box<Expr>, Box<Expr>),
    Div(Box<Expr>, Box<Expr>),
    If(Box<Expr>, Vec<Expr>, Vec<Expr>),
}
Copy the code

Finally, complete the code for the if expression:

unsafe fn codegen_expr(context: LLVMContextRef, builder: LLVMBuilderRef, func: LLVMValueRef, names: &mut HashMap<String, LLVMValueRef>, expr: Expr) -> LLVMValueRef {
    match expr {
        // ...

        Expr::If(condition, then_body, else_body) => {
            let condition_value = codegen_expr(context, builder, func, names, *condition);
            let int_type = llvm::core::LLVMInt64TypeInContext(context);
            let zero = llvm::core::LLVMConstInt(int_type, 0.0);

            // `is_nonzero` is a 1-bit integer
            let name = CString::new("is_nonzero").unwrap();
            let is_nonzero = llvm::core::LLVMBuildICmp(builder, llvm::LLVMIntPredicate::LLVMIntNE, condition_value, zero, name.as_ptr());

            // It's fine to create blocks first, and then fill them in later.
            let entry_name = CString::new("entry").unwrap();
            let then_block = llvm::core::LLVMAppendBasicBlockInContext(context, func, entry_name.as_ptr());
            let else_block = llvm::core::LLVMAppendBasicBlockInContext(context, func, entry_name.as_ptr());
            let merge_block = llvm::core::LLVMAppendBasicBlockInContext(context, func, entry_name.as_ptr());

            llvm::core::LLVMBuildCondBr(builder, is_nonzero, then_block, else_block);

            llvm::core::LLVMPositionBuilderAtEnd(builder, then_block);
            let mut then_return = zero;
            for expr in then_body {
                then_return = codegen_expr(context, builder, func, names, expr);
            }
            llvm::core::LLVMBuildBr(builder, merge_block);

            llvm::core::LLVMPositionBuilderAtEnd(builder, else_block);
            let mut else_return = zero;
            for expr in else_body {
                else_return = codegen_expr(context, builder, func, names, expr);
            }
            llvm::core::LLVMBuildBr(builder, merge_block);

            // Position the builder so that it's ready to work on the next
            // expression.
            llvm::core::LLVMPositionBuilderAtEnd(builder, merge_block);
            zero
        }
    }
}
Copy the code

It’s a lot of code, but it does what you expect. Now, you can run the program like this:

a = 1
if a {
    a = 42
} else {
    a = 13
}
a
Copy the code

The IR corresponding to the above code is as follows:

; ModuleID = 'example_module'

define i64 @main() {
entry:
  %a = alloca i64
  store i64 1, i64* %a
  %a1 = load i64, i64* %a
  %is_nonzero = icmp ne i64 %a1, 0
  br i1 %is_nonzero, label %entry2, label %entry3

entry2:                                           ; preds = %entry
  store i64 42, i64* %a
  br label %entry4

entry3:                                           ; preds = %entry
  store i64 13, i64* %a
  br label %entry4

entry4:                                           ; preds = %entry3, %entry2
  %a5 = load i64, i64* %a
  ret i64 %a5
}
Copy the code

However, we are not finished yet. Currently, our “if” expression always returns zero (see codegen_expr for the return value of the if branch above). What we want is the exact opposite: if we execute the “then” path, then if evaluates to then_return, otherwise returns else_return.

How do you use LLVM to track which branches it executes? By using the “phi” node. You give phi a (block, value) pair, and the phi node will return the value corresponding to the previously executed block.

We can end if like this. Note that we have to update then_block and else_block, because this is the last block we need in the “then/else” branch, and the preceding then_block is the first block in the “THEN /else” branch.

// This is mostly the same code as before, just note the new calls to
// `LLVMGetInsertBlock`.

llvm::core::LLVMPositionBuilderAtEnd(builder, then_block);
let mut then_return = zero;
for expr in then_body {
    then_return = codegen_expr(context, builder, func, names, expr);
}
llvm::core::LLVMBuildBr(builder, merge_block);
let then_block = llvm::core::LLVMGetInsertBlock(builder);

llvm::core::LLVMPositionBuilderAtEnd(builder, else_block);
let mut else_return = zero;
for expr in else_body {
    else_return = codegen_expr(context, builder, func, names, expr);
}
llvm::core::LLVMBuildBr(builder, merge_block);
let else_block = llvm::core::LLVMGetInsertBlock(builder);

// Insert the phi node
llvm::core::LLVMPositionBuilderAtEnd(builder, merge_block);
let phi_name = CString::new("iftmp").unwrap();
let phi = llvm::core::LLVMBuildPhi(builder, int_type, phi_name.as_ptr());

let mut values = vec![then_return, else_return];
let mut blocks = vec![then_block, else_block];

llvm::core::LLVMAddIncoming(phi, values.as_mut_ptr(), blocks.as_mut_ptr(), 2);
phi
Copy the code

Then you get an amazing compiler:

vagrant@vagrant:/vagrant$ cat in.ex a = 1 b = 0 c = if a { if b { 11 } else { 40 } } else { if b { 10 } else { 20 } } c + 2 vagrant@vagrant:/vagrant$cargo run Running 'target/debug/main' vagrant@vagrant:/vagrant$lli-3.8 out.ll; echo $? 42Copy the code

Cool! Here is the IR for our sample input program:

; ModuleID = 'example_module'

define i64 @main() {
entry:
  %a = alloca i64
  %b = alloca i64
  %c = alloca i64
  store i64 1, i64* %a
  store i64 0, i64* %b
  %a1 = load i64, i64* %a
  %is_nonzero = icmp ne i64 %a1, 0
  br i1 %is_nonzero, label %entry2, label %entry3

entry2:                                           ; preds = %entry
  %b5 = load i64, i64* %b
  %is_nonzero6 = icmp ne i64 %b5, 0
  br i1 %is_nonzero6, label %entry7, label %entry8

entry3:                                           ; preds = %entry
  %b10 = load i64, i64* %b
  %is_nonzero11 = icmp ne i64 %b10, 0
  br i1 %is_nonzero11, label %entry12, label %entry13

entry4:                                           ; preds = %entry14, %entry9
  %iftmp16 = phi i64 [ %iftmp, %entry9 ], [ %iftmp15, %entry14 ]
  store i64 %iftmp16, i64* %c
  %c17 = load i64, i64* %c
  %addtmp = add i64 %c17, 2
  ret i64 %addtmp

entry7:                                           ; preds = %entry2
  br label %entry9

entry8:                                           ; preds = %entry2
  br label %entry9

entry9:                                           ; preds = %entry8, %entry7
  %iftmp = phi i64 [ 11, %entry7 ], [ 40, %entry8 ]
  br label %entry4

entry12:                                          ; preds = %entry3
  br label %entry14

entry13:                                          ; preds = %entry3
  br label %entry14

entry14:                                          ; preds = %entry13, %entry12
  %iftmp15 = phi i64 [ 10, %entry12 ], [ 20, %entry13 ]
  br label %entry4
}

Copy the code

Note that these blocks have the following pattern: without the first entry, they are grouped as three, first with the “THEN” branch, then with the “else” branch, and finally with the “merge” block (with the recognizable PHI instruction). Each time we encounter an “if” expression, we append three new blocks to main. Because triples are queried recursively in the AST, the triples of blocks are ordered.

That’s all I want to share! Hopefully at this point you will be strong enough to decide your fate.

Contents: March issue of Rust Chinese (Rust_Magazine)