Original address: mukulrathi.co.uk/create-your…

The original article is written by mukulrathy.co.uk /

Published: May 25, 2020 -6 minutes reading

Series. Create the BOLT compiler

  • Part 1: Why Write your own programming language? Why write your own programming language?
  • Part 2: So how do you structure a compiler project?
  • Part 3: Write a Lexer and parser using OCamllex and Menhir.
  • Part 4: a straightforward introduction to type theory and implementing type checkers.
  • Part 5: A tutorial on liveliness and alias data flow analysis.
  • Part 6: Desugaring– Simplify our high-level language!
  • Part 7: Protobuf tutorial for OCaml and C++.
  • Part 8: The programming language creator’s complete guide to LLVM.
  • Part 9: Implementing concurrency and our runtime
  • Part 10: Inheritance and method overrides in Bolt.
  • Part 11: Generality — Add polymorphism to Bolt.

The forthcoming


Writing a compiler, like any other software engineering project, involves many key design decisions: what language do you use, how do you organize files in the repo, which tools should you use? Most compiler tutorials focus on a toy example, choosing to ignore these practical issues.

With Bolt, I wanted to highlight a larger compiler and the design decisions I made. If you’re reading this and you’re working on an industrial compiler for a more mature language, please contact me on Twitter! I’d love to hear about your design decisions!

Use the right language at work, not just the one you’re most familiar with.

The “Write a compiler in Y” (insert your favorite language) tutorial is a scratch. Writing a compiler in a language you know may seem easier at first because there’s one less thing to learn, but it’s a short-term gain. Choosing the right language is like learning touch input: Sure, it’ll be slow at first, but think how much faster you’ll be once you master it.

JavaScript is a great language for web applications and is easy for beginners to learn. But would I write a compiler with it? Frankly, no. It’s not that I hate JavaScript (I use it on this site), it just doesn’t fit our goals.

What do we care about for the compiler?

  • Coverage — we need to consider all possible Bolt expressions and make sure we handle all cases — it’s not good if our compiler crashes on a Bolt program we forgot to consider. Can our language help us keep track of this?

  • Data representation — How do we represent and manipulate Bolt expressions in the compiler?

  • Tools – Does our language have libraries that we can use for the compiler? There is a balance between learning by doing and unnecessarily reinventing the wheel.

  • Speed — there are two different aspects. First, how fast is the compiled Bolt code? Second, how fast is the compiler (how long does it take to compile Bolt code)? There is a tradeoff — in order to get faster compiled code, you need to add more optimization steps to the compiler, making it slower.

There is no magic bullet: every compiler design has its own tradeoffs. I chose to write my compiler mainly in OCaml.

Why OCaml?

OCaml is a functional programming language with a powerful type system. You might have two questions: why functional programming, and what does a powerful type system mean?

In a large compiler, with lots of moving parts, keeping track of states makes our lives much more difficult. Functional programming is easier to reason with: if you pass the same input to a function, it will always return the same output. With functional programming, we don’t have to worry about side effects or state, and it lets us focus on high-level design.

Another option is to write the compiler in Rust for performance reasons. While you might have a faster compiler, I don’t think speed justificates the extra low-level details like the memory management Rust requires you to track. Personally, since I don’t write thousands of lines of Bolt code, I don’t care much about the time it takes the Bolt compiler to compile the program.

A type lets you pair a program with a compiler

If you come from a dynamically typed language like JS or Python, OCaml’s rich typology may seem strange to you and may feel cumbersome. My view of the types in OCaml is that they give the OCaml compiler more information about your program — the more you tell it, the more it helps you!

Going back to our override, we want to say that the Bolt expression is either an integer, an if-else expression, a method call, a while loop, etc. Usually to represent something like this, you would use an enum and a switch statement. In OCaml, we bake this “enumeration” into our type system using variant types. We can code the structure of each expression in the type! For example, to access a variable, all you need to know is its name, x. To access the fields of an object, you need to know its name, X, and the field, F, that you want to access.

type identifier =
| Variable of Var_name.t
| ObjField of Var_name.t * Field_name.t

let do_something id = match id with
| Var(x) -> ...
| ObjField(x,f) -> ...
Copy the code

I haven’t mentioned the best part yet. Because we’ve already encoded the Bolt expression structure in a type, the OCaml compiler checks to see if we’ve covered all the cases! One thing that’s missing for us is tracking. That’s one less thing for us to keep track of!

Twitter.com/mukulrathi_…

So OCaml is responsible for the override, and we decided to encode the Bolt expression as a variant type, so this is the sort of data representation. OCaml also provides good tools for the lexer and parser phases of the compiler (discussed in the next article), which eliminates another of our standards.

Target performance with LLVM

We mentioned the fact that we don’t really care about the performance of the compiler itself. However, we do want our compiled Bolt code to be fast (it’s in the name!). . As mentioned in the last article, we don’t have to reinvent the wheel. By targeting LLVM IR, we can connect to the C/C++ toolchain and get our optimizations for free.

LLVM provides language authors with an API to generate the LLVM IR — the native API is in C++. LLVM provides OCaml bindings, but they map only part of the C++ API. Implementation of memory barriers (machine instructions required to properly implement locks) was not supported at implementation time, so I implemented this part of the compiler in C++. C++ also feels more natural to apis: while you can write imperative code in OCaml, it’s more suitable for functional programming.

A natural question you might ask is: since you’re going to use the LLVM C++ API, why not write everything in C++? I decided to choose the best language for each part of the compiler. The tradeoff is that we now have to pass data between the OCaml compiler front end and the C++ compiler back end. More powerful, but the compiler is more complex.

If the OCaml LLVM binding is sufficient for your language as you read this series, I encourage you to use it instead. The API is almost identical — OCaml functions instead of C++ method calls.

Software engineering Methodology

The repository contains more information about the compiler structure, in repo_overview.md. Most of these are general software engineering techniques, so I’ll cover them briefly. If you want to know more details, please contact me! .

First, the structure of the repository is modular. Each phase of the compiler has its own library, and you can view the documentation. Functions are grouped into modules (.ml files provide the module’s implementation, and.mli files provide the module’s interface). You can think of each module as performing a role in a phase, such as type_expr.ml for type-checking expressions, and pprint_parser_tokens. Ml for nicely printing parser tokens. This allows each module to be more focused and avoids the difficulty of reading monolithic files that are hundreds of lines long.

To build these files, I used the Dune build system for OCaml (which I have a blog post explaining) and the Bazel build system for C++. For large repositories, it’s almost impossible to manually compile every file (for example, by running clang++ foo.cpp) and link to the files that depend on each other — the build system does this for us automatically (running a make build command compiles all the files in the repository). One of Bazel’s main benefits is that dependencies are self-sustaining, so they can work across machines.

For testing, my main library is Jane Street’s Expect test library. This library is really easy to write tests because it automatically generates the expected output. I have a blog post about testing with OCaml that addresses this issue. The article also explained how I set up continuous integration (where tests are run and documents are generated each time they are committed to the repository).

Twitter.com/mukulrathi_…

I also use makefiles and scripts to automate common tasks. One tool I highly recommend is the auto-formatting tool — I use ocamlformat for OCaml code and clang-format for C++ code — which will format your code for you so you can get beautiful code for free. You can do this automatically using the Git pre-commit hook in the Repo (which filters and formats your code every time you commit) or through the IDE format-on-save integration. One final tip: Use VSCode’s OCaml IDE extension or equivalent — every time you hover over a function, it displays its type signature and any documentation comments associated with it.

conclusion

In previous articles, I’ve explained Bolt’s place in the programming language spectrum, as well as compiler design and software engineering decisions. In the next article, we’ll actually start building this program. Before we do that, I have a few action items for you.

  • Real World OCaml is a great free resource.
  • Fork the repo. Do a quick, advanced scan, but don’t worry about the details just yet. We’ll break down each stage of the compiler in our own article, and introduce more compilers throughout the article.

www.deepl.com translation