Original address: mukulrathi.co.uk/create-your…

The original article is written by mukulrathy.co.uk /

Published: May 10, 2020 -6 minutes reading

Series. Create the BOLT compiler

  • Part 1: Why Write your own programming language? Why write your own programming language?
  • Part 2: So how do you structure a compiler project?
  • Part 3: Write a Lexer and parser using OCamllex and Menhir.
  • Part 4: a straightforward introduction to type theory and implementing type checkers.
  • Part 5: A tutorial on liveliness and alias data flow analysis.
  • Part 6: Desugaring– Simplify our high-level language!
  • Part 7: Protobuf tutorial for OCaml and C++.
  • Part 8: The programming language creator’s complete guide to LLVM.
  • Part 9: Implementing concurrency and our runtime
  • Part 10: Inheritance and method overrides in Bolt.
  • Part 11: Generality — Add polymorphism to Bolt.

Above is the compiler for the language we’re going to build, Bolt. What do these phases mean? I want to learn OCaml and C++? Wait, I’ve never even heard of…… OCaml

Don’t worry about it. When I started this project six months ago, I had never built a compiler or used OCaml or C++ in any serious projects. I’ll explain everything in due course. The real question we should be asking is, why design our own language? The likely answer is

  1. It’s very interesting
  2. It’s cool to have your own programming language.
  3. It’s a nice sideline

Mental models

Although these three (or none!) That may all be true, but there’s a bigger motive: having the right mental model. You see, when you learn your first programming language, you see programming through the eyes of that language. Fast forward to your second language, it seems hard, you have to relearn the grammar and do things differently in this new language. The more programming languages you use, the more you realize that these languages have common themes. Java and Python have objects, Python and JavaScript don’t require you to write types, and the list goes on and on. Dig deeper into programming language theory, and you read that language constructs exist –Java and Python are object-oriented programming languages, and Python and JavaScript are dynamically typed.

The programming language you’ve been using is actually based on ideas that exist in older languages that you may not have heard of. Simula and Smalltalk introduced the concept of object-oriented programming languages. Lisp introduced the concept of dynamic typing. And there are newer research languages that are coming up all the time, introducing new concepts. A more mainstream example. Rust establishes memory security in low-level system programming languages.

Building your own language (especially when you add new ideas) helps you think about language design more rigorously, so when you learn a new language, it’s a lot easier. For example, I never used Hack programming before I interned at Facebook last summer, but knowing the concepts of these programming languages made it much easier to learn.

What is a compiler?

So you design your fancy new language that’s going to revolutionize the world, but there’s just one problem. How do you run it? That’s what the compiler does. To explain how compilers work, let’s first flash back to the 19th century, to the telegraph. So here we have this fancy new telegraph, but how do we send messages? Same problem, different field. The operator needed to receive speech, convert it into Morse code, and bid it out. The first thing the operator does is make sense of the sound — they break it up into words (part of speech) and then understand how those words are used in a sentence (parse) — whether they’re part of a noun phrase, part of a clause, and so on. They check for meaning by categorize or type words (adjective, noun, verb) and check if sentences make grammatical sense (we can’t use “run “to describe a noun because it’s a verb and not a noun). Finally, they translated (compiled) each word into dots and dashes (Morse code), which were then transmitted along a wire.

This seems like a waste of money, because for humans, a lot of things are done automatically. Compilers work the same way, except that we have to explicitly program the computer to do the work. The example above describes a simple compiler with four phases: lex, parsing, type checking, and then translating into machine instructions. Operators also need some additional tools to actually mine out Morse code; For programming languages, this is the runtime environment.

In practice, the operator probably constructed some shorthand symbols that they knew how to convert into Morse code. Now, instead of converting speech directly into Morse code, they convert speech into their shorthand symbols, and then convert the shorthand symbols back into Morse code. In many practical languages, you can’t go directly from source code to machine code. You have ugaring, or demoting, where you progressively remove language construction (such as unrolling a loop) until we have only a small set of instructions left to execute. Unwinding makes the later stages easier because they operate on a simpler representation. The compiler phases are divided into front end, middle end, and back end, where the front end does most of the parsing/type checking, while the middle and back end simplify and optimize the code.

Compiler design choices

In fact, we can use the above analogy to structure the design of many languages and compilers.

Does the operator translate the text into Morse code as it is transmitted, or convert the text into Morse code before transmitting it? Interpreted languages like Python do the former, and pre-compiled languages like C (and Bolt) do the latter. Java is really somewhere in between — it uses a just-in-time compiler that does most of the work beforehand, translating programs into bytecode, and then compiling the bytecode into machine code at runtime.

Now consider a scenario where a new Lorse code comes out, which is an alternative to Morse code. If the operator is taught how to convert the shorthand code into Lorse code, then the speaker doesn’t need to know how to do it; they can get it for free. Again, speakers of different languages just need to tell the operator how to translate it into shorthand, and then they can get it translated into Lorse code and Morse code! This is how LLVM works. This is how LLVM works. LLVM IR (intermediate representation) acts as a stepping stone between program and machine code. C, C++, Rust, and a host of other languages (including Bolt) all target LLVM IR and then compile the code onto various machine architectures.

Static typing versus dynamic typing? In the first case, the operator either checks that the words have grammatical meaning before starting typing. Or, they don’t, and in the middle of it they think, “Oh, that doesn’t make sense,” and they stop. Dynamic input can be seen as a fast experiment (e.g., Python, JS), but when you send this message, you don’t know if the operator will stop (crash).

I’ve explained it in terms of an imaginary telegraph operator, but any analogy will do. Building this intuition can go a long way toward understanding which language features are right for your language: if you’re going to experiment, then maybe dynamic typing is better because you can move faster. If you’re using a large code base, it’s harder to proofread everything, and you’re more prone to errors, so you should probably turn to static typing to avoid breaking things.

type

The most interesting part of the compiler (in my opinion) is the type checker. In our analogy, the arithme categorizes words into discourse (adjectives, nouns, verbs) and then checks to see if they are being used correctly. Types work the same way, we classify program values based on the behavior we want them to have. For example, int represents numbers that can be multiplied, and String represents streams of characters that can be joined together. The purpose of the type checker is to prevent bad behavior — such as joining ints together or multiplying Strings — which makes no sense and should not be allowed. With type checking, the programmer annotates values with the type, and the compiler checks that they are correct. In type inference, the compiler both inferences and checks types. We call the rules for checking types type judgments, and the set of these rules (along with the types themselves) forms a type system.

As it turns out, you can do a lot more: the type system does more than just check that ints or Strings are used correctly. Richer type systems can demonstrate greater invariance about programs: they terminate, access memory safely, or they don’t contain data races. For example, Rust’s type system guarantees memory security and data race freedom, while also checking for traditional types ints and Strings.

What’s Bolt’s advantage?

Programming languages still haven’t cracked the problem of writing secure concurrent code. Bolt, like Rust, prevents data races (explained in this Rust document), but takes a more nuanced approach to concurrency. Before the keymen took to Twitter to bash me, I thought Rust had done an excellent job of getting the conversation going on this issue — and while Bolt may never hit the mainstream, it shows an alternative approach.

If we look at the current pipeline, you can see that Bolt includes the lexical, parsing, and ugaring/ lowering phases. It also contains several Protobuf serialization and deserialization phases: these are purely for converting between OCaml and C++. Its target is LLVM IR, then we link several runtime libraries (PThreads and LIBC), and finally we output our object file, a binary file containing the machine code.

Unlike most compilers, however, Bolt has not one type-checking phase, but two! Bolt has traditional types and capabilities, and unofficially, this is another set of type check data contests. I’ve written a paper that discusses this issue more formally, and if you’re interested in theory, you can skip the data contest in this series and check out the article if you’re not. We do type checking on traditional types, simplify the language by going to Ugaring, and then do race type checking.

What about this collection?

There are two ways to think about this series: first, we’ll talk about language design, comparing Bolt to languages like Java and C++ along the way. Second, it is a practical step-by-step tutorial for building your own compiler. Unlike many tutorials on building your own compiler, which show you how to build a toy language, this tutorial explores some of the topics that form the basis of concurrent object-oriented languages such as Java: how classes are implemented, how inheritance works, generic classes, and even how concurrency can be implemented under the hood.

Bolt also doesn’t output toy commands, but rather to LLVM IR. In practice, this means That Bolt can hook into the amazing optimizations that exist in C/C++ compilers. The LLVM API is powerful, but its documentation is also difficult to navigate. I’ve spent many long nights reverseengineering C++ programs — hopefully this series will save at least one person from that pain.

In the next section, we’ll look at the practical problems of building a compiler project — I’ll introduce the Bolt repository and explain why we use OCaml in all languages on the front end.

Share this on Twitter

If you enjoyed this article, please consider sharing it with your network. If you have any questions, please tweet and I will answer 🙂 when new articles are published, I will also tweet them

PS: I’ll also share helpful tips and links along the way — so you can get them before they even make it into the article.


Translate via www.DeepL.com/Translator (free version)