A super miniature compiler

primers

When I read The Babel document, I came across The Super Tiny Compiler, and The annotation in it was quite easy to understand. Please translate and record it.

Origin
My GitHub

Why care about compilers

Most people don’t actually have to think about compilers in their daily work, and it’s normal not to pay attention to compilers. However, compilers are very common around you, and many of the tools you use are borrowed from the compiler concept.

Compilers are scary

Compilers are really scary. But it’s our fault (the people who wrote the compiler) that we took away the simplicity and the reasonableness and made it so horribly complex that most people thought it was completely inaccessible and only nerds could understand it.

Where should I start?

Start by writing the simplest possible compiler. The compiler is very small, only about 200 lines of code if you remove all the comments.

We are going to write a compiler that will convert some Lisp method calls into the C language method calls.

If you are not familiar with the language, I will briefly introduce it.

If we have two methods add and subtract, they will write like this:

example	LISP	C
2 + 2	(add 2 2)	add(2, 2)
4-2	(subtract 4 2)	subtract(4, 2)
2 plus 4 minus 2.	(add 2 (subtract 4 2))	add(2, subtract(4, 2))

Easy, right? Good, that’s exactly what we’re going to compile. While this is not the full Lisp or C syntax, the syntax is sufficient to illustrate the main parts of most modern compilers.

Three phases of compilation

Most compilations can be broken down into three main phases: Parsing, Transformation, and Code Generation.

Parsing: Take each line of code and turn it into more abstract code.
Transformation: Get the abstract code and do what the compiler intended.
Code generation: Take the translated representation of code and transform it into new code.

Parsing

Parsing is usually divided into two stages: Lexical Analysis and Syntactic Analysis.

Lexical analysis: Take the code and break it down into individual tokens using the tokenizer. A token is an array of objects that describe the various parts of the syntax. They may be numbers, text, punctuation marks, operators, and so on.
Syntactic analysis: Take those tokens and convert them into another representation that describes its own syntax and interrelationships. This is called an intermediate representation or Abstract Syntax Tree or AST. An abstract syntax tree is a deeply nested object that represents a way your code works, and it gives you a lot of information.

For example:

(add 2 (subtract 4 2))

The notation might look something like this:

[
  { type: 'paren',  value: '('        },
  { type: 'name',   value: 'add'      },
  { type: 'number', value: '2'        },
  { type: 'paren',  value: '('        },
  { type: 'name',   value: 'subtract' },
  { type: 'number', value: '4'        },
  { type: 'number', value: '2'        },
  { type: 'paren',  value: ')'        },
  { type: 'paren',  value: ')'        },
]

An abstract syntax tree might look like this:

{
  type: 'Program',
  body: [{
    type: 'CallExpression',
    name: 'add',
    params: [{
      type: 'NumberLiteral',
      value: '2',
    }, {
      type: 'CallExpression',
      name: 'subtract',
      params: [{
        type: 'NumberLiteral',
        value: '4',
      }, {
        type: 'NumberLiteral',
        value: '2',
      }]
    }]
  }]
}

Transformation

The next stage of compilation is transformation. This step simply takes the AST from the previous step and changes it again. You can manipulate the AST in the same language, or you can convert the AST to a completely new language.

Let’s see if we can transform an AST.

You might notice that some of the elements in our AST are very similar. There are several objects that have a type attribute, and each such object is called an AST Node. These nodes define the attributes of each individual part of the tree.

We have a NumberLiteral node:

{
  type: 'NumberLiteral',
  value: '2',
}

Or perhaps a CallExpression node:

{
  type: 'CallExpression',
  name: 'subtract',
  params: [...nested nodes go here...],
}

When transforming an AST, we can add/remove/replace node properties, we can add new nodes, remove nodes, or create a completely new AST based on an existing AST.

Since our target is a new language, we will create a completely new AST for the new language.

In order to find all the nodes, we need to Traversal them. The traversal reaches each node of the AST.

{
  type: 'Program',
  body: [{
    type: 'CallExpression',
    name: 'add',
    params: [{
      type: 'NumberLiteral',
      value: '2'
    }, {
      type: 'CallExpression',
      name: 'subtract',
      params: [{
        type: 'NumberLiteral',
        value: '4'
      }, {
        type: 'NumberLiteral',
        value: '2'
      }]
    }]
  }]
}

The AST above, we will traverse like this:

Program – starts at the top level of the AST
CallExpression (add) – Move to the first element in the Program’s body
NumberLiteral (2) – moves to the first element of the params of CallExpression(Add)
CallExpression(subtract) – Move to the second element of the params in callExpression (add)
NumberLiteral (4) – Move to the first element of params in CallExpression(subtract)
NumberLiteral (2) – Move to the second element of params in CallExpression(subtract)

If you were working directly with the AST, you might want to introduce various abstractions here. But what we’re trying to do, access to every node in the tree is enough.

The basic idea here for Visitors is to create a “visitor” object that has methods that can accept different types of nodes.

var visitor = {
  NumberLiteral() {},
  CallExpression() {},
};

As we traversal the AST, we will call this Visitor method as soon as we “enter” to a node of the matching type.

To make this idea work, we pass in a reference to a node and its parent node.

var visitor = {
  NumberLiteral(node, parent) {},
  CallExpression(node, parent) {},
};

Then there is the possibility of an “exit”. Imagine our tree structure like this:

- Program
  - CallExpression
    - NumberLiteral
    - CallExpression
      - NumberLiteral
      - NumberLiteral

Eventually, as we traverse, we reach a dead end. So when we finish traversing each branch of the tree, we “exit.” So we walk down the tree, “enter” to the tree node, and when we return, we “exit.”

-> Program (enter)
  -> CallExpression (enter)
    -> Number Literal (enter)
    <- Number Literal (exit)
    -> Call Expression (enter)
      -> Number Literal (enter)
      <- Number Literal (exit)
      -> Number Literal (enter)
      <- Number Literal (exit)
    <- CallExpression (exit)
  <- CallExpression (exit)
<- Program (exit)

To support this functionality, our final visitor will look like this:

var visitor = {
  NumberLiteral: {
    enter(node, parent) {},
    exit(node, parent) {},
  }
};

Code Generation

The final stage of compilation is code generation. Sometimes the compiler does something that overlaps with the conversion at this stage, but most of the code generation just means taking the AST and converting it to string code.

There are several different ways code generation works, some compilers reuse tokens, some create a single code representation so they can print nodes linearly, but from what I’ve heard, most will use the AST we just created, which is what we’ll be focusing on.

Our code generator will effectively know how to “print” all the different node types of the AST, and it will recursively call itself to print the nested nodes until it prints everything into a long string of code.

In this way! These are all the different parts of the compiler. Not every compiler is like the one I’ve described here. Compilers are used for different purposes, and they may require more steps than I have described. But by now you should have a better idea of what most compilers look like.

Now, I’ve explained so much, you should be able to write your own compiler pretty well, right? Just kidding. That’s what I’m here to help. So let’s get started!

Compiler example

See the compiler – js.

The resources

the-super-tiny-compiler
Let’s Build A Simple Interpreter