This article is a translation of the -super-tiny-Compiler repository. The original article (code) : github.com/jamiebuilds…

Today we’re going to start writing a compiler, but not a compiler as we normally call it, but a super, super small compiler, so small that if you took out all the comments in this file, the real code would only be 200 + lines.

We’re going to compile lisp-style function calls into C-style function calls, so if you’re not familiar with these two, let me give you a quick overview.

If we had two functions: Add and subtract, they would look something like this:

               LISP                      C

2 + 2          (add 2 2)                 add(2, 2)
4 - 2          (subtract 4 2)            subtract(4, 2)
2 + (4 - 2)    (add 2 (subtract 4 2))    add(2, subtract(4, 2))
Copy the code

Isn’t that easy?

Good, that’s what we’re compiling. This isn’t a complete LISP or C syntax, but it’s enough to show us the main parts of a modern compiler.

Most compilers fall into three main stages: Parsing, Transformation, and Code Generation.

Parsing converts source code to a more abstract representation of code;

Transformation does whatever it wants to do to this abstract code representation.

3.Code Generation abstracts the finished Code to generate new Code;

Parsing

Parsing is usually divided into two stages: lexical analysis and grammatical analysis.

1. Lexical analysis uses something called a tokenizer to cut source code into pieces called tokens;

Tokens are an array where each item is the smallest object that describes a single block of syntax. They can be numbers, labels, punctuation, operators, and so on.

2. Parsing recombines markup to describe each part of a grammar and to establish the relationships between them. This is often referred to as an “abstract syntax tree.”

An abstract syntax tree (AST for short) is a deeply nested object that represents code in a way that is simple and tells us a lot.

For the following syntax:

(add 2 (subtract 4 2))

The token list looks like this:

[{type: 'paren'.value: '('        },
     { type: 'name'.value: 'add'      },
     { type: 'number'.value: '2'        },
     { type: 'paren'.value: '('        },
     { type: 'name'.value: 'subtract' },
     { type: 'number'.value: '4'        },
     { type: 'number'.value: '2'        },
     { type: 'paren'.value: ') '        },
     { type: 'paren'.value: ') '},]Copy the code

The AST looks like this:

{
     type: 'Program'.body: [{
       type: 'CallExpression'.name: 'add'.params: [{
         type: 'NumberLiteral'.value: '2'}, {type: 'CallExpression'.name: 'subtract'.params: [{
           type: 'NumberLiteral'.value: '4'}, {type: 'NumberLiteral'.value: '2',}]}]}]}Copy the code

Transformation

The next stage of the compiler is transformation. Again, this stage simply takes the AST generated in the previous stage and makes some modifications, either keeping the original language or translating it into a completely new language.

Let’s look at how to transform an AST.

You may have noticed that the elements in our AST look very similar. These objects all have a Type attribute. Each node is called an AST node, and each node has properties defined to describe a part of the tree.

We can create a node for NumberLiteral:

{
     type: 'NumberLiteral'.value: '2',}Copy the code

Or create a node for CallExpression:

 {
     type: 'CallExpression'.name: 'subtract'.params: [... nested nodes...] ,}Copy the code

When transforming an AST, we can manipulate nodes by adding, removing, and replacing properties, we can add new nodes, or we can create a new AST directly from the existing AST, regardless of the existing AST.

Because our target is a new language, we will create a brand new AST based on the target language.

Traversal

In order to walk through all the nodes, we need to be able to traverse them, reaching each node in a depth-first fashion.

{
     type: 'Program'.body: [{
       type: 'CallExpression'.name: 'add'.params: [{
         type: 'NumberLiteral'.value: '2'
       }, {
         type: 'CallExpression'.name: 'subtract'.params: [{
           type: 'NumberLiteral'.value: '4'
         }, {
           type: 'NumberLiteral'.value: '2'}}}}]]]Copy the code

For the AST above, we will visit in turn:

1.Program – Starts at the top level of the AST

2.CallExpression (add) – Moves to the first element of the Program’s body list

3.NumberLiteral (2) – Moves to the first element of the params list of CallExpression

4.CallExpression (Subtract) – Move to the second element of the list of params for CallExpression

5.NumberLiteral (4) – Move to the first element of the list of params in CallExpression (subtract)

6.NumberLiteral (2) – Move to the second element of the list of params in CallExpression (subtract)

If we had manipulated the AST directly, rather than recreating one, we might have introduced various abstractions here. But actually visiting each node of the tree is enough for us.

I use the word “visiting” because there is a pattern of how to represent operations on elements of an object’s structure.

Visitors

The basic idea is to create a visitor object that provides methods to accept different node types.

var visitor = {
    NumberLiteral() {},
    CallExpression(){}};Copy the code

As we traverse the AST, each time we encounter a matching node, we call the method on that accessor that corresponds to the node type.

To make these methods more useful, we pass in two parameters, the node currently traversed, and its parent node.

var visitor = {
    NumberLiteral(node, parent) {},
    CallExpression(node, parent){}};Copy the code

However, there is also the possibility of needing access when exiting. Imagine our previous tree structure as a list:

   - Program
     - CallExpression
       - NumberLiteral
       - CallExpression
         - NumberLiteral
         - NumberLiteral
Copy the code

When we walk down, it’s easy to end up on a branch, and when we walk through a branch we exit it, so we “enter” each node on the way down and “exit” each node on the way up.

   -> Program (enter)
     -> CallExpression (enter)
       -> Number Literal (enter)
       <- Number Literal (exit)
       -> Call Expression (enter)
          -> Number Literal (enter)
          <- Number Literal (exit)
          -> Number Literal (enter)
          <- Number Literal (exit)
       <- CallExpression (exit)
     <- CallExpression (exit)
   <- Program (exit)
Copy the code

To support this, the final accessor looks like this:

var visitor = {
    NumberLiteral: {
        enter(node, parent) {},
        exit(node, parent){},}};Copy the code

Code Generation

The last stage of the compiler is generating code. Sometimes the compiler does something that overlaps with the transformation, but most of the time generating code just means converting the AST back into a code string.

Code generators work in several different ways, some compilers will reuse previous tokens, others will create a separate code representation that will print nodes linearly, but as far as I know, most will just use the AST we just created, and we’ll do the same.

Our code generator actually knows how to print all the different types of nodes on the AST, and it recursively calls itself to print all the nested nodes until everything is printed into one long code string.

The subtotal

This is the compiler we are going to make, and it contains all the parts of a real compiler.

This does not mean that all compilers are the same as I described above, and each compiler may serve a different purpose, so they may have more steps than what I mentioned above.

But you should now have an overall basic understanding of most compilers.

Now that I’ve covered the compiler, can you write one yourself?

Just kidding, let me help you finish it.

Here we go…

Code implementation

Word segmentation is

We will start with the first stage of parsing, using a word divider for lexical analysis.

All we need to do is split the code string into a token array:

(add 2 (subtract 4 2))   =>   [{ type: 'paren', value: '(' }, ...]
Copy the code

The function takes a code string as an input parameter, and we do two things:

function tokenizer(input) {
    // The 'current' variable is like a cursor that tracks our current position in the code
    let current = 0;
    // 'tokens' array is used to store generated tokens
    let tokens = [];
    // We start by creating a while loop that updates current with the desired increment
    // This is done because it is possible to update current several times within a loop, since the length of a token is arbitrary
    while (current < input.length) {
        // The character of the current position
        let char = input[current];
        // The first thing to check is the left parenthesis' (', which is used after 'CallExpression', but for now we only care about characters
        // Check if it is left parenthesis:
        if (char === '(') {
            // If a match is found, add a token of type 'paren' and set its value to '('
            tokens.push({
                type: 'paren'.value: '('});/ / increment ` current `
            current++;
            // Skip the current loop and go to the next loop
            continue;
        }
        // Next check for close parenthesis') 'as before: match to close parenthesis, add a new token, increment current, and finally skip the current loop to the next loop
        if (char === ') ') {
            tokens.push({
                type: 'paren'.value: ') '}); current++;continue;
        }
        // Continue, the next thing we need to check is the whitespace character. Whitespace is used to separate characters, but it is not really important, so it is not added as a token
        // So we just check if whitespace is matched, and then skip
        let WHITESPACE = /\s/;
        if (WHITESPACE.test(char)) {
            current++;
            continue;
        }
        // The next token type is number. This is different from the previous ones, because numbers can have arbitrary lengths, we need to add the whole number as a token
        //
        // (add 123 456)
        / / ^ ^ ^ ^ ^ ^
        // There are six characters, but only two separate tokens
        //
        // When we encounter the first number in the sequence, we begin...
        let NUMBERS = / [0-9];
        if (NUMBERS.test(char)) {
            // Create a value variable to hold the entire number
            let value = ' ';
            // Then iterate through each character after that until a non-numeric character is encountered
            while (NUMBERS.test(char)) {
                // Concatenate the current number
                value += char;
                // Update current, move to next character
                char = input[++current];
            }
            // Then we add a token of type number
            tokens.push({ type: 'number', value });
            / / to continue
            continue;
        }
        // We also want to add support for strings, that is, any character wrapped in double quotes (")
        //
        // (concat "foo" "bar")
        // ^^^ ^^^ The token is a string
        //
        // Check the opening quotation mark (") :
        if (char === '"') {
            // Create a value variable to hold the token value
            let value = ' ';
            // Skip the opening double quotes
            char = input[++current];
            // Iterate over each subsequent character until the closing double quote is encountered
            while(char ! = ='"') {
                / / update the value
                value += char;
                // Move to the next character
                char = input[++current];
            }
            // Skip the closing double quotes
            char = input[++current];
            // 添加一个string类型的token
            tokens.push({ type: 'string', value });
            continue;
        }
        // There is one last token of type 'name', which is an alphabetic character, not a number, for function names in our Lisp syntax
        //
        // (add 2 4)
        //    ^^^
        // Name token
        //
        let LETTERS = /[a-z]/i;
        if (LETTERS.test(char)) {
            let value = ' ';
            // Again, loop through all the following characters
            while (LETTERS.test(char)) {
                value += char;
                char = input[++current];
            }
            // Add a token of type 'name' and continue to the next loop
            tokens.push({ type: 'name', value });
            continue;
        }
        // Finally, if there are any characters that we haven't matched, that's a syntax error, we can't figure it out, so we just throw it wrong and abort the loop
        throw new TypeError('I dont know what this character is: ' + char);
    }
    // Finally, our tokenizer just returns the token list
    return tokens;
}
Copy the code

The parser

What the parser needs to do is convert the token list to an AST:

[{ type: 'paren', value: '(' }, ...]  => { type: 'Program', body: [...] }Copy the code

Define a parser function that takes a list of tokens as arguments:

function parser(tokens) {
    // Again, we maintain a 'current' variable as a cursor
    let current = 0;
    // But here we will use recursion instead of the while loop to define a recursive function
    function walk() {
        // Get and save the token of the current location
        let token = tokens[current];
        // We will divide each type of token into different code paths, starting with tokens of type 'number'
        //
        // Check whether it is a token of type 'number'
        if (token.type === 'number') {
            // If so, increment current
            current++;
            // Returns a new AST node of type 'NumberLiteral' whose value is the token's value
            return {
                type: 'NumberLiteral'.value: token.value,
            };
        }
        // The string type is the same as the number type, creating a StringLiteral node and returning it
        if (token.type === 'string') {
            current++;
            return {
                type: 'StringLiteral'.value: token.value,
            };
        }
        // Next, we are looking for 'CallExpressions', which starts when we encounter the left parenthesis
        if (
            token.type === 'paren' &&
            token.value === '('
        ) {
            // increment current, skip the open parenthesis because it is not needed in the AST
            token = tokens[++current];
            // Create a base 'CallExpression' node and then set the value to the value of the current token, since the function name is immediately to the right of the left parenthesis
            let node = {
                type: 'CallExpression'.name: token.value,
                params: [],};// increment current skips the function name token
            token = tokens[++current];
            // the following node is then iterated as the parameter 'params' to the CallExpression' CallExpression ', until the close parenthesis is encountered
            //
            // This is where recursion comes in. We will rely on recursion to resolve a set of nodes that may be nested indefinitely
            //
            // To explain this, let's look at the Lisp code again. You can see that the 'add' method has a numeric parameter and a nested 'CallExpression', which also has two numeric parameters:
            //
            // (add 2 (subtract 4 2))
            //
            // You'll also notice multiple close parentheses in the token list:
            //
            / / /
            // { type: 'paren', value: '(' },
            // { type: 'name', value: 'add' },
            // { type: 'number', value: '2' },
            // { type: 'paren', value: '(' },
            // { type: 'name', value: 'subtract' },
            // { type: 'number', value: '4' },
            // { type: 'number', value: '2' },
            // {type: 'paren', value: ')'}, <<< close parenthesis
            // {type: 'paren', value: ')'}, <<< close parenthesis
            / /]
            //
            // We will rely on the nested 'walk' function to increment 'current' until after all the 'CallExpression'

            // So we create a 'while' loop that recursively calls' walk 'until we encounter a close parenthesis
            // In this case, you can use recursion if a task can be broken down into smaller subtasks, and the logic of the subtask is the same as that of the larger task. In this case, the parameter of add can be any type, it can be a number, it can be a string, it can be another function, The other function has the same problem as add, so it goes directly to the recursive function. For Add, you just return the AST node.
            while( (token.type ! = ='paren') ||
                (token.type === 'paren'&& token.value ! = =') ')) {// Call the recursive function, which returns an AST node to add to the current 'params' list
                node.params.push(walk());
                token = tokens[current];
            }
			// increment current to skip the closing parenthesis
            current++;
            // returns the node
            return node;
        }
		// Similarly, if we encounter a token that we cannot recognize, we will throw the wrong token
        throw new TypeError(token.type);
    }
    // Create a root node of 'Program'
    let ast = {
        type: 'Program'.body: [],};// Start a loop to add nodes to the 'ast.body' array
    
    // Loops are used here because there can be multiple juxtaposed 'CallExpression'
    //
    // (add 2 2)
    // (subtract 4 2)
    //
    while (current < tokens.length) {
        ast.body.push(walk());
    }
    // Return ast
    return ast;
}
Copy the code

traverse

Now that we have an AST, we want to be able to access different types of nodes through accessors. We need to be able to call methods on accessors when we encounter nodes of a matching type.

traverse(ast, {
     Program: {
       enter(node, parent) {
         // ...
       },
       exit(node, parent) {
         // ...}},CallExpression: {
       enter(node, parent) {
         // ...
       },
       exit(node, parent) {
         // ...}},NumberLiteral: {
       enter(node, parent) {
         // ...
       },
       exit(node, parent) {
         // ...,}}});Copy the code

So we define a traverser function that takes an AST and an accessor, and then two more functions inside…

function traverser(ast, visitor) {
    // The 'traverseArray' function is used to traverse the array and the 'traverseNode' function defined below is called
    function traverseArray(array, parent) {
        array.forEach(child= > {
            traverseNode(child, parent);
        });
    }
    // 'traverseNode' takes a node and its parent
    function traverseNode(node, parent) {
        // First check whether the accessor has a method for the matched 'type'
        let methods = visitor[node.type];
        // If the 'Enter' method exists, it is called, passing in the current node and the parent node
        if (methods && methods.enter) {
            methods.enter(node, parent);
        }
        // Then we will handle it according to the type type
        switch (node.type) {
                // Start with the top-level node 'Program'. Since the attribute 'body' of the Program node is an array type, call the 'traverseArray' method to traverse
                // (Remember that the 'traverseArray' method calls the 'traverseNode' in turn, so it traverses the tree recursively)
            case 'Program':
                traverseArray(node.body, node);
                break;
                // The CallExpression 'type is the same, except that it iterates through its' params' attribute
            case 'CallExpression':
                traverseArray(node.params, node);
                break;
                // NumberLiteral and StringLiteral nodes have no children, so they are skipped directly
            case 'NumberLiteral':
            case 'StringLiteral':
                break;
                // Again, if there is a node that we cannot recognize, we will throw an error
            default:
                throw new TypeError(node.type);
        }
        // If there is an 'exit' method, call it here, passing in 'node' and its' parent '
        if(methods && methods.exit) { methods.exit(node, parent); }}// Finally we call the 'traverseNode' to enable the traversal, pass the AST, pass null because the top-level node has no 'parent'
    traverseNode(ast, null);
}
Copy the code

This method is essentially a depth-first traversal of the tree, followed by a call to the visitor’s Enter method at the previous traversal position and a call to the visitor’s exit method at the subsequent traversal position.

conversion

Next, the Transformer, which passes the AST we built along with a traverser visitor to the traverser function and returns a new AST.

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - original AST | AST after conversion ---------------------------------------------------------------------------- { | { type: 'Program', | type: 'Program', body: [{ | body: [{ type: 'CallExpression', | type: 'ExpressionStatement', name: 'add', | expression: { params: [{ | type: 'CallExpression', type: 'NumberLiteral', | callee: { value: '2' | type: 'Identifier', }, { | name: 'add' type: 'CallExpression', | }, name: 'subtract', | arguments: [{ params: [{ | type: 'NumberLiteral', type: 'NumberLiteral', | value: '2' value: '4' | }, { }, { | type: 'CallExpression', type: 'NumberLiteral', | callee: { value: '2' | type: 'Identifier', }] | name: 'subtract' }] | }, }] | arguments: [{ } | type: 'NumberLiteral', | value: '4' ---------------------------------- | }, { | type: 'NumberLiteral', | value: '2' |}] (sorry, the right of the longer) | | | |}}}]} -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -Copy the code

So our Transformer function will take a Lisp AST as an argument:

(Translator’s Note: Back to the above comparison, the CallExpression node has the same type as before, but has changed the name attribute to callee, and the parameter list has changed from params to arguments. Finally, if the parent of the CallExpression node is not the CallExpression node, an ExpressionStatement node will be created to wrap it, so the transformation process is like this. We first create a new AST root node, but we iterate over the old AST. So how can you add a node to a new AST? You can reference the list properties of the new AST by creating an attribute on the old AST node, so that you can add nodes to the list of the new tree while traversing the old tree.

function transformer(ast) {
    // The new AST, like the previous AST, also has a Program node
    let newAst = {
        type: 'Program'.body: [],};// Next I'll make a small change, add a 'context' property to the parent node, and then add each node to its parent's 'context'. Normally you'll have a better abstraction, but for our purposes it's simpler
    //
    Note that the context property in the old AST is just a reference to the new AST property
    ast._context = newAst.body;

    // Next call the traverser method, passing in the AST and an accessor object
    traverser(ast, {
        // The first visitor receives a node of type 'NumberLiteral
        NumberLiteral: {
            / / to enter
            enter(node, parent) {
                // Create a new 'NumberLiteral' node and add it to the parent node's context
                parent._context.push({
                    type: 'NumberLiteral'.value: node.value, }); }},// Next comes' StringLiteral '
        StringLiteral: {
            enter(node, parent) {
                parent._context.push({
                    type: 'StringLiteral'.value: node.value, }); }},// Then 'CallExpression'
        CallExpression: {
            enter(node, parent) {
                // Creates a new node 'CallExpression', which contains a nested 'Identifier' node
                let expression = {
                    type: 'CallExpression'.callee: {
                        type: 'Identifier'.name: node.name,
                    },
                    arguments: [],};// Next we define a new context attribute for the original 'CallExpression' node, referencing the arguments attribute of the newly created node, so that we can add parameters to the new node when iterating through the arguments of the old node
                node._context = expression.arguments;

                // Next check if the parent is a CallExpression node
                // 如果不是的话...
                if(parent.type ! = ='CallExpression') {
                    // create a bit of an 'ExpressionStatement' node to wrap around the 'CallExpression' node. This is done because the top layer of 'CallExpression' is actually a statement in JavaScript
                    expression = {
                        type: 'ExpressionStatement'.expression: expression,
                    };
                }

                // Finally, the (possibly wrapped) 'CallExpression' node is added to the parent's 'context'parent._context.push(expression); }}});// At the end of the function returns the newly created AST
    return newAst;
}
Copy the code

The generated code

Now let’s look at the final stage: generating code.

Our code generator recursively calls itself, printing each node in the tree into a giant character.

function codeGenerator(node) {
    // We will process each node type separately
    switch (node.type) {
            // If it is a 'Program' node, traverse its' body 'list, call the codeGenerator method for each node, and concatenate them with a newline
        case 'Program':
            return node.body.map(codeGenerator)
                .join('\n');

            // For the 'ExpressionStatement' node, call its expression node, call the codeGenerator method method for each node, and then add a semicolon...
        case 'ExpressionStatement':
            return (
                codeGenerator(node.expression) +
                '; ' / / < < (... It is standard to add a semicolon to the end of a statement.
            );

            // For the 'CallExpression' node, what we print is' callee ', then concatenate an open parenthesis, then loop through each node of the parameter 'arguments', call the codeGenerator method to convert them into strings, concatenate them with commas, and finally add a close parenthesis
        case 'CallExpression':
            return (
                codeGenerator(node.callee) +
                '(' +
                node.arguments.map(codeGenerator)
                .join(', ') +
                ') '
            );

            // For the 'Identifier' node, simply return the value of the name attribute
        case 'Identifier':
            return node.name;

            // For the 'NumberLiteral' node, return the value of its value attribute
        case 'NumberLiteral':
            return node.value;

            // For the 'StringLiteral' node, double quotation marks are required around its value
        case 'StringLiteral':
            return '"' + node.value + '"';

            // If an unrecognized node is encountered, an error is thrown
        default:
            throw new TypeError(node.type); }}Copy the code

The final compiler ~

Finally, let’s create a Compiler function that strings together all the above processes:

1. input  => tokenizer   => tokens
2. tokens => parser      => ast
3. ast    => transformer => newAst
4. newAst => generator   => output
Copy the code
function compiler(input) {
  let tokens = tokenizer(input);
  let ast    = parser(tokens);
  let newAst = transformer(ast);
  let output = codeGenerator(newAst);

  // Return the code generated result
  return output;
}

Copy the code

You’re done

Now, let’s export all of the above functions:

module.exports = {
    tokenizer,
    parser,
    traverser,
    transformer,
    codeGenerator,
    compiler,
};
Copy the code

conclusion

Too many comments may affect the reading of the code, you can read the exclusive version of github.com/wanglin2/th… .