1, an overview of the

Today we are going to learn how to develop a compiler, but this compiler does not say that everything can be compiled, it is just a super small compiler, mainly used to explain some basic principles of the compiler.

Our compiler can compile lisp-like function calls into C-like function calls. If you’re not familiar with lisp or C, it doesn’t matter which language it is, but I’ll give you a quick introduction.

Subtract If we have two functions called add and subtract, we can use them to calculate the following expression:

2 + 2
4 - 2
2 + (4 - 2)Copy the code

So in Lisp it might look like this:

(add 2 2) / / 2 + 2
(subtract 4 2) / / 4-2
(add 2 (subtract 4 2)) // 2 + (4-2)
Copy the code

In C it looks like this:

add(2.2)
subtract(4.2)
add(2, subtract(4.2))
Copy the code

Pretty simple, right?

Well, that’s because that’s just the case our compiler needs to handle. This is not the complete syntax of list, nor is it the complete syntax of C. But this syntax is enough to illustrate most of what modern compilers do.

Most of what compilers do can be broken down into three main steps: parsing, transformation, and code generation.

  1. Parsing. Parsing is the translation of raw code into an abstract representation of code.
  2. Conversion. The transformation takes this abstract representation and does whatever the compiler wants.
  3. Code generation. Code generation is the transformation of an abstract representation into new code.

2,

Parsing is usually divided into two stages: lexical analysis and syntactic analysis.

  1. Lexical analysis. Lexical analysis usually uses a marker (or lexical analyzer) to break the raw code into things called tags. Tags are arrays of tiny objects that describe isolated snippets of syntax, such as numbers, labels, punctuation marks, operators, and so on.
  2. Grammatical analysis. Parsing reformats the resulting markup from lexical analysis into a representation that describes each part of the grammar and its relationships to each other. This is called an intermediate representation or abstract syntax tree (AST). An abstract syntax tree (AST for short) is a deeply nested object for expressing code in a form that is both user-friendly and informative.

For the following syntax:

(add 2 (subtract 4 2))
Copy the code

The tag might look like this:

[{type: 'paren'.value: '('        },
      { type: 'name'.value: 'add'      },
      { type: 'number'.value: '2'        },
      { type: 'paren'.value: '('        },
      { type: 'name'.value: 'subtract' },
      { type: 'number'.value: '4'        },
      { type: 'number'.value: '2'        },
      { type: 'paren'.value: ') '        },
      { type: 'paren'.value: ') '},]Copy the code

Then its corresponding abstract syntax tree (AST) might look something like this:

{
  type: 'Program'.body: [{
    type: 'CallExpression'.name: 'add'.params: [{
      type: 'NumberLiteral'.value: '2'}, {type: 'CallExpression'.name: 'subtract'.params: [{
        type: 'NumberLiteral'.value: '4'}, {type: 'NumberLiteral'.value: '2',}]}]}]}Copy the code

3, conversion,

After parsing, the compiler’s next step is the transformation. Again, this is just taking the abstract Syntax tree (AST) from the last step and making some changes to it. These changes can vary from being made in the same language to simply converting an abstract syntax tree into a completely different new language.

Let’s look at how we will transform an abstract syntax tree (AST).

You may have noticed that our abstract syntax tree has some very similar elements. These element objects have a type attribute. Each of these object elements is called an AST node. Properties defined on these nodes are used to describe a separate part of the AST tree.

We can create a node for numberLiterals:

{
  type: 'NumberLiteral',
  value: '2',}Copy the code

Or to create a node for calling expressions (CallExpression) :

{
  type: 'CallExpression',
  name: 'subtract',
  params: [...nested nodes go here...],
}
Copy the code

When converting an AST tree, we may need to add, remove, replace, and so on. We can add new nodes, delete nodes, or we can just throw the AST tree away and create a whole new AST based on it.

Since the goal of our compiler is to convert Lisp to C, we will focus on creating a new AST specifically for the target language (in this case C).

3.1 traversal

To navigate through all of these nodes, we need to be able to walk through them. This traversal is depth-first access to each node of the AST.

{
  type: 'Program'.body: [{
    type: 'CallExpression'.name: 'add'.params: [{
      type: 'NumberLiteral'.value: '2'
    }, {
      type: 'CallExpression'.name: 'subtract'.params: [{
        type: 'NumberLiteral'.value: '4'
      }, {
        type: 'NumberLiteral'.value: '2'}}}}]]]Copy the code

So for the AST above, we need to go like this:

  1. Program – Starts at the top of the AST tree.
  2. CallExpression (add) – Moves to the first element of the body attribute of the Program.
  3. NumberLiteral (2) – Moves to the first parameter of CallExpression(Add).
  4. CallExpression(Subtract) – Moves to the second parameter of CallExpression(add).
  5. NumberLiteral (4) – Moves the first parameter to CallExpression(subtract).
  6. NumberLiteral (2) – Moves to the second parameter of CallExpression(subtract).

If we were to manipulate this AST directly rather than create a separate AST, we might need to introduce various abstractions here. But what we’re trying to do is just access every node in the tree.

The reason to use the word “access” is because it is a good way to describe how elements are manipulated on an object structure.

3.2 visitors

The basic idea here is that we create a visitor object that has methods that can accept different node types.

Like this:

var visitor = {
  NumberLiteral() {},
  CallExpression() {},
};
Copy the code

As we traverse the AST, whenever we encounter a node that matches the specified type, we call the method on the visitor object.

To make this function useful, we pass it the node and its parent:

var visitor = {
  NumberLiteral(node, parent) {},
  CallExpression(node, parent) {}};Copy the code

However, there is also the possibility of calling something on exit. Imagine the tree structure we mentioned earlier:

    - Program
      - CallExpression
        - NumberLiteral
        - CallExpression
          - NumberLiteral
          - NumberLiteral
Copy the code

As we go down, we come to the final branch. We exit when we have accessed all branches. So if we go down the tree, we enter the node, and then when we go back up we exit the node.

 -> Program (enter)
  -> CallExpression (enter)
    -> Number Literal (enter)
    <- Number Literal (exit)
    -> Call Expression (enter)
       -> Number Literal (enter)
       <- Number Literal (exit)
       -> Number Literal (enter)
       <- Number Literal (exit)
    <- CallExpression (exit)
  <- CallExpression (exit)
 <- Program (exit)
Copy the code

To support this approach, our visitor object needs to look like this:

var visitor = {
  NumberLiteral: {
    enter(node, parent) {},
    exit(node, parent) {},
  }
};
Copy the code

4. Code generation

The last step in the compiler is code generation. Sometimes the compiler repeats some of the things that you can come to the conclusion of the transformation step. But for code generation, most of what it does is stringify our AST tree, which translates into string output.

Code generation works in a variety of ways, some compilers reuse previously generated tags, others create separate representations of the code to print the nodes linearly, but as far as I can tell, the strategy of most compilers is to use the AST we just created, and that’s what we’ll focus on.

In fact, our code generator will know how to print all the different node types of the AST, and it will recursively call itself to print nested nodes until it prints everything in one long string of code.

And that’s it! That’s all the difference in the compiler.

Now it’s not as if every compiler looks exactly like what I’ve described here. Compilers have many different uses, and they may require more detailed steps than I did.

But now you should have an overall high-level idea of what most compilers look like.

Now that I’ve explained all this, you should be able to write your own compiler, right?

Just kidding, I’m here to help, so here we go!

5. Compiler code implementation

As mentioned earlier, the entire compiler can be roughly divided into three steps: parsing, transformation, and code generation. Parsing can be divided into two steps: lexical parsing and syntactic parsing. So it takes a total of four functions to do that. Let’s take a look at each:

5.1. Implementation of parsing

5.1.1 Tokenizer implementation of lexical parsing

We’ll start with the compiler’s first step, parsing, using the Tokenizer function for lexical analysis.

We’ll break up the string code into arrays of tags:

(add 2 (subtract 4 2))   =>   [{ type: 'paren', value: '(' }, ...]
Copy the code

Our Tokenizer takes a code string and then does two things:

function tokenizer(input) {

  // A current variable, like a cursor, keeps track of where we are in the code string
  let current = 0;

  // And an array of tockens to store the tags we decomposed
  let tokens = [];

  // We create a while loop where we set our current variable, which increases as the loop progresses
  //
  // This is done because tockens can be of any length
  while (current < input.length) {

    // We will also store the character of the current variable
    let char = input[current];

    // The first thing we need to check is the left bracket, this will be used for CallExpression later, but for now we only care about the left bracket character
    //
    // Let's check for left parentheses:
    if (char === '(') {

      // Create an object whose type property is Paren and value is left bracket. Then add this object to the tokens array
      tokens.push({
        type: 'paren'.value: '('});// Then we add the current variable, i.e. move the cursor
      current++;

      // Then proceed to the next loop.
      continue;
    }

    Check the right parenthesis, add a mark, add current, and go through the next loop
    if (char === ') ') {
      tokens.push({
        type: 'paren'.value: ') '}); current++;continue;
    }

    // Next, we check the whitespace. This is interesting because we care about whitespace because it separates the string, but we don't need to store whitespace as a tag, we
    // We can just throw it away, so here we just check to see if the blank space exists, and if it does we go to the next loop

    let WHITESPACE = /\s/;
    if (WHITESPACE.test(char)) {
      current++;
      continue;
    }

    The next type of tag is a number, which is different from what we saw earlier, because a number can be any number of characters, and we need to capture the entire character sequence as a tag
    //
    // (add 123 456)
    / / ^ ^ ^ ^ ^ ^
    // For example, there are only two independent number tags
    //
    // So when we hit the first number in the sequence, we start processing further.
    let NUMBERS = / [0-9];
    if (NUMBERS.test(char)) {

      // We create a value character in this to concatenate numeric characters
      let value = ' ';

      // Then we iterate through each character until we encounter a non-numeric character, concatenate the character with the preceding value variable, and change the current cursor
      while (NUMBERS.test(char)) {
        value += char;
        char = input[++current];
      }

      // After that we will create digital tokens and add the tokens array
      tokens.push({ type: 'number', value });

      // Then we continue
      continue;
    }

    // We also support strings, which are a piece of text wrapped in double quotation marks ("), e.g
    //
    // (concat "foo" "bar")
    // ^^^ ^^^ string token
    //
    // Check the left double quotation mark:
    if (char === '"') {
      // Create a value variable to hold the string.
      let value = ' ';

      // We will ignore the double quotation marks because we care about the text wrapped in double quotation marks.
      char = input[++current];

      // Then we iterate through the following string until we hit the right double quotation mark
      while(char ! = ='"') {
        value += char;
        char = input[++current];
      }

      // Ignore the close double quotation mark, again, because we care about the text enclosed by double quotation marks.
      char = input[++current];

      // Create a tag of type string and put it into the tockens array
      tokens.push({ type: 'string', value });

      continue;
    }

    // The last type of tag is the name tag, which is a string of characters instead of numbers and is the name of a function in Lisp syntax
    //
    // (add 2 4)
    //    ^^^
    / / the name tag
    //
    let LETTERS = /[a-z]/i;
    if (LETTERS.test(char)) {
      let value = ' ';

      // Similarly, we iterate over and splice them together
      while (LETTERS.test(char)) {
        value += char;
        char = input[++current];
      }

      // And create a token of type name to store in the tokens array
      tokens.push({ type: 'name', value });

      continue;
    }

    // Finally, if we haven't matched a character by this point, we'll throw an error and exit
    throw new TypeError('I dont know what this character is: ' + char);
  }

   // At the end of tokenizer we return the tokens array
  return tokens;
}
Copy the code

For example, tokenizer results in lisp code (add 123 456) as follows:

5.1.2 Parser implementation of syntax parsing

The goal of parsing is to convert the tokens array into an AST. This is the following process:

[{ type: 'paren'.value: '('},... ] = > {type: 'Program'.body:[...]. }Copy the code

So we define a parse function that takes our array of tokens as an argument:

function parser(tokens) {

  // Again we maintain a current variable as a cursor
  let current = 0;

  // But this time we are using recursion instead of the while loop, so we define the walk function
  function walk() {

    // Inside the walk function, we first get the tokens stored in the current index of the tokens array
    let token = tokens[current];

    // We will store each type of markup in another structural relation to reflect the syntactic relation
    // Start with digital tokens
    //
    // We check to see if there are any digital tokens
    if (token.type === 'number') {

      // If so, we move the cursor
      current++;

      // And we'll return a new AST node called "NumberLiteral" and set its value property to the value property of our tag object
      return {
        type: 'NumberLiteral',
        value: token.value,
      };
    }

    // If we had a string token, we would create an AST node called "StringLiteral", just like a number
    if (token.type= = = 'string') {
      // Move cursor as well
      current++;

      return {
        type: 'StringLiteral',
        value: token.value,
      };
    }

    // Next we look for CallExpressions. We start the process with a left bracket
    if (
      token.type === 'paren' &&
      token.value === '('
    ) {

      // We will ignore the left parenthesis because in AST, AST is syntactic, so we don't care about the left parenthesis
      token = tokens[++current];

      // We create a base node called CallExpression and set the name of the node to the value attribute of the current tag,
      // Because the next tag in the left bracket tag is the function name
      let node = {
        type: 'CallExpression',
        name: token.value,
        params: [],
      };

      // We move the cursor, ignoring the name tag because the function name is already stored in CallExpression
      token = tokens[++current];

      // Then now we go through each label, finding the parameters of CallExpression until we hit the right bracket
      //
      // Now, this is where recursion comes in. To avoid getting bogged down in infinite nested node parsing, we do this recursively
      //
      // To better illustrate this, let's use our Lisp code as an example. You can see that add takes a number as well as a nested CallExpression,
      // This nested function call contains its own numeric arguments
      //
      // (add 2 (subtract 4 2))
      //
      // You can see from its tokens array that it has many right parentheses
      //
      / / /
      // { type: 'paren', value: '(' },
      // { type: 'name', value: 'add' },
      // { type: 'number', value: '2' },
      // { type: 'paren', value: '(' },
      // { type: 'name', value: 'subtract' },
      // { type: 'number', value: '4' },
      // { type: 'number', value: '2' },
      // {type: 'paren', value: ')'}, <<< right parenthesis
      // {type: 'paren', value: ')'}, <<< right parenthesis
      / /]
      //
      // We will rely on the nested walk function to increment our cursor

      // So we create a while loop that continues until we encounter a tag of type Paren whose value is a right parenthesis
      while((token.type! == 'paren') || (token.type === 'paren' && token.value ! = = ') ')) {// We will call the walk function, which will return a node, and we will put the returned node into the params of the current node
        // The nested relationships are reflected in the AST
        node.params.push(walk());
        token = tokens[current];
      }

      // Finally, we need to move the cursor one last time to ignore the right bracket
      current++;

      // And returns the node
      return node;
    }

    // Also, if we do not recognize the type of tag, we will throw an error
    throw new TypeError(token.type);
  }

  // Now that the walk function is defined, we need to define our AST tree, which has a "Program" root node:
  let ast = {
    type: 'Program',
    body: [],
  };

  // Then we need to start our walk function and put the AST node into the body array of the root node
  //
  // We do this inside the loop because we might have multiple function calls attached, such as this:
  //
  // (add 2 2)
  // (subtract 4 2)
  / / start to walk
  while (current < tokens.length) {
    ast.body.push(walk());
  }

  // At the end of the parsing function, we return the generated AST.
  return ast;
}
Copy the code

Despite the previous examples, the AST obtained after parsing is as follows:

5.2 Implementation of transformation

Now that we have our AST, we want a visitor to be able to access different nodes and call the methods on the visitor whenever a matching node type is found. So we define a traveler function that takes two arguments, the first an AST tree and the second a visitor. This visitor needs to implement some of the methods that different types of AST nodes need to call:

traverse(ast, {
  Program: {
    enter(node, parent) {
      //. },exit(node, parent) {
      //. }, }, CallExpression: { enter(node, parent) {//. },exit(node, parent) {
      //. }, }, NumberLiteral: { enter(node, parent) {//. },exit(node, parent) {
      //. ,}}});Copy the code
5.2.1 Traverser function implementation

Therefore, our traveler function is implemented as follows, which takes an AST and a visitor as arguments, and defines two methods in it:

function traverser(ast, visitor) {

  //Define a traverseArray function where we iterate over an array and then invoke the traverseNode function we define laterfunction traverseArray(array, parent) {
    array.forEach(child => {
      traverseNode(child, parent);
    });
  }

  //The traverseNode function takes an AST node and its parent, so it can also be passed to our visitor functionfunction traverseNode(node, parent) {

    //Let methods = visitor[node.type];//If the AST node type has an Enter method, we call it with the current node and its parent as argumentsif (methods && methods.enter) {
      methods.enter(node, parent);
    }

    //Switch (node.type) {switch (node.type) {//We start with the top-level node Program, which has a property called body, which is an array of AST nodes//We'll call the traverseArray function to recurse it//
      //Remember that traverseArray in turn calls the traverseNode function, so we make this AST accessed recursively'Program':
        traverseArray(node.body, node);
        break;

      //Next we do the same for the CallExpression nodes and access their arguments case'CallExpression':
        traverseArray(node.params, node);
        break;

      //For number nodes and string nodes, they don't have any child nodes, so we justbreak.
      case 'NumberLiteral':
      case 'StringLiteral':
        break;

      //And again, if the corresponding node type is not recognized, an error is thrown default: throw new TypeError(Node.type); }//If the visitor has oneexitMethod, which we will call with that node and its parent as argumentsexitmethodsif (methods && methods.exit) {
      methods.exit(node, parent); }}//Finally, we enable the traverser, which is performed by calling the traverseNode with the second argument null because the grading node itself has no parent. }Copy the code
5.2.2 Transformer function implementation

We have already written the traverser function, and the traverser function performs the primary traverser operation on a node through its second argument, the traverser. There is no implementation of the traverser, only the Enter and exit interfaces. What these two interfaces actually do is what the transition step really does. To do this, we define transformer functions.

The Transformer function receives the AST, passes it to the traverser function, and provides visitors to the traverser function within the Transformer function. The final Transformer function returns a newly created AST.

For example, in the previous example, the resulting AST and the transformed AST are as follows:

----------------------------------------------------------------------------
Original AST                     |   Transformed AST
----------------------------------------------------------------------------{| {type: 'Program'. |type: 'Program'.body: [{|body: [{
    type: 'CallExpression'. |type: 'ExpressionStatement'.name: 'add'. |expression: {
    params: [{|type: 'CallExpression'.type: 'NumberLiteral'. |callee: {
      value: '2'                 |           type: 'Identifier'}, {|name: 'add'
      type: 'CallExpression'|},name: 'subtract'. |arguments: [{
      params: [{|type: 'NumberLiteral'.type: 'NumberLiteral'. |value: '2'
        value: '4'|}, {}, {|type: 'CallExpression'.type: 'NumberLiteral'. |callee: {
        value: '2'               |             type: 'Identifier'|,}]name: 'subtract'| |}}],}]arguments: [{} |type: 'NumberLiteral'. |value: '4'-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - |}, {|type: 'NumberLiteral'. |value: '2'| | | |}}}}]] |}----------------------------------------------------------------------------
Copy the code

So the concrete implementation of our Transformer function is as follows:

function transformer(ast) {

  // we will create a newAst (i.e. NewAst) that is similar to our original AST with a Program root node
  let newAst = {
    type: 'Program',
    body: [],
  };

  // Next, we'll do some trickery. We define a \_context property on the parent node,
  // We put the node into the parent's \_context property
  // Usually you have better abstractions (maybe more complex), but here we are making things relatively simple
  //
  // You just need to remember that context is a reference from the old AST to the new AST
  ast._context = newAst.body;

  // We call the traverser function with the old AST and a visitor as arguments
  traverser(ast, {

    // The first visitor attribute is used to process NumberLiteral
    NumberLiteral: {
      // The node is accessed in the Enter method.
      enter(node, parent) {
        // Inside this we will create a new AST node, again named NumberLiteral
        // We will put this node into the \_context property of its parent
        parent._context.push({
          type: 'NumberLiteral', value: node.value, }); }},// StringLiteral follows
    StringLiteral: {
      enter(node, parent) {
        parent._context.push({
          type: 'StringLiteral', value: node.value, }); }},// Next is CallExpression
    CallExpression: {
      enter(node, parent) {

        // We create a new node, CallExpression, which has a nested identifier
        let expression = {
          type: 'CallExpression',
          callee: {
            type: 'Identifier',
            name: node.name,
          },
          arguments: [],
        };

        // Next, we define a new context on the original CallExpression node for reference
        // expression the arguments attribute on the variable
        // So we can add parameters
        node._context = expression.arguments;

        // Then we check whether the parent is a CallExpression node
        // If not
        if (parent.type ! = ='CallExpression') {

          // we will wrap the CallExpression node with an ExpressionStatement node
          // This is done because the top-level CallExpression nodes are actually statements
          That is, if the parent of a CallExpression node is not a CallExpression node
          // Then the CallExpression node should be a function declaration
          expression = {
            type: 'ExpressionStatement'.expression: expression}; }// finally we add this new CallExpression(possibly wrapped by ExpressionStatement)
        / / in the parent. _context
        parent._context.push(expression); }}});At the end of the Transformer function, we return the new AST we just created
  return newAst;
}
Copy the code

Let’s use the previous example to see what the newly created AST looks like:

5.3 implementation of code generation

Now let’s come to our final step: code generation. Our code generation function recursively calls itself to print its node into a large string. To complete the process from newAST to code:

newAst => generator   => output
Copy the code
5.3.1 Implementation of codeGenerator
function codeGenerator(node) {

  // We separate things according to the type of the node
  switch (node.type) {

    // If we have a Program node, we will traverse each node in the body and recursively call the codeGenerator for each node
    // function and concatenate their results with a newline character
    case 'Program':
      return node.body.map(codeGenerator)
        .join('\n');

    // For the ExpressionStatement node, we will call it on the expression node of the node
    // codeGenerator function, then we add a semicolon (i.e.;)
    case 'ExpressionStatement':
      return (
        codeGenerator(node.expression) +
        '; ' / / < < (... because we like to code the *correct* way)
      );

    // For CallExpression nodes, we print callee and start a parenthesis
    // We will iterate over the arguments property of the node and call the codeGenerator method for each property,
    // Separate their results with commas, followed by a right parenthesis
    case 'CallExpression':
      return (
        codeGenerator(node.callee) +
        '(' +
        node.arguments.map(codeGenerator)
          .join(', ') +
        ') '
      );

    For identifiers, we return the name of the node
    case 'Identifier':
      return node.name;

    // For the NumberLiteral node, we return its value attribute
    case 'NumberLiteral':
      return node.value;

    // For the StringLiteral node, we enclose the value of its value attribute in quotes
    case 'StringLiteral':
      return '"' + node.value + '"';

    // If no node is identified, we will throw an error
    default:
      throw newTypeError(node.type); }}Copy the code

Again using the above example, its output is shown in the following figure:

6. Implementation of compiler

Now that the code for the three major points of the compiler is in place, we can start implementing the compiler by linking the three points together, which can be described as follows:

1. input  => tokenizer   => tokens
2. tokens => parser      => ast
3. ast    => transformer => newAst
4. newAst => generator   => output
Copy the code

Our compiler code looks like this:

function compiler(input) {
  let tokens = tokenizer(input);
  let ast    = parser(tokens);
  let newAst = transformer(ast);
  let output = codeGenerator(newAst);

  // and simply return the output!
  return output;
}

Copy the code

Finally, as a module, we want others to use it, because each of our functions is a relatively independent function module, so we export each function in this module:

module.exports = {
  tokenizer,
  parser,
  traverser,
  transformer,
  codeGenerator,
  compiler,
};Copy the code

The original link: segmentfault.com/a/119000001…