Implement the minimal compiler model from zero

Introduction to the

Two days ago, I happened to find a project on GitHub: the -super-tiny-Compiler, which is probably the simplest compiler, according to the official introduction. It happened that I was learning the course “Compilation Principle” in this semester, so MY interest suddenly came up. After a brief look, this project is a compiler that converts a Lisp expression into a C expression, involving lexical analysis, grammar analysis, AST tree traversal transformation and the final code output. Now LET me take you to a simple implementation.

Lexical analysis

Lexical analysis is also called parsing. The first step that every compiler needs to do is lexical analysis. In simple terms, the “source code” for transformation is taken apart to form a widget called token. Here’s an example of unpacking a JavaScript statement:

let name = "touryung";
Copy the code

[let, name, =, “touryung”,;] To facilitate the next operation. Because we implement this is the simplest compiler, so the compiler internal only to realize the recognition of parentheses, Spaces, numbers, strings, variables.

The overall framework

To implement a lexical analyzer (parser), we first need to build a framework. The whole idea of lexical analysis is to iterate over the input string, then identify the different tokens and store it in the token array.

The framework is as follows, with the meanings of the different parts marked in the notes:

const WHITESPACE = /\s/; / / space
const NUMBERS = / [0-9]; / / digital
const LETTERS = /[a-z]/i; / / variable

function tokenizer(input) {
  let current = 0; // The currently identified subscript
  let tokens = []; / / token array

  while (current < input.length) { / / traverse
    let char = input[current]; // The character currently traversed

    // Different token recognition operations

    throw new TypeError(`I dont know what this character is: ${char}`);
  }

  return tokens;
}
Copy the code

The next step is to identify different tokens.

Identify the brackets

Recognizing the parentheses is simple. When iterating through the current character to the left and right parentheses, put an object that describes the current token into the token array.

// Recognize the left parenthesis
if (char === "(") {
  tokens.push({ type: "paren".value: "(" }); // Press in the object that describes the current token
  current++;
  continue;
}
// Recognize the close parenthesis
if (char === ")") {
  tokens.push({ type: "paren".value: ")" });
  current++;
  continue;
}
Copy the code

Identify the blank space

This is important because whitespace is virtually irrelevant to the syntax of the programming language, which is why Javascript code can be compressed and still work. Therefore, when we recognize a space, we do not need to put it into the token array for the next operation.

In fact, tokens like whitespace, comments, and line breaks that do not affect the program syntax are not sent to the next step in lexical analysis.

So, when we recognize a space, we just keep iterating:

// space is not processed
if (WHITESPACE.test(char)) {
  current++;
  continue;
}
Copy the code

Identify numbers/variables/strings

Why did I write these three tokens together? The reason is that the matching logic of all three tokens is similar, starting with numbers. Since the matched token may no longer be a single character, the loop needs to continue internally until the entire token is matched.

// number, loop to get the number
if (NUMBERS.test(char)) {
  let value = ""; // Assign the initial value to the matched number
  
  while (NUMBERS.test(char)) { // traversal, if there is still a match to sum
    value += char;
    char = input[++current];
  }

  tokens.push({ type: "number", value }); // Press in the object that describes the current token
  continue;
}

// Variable, similar to number
if (LETTERS.test(char)) {
  let value = "";
  
  while (LETTERS.test(char)) {
    value += char;
    char = input[++current];
  }

  tokens.push({ type: "name", value });
  continue;
}

// String, before and after "" need to skip
if (char === '"') {
  let value = "";
  char = input[++current]; // Skip the preceding quotes
  
  while(char ! = ='"') { // The end condition is matched to the trailing quotation mark
    value += char;
    char = input[++current];
  }

  char = input[++current]; // Skip the quotes
  tokens.push({ type: "string", value });
  continue;
}
Copy the code

It should be noted that the recognition string is similar to the above two types, but there are two differences:

In the string identification, you need to skip the quotation marks and match only the specific values in the middle.
In the middle traversal the end condition matches the quotes at the end.

One might ask, how do you know that a skipped quote is a string after that? That’s where the token description object, pressed into the array, comes in. It has a type attribute that specifies the type of the current token.

A small summary

At this point, the lexical analysis work is done, in fact, relatively easy to understand, so can intuitively observe the lexical analysis output token array is what it looks like? Sure, just write a sample test, such as:

let source = "(add 2 (subtract 4 2))"; / / the source code
let tokens = tokenizer(source);
console.log(tokens);
Copy the code

This is a Lisp statement that evaluates 2+(4-2) and uses it as input to get the token array as follows:

[
  { type: 'paren', value: '(' },
  { type: 'name', value: 'add' },
  { type: 'number', value: '2' },
  { type: 'paren', value: '(' },
  { type: 'name', value: 'subtract' },
  { type: 'number', value: '4' },
  { type: 'number', value: '2' },
  { type: 'paren', value: ')' },
  { type: 'paren', value: ')' }
]
Copy the code

This is the perfect way to unpack the source code we started with.

Syntax analysis

The next step is syntax analysis. The function of syntax analysis is to convert the token array output in the previous step into the corresponding AST (Abstract syntax tree) according to the specific programming language syntax. Since the tree structure is involved, this step naturally involves recursive operations.

The overall framework

In general, the parsing section also needs to be framed first. The overall idea is to traverse the token array and recursively build the AST tree. The framework is as follows:

function parser(tokens) {
  let current = 0;

  function walk() {
    let token = tokens[current];

    // Convert different tokens to AST nodes

    throw new TypeError(token.type);
  }

  let ast = { // This is the outermost structure of the AST tree and is fixed
    type: "Program".body: [],};while (current < tokens.length) { // Iterate through the token array to build a tree structure
    ast.body.push(walk());
  }

  return ast;
}
Copy the code

Build number and string nodes

The construction of these two types of nodes is relatively simple, just return the object describing the node:

// Build integer nodes
if (token.type === "number") {
  current++;
  return {
    type: "NumberLiteral".value: token.value,
  };
}
// Build a string node
if (token.type === "string") {
  current++;
  return {
    type: "StringLiteral".value: token.value,
  };
}
Copy the code

Build the function call node

Those of you who know Lisp know that parentheses are the essence of Lisp, such as function calls of the form add 1 2. Therefore, we need to use parentheses for identification. The specific code is as follows:

if (token.type === "paren" && token.value === "(") { // Open parenthesis
  token = tokens[++current]; // Skip the open parenthesis
  let node = { // The function calls the node
    type: "CallExpression".name: token.value,
    params: [],}; token = tokens[++current];/ / skip the name

  // Collect parameter nodes recursively as long as they are not close parentheses
  while(! (token.type ==="paren" && token.value === ")")) {
    node.params.push(walk()); // Add parameters
    token = tokens[current];
  }

  current++; // Skip the closing parenthesis
  return node;
}
Copy the code

The important thing to note here is that an argument can also be the result of a function call, so you need to recursively call the walk function when parsing the argument.

It is also worth noting that we have used this code structure many times:

if (value === "(") {
  // ...
  while(! value ===")") {
		// ...}}Copy the code

This structure is obviously suitable for traversing an interval, so we need to use it when parsing paired elements such as strings and parentheses.

A small summary

After a few simple steps, the token array will be converted into an AST tree structure. The feeling is still very magical. At this point, our output and programming are as follows:

{
  "type": "Program"."body": [{"type": "CallExpression"."name": "add"."params": [{"type": "NumberLiteral"."value": "2"
        },
        {
          "type": "CallExpression"."name": "subtract"."params": [{"type": "NumberLiteral"."value": "4"
            },
            {
              "type": "NumberLiteral"."value": "2"}]}]}Copy the code

Traverse and transform the AST tree

At this point, we have an AST tree, and the compiler’s ability to convert source code into object code can actually be seen as converting the source AST tree into the target AST tree. To achieve this transformation, we need to traverse the tree and then operate on the corresponding nodes.

Traverse the tree

As we can see from the above, the body attribute and the function call in the AST tree are actually array types, so we need to define a method to iterate over the array type. It is simple to iterate over each element in the array separately.

  // Access the array
  function traverseArray(array, parent) {
    array.forEach((child) = > traverseNode(child, parent));
  }
Copy the code

When traversing a specific node, we need to call the Enter method of that node type to perform the access (transform AST) operation. Different types of node enter methods are different.

function traverseNode(node, parent) {
  let method = visitor[node.type]; // A method to remove the current type

  if (method && method.enter) { // Execute the corresponding Enter method
    method.enter(node, parent);
  }

  switch (node.type) { // Perform different traversal operations for different types of nodes
    case "Program":
      traverseArray(node.body, node);
      break;
    case "CallExpression":
      traverseArray(node.params, node);
      break;
    case "NumberLiteral":
    case "StringLiteral":
      break;
    default:
      throw new TypeError(node.type); }}Copy the code

Again, you might ask, why does the second parameter need to be passed to the parent when you call enter? This is actually related to the logic of the actual transformation part, which will be explained later.

Convert the AST tree

The overall framework

Again, we can start by building a general framework for the same type of node access (transformation) methods later. Here the transformation idea is more important: how can we traverse the old AST tree and add the transformed node to the new AST tree?

The implementation idea here is generally divided into the following steps:

Add one to the old AST tree_contextContext property that points to the array node of the new AST tree
When traversingThe children of the old AST array nodeThe converted child element is placed in the_contextProperties of the
Depending on the nature of the JavaScript reference type, the goal is to put the nodes of the transformed sum into the new AST tree.

It can be expressed in the figure as follows:

I believe this answers the question above as to why the second argument needs to be passed to the parent when the Enter method is called.

function transformer(ast) {
  let newAst = { // The outermost structure of the new AST tree
    type: "Program".body: [],};// _context is used to press the new AST while traversing the old child node
  ast._context = newAst.body;

  let visitor = {
    // Different types of node access (transformation) methods
  };

  traverser(ast, visitor); // Start traversing the old AST tree
  return newAst;
}
Copy the code

Conversion of numbers and strings

{
  NumberLiteral: {
    enter(node, parent) {
      parent._context.push({ // Press the new AST tree
        type: "NumberLiteral",
        value: node.value,
      });
    },
  }, 
  StringLiteral: {
    enter(node, parent) {
      parent._context.push({
        type: "StringLiteral", value: node.value, }); }}},Copy the code

The function calls the transformation of the node

The function call node is special because its arguments can be treated as children, so you need to point the _context property of the current node to the corresponding parameter array of the new AST tree.

If the current function call is not nested within another function call, an ExpressionStatement can be added to indicate that the current node is an entire statement. For example, the parenthesis inside (Add 2 (Subtract 4 2)) cannot be considered a complete statement because it exists as an argument to another function.

{
  CallExpression: {
    enter(node, parent) {
      let expression = { // Function call node of the new AST tree
        type: "CallExpression",
        callee: {
          type: "Identifier",
          name: node.name,
        },
        arguments: [],
      };

      node._context = expression.arguments; // Parameter array processing

      // If the current function call is not nested in another function callif (parent.type ! = ="CallExpression") {
        expression = {
          type: "ExpressionStatement", expression: expression, }; } parent._context.push(expression); }},}Copy the code

A small summary

So far, we have completed the AST tree traversal and transformation. This is not a small amount of work, but it is the best part of the compilation. If everything goes well, we can now get the following new AST tree:

{
  "type": "Program"."body": [{"type": "ExpressionStatement"."expression": {
        "type": "CallExpression"."callee": {
          "type": "Identifier"."name": "add"
        },
        "arguments": [{"type": "NumberLiteral"."value": "2"
          },
          {
            "type": "CallExpression"."callee": {
              "type": "Identifier"."name": "subtract"
            },
            "arguments": [{"type": "NumberLiteral"."value": "4"
              },
              {
                "type": "NumberLiteral"."value": "2"}]}]}}Copy the code

This is the structure of the AST tree corresponding to the C code, and comparing it to the AST tree of Lisp before, you can see many differences.

Code generation

Finally, the most exciting moment, generating object code! This step is relatively easy. From the AST tree generated in the previous step, iterate it recursively and generate the final code:

function codeGenerator(node) {
  switch (node.type) {
    case "Program":
      return node.body.map(codeGenerator).join("\n");
    case "ExpressionStatement":
      return `${codeGenerator(node.expression)}; `;
    case "CallExpression": // Generate a function call
      return `${codeGenerator(node.callee)}(${node.arguments
        .map(codeGenerator)
        .join(",")}) `;
    case "Identifier": // Generate the variable name
      return node.name;
    case "NumberLiteral":
      return node.value; // Generate a number
    case "StringLiteral":
      return `"${node.value}"`; // Generate string (don't forget quotes)
    default:
      throw new TypeError(node.type); }}Copy the code

Eventually, we made the transition from the Lisp example code (Add 2 (Subtract 4 2)) to C code Add (2, Subtract (4, 2)).

Great summary

This article takes you from zero to achieve the most basic functions of a compiler, involving lexical analysis, syntax analysis, AST tree traversal transformation and other content. The principles of compilation sound esoteric (and they are esoteric), but the basic part is that all that parsing, lexical parsing, comes down to handling strings.

My research is on the front end, so people may think that it is not usually involved in the content of compilation principles, but in fact, once you look into the Babel ES6+ code into ES5 work is actually done by the compiler, as well as the recent popular ESBuild, Anything that involves converting code will definitely involve compilation, and even Vue has a compiler built in for template compilation.

Said so much, the original intention or hope that everyone in the usual learning to wade into the new field of knowledge, expand their skills, so as to improve their technical vision and upper limit.

(End of play)

Implement the minimal compiler model from zero

Introduction to the

Lexical analysis

The overall framework

Identify the brackets

Identify the blank space

Identify numbers/variables/strings

A small summary

Syntax analysis

The overall framework

Build number and string nodes

Build the function call node

A small summary

Traverse and transform the AST tree

Traverse the tree

Convert the AST tree

The overall framework

Conversion of numbers and strings

The function calls the transformation of the node

A small summary

Code generation

Great summary

Related Posts

Reverse address resolution of wechat applets

How did I improve web performance 5 times – Build optimizations

ES6 Day 3 – Function dependence (Function argument with arrow function)