preface

After the introduction and practice of a series of preliminary code compilation articles [public number: Front-end xiaocainiao 001], we should have a certain understanding and understanding of the implementation and execution process of compiler.

Let’s take a closer look at how compilers convert between languages.

(Language A => language B and does not affect the correct execution of the program)

The body of the

To get A clearer picture of how the compiler converts language A to language B, let’s use an overview diagram.

(For the convenience of this article, language A is abbreviated as L-A, and language B is abbreviated as L-B)

  • 1. L-A / L-B

First of all, let’s introduce the differences between the original language A and the target language B:

// L-A
(add 2 (subtract 4 2))
// L-B
add(2, subtract(4, 2))
Copy the code

Its corresponding AST syntax tree structure:

Smart people should be able to see that in the process of realizing code conversion to different languages, the core work is not only the common lexical analysis, syntax analysis, but also the transformation of AST and the target code generation of the new AST. Each step is described below.

  • 2. Tokenizer

The process of lexical analysis has been introduced and practiced many times in previous articles,

We won’t discuss it in detail here.

Tokens are as follows:

const input = '(add 2 (subtract 4 2))';

const tokens = [
  { type: 'paren'.value: '('        },
  { type: 'name'.value: 'add'      },
  { type: 'number'.value: '2'        },
  { type: 'paren'.value: '('        },
  { type: 'name'.value: 'subtract' },
  { type: 'number'.value: '4'        },
  { type: 'number'.value: '2'        },
  { type: 'paren'.value: ') '        },
  { type: 'paren'.value: ') '}];function tokenizer(input) {
  let tokens = []
  let current = 0;
  while(current < input.length) {
    let char = input[current]
    // '(')' parentheses
    if(char === '(' || char === ') ') {
      tokens.push({
        type: 'para'.value: char
      })
      current++;
      continue;
    }
    / / space
    let WHITESPACE = /\s/;
    if(WHITESPACE.test(char)) {
      current++;
      continue;
    }
    / / 0-9 Numbers
    let NUMBERS = / [0-9];
    if(NUMBERS.test(char)) {
      / /...
      let value = ' ';
      while(NUMBERS.test(char)) {
        value += char;
        char = input[++current]
      }
      tokens.push({
        type: 'number',
        value
      })
      continue;
    }
    // string "xx" the character string
    if(char === '"') {
      // ...
    }
    // name
    let LETTERS = /[a-z]/i;
    if(LETTERS) {
      // ...
    }
    throw new TypeErroe(char)
  }
  return tokens
}
Copy the code
  • 3. Parser

The Tokens obtained by lexical analysis and the l-A AST syntax structure described above can be programmed to convert tokens into the AST syntax tree corresponding to L-A.

function parser(tokens) {
 const current = 0;
 // Define the AST root
 let ast = {
    type: 'Program'.body: [],};while(current < tokens.length) {
   ast.body.push(walk());
 }
 // Define a recursively called 'walk' function
 function walk() {
    // Get the token currently being processed
    let token = tokens[current];
    // Returns an AST node if the current value is number or string
    if (token.type === 'number') {current++;return {
        type: 'NumberLiteral'.value: token.value,
      };
    }
    if (token.type === 'string') {
       current++;
       return {
         type: 'StringLiteral'.value: token.value,
       };
    } 
    // If the parentheses are '(' then a 'CallExpression' ast node is created
    if (token.type === 'paren' && token.value === '(') {
      // Get the token after the '(' parentheses
      token = tokens[++current];
      // Creates a 'CallExpression' ast node with the current token set to its name
      let node = {
        type: 'CallExpression'.name: token.value,
        params: [],};// Get the token after name
      token = tokens[++current];   
      // Processes descendants of the CallExpression node
      while((token.type ! = ='paren') ||
         (token.type === 'paren'&& token.value ! = =') ')) {// Recursively call 'walk' to attach the node it returns to Node.params
          node.params.push(walk());
          token = tokens[current];
       }
       // Skip the ')' parentheses
       current++;
       returnnode; }}return ast;
}

// Final AST result
const ast = {
  type: 'Program'.body: [{
    type: 'CallExpression'.name: 'add'.params: [{
      type: 'NumberLiteral'.value: '2'
    }, {
      type: 'CallExpression'.name: 'subtract'.params: [{
        type: 'NumberLiteral'.value: '4'
      }, {
        type: 'NumberLiteral'.value: '2'}}}}]]];Copy the code

At this point, we have the AST of L-A, which will be followed by the all-important Transform.

  • 4. Translation (transformer)

Transformation phase, which takes the AST from the last step above and makes changes to it. It can manipulate the AST as if it were the same language, or it can translate it into an entirely new language.

Let’s look at how to convert the AST.

You may have noticed that the AST contains elements that all look similar. These objects all have a Type attribute, and each belongs to an AST node on which the attribute defined nodes describe a separate part of the tree.

When transforming an AST, nodes can be manipulated by adding/removing/replacing properties, adding new nodes, deleting nodes, or discarding an existing AST and creating a brand new AST based on it.

Because the goal here is to convert to a new language, the focus is on creating a brand new AST for the target language.

In order to access all nodes, a depth-first algorithm is used to traverse the entire AST.

If we were to manually manipulate this AST(add, subtract, modify) instead of creating a separate AST, we might introduce all sorts of abstractions, but for what we’re going to do here, we just need to visit each node in the tree one by one.

Visitor is defined as follows:

Var visitor = {NumberLiteral(node, parent) {}, CallExpression(node, parent) {}, }; // AST structure-program-callexpression - NumberLiteral - CallExpression - NumberLiteral - NumberLiteral // AST depth first algorithm processing // Backtrace -> Program (enter) -> CallExpression (enter) -> Number Literal (enter) < -number Literal (exit) -> CallExpression (enter) -> Number Literal (enter) <- Number Literal (exit) -> Number Literal (enter) <- Number Literal (exit) <- CallExpression (exit) < -callexpression (exit) < -program (exit) // To support the Enter /exit processing, Var visitor = {NumberLiteral: {enter(node, parent) {}, exit(node, parent) {},} //... };Copy the code

Now, with that visitor we can process different AST Node-types,

Let’s define a traverser function that accepts an AST and visitors.

function traverser(ast, visitor) {
  // Support array type traversal
  function traverseArray(array, parent) {
    array.forEach(child= > {
      traverseNode(child, parent);
    });
  }
  //
  function traverseNode(node, parent) {
    let methods = visitor[node.type];
    // The node enters the hook
    if (methods && methods.enter) {
      methods.enter(node, parent);
    }
    switch (node.type) {
      case 'Program':
        traverseArray(node.body, node);
        break;
      // 
      case 'CallExpression':
        traverseArray(node.params, node);
        break;

      // NumberLiteral and StringLiteral types do not do any node processing
      case 'NumberLiteral':
      case 'StringLiteral':
        break;
    } 
    // The hook for node exit
    if(methods && methods.exit) { methods.exit(node, parent); }}// Iterate over the AST node
  traverseNode(ast, null);  
}
Copy the code

Next, implement a Transformer and build a new AST

function transformer(ast) {
  // Create a new AST root
  let newAst = {
    type: 'Program'.body: [],};// For convenience, copy the body reference to ast._context
  ast._context = newAst.body;
  / / traverse the ast
  traverser(ast, {
    NumberLiteral: {
      enter(node, parent) {
        parent._context.push({
          type: 'NumberLiteral'.value: node.value, }); }},StringLiteral: {
      enter(node, parent) {
        parent._context.push({
          type: 'StringLiteral'.value: node.value, }); }},CallExpression: {
      enter(node, parent) {
        // Create a 'CallExpression'
        let expression = {
          type: 'CallExpression'.callee: {
            type: 'Identifier'.name: node.name,
          },
          arguments: [],};// Define a new reference on the 'CallExpression' node
        node._context = expression.arguments;
        // 
        if(parent.type ! = ='CallExpression') {
          // wrap the 'CallExpression' node with 'ExpressionStatement'.
          expression = {
            type: 'ExpressionStatement'.expression: expression, }; } parent._context.push(expression); }}})return newAst;
}

const newAst = {
  type: 'Program'.body: [{
    type: 'ExpressionStatement'.expression: {
      type: 'CallExpression'.callee: {
        type: 'Identifier'.name: 'add'
      },
      arguments: [{
        type: 'NumberLiteral'.value: '2'
      }, {
        type: 'CallExpression'.callee: {
          type: 'Identifier'.name: 'subtract'
        },
        arguments: [{
          type: 'NumberLiteral'.value: '4'
        }, {
          type: 'NumberLiteral'.value: '2'}]}]}}]};Copy the code
  • 5. Code-generator

The final stage is code generation, where the code generator calls itself recursively to output each node in the tree as a long string.

function codeGenerator(node) {
  switch (node.type) {
    // If it is 'Program', the map will traverse its body property and insert a newline character
    case 'Program':
      return node.body.map(codeGenerator)
        .join('\n');
    // If 'ExpressionStatement' is wrapped in parentheses and nested, call code generator expressions with a semicolon
    case 'ExpressionStatement':
      return (
        codeGenerator(node.expression) +
        '; '
      );
    // For 'CallExpression', it prints' callee 'and adds an open parenthesis,
    // I will then iterate over each node in the 'arguments' array and run them through the code generator,
    // Then connect them with a comma, and add a close parenthesis. Wrap it in parentheses
    case 'CallExpression':
      return (
        codeGenerator(node.callee) +
        '(' +
        node.arguments.map(codeGenerator)
          .join(', ') +
        ') '
      );
    
    case 'Identifier':
      return node.name;
    // For 'NumberLiteral' returns directly
    case 'NumberLiteral':
      return node.value;
    // For 'StringLiteral', wrap it with ""
    case 'StringLiteral':
      return '"' + node.value + '"'; }}const output = "add(2, subtract(4, 2))";
Copy the code

At this point, the compiler is fully equipped to convert L-A to L-B. From input to output, the compiler’s code conversion is finally completed through four stages.

1. input  => tokenizer   => tokens
2. tokens => parser      => ast
3. ast    => transformer => newAst
4. newAst => generator   => output
Copy the code

If you’re already here, congratulations, you’ve surpassed 80% of your readers

conclusion

So, this is the end of the preliminary code compilation series, I think if you can understand all four articles and implement the code involved by hand, it will give you a new understanding of daily front-end development, whether it is engineering using Babel, ESLint, etc., the underlying principles are very similar.

References:

  1. github.com/babel/babel

Ps: If there are any deficiencies welcome to correct


A front end little novice | eager | dream and love do not live up to