AST

Why the front end is becoming more and more complex, all kinds of framework, Babel, typescript, webpack, a rollup, etc. However, the above tools require compilation, and javascript compilation is mainly based on ast. Various frameworks/scaffolding/languages use ast to compile code to generate JS code that meets their needs and can be run in the browser. The compilation process is as follows: 1 tokenizer: word segmentation, all codes are divided into tokens; 2 Parse: Parse the syntax, one token into the syntax tree; 3 transform: Replace/modify the syntax tree according to some custom rules; 4 Generator: Then convert the syntax tree into normal js code; This paper mainly refers to the whole process of JS code compilation under the practice of Recast && the-super-tiny-Compiler to understand the application of AST in the front end.

Note: This article only implements a simple sum function compilation process. Warehouse address: github.com/binvb/simpl…

Ast results

Before implementing it, let’s take a look at the results generated by existing compilation tools. Test compiling source code to ast with a Recast tool:

import * as recast from 'recast'

let sourceCode = `
    function sum(a, b) {
        return a + b
    }
`
console.log(JSON.stringify(recast.parse(sourceCode)))
Copy the code

Formatting the compiled JSON string has nearly 3,000 lines of code! But the key data is in program.body, and the data structure of program.body is as follows:

"Type ": "FunctionDeclaration", // function identifier "id":{... }, // function name etc "params": [...] // The function passes the parameter "body":{... }, // The contents of the function blockCopy the code

There are many other things in this data structure, including start and end positions (rows, columns), indentation, etc., but these are not the main content, so they are omitted.

tokenizer

In the stage of word segmentation, we first extract the keywords we need and analyze the rules of word segmentation for keywords: 1 function: blank before and after; 2 function name: before function, after paren; 3 paren: keyword ‘(‘ or ‘)’; 5 comma: comma ‘,’; 6 BRACKET: keyword ‘{‘ or ‘}’; 7 Return: blank before and after; 8 plus: special symbol; 9 variate: string variable name; 10 blank C. Then cut and match the source code according to the smallest unit, LLDB.

Function tokenizer (input) {let current = 0, while (current < input. The length >) {/ / match first word segmentation rules if a single string (_char = = = '(' | | _char === ')' ) { tokens.push({ type: 'paren', value: _char }) current++ continue } ... }}Copy the code

parse

In the parsing phase, we need to load the tokens into our syntax tree one by one:

// from [ { type: 'FunctionDeclaration', value: 'function' } ... ]  // to { program: { body: [ { type: 'FunctionDeclaration', params: [...], id: { name: 'sum' }, body:[ { type: 'blockStament', body: [...] } ] } ] } }Copy the code

Since token is a flat array structure, we need to insert the data layer by layer into the syntax tree. We can use the width-first algorithm to calculate the data recursively:

function parse(tree, startIndex=0) {
    let root = []
    let treeLength = tree.length

    for(let i = startIndex; i < treeLength; i++) {
        // function
        if(tree[i].type === 'FunctionDeclaration') {
            let blockRangeEnd = getBlockStatementList(tree, i, '}').index

            root.push({
                type: 'FunctionDeclaration',
                params: [
                    ...getFunctionParams(tree, i + 1, ')')
                ],
                id: {
                    name: getNextToken(tree, i + 1)
                },
                body: parse(tree.slice(i + 1, blockRangeEnd))
            })
            i = blockRangeEnd
        }
        // return statement
        if(tree[i].type === 'ReturnExpressionStatement') {
            let blockRangeEnd = getBlockStatementList(tree, i, '}').index
            
            root.push({
                type: 'ReturnExpressionStatement',
                body: getExpressionStatement(tree.slice(i + 1, blockRangeEnd))
            })
            i = blockRangeEnd
        }
        ...
    }

    return root
}
Copy the code

transform

At this stage, it is often the framework/language that needs to convert code that does not run directly in v8 / or compatible versions into code that can run directly in v8 (e.g. typescript, v8, etc.). For the sake of convenience, only a simple variable conversion, namely, LLDB.

Function transform(ast) {for(let I = 0; i < ast.length; i++) { if(ast[i].type === 'FunctionDeclaration') { transform(ast[i].body) } if(ast[i].type === 'ReturnExpressionStatement') { transform(ast[i].body) } if(ast[i].type === 'variate' && ast[i].value === 'a') { ast[i].value = 1 } } return ast }Copy the code

generator

In this step, we need to deserialize the AST after transform back to Tokens, and then generate tokens into strings (see the specific code in the warehouse). Finally obtain:

function sum(a,b){return 1 + b}
Copy the code

End and spend

The problem record

1 word segmentation stage – word segmentation must start from the beginning of the document in order, and there are many word segmentation rules, so how to achieve this method of segmentation? Cut to a minimum unit, all characters/whitespace are all cut to a single unit. Start with a word segmentation rule the same size as the smallest unit.

2 word segmentation stage – Do I need to distinguish between function names, parameter names, and variables? How to distinguish if you need to, because according to the rules of word segmentation, it is continuous matching. Reference specifications need to be differentiated, as they will later need to be done differently for different types. Need to add a dimension matching rule:

1 For function names, consecutive strings are preceded by whitespace and whitespace is preceded by 'function'; 2 For parameter names, leave/comma /() blank; 3 for variables, before and after blank;Copy the code

3 Syntax phase – Do special characters, such as commas between function parameters, need to be stored in the syntax tree? If so, where should they be stored? The canonical structure needs to be stored, mainly to record locations, but this is not important for this project and would greatly increase the complexity, so it will not be stored for now.

Syntax Phase: How to extract the sibling structure of the current blockStatement from the token? e.g.

Function test() {function a() {//... } function b() { //... }}Copy the code

From blockStament start symbol ‘{‘ to end symbol ‘}’ extract to a peer node (actually much more complicated, e.g. {{… {}… }}, but don’t worry too much about it here.

5 Generator phase – How to convert ast to Token? Because the AST is generated sequentially, all tokens can be extracted through depth-first traversal.

specification

While there are many tools that implement the AST, it would be confusing if each tool/framework had its own implementation and specification, so the AST is the same across tools. Specification reference github.com/estree/estr…

Related tools

1 recast github.com/benjamn/rec…

Parse (sourcecode) reacast.parse(sourcecode)Copy the code

Reference documentation

4 the – super – tiny – compiler github.com/jamiebuilds… 5 the AST abstract syntax tree segmentfault.com/a/119000001…