Abstract syntax tree AST

Author: Shiwei Lee

As a front-end programmer, do you use webpack, rollup, Babel, esLint a lot? They are packaging tools, code compilation tools, syntax checking tools. How did they do it? This article introduces the abstract syntax tree, which is the technique they use. Should you know more about it?

There are no esoteric theories, and no large chunks of code. We start from scratch, and there are no obstacles for young people to read. Through this article, you will understand the basics of AST and how to use it.

preface

What is an abstract syntax tree?

AST (Abstract Syntax Tree) is a Tree representation of the Abstract Syntax structure of source code. The following diagram illustrates the representation of an abstract syntax tree for a piece of JavaScript code.

What is the use of abstract syntax trees?

IDE error prompts, code formatting, code highlighting, code completion, etc

JSLint, JSHint, ESLint checks for code errors or styles, etc

Webpack, rollup for code packaging, etc

Babel converts ES6 to ES5 syntax

Injected code counts unit test coverage

1. The AST parser

1.1 JS Parser

How is AST generated?

A tool that converts JavaScript source code into abstract syntax trees (AST) is called the JS Parser.

The parsing process of JS Parser consists of two parts

Lexical Analysis: Partitioned the entire code string into a minimal array of syntax units

Syntax Analysis: Establishes and analyzes the relationships between grammatical units on the basis of word segmentation

Common AST Parser

In the early days there were Uglifyjs and Esprima

Espree, based on esprima, for ESLint

Acorn, which boasts better performance and is smaller than Esprima

Babylon, from Acorn, for Babel

Babel-eslint, maintained by the Babel team for use with ESLint

1.2 Lexical Analysis

A grammatical unit is the smallest unit of parsed grammar that has practical meaning and is simply understood as a word in a natural language.

Syntax units in Javascript code include the following:

Keywords: such as var, let, const, etc

Identifier: A contiguous character that is not enclosed in quotes. It may be a variable, or keywords such as if or else, or built-in constants such as true or false

Operators: +, -, *, /, etc

Numbers: hexadecimal, decimal, octal and scientific expressions

String: Because the contents of a string are evaluated or displayed by the computer

Space: successive Spaces, line feeds, indentation, etc

Comment: A line comment or a block comment is the smallest syntax unit that cannot be split

Others: braces, parentheses, semicolons, colons, etc

1.3 Syntax Analysis

Combining the results of word segmentation, determine the relationship between words, determine the final expression meaning of words, generate abstract grammar tree.

1.4 the sample

Use esprima to parse the assignment statement as an example:

var a = 1;
Copy the code

As you can see, the result of word segmentation is an array, where each element is a minimal syntactic unit:

[
    {
        "type": "Keyword",
        "value": "var"
    },
    {
        "type": "Identifier",
        "value": "a"
    },
    {
        "type": "Punctuator",
        "value": "="
    },
    {
        "type": "Numeric",
        "value": "1"
    },
    {
        "type": "Punctuator",
        "value": ";"
    }
]
Copy the code

The results of grammar analysis are as follows, and the results of word segmentation are formed into a tree structure according to their mutual relations:

{
    "type": "Program",
    "body": [
        {
            "type": "VariableDeclaration",
            "declarations": [
                {
                    "type": "VariableDeclarator",
                    "id": {
                        "type": "Identifier",
                        "name": "a"
                    },
                    "init": {
                        "type": "Literal",
                        "value": 1,
                        "raw": "1"
                    }
                }
            ],
            "kind": "var"
        }
    ],
    "sourceType": "script"
}
Copy the code

1.5 Tool Website

esprima/parser

Classic JavaScript abstract syntax tree parser, the site provides very rich functionality

You can view segmentation and abstract syntax trees online

Syntax displays abstract Syntax trees, Tokens displays participles

Performance comparisons of various Parses are also provided, and it appears that Acorn performs a little better.

AST Explorer

A visual tool site for AST that enables AST transformations of code using various parses

AST Parsing Specification (The Estree Spec)

The same JavaScript code will yield the same AST from each parser because they all refer to the same AST parsing specification

The Estree Spec is a JavaScript AST specification for The SpiderMonkey engine output by Mozilla engineers. You can also refer to SpiderMonkey in MDN

2.AST in Babel

Having covered the AST, let’s take a look at how Babel uses the AST.

How Babel works

Babel works in three stages: Parse, Transform and generate

The Parse phase transforms the source code into an AST

Transform phase, using a variety of plug-ins for code conversion

In the generator stage, the AST is converted into code using code generation tools

The Parse – Parse

Babel parses the code using @babel/ Parser, and the input JS code string generates the AST according to the ESTree specification

The parser Babel uses is Babylon

The Transform – conversion

Receive the AST and iterate over it, adding, updating, and removing nodes during this process. This is part of the Babel plug-in.

Babel provides @babel/traverse methods to maintain the overall state of the AST tree. The methods take the original AST and custom conversion rules as parameters, and return the converted AST.

Generated by the Generator –

The code generation step converts the final AST (after a series of transformations) into string code and creates source maps.

Iterate through the AST and build a string that represents the transformed code.

Babel uses @babel/ Generator to convert the modified AST into code, which can be configured to compress and remove comments, and supports sourceMap.

3.Demo with esprima

Now that we know how Babel works, we will write a demo according to Babel’s three steps to deepen our understanding of AST.

We are going to use Esprima to simulate two code-conversion capabilities:

Change == to congruent ===

Change parseInt(a) to parseInt(a,10)

Before conversion code, before.js:

function fun1(opt) { if (opt.status == 1) { console.log('1'); } } function fun2(age) { if (parseInt(age) >= 18) { console.log('2'); }}Copy the code

Expected converted code, after.js:

The function fun1 (opt) {if (opt) status = = = 1) {/ / = = = = = become the console. The log (' 1 '); }} function fun2(age) {if (parseInt(age, 10) >= 18) {//parseInt(a,10) console.log('2'); }}Copy the code

Start with the tool kit

Const esprima = require('esprima'); //JS syntax tree module const estraverse = require('estraverse'); Const escodeGen = require('escodegen'); //JS syntax tree decompress module const fs = require('fs'); // Read and write filesCopy the code

Use esprima Parse to convert source code to AST. How about that? It’s easy. It’s a code.

const before = fs.readFileSync('./before.js', 'utf8');
const ast = esprima.parseScript(before);
Copy the code

Walk through the AST to find the code that matches the transformation rules to transform

estraverse.traverse(ast, { enter: (node) => { toEqual(node); // change == to congruent === setParseInt(node); // change parseInt(a) to parseInt(a,10)}});Copy the code

Let’s look at the implementation of toEqual and setParseInt

function toEqual(node) { if (node.operator === '==') { node.operator = '==='; }} function setParseInt(node) { If (node.type === 'CallExpression' && node.callee.name === 'parseInt' && Node.arguments. Length === 1) {if (node.type === 'CallExpression' && node.callee.name === 'parseInt' && Node.arguments. Node.arguments. Push ({// add arguments, literally "type": "Literal", "value": 10, "raw": "10"}); }}Copy the code

Finally, the transformed AST generated string code is written to the file.

// Generate target code const code = escodegen. Generate (ast); / / write file fs. ExistsSync (')/after) js') && fs. UnlinkSync (')/after) js'); fs.writeFileSync('./after.js', code, 'utf8');Copy the code

Ok, open the after.js file and see if it has been converted successfully. Is it what we expect? Is there a Babel feeling? Yes, Babel does the same thing, but its conversion rule functions are quite complex and require a lot of work to consider various JavaScript syntactic situations, which is at the heart of Babel.

If we go back to the demo we wrote, we followed Babel’s three steps exactly. The first and third steps, parse and generate, are very simple, in a sentence, not much to say. The focus is on the implementation of the Transform rule function, and one might ask, how do you know that the toEqual and setParseInt conversion functions are written this way?

Ok, to answer this question, let’s look at the AST before and after the code conversion of these two rules.

Change == to congruent ===

AST of a== B is as follows:

{
    "type": "Program",
    "body": [
        {
            "type": "ExpressionStatement",
            "expression": {
                "type": "BinaryExpression",
                "operator": "==",
                "left": {
                    "type": "Identifier",
                    "name": "a"
                },
                "right": {
                    "type": "Identifier",
                    "name": "b"
                }
            }
        }
    ],
    "sourceType": "script"
}
Copy the code

The AST of a=== B is as follows:

{
    "type": "Program",
    "body": [
        {
            "type": "ExpressionStatement",
            "expression": {
                "type": "BinaryExpression",
                "operator": "===",
                "left": {
                    "type": "Identifier",
                    "name": "a"
                },
                "right": {
                    "type": "Identifier",
                    "name": "b"
                }
            }
        }
    ],
    "sourceType": "script"
}
Copy the code

Compare the two AST columns to see if only one “operator” field is different: one is == and the other is ===.

Let’s look at the toEqual function. Is that clear? Simply modify the value of node.operator to complete the conversion.

function toEqual(node) { if (node.operator === '==') { node.operator = '==='; }}Copy the code

Change parseInt(a) to parseInt(a,10)

The AST of parseInt(a) is as follows:

{
    "type": "Program",
    "body": [
        {
            "type": "ExpressionStatement",
            "expression": {
                "type": "CallExpression",
                "callee": {
                    "type": "Identifier",
                    "name": "parseInt"
                },
                "arguments": [
                    {
                        "type": "Identifier",
                        "name": "a"
                    }
                ]
            }
        }
    ],
    "sourceType": "script"
}
Copy the code

ParseInt (a, 10);

{
    "type": "Program",
    "body": [
        {
            "type": "ExpressionStatement",
            "expression": {
                "type": "CallExpression",
                "callee": {
                    "type": "Identifier",
                    "name": "parseInt"
                },
                "arguments": [
                    {
                        "type": "Identifier",
                        "name": "a"
                    },
                    {
                        "type": "Literal",
                        "value": 10,
                        "raw": "10"
                    }
                ]
            }
        }
    ],
    "sourceType": "script"
}
Copy the code

Compare the two AST’s, see? The arguments array is overloaded with the following element.

{
    "type": "Literal",
    "value": 10,
    "raw": "10"
}
Copy the code

So in the transformation rule function, we add this element to the transformation. Is it very simple?

Function setParseInt(node) {function setParseInt(node) { If (node.type === 'CallExpression' && node.callee.name === 'parseInt' && Node.arguments. Length === 1) {if (node.type === 'CallExpression' && node.callee.name === 'parseInt' && Node.arguments. Node.arguments. Push ({// add arguments, literally "type": "Literal", "value": 10, "raw": "10"}); }}Copy the code

Okay, so that’s it. I think the Demo makes sense.

4. There is

Now you know how the AST works and how to use it. So let’s look at a problem and see what we’ve learned.

If a is an object, var a = {b: 1}, then which is more efficient, a.b or a[‘b’]?

A.b [‘b’] a.b [‘b’] a.b [‘b’] In fact, tests have been done and the performance difference between the two is close, with A.B performing slightly better than a[‘ B ‘]. So why is A.B slightly better than a[‘b’]?

In my opinion, a.b can directly resolve b as an attribute of a, while a[‘b’] may have an additional judgment process, because the contents of [] may be a variable or a constant.

This may seem plausible, but isn’t it? Is there any evidence to support this claim?

Well, the only way to explain this is to start with the V8 engine.

Javascript code can run on CPU, mainly due to the JS engine, V8 engine is developed by Google, used in Chrome and NodeJS, is a classic JS engine. Parser (AST) ->Ignition (Bytecode) ->TurboFan (Machine Code)

Parser: Converts JavaScript source code to Abstract Syntax Tree (AST)

Ignition: Interpreter converts the AST to Bytecode, interprets and executes the Bytecode; Also gather information needed for TurboFan optimization compilation, such as the types of function arguments

TurboFan: Compiler, the compiler, uses the type information gathered by Ignitio to convert Bytecode into optimized assembly code

Parser – AST Parser

You should be familiar with the AST parser we introduced today

Ignition – interpreter

The AST is parsed into a kind of assembly language that looks similar to assembly language, called Bytecode. This language is CPU-independent, and the Bytecodes generated on machines with different cpus are the same.

TurboFan – the compiler

As you know, each CPU has a different architecture and instruction set, and the corresponding assembly language will be different. In this step, V8 parses Bytecode into assembly language suitable for different cpus. V8 supports assembly languages for more than a dozen cpus.

Now let’s compare a[‘b’] with a[‘b’] in V8

The test code of a.b is as follows:

function test001() {
    var a = { b: 1 };
    console.log(a.b)
}
test001();
Copy the code

The test code for a[‘b’] is as follows:

function test002() {
    var a = { b: 1 };
    console.log(a['b'])
}
test002();
Copy the code

Take a look at the Bytecode they generated

The Bytecode code of a.b is as follows:

[generated bytecode for function: test001]
Parameter count 1
Frame size 32
   16 E> 000001F6C03D7192 @    0 : a0                StackCheck 
   33 S> 000001F6C03D7193 @    1 : 79 00 00 29 fa    CreateObjectLiteral [0], [0], #41, r1
         000001F6C03D7198 @    6 : 27 fa fb          Mov r1, r0
   46 S> 000001F6C03D719B @    9 : 13 01 01          LdaGlobal [1], [1]
         000001F6C03D719E @   12 : 26 f9             Star r2
   54 E> 000001F6C03D71A0 @   14 : 28 f9 02 03       LdaNamedProperty r2, [2], [3]
         000001F6C03D71A4 @   18 : 26 fa             Star r1
   60 E> 000001F6C03D71A6 @   20 : 28 fb 03 05       LdaNamedProperty r0, [3], [5]
         000001F6C03D71AA @   24 : 26 f8             Star r3
   54 E> 000001F6C03D71AC @   26 : 57 fa f9 f8 07    CallProperty1 r1, r2, r3, [7]
         000001F6C03D71B1 @   31 : 0d                LdaUndefined 
   63 S> 000001F6C03D71B2 @   32 : a4                Return 
Constant pool (size = 4)
Handler Table (size = 0)
Copy the code

The Bytecode code for a[‘b’] is as follows:

[generated bytecode for function: test002]
Parameter count 1
Frame size 32
   16 E> 0000022E1C7D6DC2 @    0 : a0                StackCheck 
   33 S> 0000022E1C7D6DC3 @    1 : 79 00 00 29 fa    CreateObjectLiteral [0], [0], #41, r1
         0000022E1C7D6DC8 @    6 : 27 fa fb          Mov r1, r0
   46 S> 0000022E1C7D6DCB @    9 : 13 01 01          LdaGlobal [1], [1]
         0000022E1C7D6DCE @   12 : 26 f9             Star r2
   54 E> 0000022E1C7D6DD0 @   14 : 28 f9 02 03       LdaNamedProperty r2, [2], [3]
         0000022E1C7D6DD4 @   18 : 26 fa             Star r1
   59 E> 0000022E1C7D6DD6 @   20 : 28 fb 03 05       LdaNamedProperty r0, [3], [5]
         0000022E1C7D6DDA @   24 : 26 f8             Star r3
   54 E> 0000022E1C7D6DDC @   26 : 57 fa f9 f8 07    CallProperty1 r1, r2, r3, [7]
         0000022E1C7D6DE1 @   31 : 0d                LdaUndefined 
   66 S> 0000022E1C7D6DE2 @   32 : a4                Return 
Constant pool (size = 4)
Handler Table (size = 0)
Copy the code

Compare the bytecodes of the two and you will see that they are identical, which means that there is no performance difference between the two implementations at the Bytecode layer and below. In fact, they’re different, so you have to look up, and there’s only the Parser stage up there. So let’s see what the difference is in their AST.

The AST code of a.b is as follows:

{
    "type": "Program",
    "body": [
        {
            "type": "ExpressionStatement",
            "expression": {
                "type": "MemberExpression",
                "computed": false,
                "object": {
                    "type": "Identifier",
                    "name": "a"
                },
                "property": {
                    "type": "Identifier",
                    "name": "b"
                }
            }
        }
    ],
    "sourceType": "script"
}
Copy the code

The AST code for a[‘b’] is as follows:

{
    "type": "Program",
    "body": [
        {
            "type": "ExpressionStatement",
            "expression": {
                "type": "MemberExpression",
                "computed": true,
                "object": {
                    "type": "Identifier",
                    "name": "a"
                },
                "property": {
                    "type": "Literal",
                    "value": "b",
                    "raw": "'b'"
                }
            }
        }
    ],
    "sourceType": "script"
}
Copy the code

The only difference we found is that a.b is false and a[‘b’] is true for the “computed” attribute, which means that a[‘b’] has one more computation than A.B when it is parsed into AST. Here, we conclude, is the slight difference. All right, we got the proof, so there should be no doubt now.

finishing

Not only have you learned about AST, but you have also learned how the V8 engine parses JS code. If you find this article useful to you, please give it a thumbs up, thank you very much (90 degree bow).

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

preface

directory

1. The AST parser

1.1 JS Parser

1.2 Lexical Analysis

1.3 Syntax Analysis

1.4 the sample

1.5 Tool Website

2.AST in Babel

How Babel works

3.Demo with esprima

We are going to use Esprima to simulate two code-conversion capabilities:

4. There is

finishing

For more shared articles from skFeTeam, click herehere, thank you ~

Abstract syntax tree AST

preface

directory

1. The AST parser

1.1 JS Parser

1.2 Lexical Analysis

1.3 Syntax Analysis

1.4 the sample

1.5 Tool Website

2.AST in Babel

How Babel works

3.Demo with esprima

We are going to use Esprima to simulate two code-conversion capabilities:

4. There is

finishing

For more shared articles from skFeTeam, click herehere, thank you ~

Related Posts

React from 0: React Router Hooks

Dual-threaded front-end framework: voe.js

Key points of vuecli3 project packaging optimization configuration