Computer fundamentals of front-end relevance (4) : Compilation principles

The following diagram of the computer system hierarchy, which we have seen many times before (M0,M1 and M3 are discussed in the computer composition principle, M2 is discussed in the operating system), is the M4 layer where we usually program. In this article, we will discuss how the high-level languages we use are converted to machine language for execution (perhaps first compiled into assembly language), and how to convert from one high-level language to another (such as TS to JS, higher version js to lower version JS). This process is collectively referred to as compile.

The main contents include

Foundation of compilation theory: refer to the introduction to the Dragon Book
Compilation practice: Refer to The Super Tiny Compiler
Js compiler Babel compilation part of the detailed introduction, and write a Babel plug-in

1. Theoretical Basis

A programming language is a symbol that describes the computing process to people and computers. Before a computer runs a programming language, it needs to be translated into a machine language that can be executed by the machine. This process is called compilation, and the software to complete compilation is called compiler.

1.1 Language Processor

A compiler inputs a program written in one language (known as the source language) and outputs an equivalent program written in another language (the target language). If the output is a machine language program, it can be executed directly on the computer.

Another type of language processor is the interpreter, which does not involve the translation process but interprets and executes programs in the source language. ,

1.2 Structure of a compiler

A complete compilation process is divided into two parts, the Analysis and synthesis parts, also known as the front end and back end.

The front end includes lexical analysis, syntax analysis, semantic analysis, intermediate code generation belongs to the front end, code generation and optimization belong to the back end. Lexical analysis decomposes the source program to generate lexical units (tokens), syntax analysis adds syntax structures on top of these elements to form syntax trees (ast), and then obtains a symbol table that holds program information and intermediate code (such as) for back-end processing through error checking and intermediate code generation of semantic analysis. The backend uses this information to generate a machine language that the target machine can run and optimizes accordingly.

When the intermediate representation and symbol table are independent of the high-level language, the combination of different front-end and back-end modules can be implemented, such as the implementation of GCC, which parses various high-level languages into syntax trees, which further translates into the intermediate language RTL, and which further generates Assembler code and various machine languages.

The following uses the assignment statement position=initial+rate*60 compiled into assembly language as an example to familiarise with the compilation process.

1.2.1 Lexical analysis

Lexical analysis, also known as scanning, is the lexical flow that reads into the source code and organizes into meaningful lexeme sequences. In this case, morphemes are

position
=
initial
+
rate
*
60

The lexical analyzer uses morphemes to generate lexical units (tokens) in the format of

, where token-name refers to abstract symbols used in parsing (such as identifier, or ID for short). Attribute-value is the corresponding information, such as the name and type. Token in this case the result is < id: 1 > < = > < id, 2 > < + > < id, 3 > < * > < > 60, the = sign, such as the token – a name for itself, the second to ignore.
,attribute-value>

1.2.2 Grammar analysis

Syntax analysis, also known as parsing, uses the first component of each token to create an intermediate representation of the tree, one of which is a syntax tree, in which each internal node of the tree represents an operation, and its children represent components of that operation, as in this case

1.2.3 Semantic analysis

Semantic Analyzer uses the information in the syntax tree and symbol table to check whether the source program is consistent with the semantics defined by the language, collects type information, and stores this information in the syntax tree or symbol table for later use in the generation of intermediate code. Another part of the semantic analysis is type checking, and possible type conversions, in which morpheme 60 is converted to a floating point number

1.2.4 Intermediate code generation

The preceding syntax tree can be regarded as a kind of intermediate code. After parsing and semantic analysis, it generates an intermediate code representation that can be easily translated into machine language, such as the three Address code in this example. This intermediate representation consists of a set of instructions similar to assembly language

1.2.5 Code generation

Output the final target language, prior to output the necessary optimizations, such as directly represented by a floating point value of 60.0, and remove the unnecessary variable t3, in this case optimized as

Output the final assembly language

1.3 Development history of programming language

The first generation of electronic computers appeared in the 1940s, using a sequence of 0 and 1 programming, the sequence clearly tells the computer which operations to perform in what order, that is, to move data from one location to another location, to calculate the values of different registers and so on, this is the first generation of programming language, machine language. To make programming more user-friendly, the second generation of programming language, assembly language, was introduced in the 1950s. At first, assembly language instructions were just mnemonic representations of machine language, and later macro instructions were added to simplify the programming process.

Later, higher-level languages such as C made it easier for programmers to develop programs, the third generation of programming languages

Fourth-generation programming languages are languages designed for specific applications, such as SQL for database queries

The fifth generation is based on logic and constraints, such as Prolog

The development of programming languages also puts forward new requirements for compiler designers.

2 Compilation Practice

After understanding the theoretical basis of compilation, we use JS to implement a simple compiler in which we simplify compilation into three steps

Parsing, the Parsing of source code into an abstract syntax tree, in which the abstract syntax tree serves as an intermediate representation
Transformation transforms the abstract syntax tree into the representation required for the purpose the current compiler wants to achieve
Code Generation transforms the Code to generate the destination Code

In this process, the abstract syntax tree is a nested object that contains various information about the code.

Compiler Babel

3.1 Basic Introduction

Babel is a JS compiler that follows the three steps described in the previous chapter and can be used to do the following

Compile the latest version of the JS syntax into the syntax supported by the current browser
Do the necessary build work in React, such as compiling the JSX syntax
Remove type comments and associated syntax conversions for TS, flow, etc
Additional custom compilation using plug-ins

Babel itself exists to make it easier to use the next generation OF JS syntax, but it also provides other features, such as polyfill, in addition to compiling syntax. Please refer to my article for details on how to use Babel and other features. This article only covers compilation.

Babel itself is a toolchain, and compilation is done by plug-ins. A Babel without plug-ins does something similar to const Babel =code=>code, that is, does nothing and prints the original code.

Plugins are called Syntax Plugins, such as JSX, to guide the parsing of a particular Syntax, and Transform Plugins, called Transform Plugins, to guide the parsing of intermediate representation code. For example, convert the syntax of arrow functions to es5 syntax.

To achieve this, Babel provides a series of packages, including four core packages

@babel/core relies on the last three and provides a encapsulated transform interface for us to call directly
@babel/ Parser parses the source code. Babel uses an improved ESTree rule, called ast-spec, to parse JS source code into AST. If you need to parse another syntax, such as JSX, you need a syntax plug-in.
Use @babel/traverse to traverse the AST for necessary operations, such as adding, deleting, or changing ast nodes
@babel/ Generator generates object code from the transformed AST

There are also auxiliary packages

@babel/types is used to validate, create, and modify AST nodes

3.2 Preparation before writing plug-ins

This part and the next part refer to Babel Plugin Handbook, here only introduces the introduction of plug-in knowledge, read the links for more complex operations.

3.2.1 ASTs

An AST is an abstract syntax tree, which is a tree-type data structure. Each node is a node type defined in the standard, and each data type implements an interface

interface Node {
  type: string;
  loc: SourceLocation | null;
}
Copy the code

The node types covered in this article include

Identifier, inherited from Expression, Pattern

interface Identifier <: Expression, Pattern {
  type: "Identifier";
  name: string;
}
Copy the code

FunctionDeclaration; FunctionDeclaration; FunctionDeclaration

interface FunctionDeclaration <: Function, Declaration {
  type: "FunctionDeclaration";
  id: Identifier;
}
Copy the code

A BlockStatement is a block of statements enclosed in curly braces

interface BlockStatement <: Statement {
  type: "BlockStatement";
  body: [ Statement ];
  directives: [ Directive ];
}
Copy the code

ReturnStatement return statement

interface ReturnStatement <: Statement {
  type: "ReturnStatement";
  argument: Expression | null;
}
Copy the code

BinaryExpression BinaryExpression

interface BinaryExpression <: Expression {
  type: "BinaryExpression";
  operator: BinaryOperator;
  left: Expression;
  right: Expression;
}
Copy the code

When we want to know the AST of a piece of JS code, we can view it directly through the AST Explorer (we can also look at non-JS ast).

3.2.2 the transform of Babel

The most complex of Babel’s three steps of parsing, transformation, and generation is the transformation step, which is where the plug-in’s main role is, by walking through the AST, adding, removing, or updating nodes as you go. When we process a node, we call it a visit, and the object that performs the visit is called a visitor(refer to the Visitor pattern in the design pattern).

visitor

Visitor is an object that contains key-value pairs of nodes of type key that operate on value, either a function or an object with enter and exit fields. Such as

const MyVisitor = { Identifier() { console.log("Called!" ); }}; // const MyVisitor = {Identifier: {enter() {console.log("Entered!") ); }, exit() { console.log("Exited!" ); }}};Copy the code

The handler function takes one parameter, Path, which is a responsive object representing the connection between two nodes. It contains information about the node and Operations on the node, including removing, adding, moving, and updating. For details, see Transformation Operations

Path. findParent Returns the parent path

types

Type is used to implement PATH for detailed operations on nodes, including type validation, creation, and modification of nodes, such as

T.i dentifier() creates an identifier
T.isidentifier () specifies whether the node is an identifier
T.isallexpression () is not callExpression

3.3 write a plug-in

Now that we know the basic implementation logic of Babel, we can use our current tools to implement some custom plugins that do exactly what we want, namely write Transform plugins, as referenced in this article.

A plug-in is a function that accesses nodes mainly in the visitor section in the format of

Export default function(Babel) {return {// necessary, with traverse using visitor object visitor: Var inherits {}, // Select * from JSX; OtherPlugin, // Optional, before the plugin is executed, initialize the state, such as cache pre(state) {}, // Optional, after the plugin is executed, end the cleanup task post(state) {}}}Copy the code

Let’s say we want to implement a plug-in that removes console-related methods

export default function(babel) { return { visitor: { Identifier(path){ if (path.node.name === 'console') { path.findParent(p => p.isCallExpression()).remove();; }},}}}Copy the code

Validation can be run directly in the AST Explorer or configured as a plug-in in a project

End and spend