Implementing the JavaScript language interpreter (1)

preface

JavaScript is the most familiar programming language to most front-end developers. It is powerful, but some of the language’s features are hard to understand. Concepts such as closures and the this binding are often confusing to beginners. There are a lot of articles on the Internet, such as “If you don’t understand this binding, you will come to chop me down”, to help people out. However, in my opinion, most of these articles are superficial, and you may read a lot of them and still be asked by the interviewer. So how can you thoroughly understand these language characteristics so that you can stand out in an interview? In my opinion, the best way to really understand something is to implement it. This is also what western programmers like to call learning by implementing. For example, if you want to understand React better, the best way to do that is to implement a React yourself. In order to better understand the language features of JavaScript, I implemented a Simple JavaScript interpreter called Simple. It implements a subset of JavaScript syntax based on TypeScript, including the following features:

Basic data types
Complex data types Object, Array, and Function
Variable definitions
Mathematical operations
Logical operations
If condition judgment
The while, for loop
Functional programming
closure
This binding

This series of articles is an overview of my implementation of the Simple language interpreter. It will include the following sections:

Project introduction and lexical analysis (this article)
Syntax analysis
Executing JavaScript code

While the implementation of Simple has nothing to do with the V8 engine (or any other JavaScript engine), and you won’t be able to understand the source code through this series, here are some things you should know after reading this series:

Deeper understanding of the JavaScript language (this, closures, etc.)
Grasp the basic knowledge of compilation principle
Know what a DSL is and how to implement an internal DSL to improve development efficiency (Simple’s parsing is based on an internal DSL)

Simple interpreter source code is open source in making the above, the address is https://github.com/XiaocongDong/simple, I also developed a Simple code editor for everybody to play, The address is https://superseany.com/opensource/simple/build/, you can in the editor to write and run JavaScript code, You can also see the words (tokens) and syntax trees (AST) generated by the JavaScript code.

Let’s move on to the first part of this series, project introduction and lexical analysis.

Project introduction

Compiler vs interpreter

Before we begin to understand how Simple works, let’s look at two basic concepts of Compiler principles: Compiler vs. Interpreter.

The compiler

The compiler can be thought of as a translator of the language. It converts the source files from one form of code to another. It is only responsible for converting the code, but does not actually execute the logic of the code. The code wrapper we used during the development of the front-end project, Webpack, was essentially a JavaScript compiler that would just package our code without executing it.

The interpreter

An interpreter, as the name implies, interprets and executes our code. Unlike a compiler, it does not convert the source code (at least, it does not output intermediate files), but instead executes the logic of the source code as it interprets it.

Simple interpreter

Because Simple does not perform an intermediate conversion of the JavaScript code it writes, it only interprets and executes the logic of the code, so it is a literal JavaScript language interpreter.

Simple’s architectural design

The code we write is just a string of text stored on a computer’s hard drive, and the essence of implementing a language interpreter is to teach the computer how to understand and execute this text code. So how can a computer understand what we’re writing? Given that most programming languages code in English, let’s take a look at how people understand an English sentence and see if we can get some ideas.

The process of understanding English sentences

Put a pencil on the table. I’m sure you all know what this sentence means, but have you ever thought about how you areUnderstand this sentence? Or even further, can you break down your process of understanding this sentence into separate steps?

I believe that most people go through these stages of understanding the above sentence:

Cut words and understand the meaning of each word: Sentences are made up of words, and we need to know the meaning of each word before we can understand the meaning of a sentence. The word “Put a pencil on the table” means:
- Put: verb, place
- A: An indefinite article, one.
- A: The first pencil is the pencil.
- A. on B. on C. on D. on The above.
- The: Definitive article, this one.
- Table: n.
Once the words are broken up, we divide the sentence according to the rules of English grammar. After we understand the meaning of each word in the sentence, we then divide the sentence according to the rules of English grammar. For example, in the sentence above, we divide it like this:
- The first word of the sentence is the verb put, and the verb is followed by a noun modified by an indefinite article, so this sentence should be an imperative verb + noun. So the first half of the sentence means to tell someone to put a (a) pencil.
- When the first half of the sentence has been explained, let’s look at the second half of the sentence. The pencils were left on the table, and the pencils were left on the table. The pencils were on the table, and the pencils were on the table.
- Put the pencils on the table. Put the pencils on the table.

How does a computer understand code

Now that we know how to understand an English sentence, let’s think about how to get a computer to understand our code. We all know that a lot of computer science is about modeling the real world. For example, the familiar data structure Queue corresponds to the queues that we normally Queue in our daily life, while some design patterns, such as Visitor, Listener, etc., model real-life situations. In computer science, the study of programming languages is called the Principles of Compilation, so how do some of the basic concepts of the Principles of Compilation relate to the steps we’ve taken to understand a sentence?

As mentioned above, the first step in understanding a sentence is to cut up the words and understand the meaning of each word. This step corresponds to the Lexical Analysis in the principle of compilation. Lexical analysis, as the name implies, interprets code at the word level by dividing the code string into individual words (tokens).

After understanding the meaning of each word, we will divide the sentence structure according to English grammar rules. The concept of compilation principle corresponding to this step is Syntax Analysis (Syntax Analysis/Parser). The process of grammar analysis generates an abstract grammar tree (AST) from the string of words generated by lexical analysis according to the defined grammar rules. The resulting abstract syntax tree is then eventually executed by some runtime.

To sum up, the software architecture of a language interpreter looks like this:

That’s the Simple software architecture, but let’s take a look at the lexical analysis implementation.

Lexical analysis

As mentioned earlier, lexical analysis is the division of the file’s code into individual units based on tokens. Here it should be noted that words in compiling principle are not equivalent to words in English. In compiling principle, strings connected by letters such as let, for and while are words, and some strings connected by non-letters such as =, ==, && and + are also legal words. Here are some of the legal words for the Simple interpreter:

Keywords: let, const, break, continue, if, else, while, function, true, false, for, undefined, null, new, return
Identifiers: These are mainly the names of some developer-defined variables, such as arr, server, result, etc
Literals: Literals include number literals and string literals. The Simple interpreter only supports single-quoted strings, such as ‘this is a string literal’.
Arithmetic and logical operations symbol: +, -, + +, -, *, /, &&, | |, >, > =, <, < =, = =
Assignment operators: =, +=, -=
Special symbols: [,], {,},., :, (,)

Note that the lexing phase does not retain all the characters in the source code, and unwanted information such as Spaces, newlines, and code comments is removed during this phase. Here is a rendering of the lexical analysis:

For lexical analysis, there are roughly the following two implementations:

Regular expression

This approach is probably what most developers would think of. Since the Simple interpreter does not use this approach, this is a brief overview of the process, which in general consists of the following steps:

Define the corresponding regular expression for each word type. For example, the regular expression for numeric literals is/ [0-9] [0-9] * /The regular expression for the simple assignment operator (regardless of the floating-point case) is/ = /The regular expression for the equals operator is/ = = /.
The regular expression of each word type is as followsMorphological priority orderProceed with the code string in turnmatchOperation if a regular expression for a word type hashit, extracts the corresponding substring, and then extracts it from the string you just hitLast positionBegin to continue the match operation, and so onLoop over and over againUntil all the strings match out. An important point here is that there are different types of wordsMorphological priority orderFor example, the equals operator= =Is a higher priority than=Is a higher priority, because if the developer writes two equals signs, it means equal judgment, not two assignment signs.

Based on finite state machines

Since all regular expressions can be converted to their corresponding finite-state machines, lexical analysis can also be implemented using finite-state machines. So what is a finite state machine?

Finite State Machine (FSM), as it is called in English, has the following characteristics:

It has a finite state
It can only have one state at a time, the current state
After receiving data from the outside world, the finite state opportunity calculates the next state and transitions to that state based on the current state and the received data

The familiar traffic light is an example of a finite state machine. A traffic light can only have three colors, red, green and yellow, so its state set is limited. Since a traffic light can only have one color at a time (think red and green at the same time :), its current state is unique. Finally, the traffic light will be converted to the next state according to the current state (color) and input (how much time has passed). For example, the red light will turn yellow after 60 seconds and can not turn green.

From the above definition we know that the most important elements of a finite state machine are the following:

The state sets
The current state
How do they twist from one state to another

Now that we know what a finite state machine is and its three elements, let’s look at an example of using a simple finite state machine for lexical analysis. We are going to design a finite state machine that can recognize the following types of words:

Identifier
Number (number literal, not including floating point)
String (literal string, enclosed in single quotes)
Plus (+)
The plus sign assignment operator (+=)

Let’s start by defining the three elements of a finite state machine:

State Set: The state set should contain all the states that appear in the state machine after receiving any input. For the states above, there may be the following states:
- -Sheldon: Well, my initial state
- Number: This state is in when the state machine recognizes a numeric literal
- Start string literal: The state machine receives the first single quote and is in this state until the second single quote is received
- String literal: This state occurs when the state machine recognizes a string literal
- Identifier: The state in which the identifier is identified by the state machine
- Plus: The state machine is in when it recognizes the plus sign
- Plus assign: The current state machine recognizes that the plus sign assignment operator is in this state
Current state: The current state of this finite state machine can be any of the states defined above
How to twist between different states: when the state machine is in a certain state, it can only beTo twist to some particular state. For example, if the state machine is now instart string literalState, which can only maintain the current state or transition tostring literalState. When the current input does not allow the state machine to perform state reversal, there are two cases. The first case is that the current state is oneA terminable state, that is, the current state machine already knows all the information needed to generate a token. At this time, the state machine will output the type of word represented by the current state. After the last word is output, the state machine will reset to the initial state and then process the previous input again. If the current state is aNon-terminating stateIf the state machine does not have enough information to print a single word, the state will report an error. In the current example, the terminable states arenumber.string literalandidentifier, and the non-terminating state hasstart string literal. Here is the state twist diagram for this state machine:

The important thing to note here is that in addition to storing the current state information, the state machine also has to hold the characters that are not yet printed as words. That is, it has to have a buffer variable to store the character input encountered. For example, if a + is encountered, the buffer will change to +. If a = is encountered, the buffer will change to +=. Finally, += is printed, and the buffer will be reset to the empty string “”.

Once the three elements of the state machine are defined, we can use the above state machine to parse the string a+=’Simple’ :

The initial state opportunity is in the INITIAL state, and then the state opportunity receives each character of the code one by one and completes the corresponding state reversal and word output
The state machine receives itaCharacter, we know from the state twist diagram defined above that this character can cause the state machine to twist toidentifierThe state, and saves the character inbufferIn this variable
The state machine receives it+Character, because the identifier cannot be based on+The state of the character is reversed, and it is currently in a terminable state, so the state will be printed before it is recordedaWord, and then reset the status toinitial. The state machine resets the state and reprocesses it+String, where the state machine transitions toplusState, and will+This character is recorded at this timebufferinto+
The state machine receives it=Character, as can be seen from the above twist diagram, the state machine can be converted toplus assignThis state, so the state will be reversed and recorded=This character,bufferinto+ =
The state machine receives it'Character, due toplus assignNot according to the'The character performs the state transition whileplus assignAgain a terminable state, so the state will print the currentbufferRecords of the+ =As a word, and resets the state toinitial. The state machine resets the state and reprocesses it'String, where the state machine transitions tostart string literalstate
When the state machine is received separatelyS.i.m.p.lande, since they are not single quotes, the state chance remains atstart string literalThis state, and these characters will be added in turnbuffer, and finally the buffer will beSimple
The state machine receives it'Character, the state machine transitions tostring literalState, which means the state machine has recognized a valid string word
Finally, after the state machine determines that there are no characters to enter, it looks to see if the current state is a terminable state becausestring literalIs a terminable state, so the state will print the current word. On the other hand, if the state machine finds that there are no new characters to enter and it is in a non-terminated state, it throws a name calledUnexpected EOFThe error

This is a Simple example of using a finite state machine to implement a lexer. The lexer implementation for the Simple interpreter is the same as above. In the Simple interpreter, I decoupled the core logic of the state machine (recording the current state and turning it around) from the logic of the configuration of the state machine (the definition of the state set and how the different states turn around) to make it easier to modify and extend the lexical rules of the Simple language later. And it can use another state machine configuration to perform lexical analysis in another language.

The core logic of the state machine is in the lib/lexer/ tokenizer. ts file, and the configuration of the state machine is in the lib/config/ tokenizer. ts file.

State machine configuration definition

Simple state machine configuration is defined in lib/config/ tokenizer.ts. Here is an example of a simplified version of Simple.

// lib/config/ tokenizer.ts // State defines all possible states of Simple language State machine enum State {INITIAL = 'INITIAL', NUMBER_LITERAL = 'NUMBER_LITERAL', IDENTIFER = 'IDENTIFER', START_STRING_LITERAL = 'START_STRING_LITERAL', STRING_LITERAL = 'STRING_LITERAL' ... } const config: IConfig = {initialState: state. INITIAL; {// All State transitions from the enumeration State machine [State. Initial]: {isEnd: false, // Indicates whether this State is terminable: [// All State transitions from the enumeration State machine {State: State.NUMBER_LITERAL, checker: /[0-9]/ }, { state: State.START_STRING_LITERAL, checker: "'" } }, [State.NUMBER_LITERAL]: { isEnd: true, TokenType: TOKEN_TYPE.NUMBER_LITERAL, transitions: [ { state: State.NUMBER_LITERAL, checker: /[0-9\.]/ } ] }, [State.START_STRING_LITERAL]: { isEnd: false, transitions: [ { state: State.START_STRING_LITERAL, checker: /[^']/ }, { state: State.STRING_LITERAL, checker: "'" } ] }, [State.STRING_LITERAL]: { isEnd: true, TokenType: TOKEN_TYPE.STRING_LITERAL }, ... } }

The above configuration file defines a config object that will be passed as a parameter to the finite-state machine class Tokenizer in lib/lexer/Tokenizer.ts. The Config object takes two parameters, the initial state value and all the state configuration states for the state machine. The initial state value is the state that the state machine starts with and resets to when the state machine recognizes a new word. States are objects of type Object, whose key is the value of a state, and whose value is the configuration of that state. A state’s configuration consists of the following:

IsEnd: Boolean, indicating whether this state is terminable
Tokentype: Represents the type of word that corresponds to this state. If the state is a terminable state, it can have a corresponding word type. If TokenType is not specified, a word will not be generated even if there is a match.
Transitions: Stores all possible transitions for this state, and each transition will have the following properties:
- State: The state to transition to
- Checker: The condition for a state transition, which can be a string, a regular expression, or a function that returns a Boolean value, to occur when the input satisfies the checker’s condition

State machine core logic implementation

Having seen the configuration of the Simple state machine above, let’s take a look at the core code of the state machine that uses this configuration: lib/Lexer/ tokenizer.ts. In order to implement the function of Tokenizer, I designed two helper classes. One is the LocationKeeper class, which records the current position information. It records the number of lines and columns of the characters being processed in the source file. The other class is the TokenBuffer class. All words recognized by the state machine are stored in an instance of this class, so it needs to provide methods to read/write the words. This class will be introduced after the introduction of the Tokenizer class.

Let’s take a look at the Tokenizer class’s core logic consume(ch: string) function for handling input characters:

// lib/lexer/Tokenizer.ts class Tokenizer { ... consume(ch: String) {// If the input character is a space or newline character and the current state is the initial state, Update only if the current location information ((ch = = = SPACE | | ch = = = NEW_LINE) && enclosing state = = = this. InitialState) {this. LocationKeeper. Consume (ch) Return} // This. State holds the current state of the state machine const CurrentStateConfig: IStateConfig = this.statesConfig[this.state] if (! CurrentStateConfig) {// The developer forgot to configure this state, we will also report an error, develper-friendly: ) throw new Error(' Missing state config for ${this.state} ')} // const Transitions = currentStateConfig.transitions if (! Transitions) {// If (CurrentStateConfig.isEnd) {// Generate Token, Deposit into tokenBuffer enclosing addToken (currentStateConfig TokenType) / / reset current state this. The reset () / / consumption of the current input character again this. Consume (ch) Return} // If the current state cannot be transitioned and is not terminated, an error will be sent! throw new SyntaxError(`Unexpected character ${ch}`, This. LocationKeeper. GetCurrentLocation ())} / / the input character and the checker are compared to determine the need for state transitions const targetTransition = transitions.find(({ checker }) => { if (typeof checker === 'string') { return ch === checker } if (checker instanceof RegExp) {return checker.test(ch)} return checker(ch)}) if (! TargetTransition) {/ / conditions for termination of the if (currentStateConfig. IsEnd) {if (currentStateConfig. TokenType) {/ / This token is added to the tokenBuffer instance. AddToken (currentStateConfig. TokenType)} / / reset state this. The reset () / / to consume input characters this. Consume (ch) Return} // There is no state that can be transitioned and now it is a non-terminating state. this.locationKeeper.consume(ch) throw new SyntaxError('Invalid or unexpected token', This. LocationKeeper. GetCurrentLocation ())} / / the following logic is state after successful reverse / / update the location of the current record, The rows and columns of the code. This locationKeeper. Consume (ch) / / the following code is to record the starting position of the current word if (this. State = = = this. InitialState && targetTransition.state ! = = this. InitialState) {this. LocationKeeper. MarkLocation ()} / / convert the current state to the target state. This state = targetTransition. State / / This. buffer += ch}}

Let’s look at the source code for the TokenBuffer class used to store recognized words:

// lib/lexer/TokenBuffer.ts import {Token} from "./types/ Token "class TokenBuffer {// Private tokens: Array< iToken bb0 = [] // store the position of the current read word private cursor: // peek() {return this.tokens[this. Cursor]} // Unlike peek, peek() {return this.tokens[this.cursor]} Read () {const currentToken = this.tokens[this.cursor] const nextCursor = this.tokens < this.tokens. ++this.cursor : This.tokens. Length this.cursor = nextCursor return currentToken} Unread () {const lastCursor = --this.cursor this.cursor = lastCursor return this.tokens[lastCursor]} // Write a new token write(token: } getCursor() {return this. Token.push (token)} // Set the cursor position directly. // output the current tokens toJSON() {this. Cursor = cursor} Array< iToken > {return this.tokens} // IsEmpty (): boolean { return this.cursor === this.tokens.length } }

Attentive students will find every time when I was in TokenBuffer above reads the words are just move the cursor, and no real out the words from an array, the advantage is convenient syntax analysis phase in some grammatical rules does not match the back before read the words, to use another grammar rules to match.

Token word string

Finally, let’s take a look at the Token string recognized by the finite state machine. Here is the input code:

let a = 'HelloWorld';

After processing by the finite state machine, the output Token string is:

[
  {
    "type": "LET",
    "value": "let",
    "range": {
      "start": {
        "line": 1,
        "column": 1
      },
      "end": {
        "line": 1,
        "column": 3
      }
    }
  },
  {
    "type": "IDENTIFIER",
    "value": "a",
    "range": {
      "start": {
        "line": 1,
        "column": 5
      },
      "end": {
        "line": 1,
        "column": 5
      }
    }
  },
  {
    "type": "ASSIGN",
    "value": "=",
    "range": {
      "start": {
        "line": 1,
        "column": 7
      },
      "end": {
        "line": 1,
        "column": 7
      }
    }
  },
  {
    "type": "STRING_LITERAL",
    "value": "'HelloWorld'",
    "range": {
      "start": {
        "line": 1,
        "column": 9
      },
      "end": {
        "line": 1,
        "column": 20
      }
    }
  },
  {
    "type": "SEMICOLON",
    "value": ";",
    "range": {
      "start": {
        "line": 1,
        "column": 21
      },
      "end": {
        "line": 1,
        "column": 21
      }
    }
  }
]

As can be seen from the above output, each word (token) will have the following attributes:

Type: The type of the word, which is the Tokentype defined in the non-terminating state
Value: The specific value of the word
Range: Stores the starting and ending position of the word, including the number of rows and columns. This location information will help the developer locate errors when the code reports errors

summary

In this article, I’ve introduced you to the background and content of the Simple project, followed by some basic compilation principles, and finally detailed how to use finite state machines to implement lexical analysis and decipher the source code for the Simple project.

In the next article I’ll go through some of the basics of parsing, popularize some of the basic concepts of domain-specific languages (DSLs), and finally look at how I implemented parsing in Simple using a flexible DSL.

Personal Technology Trends

It started on my personal blog

Welcome to pay attention to the public number of green onion attack to learn and grow together