Experience the V8 engine parsing strings in JavaScript

In fact, AST module to write, I would like to write more than 100, I will translate some simple points into JavaScript code to explain (v8 internal complexity can always exceed my expectations, now see thousands of lines of source code have no feeling), if you want to see the C++ source code, you can go to my previous running account.

Code address :github.com/pflhm2005/V…

Let me write a couple of conclusions.

The abstract syntax tree has strict classification, such as Statement, Expression, Declaration, etc. inherited from AstNode. When determining the corresponding lexical type, there is a factory class to generate the corresponding description class.
V8 has an internal hashmap called string_table_ that caches all strings. When converting the abstract syntax tree, every string encountered is converted into a hash value based on its characteristics and inserted into the HashMap. If a string with the same hash value is encountered in the future, it will be preferentially extracted from it for comparison. If the string value is consistent, no new string class will be generated.
Declaration(let a = 1), Statement(if(true) {}), Expression(“a” + “b”). A very special syntax type is goto, which is the label syntax. All I can say is try not to use it, v8 has written a special parse for it and it’s very complicated.
Each large type (such as Statement) also has very detailed subtypes, such as if, while, return, and so on. The current parse lexical does not match the corresponding type and is degraded.
When the string is cached, it is processed in three types: single character with length 1, numeric string with length 2-10 and value less than 2^ 32-2, and other strings. Only the hash value generation method is affected.

In this case, the single lexical ‘Hello’ is a primitive string, managed by the AstRawString class. In the entire string “‘Hello’ + ‘World'”, Spaces around the plus sign are ignored and parsed into three segments: string, plus sign, and string. Because this code starts with a string and is judged to be a literal, it is judged to be a ‘normal binary operation expression’ after being parsed to find a plus sign and another string. The tokens in expression are normal, binary operation, and literal.

Here is a JavaScript simulation of the ‘Hello + World’ string parsing process, the complete parsing after someone read again. Naming and logic as far as possible to restore C++ source code, some classes have multiple layers of inheritance is not done, enumeration with arrays instead, part of the syntax and call may look strange, Pointers and template meta-those can not be done.

First we need two mapping tables, as follows.

const kMaxAscii = 127;
const UnicodeToAsciiMapping = [];
for(let i = 0; i < kMaxAscii; i ++) { UnicodeToAsciiMapping.push(String.fromCharCode(i));
}
/** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** *
const TokenToAsciiMapping = (c) = > {
  return c === '(' ? 'Token::LPAREN' : 
  c == ') ' ? 'Token::RPAREN' :
  / /... Many, many
  c == '"' ? 'Token::STRING' :
  c == '\' ' ? 'Token::STRING' :
  / /... Many, many
  'Token::ILLEGAL'
};
const UnicodeToToken = UnicodeToAsciiMapping.map(v= > TokenToAsciiMapping(v));Copy the code

One map maps Unicode to Ascii and one map maps Unicode to Token types, where V8 uses array subscripts to quickly locate character types.

V8 internal string parsing verbatim, we need a Stream class to manage and process, implementation.

class Stream {
  constructor(source_string) {
    /** * buffer_ is not initialized in the constructor * but the source string */ is temporarily saved here to simulate V8
    this.source_string = source_string;
    /** * stores the character */ as a container
    this.buffer_ = [];
    /** * three Pointers represent the current parsing progress */
    this.buffer_start_ = 0
    this.buffer_cursor_ = 0
    this.buffer_end_ = 0
  }
  ReadBlockChecked() {
    return this.ReadBlock();
  }
  ReadBlock() {
    this.buffer_ = this.source_string.split(' ').map(v= > UnicodeToAsciiMapping.indexOf(v));
    this.buffer_end_ = this.buffer_.length;
    /** * the return here differs from the source code in that gc does not expand */
    return this.buffer_.length;
  }
  /** * returns the current character and moves forward one space */
  Advance() {
    let tmp = this.peek();
    this.buffer_cursor_++;
    return tmp;
  }
  /** * returns the current character * while doing initialization */
  peek() {
    if(this.buffer_cursor_ < this.buffer_end_) {
      return this.buffer_[this.buffer_cursor_];
    } else if(this.ReadBlockChecked()) {
      return this.buffer_[this.buffer_cursor_];
    } else {
      return null; }}}Copy the code

With this class, you can parse the string verbatim, but you still need a machine called scanner to start this step. Before we can implement a scanning machine, we also need to implement lexical classes, that is, how to describe individual lexical forms. This class, called TokenDesc in V8, belongs to the most basic unit in the Ast.

class TokenDesc {
  constructor() {
    /** * is a struct * in addition to marking the start and end positions there are several methods */
    this.location =  {
      beg_pos: 0.end_pos: 0};/** * is responsible for managing strings * and a property of the same type named raw_literal_chars is responsible for storing source strings */
    this.literal_chars = new LiteralBuffer();
    /** * Token type */
    this.token = null;
    /** * handles small integers */
    this.smi_value = 0;
    this.after_line_terminator = false; }}Copy the code

The attributes are basically the same as the V8 source code, Location is simplified, and Literal_chars is responsible for handling strings. The implementation is shown below.

Token marks the type of the lexical notation. The second mapping table above shows the type judgment, and different case processing is performed according to different types.

Smi_value manages the lexicality of small integer types. See JJC for an introduction to this, but I won’t expand it here.

With the lexical class, let’s implement scanner.

class Scanner {
  constructor(source_string) {
    this.source_ = new stream(source_string);
    /** * The Unicode encoding of the current character * if null, parsing is complete */
    this.c0_ = null;
    /** * v8 has three lexical description classes ** token_storage_ is an array containing those three classes
    this.TokenDesc = new TokenDesc();
    this.token_storage_ = [];
  }
  /** ** * current_ next_next_ */
  next() {
    return this.TokenDesc;
  }
  Initialize() {
    this.Init();
    this.next().after_line_terminator = true;
    this.Scan();
  }
  Init() {
    this.Advance();
    // There will be some lexical descriptions of the token_storage_ mapping that will be skipped here
  }
  Advance() {
    this.c0_ = this.source_.Advance();
  }
  /** *}}}}}}}}}}}}}}}
  Scan(next = this.TokenDesc) {
    next.token = this.ScanSingleToken();
    next.location.end_pos = this.source_.buffer_cursor_ - 1;
  }
  /** * parsing a single word */
  ScanSingleToken() {
    let token = null;
    do {
      this.next().location.beg_pos = this.source_.buffer_cursor_;
      if(this.c0_ < kMaxAscii) {
        token = UnicodeToToken[this.c0_];
        switch(token) {
          case 'Token::LPAREN':
          /** * there are many other cases * because we are only talking about strings * this method is not implemented here */
            return this.Select(token);
          case 'Token::STRING':
            return this.ScanString();
          // ...}}/** * source code here to handle some special cases not expanded */
    } while(token === 'Token::WHITESPACE')
    returntoken; }}Copy the code

This analogy is large and simplified, but the core is of course parsing. In the source code, a call to Initialize the scanner class parses the first lexical, as I rewrote the logic, and the last processing method for the string is ScanString.

The implementation of ScanString is not given here at the moment, mainly because this method is associated with another class, namely literal_chars in the previous TokenDesc class.

So let’s start with the implementation of the class that manages strings, and then look at the final parsing of strings.

const Latin1_kMaxChar = 255;
// constexpr int kOneByteSize = kCharSize = sizeof(char);
const kOneByteSize = 1;
class LiteralBuffer {
  constructor() {
    /** * is a Vector with corresponding expansion algorithm */
    this.backing_store_ = [];
    this.position_ = 0;
    /** * A string with a Unicode value greater than 255 is considered to be double-byte
    this.is_one_byte_ = null;
  }
  /** * The default string when starting this is a single byte */
  start() {
    this.position_ = 0;
    this.is_one_byte_ = true;
  }
  /** * only cares about single-byte characters so those two methods are not given implementations of */
  AddChar(code_unit) {
    if(this.is_one_byte_) {
      if(code_unit <= Latin1_kMaxChar) {
        return this.AddOneByteChar(code_unit);
      }
      this.ConvertToTwoByte();
    }
    this.AddTwoByteChar(code_unit);
  }
  AddOneByteChar(one_byte_char) {
    /** * The capacity expansion algorithm is based on the base of 64 *4 * each time. If the required container is larger than (1024 * 1024) / 3, the capacity is written to 2 * 1024 * 1024 */
    if (this.position_ >= this.backing_store_.length) this.ExpandBuffer();
    this.backing_store_[this.position_] = one_byte_char;
    this.position_ += kOneByteSize; }}Copy the code

In fact, the class itself is relatively simple, just use a container to hold the character, if necessary to expand, odd or double bytes do not care about it.

With this class, you can fully parse the string, implementing the ScanString method of the scanner class.

class Scanner {
  // ...
  ScanString() {
    // Save the current string's token 'or"
    let quote = this.c0_;
    this.next().literal_chars.Start();
    while(true) {
      this.AdvanceUntil();
      /** * the special symbol goes directly one space */
      while(this.c0_ === '\ \') {
        this.Advance();
      }
      /** * an end flag is encountered to indicate the end of parsing */
      if (this.c0_ === quote) {
        this.Advance();
        return 'Token::STRING';
      }
      this.AddLiteralChar(this.c0_);
    }
  }
  AddLiteralChar(c) {
    this.next().literal_chars.AddChar(c); }}Copy the code

As you can see, except for the AdvanceUntil method, it’s actually a normal word-by-word traversal character that represents the end of string parsing when the same token is encountered.

But the AdvanceUtil method is really interesting. The short version is to quickly detect the end of the string and scan it, and hopefully run the method to complete ScanString. The argument is a function that checks whether the current character can be a string end flag. C++ source code with an anonymous function, looks uncomfortable, here with JS rewrite, as follows.

class Scanner {
  // ...
  C0_ * 2 = c0_ * 2; c0_ * 2 = c0_ * 2
  AdvanceUntil() {
    /** * here we need to implement a method in the STD library * that actually takes three arguments and the first two arguments are iterators
    const find_if = (arr, start, end, callback) = > {
      let tarArr = arr.slice(start, end);
      let tarIdx = tarArr.findIndex(v= > callback(v));
      return tarIdx === - 1 ? end : tarIdx;
    }
    const callback = (c0) = > {
      * uint8_t char_flags = character_scan_flags[c0]; * uint8_t char_flags = character_scan_flags[c0]; * if (MayTerminateString(char_flags)) return true; * /
      if(["\ '"."\" "].includes(UnicodeToAsciiMapping[c0])) return true;
      this.AddLiteralChar(c0);
      return false;
    }
    /** * Searches the string for the position of the first end-of-character marker * such as', ", etc. */
    let next_cursor_pos = find_if(this.source_.buffer_, this.source_.buffer_cursor_, this.source_.buffer_end_, callback);
    if(next_cursor_pos === this.source_.buffer_end_) {
      this.source_.buffer_cursor_ = this.source_.buffer_end_;
      this.c0_ = null;
    } else {
      this.source_.buffer_cursor_ = next_cursor_pos + 1;
      this.c0_ = this.source_.buffer_[next_cursor_pos + 1]; }}}Copy the code

The string is actually traversed here, but it’s just a crude scan. In general, this method traverses the string as soon as it finishes, but occasionally there are special cases, such as “ab’c’d” and “ABC \”d”. In special cases, the add character can only be assigned to external processing.

There is also a mapping table called character_scan_flag, which is also a classification of possible characters. For example, traversing a character Z gives a kCannotBeKeyword, which means that the lexical cannot be a keyword, and in some cases can be skipped quickly. Similarly, when the ‘and “characters are encountered, they can be judged to be the end of a string, which is used here. This is a very complicated mapping, and I didn’t figure it out.

At this point, a STRING’s semantics are parsed, and a Token of type ::STRING is returned as a lexical description of the type. Of course, this single morphology actually has no meaning and is ignored on its own. But if you combine it with the ADD operator and another string, it will evolve into a binary expression, and that’s all for later.

As a final test result, some methods should be commented out during execution because they are not implemented.

let scanner = new Scanner(source_code);
scanner.Initialize();
console.log(scanner)Copy the code

The result is shown here.

Where TokenDesc is wrapped up as a higher level class and then finally into the abstract syntax tree, that’s another story. How strings are stored, hash tables, etc.

Experience the V8 engine parsing strings in JavaScript

Related Posts

What do we need to know about unit testing? – Practice – methodology level

Loader learning, a brief analysis of babel-loader

Lazy tip: Implement a React automation skeleton screen component