Through this article, you will be able to grasp the principles of Parsing and Recursive Descent algorithms, and get a rudimentation of context-free Grammar (CFG).

The content of this article is borrowed from the geek time compiler principle course, if you are interested, you can buy the course. The main purpose of this article is to hope that more people can understand the compiler, and through learning to use JS to achieve the formula calculator. The source code is at the bottom.

The realized formula calculator supports simple arithmetic operations of addition, subtraction, multiplication and division, and can support such operations as 2 + 3 * 5. This expression looks simple, but you can learn a lot about the principles of parsing, such as left recursion, precedence, and associativity.

To implement the above expression, we must be able to parse its syntax. But before I do that, I want to parse the syntax of variable declarations so that I can get up to speed on parsing.

Parsing variable declaration statements: Understand the meaning of “drop”

The result of parsing is an AST. Algorithms are divided into top-down and bottom-up algorithms, among which recursive descent algorithm is a common top-down algorithm.

This is illustrated in the figure belowint age = 45(The following diagrams and explanations are based on Java int types, and the actual code, including the code in future articles, is using js let)

We first of all the variable declaration statement rules, with a formal way to express. To its left is a non-terminal. On the right is its Production Rule. In parsing, the left side is replaced by the right. If the substitution is followed by non-terminals, the substitution process continues until all terminals, also known as tokens, are present. Only terminal runes can be leaf nodes of an AST. This process, also known as Derivation, is:

intDeclaration : Int Identifier ('='additiveExpression)? ;Copy the code

As you can see, the declaration of a variable of type int requires an int Token, a variable identifier, and an optional assignment expression. We translate the above grammar into program statements with pseudocode as follows:

/ / pseudo code
MatchIntDeclare(){MatchToken (Int);// Matches the Int keyword
  MatchIdentifier();       // Match the identifier
  MatchToken(equal);       // Match the equals sign
  MatchExpression();       // Match the expression
}
Copy the code

Let’s explain some of the rankings we just mentioned:

  • Nonterminals are symbols used to represent grammatical elements, sometimes called “grammatical variables”. A word enclosed in <> that appears at the left of a rule and represents a grammatical concept. For example, < sentence >, < noun phrase >, < verb phrase >, < noun >.

  • Production describes the general form of a method production that combines terminals and non-terminals into strings. α→βα→β, read :α is defined as β. For example, < sentence > ->< noun phrase >< verb short >

This article has a detailed explanation of the basic knowledge of the principle of compilation

The actual code snippet is shown below

intDeclare(tokens) {
    let node = null;
    / / to proofread
    let token = tokens.peek();
    if(token ! = =null && token.getType() === TokenType.Int) {
        / / match Int
        token = tokens.read(); // consume int
        token = tokens.peek(); // consume int
        if (token.getType() === TokenType.Identifier) {
            token = tokens.read(); // Consume the identifier
            node = new SimpleASTNode(ASTNodeType.IntDeclaration, token.getText());
            token = tokens.peek(); / / to proofread
            if(token ! = =null && token.getType() === TokenType.Assignment) {
                tokens.read(); // use up the equals sign
                const child = this.additive(tokens); // Matches an expression
                if (child === null) {
                    throw new Error(
                        "invalide variable initialization, expecting an expression"
                    );
                } else{ node.addChildren(child); }}}else {
            throw new Error("variable name expected");
        }
        if(node ! = =null) {
            token = tokens.peek();
            if(token ! = =null && token.getType() === TokenType.SemiColon) {
                tokens.read();
            } else {
                throw new Error("invalid statement, expecting semicolon"); }}}return node;
}
Copy the code

A brief description of the above algorithm:

When parsing variable declarations, I first check to see if the first Token is an int. If so, I create an AST node, write down the variable name after int, and then see if it’s followed by the initialization part, which is equal plus an expression. Let’s check if we have an equal sign, and if we do, we’ll match an expression.

We typically create one child node for each part of a production, such as variable declarations, which create four children: the int keyword, identifier, equal sign, and expression. This is how the later tools rigorously generate the AST. But I’ve simplified it here, and I’ve only generated one child node, which is the expression node. The variable names are recorded in the ASTNode text value, and the other two child nodes are discarded without providing additional information.

Also, as we can see from the above code, the program is read sequentially from a Token stream. The peek() method in the code is preread (preread with the peek() method, which is actually an optimization of the code) and simply reads the next Token, but does not remove it from the Token stream. In code, we use the peek() method to look ahead to see if the next Token is an equal sign to see if the following Token is an expression. The read() method is removed from the Token stream and the next Token becomes the current Token.

We write the algorithms for parsing variable declarations and expressions as functions. During parsing, these functions are called to pattern match the following Token string. If a match is made, an AST node is returned, otherwise NULL is returned. If it is found to be inconsistent with the syntax rules, a compilation error is reported.

In this process, the superior grammar nested the inferior grammar, and the superior algorithm called the inferior algorithm. In AST generation, the upper-level algorithm generates the upper-level node, and the lower-level algorithm generates the lower-level node. That’s what “going down” means.

Analyzing the pseudocode and program statements above, you can see that the program structure is essentially isomorphic to the grammar rules. That’s the beauty of the recursive descent algorithm, it’s very intuitive.

Let’s run the sample program and print the AST:

Programm Calculator
    IntDeclaration age
        AssignmentExp =
            IntLiteral 45

Copy the code

The previous grammar and algorithm are simple, and this level of grammar does not exceed the regular grammar. In other words, not beyond the grammar that we use for lexical analysis.

Ok, now that we’ve parsed the variable declarations and given you an understanding of what falling means, let’s look at how to describe an arithmetic expression using a context-free grammar. .

Use context-free grammar to describe arithmetic expressions

When parsing arithmetic expressions, we encounter more complex situations where regular grammar is not enough and we must use context-free grammar. So: “Why can’t regular grammars represent arithmetic expressions?” Let’s look at the syntax of arithmetic expressions.

Arithmetic expressions contain both addition and multiplication (for simplicity, we treat subtraction as addition and division as multiplication), and addition and multiplication have different priorities. Our rule should match every possible arithmetic expression:

  • 2 + 3 * 5
  • 2 * 3 + 5
  • 2 * 3

We divide the rules into two levels: the first level is the addition rule, and the second level is the multiplication rule. The multiplication rule is regarded as the subrule of the addition rule, so that when the AST is parsed, the multiplication node must be the child node of the addition node, and thus be computed first.

additiveExpression
    :   multiplicativeExpression
    |   additiveExpression Plus multiplicativeExpression
    ;
 
multiplicativeExpression
    :   IntLiteral
    |   multiplicativeExpression Star IntLiteral
    ;
Copy the code

What about the above DSL(domain specific Language)?

This is actually the grammar rule, expressed in BNF. Take Addtive, which has two productions. production1: a multiplication expression generation expression2: an addition + multiplication expression. By the combination of the two productions above, especially the production2To deduce all arithmetic expressions of addition, subtraction and multiplication. For example, for2*3In this expression, we use production1. for2+3*5, using the production form2. The syntax rules I used above are actually written in the same way as the Antlr tool I will use later. So you can also write, is the general teaching materials on writing: A - > M | A + M M - > int | M * int each non-terminal us only the representatives of A capital letter, more concise. I use longer words in my manuscript to make them easier to understand. The vertical line is one of the choices. You can also detach it into the simplest form4Rule: A -> M A -> A + M M -> int M -> M * int You need to be able to read and switch between different writing stylesCopy the code

Support for priority can be achieved through the nesting of grammar. So when we parse the arithmetic expression “2 + 3 * 5”, we form an AST similar to the following:

If you want to evaluate the expression, you only need to evaluate the root node. To complete the evaluation of the root node, we need to evaluate the lower nodes recursively, so we first complete “3 * 5 = 15”, and then calculate “2 + 15 = 17”. When parsing arithmetic expressions, you can match them with the rule of addition. In addition rules, the multiplication rules are nested. The priority of computation is realized through the nesting of grammars.

It should be noted that addition rules are recursively referenced in addition rules. Through this recursive definition, we can expand and form all possible arithmetical expressions. For example, “2+3*5” derivation process:

-->additiveExpression + multiplicativeExpression
-->multiplicativeExpression + multiplicativeExpression
-->IntLiteral + multiplicativeExpression
-->IntLiteral + multiplicativeExpression * IntLiteral 
-->IntLiteral + IntLiteral * IntLiteral
Copy the code

This grammar can no longer be rewritten into regular grammar, but is more expressive than regular grammar and is called “context-free grammar”. Regular grammars are a subset of context-free grammars. The difference is that context-free grammars allow recursive calls while regular grammars do not.

Context-free means that the rules for deriving grammar are the same in all cases. For example, you might use an arithmetic expression to initialize a variable in a variable declaration statement, and you might use an arithmetic expression elsewhere. The syntax of arithmetic expressions is the same everywhere, allowing addition and multiplication, and the priority of computation remains the same. Fortunately, most computer languages we see can express their syntax using context-free grammars.

There are contextual situations that need to be addressed, but they are not handled in the parsing phase, they are handled in the semantic phase.

Parsing arithmetic expressions: Understand what “recursion” means

Our previous algorithm only used “descent”, did not involve “recursion”, now, let’s see how to use recursive algorithm to translate recursive grammar.

First, translate the grammar visually into an algorithm as described above. But you’re going to have an infinite number of calls. For simplicity, we use the following simplified grammar, removing the layers of multiplication:

additiveExpression
    :   IntLiteral
    |   additiveExpression Plus IntLiteral
    ;
Copy the code

When parsing the simplest addition expression “2 + 3”, we intuitively translate it into an algorithm, and the result appears as follows:

  • First matches whether an integer literal is, and finds that it is not;
  • And then the match is not an addition expression, in this case it’s a recursive call;
  • I repeat the above two steps, endlessly.

The specific process is as follows:

  • additive -> IntLiteral | additive Intliteral ;
  • First time: additive->IntLiteral, but since there are still tokens left, this derivation will fail and will be reversed. Each time a match fails, the read token is returned and another generation is tried. This is backtracking.
  • Second time: additive->additive->IntLiteral, same failure.
  • Third time: additive->additive->additive->IntLiteral.
  • Fourth time:….

The first part of the grammar rule “additiveExpression Plus multiplicativeExpression” recursively refers to itself, which is called left recursion. Through the above analysis, we know that left recursion cannot be handled by recursive descent algorithm, which is the biggest problem of recursive descent algorithm.

How do you solve it? How about “additiveExpression” after the plus sign? Let’s give it a try.

additiveExpression
    :   multiplicativeExpression
    |   multiplicativeExpression Plus additiveExpression
    ;
Copy the code

Then rewrite it to an algorithm that really doesn’t have the problem of infinite calls:

additive(tokens) {
    let child1 = this.multiplicative(tokens);
    let node = child1;

    let token = tokens.peek();
    if(child1 ! = =null&& token ! =null) {
        if (
            token.getType() === TokenType.Plus ||
            token.getType() === TokenType.Minus
        ) {
            token = tokens.read();
            const child2 = this.additive(tokens);
            if (child2) {
                node = new SimpleASTNode(ASTNodeType.Additive, token.getText());
                node.addChildren(child1);
                node.addChildren(child2);
            } else {
                throw new Error(
                    "invalid additive expression, expecting the right part."); }}}return node;
}
Copy the code

Read the algorithm above:

Try to see if you can match the multiplication expression. If not, the node is definitely not an addition node, because both producers of the addition expression must first match the multiplication expression. In this case, null is returned and the caller failed. If the multiplication expression matches, then try again to match the part to the right of the plus sign, which is to recursively match the addition expression. If the match is successful, an additive ASTNode is constructed and returned.

Similarly, the grammatical rules for multiplication can be similarly rewritten:

multiplicativeExpression
    :   IntLiteral
    |   IntLiteral Star multiplicativeExpression
    ;
Copy the code

Now that we seem to have solved the left recursion problem, run the algorithm to parse “2+3*5” and get the following AST:

Programm Calculator
    AdditiveExp +
        IntLiteral 2
        MulticativeExp *
            IntLiteral 3
            IntLiteral 5
Copy the code

Does everything seem normal? But what if the program were to parse “2+3+4”?

Programm Calculator
    AdditiveExp +
        IntLiteral 2
        AdditiveExp +
            IntLiteral 3
            IntLiteral 4
Copy the code

What’s the problem? The calculation order is out of order. It is the associativity rule of addition that continuous expressions are evaluated from left to right. But in the AST that we’ve generated, it goes from right to left, so we calculate 3 plus 4, and then we add to 2. That won’t do!

Why do these problems arise? It is because we have changed the grammar by switching the parts to the left and right of the plus sign. What are the effects? You can derive the process for “2+3+4” :

  • The first call to multiplicative() matches the function multiplicative(), which returns a literal node 2.
  • And let’s see if we can recursively match the addition expression on the right-hand side.
  • The result of the match actually returns an addition expression of “3+4”, which becomes the second child node. That’s where the mistake lies. In such a matching order, “3+4” will definitely become the child nodes, which will be computed first in the evaluation.

So, our previous approach doesn’t solve left recursion perfectly, because it changes the associativity rules for addition. So, can we solve the left recursion problem without getting the order of the computation wrong? The answer is yes. But I’ll deal with that in the next lecture.

Implement expression evaluation

Now that you understand what “recursion” means, I’m going to take you through evaluating expressions. In fact, to evaluate an expression, you only need to evaluate it based on the AST. The calculation process is relatively simple, just need to do depth first traversal of the tree.

Depth-first traversal is also a recursive algorithm. Take the AST of “2 + 3 * 5” as an example.

  • Evaluating an expression is equivalent to evaluating the AST root node.
  • So let’s take the child on the left hand side, which is 2.
  • Then we evaluate the right-hand child, and we recursively compute the next level. When we’re done, we return 15 times 3 times 5.
  • Add the left and right nodes and calculate the value 17 for the root node.

Take “2+3*5” as an example. The output of the evaluation process is as follows, and you can see that the evaluation traverses the entire tree:

Calculating: AdditiveExp          // Compute the root node
        Calculating: IntLiteral      // Compute the first child node
        Result: 2                     // The result is 2
        Calculating: MulticativeExp   // Compute the second child recursively
            Calculating: IntLiteral
            Result: 3
            Calculating: IntLiteral
            Result: 5
        Result: 15                // Ignore the details of the recursion, the result is 15
    Result: 17                    // The root value is 17
Copy the code

To summarize the main points of this article:

  1. Have a preliminary understanding of context-free grammar, know that it can express the mainstream computer language, and the difference with regular grammar.
  2. Understand “descent” and “recursion” in recursive descent algorithms. It is basically isomorphic with grammar rules, and algorithms can be written by grammar.
  3. Gain a deeper understanding of the execution mechanism of computer programs by iterating the AST to evaluate expressions.

Recursive algorithm is a very good top-down problem solving method, is a core way of thinking in the field of computer. Having this way of thinking is an advantage for programmers over non-programmers.

Code implementation

This article uses JS to achieve a version of the formula calculator, source address: SimpleCalculator

Execute the following test code and results:

let str = 'let a = b + 3; ';
console.log(Parse variable declaration statements:, str);
const lexer = new SimpleLexer();
let tokens =lexer.tokenize(str)
dump(tokens);
try {
    tokens = lexer.tokenize(str);
    const node = this.intDeclare(tokens);
    console.log('Let a = b + 3 AST:')
    this.dumpAST(node);
} catch (error) {
    console.log(error);
}
// Test the expression
str = "2 + 3 * 5";
console.log("\n calculate:" + str + "The AST");
this.evaluate(str);
Copy the code

Question and answer

What are the differences and connections between grammar and grammar?

Grammar is a term for Formal Language. So we have the expression Formal Grammar. The grammar here has clearly defined rules. For example, our lexical rules, grammar rules, and attribute rules are defined using formal grammar. Regular Grammar, context-free Grammar, etc., are used to describe morphology and Grammar.

Syntax is used to describe how words form sentences. The grammatical rules of a language, usually referred to as this Syntax.

The problem is, the word Grammar is also called Grammar in many contexts of Chinese. This is where confusion can arise. We just have to be careful when we use it. For example, if I make a rules file with a bunch of Lexer Grammar, I say, this is a Lexer file, or a Lexer Grammar file. At this point, calling it a syntax rules file is a bit ambiguous. Because there are no Syntax rules.

Babel, Node.js compilation mechanism

Babel, just language translation, just front-end technology. Translate into AST, complete semantic analysis, and then into another version of JS. Node.js is based on V8 and not only does front-end work, but more importantly optimizes the back-end runtime.

Resources: the latest compilation principle to learn the whole walkthrough