In the compilation principle: grammar analysis actual combat pure manual implementation of a formula calculator has explained how to initially achieve a simple formula calculator, and intuitive access to the experience of writing grammar analysis program, grammar analysis has been understanding. This article addresses the questions of the previous article about how to eliminate left recursion and how to ensure correct priority and associativity

But before we get started, let’s briefly review what Left recursion, Priority, and Associativity are.

In the syntax of binary expressions, if the first element of a production is itself, then the program recurses indefinitely. This situation is called left recursion. For example, the production formula of addition expression “addition expression + multiplication expression” is left recursive. Priority and associativity are the core concepts related to expressions in computer languages. They all involve the design of grammar rules.

Write grammar rules and derive them

Grammar rules are represented by context-free grammars, which consist of a set of substitution rules (also known as production). For example, grammar rules for arithmetic expressions can be expressed in the following form:

add -> mul | add + mul
mul -> pri | mul * pri
pri -> Id | Num | (add) 
Copy the code

Using the above production, add can be replaced by mul, or add + mul. This process of substitution is also called “derivation”. In order to”2 + 3 * 5“And”2 + 3 + 4“Take these two arithmetic expressions for example. The derivation process of these two arithmetic expressions is shown in the figure below:

You can see clearly how these two expressions are generated by the derivation above. The tree formed during the analysis is actually the AST. However, our handwritten algorithm usually simplifies the AST generation by omiting unnecessary nodes in the middle. For example, the “add-add-mul-pri-num” branch can be simplified to “Add-num” when handwritten. In fact, simplifying the AST is also a way to optimize the compilation process, without simplifying it, the result would look like the one above.

So what are the leaf nodes of the two trees in the figure above? Num, +, and * are all terminators, and terminators are tokens generated in lexical analysis. And those non-leaf nodes, they’re non-terminals. The derivation of grammar is the process of constantly replacing the non-terminal, so that the final result has no non-terminal, only terminal.

In practice, grammar rules are often written in the following form:

add ::= mul | add + mul
mul ::= pri | mul * pri
pri ::= Id | Num | (add) 
Copy the code

This writing is called the Bacos paradigm, or BNF. Antlr and Yacc are both written this way.

There is also a term sometimes heard called the Extended Bacos normal form (EBNF). The main difference between BNF expressions and normal BNF expressions is that they are written like regular expressions. For example, the following rule uses the * sign to indicate that the section can be repeated 0 to more than one time:

add -> mul (+ mul)*
Copy the code

In fact, this is equivalent to the standard BNF notation, but more concise. Why is it equivalent? Because a term is repeated many times, it’s equivalent to recursively deriving. A further corollary from this is that context-free grammars contain regular grammars and can do more than regular grammars.

How do YOU ensure correct priorities

I just showed you how to derive the addition rule to the multiplication rule, which ensures that the multiplication node in the AST is always below the addition node, which guarantees that multiplication takes precedence over addition.

Let’s look at the syntax rules for common expressions:

exp -> or | or = exp or -> and | or || and and -> equal | and && equal equal -> rel | equal == rel | equal ! = rel rel -> add | rel > add | rel < add | rel >= add | rel <= add add -> mul | add + mul | add - mul mul -> pri | mul *  pri | mul / priCopy the code

The priorities expressed here from lowest to highest are assignment, logical operation (OR), logical operation (AND), equality comparison (equal), size comparison (rel), addition (Add), multiplication (MUL), and base expression (pri). There are many more different priorities in real languages, such as bitwise operations. And the priority can be changed. For example, we usually change the priority of the calculation by parentheses in the syntax. But how does that express the idiom code?

In fact, at the lowest level, the highest priority base expression (PRI), the expression is wrapped in parentheses and referred to recursively. That way, whenever you see parentheses when parsing an expression, you know that this is the first one. In this way, the priority change is achieved:

pri -> Id | Literal | (exp)
Copy the code

So far, you’ve been able to write a whole set of expression rules and have your formula calculator support them. In addition, when using a language, if you don’t know the exact priority of the various operations, you have a new skill, besides looking up the regular data, to read the grammar rules of the language, which may be written in the BNF or EBNF style.

How do you ensure correct binding

In the last article, we evaluated the arithmetic expression “2+3+4” before adding it to “2”, from right to left, or from left to right.

That’s the associativity of operators. What is associativity? Whether an operator of the same precedence is evaluated from left to right or from right to left is called associativity. Our common arithmetic operations, such as addition, subtraction, multiplication and division, are left associative. Symbols are also left associative.

For example, “rectangle. Center. X” is to first obtain the center of a rectangle, and then obtain the x coordinate of the point. The order of calculation is left to right. Are there any examples of right binding? There has to be. Assignment operations are typical examples of right combinations, such as “x = y = 10”.

Let’s review why the order of 2+3+4 is wrong. Parsing this expression using the wrong right recursive grammar yields a simplified version of the AST as follows:

According to this AST(a structured representation of a computer language, it is the basis for all subsequent work, such as doing semantic analysis and translating into object code). The order of calculations is out of order. But if we write the recursion term on the left hand side, we don’t get this associative error. So we have a rule: for left-associative operators, the recursion is placed on the left; The right-hand combined operator, the recursive term is placed on the right-hand side.

So when you write the rule for adding expressions, it looks like this:

add -> mul | add + mul   
Copy the code

This is what you learn when you make a mistake. So the question is, most binary operations are left associative, so don’t you have left recursion? Don’t worry, we can solve this problem by rewriting the grammar of left recursion.

Eliminate left recursion

The case of left recursion was mentioned above, and it was also pointed out that the recursive descent algorithm cannot handle left recursion. It should be added here that not all algorithms can’t handle left recursion, and there are other algorithms that can handle left recursion, such as LR algorithms.

By eliminating left recursion, a standard method can be used to rewrite a left recursion grammar into a non-left recursion grammar. Add -> add + mul + mul + mul + mul + mul + mul + mul + mul + mul + mul

add -> mul add'
add' -> + mul add'| epsilon.Copy the code

In grammar, ε (pronounced epsilon) means empty set. Next, we derive the expression “2+3+4” again using the rules we have just rewritten to get the result on the left side of the picture below:

The analysis tree on the left is the result of the derivation. The problem is, because the rule for add ‘is right recursive, if you use the standard recursive descent algorithm, you get the same associative operator error that we saw in the last lecture. We expect the AST to be the one on the right, which is the correct binding. So is there a solution?

The answer is yes. Let’s take a closer look at the derivation of the above grammatical rules. Only the first step is the add rule derivation, and then the add ‘rule derivation, all the way to the end.

If expressed in EBNF terms, which allow * and + signs to indicate repetition, the above two rules can be combined into one:

add -> mul (+ mul)* 
Copy the code

The nice thing about writing it like this is that it optimizes the way we write the algorithm. For the (+ mul)* part, we can actually write it as a loop, not as a recursive call. The pseudocode is as follows:

mul();
while(next token is +){
  mul()
  createAddNode
}
Copy the code

In studying recursive functions, there is a concept called tail-recursion, where the last sentence of a tail-recursion function calls itself recursively.

Compilers usually turn tail recursion into a loop, using the same principles as the pseudocode above. Loop statements have lower overhead on system resources than recursive calls, so converting tail recursion to loop statements is also a compile-optimization technique.

Let’s continue the topic of left recursion. Now we know how to write this left recursive algorithm, which looks something like this:

additive(tokens) {
    let child1 = this.multiplicative(tokens); // Apply the add rule
    let node = child1;

    if(child1 ! =null) {while(true) {// 循环应用 add'
            let token = tokens.peek();  
            if(token ! = =null && (token.getType() === TokenType.Plus)){
                token = tokens.read();  // Read the plus sign
                let child2 = this.multiplicative(tokens);  // Compute the child node
                node = new SimpleASTNode(ASTNodeType.Additive, token.getText());
                node.addChildren(child1); // Note that the new node is at the top level to ensure correct associativity
                node.addChildren(child2);
                child1 = node;
            } else {
                break; }}}return node;
}
Copy the code

After that, run the parser again to parse “2+3+4” and get the correct AST:

After modification, re-execute the code and the result is as follows:

So that solves the left recursion problem. The left recursive problem is one of the biggest “obstacles” to writing parsers using recursive descent algorithms. Solve this “obstacle” and your path will become smoother and smoother.

Key knowledge of this article:

  • Priority is determined by the hierarchy in a grammatical derivation, the lower the priority, the more first the derivation is attempted.
  • Associativity is related to left recursion and right recursion, left recursion leads to left binding, right recursion leads to right binding.
  • Left recursion can be avoided by rewriting the syntax rules, which in turn can be expressed in a concise EBNF format, inspiring us to use loops instead of right recursion.

Related questions and answers

1. What is the replacement rule for?

add -> mul | add + mul
mul -> pri | mul * pri
pri -> Id | Num | (add)
Copy the code

Substitution is the derivation. If you keep substituting, you keep deriving. If a sentence can be deduced by a certain grammar, it is said that the sentence conforms to a certain grammar. For example, 2 plus 3 times 5 fits our grammar above. When we say grammatical parsing, it is actually the reverse of grammatical derivation, the reverse of how it is derived.

2. Why is the substitution rule like this?

That’s the problem with grammar design. This is designed according to the characteristics of the problem domain. For example, if you were to design a grammar for Chinese, you would know that it has three parts: subject, predicate and object. The expression is divided into addition, subtraction and multiplication operations, and divided into priority, so use the above rules to express. And the way to verify that is to see if you can derive all the possible expressions.

Compile principle actual combat four: take you to use JS to achieve a simple scripting language

This course is based on the principles of Geek Time compilation course.