ANTLR: Play parsing in the browser

Zqlu Ant Financial · Data Experience Technology team

One, foreword

In the front-end development, usually mentioned syntax parsing and other functions, which are responsible for the back-end interface, front-end call. Can the front-end handle the parsing itself and run it in the browser? The answer is yes. This article describes a simplified language called Expr and does error validation, execution, and translation of input Expr code in the browser.

The simplified Expr language

First of all, let’s introduce our Expr language. Expr language is a simplified C-like language assumed in this paper. The code snippet is as follows:

a = 2;
b = 3
c = a * (b + 2);
d = a / (c - 1);

a; // It should print 2
b; // It should print 3
c;
d;
Copy the code

The first four lines act as familiar assignment expressions, while the last four act as print statements that print the values of each variable in sequence, including the familiar comments.

The question is, can we parse Expr syntax code and interpret it on the front end? If there are errors in the code snippet, can we give an accurate error message? More interesting question, our Expr language uses infix expressions, can we translate the input code into prefix expression code? Can we format the input snippet? Can we translate Expr code into source code for other languages?

First of all, let’s look at the final Demo. We can input the code of Expr language in the code box of Demo page, click the execute button, that is, we can see the result after execution:

These functions are in the browser using ANTLR V4 to complete the syntax, and finally realize the syntax validation, interpretation and execution functions.

3. First acquaintance with ANTLR V4

The introduction of ANTLR

Antlr4 is a powerful parser generation Tool that can be used to read, process, execute and translate structured text or binary files.

Antlr4 generates a parser that contains both a lexical parser and a syntax parser, which, yes, is what you’ll find in the principles of Compiling course. After years of front-end writing, we only need to know that a lexical parser is a program that converts a sequence of input code characters into a sequence of tokens, whereas a parser is a program that converts a sequence of tokens into a syntax tree. The good news is that Antlr4, with its syntactic definition, can generate parser source for us, not just Java source, but JavaScript and TypeScript source for our front-end convenience. Yes, in this article, we will use antlr4-generated TypeScript parser to parse our Expr code.

ANTLR V4 is installed for use

To install and use ANTLR V4, you can refer to Getting Started with ANTLR V4 on Github. In a nutshell, using ANTLR V4 is generally divided into three steps:

Write the syntax definition file of the language to be parsed according to ANTLR V4. The ANTLR V4 syntax definition of the mainstream language can be found in the ANTLR/grammars-V4 warehouse. Generally, g4 is the suffix of the syntax definition file
Run the ANTLR tool to generate the parser source for the specified target language
Use the generated parser to complete code parsing, etc

Expr language syntax definition

According to the above introduction, in order to explain and execute our Expr language, the first step is to define Expr language syntax definition file expr.g4 according to ANTLR V4 specification. Here is a brief introduction to The syntax definition of ANTLR V4. For more details, please refer to ANTLR author’s The Definitive ANTLR 4 Reference.

Grammar rules

The syntax definition file of ANTLR V4 takes the form of a syntax declaration statement and a set of syntax rules, roughly structured as follows:

grammar Name; Declare the syntax named Name

# Define grammar rules once
rule1
rule2
...
ruleN
Copy the code

The structure of each grammar rule is as follows:

ruleName: alternative1 | alternative2 | alternative3 ;

This grammar rule that a named ruleName rules, including | watch called branches, namely the change rule can match any of the three branches.

Finally, ANTLR V4’s syntax rules are divided into Lexer and Parser rules: The lexical rules define how to convert a sequence of code strings into a sequence of tokens; Syntax rules define how to convert a sequence of tags into a syntax tree. In general, the rule name of a lexical rule is named with a capital letter, while the rule name of a grammatical rule begins with a lowercase letter.

Expr grammar

The syntax Expr. G4 is defined as follows:

grammar Expr;

prog
    : stat+ ;

stat
    : exprStat
    | assignStat
    ;

exprStat
    : expr SEMI
    ;

assignStat
    : ID EQ expr SEMI
    ;

expr
    : expr op = (MUL | DIV ) expr   # MulDivExpr
    | expr op = ( ADD | SUB ) expr   # AddSubExpr
    | INT                       # IntExpr
    | ID                        # IdExpr
    | LPAREN expr RPAREN        # ParenExpr
    ;

MUL     : '*' ;
DIV     : '/' ;
ADD     : '+' ;
SUB     : '-' ;
LPAREN  : '(' ;
RPAREN  : ')' ;

ID      : LETTER (LETTER | DIGIT)*  ;
INT     : [0-9]+ ;
EQ      : '=' ;
SEMI    : ';' ;
COMMENT : '//' ~[\r\n]* '\r'? '\n'? -> channel(HIDDEN);
WS      : [ \r\n\t]+ -> channel(HIDDEN);

fragment
LETTER  : [a-zA-Z] ;
fragment
DIGIT   : [0-9] ;
Copy the code

As you can see, the syntax definition is designed in a top-down way. Our Expr code takes the rule PROg as the root rule, and prog consists of multiple statements stat. The stat statement can be either the expression exprState or the assignment assignState. The expression can be an integer, INT, or variable ID. Note that expR rules use recursive expressions, that is, expr is referenced in the definition of expR rules, which is also a feature of ANTLR V4. Finally, the lexical rules defined here include rules for addition, subtraction, multiplication and division, parentheses, variables, integers, assignments, comments, and whitespace. Note the channel(HIDDEN) defined by the COMMENT (COMMENT) and whitespace (WS) rules, which flags that our parsing needs to ignore comments and whitespace.

With the syntax Expr. Ge, we can generate the parser source we need, using antlr4ts and adding script to package.json:

"scripts": { "antlr4ts": "antlr4ts -visitor Expr.g4 -o src/parser" }, "dependencies": { "antlr4ts": "^ 0.4.1 - alpha. 0"} "devDependencies" : {" antlr4ts - cli ":" ^ 0.4.0 - alpha. 4 "and" typescript ":" ^ 2.5.3 ",},Copy the code

Execute NPM run antlr4ts to see the TypeScript source code for the generated Expr parser in the SRC/Parser directory.

Expr language interpreter

With the Expr language parser, we can use the parser to implement our Expr language interpreter, the specific need to achieve the purpose that is, input Expr language code, and finally print out the execution results.

Code and syntax trees

How to use ANTLR to interpret the Expr code that executes the input, let’s first look at ANTLR’s generated Token sequence and syntax tree for the following input code.

a = 1;
b = a + 1;
b;
Copy the code

The Token sequence obtained by lexical parsing is shown in the figure below, which is resolved into 22 tokens. Each Token contains the serial number, text and type of the Token. For example, tokens with serial number 0, text ‘a’ and type ‘ID’ match the lexical rule ID above in Expr. G4.

The syntax tree structure is shown in the figure below. The nodes in the tree correspond to the syntax rules or lexical rules defined in Expr. G4. It is important to note that all leaf nodes in the syntax tree correspond to lexical rules or character constants, which is the same as the top-down design in Expr.

As you can see, the following node is a prog rule node, and its children are three stat nodes. The first two children are assignStat nodes, and the last child is statExpr. As defined in Part 1, for this code, we need to identify the expression statement in the code and print the value of that expression. In this example, only one expression statement is used in the input code, and the expression is variable B. In order to print the value of B, we need to calculate the value of B by interpreting the first two statements (given here, the reference to the variable must follow the definition of the variable). So, the whole idea is that we need to interpret each statement in order, and remember the variables and their values that appear during the statement interpretation. In subsequent statements, if we encounter a reference to a variable, we need to look up the value of that variable.

Use the Visitor to access the syntax tree

To implement the above interpretation process, we need to traversal the syntax tree parsed by the access parser. ANTLR provides two mechanisms to access the generated syntax tree: Listener and Visitor. When the Listener mode is used to access the syntax tree, ANTLR’s internal ParserTreeWalker calls different methods of the provided Listener when it encounters different nodes in the syntax tree. When using the Visitor pattern, the Visitor needs to specify itself if a particular type of node is visited. ANTLR generates a parser source that contains the default Visitor base class/interface exprVisitor.ts, and implements the visit method on nodes of interest. For example, if we need to access the exprStat node, we just need to implement the following interface:

export interface ExprVisitor<Result> extends ParseTreeVisitor<Result> {
  ...

  /** * Visit a parse tree produced by `ExprParser.exprStat`. * @param ctx the parse tree * @return the visitor result */visitExprStat? :(ctx: ExprStatContext) = >Result; . }Copy the code

After explaining how to use a Visitor to access nodes in the syntax tree, let’s implement the Visitor required by the Expr interpreter: the ExprEvalVisitor.

We used Visitor internal variables to store the intermediate values when we accessed the syntax tree:

class ExprEvalVisitor extends AbstractParseTreeVisitor<number>
  implements ExprVisitor<number> {
  
  // Save the execution results
  private buffers: string[] = [];
  
  // Save the variable
  private memory: { [id: string] :number } = {};
  
}
Copy the code

Which nodes in the syntax tree do we need to access? First, access to the expression statement exprState is the most important for the final result. We access the expression in the expression statement to get the value of the expression and print the value to the execution result. Since an expression statement consists of an expression followed by a semicolon, we need to continue to access the expression to get the value of the statement, while semicolons are ignored:

class ExprEvalVisitor extends AbstractParseTreeVisitor<number>
  implements ExprVisitor<number> {
  
  // Save the execution results
  private buffers: string[] = [];
  
  // Save the variable
  private memory: { [id: string] :number } = {};
  
  // Access the expression statement
  visitExprStat(ctx: ExprStatContext) {
    const val = this.visit(ctx.expr());
    this.buffers.push(`${val}`);
    returnval; }}Copy the code

What is the access method for the expression phase of an expression statement? Going back to our syntactic definition expr.g4, the expression is made up of five branches, and different branches are treated differently, so we use different access methods for different branches. We added different comments at the end of different branches. These comments generated a parser that could be used to distinguish between different types of nodes. In the generated Visitor, different interfaces were seen:

export interface ExprVisitor<Result> extends ParseTreeVisitor<Result> {
  ...
  
  /** * Visit a parse tree produced by the `MulDivExpr` * labeled alternative in `ExprParser.expr`. * @param ctx the parse  tree * @return the visitor result */visitMulDivExpr? :(ctx: MulDivExprContext) = > Result;
	
	/** * Visit a parse tree produced by the `IdExpr` * labeled alternative in `ExprParser.expr`. * @param ctx the parse tree * @return the visitor result */visitIdExpr? :(ctx: IdExprContext) = > Result;

	/** * Visit a parse tree produced by the `IntExpr` * labeled alternative in `ExprParser.expr`. * @param ctx the parse tree * @return the visitor result */visitIntExpr? :(ctx: IntExprContext) = > Result;

	/** * Visit a parse tree produced by the `ParenExpr` * labeled alternative in `ExprParser.expr`. * @param ctx the parse tree * @return the visitor result */visitParenExpr? :(ctx: ParenExprContext) = > Result;

	/** * Visit a parse tree produced by the `AddSubExpr` * labeled alternative in `ExprParser.expr`. * @param ctx the parse  tree * @return the visitor result */visitAddSubExpr? :(ctx: AddSubExprContext) = >Result; . }Copy the code

So, in our ExprEvalVisitor, we implement different interfaces to access different expression branches. For the AddSubExpr branch, we implement the following access:

visitAddSubExpr(ctx: AddSubExprContext) {
  const left = this.visit(ctx.expr(0));
  const right = this.visit(ctx.expr(1));
  const op = ctx._op;

  if (op.type === ExprParser.ADD) {
    return left + right;
  }
  return left - right;
}
Copy the code

For MulDivExpr, the access method is the same. For the IntExpr branch, since its children only have INT nodes, we just need to parse the integers:

visitIntExpr(ctx: IntExprContext) {
  return parseInt(ctx.INT().text, 10);
}
Copy the code

For the IdExpr branch, the child node only has the variable ID, so we need to look up the variable in our saved variable and fetch its value:

visitIdExpr(ctx: IdExprContext) {
  const id = ctx.ID().text;
  if (this.memory[id] ! = =undefined) {
    return this.memory[id];
  }
  return 0;
}
Copy the code

The last branch, ParenExpr, is easily accessed by accessing the expression in parentheses:

visitParenExpr(ctx: ParenExprContext) {
  return this.visit(ctx.expr());
}
Copy the code

As you can see from the above access methods, we only read variables from memory, but do not write variables to memory. This requires us to access the assignment expression assignExpr node: For assignment expressions, we need to identify the variable name to the left of the equal sign and the expression to the right of the equal sign, and finally save the variable name and the value of the expression to memory:

visitAssignStat(ctx: AssignStatContext) {
  const id = ctx.ID().text;
  const val = this.visit(ctx.expr());
  this.memory[id] = val;
  return val;
}
Copy the code

Explain the implementation of Expr language

At this point, our VisitorExprEvalVisitor is ready, and we just need to use the visitor to access the parse syntax tree for the specified input code to implement the interpretation of the Expr code:

// Expr code interprets the execution function
/ / input code
// Returns the execution result
function execute(code: string) :string {
  const input = new ANTLRInputStream(code);
  const lexer = new ExprLexer(input);
  const tokens = new CommonTokenStream(lexer);
  const parser = new ExprParser(tokens);
  const visitor = new ExprEvalVisitor();

  const prog = parser.prog();
  visitor.visit(prog);

  return visitor.print();
}
Copy the code

Expr code prefix expression translator

From the previous introduction, we have explained executing Expr code by using ANTLR. ANTLR is used to read, process, execute, and translate structured text. Can we use ANTLR to translate the input Expr code? In the Expr language, expressions are common infix expressions. Can we translate them into prefix expressions? Remember from the data structure course how to convert infix expressions into prefix expressions using out of stack and on stack? Remember, with the parser generated by ANTLR, we can also simply convert to transformations.

For example, for the following Expr code:

a = 2;
b = 3;
c = a * (b + 2);
c;
Copy the code

The result of our conversion is as follows: we do the conversion for the branch expression, but we do not do the conversion for the assignment expression, that is, the expression in the code will be converted to:

a = 2;
b = 3;
c = * a + b 2;
c;
Copy the code

Prefix Translation Visitor

Again, here we use the Visitor pattern to access the syntax tree. This time, we directly visit the root node prog and return the translated code:

class ExprTranVisitor extends AbstractParseTreeVisitor<string>
  implements ExprVisitor<string> {
  defaultResult() {
    return ' ';
  }

  visitProg(ctx: ProgContext) {
    let val = ' ';
    for (let i = 0; i < ctx.childCount; i++) {
      val += this.visit(ctx.stat(i));
    }
    returnval; }... }Copy the code

This assumes that our visitor already returned the translated code when the visitor statement stat was used, so visitProg simply concatenates the translated code for each statement. For statements, as mentioned earlier, we don’t translate them, so their visit is easy: for expression statements, just print the translated expression with a semicolon; For assignment statements, we simply translate the expression to the right of the equals sign:

VisitExprStat (CTX: ExprStatContext) {const val = this.visit(ctx.expr());return `${val}; \n`; } visitAssignStat(ctx: AssignStatContext) { const id = ctx.ID().text; const val = this.visit(ctx.expr());return `${id} = ${val}; \n`; }Copy the code

Let’s see how to translate various expressions. For AddSubExpr and MulDivExpr translation, it is the logic of the entire translator, which prefixes the operator:

visitAddSubExpr(ctx: AddSubExprContext) {
  const left = this.visit(ctx.expr(0));
  const right = this.visit(ctx.expr(1));
  const op = ctx._op;

  if (op.type === ExprParser.ADD) {
    return` +${left} ${right}`;
  }
  return` -${left} ${right}`;
}

visitMulDivExpr(ctx: MulDivExprContext) {
  const left = this.visit(ctx.expr(0));
  const right = this.visit(ctx.expr(1));
  const op = ctx._op;

  if (op.type === ExprParser.MUL) {
    return` *${left} ${right}`;
  }
  return` /${left} ${right}`;
}
Copy the code

Since parentheses are not required in prefix expressions, the access to ParenExpr needs only to remove the parentheses:

visitParenExpr(ctx: ParenExprContext) {
  const val = this.visit(ctx.expr());
  return val;
}
Copy the code

For other nodes, no more processing is required, just return the text of the node’s tag:

visitIdExpr(ctx: IdExprContext) {
  const parent = ctx.parent;
  const id = ctx.ID().text;
  return id;
}

visitIntExpr(ctx: IntExprContext) {
  const parent = ctx.parent;
  const val = ctx.INT().text;
  return val;
}
Copy the code

Perform prefix translation of code

The ExprTranVisitor is used to query the prog root node of the input code to return the translated code:

function execute(code: string) :string {
  const input = new ANTLRInputStream(code);
  const lexer = new ExprLexer(input);
  const tokens = new CommonTokenStream(lexer);
  const parser = new ExprParser(tokens);
  const visitor = new ExprTranVisitor();

  const prog = parser.prog();
  const result = visitor.visit(prog);

  return result;
}
Copy the code

For input code:

A * B + C / D ; A * (B + C) / D ; A * (B + C / D) ; (5-6) * 7;Copy the code

The output is:

+ * A B / C D; / * A + B C D; * A + B / C D; * -5 6 7;Copy the code

Seven,

With the Expr language executor above, you can see that with ANTLR V4, front-end engineers can also do a lot of parse-related things in the browser.

Want to know how we can use ANTLR to parse complex SQL code, do syntactic validation of SQL code, and format SQL scripts with ANTLR? Can focus on column or send your resume to ‘tao.qit####alibaba-inc.com’. The replace (‘ # # # # ‘, ‘@’), welcome people with lofty ideals to join ~

Original address: github.com/ProtoTeam/b…