The preface

In the previous tutorial, We introduced the basic concepts of the compiler principle and introduced how to write a recursive downward parser by hand, using the example of an expression calculator.

However, in the production practice of the front end, it is rarely necessary to spend time and effort implementing a manual parser. In general, on the front end, the two most common scenarios that require compilation techniques are:

  1. Front-end engineering or business requirements, the need to customize the source code of the programming language analysis.
  2. Business requirements require semantic parsing of string input sources with specified syntax rules.

The first scenario is easy to understand. For example, arco-Design component documents, is from TSX source files, direct analysis of the extracted interface structure and annotations and automatically generated API documents. This kind of scenario, because the need to analyze is ready-made programming language, basically can find the corresponding language parsing tools. For example, TSX can use TSC tools to analyze and obtain TSX source code corresponding AST structure, and JS can use Acorn, and so on. With the AST structure, the business code traverses the standard tree structure itself for demand-driven processing.

The second scenario, for example, involves parsing and converting a bunch of single-line string logs from different sources into a JSON structure. Single-line string logging is understood in terms of nginx logging, where we need to extract attribute information (such as time, request URL, etc.) from a single-line string. However, the logs here may come from different systems, and the format of each log line is not uniform. In this case, using regular expressions for analysis will seem inadequate, and it is too tiring to analyze logic completely by hand and difficult to deal with demand changes.

Another example that is better understood is that a big data analysis product independently designs a query language similar to SQL but simpler and more user-friendly than SQL. In the front end, it needs to realize the syntax highlighting and intelligent prompt functions of the query language editor.

In general, the first thing to consider for scenario 2 is to use the [Lexical Parser] auto-generation tool. That is, we use a tool to automatically generate parsers. Classic autogeneration tools include The long-established lex, YYAC and subsequent derivatives, but these tools are in The traditional C/C ++ language domain tools, The front-end domain is basically unavailable, interested students can check out The lex & YACC Page for details. And the front end can be used, it is the protagonist of this share, Antlr tool.

Introduction to the

“Twitter Search uses Antlr for parsing and handles over 2 billion queries per day; Antlr is used for Hive, Pig, data warehouse, and analytics in the Hadoop ecosystem; Lex Machina uses Antlr to analyze legal texts; Oracle uses Antlr in its SQL developer IDE and migration tools; The Hibernate Object-relational Mapping Framework (ORM) uses Antlr to handle the HQL language.” — ANTLR 4 Authoritative Guide (Douban)

ANTLR (Full name: ANother Tool for Language Recognition (ANother Tool for Language Recognition) is a powerful automatic parser generation Tool written in Java Language. It was introduced in 1989 by Dr. Terence Parr et al from university of San Francisco. The iteration is now the fourth generation. So it’s called Antlr4. The tool itself is a Java language tool, but the resulting parser can be in mainstream programming languages including JS and TS, so Antlr4 is basically the most widely used automatic parser generation tool.

Antlr4 receives the G4 grammar as input, and the output is the parser source code of the corresponding target language that conforms to the constraints of this method. More precisely, the output parser framework code automatically parses the input text at run time and generates the abstract syntax tree (AST) we mentioned last time, but the business project still needs to complement the logic of the business requirements on top of the framework code.

In the introduction, we mentioned that programming languages such as TS often have ready-made parsers available. However, you can still use Antlr4 to build compilers for programming languages, because Antlr4 itself is powerful enough to support the writing of compilers for even high-level programming languages. Antlr4 officially provides G4 grammar rules for major languages to learn and use, such as javascript G4 grammar, you can try to compare with the official EMAC grammar specification we mentioned in the last lecture. Note that the latter is the official specification, and the former can be understood as a G4 language implementation of the official specification (yes, G4 itself is a programming language used to describe grammars).

Next, we will look at the concept and use of Antlr using the expression calculator example from last lecture. First, we need to set up the environment of Antlr4.

The installation

Java environment

Antlr4 is itself a Java tool, so the Java environment needs to be installed first. For MAC systems, go directly to java.com/en/download… Download and install. After the installation, run the Java –version command in Terminal to check whether the installation is successful.

IDE plug-ins

To better edit “G4” files, you can install IDE plug-ins to support including code highlighting and smart hints. Take vscode as an example, search for antlr keyword in the market, and install the plug-in after finding it.

In addition to the basic functions of highlighting, intelligent hints and automatic formatting of G4 grammar specifications, the plug-in also provides advanced functions such as grammar rule visualization and grammar rule debugging.

ANTLR V4-Intellij IDEs Plugin also has similar functions under Webstorm. Interested students can explore and use it independently.

NPM install

Once you have the Java environment and IDE plug-in installed, you need to add dependencies to projects that need to use ANTLR. Here we use the typescript language project as an example, which needs to be installed in sequence:

npm i antlr4ts

npm i -D antlr4ts-cli
Copy the code

Antlr4ts is Antlr’S typescript Runtime Library, which includes various base classes and utility functions for object-oriented design. The library itself is all TS/JS code, like React, LoDash, etc., that will eventually be packaged into business product builds.

Antlr4ts-cli is a CLI tool, or compiler for the G4 language itself, which can compile source code for the G4 grammar rules to the grammar parser. This tool simply calls the antlr.jar tool from the JS code, so the Java environment is already installed on the dependent system using ANTlr4TS-CLI.

For your convenience, after you have installed the Java environment and IDE plug-in, you can download the sample project for the rest of this article directly from GitLab.

Learn git clone https://github.com/cloudfun-team/learn-antlr4 CD - antlr4 NPM install # # for students can't wait, can continue to run the following command under the experience. NPM run TSC # compile ts source code, which will generate to dist directory Node dist/ Calculator-visitor # Start demo, type for example 3*(1+1) and press EnterCopy the code

use

G4 grammar

G4 grammars are antLR supported grammars for defining grammar rules, which can be understood simply as the “context-free grammars” we mentioned in the last lecture. The G4 grammar supports the expression of natural and understandable rules, and does not require the designer of the grammar to consider the avoidance of “left recursion” or the design of “LL(k)” mentioned in the last lecture. For example, we still have the expression calculator mentioned in the last lecture, its G4 grammar source code is as follows:

(Screenshots are used instead of flybook code blocks for better code highlighting. Students can view the file directly in the learn-Antlr4 repository.)

In the code above, we mark 1, 2, 3, and 4, which correspond to the four basic parts of the G4 grammar. Respectively is:

  1. Lexer Tokens

In the last lecture, we learned that the compiler needs to perform lexical analysis to generate the token stream first. In G4 grammar, lexical rules need to be defined first. This rule is similar to the definition of regular expressions and won’t be repeated here. For our expression calculator, the morphology is mainly concerned with numbers and addition, subtraction, multiplication and division, while whitespace is ignored.

  1. Grammar Rules

Lines 11 through 18 define the complete grammar rules. Start defines the entry to the grammar; Expression defines a concrete computed expression that uses the concept of “recursion” mentioned in the previous lecture.

Expression can be a direct number (lines 13) or two numbers plus, minus, multiply and divide (lines 15-17). Two expressions can be combined by addition, subtraction, multiplication and division to form an expression (lines 15-17), or by wrapping expression combinations in parentheses (lines 14). In this recursive way, the grammar rules of calculator expressions are defined and constrained strictly and completely.

  1. Rule Name

As mentioned briefly in the introduction section, ANTLR takes the rules defined by G4 and generates a parser skeleton code that automatically parses and generates AST data structures, but that’s about it. After the business layer gets the AST, it still needs to traverse the AST to process the business logic and implement the business requirements.

But even the AST, when a grammar rule grows in size (imagine using G4 to define the javascript language directly), is opaque and complex. Thus, the framework code generated by Antrl provides two more comfortable ways to handle the AST and implement business logic, and the rule name is the entry name for the business code to handle the AST. More on that later in this article.

It is important to note that the rule name is defined with a # header at the end of the grammar line and is not a comment in the Bash language.

  1. Context Label

Grammar rules can be assigned context tags that can be used to more clearly distinguish between rules in the current context when business code processes the AST.

Taking 17 actions as an example, when the business code handles the rule of AdditionOrSubtraction, it is obvious that the expression on both sides of the + or – sign should be handled first, and the left and right expression values should be obtained before AdditionOrSubtraction. Since the left and right rules are all named recursive expression, we assign left tags to the left expression rules by way of left= and right= respectively. The expression on the right allocates the right flag. This way, when the business code handles the AdditionOrSubtraction line rule, it can clearly identify the left and right expressions in the current context. More on this later in this article.

Build parser

Antlr4 provides two different parser generation modes, Visitor and Listener modes. Instead of discussing the differences, let’s go straight to the most common Visitor pattern for the generated code.

The Visitor pattern

In the clone project, execute NPX antlr4ts-visitor calculator-visitor/antlr/calc.g4 to generate the following files in the same directory as calc.g4:

The TS file is the parser framework code automatically generated according to G4. Calclexer. ts is a lexical analyzer that generates tokens flows. Calcparser. ts is a parser used to generate an AST. Calcvisitor.ts is the interface of the visitor used to traverse the AST, which the business code implements to implement the specific business requirement logic.

Our business requirement here is to implement the evaluation of the expression. In the index.ts file, we have the following code:

As you can see from the code, the logic that the business code needs to implement is the processing of the corresponding rule in the G4 grammar. incalcVisitor.tsThis interface defines a series ofvisitAre prefixed functions that correspond to each of the g4 grammarsRule Name. Such asvisitNumberThe correspondingcalc.g4Defined on line 13 in#NumberThe rules,visitPowerCorresponding to 15 rows#PowerThe rules.

Each of these access functions takes a context parameter. Context parameters vary for different rules. The most important difference is that different Context labels can be used. For example, the #Parentheses rule (line 14 in calc.g4). We defined the recursive expression rule as inner. In the business code, the recursive rule can be accessed through ctx._inner (line 29 in index.ts). Once you have ctx._inner, you can use the super.visit traversal function to recursively iterate over the deep traversal expression.

The super.visit function is the core function for traversing the AST. Inside this function, it will enter the specific visitXXX function according to the specific rules of VISIT. In the specific visitXXX function, it is controlled by the business code to recursively call the super.visit function to achieve depth-first traversal. At the same time, the specific visitXXX function can return data according to the scenarios of business requirements, and the returned data will also be returned through the super.visit layer upon layer. In our case, the visit function returns the evaluated value of the expression. For visitNumber on line 25, it calls the leaf node, which is the specific Number. Use the Number function to convert the string to the Number and return it. And 34 lines of visitAdditionOrSubtraction function, for example, we first left node of the tree traversal visit CTX. _left, got the value of the expression on the left. Then access the right node to get the expression on the right. Finally, the lvalue and rvalue are added or subtracted according to whether the ctx._operator is plus or minus, and the calculated result is returned upward.

Once you have implemented the traversal processing logic, the CaculatorVisitor, you can use this parser to process the actual business input.

As you can see from the code above, the parser generated by the Antlr framework, like the common flow of the compiler we mentioned last time, performs a standardized compile process of source text -> lexical token stream -> parsing AST -> iterating the AST to execute the business logic.

The Listener pattern

In the Visitor mode mentioned above, the business logic CaculatorVisitor fully controls the AST traversal through the super.visit access function. If you do not write visit to the child node in the visitXXX function, the child node will not be visited.

Corresponding to this is the Listener pattern provided by Antlr. In this mode, depth-first traversal of the AST is automatic and does not need or be controlled by business code. Business code can listen for the entry and exit events of this traversal on each AST node.

This mode, because of its “audit” approach, does not control the return value of the traversal process at the node, or the listener method has no return value. Therefore, an additional data structure (usually such as a Map or Stack) is required to store intermediate results.

The learn-antlr4 repository also provides a demo of an expression calculator in Listener mode. Interested students can compare the demo of the two modes to learn and feel the difference between the two modes. In production, Vistor mode is generally used in the majority.

Business example

To further illustrate that Antlr can be used in real scenarios in the front end, not just in demo, the following projects are introduced here:

jsx-loader

A small tool made in 2018, when TSX was not as popular as it is now, and the project was still JSX. Esbuild hasn’t shown up yet, Babel’s slow batch. Simply use ANTLR4 to write a Webpack JSX-Loader exercise and self-use.

Github.com/YuhangGe/js…

jinge-material

A small but decent LIBRARY of UI components, based on an MVVM framework developed by the author with interest and learning, which uses ANTLR to analyze templates. For example, a template references a component. One way to do this is by writing an import statement in a comment, such as:

<! -- import { ComponentA } from './a'; import { ComponentB } from '~root/component'; --> <div><ComponentA /><ComponentB /></div>Copy the code

This area involves extending the standard HTML/XML syntax, so Antlr with Acorn is used to handle the syntax of custom rules.

Jinge Material

Github.com/jinge-desig…

HQL: HanSight Query Language

Ace-editor and Antlr can be used together to highlight and intelligentify the code of the EDITOR of sqL-like big data query analysis language. Antlr can also shine in real front-end business products.