This paper included in the column “magic” of the IDE, hope that the reader can through this series of articles, the IDE the realization of the related technologies have certain cognitive, at the same time, through the study of the static analysis of language, can from the Angle of the compiler, look at the language features, help in the understanding of the IDE at the same time, can also be a deeper understanding of the language itself.

There are so many techniques involved in IDE analysis that you can’t expect to uncover the magic in a short time, but as a starting point, we can start with a few simple scenarios and experience them with practice. The first thing we need to do is to implement a formula syntax and implement auto-completion and error inference for the formula. The specific effect is the same as the formula in Excel, as shown below:

Formula syntax is very simple, it’s just expression plus function call syntax, with no variables and no scope, which allows us to focus on auto-completion and error handling when we do analysis. However, it is not that simple. To realize the functions of completion and error prompt in the figure, a lot of work needs to be done, and a lot of relevant knowledge needs to be introduced step by step. This is not enough to be covered in a single article, but we will divide it into several chapters, step by step, and finally achieve the effect in the figure.

When we want to design a language, we do not need to write from lexical analysis and grammatical analysis step by step. The community provides a variety of grammatical analysis generation tools. Being able to use them skillfully and flexibly will make us get twice the result with half the effort when developing a language. ANTLR is one such ecologically sound tool that can turn your grammar into a parser for many different languages. Subsequently, ANTLR was used to generate parser for excel’s formula syntax, and we completed static analysis for formula syntax, such as automatic completion, error monitoring and inference.

Initialize an ANTLR project

ANTLR is implemented in Java, so installing ANTLR requires Java 1.7 or later. After installing Java, run the following command to install ANTLR:

cd /usr/localhttps://www.antlr.org/download/antlr-4.9-complete.jar/lib curl - OIt is recommended to write the following command in.bash_profile
export CLASSPATH=". : / usr/local/lib/antlr - 4.9 - complete. Jar:$CLASSPATH"
alias antlr4='Java -xmx500m -cp "/usr/local/lib/antlr-4.9-complete.jar:$CLASSPATH" org.antlr.v4.4.tool'
alias grun='Java -xmx500m -cp "/usr/local/lib/antlr-4.9-complete.jar:$CLASSPATH" org.antlr.v4.4.gui.testrig'
Copy the code

Thanks to ANTLR’s ecology, ANTLR supports a wide range of generation targets, and we expect to be able to generate Typescript parser so that we can use our new syntax in the front end or in vscode plug-ins. Now we initialize our project, create a new hello (or whatever you want to call it) directory, go into that directory, create a new file called hello.g4 and write the following

G4 - ANTLR defines a grammar called Hello grammar Hello; r : 'hello' ID ; // match keyword hello followed by an identifier ID : [a-z]+ ; // match lower-case identifiers WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlinesCopy the code

Then initialize NPM and install dependencies, execute:

Initialize NPM
npm init
# install dependencies
npm i antlr4ts typescript
npm i -save-dev ts-node @types/node antlr4ts-cli
Copy the code

After installing typescript, configure tsconfig.json and add the following script to package.json:

 "scripts" : {
    "build" : "antlr4ts -visitor Hello.g4"."start" : "ts-node index.ts"
}
Copy the code

Then, execute NPM run build and create our own entry file index.ts, which will look like this:

import { ANTLRInputStream, CommonTokenStream } from 'antlr4ts';
import { HelloLexer } from './HelloLexer'
import { HelloParser } from './HelloParser'

// Create the lexer and parser
let inputStream = new ANTLRInputStream("hello world");
let lexer = new HelloLexer(inputStream);
let tokenStream = new CommonTokenStream(lexer);
let parser = new HelloParser(tokenStream);
let tree = parser.r();
console.log(tree.toStringTree());
Copy the code

Finally, execute NPM run start. If we see helloWorld helloworld, the project has been initialized successfully. ToStringTree is a tree in LISP format. The format is (root child1… ChildN), which outputs a syntax tree with helloWorld as the root node and two children of Hello and world.

How it works

We have covered the various stages of the compiler workflow in previous articles. First, ANTLR is a syntax generation tool. We write syntax rules (ANTLR meta-language), and ANTLR generates LL(*) interpreters that can analyze our syntax. The interpreter transforms the input stream of characters into a parse tree. In this process, we do not need to write the lexical analysis and grammar analysis again. After modifying the grammar or lexical rules, we only need to generate parser again with ANTLR. Instead of manually modifying the parser code again, we can focus more on the grammar itself.

While ANTLR does a lot of automating for us, it’s worth taking a quick look at how ANTLR does it. Using ANTLR entirely as a black box is also detrimental to ANTLR generated Parser. Let’s start with a simple grammar rule:

// The syntax rule (non-terminal) assign: ID '=' expr '; '; ID: [a-za-z_]+;Copy the code

ANTLR generates a recursively descending parser, which is actually a collection of recursive methods, with one rule for each method. The process of parsing is to start from the entry and parse to the leaf node (terminal). For example, the syntax rules above might generate the following methods:

function assign() {
    match(ID); / / match INT
    match('='); / / match +
    expr();     // Call expr to match an expression
    match('; ');
}
Copy the code

Each call to assign, expr, and match matches a node in the parse tree. The match method matches a leaf node. If we successfully match all the input characters recursively from the entry down, we have the parse tree corresponding to the input characters. However, the above grammar rules are simple. In reality, many grammar rules have multiple branches. Consider the following syntax rules:

stat: assign
    | ifstat
    | whilestat
    ...
Copy the code

The corresponding method might be a switch statement:

function stat() {
    switch (nextToken) {
        case ID: assign(); break;
        case IF: ifstat(); break;
        case WHILE: whilestate(); break; .default:
            // Throw an optional branch exception
            throw Error()}}Copy the code

Stat () needs to determine which alternative branch is correct based on the next token, usually called the lookahead token, A lookahead token is strictly any lexical symbol known by the parser and used for processing before matching and consumption. In STAT, each alternative branch starts with a different token, so it only needs a lookahead token to make a decision. But in more complex cases, we may need to sniff many tokens forward before deciding, sometimes even scanning to the end of the input character stream.

We start at a starting point and at every fork in the road, we try to enter one of the branches, find that the branch doesn’t match (the lookahead token doesn’t match), and then we go back to the other branches, and so on and so on. Until we reach the end of the maze or we find that all roads are dead ends, we are done with the parsing process.

If we find that both paths lead to the end, then our syntax is ambiguous, in which case ANTLR will choose the first of all possible branches that match. In general, however, we should avoid syntactic ambiguities, which are generally considered syntax bugs, because the same input stream can generate multiple syntax trees with different semantics, which is unacceptable for a programming language.

Finally, as an additional point, if we are careful, we can see that the parsing can go into an infinite loop if certain syntax can be matched by itself. For example:

Expression: expression + expression | INT.Copy the code

This kind of left recursive syntax generates expression recursively in LL Parser and calls expression methods, which leads to wireless recursion. This is also a limitation of LL syntax. Your syntax cannot have left recursion. However, in ANTLR4, you can already write the syntax in the example above, and ANTLR converts the left recursive syntax into its equivalent non-left recursive syntax. However, you still can’t use indirect left recursion of the form expression left recursion to number, number left recursion to expression, which in ANTLR4 also causes an infinite loop.

Listener and Visitor

After we have successfully parsed and obtained the Parse tree, all that remains is how we need to access the tree. ANTLR provides two modes to choose from: Listener and Visitor.

The Listener mode provides entry and exit events for all nodes, and ANTLR generates a ParseTreeListener class for each grammar file. In this class, each grammar rule has an enter and exit method. When the traverser accesses the Assign node, it calls the enterAssign method and passes in the corresponding tree node AssignContext. After accessing all the children of the Assign node, it calls the exitAssign method. This is a relatively simple mode with limited functionality. If the processing is not complicated, for example if we want to do Lint, then we only need to worry about non-compliant writing methods and not the whole tree, Listener is a very suitable mode. The dotted lines below indicate the depth-first search of the Parse tree and the timing of the enter and exit calls.

The Visitor pattern is more complex, but it gives us the ability to fully control the parse Tree traversal. To use a Visitor, pass in an additional parameter – Visitor option for the CLI when generating code. ANTLR generates a Visitor interface for the syntax, Each rule in the syntax corresponds to a visit method in the interface, such as visitAssign, visitExpr, visitID, and visitTerminalNode for terminators. To start the walk, ANTLR’s internal support code for the Visitor will call the visitAssign method at the root node, and then our implementation visiAssign will call visit and pass all the child nodes as arguments to continue the walk. Of course, instead of traversing all the children of assign, we could just call visitExpr to traverse the EXPr nodes.

Const visitor = new MyVisitor() // Parse tree visitor. Visit (tree)Copy the code

Both modes perform depth-first search on the Parse tree. Depth-first search has a feature that all children of the node are already traversed in the backtrace phase. For example, in the Listener diagram above, when exitID is triggered, the STR node below the ID node must be finished processing. It is helpful to understand that no matter what you are doing to the AST, the “context” of the node is already processed in the backtrace phase, which is useful for handling part of the context-specific semantics.

Write in the last

This article provides a brief introduction to ANTLR initialization and related principles. Understanding its mechanisms will help you to better use ANTLR and understand its limitations. The initial project section can be directly referenced in ANTLR’s official documentation geting-Started, which you can see in more detail on ANTLR’s project home page. In addition, ANTLR official also provides many detailed examples, you can refer to github.com/antlr/gramm… Parser, the official example, is a good place to start. Even if you want to implement a completely new language, these examples are a great reference point. After all, there are a lot of programming languages out there, but there aren’t a lot of basic language patterns, because languages tend to be designed to look like natural languages, and most of the time it’s good to follow those patterns.

Refer to the link

  • Github.com/antlr/antlr…
  • Github.com/tunnelvisio…