From language recognition to general SQL parsing

One, foreword

In the construction process, the warehouse gradually draws lessons from the platform development experience and often introduces the CONCEPT of CI/CD, requiring static code scanning function to identify dangerous operations such as deleting databases, deleting tables and changing tables in advance. In addition, computing engines such as Hive can dynamically obtain the consents through Hook. However, for MySQL and Vertica, the consents cannot be dynamically obtained and need to be statically resolved during task scheduling. All of this requires a common set of SQL parsing tools. However, there are many big data computing engines, such as Hive, Spark, Impala, Vertica, and so on. Each computing engine has its own dialect, and each computing engine uses different development languages. If we want to achieve universal SQL parsing, we must shield the differences of development languages and start from syntax rules to achieve universal SQL parsing. The following will respectively from language recognition, ANTLR use, and Hive SQL as an example to introduce the general SQL parsing implementation.

Second, language recognizer

SQL language is composed of grammar rules and lexical symbols, just like our ordinary language. The parsing of SQL is actually a process of dividing a series of characters into lexical symbols Token and generating a syntax tree according to the grammar rules. The analysis of SQL is realized by traversing the syntax tree.

An assignment statement sp = 100; , the process of dividing the input character stream into non-separable lexical symbols is lexical analysis, and the process of converting the lexical symbol stream into a parsing tree is grammatical analysis.

2.1 Lexical analysis

Lexical analysis is the process of dividing input characters into non-separable lexical symbols according to lexical rules. In SQL, for example, lexical symbols include the following types:

Identifier. Such as letter (letter | digit) *.
Constants.
The keyword. Such as select, from, etc.
Delimiter. Such as,,, And so on.
Operator. Such as <, <=, >, >=, =, +, -, *, etc.

The lexical analyzer reads one character at a time, completing a lexical symbol recognition when the current character is not in the same classification as the previous character. For example, when ‘SELECT’ is read, the first character is ‘S’, which satisfies the rules for keywords and identifiers, as well as the second character, and so on, until the seventh character is a space, thus completing a lexical recognition. In this case, ‘SELECT’ can be regarded as either a keyword or an identifier, and the analyzer needs to determine the value according to its priority.

SELECT name, age FROM student WHERE age > 10;
Copy the code

For example, after lexical analysis of the SQL above, the following lexical symbols can be obtained:

Keyword: SELECT, FROM, WHERE
Identifiers: name, age, student
Constants: 10
Delimiter:;
Operator: >

As can be seen from the analysis process described above, lexical analysis is actually a process of continuous state transition based on entered characters and lexical rules until all characters are scanned. In the implementation of lexical analysis, the normal expression RE is converted into uncertain finite automaton NFA by Thompson construction method, then the NFA is converted into deterministic finite state machine DFA by subset construction method, then the equivalent state is merged by DFA minimization, and finally the lexical symbols are obtained by character scanning through DFA.

2.2 Grammar Analysis

Through lexical analysis, the input characters can be divided into morphological symbols. How these morphological symbols are combined according to certain rules to form meaningful words and sentences is the process of grammatical analysis. The difficulty of parsing lies in rule processing and branch selection, as well as recursive calls and complex computational expressions. In terms of implementation, there are mainly two algorithms, top-down analysis and bottom-up analysis, which will be introduced in detail below.

2.2.1 Context-free grammar

The context-free method is a rule used to specify the grammatical rules of a language. It consists of four parts:

A terminal set;
A set of non-terminal symbols;
A production list;
A start symbol.

For example, an arithmetic expression grammar:

S -> E
E -> E + E
E -> E - E
E -> (E) | num
Copy the code

To the left of ‘->’ is called the production head (left), and to the right of ‘->’ is called the production body (right). All production headers are a non-terminal symbol. A nonterminal symbol describes an abstract rule, while a terminal symbol describes a concrete thing. In the example above, S and E are non-terminal symbols, and the others are terminal symbols.

Assume the following grammar:

S - > AB to aA | A, B, BCopy the code

Derive the string aab using the above syntax as follows:

S → AB → aAB → aAB → aAB → aABCopy the code

2.2.2 Derivation and specification

derivation: Deduces the given string using grammar, starting with the opening symbol, by replacing the left part of the production with the right part of the production. As shown above:

S → AB → aAB → aAB → aAB → aABCopy the code

reductionThe: specification is the reverse of the derivation, which is to change a string into a non-terminal, then a non-terminal into a non-terminal, and so on until you reach the root node. Take the same string as above:

Aab → aab → Ab → Ab → SCopy the code

The derivation of always choosing the leftmost non-terminal string for substitution is the leftmost derivation, and the derivation of always choosing the rightmost non-terminal string for substitution is the rightmost derivation, also known as the canonical derivation. A derivation must correspond to a syntax tree, but the derivation may not be unique and the corresponding syntax tree may not be unique.

As the reverse process of derivation, the reverse process of the rightmost derivation is called the leftmost specification, namely the specification. The leftmost derivation of the inverse process is called the right-most specification.

Use the above syntax and strings as examples to illustrate the derivation and specification process:

2.2.3 LL analysis and LR analysis

LL analysis: Reads from the leftmost symbol of the statement, from left to right, using the leftmost derivation until the terminator is reached or an error exit is reported. The first L means reading characters from left to right, and the second L means using left-most derivation. Representation is the parsing process from grammar rules to statement strings, i.eThe top-downThe processing flow of.
LR analysis: Reads from the leftmost symbol of the statement, from left to right, using the right-most derivation (left-most specification) analysis method. The first L means to read characters from left to right, and the second R means to use right-most derivation. Representation is the parsing process from statement strings to grammar rules, i.eFrom the bottom upThe processing flow of.

In LL analysis, there are usually predictions/outputs and matches for the parser to choose from, where:

Forecast/output: According to the leftmost non-terminal and a series of forward-looking word symbols, determine the production with the highest probability of matching the current input, and output or perform other actions.
matching: Matches the production predicted in the previous stage according to the unmatched symbol at the far left of the input.

Taking the above aAB as an example, the analysis process using LL(1) is as follows, where 1 means that one character is read forward each time,

Production Input Action --------------------------------------------------------- S aab Predict S -> AB AB aab Predict A  -> aA aAB aab Match a AB ab Predict A -> a aB ab Match a B b Predict B -> b b b Match b AcceptCopy the code

In LR analysis, there are usually two actions for the parser to choose from: migration and specification:

move: Puts the currently pointed lexical symbol into a buffer (usually a stack).
statute: Converts a production string to a non-terminal symbol in the corresponding production by reverse-matching the production with one or more symbols in the buffer.

Also taking the above AAB as an example, the analysis process using LR(1) is as follows:

Buffer Input Action --------------------------------------------------------- aab Shift a ab Shift aa b Reduce A -> a aA  b Reduce A -> aA A b Shift Ab Reduce B -> b AB Reduce S -> AB S AcceptCopy the code

As can be seen from the above processing flow differences, LL and LR have the following differences:

LLIt’s a top-down analysis process, starting from the rules of grammar, deriving a given string of symbols from a production formula, using derivation.LRIs the bottom-up process of analysis, from a given string of symbols to the notation of the grammar, using the specification.
LRthanLLMore efficient, no left recursion and ambiguity, such asE -> E + EThis rule, applyLLThere will be recursion.
LR generates code that is too arcane to understand compared to LL. JavaCC is LL(1) algorithm, ANTLR is LL(n, n>=1) algorithm.

Comparison of common SQL parsers

Currently, there are two main types of SQL parsing tools:

By writing Parser manually, typical examples are SQLParser (a module in Druid), JSQLParser, etc.
The custom syntax type Parser of Parser is generated by the syntax parsing tool, such as ANTLR, JavaCC, etc.

The differences in generation between the two parsers determine their differences in usage:

Performance: A hand-written Parser can be optimized in a variety of ways, and its performance is much higher than the tool-generated Parser. For example, SQL Parser performs 10 or even 100 times faster than the Parser generated by ANTLR and JavaCC tools.
Syntactic support: Both support multiple syntax, but the tool-generated Parser implementation is easier and has more support. For example, SQL Parser supports only common DML and DDL for Oracle, Hive, and DB2.
Readability and maintainability: ANTLR, for example, decouples its syntax from its code, making it more readable. When adding new syntax, it simply needs to modify the syntax file. SQL Parser, on the other hand, couples syntax rules with code, making it less readable and requiring a lot of code to change a few syntax changes.
Syntax tree traversal: SQL Parser uses Visitor mode to completely encapsulate the abstract syntax tree, and peripheral programs cannot directly access the abstract syntax tree. When the tree is not completely traversed, the code is cumbersome. ANTLR supports visitor and Listenor access to control the traversal of the syntax tree.

Of course, for custom syntactic type parsers, ANTLR and JavaCC are functionally similar, but ANLTR is a bit richer and cross-language, while JavaCC is only available in Java.

Obviously, ANTLR is the best choice for syntax support, readability, and maintainability, and syntax tree traversal, as discussed below.

Fourth, the ANTLR

ANTLR is short for Another Tool for Language Recognition, a recognizer Tool written in the Java Language. It can generate parsers automatically and directly generate parsers for target languages from user-written ANTLR grammar rules. It can generate parser clients for Java, Go, C, and other languages.

The parser client generated by ANTLR generates an abstract syntax tree from the input text and provides an interface to traverse the tree to access various parts of the text. The implementation of ANTLR is consistent with the lexical and grammatical analysis described above. The lexical analyzer divides lexical units according to lexical rules. The parser makes semantic analysis of lexical units, optimizes rules and eliminates left recursion.

Installation and use of ANTLR can be found on the official website.

4.1 Grammar and lexical rules

4.1.1 Syntax File structure

In ANTLR, the syntax file starts with.g4At the end, if the grammar rules and grammar rules are in a file, the syntax file Name for NameName.g4If the syntax file and lexical file are placed separately, the syntax file must be named asNameParser.g4, the lexical file must be namedNameLexer.g4. A basic syntax file structure is as follows:

Grammar: Specifies the syntax name. Pure syntax file declaration is usedparser grammar Name;, pure lexical file declaration usedlexer grammar Name;.
Options: Reserved functions.
Tokens: Declare lexical symbols. Tokens may not be defined in syntax files but must be used in syntax files.
ActionName: Use actions outside of syntax rules, for the target language, for JAVA currently has headers and members, if you expect to use them only in a lexical analyzer, use@lexer::name, if you expect to use it only in parsers@parser::name.
- Header: Defines class file headers, such as package and import declarations embedded in Java.
- Members: Defines the contents of a class file, such as class members and methods.

4.1.2 Example Syntax files

To distinguish grammar and lexical rules, lowercase letters are grammatical rules, and uppercase letters are lexical rules. Take the josn syntax file as an example. The syntax is JSON and the file is json.g4. The details are as follows:

// Specify the grammar JSON; // a syntax rule json: value; // A syntax rule with multiple alternative branches // For 'true', an implicitly defined lexical symbol // Alternative branches where # represents a tag can generate more precise listener events, // Alternative branches in a rule can either be all tagged or all untagged value: STRING # ValueString | NUMBER # ValueNumber | obj # ValueObj | arr # ValueArr | 'true' # ValueTrue | 'false' # ValueFalse | 'null' # ValueNull ; obj : '{' pair (',' pair)* '}' | '{' '}' ; pair : STRING ':' value ; arr : '[' value (',' value)* ']' | '[' ']' ; / / / / a lexical rules and ordinary regular expressions, you can use wildcards, or | said, * 0 or more times / /? Indicates 0 or 1 occurrence, + indicates 1 or more occurrence, ~ indicates the opposite,? Also support wildcards greed and the greed pattern STRING: the '"' (ESC | SAFECODEPOINT) * '"'; / / using fragments of lexical rules, the logo says the lexical rules cannot be applied to grammar rules alone, only as a piece of lexical fragments ESC lexical rules: ([" '\ \' \ \ | / BFNRT] UNICODE); fragment UNICODE : 'u' HEX HEX HEX HEX ; fragment HEX : [0-9a-fA-F] ; fragment SAFECODEPOINT : ~ ["\\\u0000-\u001F] ; NUMBER : '-'? INT ('.' [0-9] +)? EXP? ; fragment INT : '0' | [1-9] [0-9]* ; fragment EXP : [Ee] [+\-]? INT ; // Use ANLTR API to get WS: [\t\n\r] + -> skip;Copy the code

Take the following josn text for example:

{" string ", "string", "num" : 2, "obj" : {" arr ": [" English", "Chinese", 123, 12.45, 23.64 e+3, true, "{db} ABC def,"] {}}}Copy the code

A lexical analyzer converts input characters into a lexical symbol stream:

[= '{' @ 0, 0, 0, < >,' {' 1-0] [@ 1, 6, 13 = '" string "', < string >. 2:4] [@ 2, 14:14 = ':' < > ':', "] [@ 3, 15, 19 = '" string ", < string >, and] = ', '[@ 4, 20, 20, <', '>, and [@ 5, 26:30 =' "num" ', < STRING >. 3:4] [@ 6, 31:31 = ':' < > ':', "] [@ 7, 32, 32 = '2', < NUMBER >, they = ', '[@ 8 behavior: 33, <', '>, 3:11] [@ 9 33:6 =' "obj" '43, < STRING >, Deborah] [@ 1:44 =' : '< >' : ', for] [= '{' @ 11 do: 45, < >, 4:10]' {' [@ 12 zhongguo kuangye daxue: 59 = '" arr "', < STRING >. 5:8] [13 '@ : 60 =' : '< >' : ', "] [@ 14, 21:61 = '[' <' [' >, "] 15,75: [@ 83 = '" English "', < STRING >, and [@ 16,84:84 = ', ', < ', '>, the supplications] [@ 17 16/32-bit: 101 =' "Chinese" ', < STRING >, concerning [@ = 18102-102 ', '<' and '>, 7:16] [@ =' 123 ', 19116-118 < NUMBER >, kingdom [@ = 20119-119 ', ', < ', '>, head] [@ = '12.45' 21133-138, < NUMBER >, but [@ = 22139-139 ', ', < ', '>, therefore] [@ =' 23.64 e+3 '23153-160, < NUMBER >, and when [@ = 24161-161 ', '<' and '>, 10:20] [@ =' true ', 25163-166 < > 'true', 11:0] [@ = 26167-167 ', ', < ', '>, 4] [@ 27181-192 = '{db} "ABC def"', < STRING >, 12:] [@ = 28193-193 ', ', < ', '>, division] [=' {' @ 29211-211, < >, unto '{' [@ = '} ', 30226-226 < > '} ', then] [@ = '] '31236-236, the < >'] ', were] [@ = '} ', 32242-242 < > '} ', but [@ = '} ', 33244-244 < > '} ', 22:0] [@ 34246-245 = '< EOF >' < EOF > 23:0]Copy the code

With a parser, you can convert lexical symbols into syntax trees:

4.1.3 Common lexical rules

Matching priority

When matching, if the input string can be matched by more than one lexical rule, the rule declared earlier takes effect first.

Lexical pattern

Lexical mode allows lexical rules to be grouped by context, and lexical parsers start in default mode and remain in default mode unless specified using the mode directive. For example, in the analysis of XML, multiple attributes need to be cut out in the tag body, and in the tag body, the whole text acts as a tag body.

<<rules in default mode>>
...
mode MODE1;
<<rules in MODE1>>
...
mode MODE2;
<<rules in MODE2>>
...
mode MODEN;
<<rules in MODEN>>
Copy the code

Take the XMLlexer.g4 fragment as an example:

OPEN: '<' -> pushMode(INSIDE); // INSIDE mode vocabulary rules define mode INSIDE; CLOSE: '>' -> popMode; SLASH : '/' ; EQUALS : '=' ; STRING : '"' ~[<"]* '"' | '\'' ~[<']* '\'' ;Copy the code

Lexical rules action

After matching a lexical rule, the lexical analyzer will generate a lexical symbol object. If you want to change the lexical symbol type during the matching process, you can achieve this through the lexical rule action.

ENUM : 'enum' {if (! enumIsKeyword) setType(Identifier); };Copy the code

Semantic judgment

In the process of lexical analysis, it is often necessary to dynamically open and close lexical symbols, which can be realized by semantic judgment.

ENUM : 'enum' {java5}? ;
ID : [a-zA-Z]+
Copy the code

For example, before Java version 1.5, enum was only an identifier that could be used to define variables. After Java version 1.5, enum was used as a keyword. If you use the syntax rules after Java version 1.5 to compile the code before Java version 1.5, the compilation will fail. When java5 is true, the lexical rule is turned on, otherwise it is turned off. Note that since both ENUM and ID can match the ENUM input string, as mentioned earlier, ENUM must be placed first for the lexical parser to match first.

4.1.4 Common Grammar Rules

Alternative branch label

ANTLR based on grammar files generated for syntax tree analysis of the listener, every grammar rules to create a method, but for a rule has multiple alternative branches, use more inconvenience, can add branch to branch of each alternative label, so in the generated listener, each branch will generate an alternate method. Take the value rule in the JSON syntax file above for example:

Public class JSONBaseListener implements JSONListener {@override public void implements JSONListener Jsonparser.jsoncontext CTX) {} @override public void exitJson(jsonParser.jsonContext CTX) {} Instead of rule generation methods, Method to generate the label @ Override public void enterValueString (JSONParser. ValueStringContext CTX) {} @ Override public void exitValueString(JSONParser.ValueStringContext ctx) { } @Override public void enterValueNumber(JSONParser.ValueNumberContext ctx) { } @Override public void exitValueNumber(JSONParser.ValueNumberContext ctx) { } ... @Override public void visitTerminal(TerminalNode node) { } @Override public void visitErrorNode(ErrorNode node) { } } // Public class JSONBaseListener implements JSONListener {@override public void implements JSONListener enterJson(JSONParser.JsonContext ctx) { } @Override public void exitJson(JSONParser.JsonContext ctx) { } ... // If alternative branch tags are not used, @override public void enterValue(jsonParser. ValueContext CTX) {} @override public void exitValue(JSONParser.ValueContext ctx) { } ... @Override public void visitTerminal(TerminalNode node) { } @Override public void visitErrorNode(ErrorNode node) { } }Copy the code

Syntax rule action, semantic judgment, matching priority

In grammar rules, similar lexical rules are supported for action and semantic judgment, and the matching priority is the same as that of lexical rules.

associativity

In addition, subtraction, multiplication and division, we combine from left to right, but in exponents, we combine from right to left, and the ASSOC is required to manually specify the associativity. 2^3^4 is recognized as 2^(3^4) with the following syntax:

expr : <assoc=right> expr '^' xpr
     | INT
     ;
Copy the code

4.2 Error Reporting

By default, ANTLR sends all error messages to standard error output. ANTLR also provides ANTLRErrorListener to change the target output and style of these messages. The interface has a syntaxError() method for both lexers and parsers. ANTLRErrorListener interface has many methods, ANTLR provides the BaseErrorListener class as its base class implementation, when using, just rewrite the interface and modify ANTLR’s error listener.

ANTLR:

public class ConsoleErrorListener extends BaseErrorListener { public static final ConsoleErrorListener INSTANCE = new ConsoleErrorListener(); @Override public void syntaxError(Recognizer<? ,? > recognizer, Object offendingSymbol, int line, int charPositionInLine, String msg, RecognitionException e) {// Output error message to console system.err. Println ("line "+ line + ":" + charPositionInLine +" "+ MSG); } } public abstract class Recognizer<Symbol, ATNInterpreter extends ATNSimulator> { ... Private List<ANTLRErrorListener> _Listeners = new CopyOnWriteArrayList<ANTLRErrorListener>() {{ add(ConsoleErrorListener.INSTANCE); }}; . }Copy the code

When used, you can override error reporting methods to better display error messages, such as the SyntaxErrorListener below. In addition, in the case of lexical or grammatical errors, ANTLR has some means of repair, parsing can continue, but in some cases, such as SQL statements, or Shell script, grammatical errors, should not be implemented by all the subsequent, therefore, is listening to the grammar and lexical errors, can terminate the parsing process by throwing an exception.

public class SyntaxErrorListener extends BaseErrorListener {

  @Override
  public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line,
      int charPositionInLine, String msg, RecognitionException e) {
    List<String> stack = ((Parser) recognizer).getRuleInvocationStack();
    Collections.reverse(stack);
    SyntaxException exception = new SyntaxException("line " + line + ":" + charPositionInLine + " at " + offendingSymbol + ": " + msg, e);
    exception.setLine(line);
    exception.setCol(charPositionInLine);
    exception.setSymbol(String.valueOf(offendingSymbol));
    
    throw exception;
  }

}
Copy the code

4.3 Syntax tree Traversal

ANTLR provides two traversal tree mechanisms, namely listeners and accessors.

This listener

Listeners are similar to SAX document objects generated by XML parsers, and SAX listeners receive event notifications like startDocument() and endDocument(). A listener’s method is essentially a callback function that ANTLR will depth-first traverse the syntax tree and fire when entering or leaving a node. Take a simple assignment syntax:

grammar Stat; stat : assign; assign : 'sp' '=' expr '; '; expr : Expr; Expr : [1-9][0-9]*; WS : [ \t\n\r] + -> skip;Copy the code

ANTLR generates the following listener UML diagram:

The StatListener interface provides an abstract method to call back when all syntax rules enter (Enter) and exit (exit). The StatBaseListener class implements the default implementation of all interfaces. When using the StatBaseListener class, you only need to inherit the StatBaseListener class and rewrite the concerned method interface.

In order to sp = 100; For example, ANTLR traversal sequence is as follows:

4.3.2 accessor

Accessors also use depth-first traversal to traverse the syntax tree. Unlike listeners, accessors use display calls to access nodes, so the traversal process can be controlled.

The visitor UML diagram is as follows:

The StatVisitor interface provides an abstract method of all the grammar rules, and if you want to access a particular grammar rule, you just call the corresponding interface. Of course, ANTLR also provides the default implementation class StatBaseVisitor, which can be used only by inheriting it. In addition, it can also be seen from the method definition that every method has a return value, but the return value type is fixed and the constraint is larger.

For sp = 100; , the traversal sequence of accessors is as follows:

4.3.3 Data transfer mechanism

In syntax tree traversal, we often need to pass data, and in event methods, there are currently three ways to share information.

Use the method return value

From the principle of accessor and listener implementation, it can be seen that the listener uses the callback mode, so the return value is void, cannot pass parameters. Accessors have return values of fixed types and can be used to share data, but their use is limited because of their fixed types.

Class members share data in event methods

Both accessors and listeners use depth-first traversal to access the syntax tree, often using stacks to store intermediate data. Let’s take the traversal of the JSON syntax tree as an example to introduce the use of the following class members in the event method, as well as the specific use of accessors and listeners.

Sometimes, for convenience, configuration information is stored in JSON format, which needs to be converted into properties files for use. Using the json.g4 syntax mentioned above, we will convert the following JSON data into a standard properties file. For generality, we will remove alternative branch tags for the syntax rules in the syntax file.

{ "spring":{ "datasource":{ "driver-class-name":"com.mysql.cj.jdbc.Driver", "JDBC - the url" : "JDBC: mysql: / / 127.0.0.1:3306 / db", "username" : "root", "password" : "password", "type":"com.zaxxer.hikari.HikariDataSource", "hikari":{ "pool-name":"HikariCP", "minimum-idle":5, "Maximum-pool-size ":50, "idle-timeout":600000, "max-lifetime":1800000}}, "redis":{"database":0, "host":"127.0.0.1", "port":6379, "password":"123456" } } }Copy the code

The converted properties file will look like this:

Database = 0 spring.redis.password = 123456 spring.redis.host = 127.0.0.1 spring.redis.port = 6379 spring.datasource.hikari.pool-name = HikariCP spring.datasource.password = password spring.datasource.driver-class-name = com.mysql.cj.jdbc.Driver spring.datasource.hikari.idle-timeout = 600000 spring.datasource.username = root spring.datasource.hikari.maximum-pool-size = 50 spring.datasource.hikari.max-lifetime = 1800000 Spring. The datasource. JDBC - url = JDBC: mysql: / / 127.0.0.1:3306 / db spring. The datasource. Type = com. Zaxxer. Hikari. HikariDataSource spring.datasource.hikari.minimum-idle = 5Copy the code

The listener implementation is as follows:

Public class MyListener extends JSONBaseListener {/** ** private Stack<String> keys = new Stack<>(); /** * attributes */ @getter Private Map<String, String> prop = new HashMap<>(); /** * attributes */ @getter Private Map<String, String> prop = new HashMap<>(); @Override public void enterValue(ValueContext ctx) { if (ctx.arr() ! = null || ctx.obj() ! = null) { return; } if (ctx.STRING() ! = null) { addProp(ctx.getText().substring(1, ctx.getText().length() - 1)); } else { addProp(ctx.getText()); } } @Override public void enterPair(PairContext ctx) { String text = ctx.STRING().getText(); keys.push(text.substring(1, text.length() - 1)); } @Override public void exitPair(PairContext ctx) { keys.pop(); } private String getKey() { StringBuilder sb = new StringBuilder(); for (String key : keys) { if (StringUtils.isNotBlank(key)) { sb.append(key.trim()).append("."); } } if (sb.length() == 0) { return ""; } sb.deleteCharAt(sb.length() - 1); return sb.toString(); } private void addProp(String value) { String key = getKey(); Prop. Put (key, value) is overwritten with the last read. } } public static void transformByListener(String json) { CharStream input = CharStreams.fromString(json); JSONLexer lexer = new JSONLexer(input); CommonTokenStream tokens = new CommonTokenStream(lexer); JSONParser parser = new JSONParser(tokens); ParserRuleContext context = parser.json(); ParseTreeWalker = new ParseTreeWalker(); MyListener listener = new MyListener(); walker.walk(listener, context); Map<String, String> prop = listener.getProp(); for (String key : prop.keySet()) { System.out.println(key + " = " + prop.get(key) ); }}Copy the code

Accessors are implemented as follows:

Public Class MyVisitor extends JSONBaseVisitor<Void> {private Stack<String> keys = new Stack<>(); /** * attributes */ @getter Private Map<String, String> prop = new HashMap<>(); /** * attributes */ @getter Private Map<String, String> prop = new HashMap<>(); @Override public Void visitValue(ValueContext ctx) { if (ctx.obj() ! = null) { visitObj(ctx.obj()); } else if (ctx.arr() ! = null) { visitArr(ctx.arr()); } else if (ctx.STRING() ! = null) { addProp(ctx.getText().substring(1, ctx.getText().length() - 1)); } else { addProp(ctx.getText()); } return null; } @Override public Void visitObj(ObjContext ctx) { List<PairContext> pairs = ctx.pair(); if (pairs ! = null && pairs.size() > 0) { pairs.forEach(this::visitPair); } return null; } @Override public Void visitPair(PairContext ctx) { String text = ctx.STRING().getText(); keys.push(text.substring(1, text.length() - 1)); visitValue(ctx.value()); keys.pop(); return null; } private String getKey() { StringBuilder sb = new StringBuilder(); for (String key : keys) { if (StringUtils.isNotBlank(key)) { sb.append(key.trim()).append("."); } } if (sb.length() == 0) { return ""; } sb.deleteCharAt(sb.length() - 1); return sb.toString(); } private void addProp(String value) { String key = getKey(); Prop. Put (key, value) is overwritten with the last read. } } public static void transformByVisitor(String json) { CharStream input = CharStreams.fromString(json); JSONLexer lexer = new JSONLexer(input); CommonTokenStream tokens = new CommonTokenStream(lexer); JSONParser parser = new JSONParser(tokens); ParserRuleContext context = parser.json(); MyVisitor = new MyVisitor(); visitor.visit(context); Map<String, String> prop = visitor.getProp(); for (String key : prop.keySet()) { System.out.println(key + " = " + prop.get(key)); }}Copy the code

Annotate the nodes of the parse tree to store relevant data

As you can see from ANTLR generated code, a context class is generated for each syntax rule, so that data can be shared through such objects. Such as:

e returns [int value] : e '*' e | e '+' e | INT ; public static class EContext extends ParserRuleContext { public int value; . }Copy the code

This approach loses flexibility by binding the syntax to a particular programming language. ANTLR also provides a ParseTreeProperty class for JAVA, which is used to maintain the relationship between nodes and values. How to use this class will be introduced in detail in the following SQL syntax tree traversal.

Five, SQL parsing implementation

Back to the SQL parsing mentioned at the beginning of this article, no matter which dialect, as long as you find the syntax file, customize the syntax file according to the needs, and realize the syntax tree traversal logic, you can achieve the input and output table parsing, blood parsing and other functions.

Hive SQL 2.x is used as an example. The syntax files of Hive 2.x are implemented in the Hive source code using ANTLR 3.x. The syntax files and code files are highly coupled and need to be modified using the 4.x version rules. For the modified syntax files, see Hive SQL 2.x syntax files.

The syntax file also has the following problems:

Reserved words are not supported, which are supported in lower version syntax, such as Hive 2.1.1.
Does not support the set parameters, add the jar command, although it also doesn’t support in native Hive, but was in a SQL parsing, the incoming are often multiple SQL, may have the above command, after support can avoid filtering, also can realize more rich functions, such as parse SQL, tell you about the input and output table column number in the document, etc.

5.1 Syntax File Modification

Supported reserved word

IdentifiersParser.g4Add reserved words at the bottom of the document

// The following SQL2011 reserved keywords are used as identifiers in many q tests, they may be added back due to backward compatibility. // We are planning to remove the following whole list after several releases. // Thus, please do not change the following list unless you know what to do. sql11ReservedKeywordsUsedAsIdentifier : KW_ALL | KW_ALTER | KW_ARRAY | KW_AS | KW_AUTHORIZATION | KW_BETWEEN | KW_BIGINT | KW_BINARY | KW_BOOLEAN | KW_BOTH | KW_BY | KW_CREATE | KW_CUBE | KW_CURRENT_DATE | KW_CURRENT_TIMESTAMP | KW_CURSOR | KW_DATE | KW_DECIMAL | KW_DELETE | KW_DESCRIBE | KW_DOUBLE | KW_DROP | KW_EXISTS | KW_EXTERNAL | KW_FALSE | KW_FETCH | KW_FLOAT | KW_FOR | KW_FULL | KW_GRANT | KW_GROUP | KW_GROUPING | KW_IMPORT | KW_IN | KW_INNER | KW_INSERT | KW_INT | KW_INTERSECT | KW_INTO | KW_IS |  KW_LATERAL | KW_LEFT | KW_LIKE | KW_LOCAL | KW_NONE | KW_NULL | KW_OF | KW_ORDER | KW_OUT | KW_OUTER | KW_PARTITION | KW_PERCENT | KW_PROCEDURE | KW_RANGE | KW_READS | KW_REVOKE | KW_RIGHT | KW_ROLLUP | KW_ROW | KW_ROWS | KW_SET | KW_SMALLINT | KW_TABLE | KW_TIMESTAMP | KW_TO | KW_TRIGGER | KW_TRUE | KW_TRUNCATE | KW_UNION | KW_UPDATE | KW_USER | KW_USING | KW_VALUES | KW_WITH // The following two keywords come from MySQL. Although they are not keywords in SQL2011,  they are reserved keywords in MySQL. | KW_REGEXP | KW_RLIKE | KW_PRIMARY | KW_FOREIGN | KW_CONSTRAINT | KW_REFERENCES ;Copy the code

IdentifiersParser.g4File identifiers support SQL reserved words

identifier : Identifier | nonReserved / / new Hive 2.1.1 version supports reserved word as an Identifier, the current 2.3.8 subsequent versions is not support, so you need to add | sql11ReservedKeywordsUsedAsIdentifier;Copy the code

For set parameters, add JAR modification

The style of set parameter value is very rich, and the existing lexical rules are not satisfied. If it is supported in the form of grammar rules, the lexical rules need to be greatly modified. We adopt an opportunistic way and use channels to realize the filtering of SET and Add JAR syntax. Add the following rule to hivelexer.g4:

// Add action to specify header@lexer ::header {import java.util.Iterator; import java.util.LinkedList; @lexer::members {public static int CHANNEL_SET_PARAM = 1; @lexer::members {public static int CHANNEL_SET_PARAM = 2; public static int CHANNEL_USE_JAR = 3; private LinkedList<Token> selfTokens = new LinkedList<>(); @Override public void emit(Token token) { this._token = token; if (token ! = null) { selfTokens.add(token); } } @Override public void reset() { super.reset(); this.selfTokens.clear(); } public boolean isStartCmd() { Iterator<Token> it = this.selfTokens.descendingIterator(); while (it.hasNext()) { Token previous = it.next(); if (previous.getType() == HiveLexer.WS || previous.getType() == HiveLexer.LINE_COMMENT || previous.getType() == HiveLexer.SHOW_HINT || previous.getType() == HiveLexer.HIDDEN_HINT || previous.getType() == HiveLexer.QUERY_HINT) { continue; } return previous.getType() == HiveLexer.SEMICOLON; } return true; SET_PARAM: {isStartCmd()}? KW_SET (~('='|'; (~ (')) + '=' '; '))+ -> channel(2) ; ADD_JAR: {isStartCmd()}? KW_ADD (~('; '))+ -> channel(3) ;Copy the code

5.2 Syntax tree traversal implementation

public class HiveTableVisitor extends HiveParserBaseVisitor<Void> { @Setter private String curDb; /** * Private ParseTreeProperty<List<Entity>> curProp = new ParseTreeProperty<>(); // Other parts omit... @Override public Void visitStatement(StatementContext ctx) { if (ctx.execStatement() == null) { return null; } visitExecStatement(ctx.execStatement()); addProp(ctx, curProp.get(ctx.execStatement())); return null; } / / @ Override switch database public Void visitSwitchDatabaseStatement SwitchDatabaseStatementContext (CTX) {String db = ctx.identifier().getText(); this.curDb = trimQuota(db); return null; } / / delete table operation @ Override public Void visitDropTableStatement DropTableStatementContext (CTX) {TableNameContext fullCtx = ctx.tableName(); Opt opt = new Opt(OptType.DROP, ctx.getStart().getLine(), ctx.getStart().getCharPositionInLine()); addProp(ctx, buildTbl(fullCtx, opt)); return null; } / / query operation @ Override public Void visitAtomSelectStatement AtomSelectStatementContext (CTX) {if (CTX) fromClause ()! = null) { visitFromClause(ctx.fromClause()); Opt opt = new Opt(OptType.SELECT, ctx.getStart().getLine(), ctx.getStart().getCharPositionInLine()); fillOpt(opt, curProp.get(ctx.fromClause())); addProp(ctx, curProp.get(ctx.fromClause())); } else if (ctx.selectStatement() ! = null) { visitSelectStatement(ctx.selectStatement()); addProp(ctx, curProp.get(ctx.selectStatement())); } return null; } @Override public Void visitTableSource(TableSourceContext ctx) { TableNameContext fullCtx = ctx.tableName(); addProp(ctx, buildTbl(fullCtx)); return null; } private Entity buildTbl(TableNameContext fullCtx, Opt opt) { Entity entity = buildTbl(fullCtx); entity.setOpt(opt); return entity; } private Entity buildTbl(TableNameContext fullCtx) { Tbl tbl; if (fullCtx.DOT() ! = null) { IdentifierContext dbCtx = fullCtx.identifier().get(0); IdentifierContext tblCtx = fullCtx.identifier().get(1); tbl = new Tbl( Db.buildDb( curDb, trimQuota(dbCtx.getText()), dbCtx.getStart().getLine(), dbCtx.getStart().getCharPositionInLine() ), trimQuota(tblCtx.getText()), tblCtx.getStart().getLine(), tblCtx.getStart().getCharPositionInLine() ); } else { IdentifierContext tblCtx = fullCtx.identifier().get(0); Integer line = tblCtx.getStart().getLine(); Integer col = tblCtx.getStart().getCharPositionInLine(); tbl = new Tbl( Db.buildDb(curDb, null, line, col), trimQuota(tblCtx.getText()), line, col ); } return new Entity(Type.TBL).setTbl(tbl); } private void fillOpt(Opt opt, Entity entity) { if (entity == null || entity.getOpt() ! = null) { return; } entity.setOpt(opt); } private void fillOpt(Opt opt, List<Entity> entities) { if (entities == null || entities.size() == 0) { return; } for (Entity entity : entities) { if (entity.getOpt() ! = null) { continue; } entity.setOpt(opt); } } private String trimQuota(String name) { if (name == null || name.length() <= 2) { return name; } char start = name.charAt(0); char end = name.charAt(name.length() - 1); if (start == '`' && end == '`') { name = name.substring(1, name.length() - 1).replaceAll("``", "`"); } return name; } // Other parts omitted... }Copy the code

Vi. References

antlr
SQL Parser
Compilation principle: In-depth understanding of regular expressions and NFA and DFA state machines
Druid VS Antlr4
LL and LR Parsing Demystified