Click on the blue “Go Language Chinese” to follow us, get a full set of Go materials, and learn Go language every day

Abstract

The purpose of this article is to provide a brief introduction to how to build a parser [1] for the LL(1) grammar in Go, in this case to parse SQL requests. Minimal programming skills (functions, structs, conditional statements, and for loops) are required.

If you want to skip the article and get straight to the results, here’s the finished parser repository:

https://github.com/marianogappa/sqlparser

Something dropped for the sake of simplicity

We will not implement some of the complex features in SQL, including subselections, functions, complex nested expressions, and so on. But these features can be quickly implemented using our strategy.

A brief theoretical introduction

A parser consists of two parts:

  1. A lexical analyzer, also known as a “token cutter [2]”.
  2. Parsers for constructing abstract syntax trees [3].

Lexical analyzer

Let’s use an example to define this when we say that the following query “splits into tokens” :

SELECT id, name FROM 'users.csv'Copy the code

It’s actually about extracting all the signs out of it. The result of token-cutter processing would look like this:

[]string{"SELECT", "id", ",", "name", "FROM", "'users.csv'"}Copy the code

Parser

In this section, we’re actually going to look at these notations, make sure they make sense, and then interpret them as structs. Some applications use these constructs, such as executing queries or adding color highlights to queries. These constructs represent queries in a way that is easy for applications to use. After this step, we have something like this:

query {	Type: "Select",	TableName: "users.csv",	Fields: ["id", "name"],}Copy the code

There are a lot of failures during parsing. So for convenience, it is common to do these two steps (lexical analysis and grammar analysis) together and stop parsing if something goes wrong.

strategy

We define the parser as follows.

Type Parser struct {SQL string // Query to be parsed I int // Position in the current query string query query.Query // "step step" to be built What is this? }// The main function returns a "query structure" or an error func (p *parser) Parse() (query.query, Func (p *parser) peek() (string) {}// Same as peek, Func (p *parser) pop() (string) {}Copy the code

Intuitively, during parsing, we need to “peek” at the first token. In basic SQL syntax, there are only a few valid tokens: SELECT, UPDATE, DELETE, etc. Anything else is wrong. The code looks like this:

switch strings.ToUpper(parser.peek()) {case "SELECT": Parser.query. type = "SELECT" // Start building this type "query structure" parser.pop() // TODO Errorf("invalid query Type ")} Default: return parser.query, fmt.Errorf("invalid query type")}Copy the code

The next thing I want TODO is basically fill in these todos and boundaries. However, as the diligent reader will soon discover, the code can quickly become messy in the process of parsing the entire SELECT query. Not to mention we have a lot of middle queries to parse. So we need some structure that helps.

Finite-state automata

Finite-state automata [4] is a very interesting topic, but this is not the place to get your computer science degree, so we’ll just focus on what we need.

As we parse, only a few tokens are valid at each point. Once you find such a legal token, you arrive at a new node where other tokens are legal. Repeat until the entire query has been parsed. We can visualize the relationship of nodes as a directed graph:

sql_parser_graph.png

Transitions between nodes can be represented by a simpler table:

sql_parser_table

We can translate this table directly into a giant switch statement. Now we can use our sneaky parser.step property:

func (p *parser) Parse() (query.Query, Error) {parser.step = stepType // Initial step for parser. I < len(parser.sql) {nextToken := parser.peek() switch parser.step { case stepType: switch nextToken { case UPDATE: Parser.query. type = "UPDATE" parser.step = stepUpdateTable // TODO other queries} case stepUpdateSet: //... case stepUpdateField: // ... case stepUpdateComma: // ... } parser.pop() } return parser.query, nil}Copy the code

That’s it! Note that some steps return to the previous step under certain conditions, such as the comma in the SELECT definition. This strategy extends well to the base interpreter. However, as the syntax becomes more complex, the number of states increases dramatically, and writing this code becomes tedious. So I recommend testing as you write, rather than testing after you write.

Peek()The implementation of the

Remember that we needed to implement peek() and POP () earlier, because they do basically the same job, so we used a helper function to prevent duplication [5]. In addition, pop() needs to be further advanced to avoid getting whitespace characters.

func (p *parser) peek() string {	peeked, _ := p.peekWithLength()	return peeked}func (p *parser) pop() string {	peeked, len := p.peekWithLength()	p.i += len	p.popWhitespace()	return peeked}func (p *parser) popWhitespace() {	for ; p.i < len(p.sql) && p.sql[p.i] == ' '; p.i++ {	}}Copy the code

Here is a list of possible tokens:

var reservedWords = []string{ "(", ")", ">=", "<=", "! =", ",", "=", ">", "<", "SELECT", "INSERT INTO", "VALUES", "UPDATE", "DELETE FROM", "WHERE", "FROM", "SET",}Copy the code

Otherwise, we might pass through a string enclosed in quotes, or a plain identifier (that is, the name of the field). Here is a complete implementation of peekWithLength().

func (p *parser) peekWithLength() (string, int) { if p.i >= len(p.sql) { return "", 0 } for _, rWord := range reservedWords { token := p.sql[p.i:min(len(p.sql), p.i+len(rWord))] upToken := strings.ToUpper(token) if upToken == rWord { return upToken, Len (upToken)}} if p. ql [p.i] = = '\' '{/ / the string quoted expansion up return p.p eekQuotedStringWithLength ()} to return p.peekIdentifierWithLength()}Copy the code

The rest of the functions are very simple and are left as an exercise for the reader. If you’re curious about this, the link in the summary section has the full code implementation. I’m going to go around a little bit and not write the link again.

Final confirmation

The parser may encounter the end of the string before completing an entire query. It’s a good idea to implement a parser.validata() function to check the generated “query” structure. This function returns an error object if the resulting query is incomplete or if there are other errors.

test

Go’s table-driven testing mode is also sexy in our case.

Type testCase struct {Name string struct { Expected query.Query // Err error // Error caught}Copy the code

Test examples:

ts := []testCase{	{		Name:     "empty query fails",		SQL:      "",		Expected: query.Query{},		Err:      fmt.Errorf("query type cannot be empty"),	},	{		Name:     "SELECT without FROM fails",		SQL:      "SELECT",		Expected: query.Query{Type: query.Select},		Err:      fmt.Errorf("table name cannot be empty"),	},	...Copy the code

Run the test like this:

for _, tc := range ts { t.Run(tc.Name, func(t *testing.T) { actual, err := Parse(tc.SQL) if tc.Err ! = nil && err == nil { t.Errorf("Error should have been %v", tc.Err) } if tc.Err == nil && err ! = nil { t.Errorf("Error should have been nil but was %v", err) } if tc.Err ! = nil && err ! = nil { require.Equal(t, tc.Err, err, "Unexpected error") } if len(actual) > 0 { require.Equal(t, tc.Expected, actual[0], "Query didn't match expectation") } })}Copy the code

I used the potency package [6] because it outputs what is different when the query structure is inconsistent.

further

The experiment in this paper is suitable for:

  • Learn the LL(1) interpreter algorithm.
  • Parsing custom simple syntax without relying on anything.

However, this approach is tedious and has limitations. Think about how to parse any complex combination of expressions (such as SQRT (a) = (1 * (2 + 3))).

For a more powerful interpreter model, look at the parser combinator [7]. Goyacc [8] is a popular Go language implementation.

I also recommend Rob Pike’s interesting lecture on metaphor Scanning [9].

Recursive descent parsing [10] is another means of parsing.

Why do I write this

Recently I decided to centralize my data into a CSV repository. This gave me an opportunity to learn React from the perspective of creating a user interface for adding, deleting, modifying and checking [11] data. When I wanted to design an interface for passing add, delete, change and query operations between the front end and the back end, I found SQL to be a natural language (and one I already knew well).

Although there are many libraries that use SQL to read CSV, there does not seem to be much support for write operations (especially data definition statements [12]). A colleague suggested that I upload the file to an SQLite[13] in-memory database, run SQL and export CSV. This is a good solution, because I don’t really care about efficiency. Finally, I asked myself: Haven’t you always wanted to write an SQL interpreter? How hard could this be?

It turns out that writing a (basic) interpreter is quite simple. Without a tutorial as simple as possible, it can still be daunting.

I hope this is a tutorial that stops you from being afraid, KISS[14].


via: https://marianogappa.github.io/software/2019/06/05/lets-build-a-sql-parser-in-go/

[15] [16] [17]

This article is originally compiled by GCTT[18] and published by Go Chinese [19]

Recommended reading”GCTT produced” uses SQL as the API Good news, there’s another course on Go Web application developmentThis paper links

LL (1) grammar parser: https://en.wikipedia.org/wiki/LL_parser

Mark cut separator: https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization

The abstract syntax tree: https://en.wikipedia.org/wiki/Abstract_syntax_tree

Finite state automaton: https://en.wikipedia.org/wiki/Finite-state_machine

To prevent repeat: https://en.wikipedia.org/wiki/Don%27t_repeat_yourself

testify: https://github.com/stretchr/testify

The parser combinator: https://en.wikipedia.org/wiki/Parser_combinator

goyacc: https://godoc.org/golang.org/x/tools/cmd/goyacc

Interesting speech: https://www.youtube.com/watch?v=HxaD_trXwRE

Recursive drop analytic: http://www.cs.binghamton.edu/~zdu/parsdemo/recintro.html

Add and delete: https://en.wikipedia.org/wiki/Create, _read _update_and_delete

Data definition statements: https://en.wikipedia.org/wiki/Data_definition_language

SQLite: https://www.sqlite.org/index.html

KISS: https://en.wikipedia.org/wiki/KISS_principle

Mariano Gappa: http://marianogappa.github.io/

plus7wist: https://github.com/plus7wist

polaris1119: https://github.com/polaris1119

GCTT: https://github.com/studygolang/GCTT

Go Chinese website: https://studygolang.com/