Go language in parser eyes

Lexical analysis
Syntax analysis
- grammar
- Analysis method
- Go
conclusion
Related articles
Reference

Code is simply a bunch of strings written in a convention format, and an engineer can mentally compile the source code of a language and run the target program, because trained software engineers can group and analyze otherwise meaningless strings and understand the source code according to the convention syntax. Since engineers can understand and compile Go’s source code in a certain way, how can we build a program that can analyze programming language code in the same way that a human understands source code?

In this section we will introduce the two lexical analysis and syntax analysis is very important to the compilation process, the process of the two main effect is to the original machine doesn’t seem that disordered and easy to understand the source file is converted into machine very easy to understand and analyze the abstract syntax tree, and then we’ll take a look at the parser in the eyes of the Go language is what kind of.

Lexical analysis

Source files in the eyes of the compiler is actually a mess, a “no actual meaning” string characters, cannot be understood, all the characters in the compiler, it seems there is no difference between in order to understand these characters, the first thing we need to do is to put the string grouping, it can reduce the cost of understanding string, simplify the process of analysis of the source code.

make(chan int)
Copy the code

As someone who doesn’t understand any programming language, my first instinct is to break the above string into parts — make, chan, int, and parentheses. This intuitive process of breaking up the text is essentially lexical analysis.

Lexical analysis is the process of converting character sequences into token sequences in computer science.

Lex generates code that splits characters in a file into a series of tokens. Earlier versions of Go also used Lex for lexical analysis. In addition to looking at lex and how it works, we’ll also look at how the Go language performs lexical analysis.

lex

Lex is a code generator that uses c-like syntax. Lex uses regular matching to scan the stream of incoming characters. We have the following files:

%{ #include <stdio.h> %} %% package printf("PACKAGE "); import printf("IMPORT "); \. printf("DOT "); \{ printf("LBRACE "); \} printf("RBRACE "); \( printf("LPAREN "); \) printf("RPAREN "); \" printf("QUOTE "); \n printf("\n"); [0-9]+ printf("NUMBER "); [a-zA-Z_]+ printf("IDENT "); % %Copy the code

This defined file can parse package and import keywords, common special characters, numbers, and identifiers, and while the rules here may be rudimentary and imperfect, it’s easier to parse code like this:

package main

import (
	"fmt"
)

func main() {
	fmt.Println("Hello")
}
Copy the code

The lex code at the end of the.l file does not run directly, so we need to “expand” the above code into C code, where we can compile and print the contents of the current file by executing the following command:

$ lex simplego.l $ cat lex.yy.c // ... int yylex (void) { register yy_state_type yy_current_state; register char *yy_cp, *yy_bp; register int yy_act; / /... while ( 1 ) { yy_cp = (yy_c_buf_p); *yy_cp = (yy_hold_char); yy_bp = yy_cp; yy_current_state = (yy_start); yy_match: do { register YY_CHAR yy_c = yy_ec[YY_SC_TO_UI(*yy_cp)]; if ( yy_accept[yy_current_state] ) { (yy_last_accepting_state) = yy_current_state; (yy_last_accepting_cpos) = yy_cp; } while ( yy_chk[yy_base[yy_current_state] + yy_c] ! = yy_current_state ) { yy_current_state = (int) yy_def[yy_current_state]; if ( yy_current_state >= 30 ) yy_c = yy_meta[(unsigned int) yy_c]; } yy_current_state = yy_nxt[yy_base[yy_current_state] + (unsigned int) yy_c]; ++yy_cp; } while ( yy_base[yy_current_state] ! = 37); / /... do_action: switch ( yy_act ) case 0: *yy_cp = (yy_hold_char); yy_cp = (yy_last_accepting_cpos); yy_current_state = (yy_last_accepting_state); goto yy_find_action; case 1: YY_RULE_SETUP printf("PACKAGE "); YY_BREAK // ... }Copy the code

Since most of the code in this file is generated automatically, the use of blank lines and indentation is confusing, and some code is omitted and simply typesetted.

The first 600 lines of lex.yy.c are mostly macro and function declarations and definitions. Most of the generated code is for yylex, a function that uses finite automata (DFA) programming structure to analyze the input character stream. If you look closely at the generated code in this file, The main function is defined in the liblex library, so you need to add an additional -ll option at compile time:

$ cc lex.yy.c -o simplego -ll
$ cat main.go | ./simplego
Copy the code

Once we have compiled the C code into binary code via GCC, we can pipe the aforementioned Go code as input to the generated lexical parser, which yields the following output:

PACKAGE IDENT IMPORT LPAREN QUOTE IDENT QUOTE RPAREN IDENT IDENT LPAREN RPAREN LBRACE IDENT DOT IDENT LPAREN QUOTE IDENT  QUOTE RPAREN RBRACECopy the code

From the output above, we can see the shadow of Go source code, lex generated by lexer through the regular matching of the machine was difficult to understand the string into a lot of tokens, conducive to the subsequent processing.

Go

Go language lexical parsing is through the CMD/compile/internal/syntax in scanner structure, the structure will hold the current scan data source, enable the mode and the current to be scanned to the Token:

type scanner struct {
	source
	mode   uint
	nlsemi bool

	// current token, valid after calling next()
	line, col uint
	tok       token
	lit       string
	kind      LitKind
	op        Operator
	prec      int
}
Copy the code

All Token types supported by the go language are defined in the token. go file. Token types are actually defined as positive integers.

const (
	_    token = iota
	_EOF       // EOF

	// names and literals
	_Name    // name
	_Literal // literal

	// operators and operations
	_Operator // op
	// ...

	// delimiters
	_Lparen    // (
	_Lbrack    // [
	// ...

	// keywords
	_Break       // break
	// ...
	_Type        // type
	_Var         // var

	tokenCount //
)
Copy the code

From the Token types defined in the Go language, we can divide the elements in the language into several different categories, namely names and literals, operators, delimiters, and keywords. Lexical analysis is primarily driven by the next method in the scanner structure, which is a 250-line switch/case structure:

func (s *scanner) next() { c := s.getr() for c == ' ' || c == '\t' || c == '\n' || c == '\r' { c = s.getr() } s.line, s.col = s.source.line0, s.source.col0 if isLetter(c) || c >= utf8.RuneSelf && s.isIdentRune(c, true) { s.ident() return } switch c { case -1: s.tok = _EOF case '0', '1', '2', '3', '4', '5', '6', '7', '8', '9': s.number(c) // ... }}Copy the code

Each time the scanner uses the getr function to get the last unparsed character in the file, and executes a different case depending on the current character. If a space or newline is encountered, whitespace characters are skipped. If the current character is 0, the scanner executes the number method to try to match a number.

func (s *scanner) number(c rune) { s.startLit() if c ! = '.' { s.kind = IntLit for isDigit(c) { c = s.getr() } } s.ungetr() s.nlsemi = true s.lit = string(s.stopLit()) s.tok =  _Literal }Copy the code

The above method actually bypasses a lot of code, including how to match floating-point numbers, exponents, and complex numbers. Let’s briefly look at the logic of lexical matching. StartLit is called to record the starting position of the Token before the actual match is started, and the for loop continues to fetch the latest characters. The method ungets the recently acquired error character (non-numeric) through ungetr and passes the literal and Token types back to scanner before returning, at which point a next function call ends.

The lexical analyzer scanner in the current package only provides the next method to the upper layer. The lexical parsing process is lazy and only calls next to get the latest Token when the upper layer parser needs it.

The lexical elements of Go are relatively simple, and the use of this huge switch/case for lexical parsing is convenient and convenient. The early Go languages used lex to generate lexical parsers, but eventually used Go to implement lexical parsers. Use your own written lexical analyzer to parse yourself.

Syntax analysis

After the initial lexical analysis of compilation, Grammar analysis is a process of analyzing the input text composed of Token sequences according to a specific formal Grammar and determining its grammatical structure. From the above definition, the Token sequence that the lexical analyzer outputs is the input to the parser.

In the process of grammar analysis, we will use top-down or bottom-up derivation. Before introducing the implementation principle of grammar analysis of Go language, we will first introduce grammar and analysis methods in grammar analysis.

grammar

Context-free grammars are tools used to formalize and accurately describe a programming language. We can define the grammar of a language by grammars, which mainly contains a series of production rules for converting strings. Each production rule in a context-free grammar converts a non-terminal character on the left of the rule to a string on the right. The grammar consists of four parts:

Terminators are symbols in grammar that can no longer be expanded, as opposed to non-terminators, which can be expanded by production rules, such as identifiers or literals such as “ID” and “123”.

$N $co., LTDnon-terminalA collection of;
$\ Sigma $co., LTDterminatorA collection of;
$P $co., LTDProduction rulesA collection of;
$S$is not unique in the set of terminatorsStart symbol;

A grammar is defined as a tuple $(N, \Sigma, P, S)$. Several parts of this tuple are the four symbols mentioned above. The most important of these is the production rule.

$S \rightarrow aSb $
$S \rightarrow ab $

The grammar formed by the above rules can express ab, AABB, and AAA.. BBB string, such as programming language grammar is represented by a series of production rules, here we can from CMD/compile/internal/syntax parser in the bag. Go in showcase some of the go language grammar rules:

SourceFile = PackageClause ";" { ImportDecl ";" } { TopLevelDecl ";" } .
PackageClause  = "package" PackageName .
PackageName    = identifier .

ImportDecl       = "import" ( ImportSpec | "(" { ImportSpec ";" } ")" ) .
ImportSpec       = [ "." | PackageName ] ImportPath .
ImportPath       = string_lit .

TopLevelDecl  = Declaration | FunctionDecl | MethodDecl .
Declaration   = ConstDecl | TypeDecl | VarDecl .
Copy the code

A more detailed grammar for Go can be found in the Language Specification, which contains not only the grammar of the Language, but also information about the lexical element, built-in functions, and so on.

Because each Go source code file is eventually parsed into a separate abstract syntax tree, the top structure or start symbol of the syntax tree is actually SourceFile, as we can see from the grammar, Each file contains a package definition along with optional import declarations and other top-level declarations.

type File struct {
	PkgName  *Name
	DeclList []Decl
	Lines    uint
	node
}
Copy the code

There are five types of top-level declarations, constants, types, variables, functions, and methods, which are defined at the outermost layer of the file.

ConstDecl = "const" ( ConstSpec | "(" { ConstSpec ";" } ")" ) . ConstSpec = IdentifierList [ [ Type ] "=" ExpressionList ] . TypeDecl = "type" ( TypeSpec | "(" { TypeSpec "; " } ")" ) . TypeSpec = AliasDecl | TypeDef . AliasDecl = identifier "=" Type . TypeDef = identifier Type . VarDecl = "var"  ( VarSpec | "(" { VarSpec ";" } ")" ) . VarSpec = IdentifierList ( Type [ "=" ExpressionList ] | "=" ExpressionList ) .Copy the code

The above grammar defines the syntax structure of the three common definitions of constant, type and variable in Go language respectively, from which we can see many keywords in the language const, type and var. A brief review of our daily contact with the Go language code can verify the correctness of the grammar here.

In addition to the three simple syntax structures, the definition of functions and methods is more complex. As you can see from the following syntax, statements can be converted into 15 different syntax structures. These include the switch/case, if/else, for loops, and SELECT statements that we often use:

FunctionDecl = "func" FunctionName Signature [ FunctionBody ] .
FunctionName = identifier .
FunctionBody = Block .

MethodDecl = "func" Receiver MethodName Signature [ FunctionBody ] .
Receiver   = Parameters .

Block = "{" StatementList "}" .
StatementList = { Statement ";" } .

Statement =
	Declaration | LabeledStmt | SimpleStmt |
	GoStmt | ReturnStmt | BreakStmt | ContinueStmt | GotoStmt |
	FallthroughStmt | Block | IfStmt | SwitchStmt | SelectStmt | ForStmt |
	DeferStmt .

SimpleStmt = EmptyStmt | ExpressionStmt | SendStmt | IncDecStmt | Assignment | ShortVarDecl .
Copy the code

These different syntax constructs together define the syntax constructs and expressions that can be used in the Go language. This article will not cover more details in the Statement. Interested readers can read the Go language specification or find the answer directly from the source code.

Analysis method

The analysis methods of grammar analysis are generally divided into top up and bottom down, which use different methods to deduce the input Token sequence:

Top-down analysis: it can be regarded as the process of finding the leftmost derivation of the current input stream. For any input stream, according to the current input symbol, determine a production rule, and use the symbol on the right of the production rule to replace the corresponding non-terminal to deduce downward;
Bottom-up analysis: Starting with the input stream, the parser tries to rewrite multiple right-most symbols each time. This means that the parser deduces from the simplest symbols and merges them into the beginning symbols at the end of the parsing.

It does not matter if the above definition is not understood by the reader, we will describe two different analysis methods and their detailed analysis processes in the rest of this section.

The top-down

LL grammar is a grammar that uses the top-down analysis method. Here is a common LL grammar:

$S \rightarrow aS_1$
$S_1 \rightarrow bS_1 $
$S_1 \rightarrow \epsilon$

Assuming we have the production rule and input stream ABB above, if we use the top-down approach here, we can understand that each time the parser determines which way to expand the current input stream by adding new characters:

$S$(opening symbol)
$aS_1$(Rule 1)
$abS_1$(Rule 2)
$abbS_1$(Rule 2)
$ABB $(Rule 3)

This analysis method must start with the beginning symbol analysis and determine how to expand the right-most non-terminal character ($S$or $S_1$) in the current stack by the next symbol to be pushed. The parsing process is not complete until there are no non-terminals in the string.

From the bottom up

However, if we use the bottom-up method to analyze the input stream, the processing is completely different. The four common LR(0) grammar, SLR, LR(1) and LALR(1) use the bottom-up processing. We can simply write a LR(0) grammar that has the same effect as the one in the previous section:

$S \rightarrow S_1 $
$S_1 \rightarrow S_1b $
$S_1 \rightarrow a $

Using the equivalent grammar above to process the same input stream abb uses a completely different process to unwrap the input stream:

$a$
$S_1$(Rule 3)
$S_1b$(pushed)
$S_1$(Rule 2)
$S_1b$(pushed)
$S_1$(Rule 2)
$S$(Rule 1)

Bottom-up analysis maintains a stack for symbols that have not been reduced, and performs two different operations throughout the process, one called shift, which pushes the next symbol onto the stack, and the other called reduce, which merges the right-most string according to production rules.

The above analysis process is completely different from the top-down analysis method, and the two different analysis methods actually represent two different ideas in computer science — from abstract to concrete and from concrete to abstract.

Lookahead

In addition to the two different types of parsing methods, LL and LR, another very important concept in parsing is the concept of Lookahead. In the event of a conflict between different production rules, The current parser needs to preread some tokens to determine which production rules should be used to expand or reduce the input stream. For example, in LALR(1) grammar, it needs to preread a Token to ensure that conflicting production rules can be processed correctly.

Go

The parser of Go language uses LALR(1) grammar to parse Token sequences output in lexical analysis. Right-most derivation and forward-looking constitute the basic principle of the parser of Go language, which is also the choice of most programming languages.

As for the grammar used in Go, the author has not found an exact answer. The more reliable basis is a Google Group discussion in 2011 about what Type of Grammar Go programming language? , but the content of the discussion is out of date in fact unknown, language grammar is simple compared to other languages, but it still takes a long time to analyze the grammar in detail, if you have a definite answer can leave a message below the article.

The Main function in the compiler is introduced in the compilation process of Go language, and the function parseFiles is called in Main. This function mainly completes two parts of work, one is to call Parse method in syntax package to Parse the source code in the file. The other part adds nodes from the abstract syntax tree to the global xTOP variable by calling the node function:

func parseFiles(filenames []string) uint {
	var noders []*noder

	for _, filename := range filenames {
		p := &noder{
			basemap: make(map[*syntax.PosBase]*src.PosBase),
			err:     make(chan syntax.Error),
		}
		noders = append(noders, p)

		go func(filename string) {
			base := syntax.NewFileBase(filename)

			f, _ := os.Open(filename)
			defer f.Close()

			p.file, _ = syntax.Parse(base, f, p.error, p.pragma, syntax.CheckBranches)
		}(filename)
	}

	var lines uint
	for _, p := range noders {
		p.node()
		lines += p.file.Lines
	}

	return lines
}
Copy the code

The Parse function initializes a new parser structure and turns on lexical and syntactic parsing of the current file using the fileOrNil method:

func Parse(base *PosBase, src io.Reader, errh ErrorHandler, pragh PragmaHandler, mode Mode) (_ *File, first error) {
	var p parser
	p.init(base, src, errh, pragh, mode)
	p.next()
	return p.fileOrNil(), p.first
}
Copy the code

The fileOrNil method is an implementation of the Go grammar described above. It first parses the package definition at the beginning of the file:

// SourceFile = PackageClause ";" { ImportDecl ";" } { TopLevelDecl ";" } . func (p *parser) fileOrNil() *File { f := new(File) f.pos = p.pos() if ! p.got(_Package) { p.syntaxError("package statement must be first") return nil } f.PkgName = p.name() p.want(_Semi)Copy the code

If it is the package key, execute name to match the package name and save the result to the returned file structure. If it is the package key, execute name to match the package name.

	for p.got(_Import) {
		f.DeclList = p.appendGroup(f.DeclList, p.importDecl)
		p.want(_Semi)
	}
Copy the code

Once the package name of the current file is determined, the optional import statements are parsed. Each import is seen by the parser as a declaration statement, which is added to the file’s DeclList list.

This is followed by different branches of the switch statement, depending on the key retrieved by the compiler, that call the appendGroup method and pass in constDecl, typeDecl, and other functions to handle statements of the corresponding type.

for p.tok ! = _EOF { switch p.tok { case _Const: p.next() f.DeclList = p.appendGroup(f.DeclList, p.constDecl) case _Type: p.next() f.DeclList = p.appendGroup(f.DeclList, p.typeDecl) case _Var: p.next() f.DeclList = p.appendGroup(f.DeclList, p.varDecl) case _Func: p.next() if d := p.funcDeclOrNil(); d ! = nil { f.DeclList = append(f.DeclList, d) } } } f.Lines = p.source.line return f }Copy the code

At the end of the fileOrNil method, the File structure that was created at the beginning of the File is returned. In addition, this method uses a lot of submethods to parse the input File.

If you’re reading this, you might wonder why you don’t see the parsing code. The parsing process is actually embedded in the parsing process, partly because the scanner is built into the Parser structure. So the p.ext () method actually calls scanner’s next method, which directly fetches the next Token in the file. From this we can see that lexical analysis and parsing are actually done together.

FileOrNil forms a tree with other child methods executed in this method. The root node is the fileOrNil method, and the child nodes are importDecl, constDecl, and so on.

For each method fileOrNil, constDecl represents a production rule in the Go language. For example, fileOrNil implements:

SourceFile = PackageClause ";" { ImportDecl ";" } { TopLevelDecl ";" }.Copy the code

Based on this rule, we can understand the implementation principle of parser. It maps all production rules of programming language to corresponding methods. The tree structure formed by these methods will return an abstract syntax tree.

Since most of the methods are implemented very similarly, the fileOrNil method is only implemented here. To see how the other methods work, you can look at the parser.go file, which contains all the methods for the parsing phase.

Helper method

While we won’t go into other similar methods here, there are a few helper methods in the parser execution that we will briefly explain. The first are the common got and want methods:

func (p *parser) got(tok token) bool { if p.tok == tok { p.next() return true } return false } func (p *parser) want(tok  token) { if ! p.got(tok) { p.syntaxError("expecting " + tokstring(tok)) p.advance() } }Copy the code

The global offset table is used to quickly identify keywords in statements. If the Token in the current parser is passed in, it will skip the Token and return true. Want is a simple encapsulation of got, and if the current Token is not what we expect, we immediately return a syntax error and end the compilation.

The introduction of these two methods can help engineers to reduce the relevant logic and redundant code to judge keywords in the upper layer, and make the implementation of the upper syntax analysis process more clear and concise.

The other appendGroup method implementation is a bit more complicated, and its main function is to find the definition of a batch. We can use a simple example:

var (
   a int
   b int 
)
Copy the code

These two variables belong to the same Group, and the various top-level structures ConstDecl and VarDecl have an additional Group parameter passed in appendGroup:

func (p *parser) appendGroup(list []Decl, f func(*Group) Decl) []Decl {
	if p.tok == _Lparen {
		g := new(Group)
		p.list(_Lparen, _Semi, _Rparen, func() bool {
			list = append(list, f(g))
			return false
		})
	} else {
		list = append(list, f(nil))
	}

	return list
}
Copy the code

If we do use the Group definition syntax when defining structures such as variables or constants, then the Group attribute for those structures is not null.

The appendGroup method calls the passed f method to match the input stream and append the matching result to the DeclList array in the File structure. Import, const, var, type, and func declaration statements are all resolved by calling the appendGroup method.

node

The parser eventually builds nodes in the abstract syntax tree using different constructs, where the root node File, which we’ve already covered above, contains the package name of the current File, a list of all declared constructs, and the number of lines in the File:

type File struct {
	PkgName  *Name
	DeclList []Decl
	Lines    uint
	node
}
Copy the code

The structure of the other nodes is also defined in the nodes.go file, which defines all the structures of the declared type. Here we briefly look at the structure of the function definition:

type ( Decl interface { Node aDecl() } FuncDecl struct { Attr map[string]bool Recv *Field Name *Name Type *FuncType Body  *BlockStmt Pragma Pragma decl } }Copy the code

As can be seen from the function definition, function is mainly composed of receiver, function name, function type and function body in terms of syntax structure. Function body BlockStmt is composed of a series of expressions, which together constitute the main body of function:

The main body of the function is an Stmt array. Stmt is an interface. There are 14 different Stmt implementations:

These different types of Stmt make up all the imperative Go language code, and we can see many familiar control constructs, such as if, for, Switch, and SELECT, that are common in other programming languages as well.

conclusion

This section introduces the process of lexical analysis and grammatical analysis of Go language. We not only introduce the principles of lexical analysis and grammatical analysis from the theoretical level, but also analyze in detail how the compiler of Go language implements lexical and grammatical analysis functions at the bottom from the source code of Golang.

Scanner and parser let us have a clear understanding of the process of the source code processed by the parser. At the same time, we also found the familiar keywords and syntax structure in the grammar and parser of Go language, which deepened our understanding of Go language.

Overview of Compilation Principles
Lexical and parser
Type checking
Intermediate code generation
Machine code generation

Reference

Lexical Analysis · Wikipedia
Lex command
what type of grammar GO programming language?

About pictures and reprints

Creative Commons Attribution 4.0 International License agreement

Wechat official account

About comments and comments