preface

In the last article, the core technology of lexical analysis, finite automata (DFA), and the use and working principle of two common lexical analysers were introduced. On this basis to see Go lexical analysis source code will be easier

This article mainly contains the following contents:

  1. The entry file compiled by Go, and what is done in compiling the entry file
  2. Where is lexical analysis in Go compilation, and what is the detailed process like
  3. Write a test go source file, perform a lexical analysis on the source file, and obtain the results of the lexical analysis

Source code analysis

Go compiler entry

In order to have a clearer understanding of how the compilation process of Go goes to the step of lexical analysis, here we first introduce the Go compilation entry file where, and roughly do what

The Go compilation entry file is in: SRC/CMD /compile/main.go -> gc.Main(archInit)
Copy the code

Enter the gc.main (archInit) function, which is a long one. All it does is get the parameters passed in from the command line and update the compilation options and configuration. Then you’ll see this line of code down here

lines := parseFiles(flag.Args())
Copy the code

This is the entry for lexical analysis and syntax analysis. It will perform lexical and syntax analysis on the incoming file, resulting in a syntax tree. It will build it into an abstract syntax tree, and then perform type checking on it, which will be shared in a later article

Open the parseFiles(flag.args ()) file and you can see the following (I omitted the latter part of the code to focus on the lexical analysis) :

func parseFiles(filenames []string) uint {
	noders := make([]*noder, 0.len(filenames))
	// Limit the number of simultaneously open files.
	sem := make(chan struct{}, runtime.GOMAXPROCS(0) +10)

	for _, filename := range filenames {
		p := &noder{
			basemap: make(map[*syntax.PosBase]*src.PosBase),
			err:     make(chan syntax.Error),
		}
		noders = append(noders, p)

		go func(filename string) {
			sem <- struct{} {}defer func(a) { <-sem }()
			defer close(p.err)
			base := syntax.NewFileBase(filename)

			f, err := os.Open(filename)
			iferr ! =nil {
				p.error(syntax.Error{Msg: err.Error()})
				return
			}
			defer f.Close()

			p.file, _ = syntax.Parse(base, f, p.error, p.pragma, syntax.CheckBranches) // errors are tracked via p.error
		}(filename)
	}
	......
}
Copy the code

As we know, in the compilation process of Go, each source file is eventually parsed into a syntax tree. As you can see from the first few lines of the code above, it first creates multiple coroutines to compile the source file, but there is a limit on the number of open source files at a time

sem := make(chan struct{}, runtime.GOMAXPROCS(0) +10)
Copy the code

Then the source file is traversed, and a number of coroutines are used for lexical and grammatical analysis of the file, mainly reflected in the for loop and go func. As can be seen in Go Func, it initializes the information of the source file, mainly recording the name, row, and column of the source file. The purpose is to report the location of the error if encountered during lexical and grammatical analysis. It mainly contains the following structures

type PosBase struct {
	pos       Pos
	filename  string
	line, col uint32
}

type Pos struct {
	base      *PosBase
	line, col uint32
}
Copy the code

I’m going to open the source file and initialize the parser, and the reason why I initialize the parser is because you can see that during the compilation of Go, lexing and parsing are put together, and when I initialize the parser, I actually initialize the parser, We can go to the syntax.Parse function in Go Fun

func Parse(base *PosBase, src io.Reader, errh ErrorHandler, pragh PragmaHandler, mode Mode) (_ *File, first error) {
	defer func(a) {
		if p := recover(a); p ! =nil {
			if err, ok := p.(Error); ok {
				first = err
				return
			}
			panic(p)
		}
	}()

	var p parser
	p.init(base, src, errh, pragh, mode) // Initialize the operation
	p.next() // The lexical parser parses the source file and converts it into a source file consisting entirely of tokens
	return p.fileOrNil(), p.first // The parser parses the token file above
}
Copy the code

You can see that parsing initialization is invoked:

p.init(base, src, errh, pragh, mode)
Copy the code

When we go into P.i nit, we see a line of code that initializes the lexical analyzer

p.scanner.init(... Here are the parameters to initialize the lexical analyzer.Copy the code

You can see that the parser calls the init method of the lexical parser with p.scanner. The parser structure is embedded in the structure of the parser. This article mainly introduces the syntax of the parser, so it will not introduce the meaning of the parser structure field. It will introduce the parser structure in detail in the article.

// Parse the structure
type parser struct {
	file  *PosBase 
	errh  ErrorHandler
	mode  Mode
	pragh PragmaHandler
	scanner // Embedded lexical analyzer

	base   *PosBase 
	first  error 
	errcnt int  
	pragma Pragma  

	fnest  int
	xnest  int 
	indent []byte
}
Copy the code

Having made clear the relationship between grammatical analysis and lexical analysis, let’s start to look at the specific process of lexical analysis

Lexical analysis process

The code location for lexical analysis is:

src/cmd/compile/internal/syntax/scanner.go
Copy the code

Lexers are implemented with a structure that looks like this:

type scanner struct {
	source // Source is also a structure that records information about the source file for lexical analysis, such as the byte array of the contents, the currently scanned characters and their positions.
	mode   uint  // Controls whether comments are parsed
	nlsemi bool // if set '\n' and EOF translate to '; '

	// current token, valid after calling next()
	line, col uint   // The position of the current scanned character, starting with 0
	blank     bool // line is blank up to col
	tok       token // The TOKEN of the string currently matched (record all TOKEN types supported in Go)
	lit       string   If the Token is lit from the source file, then its Token is _If and its lit is if
	bad       bool     // If there is a syntax error, the obtained lit may be incorrect
	kind      LitKind  // If the string is of a value type, this variable identifies which value type it is, such as INT, FLOAT, RUNE, etc
	op        Operator // It is similar to kind, which identifies the TOKEN recognized by the operator (if any).
	prec      int      // valid if tok is _Operator, _AssignOp, or _IncOp
}

type source struct {
	in   io.Reader
	errh func(line, col uint, msg string)

	buf       []byte // A byte array of source file contents
	ioerr     error  // File read error message
	b, r, e   int    // buffer indices (see comment above)
	line, col uint   // The position of the currently scanned character
	ch        rune   // The characters currently scanned
	chw       int    // width of ch
}
Copy the code

Now that you know what each field in the lexical parser structure means, let’s look at what types of tokens are available in Go

Token

Token is the smallest lexical unit with independent meaning in a programming language. Token mainly contains keywords, custom identifiers, operators, delimiters, comments, etc., which can be found in: SRC/CMD/compile/internal/syntax/tokens. Go see, I bottom part of the (these Token in the form of constant)

const (
	_    token = iota
	_EOF       // EOF

	// names and literals
	_Name    // name
	_Literal // literal

	// operators and operations
	// _Operator is excluding '*' (_Star)
	_Operator // op
	_AssignOp // op=
	_IncOp    // opop
	_Assign   / / =.// delimiters
	_Lparen    / / (
	_Lbrack    / / /
	_Lbrace    / / {
	_Rparen    // )
	_Rbrack    // ]
	_Rbrace    // }.// keywords
	_Break       // break
	_Case        // case
	_Chan        // chan
	_Const       // const
	_Continue    // continue
	_Default     // default
	_Defer       // defer.// empty line comment to exclude it from .String
	tokenCount //
)
Copy the code

The three most important attributes of the lexical unit corresponding to each lexical Token are the lexical unit type, the text form of the Token in the source code, and the location where the Token appears. Annotations and semicolons are two special tokens. Common annotations generally do not affect the semantics of the program, so you can ignore annotations in many cases.

All tokens are divided into four categories:

  1. Token of a special type. Such as:_EOF
  2. The Token corresponding to the base value. Such as:IntLit,FloatLit,ImagLitEtc.
  3. Operator. Such as:Add* // +,Sub* // -, *,Mul* // *
  4. The keyword. Such as:_Break* // break,_Case* // case

Lexical analysis implementation

In the lexical analysis section, there are two core methods, nextch() and next()

As we know, lexical analysis is a character-by-character reading of the source file, and nextch() continuously reads the contents of the source file character-by-character from left to right

The following is part of the nextch() function, which gets the next unprocessed character and updates the scanned position

func (s *source) nextch(a) {
redo:
	s.col += uint(s.chw)
	if s.ch == '\n' {
		s.line++
		s.col = 0
	}

	// fast common case: at least one ASCII character
	if s.ch = rune(s.buf[s.r]); s.ch < sentinel {
		s.r++
		s.chw = 1
		if s.ch == 0 {
			s.error("invalid NUL character")
			goto redo
		}
		return
	}

	// slower general case: add more bytes to buffer if we don't have a full rune
	fors.e-s.r < utf8.UTFMax && ! utf8.FullRune(s.buf[s.r:s.e]) && s.ioerr ==nil {
		s.fill()
	}

	// EOF
	if s.r == s.e {
		ifs.ioerr ! = io.EOF {// ensure we never start with a '/' (e.g., rooted path) in the error message
			s.error("I/O error: " + s.ioerr.Error())
			s.ioerr = nil
		}
		s.ch = - 1
		s.chw = 0
		return}... }Copy the code

The next() function, based on the scanned characters, uses the idea of determining finite automata introduced in the previous article to split the string and match the corresponding token. Part of the core code of the next() function is as follows:

func (s *scanner) next(a) {
	nlsemi := s.nlsemi
	s.nlsemi = false

redo:
	// skip white space
	s.stop()
	startLine, startCol := s.pos()
	for s.ch == ' ' || s.ch == '\t' || s.ch == '\n' && !nlsemi || s.ch == '\r' {
		s.nextch()
	}

	// token start
	s.line, s.col = s.pos()
	s.blank = s.line > startLine || startCol == colbase
	s.start()
	if isLetter(s.ch) || s.ch >= utf8.RuneSelf && s.atIdentChar(true) {
		s.nextch()
		s.ident()
		return
	}

	switch s.ch {
	case - 1:
		if nlsemi {
			s.lit = "EOF"
			s.tok = _Semi
			break
		}
		s.tok = _EOF

	case '\n':
		s.nextch()
		s.lit = "newline"
		s.tok = _Semi

	case '0'.'1'.'2'.'3'.'4'.'5'.'6'.'7'.'8'.'9':
		s.number(false)

	case '"':
		s.stdString()
......
}
Copy the code

A complete description of what these two methods do is:

  1. The lexical parser gets the latest unparsed character each time through the nextch() function
  2. Based on the characters scanned, the next() function determines the type of the currently scanned character. For example, if an A character is currently scanned, it tries to match an identifier type, the s.dent () method called in the next() function (and determines if the identifier is a keyword).
  3. If the scanned character is a numeric character, an attempt is made to match an underlying literal type (such as *)IntLit,FloatLit,ImagLit*)
  4. Next () recognizes a token and passes it to the parser, which then proceeds to fetch the next token by calling the next() function of the lexical parser (so you can see that the lexical parser does not translate the entire source file into tokens at once and then provide them to the parser. Instead, the parser needs one itself and gets one from the lexical parser’s next() function.)

We can see this line in the next() function

for s.ch == ' ' || s.ch == '\t' || s.ch == '\n' && !nlsemi || s.ch == '\r' {
		s.nextch()
	}
Copy the code

It filters out Spaces, tabs, newlines, and so on in the source file

To see how it identifies whether a string is a literal or a string, you can look at the internal implementations of ident(), number(), and stdString()

Let me start with the Go compiler entry and draw a flow chart of the lexical analysis section to tie things together

Maybe after reading the source code, you still don’t have a clear understanding of the lexical parser. The following is a test file or standard library provided by Go to actually use the lexical analyzer to see how it works

Test the lexical analysis process

There are two ways to test lexical analysis: you can directly compile and execute the lexical analyzer test files provided by Go, or the standard library provided by Go

Lexical analyzer test files address: SRC/CMD/compile/internal/syntax/scanner_test.goGo provides the lexer standard library address: SRC /go/scanner/scanner.go
Copy the code

Below I will write a source file and pass it to the lexical analyzer to see how it parses and what the results are

Test the lexical analyzer with the test file

We can compile directly execute the SRC/CMD/compile/internal/syntax/scanner_test TestScanner methods of go, the method of source code (code comments) as follows:

func TestScanner(t *testing.T) {
	if testing.Short() {
		t.Skip("skipping test in short mode")
	}

	filename := *src_ // can be changed via -src flag
	// Here you can choose an absolute path to the source file you want to parse yourself
	src, err := os.Open("/Users/shulv/studySpace/GolangProject/src/data_structure_algorithm/SourceCode/Token/aa.go")
	iferr ! =nil {
		t.Fatal(err)
	}
	defer src.Close()

	var s scanner
	s.init(src, errh, 0) // Initialize the lexical parser
	for {
		s.next() // Retrieve token (nextch() is called in the next function to retrieve the next character until a token is matched)
		if s.tok == _EOF {
			break
		}
		if! testing.Verbose() {continue
		}
		switch s.tok { // The obtained token value
		case _Name, _Literal: // An identifier or base value
			// Prints the file name, row, column, token, and the text in the token corresponding source file
			fmt.Printf("%s:%d:%d: %s => %s\n", filename, s.line, s.col, s.tok, s.lit)
		case _Operator:
			fmt.Printf("%s:%d:%d: %s => %s (prec = %d)\n", filename, s.line, s.col, s.tok, s.op, s.prec)
		default:
			fmt.Printf("%s:%d:%d: %s\n", filename, s.line, s.col, s.tok)
		}
	}
}
Copy the code

The test function first opens your source file and passes the contents of the source file to the lexical analyzer’s initialization function. The token is then obtained by calling next() repeatedly through an infinite loop until the _EOF terminator is reached

The file I want to parse for the lexical parser looks like this:

package Token

import "fmt"

func testScanner(a)  {
	a := Awesome!
	if a == Awesome! {
		fmt.Println("Learning Scanner")}}Copy the code

Then run the test method with the following command (you can print more information about the sacnner structure fields, you can print them out to see) :

# cd /usr/local/go/src/cmd/compile/internal/syntax
# go test -v -run="TestScanner"Command output: === RUN TestScanner parser.go:1:1: package
parser.go:1:9: name => Token
parser.go:1:14:; parser.go:3:1: import
parser.go:3:8: literal => "fmt"
parser.go:3:13:; parser.go:5:1: func
parser.go:5:6: name => testScanner
parser.go:5:17: (
parser.go:5:18: )
parser.go:5:21: {
parser.go:6:2: name => a
parser.go:6:4: :=
parser.go:6:7: literal => Awesome!
parser.go:6:10:; parser.go:7:2: if
parser.go:7:5: name => a
parser.go:7:7: op => == (prec = 3)
parser.go:7:10: literal => Awesome!
parser.go:7:14: {
parser.go:8:3: name => fmt
parser.go:8:6: .
parser.go:8:7: name => Println
parser.go:8:14: (
parser.go:8:15: literal => "Learning Scanner"
parser.go:8:33: )
parser.go:8:34:; parser.go:9:2: }
parser.go:9:3:; parser.go:10:1: }
parser.go:10:2:; --- PASS: TestScanner (0.00s)
PASS
ok  	cmd/compile/internal/syntax	0.007s
Copy the code

Test the lexical analyzer against the standard library

Another way to test is through the standard library provided by Go. Let me show you how to test the lexical analyzer using the methods in the standard library. Okay

You need to write code that calls a method in the standard library to implement a lexical analysis procedure, as shown in the following example:

package Token

import (
	"fmt"
	"go/scanner"
	"go/token"
)

func TestScanner1(a)  {
	src := []byte("cos(x)+2i*sin(x) //Comment") // What I want to parse (of course you can also use a byte array of file contents)
	// Initialize scanner
	var s scanner.Scanner
	fset := token.NewFileSet() // Initialize a file set (I'll explain this below)
	file := fset.AddFile("", fset.Base(), len(src)) // Adds a file to the character set
	s.Init(file, src, nil, scanner.ScanComments) // The third parameter is mode. I'm passing ScanComments, which means I need to parse comments. Normally, I don't need to parse comments
	/ / scan
	for  {
		pos, tok, lit := s.Scan() // is equivalent to the next() function
		if tok == token.EOF {
			break
		}
		fmt.Printf("%s\t%s\t%q\n", fset.Position(pos), tok, lit) // fset.position (pos) : obtain the Position information}}Copy the code

Execute the above code and get the following result:

1:1     IDENT   "cos"
1:4     (       ""
1:5     IDENT   "x"
1:6     )       ""
1:7     +       ""
1:8     IMAG    "2i"
1:10    *       ""
1:11    IDENT   "sin"
1:14    (       ""
1:15    IDENT   "x"
1:16    )       ""
1:18    ;       "\n"
1:18    COMMENT "//Comment"
Copy the code

You will find that the methods used in the standard library are completely different from those used in the test files. This is because the library implements a separate set of lexers and does not reuse the go compiler’s lexer code. I understand that this is because the code in the GO compiler cannot be used as a public method for external use. For security reasons, the methods in it must remain private

If you look at the lexical analyzer implementation in the standard library, it is not quite the same as the implementation in the GO compilation, but the core ideas are the same (e.g. character scanning, token recognition). The difference lies in the processing of the files to be parsed. As we know, in Go, multiple files are composed of packages, which are then linked into an executable file, so the multiple files corresponding to a single package can be regarded as the basic compilation unit of Go. Therefore, the lexical parser provided by Go also defines FileSet and File objects to describe filesets and files

type FileSet struct {
	mutex sync.RWMutex // protects the file set
	base  int          // base offset for the next file
	files []*File      // list of files in the order added to the set
	last  *File        // cache of last file looked up
}

type File struct {
	set  *FileSet
	name string // file name as provided to AddFile
	base int    // Pos value range for this file is [base...base+size]
	size int    // file size as provided to AddFile

	// lines and infos are protected by mutex
	mutex sync.Mutex
	lines []int // lines contains the offset of the first character for each line (the first entry is always 0) (Line contains the offset of the first character of each line (first entry is always 0))
	infos []lineInfo
}
Copy the code

The use of the go compiler is similar to the use of the source structure in the lexical parser scanner structure. The difference is that we know that the GO compiler creates multiple coroutines to compile multiple files concurrently, while the standard library stores multiple files to be parsed through file sets. You’ll notice that the FileSet structure has a one-dimensional array of files to parse

FileSet and File, and how it calculates Token location information

The FileSet and File

The mapping between FileSet and File is shown in the figure below:

Image credit: Go-ast-book

The Pos type in the figure represents the subscript position of the array. Each File element in FileSet corresponds to an interval of the underlying array. There is no intersection between different files, and there may be fill space between adjacent files

Each File consists of File name, base and size. Base corresponds to the Pos index position of File in FileSet, so base and base+size define the start and end positions of File in FileSet array. The subscript index can be located by offset inside each File, and offset+ file.base can be converted to Pos position inside the File. Because Pos is the global offset of FileSet, you can also use Pos to query the corresponding File and offset within the corresponding File

The location information of each Token in lexical analysis is defined by Pos, and the corresponding File can be easily queried through Pos and the corresponding FileSet. The corresponding line and column numbers are then calculated from the source File corresponding to File and offset. (In the implementation, File only stores the beginning of each line and does not contain the original source code data.) The underlying type of Pos is int, which is similar to the semantics of Pointers. Therefore, 0 is also defined as NoPos, indicating that Pos is invalid

Source: go – ast – book

From the relationship between FileSet and File, it can be seen that the lexical analyzer in Go standard library parses multiple source files in this way of File sets

conclusion

This paper mainly starts from the Go compilation of the entry file, gradually introduced the Go compilation of the lexical analysis of the source code part of the implementation, and through the test file and Go provided by the lexical analyzer standard library, the actual test and use of the lexical analyzer. I believe I can have a clear understanding of the lexical analysis process of Go after watching it

The lexical analysis part is relatively simple, involving less core content, the real difficult part is the grammar analysis and abstract syntax tree part, interested friends please continue to pay attention to