The previous Go notes series has completed the construction of the development environment. Originally, the next plan was to Go to the grammar part, but then it did not move forward. Mainly because the work was busy at that time, scattered energy, so temporarily put down.

Recently, ready to pick up the previous plan.

The first step, of course, is to understand the basic syntax of Go. The plan was to write the basics of Go coding, but talking about keywords, identifiers, literals, and operators was a bit boring.

Suddenly it occurred to me that lexical analysis of this piece of knowledge has not been carefully studied, so let’s start from this Angle. Through step-by-step dismantling, the tokens are categorized.

An overview of the

We know that the source code for compiled languages like Go has to be compiled and linked before it can be turned into a program that a computer can execute. The first step in this process is lexical analysis.

What is lexical analysis?

It is the process of converting source code into pre-defined tokens. To make it easier to understand, we break it down into two stages.

In the first stage, the source code string is scanned, matched according to the predefined token rules and divided into strings with grammatical meaning and minimum units, namely lexme, and then classified into a certain type of token. At this stage, some characters may be filtered out, such as whitespace, comments, etc.

In the second stage, the scanned morpheme is evaluated by Evaluator Evaluator and its literal value is determined to generate the final Token.

Is it a little hard to understand?

If you’ve never touched it before, it might not be intuitive. Actually, it looks complicated, but it’s really very simple.

A simple example

Let’s start with the classic Hello World code:

package main

import "fmt"

func main(a) {
    fmt.Println("Hello World")}Copy the code

We can use the source code of this example to break down the whole process of lexical analysis step by step.

What is a morpheme

Theoretical concept did not say, see the effect directly.

First, take this sample code through the first phase of lexical analysis, and we’ll get something like this:

package
main
\n
import
"fmt"
\n
func
main
(
)
{
\n
fmt
.
Println
(
"Hello World"
)
\n
}
Copy the code

The individual sequences of characters in the output are morphemes.

Morpheme segmentation planning is related to the grammatical rules of a language. In addition to some visible characters in the output here, newlines also have syntactic implications because Go, unlike C/C++, which must be semicolon-delimited statements, can also be delimited by newlines.

The process of dividing source code into morphemes is rule-based, depending on the language. Although there are differences, the rules are the same in two ways, one is through the syntactic meaning of characters (Spaces, tabs, etc.) segmentation, and each morpheme can be used as a separator.

What is a token

Token, also known as lexical unit, token, etc., consists of a name and a literal. There is a fixed correspondence from morpheme to token, and not all tokens have literals.

Converting the source code of Hello World into token, we will get a corresponding table as follows.

lexme name value
package PACKAGE “package”
main IDENT “main”
\n SEMICOLON “\n”
import IMPORT “import”
“fmt” STRING “\”fmt\””
\n SEMICOLON “\n”
func FUNC “func”
main IDENT “main”
( LPAREN “”
) RPAREN “”
{ LBRACE “”
fmt IDENT “fmt”
. PERIOD “”
Println IDENT “Println”
( LPAREN “”
“Hello World” STRING “”Hello World””
) RPAREN “”
\n SEMICOLON “\n”
} LBRACE “”
\n SEMICOLON “\n”

It’s a little long, because I didn’t omit it. The first column in the table is the original content, the second column is the token name, and the last column is the token literal.

As you can see from the table, some of these tokens have no value. For example, parentheses, dots, and the names themselves already indicate their contents.

The classification of the token

Tokens can be classified into four categories: keywords, identifiers, literals, and operators. In fact, this classification is very obvious in the Go source code.

Check the source file SRC /go/token/token.go and you will find the following methods of token types.

// Is a literal constant
func (tok Token) IsLiteral(a) bool { return literal_beg < tok && tok < literal_end }
// Is the operator
func (tok Token) IsOperator(a) bool { return operator_beg < tok && tok < operator_end }
// Is the keyword
func (tok Token) IsKeyword(a) bool { return keyword_beg < tok && tok < keyword_end }
Copy the code

The code is simple enough to determine the Token’s type by comparing whether it is in the specified range. The above three methods correspond to determining whether tokens are literal constants, operators, or keywords, respectively.

The forehead? Why is there no identifier?

Of course there is, but it is not a Token method, it is a separate function. As follows:

func IsIdentifier(name string) bool {
	for i, c := range name {
		if! unicode.IsLetter(c) && c ! ='_' && (i == 0| |! unicode.IsDigit(c)) {return false}}returnname ! ="" && !IsKeyword(name)
}
Copy the code

The name of a variable, constant, function, or method cannot be a keyword, and must contain letters, underscores (_), or digits. The name cannot start with a digit.

So far, I’ve written it pretty much. But let’s think about one of these types for a moment.

The keyword

Take keywords, for example. What are the keywords in Go?

Continue to look at the source code. Review the previous section on how to determine if a token is a keyword. As follows:

func (tok Token) IsKeyword(a) bool {
	return keyword_beg < tok && tok < keyword_end
}
Copy the code

As long as the Token is greater than keyword_beg and less than keyword_end, this is the keyword. What are the keywords between keyword_beg and keyword_end? The code is as follows:

const(... keyword_beg// Keywords
	BREAK
	CASE
	CHAN
	CONST
	CONTINUE

	...

	SELECT
	STRUCT
	SWITCH
	TYPE
	VAR
	keyword_end
	...
)
Copy the code

A total of 25 keywords were combed out. As follows:

break       case        chan    const       continue
default     defer       else    fallthrough for
func        go          goto    if          import
interface   map         package range       return
select      struct      switch  type        var
Copy the code

There are a few keywords. It is evident that…

Huh? !

If you guessed what I was going to say, the Go language is simple, with so few keywords. Look at Java. There are 53 keywords, two of which are reserved. You look at Go, not even the reserved word, just so confident.

Now that you’ve figured it out, I’d rather not talk about it.

other

Operators and literal constants do not follow, the idea is the same.

There are 47 operators in Go, such as assignment, bitwise, arithmetic, comparison, and others. Trust me, it’s all from the source code, without looking at any data. [Here should put a facepalm smile].

What about literal constants?

There are five types: INT, FLOAT, IMG, CHAR, and STRING.

conclusion

This is just to show you where to find keywords, identifiers, operators, and literals used in the Go syntax. And, ultimately, how they are used is not explained much.

Just for fun? Of course not. Because… I won’t reveal the plot to avoid embarrassment.

The readings

How does the Go program work

Go-lexer lexical analysis

Lexical analysis

Lexical analysis