This article first in my blog, if you feel good, welcome to like collection, let more friends see.

Author: Adam Presley | address: adampresley. Making. IO / 2015/05/12 /…

The translator preface

This article is about the implementation of the lexicographer specific introduction, if you encounter difficulties in reading, it is recommended to refer to the source code to read, in order to introduce the idea of code snippets. How to parse this will be covered in the next article.

A quick look at the Go source code recently shows that there are several modules in the SRC/Go directory. Token, scanner, and Parser are probably the core of the Go lexical implementation.

Due to a large number of concurrent tasks recently, the update cannot be performed at the fastest speed. In addition to this series, I have posted links to other related articles below. If you have a good command of English, you can read by yourself.

A look at Go lexer/scanner packages

Rob Pike’s Functional Way

Handwritten Parser & Lexers In Go

The translation is as follows:


In the first article of this series, I introduced the basic concepts of lexical analysis and parsing and the basic composition of the contents of INI files. After that, we created some related constructs and constants to help implement the following INI text parser.

This article will actually dive into the details of lexical analysis.

Lexing refers to the process of converting input text into a series of tokens. Tokens are smaller units than text, and it is possible to combine them to produce meaningful content, such as programs, configuration files, and so on.

In the INI files in this series, tokens include open parentheses, close parentheses, SectionName, Key, Value, and equal sign. Combine them in the right order and you have an INI file. The tokenizer is responsible for reading the contents of the INI file, analyzing the creation of tokens, and sending tokens to the parser through channels.

Lexical analyzer

To implement the text-to-token conversion, we also need to track information such as the text content, the location of the currently analyzed text, and the start and end locations of the currently analyzed Token.

Once the analysis is complete, we also send the Token to the parser, which can be passed through a channel.

We also need a function to track the state of the tokenizer. Rob Pike talked about using functions to track the current and expected state of the tokenizer. Simply put, a function processes a Token and returns the next state function to generate the next expected Token. So let me just translate it as a state variable.

Take an example!

Section in INI consists of three parts: open parenthesis, SectionName, and close parenthesis. The first function will generate an open parenthesis Token, return a state function for SectionName, analyze the logic for handling SectionName, and return a state function for handling the close parenthesis. The general order is left parenthesis -> section name -> close parenthesis.

A hundred words is better than an opinion. Let’s look at the structure of the lexicon. As follows:

Lexer.go

type Lexer struct {
  Name   string
  Input  string  // Enter text
  Tokens chan lexertoken.Token // The channel used to send tokens to the lexical analyzer
  State  LexFn   // The state function mentioned above

  Start int      The end position can be obtained by starting + len(token)
  Pos   int      // The tokenizer processes the text position. When confirming the end of the Token, it is equivalent to knowing the end position of the Token
  Width int
}
Copy the code

LexFn.go

type LexFn func(*Lexer) LexFn// The definition of the tokenizer state function returns the next expectationTokenAnalysis function of.Copy the code

In the previous article, we defined the Token structure. LexFn is a token-state function type used to handle tokens.

Now add some more power to our frontal lexicon. The Lexer is for text processing, and to get the next Token, we add methods like reading rune strings, skipping Spaces, and other useful methods to the Lexer. They’re basically simple ways of manipulating text.

/*
Puts a token onto the token channel. The value of this token is
read from the input based on the current lexer position.
*/
func (this *Lexer) Emit(tokenType lexertoken.TokenType) {
    this.Tokens <- lexertoken.Token{Type: tokenType, Value: this.Input[this.Start:this.Pos]}
    this.start = this.Pos
}

/*
Increment the position
*/
func (this *Lexer) Inc(a) {
    this.Pos++
    if this.Pos >= utf8.RuneCountInString(this.Input) {
        this.Emit(lexertoken.TOKEN_EOF)
    }
}

/*
Return a slice of the input from the current lexer position
to the end of the input string.
*/
func (this *Lexer) InputToEnd(a) string {
    return this.Input[this.Post:]
}

/*
Skips whitespace until we get something meaningful
*/
func (this *Lexer) SkipWhiteSpace(a) {
    for {
        ch := this.Next()
        if! unicode.IsSpace(ch) { this.Dec()break
        }

        if ch == lexertoken.EOF {
            this.Emit(lexertoken.TOKEN_EOF)
            break}}}Copy the code

The important thing to understand is the reading and sending of tokens. It mainly involves several steps as follows:

First, the character is read until a definite Token is formed. For example, SectionName is a state function that cannot be confirmed until the closing parenthesis is read. Next, the Token and the Token type are sent to the parser through a channel. Finally, determine the next expected state function and return.

Let’s define a start function. It is also the startup entry for the parser (next article). It initializes a Lexer, giving it its first state function.

What might be the first desired Token? A special symbol or a keyword?

In our example, the first state function will be named with the generic name LexBegin, because in the INI file, sections can start, but not sections, with key/value. LexBegin takes care of this logic.

/*
Start a new lexer with a given input string. This returns the
instance of the lexer and a channel of tokens. Reading this stream
is the way to parse a given input and perform processing.
*/
func BeginLexing(name, input string) *lexer.Lexer {
    l := &lexer.Lexer{
        Name: name,
        Input: input,
        State: lexer.LexBegin,
        Tokens: make(chan lexertoken.Token, 3),}return l
}
Copy the code

start

The first state function is LexBegin.

/*
This lexer function starts everything off. It determines if we are
beginning with a key/value assignment or a section.
*/
func LexBegin(lexer *Lexer) LexFn {
    lexer.SkipWhitespace()
    if strings.HasPrefix(lexer.InputToEnd(), lexertoken.LEFT_BRACKET) {
        return LexLeftBracket
    } else {
        return LexKey
    }
}
Copy the code

As you can see, the first is to skip all Spaces. INI files, Spaces are meaningless. Next, we need to check whether the first bracket is an open bracket. If so, we return a LexLetBracket; if not, we return a LexKey state.

Section

Section processing logic.

The SectionName in the.ini file is surrounded by parentheses. We can organize the Key/Value in a Section. In LexBegin, the LexLeftBracket function is returned if a left bracket is found.

LexLeftBracket has the following code:

/*
This lexer function emits a TOKEN_LEFT_BRACKET then returns
the lexer for a section header.
*/
func LexLeftBracket(lexer *Lexer) LexFn {
    lexer.Pos += len(lexertoken.LEFT_BRACKET)
    lexer.Emit(lexertoken.TOKEN_LEFT_BRACKET)
    return LexSection
}
Copy the code

The code is simple! Move the bracket back based on the bracket length (bit 1) and then send TOKEN_LEFT_BRACKET to the channel.

In this scenario, Token content doesn’t make much sense. When the Emit execution is complete, the start position is assigned to the current descriptor position, which is ready for the next Token. Finally, return the state function used to handle SectioName, the LexSection.

/*
This lexer function exits a TOKEN_SECTION with the name of an
INI file section header.
*/
func LexSection(lexer *Lexer) LexFn {
    for {
        if lexer.IsEOF() {
            return lexer.Errorf(errors.LEXER_ERROR_MISSING_RIGHT_BRACKET)
        }

        if strings.HasPrefix(lexer.InputEnd(), lexertoken.RIGHT_BRACKET) {
            lexer.Emit(lexertoken.TOKEN_SECTION)
            return LexRightBracket
        }

        lexer.Inc()
    }
}
Copy the code

The logic is a little more complicated, but the basic logic is the same.

The function iterates through the characters until a RIGHT_BRACKET is encountered to confirm the end position of the SectionName. If an EOF is encountered, it is an INI in the wrong format, and we should give an error message and send it to the parser via channel. If it works, it loops until the closing parenthesis is found, and then the TOKEN_SECTION and the corresponding text are sent.

The state function returned by LexerLeftBracket is LexerRightBracket. The logic is similar to that of LexerLeftBracket except that the state function returned by LexBegin is because sections can be empty sections. You could have a Key/Value.

/*
This lexer function emits a TOKEN_RIGHT_BRACKET then returns
the lexer for a begin.
*/
func LexRightBracket(lexer *Lexer) LexFn {
    lexer.Pos += len(lexertoken.RIGHT_BRACKET)
    lexer.Emit(lexertoken.TOKEN_RIGHT_BRACKET)
    return LexBegin
}
Copy the code

Key/Value

To continue the introduction of Key/Value processing, the expression is very simple: Key = Value.

The first is Key processing, similar to LexSection, which loops until it hits an equal sign to determine a complete Key. Then execute the Emit to send the Key and return the status function LexEqualSign.

/*
This lexer function emits a TOKEN_KEY with the name of an
key that will assigned a value
*/
func LexKey(lexer *Lexer) LexFn {
    for {
        if strings.HasPrefix(lexer.InputToEnd(), lexertoken.EQUAL_SIGN) {
            lexer.Emit(lexertoken.TOKEN_KEY)
            return LexEqualSign
        }

        lexer.Inc()
        if lexer.IsEOF() {
            return lexer.Errorf(errors.LEXER_ERROR_UNEXPECTED_EOF)
        }
    }
}
Copy the code

The equal sign is very simple, similar to the left and right parentheses. Send the TOKEN_EQUAL_SIGN Token directly to the parser and return the LexValue.

/*
This lexer functions emits a TOKEN_EQUAL_SIGN then returns
the lexer for value.
*/
func LexEqualSign(lexer *Lexer) LexFn {
    lexer.Pos += len(lexertoken.EQUAL_SIGN)
    lexer.Emit(lexertoken.EQUAL_SIGN)

    return LexValue
}
Copy the code

The final state function is LexValue, which handles the Value part of Key/Value. It acknowledges a full Value when it encounters a newline character. It returns the state function LexBegin to continue the next round of analysis.

/*
This lexer function emits a TOKEN_VALUE with the value to be assigned
to a key.
*/
func LexValue(lexer *Lexer) LexFn {
    for {
        if strings.HasPrefix(lexer.InputToEnd(), lexertoken.NEWLINE) {
            lexer.Emit(lexertoken.TOKEN_VALUE)
            return LexBegin
        }

        lexer.Inc()

        if lexer.IsEOF() {
            return lexer.Errorf(errors.LEXER_ERROR_UNEXPECTED_EOF)
        }
    }
}
Copy the code

The following

In Part 3, the final article in this series, we’ll show you how to create a basic parser that processes tokens obtained from lexer into the structured data we expect.