Abstract: This paper introduces a word segmentation technology based on state machine, designs a state machine that can recognize the word segmentation in HTML string, and makes a detailed analysis of the running process of the state machine.

Welcome to “Algorithms and the Beauty of Programming” ↑ pay attention to us!

This article was first published on the wechat official account “Beauty of Algorithms and Programming”. Welcome to follow and learn more about this series of blogs in time.

In the previous my.oschina.net/gschen/blog… We have introduced the basic concepts, design ideas and implementation of state machines based on Java language. We hope you can understand the relevant knowledge points. The state machine is the key technology that we will use to generate DOM trees, and the next few blogs will be based on state machines. So please be sure to master, if there are still unclear places can read the previous article, you are also very welcome to pay attention to the wechat public number, timely understanding of the latest article.

First, let’s review our task. Given an HTML file, generate a DOM tree for that file. An HTML file is a string written using HTML rules, such as “< HTML ><body>hello</body></ HTML >”.

So how do you generate a DOM tree from such a string?

One word

We call the first stage word segmentation. To eventually generate a DOM tree, we do not process the HTML string character by character, but first, then word by word.

We define the following types of participles and call each one a Token.

StartTag: the StartTag, such as < HTML >.

EndTag: indicates the EndTag such as </ HTML >.

Character: Content between StartTag and EndTag such as Hello.

EOF: Indicates the end flag.

For the string “< HTML ><body>hello</body></ HTML >”, we get the following participles:

StartTag:<html>

StartTag:<body>

StartTag:

Character:hello

EndTag:

EndTag:</body>

EndTag:</html>

Now that we know what we want to achieve, how do we convert HTML strings into participles?

Use state machine to realize word segmentation

In our last blog, we introduced the basic knowledge of state machine. In this section, we will introduce a word segmentation technology based on state machine.

The above picture is a state machine designed for the segmentation of the first section. This state machine eliminates many error handling processes in order to make our description more concise and focus on the core problem. Let’s take a look at how the state machine is designed.

DATA: indicates the start status.

When the input character ‘-1’ indicates that the end of the string has been reached, a participle EOF is recognized.

When the input character is ‘<‘, the TagOpen state is entered.

If the input Character is another Character, a participle Character is identified at this time.

TagOpen:

Enter the EndTagOpen state when the input character is ‘/’.

Checks if the current character is a letter, if so, enter TagName.

EndTagOpen:

Checks if the current character is a letter, if so, enter TagName.

TagName:

This state will read all letters at once until a ‘/’ or ‘>’ is encountered.

The next input character is ‘>’, an HTML tag has been recognized and DATA will be entered.

We will use an example to illustrate the state machine in action.