Abstract: This paper introduces a word segmentation technology based on state machine, designs a state machine that can recognize the word segmentation in HTML string, and makes a detailed analysis of the running process of the state machine.

Welcome to “Algorithms and the Beauty of Programming” ↑ pay attention to us!

This article was first published on the wechat official account “Beauty of Algorithms and Programming”. Welcome to follow and learn more about this series of blogs in time.

In the previous my.oschina.net/gschen/blog… We have introduced the basic concepts, design ideas and implementation of state machines based on Java language. We hope you can understand the relevant knowledge points. The state machine is the key technology that we will use to generate DOM trees, and the next few blogs will be based on state machines. So please be sure to master, if there are still unclear places can read the previous article, you are also very welcome to pay attention to the wechat public number, timely understanding of the latest article.

First, let’s review our task. Given an HTML file, generate a DOM tree for that file. An HTML file is a string written using HTML rules, such as “< HTML ><body><p>hello</p></body></ HTML >”.

So how do you generate a DOM tree from such a string?

One word

We call the first stage word segmentation. To eventually generate a DOM tree, we do not process the HTML string character by character, but first, then word by word.

We define the following types of participles and call each one a Token.

StartTag: the StartTag, such as < HTML >.

EndTag: indicates the EndTag such as </ HTML >.

Character: Content between StartTag and EndTag such as Hello.

EOF: Indicates the end flag.

For the string “< HTML ><body><p>hello</p></body></ HTML >”, we get the following participles:

StartTag:<html>

StartTag:<body>

StartTag:<p>

Character:hello

EndTag:</p>

EndTag:</body>

EndTag:</html>

Now that we know what we want to achieve, how do we convert HTML strings into participles?

Use state machine to realize word segmentation

In our last blog, we introduced the basic knowledge of state machine. In this section, we will introduce a word segmentation technology based on state machine.

 

The above picture is a state machine designed for the segmentation of the first section. This state machine eliminates many error handling processes in order to make our description more concise and focus on the core problem. Let’s take a look at how the state machine is designed.

DATA: indicates the start status.

When the input character ‘-1’ indicates that the end of the string has been reached, a participle EOF is recognized.

When the input character is ‘<‘, the TagOpen state is entered.

If the input Character is another Character, a participle Character is identified at this time.

TagOpen:

Enter the EndTagOpen state when the input character is ‘/’.

Checks if the current character is a letter, if so, enter TagName.

EndTagOpen:

Checks if the current character is a letter, if so, enter TagName.

TagName:

This state will read all letters at once until a ‘/’ or ‘>’ is encountered.

The next input character is ‘>’, an HTML tag has been recognized and DATA will be entered.

 

We will use an example to illustrate the state machine in action.

<html><body><p>hello</p></body></html>

The current input is ‘<‘, which will enter TagOpen.

 <html><body><p>hello</p></body></html>

TagOpen detects that the next character is the letter ‘h’ and enters TagName.

TagName reads all the letters “HTML” at once, then identifies a StartTag and completes the token identification. The next input is ‘>’ to enter DATA.

This is the process for identifying a StartTag. Body is similar to P, so I won’t repeat it.

 <html><body><p>hello</p></body></html>

The above process has identified three StartTag participles, which are < HTML >, <body>, and < p>.

The current state is DATA and the next input character is ‘h’, all the letters ‘hello’ are read at once. Identify a word segmentation Character and mark the end of the word segmentation.

 <html><body><p>hello</p></body></html>

The current state is DATA and the next input character is ‘<‘, enter TagOpen.

 <html><body><p>hello</p></body></html>

The current status is TagOpen and the next input character is ‘/’. The current tag type is EndTagOpen.

TagName is entered because the next character is a letter.

 <html><body><p>hello</p></body></html>

The TagName state keeps reading letters. When the character ‘>’ is encountered, mark it as a participle and enter DATA.

3 summary

This paper introduces a word segmentation technology based on state machine, designs a state machine that can recognize the word segmentation in HTML string, and makes a detailed analysis of the running process of the state machine.

Subsequent DOM tree generation will be based on the above segmentation. In the next lecture, we’ll show you how to implement such a word segmentation technique using the Java language. If you have any questions in the process of reading this blog, please leave a comment below.

All of the code in this article can be found in the day02 module of the following Git library with the git address:

Gitee.com/gschen/sctu…

Those of you who are interested can read the code in advance.