Welcome to “Algorithms and the Beauty of Programming” ↑ pay attention to us!

\

This article was first published on the wechat official account “Beauty of Algorithms and Programming”. Welcome to follow and learn more about this series of blogs in time.

\

DOM(Document Object Model) is a very important data structure with a wide range of uses.

\

For the browser’s rendering engine, the HTML string needs to be converted into a DOM tree, which is then converted into a render tree, and finally rendered. For data collection, it is often necessary to parse HTML documents that have been downloaded, and the premise of such parsing is to generate the DOM tree of the HTML document before parsing.

\

This series of blogs will introduce you to a state machine-based DOM tree generation technique, from the initial HTML document to the final DOM tree.

\

In this lecture we will introduce you to the basics of state machines.

1 Introduction to the state machine

 

Figure 1-1 Shows an example of a state machine

Let’s start with a simple example of what a state machine is, with four circles representing four states: state 0, state 1, state 2, and state 3. State 0 indicates the start state, and states 2 and 3 indicate the end state. The start state indicates where the task starts, and the end state will no longer accept any input.

State 0: When the user enters the character ‘a’, it enters state 1 and other characters enter state 3.

State 1: The user can continuously enter character ‘b’, but will remain in state 1. When the user enters character ‘a’, it will enter state 2, and other characters will enter state 3.

State 2: Terminates. No more input is accepted.

State 3: Terminates. No more input is accepted.

 

 

The above is a state machine. In a state machine, there are multiple states. In general, there is a starting state, multiple intermediate states and termination states. You can take different inputs for each state, either stay in the current state or move to the next state.

 

This state machine can accept any abBBB * A mode string such as: ABA, ABBA, abbbA, etc.

 

The use of state machine is very wide, not only in regular expression, compiler and many other fields play an important role.

 

2 state machine implementation

With that in mind, this section shows you how to implement a state machine in the Java language.

 

To implement the above state machine, we have the following three tasks to complete:

 

(1) Control of string input.

We need to define a class that can easily control the entire input string, moving it one character at a time.

This class is very simple and looks like this:

public class CharacterReader { private Stringcontent; private int pos; public CharacterReader(String content) { this.content= content; pos =0; } public char consume() { return content.charAt(pos++); }}

 

For any given string such as “abbba”, we store it in the content property and define a POS property that marks the character position being processed. After a character is processed, the position is moved one bit, starting at 0.

 

CharacterReaderreader =newCharacterReader(“abbba”);

 

(2) Definition of various states.

In this case, we have four states, so we need to define four state classes StateZero, StateOne, StateTwo, and StateThree.

The processing logic of each state class is similar. The basic business processing logic is as follows:

Accepts a character, and then determines whether to make a state transition.

 

So we can abstract a State base class, State, in which we can define a common abstract method, as follows:

public enum State {

abstract void read(StateControllercontroller, CharacterReader reader);

}

 

We use an enumeration type to define the base class State, and we define an abstract method read that takes two arguments:

The first is the state control logic controller, through which we can quickly transfer the state;

The second is an encapsulation of the input string, and the consume method is a quick way to get the current character.

 

Let’s look at the implementation of StateZero:

StateZero { @Override void read(StateController controller,CharacterReader reader) { char ch = reader.consume(); switch (ch) { case ‘a’: controller.transition(StateOne); break; default: controller.transition(StateThree); break; }}}

 

The code above is very easy to understand. From the state transition diagram in section 1, we know that at state 0, the input character ‘A’ is moved to state 1 and the other characters are moved to state 3. So you can see that this code is pretty straightforward.

 

StateOne is similar to StateZero, and I won’t repeat it. The implementation of the termination state is described next.

 

StateTwo {

@Override

void read(StateController controller, CharacterReaderreader) {



controller.setTerminated(true);

System.out.println(“Match!”);

}

},

StateThree{

@Override

void read(StateController controller,CharacterReader reader) {



controller.setTerminated(true);

System.out.println(“UnMatch!”);

}

};

 

States 2 and 3 are terminated and no longer accept any input, so just set the controller terminated to true.

 

 

(3) State control logic.

This task is the heart and soul of our state machine implementation. We control all processes by defining a state control logic class, StateController.

public class StateController {



private CharacterReaderreader;

private Statestate= State.StateZero;


private boolean isTerminated=false;



public StateController(CharacterReader reader) {

this.reader= reader;

}

\

Public void Transition (State State){this. State = State; } public void run() { while (! isTerminated) { state.read(this,reader); } } public void setTerminated(booleanterminated) { isTerminated = terminated; }}

The state property holds the state of the current state machine. The initial state is StateZero. The isTerminated property indicates whether the current state machine has entered the terminated state.

 

The core code is:

while (! isTerminated) { state.read(this,reader);

}

\

Continuously checks whether the current state of the state machine is terminated, and if not, continuously accepts input characters from reader to enter the state machine.

 

Finally, the test code:

public class StateTest {

public static void main(String[] args) {



CharacterReader reader = new CharacterReader(“abbba”);



StateController controller = new StateController(reader);

controller.run();

}

}

 

 

To sum up, the design of a state machine has been completed, which can be summarized into three aspects: input control, various state class definitions, and state control logic.

 

3 summary

 

This paper introduces the basic knowledge of the state machine through a simple case, and gives the basic idea of designing the state machine and relevant codes, laying a good foundation for the subsequent explanation of DOM tree generation based on the state machine.

\

\

All of the code in this article can be found in the day01 module of the following Git library with the git address:

\

Gitee.com/gschen/sctu…

\

\

Please continue to follow the “Beauty of Algorithms and Programming” wechat official account to learn more.

\