Hello, I am the smiling snail, 🐌.

This article mainly describes the HTML parsing process, the implementation of a small HTML parser.

HTML rules

HTML contains a series of tags, either individually or nested. At the same time, the tag can also carry attribute data.

Such as:

<html>
    <body>
        <h1>Title</h1>
        <div id="main" class="test"> <p>Hello <em>world</em>! </p> </div> </body> </html>Copy the code
  • <h1>Is a single tag,<div>Contains child tags<p>.
  • id="main", it is<div>Attributes in the tag.
  • h1Tags within the"Title"Belongs to the text.

If you want to fully implement HTML parsing, there are many scenarios to consider. For simplicity, we only implement the following basic functions:

  • Full matching tag parsing, for example<div></div>, but<div/>This is not supported.
  • Parsing of attributes within a tag.
  • Parsing of text nodes.

Others, such as comments, character encodings, docType declarations, and so on, are not supported.

HTML is essentially text, and our goal is to parse tags and attributes out of that text in structured data output.

The data structure

First, you need to define an appropriate data structure that clearly describes the output.

According to the minimal function description above, we only need to finish parsing tags, attributes, and text. Note that tags can be nested, and only tags have attributes.

Here, we abstract the label/text as a node, which can contain a series of data, such as child nodes and types.

Thus, the data structure of the node can be defined:

/ / the node
public struct Node {
    / / child nodes
    public let children: [Node]
    
    // Node type
    public let nodeType: NodeType
}
Copy the code

Node type, divided into element node and text node, using enumeration type and associated data, defined as follows:

// Node type
public enum NodeType {
    case Element(ElementData)
    case Text(String)
}
Copy the code

ElementData The data associated with an element type, including tag names and attributes, is defined as follows:

public typealias AttrMap = [String:String]

// Tag element structure
public struct ElementData {
    / / tag name
    public let tagName: String
    
    / / property
    public let attributes: AttrMap
}
Copy the code

If you have trouble looking at the code, you can also look at the data structure definition in the figure below.

With the main data structure defined, let’s begin parsing the label.

But don’t worry. Before parsing, you need to do some preparatory work to implement a text scanning utility class, since CSS parsing uses the same scanning code.

Text scan utility class

The SourceHelper class is a simple utility class designed to assist with text scanning. It includes the following functions:

  • Read the next character
  • Whether to start with xx
  • Whether to the end of the string
  • Consuming a single character
  • Circularly consumes characters if the character meets the specified criteria
  • Skip whitespace

Its structure is defined as follows, including cursor and source string fields.

public struct SourceHelper {
    // Position cursor
    var pos: String.Index
    
    // Source string
    var input: String
}
Copy the code

Now, let’s take a quick look at the function implementation.

  1. Read the next character

Only reads, but the cursor does not move.

// Returns the next character, the cursor does not move
func nextCharacter() -> Character {
    return input[pos]
}
Copy the code
  1. Check whether the string starts with xx
// Whether to start with s
func startsWith(s: String) -> Bool {
    returninput[pos...] .starts(with: s) }Copy the code
  1. Whether the cursor moves to the end
// Is it the end
func eof() -> Bool {
    return pos >= input.endIndex
}
Copy the code
  1. Read a single character and move the cursor
// Consume character, cursor +1
mutating func consumeCharacter() -> Character {
    let result = input[pos]
    pos = input.index(after: pos)
    return result
}
Copy the code
  1. Circularly consumes characters when they meet the specified condition
// If the test condition is met, the loop consumes characters and returns a string that meets the condition
mutating func consumeWhile(test: (Character) -> Bool) -> String {
    var result = ""
    while !self.eof() && test(nextCharacter()) {
        result.append(consumeCharacter())
    }
    return result
}
Copy the code
  1. Skip whitespace
// Skip whitespace characters
mutating func consumeWhitespace() {
    _ = consumeWhile { (char) -> Bool in
        return char.isWhitespace
    }
}
Copy the code

The node analytical

Nodes are divided into label nodes and text nodes. The signature of the tag node is obvious, beginning with <.

Therefore, we can use it as a marker. When a < is encountered while scanning text, it is considered a parsing tag; The rest of it is just parsing text.

Then, we can construct a rough way of parsing:

// Parse the node
mutating func parseNode() -> Node {
    switch sourceHelper.nextCharacter() {
    // Parse the tag
    case "<":
        return parseElement()
    default:
        // Parses text by default
        return parseText()
    }
}
Copy the code

Label parsing

First, consider the simplest case, parsing a single tag, such as

.

Tag information includes tag names and attributes. Here attribute resolution is put aside for the moment and only tag names are extracted.

To extract the tag name, just extract the string between <>. However, pay attention to tag closing pairs.

The analysis process is divided into the following steps:

  • Ensure that labels are<Start.
  • Extract the label name until encountered>The end.
  • Ensure that labels are</Closed.
  • Extract the trailing label name, making sure it matches the preceding label.<div></p>This is not a valid label becausepdivDon’t match.
  • Finally, make sure that the trailing character is>The end.
// Parse the tag
// <div></div>
mutating func parseElement() -> Node {
    // Make sure to start with <
    assert(self.sourceHelper.consumeCharacter() == "<")
    
    // Parse the tag, tagName = div
    let tagName = parseTagName()
    
    // Make sure the tag ends with >
    assert(self.sourceHelper.consumeCharacter() == ">")
    
    // Make sure the tag is closed </
    assert(self.sourceHelper.consumeCharacter() == "<")
    assert(self.sourceHelper.consumeCharacter() == "/")

    // Make sure that the name of the closed label is the same as that of the previous label
    let tailTagName = parseTagName()
    assert(tagName.elementsEqual(tailTagName))
    
    // Make sure to end with >
    assert(self.sourceHelper.consumeCharacter() == ">")... }Copy the code

The complete process of element parsing is shown in the figure below:

Tag name resolution

A valid label name, a combination of letters and numbers.

This is where the consumeWhile method in the assistive tool comes in handy. If a character is a letter or digit, the system traverses the character backward until the character does not meet the requirements.

// Parse the tag name
mutating func parseTagName() -> String {
    // Tag name, a-z, A-z,0-9 combination
    return self.sourceHelper.consumeWhile { (char) -> Bool in
        char.isLetter || char.isNumber
    }
}
Copy the code

Attribute resolution

There may also be attribute data in the tag, such as

, id, class are attribute information.

Property definitions have their own rules: separated by =, property names on the left and property values on the right.

As you can see from the chestnut, the definition of the attribute is located after the tag name. Therefore, after extracting tags, attributes can be parsed and stored as maps.

{“id”: “p1”, “class”: “c1”};

  1. First, let’s look at parsing individual attributes.
  • Attribute names are resolved, and rules for valid attribute names apply to label names.
  • Make sure you have an equal sign in the middle.
  • Parses the property value with the value""or' 'The parcel.
// Parse a single attribute
mutating func parseAttribute() -> (String, String) {
    / / the property name
    let name = parseTagName()
    
    // Middle equals
    assert(self.sourceHelper.consumeCharacter() == "=")
    
    / / property values
    let value = parseAttrValue()
    
    return (name, value)
}
Copy the code

For attribute values, make sure they are properly formatted and that quotes are paired.

  • Make sure characters are in double quotes"Or single quotation marks'At the beginning
  • Retrieves the string between quotes when close quotes are encountered
  • Judge quote pairs
// Parse the attribute value, and end with "or '
mutating func parseAttrValue() -> String {
    let openQuote = self.sourceHelper.consumeCharacter()
    
    // Start with single or double quotation marks
    assert(openQuote == "\" " || openQuote == "'")
    
    // Retrieve the characters between quotes
    let value = self.sourceHelper.consumeWhile { (char) -> Bool in
        char! = openQuote }// Pair quotes
    assert(self.sourceHelper.consumeCharacter() == openQuote)
    
    return value
}
Copy the code
  1. Multiple attribute resolution, using a loop can be encountered>Exit the loop.
// Parse the attributes to return the map
mutating func parseAttributes() -> AttrMap {
    var map = AttrMap()
    
    while true {
        self.sourceHelper.consumeWhitespace()
        
        // If you reach the end of the opening tag, end
        if self.sourceHelper.nextCharacter() == ">" {
            break
        }
        
        // Parse attributes
        let (name, value) = parseAttribute()
        map[name] = value
    }

    return map
}
Copy the code

Parallel label parsing

How do you parse when there are multiple side tags?

<div></div>
<div></div>
Copy the code

Same routine, loop parsing can return node array.

// loop through the parse node
mutating func parseNodes() -> [Node] {
    var nodes: [Node] = []
    while true {
        self.sourceHelper.consumeWhitespace()
        
        // "
      
      , after parsing < HTML >, parseNodes is called again to parse child tags
             // The string is 
      .
        if self.sourceHelper.eof() || self.sourceHelper.startsWith(s: "< /") {
            break
        }
        
				 // Parse a single node
        let node = self.parseNode()
        nodes.append(node)
    }
    
    return nodes
}
Copy the code

Subtag parsing

In the case of nested tags, this is the case:

 <div><p></p></div>
Copy the code

How to parse the child tag

?

Take a look at the position of the

tag, after

. To be exact, after >. Therefore, we can parse the child tag at this time.

The child tag

is the same as normal tag parsing. After the tag is resolved, it needs to be added to the children of the parent tag, so the recursive method is adopted.

This way, we can add a new round of node resolution code after the > judgment:

// Make sure the tag ends with >
assert(self.sourceHelper.consumeCharacter() == ">")

// Resolve nested child nodes
let children = parseNodes()
Copy the code

The process is as follows:

Finally, synthesize the label name, attribute and child node list obtained previously to generate the parent node:

// Generate element nodes
let node = Node(tagName: tagName, attributes: attributes, children: children)
return node
Copy the code

And child tags can contain child tags, parsing will be a problem?

The answer is no.

Because every time a label is parsed, sublabel parsing is done, and sublabel parsing is a recursive call to label parsing. No matter how many layers are nested, it will be resolved.

Text parsing

Text parsing is relatively simple, just extract the characters in the middle of the label. For example,

hello

extracts Hello.

As mentioned above, after the > judgment, a new round of child node parsing begins.

In the case of

hello

, it scans from character h. Finally, the method resolved by a single node is called.

mutating func parseNode() -> Node {
    switch sourceHelper.nextCharacter() {
    // Parse the tag
    case "<":
        return parseElement()
    default:
        // Parses text by default
        return parseText()
    }
}
Copy the code

Eventually, no doubt, it will find its way into parsing text methods.

So, for Hello

, you simply start with the character H and traverse the character backwards until you hit the closing tag <. That’s how you extract Hello.

// Parse the text
mutating func parseText() -> Node {
    

HHH

let text = self.sourceHelper.consumeWhile(test: { (char) -> Bool in char! ="<" }) return Node(data: text) } Copy the code

Because starting with H, it’s a new round of node resolution. So, the text node Hello exists as a child of the P tag.

The output of the dom

After the node is parsed, the returned data is an array, but we only need the root node.

If there is only one element, return it directly; If there are multiple elements, wrap one layer in HTML and return, doing some compatibility here.

mutating public func parse(input: String) -> Node {
    
    // Set the input source
    sourceHelper.updateInput(input: input)
    
    // Parse the node
    let nodes = self.parseNodes()
    
    // If there is only one element, return it directly
    if nodes.count == 1 {
        return nodes[0]}// Wrap one layer with HTML tags and return
    return Node(tagName: "html", attributes: [:], children: nodes)
}
Copy the code

The full code can be viewed at Github.

Test output

let input = """   

Title

main" class="test">

Hello world!

"
"" var htmlParser = HTMLParser() let dom = htmlParser.parse(input: input) print(dom) Copy the code

Now we can test the HTML parser and see the output.

conclusion

This article has completed a minimalist HTML parser, mainly about the HTML tag, child tag, attribute, text node parsing, and output DOM structure.

Stay tuned for the next article, which will cover CSS parsing

The resources

  • Limpet.net/mbrubeck/20…
  • www.screaming.org/blog/2014/0…
  • Github.com/silan-liu/t…