We will develop a small but complete Swift library for processing and serializing JSON data.

Project source: github.com/swiftdo/jso…

In the previous Swift code JSON parser (2), we successfully parsed JSON strings into JSON data. In this article, the third in the Swift Code JSON Parser series, we will optimize the parser above to make the parsing process clearer.

I have been studying the principles of compilation recently, so I will understand the process from a compilation perspective. There are usually three steps in front of a compiler:

  • Lexical analysis: The process of dividing programs into tokens can be achieved by constructing finite automata.
  • Parsing: Identifying the structure of a program and forming an abstract syntax tree (AST) that can be easily processed by a computer. It can be done by recursive descent.
  • Semantic analysis: eliminate semantic ambiguity, generate some attribute information, so that the computer can generate object code according to these information.

Implementation summary

A JSON parser is essentially a state machine created from JSON grammar rules, with JSON strings as input and JSON objects as output. Above “Swift code a JSON parser (ii)” we are from the first character to start parsing, and then according to the syntax of JSON parsing into JSON. In this article, however, the process will be divided into two stages: lexical analysis and grammatical analysis.

In the lexical analysis phase, the goal is to parse the string into a sequence of tokens. Such as:

{
    "name": "oldbirds"."level": 8
}
Copy the code

Lexical analysis results in a set of tokens:

{, name, :, oldbirds,, level, :, 8,}

In the syntax stage, it is to check whether the JSON format formed by the input Token sequence is legal. Because json is easy to parse, at this step we can interpret the syntax tree directly as the JSON to be generated.

Complete lexical parsing

The goal of lexical analysis is to parse a JSON string into a Token stream according to syntactic rules.

We first need to define the type of Token. When the lexical parser reads a word that conforms to the data type specified in JSON, the lexical parser generates tokens for the word that conforms to the word-formation rules. Therefore, through the definition of JSON, we can summarize the following Token types:

enum JsonToken {
    case objBegin / / {
    case objEnd   // }
    case arrBegin / / /
    case arrEnd   // ]
    case null    // null
    case number(String)   / / 1 2 | | - 3.0
    case string(String)   // "a"
    case bool(String)     // true false
    case sepColon / / :
    case sepComma // ,
}
Copy the code

Next we will implement the JSON string into a Token stream.

With the Token class defined, we need to implement the lexical parser:

/ / / word segmentation
struct JsonTokenizer {
    
    private var input: String
    
    private var currentIndex: String.Index
    
    init(string: String) {
        self.input = string
        self.currentIndex = string.startIndex
    }
    
    /// The current character
    private var current: Character? {
        guard currentIndex < input.endIndex else {return nil}
        return input[currentIndex]
    }
    
    /// move subscript
    private mutating func advance(a) {
        currentIndex = input.index(after: currentIndex)
    }
    
    /// go back to the previous one
    private mutating func back(a) {
        currentIndex = input.index(before: currentIndex)
    }
    
    /// Generate the token stream
    mutating func nextToken(a) throws -> JsonToken? {
        // Filter out Spaces
        scanSpaces()
        guard let ch = current else { return nil }    
        switch ch {
        case "{":
            advance()
            return JsonToken.objBegin
        case "}":
            advance()
            return JsonToken.objEnd
        case "[":
            advance()
            return JsonToken.arrBegin
        case "]":
            advance()
            return JsonToken.arrEnd
        case ",":
            advance()
            return JsonToken.sepComma
        case ":":
            advance()
            return JsonToken.sepColon
        case "n":
            let _ = try scanMatch(string: "null")
            advance()
            return JsonToken.null
        case "t":
            let str = try scanMatch(string: "true")
            return JsonToken.bool(str)
        case "f":
            let str = try scanMatch(string: "false")
            return JsonToken.bool(str)
        case "\"":
            let str = try scanString()
            advance()
            return JsonToken.string(str)
        case _ where isNumber(c: ch):
            let str = try scanNumbers()
            return JsonToken.number(str)
        default:
              throw JsonParserError(msg: "Unparsed character:\(ch) - \(currentIndex)")}}private mutating func peekNext(a) -> Character? {
        advance()
        return current
    }
    
    mutating func scanString(a) throws -> String {
        var ret:[Character] = []
        
        repeat {
            guard let ch = peekNext() else {
                throw JsonParserError(msg: "ScanString error,\(currentIndex)An error")}switch ch {
            case "\ \": // Handle escape characters
                guard let cn = peekNext(), !isEscape(c: cn) else {
                    throw JsonParserError(msg: "Invalid special type of character")
                }
                ret.append("\ \")
                ret.append(cn)
                /// Handle Unicode encoding
                if cn = = "u" {
                    try ret.append(contentsOf: scanUnicode())
                }
            case "\"": // If another quote is encountered, the string parsing is considered complete
                return String(ret)
            case "\r"."\n": // The incoming JSON string does not allow line breaks
                throw JsonParserError(msg: "Invalid character\(ch)")
            default:
                ret.append(ch)
            }
        } while (true)}mutating func scanUnicode(a) throws- > [Character] {
        var ret:[Character] = []
        for _ in 0..<4 {
            if let ch = peekNext(), isHex(c: ch) {
                ret.append(ch)
            } else {
                throw JsonParserError(msg: "Unicode characters are not canonical\(currentIndex)")}}return ret
    }
    
    mutating func scanNumbers(a) throws -> String {
        let ind = currentIndex
        while let c = current, isNumber(c: c) {
            advance()
        }
        if currentIndex ! = ind {
            return String(input[ind..<currentIndex])
        }
        throw JsonParserError(msg: "ScanNumbers error:\(ind)")}/// skip the space
    mutating func scanSpaces(a) {
        var ch = current
        while ch ! = nil && ch = = "" {
            ch = peekNext()
        }
    }
    
    mutating func scanMatch(string: String) throws -> String {
        return try scanMatch(characters: string.map { $0})}mutating func scanMatch(characters: [Character]) throws -> String {
        let ind = currentIndex
        var isMatch = true
        for index in (0..<characters.count) {
            if characters[index] ! = current {
                isMatch = false
                break
            }
            advance()
        }
        if (isMatch) {
            return String(input[ind..<currentIndex])
        }
        throw JsonParserError(msg: "ScanUntil not satisfied\(characters)")}func isEscape(c: Character) -> Bool {
        // \" \\ \u \r \n \b \t \f
        return ["\""."\ \"."u"."r"."n"."b"."t"."f"].contains(c)
    }
    
    /// Check if it is a numeric character
    func isNumber(c: Character) -> Bool {
        let chars:[Character: Bool] = ["-": true."+": true."e": true."E": true.".": true]
        if let b = chars[c], b {
            return true
        }
        
        if(c > = "0" && c < = "9") {
            return true
        }
        
        return false;
    }

    /// Check whether it is a hexadecimal character
    func isHex(c: Character) -> Bool {
        return c > = "0" && c < = "9" || c > = "a" && c < = "f" || c > = "A" && c < = "F"}}Copy the code

The above processing logic, with “Swift code a JSON parser (two)” is the same, here is not too much introduction.

Complete parsing

The syntax analysis takes the Token sequence resolved in the lexical analysis stage as the input and outputs JSON Object or JSON Array.

// Syntax parsing
struct JsonParser {
    private var tokenizer: JsonTokenizer
    
    private init(text: String) {
        tokenizer = JsonTokenizer(string: text)
    }
    
    static func parse(text: String) throws -> JSON? {
        var parser = JsonParser(text: text)
        return try parser.parse()
    }

    private mutating func parse(a) throws  -> JSON? {
        guard let token = try tokenizer.nextToken() else {
            return nil
        }
        switch token {
            / / if [
        case .arrBegin:
            return try JSON(parserArr())

            / / if {
        case .objBegin:
            return try JSON(parserObj())

        default:
            return nil}}}Copy the code

The core of the parser implementation is parserArr and parserObj.

In parserArr, each element is separated by.sepcomma, and if.arrend is encountered, array has been read and the result is returned. Each element is a JSON data that can be processed recursively.

private mutating func parserArr(a) throws- > [JSON] {
    var arr: [JSON] = []
    repeat {
        guard let ele = try parseElement() else {
            throw ParserError(msg: "ParserArr parsing failed")}/// as an element
        arr.append(ele)
        
        guard let next = try tokenizer.nextToken() else {
            throw ParserError(msg: "ParserArr parsing failed")}/// If the value is], the description is complete
        if case JsonToken.arrEnd = next {
            break
        }
        
        /// If the next element is not ', ', then it does not match the json array definition and throws an exception
        if JsonToken.sepComma ! = next {
            throw ParserError(msg: "ParserArr parsing failed")}}while true

    return arr
}
Copy the code

The parserObj method, also separated by.sepcomma, completes object parsing when.objend is read. The difference with parserArr is that each element is a key-value pair.

private mutating func parserObj(a) throws- > [String: JSON] {
    var obj: [String: JSON] = [:]

    repeat {
        
        guard let next = try tokenizer.nextToken(), case let .string(key) = next else {
            throw ParserError(msg: "ParserObj error, key not found")}if obj.keys.contains(key) {
            throw ParserError(msg: "ParserObj error already existing key:\(key)")}guard let comma = try tokenizer.nextToken(), case JsonToken.sepColon = comma else {
            throw ParserError(msg: "ParserObj error, : does not exist")}guard let value = try parseElement() else {
            throw ParserError(msg: "Error parserObj, value not found")
        }
        
        obj[key] = value
        
        guard let nex = try tokenizer.nextToken() else {
            throw ParserError(msg: "Error parserObj, next value does not exist")}if case JsonToken.objEnd = nex {
            break
        }
        
        if JsonToken.sepComma ! = nex {
            throw ParserError(msg: "ParserObj error, does not exist")}}while true

    return obj
}
Copy the code

Parse one or more tokens into the corresponding JSON type based on the JSON definition.

private mutating func parseElement(a) throws -> JSON? {
    guard let nextToken = try tokenizer.nextToken() else {
        return nil
    }
    
    switch nextToken {
    case .arrBegin:
        return try JSON(parserArr())
    case .objBegin:
        return try JSON(parserObj())
    case .bool(let b):
        return .bool(b = = "true")
    case .null:
        return .null
    case .string(let str):
        return .string(str)
    case .number(let n):
        if n.contains("."), let v = Double(n) {
            return .double(v)
        } else if let v = Int(n) {
            return .int(v)
        } else {
            throw ParserError(msg: "Number conversion failed")}default:
        throw ParserError(msg: "The unknown element.\(nextToken)")}}Copy the code

test

complete

let str = "{  \"a\": [8, 9, 10].\"c\": {\"temp\":true,\"say\":\"hello\".\"name\":\"world\"},   \"b\": 10.2}"

/// Parse through PASer2.0
do {
    if let result2 = try JsonParser.parse(text: str) {
        print("\n\n✅ Parser2.0 returns result:")
        print(prettyJson(json: result2))
    } else {
        print("\n\n❎ Parser2 parsed empty")}}catch {
    print("\n\n❎ Paser2.0 error:\(error)")}Copy the code

The results show:

✅ Parser2. 0Return result: {"a": [8.9 -.10]."c": {"name":"world"."say":"hello"."temp":true
    },
    "b":10.2
}
Copy the code

conclusion

Through lexical analysis and grammar analysis, the parsing task is divided into two stages, which greatly improves the code clarity. Lexical parsing hides subscript movement, as well as filtering out useless characters that are passed to parsing as meaningful data. Parsing is only about JSON recognition, not cursor movement.

Compared to the analytical steps in the previous article, the method in this article is undoubtedly more methodical. Although the performance will be weaker, but the clear and clear process is more admired, for similar parsing, can be applied.