• Learning Parser Combinators With Rust
  • Original author: Bodil
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: suhanyujie

Learning resolver Combinators through Rust – Part 3

If you haven’t seen the other articles in this series, I suggest you read them in order:

  • Learning parser combinators through Rust – Part 1
  • Learning resolver Combinators through Rust — Part 2
  • Learning resolver Combinators through Rust – Part 3
  • Learning resolver Combinators through Rust — Part 4

Decision combinator

Now that we have the building block, we need to use it to parse Spaces with one_OR_MORE and attribute pairs with zero_OR_MORE.

Actually, it’ll have to wait. We don’t want to parse Spaces first and then properties. If you consider that Spaces are optional in the absence of attributes, and we might immediately encounter > or />. But if you have an attribute, there must be a space at the beginning. Fortunately, there must also be Spaces between each attribute, and if there are more than one, let’s look at zero or more sequences that follow the attribute with one or more Spaces.

First, we need a parser for a single space. Here we can choose one of three ways.

First, we can use the match_literal parser as simply as possible, which takes a string containing only one space. Doesn’t that seem silly? Because whitespace is also the equivalent of newlines, tabs, and a lot of weird Unicode characters, which are rendered as whitespace. We’ll have to rely on Rust’s library again, and of course char has an is_whitespace method, which is also similar to its is_Alphabetic and is_Alphanumeric methods.

Second, we can write a parser that determines the resolution of any number of Spaces through is_whitespace, just like the identifier we wrote earlier.

Third, we can be smarter, and we do like to be smarter. We can write a parser any_char that returns a single char as long as there is a space in the input, and then a pred combinator that takes a parser and a decision function and combines them like this: Mr Pred (any_char, | | c c.i s_whitespace ()). This has the advantage of making our final parser easier to write: attribute values use reference strings.

Any_char can be thought of as a very simple parser, but we must remember to beware of utF-8 traps.

fn any_char(input: &str) -> ParseResult<char> {
    match input.chars().next() { Some(next) => Ok((&input[next.len_utf8()..] , next)), _ => Err(input), } }Copy the code

To our now experienced eyes, the PreD combinator came as no surprise. We call the parser, and then we call the decision function on the return value when the parser succeeds. We only really return success if the function returns true, otherwise we return as many errors as the parsing failed.

fn pred<'a, P, A, F>(parser: P, predicate: F) -> impl Parser<'a, A>
where
    P: Parser<'a, A>, F: Fn(&A) -> bool, { move |input| { if let Ok((next_input, value)) = parser.parse(input) { if predicate(&value) { return Ok((next_input, value)); } } Err(input) } }Copy the code

Write a quick test case to make sure everything is in order:

#[test]
fn predicate_combinator() {
    let parser = pred(any_char, |c| *c == 'o'); assert_eq! (Ok(("mg".'o')), parser.parse("omg")); assert_eq! (Err("lol"), parser.parse("lol"));
}
Copy the code

For both of these, we can write our Whitespace_char parser in a quick one-line code:

fn whitespace_char<'a>() -> impl Parser<'a, char> {
    pred(any_char, |c| c.is_whitespace())
}
Copy the code

Now, with whitespace_char, what we’re doing is getting closer to our idea of one or more Spaces, and similarly, zero or more Spaces. Let’s simplify this by naming them space1 and space0, respectively.

fn space1<'a>() -> impl Parser<'a, Vec<char>> {
    one_or_more(whitespace_char())
}

fn space0<'a>() -> impl Parser<'a, Vec<char>> {
    zero_or_more(whitespace_char())
}
Copy the code

String reference

With that done, can we finally now parse these properties? Yes, we just need to make sure we write a separate parser for the property component. We’ve got the identifier for the attribute name (although it’s easy to rewrite it using any_char and pred plus the *_or_more combinator). = match_literal(“=”). However, we only need a reference to the string parser, so we’ll build it. Fortunately, we have implemented the combinator we need.

fn quoted_string<'a>() -> impl Parser<'a, String> {
    map(
        right(
            match_literal("\" "), left( zero_or_more(pred(any_char, |c| *c ! ='"')),
                match_literal("\" "),
            ),
        ),
        |chars| chars.into_iter().collect(),
    )
}
Copy the code

The nesting of combinators is a bit annoying here, but we’re not going to refactor it for now and focus on what we do next.

The outermost combinator is a map, because nesting is annoying, it gets bad from here and we’re going to live with that and understand that, we’re going to try to find the place to start execution: the first quote character. In the map, there is a right, and the first part of the right is what we are looking for: match_literal(“\””). So that’s what we’re going to start with.

The second part of right deals with the rest of the string. It’s inside left, and we’ll quickly notice that the left argument on the right is the one we want to ignore, which is another match_literal(“\””) — closing quotes. So the left argument is the string we refer to.

We use the new pred and any_char here to get a parser that accepts any character except another quote, which we put in zero_or_more, so we’re also talking about the following:

  • A quotation mark
  • This is followed by zero or more characters other than the closing quotation mark
  • This is followed by the closing quote

And, between right and left, we discard the quotes in the result value and get the string between the quotes.

Wait, that’s not a string. Remember what zero_OR_more returned? A value of type Vec, where the value of type A is returned by the internal parser. For any_char, the char type is returned. So instead of a string, we get a value of type Vec

. This is where map stands: We use it to convert Vec

to String. Based on such a case, you can build an Iterator that produces String

, We call it vec_of_chars.into_iter().collect(), and thanks to the power of type derivation we have strings.


Before we continue, let’s write a quick test case to make sure it’s right, because if we need so many words to explain it, then it’s probably not something we as programmers should believe in.

#[test]
fn quoted_string_parser() { assert_eq! ( Ok((""."Hello Joe!".to_string())),
        quoted_string().parse("\"Hello Joe! \ "")); }Copy the code

Now, I swear, it’s really time to parse these properties.

Finally, the attributes are parsed

We can now parse Spaces, identifiers, = symbols, and quoted strings. In the end, that’s all you need to parse the properties.

First, we write a parser for the attribute pairs. We’re going to store attributes as Vec<(String, String)>, which you probably remember, so it feels like we might need a parser for (String, String) to provide us with a reliable Zero_or_more combinator. Let’s see if we can build one.

fn attribute_pair<'a>() -> impl Parser<'a, (String, String)> {
    pair(identifier, right(match_literal("="), quoted_string()))
}
Copy the code

It was so easy, I didn’t sweat a drop! To sum up: We already have a handy combinator for parsing the value of a tuple, pair, which we can use as the identifier parser to iterate over a String, and a right parser with =, whose return value we don’t want to save, And the quoted_string parser that we just wrote will return us a String value.

Now let’s combine zero_OR_more to build a vector — but don’t forget the Spaces between them.

fn attributes<'a>() -> impl Parser<'a, Vec<(String, String)>> {
    zero_or_more(right(space1(), attribute_pair()))
}
Copy the code

Zero or more occurrences of one or more whitespace characters followed by an attribute pair. We discard whitespace with right and preserve attribute pairs.

Let’s test it out.

#[test]
fn attribute_parser() { assert_eq! ( Ok(("",
            vec![
                ("one".to_string(), "1".to_string()),
                ("two".to_string(), "2".to_string())
            ]
        )),
        attributes().parse(" one=\"1\" two=\"2\"")); }Copy the code

The test passed! Don’t count your chickens before they hatch!

In fact, there are some problems, in this case, where my RUSTC compiler has given a hint that my type is too complex and I need to increase the range of types allowed to get compilation going. Given that we have encountered similar errors at the same point, it is advantageous that if you are in this situation, you need to know how to deal with it. Fortunately, ruSTC usually gives good advice in these cases, so when it tells you to add #! At the top of the file. [type_length_limit = “… some big number “] When annotating, just do it. In the real case, add #! [type_LENGTH_limit = “16777216”], which takes us further into the stratosphere of complex types. Full speed ahead. We’re going up.

Now we are very close to the answer

At this point, it looks like these things are coming together, which is kind of a relief, because our type is fast approaching NP-complete theory. We only need to deal with two types of element tags: the single element and the parent element with child elements, but we’re pretty confident that once we have these, we’ll just have to use Zero_OR_more to parse the child elements, right?

So let’s do the single element case first, and let’s leave the child element problem behind. Or, further, we could write a parser based on the commonality of the two elements: the initial <, the element name, and then the attribute. Let’s see if we can get a result of type (String, Vec<(String, String)>) from several combinators.

fn element_start<'a>() -> impl Parser<'a, (String, Vec<(String, String)>)> {
    right(match_literal("<"), pair(identifier, attributes()))
}
Copy the code

With this in mind, we can quickly write code to create a parser for a single element.

fn single_element<'a>() -> impl Parser<'a, Element> {
    map(
        left(element_start(), match_literal("/ >")), |(name, attributes)| Element { name, attributes, children: vec! [],}},)Copy the code

Hooray, it feels like we’re close to our goal — we’re actually building an Element!

Let’s test the wonders of modern technology.

#[test]
fn single_element_parser() { assert_eq! ( Ok(("",
            Element {
                name: "div".to_string(), attributes: vec! [("class".to_string(), "float".to_string())], children: vec! [] } )), single_element().parse("<div class=\"float\"/>")); }Copy the code

… I think we’ve escaped the stratosphere.

The type returned by single_Element is so complex that the compiler cannot successfully compile it unless we provide a large amount of memory in advance, or even require a larger type. Obviously, we can’t ignore this anymore, because it seems unreasonable that it’s a very simple parser that takes minutes to compile — which can lead to the final product taking hours to compile.

Before proceeding, you’d better comment out the two functions and test cases so we can fix them…

Dealing with infinitely large problems

If you’ve ever tried to write something recursively typed in Rust, you probably already know the solution to this problem.

A simple example of a recursive type is a single linked list. In principle, you can write it as an enumeration like this:

enum List<A> {
    Cons(A, List<A>),
    Nil,
}
Copy the code

Obviously, the RUSTC compiler gives an error message for the recursive type List, indicating that it has an infinite size, because inside each List::: Cons there can be another List, which means that List can go on to infinity. As far as the RUSTC compiler is concerned, we need an infinite list, and we require it to be able to assign an infinite list.

In many languages, an infinite list is not a problem in principle for type systems, and it is not a problem for Rust either. The problem is, as mentioned earlier, in Rust we need to be able to allocate it, or, more precisely, we need to be able to determine the size of a type before constructing it. When a type is infinite, that means the size must also be infinite.

The solution is to take an indirect approach. Instead of changing List::Cons to an element of A and another List of A, we use an element of A and A pointer to the List of A. We know the size of the pointer, and no matter what it points to, it’s the same size, so our List::Cons is now fixed-size and predictable, regardless of the size of the List. Turning an existing piece of data into a way to store the data on the heap with a pointer to the heap’s memory, in Rust, is to process it with Box.

enum List<A> {
    Cons(A, Box<List<A>>),
    Nil,
}
Copy the code

Another interesting feature of Box is that its types can be abstracted. This means that we can have the type checker handle a very compact Box

> instead of the current very complex Parser function types.

That sounds great. Are there any defects? Well, we might lose a loop or two because of the way we use Pointers, and the compiler might lose some opportunity to optimize the parser. But remember Knuth’s warning about premature optimization: everything will be fine. It’s worth losing these cycles. You are here to learn about parser combinators, not to learn to hand-write professional SIMD parsers (although they can be exciting in themselves)

So, instead of the simple functions we’re using so far, let’s continue implementing Parser based on the Parser functions we’re about to complete.

struct BoxedParser<'a, Output> { parser: Boxa, Output> + 'a>, } impl<'a, Output> BoxedParser<'a, Output> { fn new

(parser: P) -> Self where P: Parser<'

a, Output> + 'a, { BoxedParser { parser: Box::new(parser), } } } impl<'a, Output> Parser<'a, Output> for BoxedParser<'a, Output> { fn parse(&self, input: &'a str) -> ParseResult<'a, Output> { self.parser.parse(input) } } Copy the code

To better implement this, we created a new type, BoxedParser, to hold box-related data. We create new BoxedParser using other parsers (including another BoxedParser, although this doesn’t work very well), and we provide a new function BoxedParser:: New (Parser), which simply puts the parser in a new type of Box. Finally, we implement Parser for it so that it can be used interchangeably as a Parser.

This gives us the ability to put the Parser in a Box, and BoxedParser will perform some logic for the Parser in the role of a function. As mentioned earlier, this means moving the box-wrapped parser into the heap and having to remove Pointers to the heap region, which can take a few nanoseconds longer, so we might actually want to wrap all the data without Box. Just wrap some of the more active combinator data through Box.

  • Learning parser combinators through Rust – Part 1
  • Learning resolver Combinators through Rust — Part 2
  • Learning resolver Combinators through Rust – Part 3
  • Learning resolver Combinators through Rust — Part 4

license

This work is copyrighted by Bodil Stokke and licensed under the terms of the Creative Commons Attribution – Non-commercial – Share alike 4.0 agreement. To view the license, please visit the creativecommons.org/licenses/by…

footnotes

1: He’s not your real uncle 2: Please don’t be that person at the party.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.