The original | The origins of pgen

The father of the author | Guido van Rossum (Python)

The cat under the translator | pea flowers

The original | python-history.blogspot.com/2018/05/the…

David Beazley’s US PyCon 2018 talk on Parser Generators reminded me to write about its history. This is a short brain dump (maybe I’ll explain it later).

(I’m going to hazard a guess at “brain dump,” which is to say that the conversion of personal memories, as well as details of Python’s history, into words is a store-solidified process that is easy to pass down. My translation job is to spread the wealth of documentation to more Python enthusiasts.)

There are actually two pGen’s, one original, written in C, and one rewritten in Python, under lib2to3/pgen2.

I wrote both of them. The first one was actually the first code I wrote for Python. Technically, though, I would have to write a lexer first (pgen and Python share lexers, but pgen doesn’t work for most tokens).

The reason I wrote my own parsing generator was because it was pretty rare at the time (what I was familiar with) — it was basically Yacc (there was a GNU rewrite called Bison, but I’m not sure I knew that at the time); Or write one by hand (which is what most people do).

I had used Yacc in college and was familiar with how it worked from the Dragon Book, but for some reason I didn’t like it; It is difficult for me to explain the limitations of IIRC’s LALR(1) syntax.

Compilers: Principles, Techniques, and Tools, the Dragon Book, is a textbook about compiler Principles and is a temple of compiler Principles. There are also two classic works, titled “Tiger Book” and “Whale Book”, which often appear together. IIRC, If I Remember Correctly, Correctly.

Collect the three books, can summon the dragon?

I’m also familiar with LL(1) parsers, and I’ve written some recursively descending LL(1) parsers carefully — I love it, and I’m familiar with the generation technique for LL(1) parsers (also thanks to Dragon Book), so I had an idea for an improvement I wanted to try out: Use regular expressions (sort of) instead of the standard BNF format.

The Dragon Book also taught me how to convert regular expressions into DFA, so I put all those things together and PGen was born. Update: See below for a slightly different version of this reason.

I was unfamiliar with more advanced technologies or thought they were inefficient. (At the time, I thought this was true of most people working on parsers.)

As for the lexer (lexer), I decided not to use a generator — I have a much lower opinion of Lex than Yacc because the version of Lex I’m familiar with has segment errors (real!) when trying to scan more than 255 bytes of tokens. . Also, I think the indentation format is hard to teach a lexical analyzer generator.

Generators are not generators in Python syntax, but tools used to generate profilers. Lex, short for “LEXical Compiler,” is used to generate LEXical parsers; Yacc is short for “Yet Another Compiler compiler” and is used to generate parsers. 2. Segfaults are reported as segmentation faults caused by out-of-bounds access to memory space.)

Pgen2 is a different story.

I was employed by a San Mateo startup (Elemental Security closed in 2007 and I left to join Google), where I was tasked with designing a custom language (with the goal of making Security decisions about system configurations) and had considerable autonomy.

I decided to design something slightly python-like, implemented in Python, and decided to reuse PGen, but based on Python on the back end, using Tokenize.py as a lexical analyzer. So I rewrote the algorithms in PGen in Python and went on to build the rest.

Management decided it made sense to open source the tool, so they quickly approved it, and not long after (I would have moved to Google?). This tool also makes sense for 2to3. (Since the input format is the same as the original PGen, generating a Python parser with it is easy — I just feed the syntax file to the tool. : -)

Update: Why pGen was created, and more stories

I don’t remember why do you want to do so, but I’ve just peek at https://en.wikipedia.org/wiki/LL_parser#Conflicts, I may think this is a new (to me) not by adding a helpful way of rules and conflict resolution.

Referred to the web page of the left, for example, decomposition (will replace A – > X | X Y Z – > X B; B – > Y Z | < empty >), I will rewrite into A – > X/Y Z.

If I remember correctly, the parsing engine (the syntacticAnalysis function at the beginning of the page) can still work on the parsing tables derived from these rules through the “regular expression -> NFA -> DFA” conversion process; I think there needs to be a desire not to have blank products. (” Empty productions “is a reference to the <empty> above, which refers to the unnecessary presence of empty.)

I also recall that A parse tree node generated by A parse engine may have many children. For example, for the rule A -> X [Y Z] above, node A may have either 1 child (X) or 3 children (X Y Z). There needs to be a simple check in the code generator to determine which possible situations it encounters. (This proved to be a double-edged sword, and we later added a “parse tree -> AST” step driven by a separate generator to simplify bytecode generators.)

So the reason I use regular expressions is probably to make the syntax easier to read: After using the necessary rewriting to resolve the conflict, I found that the syntax was less readable (insert Zen Python :-), and that regular expressions were more in line with my view of the syntax of classic languages (except for the weirdly named help rules, the [optional] part, and the * repetition part).

Regular expressions do not increase LL(1) power, much less reduce it. By “regular expression”, of course, I mean EBNF — I’m not sure if “EBNF” was a clearly defined symbol at the time, it could have just meant an arbitrary extension of BNF.

Converting EBNF to BNF and then using it would lead to awkward multi-parsed tree nodes, so I don’t think it would be an improvement.

If I had to do it all over again, I would probably choose a more powerful parsing engine, probably some version of LALR(1) (Yacc/Bison, for example). LALR(1) has some things that are more powerful and useful than LL(1), such as keyword arguments.

In LL(1), the rule “arg: [NAME =] expr” is invalid because NAME appears in the FIRST set of the expression, and THE LL(1) algorithm cannot handle such writing.

If I remember correctly, LALR(1) can handle it. However, keyword argument writing didn’t appear until years after I wrote the first version of PGen, and BY then I didn’t want to redo the parser.

Update March 2019: Python 3.8 will remove the C version of pgen in favor of the rewritten Pgen2 version. Refer to the github.com/python/cpyt…

Guido is currently working on a PEG parser as an alternative to PGen.

Click to become a registered member of the community ** “Watching” **