 Please welcome Guido van Rossum Thank you Okay So the real title of this talk is Writing a peg parser for fun and profit Because I'm a languages guy. I'm a languages nut. I'm really mostly just a parsing nut. So Well, actually James Bennett explained it all what is a parser. It's part of a compiler So the compiler takes the source code and turns it into well in our case bytecode and it goes through stages There's a tokenizer. So the tokenizer Sort of analyzes the source code and turns it into useful things like there's print. There's a parentheses There's a string literal. There's another parentheses Then there's something that you didn't really see in the source code a new line token But it's very much a token Even in dense and D dense and tokens then there's a parser and almost all the rest of the talk is about the parser Which produces from that sequence of tokens and abstract syntax tree and I got an idea from James to Just show what that tree looks like. I'm terrible at drawing on computers. So apparently I have to use text but that's That's it an abstract syntax tree for me and then you get bytecode Which kind of look like looks like that and you don't really want to know what that does but except it does print. Hello world So parsers and grammars are closely related, but of course, they're different. The parser is code It has a lot of code in fact a typical parser has so much code that I didn't really don't really want to show you The code is way more than that bit of bytecode there a grammar on the other hand is a Relatively simple thing like here is an egg a grammar that describes Very simple arithmetic expressions you can add and subtract and multiply and divide and you can even put things in param parentheses And it's all four lines. This this corresponds to I don't know between 50 and 100 lines of code or so Fortunately, we don't have to write all that code to parse that language because we because we can take something called a parser generator that takes the grammar as input and Produces a parser as output. So now we see that the sort of von Neumann thing where Data is also code because the output of the parser generator is from the parser generators perspective It's data, but then from the compiler's perspective. It's code. It's the code of the of the parser So that's the thing that turns the tokens into a parse tree now parser generators are a diamond dozen To the point that even in the early 70s when a parser generator named yak was created It was already called yet another compiler compiler So they're fun to write which is why as Chris said in 1989 Literally the first thing I did when I thought I'm gonna implement my own language was Okay, let's shave this otter yak first, which is I'm gonna build my own parser generator And I had some decent reasons for it, but mostly was just for sport I originally wrote it in see them in 2004 or so I rewrote it in Python and that eventually made it into the standard library and Now the original p-gen is completely dead and also replaced by a copy of the Python version It does the same thing and and I feel it's essentially a dead end. It's an LL one Parser that that gets generated actually it's mostly tables that got generated that run a little push that that That's a sort of driving a little push down down a Tom and Tom The key thing is that it has a single token look ahead So it looks at one token. I had this great image in my mind and I couldn't find the right Video support for it. Like there's a thing in railroading called a hand car Which is like three or four wheels? With a seat in it or or maybe two and I was thinking The parser is this little hand car in front of the hand car is a guy who looks at the code Which is written between the railroad ties and he picks up one token at a time and he has this huge rule book next to him and He maybe flips to a different page in the rule book and he throws the thing over his back Where there's a whole stack of things and every once in a while he grabs a few things from his stack and Wraps them in a package and puts them push the package back on the stack. Anyway, that's what the push down a Tom and Tom does And that's that's how we are like yaks or bisons nowadays parser works as well The one thing that that sort of sets p-gen apart at least which Set set it apart 30 years ago when I wanted my own parser generator. My excuse was that I Could use eb and f which is like Extended back as normal form or back as now or form so you can sort of use plus for repeat x one or more times or star for repeat X perhaps many times or perhaps zero times or question mark like maybe there was an x here or not a bunch of other things The other thing that p-gen did which at a time I was very proud of which is that it automatically generates a parse tree for you So if you had an a grammar like that little expression thing you would Automatically get a grammar a parse tree that gives you an expression node and expression node has like a Term child and a plus child and another term child on the term child has a factor child on the factor child Has an has a name child and so on Over the years Eventually we we also started recognizing the downsides of p-gen I should I should add one more advanced sort of Positive thing about p-gen because it was a fairly sort of restricted paradigm Especially in the early years it sort of kept us honest when we were proposing new features for Python because we always said well if it if we can't write the Grammar for it using p-gen Then we don't want to implement the feature then apparently that the syntax you're proposing is too hard to parse and It's probably also going to be too hard for humans, but we actually violated our all our own rule there for various very useful things like assignments So p-gen does not support left recursion So instead of an expression node that can have as its left Child another expression node, which is how a plus b plus c really ought to be interpreted you have to say well it can be a term followed by any number of Additional things that are each a plus or minus operator followed by another term And so now if you have a plus b plus c it's term plus term plus term But now the code generator has to sort of essentially parse that parse tree node and see okay well we have an expression node and well let's see how many Children it has does it have like two n plus one children then there are and Interesting things happening or something like that so that the idea that it generates the parse the parse tree automatically is true, but it's not a very useful parse tree and And Literally two plus two equals five is correct as far as the python's formal grammar Is concerned and then we have a separate pass that says well wait to look wait a second on the left of that assignment operator we're looking for Something that's an assignment target and many things that our expressions are also assignment targets like x comma y comma z inside square brackets is an assignment target But two plus two is not an assignment target. And so that's what that second pass has to figure out It turns out that the parse tree changed every time we had to sort of Refactor the grammar which happened regularly when we added some feature that that sort of something like adding a weight to the grammar change the The parse tree produced by p-gen dramatically So we didn't want to have to modify the the bytecode generator Which is a really complicated kind of sensitive Piece of code every time we refactor the grammar because often after the refactoring the parse tree looks completely different But the meaning is still that the code you wanted to generate for it for most cases is still the same so we added a sort of a Distinction between a concrete syntax tree, which is what p-gen produces and an abstract syntax tree Which is what you would really like to have and which is what now the code generator Consumes as its input And so now there is a translation program that goes from the concrete to the abstract syntax tree And so when we refactor the grammar we only have to change that part When we refactor the grammar and we also add a feature of course We still have to also update the bytecode compiler. You have to compile code somehow But it turns out that that translation was non-trivial. I I looked it really was almost 6,000 lines of C code And it is still pretty sensitive I mean adding the walrus operator was a major operation just because we had to update that That thing a lot. Also now we have two trees that represent the entire program that have to be in memory at the same time So bag a new shiny object As I said, I like parsers. So well, I Didn't immediately when peg was invented sort of jump at the occasion, but people pointed it out to me a few times and For the last time was earlier this year and I finally thought hmm peg actually looks kind of interesting and I sort of have a few things I want to do to Python That might be much easier if we had a different grammar than the stupid 30 year old p-gen So, okay also Peg peg sort of has a completely different approach to parsing It actually uses infinite look ahead, which means that you essentially have all the tokens for the entire program in memory But yeah memories cheap I mean this this this is still sort of the representation of a program Which is a very small amount of data compared to like one photo that you just took off the screen So It turns out that there is a little add-on thing where you can make peg support left recursion easily In general peg has more power Because of the infinite look ahead so you can make grammars that actually look more natural and that sort of describe the feature so essentially we don't have to put so many lies in the documentation because that the Documentation claims that it shows the grammar well the grammar that the documentation shows is very good for sort of understanding what What it means, but that that grammar is completely different from what the what p-gen actually consumes Peg also has some other features that I'm not sure how useful they will be look ahead assertions especially I Think that's mostly used to rule out ambiguities in in certain grammars that Python probably will never have and then Since I'm starting over I might as well add actions Which is more or less that the actions the things in curly braces that you that you know from say from yak or bison And using those actions we can generate the AST directly which at least saves us all the complexity of Translating the CST to the AST and it saves a little bit of memory, which we then of course used to Keep all that infinite look ahead for the tokens So there are many again. There are many peg gram peg generators already on the market But none of them sort of did exactly what I wanted at least I couldn't figure out how to make them do exactly what I wanted so I thought I'd Explore this as my hobby project because I was already sort of exploring retirement So what what I've been doing and I forget when I started I think the first blog post came out sort of late in the summer I wrote oh Hey, there's music I actually wrote several peg generators the first version I wrote Generates Python code and it works and we eventually sort of got it to the point where it can also There's also a metagrammer that describes that the inputs to the parser generator Then I said I wanted to blog about it That sort of got out of hand. I've got nine episodes by now. Maybe the end is is near and so for the blog post I started from scratch with a much cleaner version that that was easier to explain and And then at some point I also sort of Took version one and refactored it. Well, actually with the help of Pablo who's one of the Python core devs He's in based in London We refactored the original generator so that it can also generate C code, which was my ultimate goal In September during the core development sprint I Hacked on a grammar that can actually parse All of Python as least as far as I can tell Emily another core dev wrote a test program that Sort of takes this parser and parses a whole directory full of files with it and tells you which files Succeed and which files fail to parse and then Once we had that test program. I just mostly iterated myself on Small subsections of the standard library every time I found Programs that couldn't be parsed yet I added some tweak to the To the grammar until I could parse everything and then I found out that there are like about nine files in the standard library That are that have syntax errors in them Well Fortune that turns out them Upon further inspection that those are all test cases for live two to three and a few other Parts, so it's it's not so bad, but the good news is that that Without having done any particular optimization this thing parses about as fast as the original p-gen generated parser Even though that the sort of the technology involved is is much more advanced But at least we didn't lose any speed. We lost we lost a bit of memory, but There's still hope that I can optimize that out so now let's let's Look at how peg actually works So there of course, there's that the generator that produces the parser and there's the runtime the generator just Takes the grammar and it actually it doesn't generate parsing tables It generates a recursive descent parser It actually generates code for every for every rule in the grammar. It generates a function or a method that can recognize that part of the grammar and whenever a rule references not a rule the function just calls another function and We have high hopes that the C stack is big enough to parse reasonably complex complex parsers programs and If not, there's probably that I know there's a way to to sort of turn all this inside out and Generate parsing tables again for some kind of push down down a tom-a-tom if we need to Oh, yeah, the the generator also takes the actions which actually are C expressions or Python expressions depending on which language the parser is written in That you generate so if you generated Python code you have actions that are Python expressions and So on for C So those are inserted in the right spots in the generated functions and then at runtime We have additional support for memorization, which I'll explain in a bit for left recursion and Then when it's recognized a rule it executes the action to produce a bit of the AST Okay, let's start with a truly Trivial grammar an expression is either a term plus another term or just a single term So you can do two plus two, but you can do one plus two plus three because you can also do x plus y so Name and number anything in all caps is actually a token type. It's not a rule And one thing of note is that this is all tied fairly closely to Python's tokenizer And Python's tokenizer sort of Recognize a certain multi character symbols like plus equals is recognized as a single token on the other hand Plus plus is not a special Python token. So that's just two plus tokens Let's see what we do with the Simple grammar. So here is a Very naive version of a parser that corresponds to that grammar. So the grammar is two lines It's either a term plus a term or a term and then the term is a name or a number And so here's a function now the function just returns true or false And it moves the input pointer. So it's it's it's hardly a parser, but it's sort of If all is after all is said and done you either it either says true and then it Sort of the input has been moved past the last token that was part of what it recognized the expression or it says false and then the Input pointer is still where it was So we can also explicitly manipulate the input pointer, which is just a counter in the array of tokens Mark gives us the current input pointer Reset sets the input pointer to something. So if we say pause is mark and I don't know if I can point No, I cannot point. Well, you can read you can read pauses mark there Later we call reset pause that just sort of rewinds to the start of the expression Expect Is something that either consumes a token and returns true or if that particular token is not there It returns false and leaves the input where it is So an expression is a term followed by a plus followed by a term as you can still sort of see here Then if we didn't recognize something with the plus then we reset the position we go again Let's say if there's just a symbol symbol a simple term Turns out there is maybe Then we return true if not then just I know we technically we don't need to reset the position at this point because The term function itself also has this property that it either Succeeds and move the input or fails and doesn't move the input But just in case we reset the position and then we return false because we didn't find an expression sadly So the term function does a similar thing except because that is so simple We can just say return name or number Let's see let's show the same action With a very sorry the same grammar with a very simple action added to it The action is we create an ad node Say that that's some predefined class The action needs to be able to reference Sub expressions of the alternative that it's attached to of course So we can name the term so the first term is X and the second term is Y The plus doesn't get to name me because we don't actually reference it directly The simple X alternative where there's just a term that there's no action and there's supposed to be a default action that says If there is an alternative has only one item then we return The action just the default action just returns whatever that item was So term also doesn't need any actions because it will just return a name or a number node Let's see. I think we have everything there So What's the parser generator the simplest form of the parser generator actually produces for this is not exactly this But it's pretty close. It does use the walrus operator. So it only this is all very forward-looking. So it only only works with Python 3.8 or Probably with 3.9 as well the master so there's a toy parser class which inherits from a Parser base class parser. I'm not showing that but it just defines the mark and expect and reset functions So again, we save the current input position by calling mark except now mark is a method We call the term and the expect plus and another term call So if any of those fail the whole if fails If they all succeed we have X and Y assigned to Then we return that add node. So otherwise we reset and the default action Well, it has to have to have some some variable that it can return. So it will say return return that term The term function is the term method is pretty simple, okay, I'll I'll A Little later. I have a little demo where where I actually show that in action if I can get it working So there's one problem with this this example even this very simple example has has one issue or it demonstrates an issue which is that The first and the second Alternative for expert both start with term now suppose that term is a complicated thing And if we have an input that is just a term Because of the ordering of the alternatives which in general is is actually important for Bag you you try the alternatives in order until one succeeds. So We try term plus term and we get as far as the end of the first term And then we look forward and see is there a plus and there is not a plus Then we reset and then we try again looking for term What happens at this point is Everything works, but we we parse that sub expression term twice So because everything is completely Deterministic here when you're parsing the input, I mean we were caching the tokens in memory So there's no chance that if the file changed The second time you you try to to parse that sub expression it'll look different So we can memorize this basically if you call a function if you call the same function at the same input position We know we know that its effect is going to be the same so it either succeeds and Moves the input position a number of tokens on or it fails and doesn't move the input and we cash that we cash both the success or the failure of a function call and then if Somehow the parser in any context calls that same function again with the same input We can sort of save all the effort of parsing that sub expression. We just return the cached AST node And move the input pointer forward and so this this is there's a theory result that says This way you actually parse in linear time, it's called back rat parsing I think it was invented by the same guy who invented peg because they really go a hand in hand back without back rat Parsing is hopeless. I guess back rat parsing is useful for other types of parsers though So at least in the Python version of the the code generator, this can be done with a memorized decorator Left recursion the other thing that sort of makes peg attractive from my perspective Let's first show it without actions now We have a slightly more complex grammar. It has the same number of rules But the first rule is actually an expression is an expression followed by a term or just a term so that's left recursive in old p-gen that was Statically an error that the code gen that p-gen says you can't do that And it can tell Classic peg without left recursion support also can tell and Says you shouldn't do that But there is a hack and I'll show how that works It's just like with actions. It looks the same as the previous one So, okay, actually I'm I'm I'm and I wrote that line a few days ago, but I knew what what time I was in the schedule and yes, I mean Really I sort of I found a wiki page of I was pointed to a wiki page that had the description And I tried to read it and it was complicated and it pointed to a paper and tried to read the paper and the paper was like Hopeless I can't read theoretical papers And I thought about it and went back to the wiki page and then I peeked in some code that actually implemented the whole thing and finally it sort of My penny dropped and I realized how it works and there is sort of you you set the recursion limit you say Expert is left recursive, but this particular call to expert can at this position can only recurse n times and When it tries to recurse for the n plus one time Will just artificially say There is no expert here and Then we come because sort of the expert function is still generated exactly the same way It will just try the second alternative is the first alternative that's like left recursive fails It just tries the other alternative and so you start with like the smallest possible recursion limit like zero or one I forget which which one it is but sort of the one that you start with and regard the first time regardless of whether you have a success or a failure you bumped it recursion limit by one and then You try again If the second time you get a result that is a success and that is longer than the first time You keep doing this you bumped the recursion limit one more and you try again And you keep doing this until you either get the result that is a failure Or you get the result that is not longer than the pre that the previous one and again There is there's theory that says that this works I try to understand it and for a very simple case I think I sort of follow the proof on paper in general. It's it's again. It's it's above my My level of understanding, but I trust that it works and in practice It works beautifully because that that grammar for all of Python that I created is full of left recursion even a few sort of mutually Recursive rules It turns out that the memorization cash was a really good idea because the memorization cash is essentially abused to implement the recursion limit because you prime the cash with a failure result and then sort of You call the the function that the decorator is wrapping once and if it returns something that you want to keep you Sort of stuff that back into the cash and then you go again. There's it's it's it's beautiful, but it's also pretty tough to follow if If you haven't spent half a day trying to understand it There's there's there's one more thing. There's a little bit of graph analysis You have to do to even know whether something is left recursive and there's some strange edge cases, but graph analysis is easy Okay, I had some demos planned and I think the demos are actually the best part so back grammars can bet get pretty complicated and Because there is this thing where you have to put the longest Alternative first especially when two alternatives might start with the same sub alternate sub expression Because peg just sort of as soon as an alternative succeeds. It's gonna pick that one and Because of the memorization and various other things. It's never ever going to try any following alternatives So every once in a while you write your grammar and you test it and some it doesn't recognize what you want Or it doesn't reject what you expected to reject and then you end up debugging it and the first version I put in a whole bunch of logging and I spent a lot of time making the logging nice But it was still really hard and then for the for the blog posts. I decided to actually Create a visualizer now. It turns out that visualizers for parsing algorithms are also a dime a dozen I don't know if I've seen another one that was written using cursors though So Let me see. Where is my other window? Okay, I think here's my first demo So this is the grammar Ignore that first line start just means That's where we start the parser generator and then it's it's this non left recursive Version, so this is the parser generator Generating stuff. This is the generated code I wasn't kidding that the generated code is longer than the grammar This is the input we're parsing a very simple thing ABC plus one okay, and so here's the visualization and so The screen is divided in three sections the middle section actually is where the input will appear and There's one thing I didn't explain which is that we actually parse the input incrementally So we sort of we only ask for the next token from the input file When the grammar really really wants to know what's the next token is on top We have what essentially we build up the stack of calls. So start calls expert calls term calls Expect and so on at the bottom. We'll see things appear that end up being memorized So let's start we start with the start function being called So the start function Has only one alternative and it has only one item. So we're gonna try that Okay, so now we're calling expert and the underlying shows us that that's where we are Expert is term plus term or just term So we're gonna try the first item of the first alternative, which is a term So we're calling term term is name or number and we still haven't looked at the input We're still just sort of in the recursive calls of the the the parser. So now we're gonna expect the name And now yeah, now now it's it's do or die So yeah, we look at the input and then of course there's a token ABC So that is in fact the name so this succeeds expect returns ABC then term succeeds and I guess I lied about the default action It actually produces some kind of default note that says that was it we recognize the term and the only thing in the term was ABC Okay, so Now that expert Is going to Look ahead now it's going to expect a plus The things we already recognized sort of sink down to the Memorization cache we're calling expect plus by the way the indentation here Corresponds to the input position That's important because we know we're now expecting a plus at the input position after the first token So we look is there a plus? Yes, there's a plus So now we're gonna recognize the third term So now again that that plus and here below you also see that expect plus is cached at The input position where there was a plus So now we're looking for a term again Term is a name or a number We're expecting a name. No, it's not a name Now we're gonna try a number. Yes We're gonna we're gonna call expect for a number notice that expect for a name There's is cached also. There was no name here negative caching is important because we have a one So is that a number? Yes, it is a number So now we have a term. So now we have the whole first alternative for expert and That return some successful thing and now we're back sort of we Rewound the call stack to the start function, which is also ready and this is This is the output. So we recognize the start. We recognize an expression. We recognize the term That's all there is. Okay, it blinks to tell me that there's no more We can also run the visualization backwards if you wanted to if you sort of you came across something interesting and you thought Oh, what happened there? Yeah, so you know we're going forward again That was fairly nasty to implement the going backward part because I hadn't planned that Okay, the second demo quick because we're we're technically out of time It's the same grammar So not so interesting this had this version adds the memorization code though. I Think that's that's what's the case maybe I Oh, it's a demo one still ha hence Okay, this is more complex. There's the left. Oh, yeah, this is the left recursive one that actually may be a little Complicated for this late. Let's let's walk through it. It's not much. Okay. The star beans is the recursive rule So you see that expert calls expert, but they're They're going back Yeah, so we did the sort of The setup for the week left recursive expert call primes the memorization cache with them none and so Expert calls itself recursively says is there an expert here and That call goes through the cache and it says there is no expo there So we immediately proceed to the second alternative Is there a left brand that there's not is there an atom Adam's a name or a number in this grammar Yep. Yep. Yep. Yep. Yep Okay, so that was the recursive call and Now we do another recursive call with different okay going back a little bit again Yeah, if so Once we go back here. We sort of the cached version for expert Changes to we we had this whole note with ham in it And now we go recursive again Check if there's a plus term is not very interesting Yeah, that's a name Okay, you know, so we recognized everything so we put that in the cache because it's longer than we got before and then we go again and We find the same thing in the cache We see if there's another plus Expect plus at this further indented position. But no, there's a new line there. Actually, this is the first time that we see the new line token so we go back to the second alternative which Immediately gives us this result and now going back again a little bit At this point the left recursion Logic detects that the new result was Not longer than the old result. So it says, okay, we we got the longest we can get So we're done here. And that's how left recursion works and I had a third demo, but it's not all that interesting So I'm gonna go back to my slides if I can get there and And this is all I have left. What's next? So now that we have the full grammar for Python, we have to develop actions for it turns out some guy in Berlin is interested in helping with that. I Imagine it's also easily sort of It's it's a distributable system multiple people can work on developing actions for different parts of the grammar So if people want to help, please Drop me a note Once we have enough actions We have to test everything and that includes what we currently don't have Generating tests somehow for things that don't don't parse Because we also need to be confident that the new parser Rejects the same collection of inputs as the old parser Now we have to benchmark it and I'm I'm fully Expecting that the benchmark will show that either the memory usage is out of hand Or it's still slower than the old parser and that doesn't make a very good case for the rest of the core devs to Allow me to do all this stuff and then There's going to be this this sort of gut wrenching Review is is is that grammar really more readable after you added all those actions and everything Can we do things with it? That's that's sort of we couldn't do with p-gen or at least would be really hard with p-gen or Where p-gen would say sort of yeah, do whatever and then the second pass has to sort it all out And and do we actually want to do things and then so that the one thing I have in mind which Definitely is going to be a release after we potentially introduced this in in CPython Which is a match statement. So if you're interested in match statements hang in there Anyway, I think we have about six months before beta one, which is the feature freeze. So wish me luck and thanks Thanks, we know