 So hello These talk is about some work that I did with my previous employer when I left them They asked me to remove the logo from the slides and not to mention their name. So I'm here with no business affiliation And unlike probably most talks in this conference. This is not about writing stuff with python This is about python himself and most specifically about the see python, which is the reference implementation And even more specifically about the parser, which is part of the implementation So what is the python parser when I put python parser into Google? This is the picture that came up But actually parser is the first stage of the interpreter which converts the source code into parse tree which the compiler then compiles into white coat and the white coat is what the interpreter actually runs Now what is python parser like? Same as any mature programming language. It has a formal grammar which looks something like this If this link should open Has a formally defined production three start Symbols for three different kinds of inputs. Every line defines a new Non-terminal symbol and the notation here is a bit more complicated than the BNF notation that Grammar says usually written in Something more like regular expressions. So we have something like plus modifiers and Somewhere star modifiers for repetitions as well as parentheses for grouping and the other Ingredient is the header file with the definition of all tokens Python has 57 types of tokens numbered 0 to 56 The numbers above 256 are reserved for non-terminals, which are the ones that are defined in the grammar then we have parser generator that based on the formal grammar and the token definitions produces FL which looks like this basically, it's Just a definition of static data. It doesn't have any code in it All it does is it defines the DFA one DFA for every non-terminal that the grammar defines So every DFA is defined as an array of states every state is defined as an array of arcs and every arc has a label and destination state. Oh Sorry, I thought my screen is duplicated, but it's not really Sorry Did the previous Files got missing again Did you see the previous folder files that I was trying to show Okay, then I'll have to scroll back This is what the grammar looks like with the productions and the right-hand sides that looks like regular expressions and that's what the header file looks like with the Token definitions. I don't know how to make it Deplicate it again. Let me try Right now at least I see myself what I'm showing. So to go back each State is defined as an array of arcs going out of this state and each arc is defined as an No, each DFA is defined as an Array of state each state is defined as an array of arcs going out of it. The first Item is the arc label, which is more or less the same as the symbol number And the second item is the destination state And in fact, we have a special table which converts the symbol IDs into arc labels because Symbol number one it corresponds to all keywords and all identifiers So we also have to match which specific keyword it corresponds to so that each one of those get a different Arc label number even though they are all sharing the same symbol ID and near the very end of this file we have in the definition of DFA themselves and for each DFA we also have the Start set which shows which tokens the specific non-terminal can start with And this is all sophisticated enough so I prepared a short demonstration of how this works for a Very simple Python program which looks like this. We parse it as a single input So that's the production that we are parsing it according to and our parse tree starts with the just root note of single input and The DFA which pars a single input is the DFA number zero which looks like this and In the initial states, which is state number zero. We have three arcs going out of it So the first arc with label number two corresponds to new line Next arc to simple statement Third arc to compound statement and we are looking at the if token in the inputs and According to the start sets we know that simple statement cannot begin with if if is not a new line But compound statement can begin with if so we transition along the last arc add the compound statement note to the parse tree and before we actually transition to State number two we switch to the new DFA which parses compound statements for Compound set we have we have nine options. So in the initial state we have nine arcs going out of it only the first arc which is for if statement can start with an if token So we add if statement note of the parse tree transition along this arc So on now we switch to the DFA which parses if statement It's sophisticated enough that it didn't fit on this slide because you see that it definition is also sophisticated having the parenthesis and repetitions and optional parts in the initial set we have only one arc which With the label 97 corresponds to the if token because it's a terminal symbol We just add it to the parse tree consume it from the input switch to the next state and stay in the same DFA the Next state once again has only one arc going out of it, but this one it's for test Non-terminal the start set for test says that it can indeed begin with the number such as 42. So we add the test node Switch to the DFA which Parses test and so on it goes for an awful long time before we actually get back into the same DFA and Now we are in state number two has one arc which consumes the colon token from the input transitions to the next state Once again, we have only one arc corresponding to sweet the Set confirms that sweet can begin from a print token. So we add a sweet note of the parse tree switch to DFA for sweet once again, it takes an awful long time before we get here and the interesting bit is now we have three arcs going out of the state number four Arc with label 98 corresponding to the terminal L if 99 to the terminal else and the next token in the input is the new line. So it's neither of this and the Arc label zero means that it's a final state. So if we didn't match anything Which are just done with this DFA and go back to the previous one So we go back to compound statement We're ending up in state that only has zero arc Which means no matter what the next token in the input is we are done and going back to the previous one Go back to the root DFA that we started from there is only one arc with label number two Which is for new line. So we consume the new line from the input add a note to the parse tree And that transition to state number one, which is final So we are at end of input. We're in the final state of the root DFA, which means that the parsing succeeded We are done. So what has just happened? We had a program which was 27 bytes long tokenized into seven tokens and This is the parse tree, which we produced for it 64 nodes takes a hundred times As much memory as the source code itself and it's not so much a tree as a bamboo stalk because it almost doesn't branch We see that we have a huge like series of a dozen nodes With a single child each All these because Python has a rather primitive parser Which has to define grammar like this it works to actually see this instead of Repeating all steps that I have shown just earlier. Yeah, that's more important because the parse tree can Can take a hundred times as much memory as the source code if I have a moderately big Python source file of say 30 megabytes then the Parcer would not even be able to load it into memory because the syntax tree for it would would be too big So that was my motivation for this work I had a 30 megabytes Python file and I couldn't possibly run it through Python to show these CSTs standing for concrete syntax tree. There is a standard module called parser and it can show you the CST serialized into structure of nested lists and Every listing the structure has two elements. The first one is the symbol ID So it's 0 to 56 for terminals and something above 256 for non-terminals and the second item is the token string for terminals also subtree Been in a nested list for non-terminals the documentation for parser module Actually says that people shouldn't use it because there is a better module called AST standard for abstract syntax tree Which is a second kind of syntax tree that the Python uses and the these example shows What the AST? looks like for the same Example Python code you see that it's much more compact It's almost human readable and the point is unlike CST Which reflects how the program is represented in the source code the AST reflects what the program actually intends to do So it's actually the AST that the compiler runs on to generate the bytecode so the Second kind of syntax tree actually also has a formal grammar Which is what this formal grammar looks like a different format now We have each kind of AST no defining what its children should be How many and what are their types and the funny bit is that there is no parser generator Which takes this file as input this file is just formal documentation But the parser is handwritten Hope it opens Yeah, so this handwritten parser is actually 5200 lines long It's a bit of very old and dusty code which dates from 2005 to 2007 nobody touches it nobody remembers how it works and the general attitude is the if you're 10 broke don't fix it It has been working for 10 years time has proven that it does work. Okay Yeah, and once again the AST for our sample code is the very simple structure of nine nodes which I'm shown here So this is what these second parser actually looks like has the example is Function that a parser's expert type of nodes and what it does It has a loop organized with the label and I go to from the middle of the function something that Dijkstra would not approve of and What it actually does is that for each node which only has one child it knows that there is nothing interesting for this node so nothing to do we just skip to the next node So we see that two-thirds of the nodes in the CST have one child. We don't do anything with them We just skip to the next node what I thought was Why do we have all these useless nodes? Occupying the memory if we can eliminate them at the time when we generate the CST then at the same time We are saving two-thirds of the memory and also the second parser becomes simpler because it doesn't have to be concerned with the Skipping all these useless nodes. This is what an example of compressed CST looks like It's actually small enough. So it fits on one slide without scrolling and at the same time we have this Modified function which no longer has the awkward loop actually doesn't have any loops at all What it does is that in each branch of the switch it can assert that the node has more than one child Meaning we can do something meaningful with this node and not just keep it now the this shows the serialized CSTs we before and after my fix and you can see that the CST has become so small that they almost becomes human readable The catch here is that the only purpose of CST is to be parsed by the second parser into the AST And then it's discarded from the memory. We never use it again So if we have a long-running script such as Web service for example that loads in a split second and then runs for hours and days Then it really doesn't matter how much memory the CST takes because it's only there for the first split second on the other hand if we have a System utility which runs for a split second and finishes immediately And then the performance of this utility would be dominated by the performance of the parser And that was my case my script was just looking up a value in the 30 megabytes literal dictionary and indeed with my fix to the Python parser memory consumption of my script Reduced by a factor of three so I was curious how my patch would affect other standard benchmarks and This are the results One of them has shown the improvement by a third a few of them have shown around a dozen percent improvement Then nothing has got worse But actually this Benchmark package which I was using was Procaded in October last year on the grounds that the results it was showing were inconsistent and meaningless For some purposes like I'm not a performance analysis person So I don't know what it did and what it did wrong but it had been superseded with the by performance back package and This are the results basically something got a little bit better something got a little bit worse, but The big picture is that nothing changed my intuition is that by performance Intentionally runs a few warm-up loops Before starting the timer and this way intentionally avoids it timing the loading and parsing of the source code Didn't actually look inside to the by performance. It's big and sophisticated enough but nevertheless even these results come from that my Patch has no unintended side effects doesn't break anything and at the same time it Enables a use case which may not be relevant to most of the Python programmers But at least a if I bumped into it, maybe other people's other people would bump into it as well So then my Desire was to share this patch with the community you get it included into Python How somebody does this is it opens a bug in the bug tracker because that's where all the discussions happen where all the reviews happen and There you also have a mailing list where and you're allowed to send Pings once a month if your bug is not getting enough attention So I opened to Issues in the bug tracker and one of them the second one of them actually got accepted and committed to took me three months waiting for a review and what it does was The second handwritten parser knows that the input that it parses always comes from the first parser So it's guaranteed to conform with the first formal grammar So it doesn't have to validate That it conforms to it It knows for sure that for example for if statement node the first child is always if token then the second child is always test but parser module can take an arbitrary serialized CST as input and run it through these second parser and Then if it doesn't conform to the formal grammar the Python binary would just crash on invalid memory access so they had to implement the validation for CST themselves How they implemented this was? Like this. I'm showing just validation for three different types of nodes, but it was all more or less the same They are checking that the number of the Children in the node is correct that the types of the children are correct and then recursively descending into each child node and The colon validation function for the child node When I was looking at this code which had to be updated with the every change to the grammar too much the updated grammar I thought maybe we can auto-generate the from the grammar maybe we can Write some DFS which would traverse the CST to see if it's valid and Then what occurred to me is that we already have this DFS these are the DFS which parser runs We can just run them on the CST instead of the token in stream of the input So this way I could eliminate 2400 lines of handwritten code With the just 70 lines of code for running the DFS which we already have in Python parser And which are guaranteed to stay in sync with the parser because they are what the parser runs but After three months in review my other patch was rejected by widow and rossum himself And that's because it's so big unsophisticated and nobody has volunteered to review it over three months I shouldn't hold my breath for anybody to jump up and review it and If nobody cares as much as I do about running the 30 megabytes Python scripts through the interpreter Then tough luck for me Basically, this is the motivation while I'm here. Maybe you can enter this Bpo to six four one five back ticket Take a weekend to review it and from that it's okay. Yet it committed Yeah, basically, I have explained what my patch does but that's not all if The only purpose of the CST is to get Parced by the second parser and discarded. Why do we have to store it in the memory at all? Maybe we can generate one CST note at a time Handle it in the ST parser and discard it immediately that way instead of improving its Memory consumption by a factor of three. We're improving it infinitely. It will take zero memory what the current parser actually does is It has two entry points one for string objects and one for file objects It passes this input to parse talk function which runs a loop inside it Extracting one token at a time from the input passing this extracted token to by parser at token function and this ad token kind of consumer function Maintains its state in a parser state structure, which most importantly has the stack of the DFA's keeping track of the current state for each DFA and The pointer to the CST root node, so it's appends new CST nodes to this tree after parse talk is done at the end of input it returns the CST root to the caller and the caller passes it to the second function called by ST from node object Which is essentially the second parser It recursively traverses this tree and for each subtree of the CST It generates the corresponding subtree of the ST the important bit here is that the whole of CST is Available by the time second parser runs so it can analyze nodes out of order it can analyze as one node repeatedly And so on my suggestion is to replace the by ST from node Object with a different function something like by ST from tokenizer which would more or less do the same but for getting each CST node it would call this next node function, which would take the parser state All the parameters for the tokenizer which would just pass on to the tokenizer and generate one node in the output which means only node type and node string but no children and as well return in unvalue Indicating the position of the new node Relatively to the previous node so whether it's the first child of a parent node and next sibling of a child node or Whether the previous node was the last child and we're going back to the parent The difficult question here is well, what are we going to do with the web parser module? Because we no longer have these CSTs memory. There is nothing that it could serialize into the structure of Enested lists and I'm not sure anybody is actually using the parser module even especially so that it's documented as deprecated All source cores analyze it that I have seen are using the AST because it's much easier to use but with Some effort we can make the parser itself call next node recursively to materialize the CST To serialize it into these nested lists and then to discard the whole thing just for compatibility with Existing consumers if there are some and there are some difficulties here The minor difficulty is that the parse talk can no longer have a loop running inside it We have to change it to a generator function and extract a loop out of it into the parser state structure But the main difficulty is that we have to rewrite all the 5200 lines of the hundred in parser This is the example of rewriting one function in this way. This is for parser for slice nodes slices when you have Like square brackets and two one or two or three numbers separated by columns inside them and you see that the old parser Does all kinds of interesting things about the rewritten parser can only Check each node once and it has to go on these nodes in order which actually simplifies the code the question is it would obviously take some time to rewrite the 5200 lines this way maybe several weekends, maybe many weekends and Do we want to do this and? If for example, I take several weekends to do this will there be somebody who wants to Review the patch including the writing of 5200 lines of code. That's actually all I have and one I Put Python questions into Google. This is the image that came up So Python's are actually cute. Thank you very much. So any questions? When you had the problem tokenizing your 30 megabyte source file What was the limit that you hit to do simply run out of memory or was there some internal structure? There was a static size and it over-followed Yes, I was doing it on my laptop which only has two gigabytes of memory so if you bought more memory or had more swap then It wouldn't have been a problem and in the future when computers are even larger than they are It's not going to be a problem for hardly anybody If you want to run Python on your phone, then it always we have problem. Oh, we already know Python doesn't run on phones Rather than rewrite all the conversion from the concrete parse tree to the abstract syntax tree Is it not possible to some equivalent to a grammar definition for that and generate the code? Because it does look awfully repetitive that code I agree that it does look repetitive and I think that the I can actually show this like most grammars for yak for example include the bits of code with each production that the parser generator Inserts into the parser. Maybe something similar can be done to this SDL file To help to generate in the parser but this would change the syntax of this file which is Something formal and somebody outside of the parser Python tree may rely on this exact structure So yes, while keeping the same Input structure of the grammar. It's not possible if we Give ourselves the availability to rewrite the grammar then it would be possible But we may be breaking somebody else's code. Hi, I for one think this is awesome And if the entire problem is having a co-maintainer, I think you could become one you clearly qualified Thank you for what it's worth the see Python tree moved to get four or five months ago So all of your hg links are now five months out of date Sorry, can you repeat this web page is at hg.python.org? Python is now hosted on github and has been for five months. So we're now looking at five month old source code Yeah, my slide deck is actually older than five months because this work has been going on since a year ago But I'm sorry about using outdated links. They still work any other questions. Okay, so thanks again for a very interesting talk