 Hello, I am delighted to announce our next talk in the NEA channel, which is about parsers and compilers. Kishou from Google will be talking about the Learn from LL1 to pegparser the hard way. I believe this is a free recorded talk, so I will be just introducing that video should be started to play. Take it away. Hi, welcome to my talk. The topic is Learn from LL1 to pegparser the hard way. My name is Kerr, and before start this talk, I want to briefly go through this photo. So the photo is came from a famous show in Japan called Sasuke, and they are getting popular around the world called Ninja Warrior. And let's see how player play this stage. So as you can see, player need to jump from one bar to the other bar and eventually reach the goal, and player need to jump it twice. And this stage is one of the most difficult stage in the game. It's called cliffhanger, it knock down tons of greater players in this stage. Why I talk this thing in this topic? It's because I want to mention how I felt when I prepare this presentation. So I want to deliver such feeling for you and hope you can also experience it. Alright, let's talk about myself. I'm currently working for Google, and I used to work for Amazon. I'm born in Taiwan and now living in Tokyo in Japan. And I've represented at PyCon Taiwan in Japan since 2017. And you can find more about me in my Github page, or you can check today's talk related source code in the project called Parser Learning. As you can see I put a photo which is a movement similar to the movement in Sasuke. This is how I try to experience the feeling in the Sasuke show, how they do such movement and how difficult it is. And let's talk about agenda. Today, agenda will cover many, many points you can see listed here. The first one is my motivation of this talk, later what is parsing C-Python, and later dive deep into parser technologies from 101 to 102, and finally C-Python impact parser and take away. Let's talk about my motivation. My motivation came from the release article, what's new in Python 3.9. In that article, there's one line say PAP617, C-Python now uses a new parser based on PAP. And when I read this line, my impression was, if I remember correctly, I took a compiler class in school, so I supposed to know it. However, I check what I learned in school, I found out school taught us the brief concept of compiler's front end and back end, and scores parser assignment mostly use bison plus yuck. And I immediately figured out my brain was lost. Yeah, here's a list of questions I asked myself at that time, and this is also today's talk objectives. What is PAP parser? Why did Python use L01 parser before? And why did Gido choose PAP parser? And what other parsers do we have? What's the difference between those parsers? And finally, how to implement those parsers? So the first thing we need to learn is what is parser in C-Python? You can find this in the below tutorial called C-Python DevGuide, design of C-Python's compiler. Here is the Python compilation steps. Initially, we have the source code, and usually, you know, we just write wrong the Python command and generate the results. And what is under the hood? Here is the steps. The first steps is called lexer. So lexer will parse your source code into a bunch of tokens, at least time, token has no meaning. So parser, today's main character here, will basically parse those tokens and run some schematic analysis, and eventually generates a meaningful syntax tree. And we call it AST. I will explain this later. And now we have our tree. Tree will go through another compiler process and output Bico. Bico later can be executed by VM, virtual machine, and generated results. That's what you can see on the screen. And if virtual machine detects some other import module, it will go through a same process and yet again generate the Bico and the result. Let's take a deep look in each step. The first step, lexer, you can use a module called tokenize, try to present it. So tokenize, take your source code as an input, and then generate a bunch of tokens here. And remember the tokens here have no meaning. And so we have another module called AST, serve a purpose as parser. This AST stands for abstract syntax tree and contains two methods, one is called parse and the other called dump. And parse can parse your source code again and generate a tree. And a tree can be presented by dump function and generate kind of output like this. And this tree contains the mean. And now you got a tree, how to generate the final output. There are two building functions to help you. First is compile, the other one is avil. Compile your tree into Bico and avil your Bico, you can generate final results. And also, if you are interested in the detail of Bico, you can use a module called this and this disassemble function will take your Bico as an input and finally generate the step of detail execution. And today's focus will just be the parser. So I talk about the entire compilation steps. However, our focus will just be a parser. If you are interested in other steps, that's not the focus in today's talk. We learned CpyCen's parser. So we can start to talk about the compiler's technologies. And in parser 101, we will talk about the content in scores lecture. So if you are a computer science student, you may already know it, it's a good chance for you to review it. If you are not, that means you can just learn like a computer science student. And the first essential item called CFG. CFG stands for context-free grammar. Context-free grammar, the grammar can be represented by this gray block. And inside the grammar, usually we have a lot of rows. And each row presented by a yellow block. And in each row, we have an arrow. Arrow means derivation. Some paper may write it in the upside, but the meaning is the same. And we have, usually we have two ways to represent in that hand side and right hand side. So the upper case means non-terminal. The lower case means terminal. And non-terminal means it can derive to other thing. And terminal means it's a final token. And we have our bar here. The bar here in CFG means end. So it supports ambiguous syntax. Because in this case, we can interpret this grammar into both upper case B and the lower case A can be derived from upper case A. And why we talk about CFG here is because in the following talk or in the paper, you often will see a lot of grammar. And so please be sure to be familiar with the meaning of each block. And what is context-free? Context-free can be explained by following two examples. And simply speaking, context-free disallows any contacts in the left hand side. That means left hand side in all the rules can only contain one non-terminal. So in the second case, that's the invalid CFG because we put some contacts before or after the non-terminal. And we say that pattern can be, can derive to another pattern. That is not a valid case. The valid case, we can only say or a certain non-terminal can derive to another pattern. Now we learn context-free grammar. What's the usage behind that? So context-free grammar usually going through another process called semantic analysis. And in that process, it will use it plus the source code as an input and generate ultimate output called pastry. And pastry contains two types. One is concrete type, the other one is abstract type. In the upper part is a concrete type. Concrete type usually is pretty complicated and pretty raw and is not so human readable. It contains many non-terminal and intermediate process going through your grammar rules. And on the upside, abstract syntax tree is much more human readable. So for this example, if you have elementary education background, presumably you know the multiply operation is higher than the plus operation. Depends on their parser implementation. We need a treatment on the CFG and that's called CFG simplification. There are totally three types of simplification we should talk about. First one is ambiguous. Second one is non-deterministic. And finally, that's the recursion. Ambiguous definition is a grammar contains rules that can generate more than one tree. So for following example, as you can see, it generates two kind of tree if you try to put some practice here. And so this is problematic because as any elementary educated people should know, right hand size tree is the correct one and left hand size one is wrong. So to resolve this, we need to rewrite the grammar. So after rewrite the grammar, as you can see, we have not only one non-terminal, we use another two non-terminal here. Try to make this grammar able to represent by one identical tree. And the second simplification is non-deterministic. The meaning is a grammar contains rules that have common prefix. So for following example, a non-terminal A can derive to terminal AB and terminal AC. In this case, because we have the common prefix A, so we can rewrite the grammar with another non-terminal and eventually generate a grammar like this. And the third one, which is also the monster challenge one for the parser implementation is called left the recursion. The explanation of this is pretty complicated if you never experienced this before. If you have a grammar like this, let's say, you know, it's very human readable, right? E can derive to E plus T and the T and the T can derive to another T, multiply F and F, and F can derive to number. This is the kind of way we just explain the kind of calculator suppose plus and multiply. However, the problem is in the implementation, you may meet the problem is what if you always deal with E first? And you will meet a problem is E direct to another E and later that another E direct to another E and your program will recursively meet the E and eventually reach the memory limitation. So unable to parse the thing. That means left the recursion. So how to deal with that? Yet again, we need to rewrite the grammar. So after the grammar return, as you can see, it's pretty ugly, not so human readable, but the meaning is the same. I'm not going to explain the detail between left hand side and right hand side. I just let you know is if you want to implement the parser in a certain way, then you need to resolve the reference recursion. That means your grammar will be rewritten into a certain way, not so human readable. Let's have a recap here. We have three types of CFG simplification. Here's the before. Here's the after. We know CFG. And we also know different parser implementation may require certain simplification on your CFG. And we also know the output is kind of parse tree. So we can start to talk about the traditional parser implementation. Before that, we need to know they are actually two type of parser. First one is top down, the other one is button top. And top down parser, it starts from the root and eventually till the leaf and generate the tree. And button up on the outside is start from the leaf and go up till the time they can't go up anymore and eventually generate tree from leaf to root. So they are two type of parser based on the two type of classification. So the first parser called LL, second one called LR. LL parser and LR parser's first L stands for left to right. That means they all parse the input stream from left to right. And second L and R means derivation, timing. So in LL parser, it does the left most derivation. That means for the first example in left hand side, you can see it's a top down approach is the LL parser approach. And with this approach, it meets the plus operator first and data multiply operator. And on the upside, the right most derivation, that is the LL parser implementation. In the right hand side, you can see it starts from the leaf to root. And so it meets the operator in a different timing. It meets the multiply operator first, then plus operator second. And finally the K. K means token look ahead. And following is a kind of example, right? So start from two, oh, I, if you parse, if you implement parser, you basically add two, I'm a token of a number. And if I perform one token look ahead and meet another token of plus, then what to do next? That is the meaning of the look ahead, right? So we learned LLK and RRK's definition. Now we can move on to take a look at the detail implementation. And the first one we will take a look is the top down recursive descent parser, which is also the simplest and the most straightforward one. And for this case, you can see we have three steps in total. The first step is you need to write function for each non-terminal in your grammar. So in this case, we have five non-terminal in our grammar. And remember, this grammar doesn't support left recursion. So it's pretty ugly. And now we have the corresponding function, we can start to parse the string from left to right. And during the time we parse the string, we always need to start from the root. So that means it's parse E. And start from parse E, based on the look ahead, we can decide what derivation we want to proceed and eventually reach the end of the input string. So look at the example code. Here is the aerial one example. We can see that for the grammar, we have five non-terminal and we have two examples here. One is parse E-pline and the parse T. And in parse T, it's pretty simple. We just have one possibility, one derivation. So there's nothing special. But in parse E-pline, because there are two possibilities, so we need to use a K look ahead approach. And based on, oh, we look at the symbol, next symbol is plus. So we decide to do derivation. And if we didn't find the plus symbol in next symbol, then we decide to do another derivation. And that is the K look ahead approach in this case, and K equals to one. Another way to implement top down parser code non-recursive descent parser, Python's aerial one parser also implement in this approach. This way, we instead need to have a pre-generated parsing table based on your grammar. And this pre-generation require two steps. The first step is to have a first follow table for each non-terminal. That is based on non-terminal's previous symbol and next symbol. And we collect those information, later we can generate their parsing table. And the parsing table basically let you know when I am certain terminal, and I need to do based on the grammar what action I should do. And so these parsing table basically will need to implement in following step. We have a stack here. And yet again, the first error is left to right. And second error is from left derivation to right derivation. And from top to down. So as you can see in the beginning, we have their E as a first non-terminal. Later, we start to reduce, reduce, we start to run reduce to till the end. We reach the terminal and we can shift. And we will keep running this till the end. There is no more symbol in a stack. And we call it accept. That means the grammar is valid and we generate their tree. So in a implementation is also pretty simple. As you can see, we need to have a parsing table here. And then we have a stack. And we also have a pointer to point to the current place. And based on the symbol we meet, we perform shift action or reduce action. And if it's done, we eventually return a root of the tree. Let's take a look at the final traditional parsing implementation called LRParser. And LRParser in fact is more popular than LRParser, mainly because it accepts left the recursion grammar. Let's take a look at the detail. So LRParser like LRParser, we also need to pre-generate some parsing table for the step-by-step action. And the thing here is we don't use the first follow table, but we use a thing called DFA, deterministic finite automata. So using this DFA based on the state to generate the parsing table. And so in each state, we have the kind of point. And in the beginning, every point is in the start point. And we will take a look at the following symbol and perform the shift action to different states. And if the point reached the end, it will reduce based on the rule. So in this case, we will say, oh, this is the rule five. So we will reduce based on rule five. And in this case, if we are in the beginning, we meet E, then we will shift to state two. And so it's like a kind of workflow. We will keep shifting till the end and reduce based on that rule. And so the parsing table looks similar. As you can see, you have the states here and meet different symbol, perform shift to different state. And if you meet the end, then you will run the reduce action based on that grammar rule. And so the implementation state by step is also like this. We have the left to right parsing. And then the second thing is the based on the parsing table, we run the derivation from right to left instead. So this is the bottom upper approach. And look at the example code would be very, very similar to previous non recursive one. We also use a stack. And then based on this parsing table here, we run the stack. And when we meet a certain state, we will do the following action, either shift or reduce. And then get the constructor final tree. Great. We learned the textbook content. Let's start something outside of textbook. In the parser one or two, we will initially talk about pack and pack parser. So pack stands for parsing expression grammar. It was initially introduced in 2002, pack rate parsing. So the difference between pack and CFG, there are two difference. First is the rule supports regular expression impact. And the second thing, the most important thing impact is this bar. The bar means over now. That means pack no longer supports ambiguous grammar. If the first grammar success, it will not go to the second grammar. So let's take a look at the example. If for grammar one and two, CFG has no difference. CFG will say, oh, A can derive to lowercase AB and the lowercase A. But for pack, it's different. For pack, we'll try for the grammar one pack, we'll try, oh, A can derive to lowercase AB. And if it success, then I will go for it. I will not go for the second one. Or only if it failed, I will go for a second one. So grammar one and two is meaning become different now. And pack parser. Pack parser means parser generated based on pack. And pack parser can be a pack rate parser or other traditional parser with K look ahead limitation. However, mostly when we talk about pack parser, we mean pack rate parser. What is pack rate parser? Let's take a look. We need to recall one question I mentioned in the beginning. Why Gido choose from LL1 implementation to pack rate implementation? The answer will gradually shows up. First thing is the type of pack rate parser is also top down type. That means the implementation is also similar. There are only two difference in between the LL1 parser recursive documentation versus pack rate implementation. The first thing is the rule now become pack rule instead of CLG. The second difference is when we recursively parse the input string, we perform infinite look ahead instead of K look ahead. So we look at everything, every possibility. And why is allow is because we use pack rule, we no longer accept ambiguous syntax. And how do we implement that? We use an approach called memorization. So the implementation is like this. Another important fact why we shift to pack rate is because we can get super good benefit. Our grammar accept left the recursion now. It's like LL parser. And the implementation is also similar. We have a recursive way, but we address a decorator there for memorization. And then we still run the derivation based on recursive logic here. To talk about memorization, we need to slightly shift to algorithm 101. So one of the iconic problem called Fibonacci number, you can also find it in code. The implementation can be exponential tan approach by some recursive way. So in this recursive implementation, you can see we put a number in this function. And this function will call itself with a smaller number. And till the number is smaller than two, it will return the result. This approach is very slow, because when the given number n is super large, then below generated tree will be super big. And the number will, you know, the same number will be calculated all the time because you don't recreate it anywhere. So the time complexity in this case will be exponential. How to resolve this? We use an approach called LRU catch in Python, just two lines of code. It actually makes you to record the calculation result function somewhere in the memory. And then you no longer need to calculate the number you have calculated. And the time complexity in this case will from exponential tan to linear tan. And the space complexity instead will from all one to all m because you need to store all the calculation result. Right, you understand memorization now. How about left the recursion impact rate parser? Even with the memorization approach, we still need to handle this. So this approach wasn't introduced initially impact rate parsing techniques. Instead, it was developed later in the following papers. Guido's post talk about this specifically. So I will follow Guido's post to talk about his idea in C Python. And he used the second approach called reverse the cost stack. So as previous example give, we know that we need to write the parse function for each non terminal. And if we meet this case, right, the first non terminal will need to parse another same non terminal in either way recursively call itself and never stopped. And so to resolve these approaches, we need to have this magic function called Oracle parse E. So in parse E, we actually will not directly call itself. Instead, we will call this magic function. And magic function will help to fill the calculation result in the memory. And then till the end after finish all the filling, it will return data back. And then when parse E meet the same input again, it will retrieve the result from memory instead of call Oracle parse E. That is the approach. So the normal memory relation is pretty simple. If the calculated result is in memory, we return it. If we haven't calculated it, then we will just calculate it. But for left the recursion one, we will additionally do one more step. We will write the one for loop inside. And this loop will try to go through all the possibility from the current index till the end. And after we fill all the data, all the calculated result into memory, we can break. That is the way we handle left the recursion in CPython. So we have the knowledge of both traditional parser and the pack rate parser now. Let's compare the difference between them. Traditional parser and the pack rate parser both scan from left to right. And pack rate parser additionally need to run their right to left memorization. And for the left recursion handling, pack rate and LR parser support that. For ambiguous side, pack rate parser disallow that. And so they are shadowed between pack rate parser and the traditional parser. The space complexity, traditional parser is better in general because pack rate parser's space complexity depends on the code size. And traditional parser just depends on the looker has three depths. But the time complexity side, pack rate parser, it's better because it's memorize the result in the memory. So it's separate linear time, depends on the state. But traditional parser in this case will be exponential time and also depends on the k. And the capability side, because pack rate parser support the infinite looker head. So supposedly it has more potential than traditional parser. So because of that, we have several new rules in Python 3.10 and later based on pack. And you can find a corresponding talk this year presumably somewhere. So I'm not going to talk about detail here. Let's take a look at Cpycern's pack parser. And again, pack parser here means pack rate parser. Cpycern parser before 3.8, it uses LL1 parser written by Guido 30 years ago. And a parser requires a fuller step called CST. So you need to convert CST to ASD. And after Cpycern 3.9, it starts to use pack parser. So there are three main reasons why Guido choose pack. Firstly, it's infinite looker head. That means it may support more grammar in the future. And secondly, the most important thing, Guido complained multiple times in his media and posts about the horrible part of left recursion grammar. And now pack rules supports left recursion. That means it's much easier human readable. And the final benefit is there is no more CST to ASD step. So in fact, it's somehow save the memory in this case. And after Cpycern 3.10, it drops LL1 parser support. The best way to learn Cpycern parser is to play it. You can find it in tools folder. There's a thing called packer generator. And this is exact the thing we talk today, pack parser. It consumes the grammar and generates the parser. So it supports both Python-based pack rate parser and C-based pack rate parser. The input is slightly different. For Python's one, it only requires a meta grammar. But for C's one, it also needs a token because the parser is written in Python, so Python doesn't require a token for itself. And here's an example of the meta grammar. You can see there's a subheader. It will be added in the top of the generated parser file and the sum rule. And the rule contains, as you can see, a non-terminal. And the return type of that rule and the sum pack rule divider. And that means or in this case. And also the pack rule. And because this is SDT, so you need to provide some action. And the action will be written in Python code. And output. After you input this thing to the pack generator, it will generate the parser out. The parser, as you can see, this is fully generated code. And it's well written, as you probably can see in some of my previous examples is pretty similar. But it's auto generated. And it serves the purpose we mentioned before. Let's review the benefit and take a look at the performance result. The benefit of pack parser, firstly, obviously is more flexible because of the infinity to look ahead. And secondly, because hardware now is cheaper, so memory is cheaper. And we don't need to worry about the pack raise memory consumption. And finally, the CST construction step was removed. So in the newer implementation, it simplified the workflow a bit. And for the performance result from PAP 617, it mentioned the output was less than 10% difference before and after. They captured tons of popular open source program, tried to compare the speed and the memory. So overall, we can say that the benefits of pack parser is much important. And the performance is acceptable. So they decided to do this transition. Finally, we can talk about the tech away. Let's recap first. We learned what is parser in Python. We learned the parser 101, including CFG and traditional parser. And later, parser 102, including pack and the pack rate parser. And finally, in C Python, how do we implement the pack parser there and use it? So people may ask me, after this talk, what does that mean for me? And I will say, I verified my understanding. So I kind of hope that if people are interested in the same topic, you should try to get your hands dirty. And the way I did this before is to finish all kinds of parser implementation on the Lico question called basic calculator. So if you want to learn the thing in the same way, I would recommend you to implement them as I did. And if you need answer, you can check out my parser learning repository. It contains all my implementation for Lico question. And with this approach, you will not just learn it like academically read some paper only, but you will get your hands dirty really important result for real world problem. Finally, let's take a look at appendix. Here are a few related articles, including, of course, Guido's pack parsing series overview. It contains all sort of topic we talked today, such as the initiative, the penpoint of original parser and the detail implementation, including left the recursion handling of the new parser. Second important person is Brian Ford. He is the author of pack rate parsing and the pack. So if you want to read a kind of mathematic proof, then his publication would be the thing was to read. And related talks, if you want to know more about those Guido's posts, he personally gave the talk in couple of places, topic is called writing a pack parser for fun and profit. That would be the thing was to watch. And if you want to know more about PAP 617, the other two author, actually, there was interviewed in podcast that I need, their topic is the journey to replace Python's parser and what it means for the future. The content is rich and the contents all sort of detail in PAP 617. And remember, I talk about where is parser in Cpython, but I only focus on parser in today's talk. So if you are interested in that diagram in the completion step, other steps, then here are two talks I recommend to watch. First is Emily's, the AST and me. And second is Alex. So you want to write an interpreter. They both are related to the entire picture. All right. I think that's the end of today's talk. Hope you enjoy the content. Thank you so much for that, Kier. Are you available to do a live Q&A or would you prefer to do it in the Matrix chat room? Let me do it in the chat room maybe, because I didn't see any question there. Yes, I don't think we've had any so far, but they do often come in right at the end. But if your internet connection will prefer to do it via text, then of course we'll do that. Yes, let's wait for a while maybe, if anyone has any question. Yeah. Well, for me, I'd like to say thank you so much for such a comprehensive talk. And it's an area I know very little about. So I'm really glad to have seen you talk so eloquently about it. Yeah, it's also my first experience outside of Asia. So I'm really happy. Congratulations. Well, I think you knocked it out the park. So we're done. Someone's saying can you share your slides? And there will be a way for you to do that via your Python. Seeing mostly just applause in the Matrix chat room. No questions at all yet. I think above the slide, you can find it in the talk page. There is a button to download the slides and it will link to the slideshare. I already uploaded there. Yes, someone's just shared the link. Thank you.