 Awesome. So thank you very much for attending. So in this talk, the idea is that we're going to have like a deep dive on what makes Python Python, right? Because we have several pieces that normally people identify as Python, right? We have the interpreter, but as you know, there is multiple versions of them, right? We have C Python, PyPy, Stackless Python, Iron Python. So it seems like the interpreter usually is basically implementing something else, right? Usually languages have what is called a spec, which is like, you know, like, for example, C++ has the C++ standard, which is like a book you can kill someone hitting with them. But the idea that Python is a very interesting case, because C Python is not only like one of the interpretation of the interpreter, but it's also the reference, right? So the idea of this is like, okay, so what is Python really like? When people think about Python, like Python the language, like the text that you write, and then x executed by something, what is exactly that? So just some more interaction first. So in Pablo, I'm a Python core developer. I work at Bloomberg in the Python infrastructure team. Before that, I used to black hole physics. So if someone is interested in talking about that, I'm very happy always. I'm one of the authors and the implementer of Python 70, and I spent my free time just fixing bugs in the C Python. Okay, so let's start with a cool question. Who thinks this is valid Python code? Raise your hand if you think it's valid. Raise your hand if you think it's invalid. This doesn't adapt all the people. So this is actually valid Python code. If you see this x is a type of notation, property is just property. This t that seems like a C++ template is actually saying property of template is less than t bigger than x, that's an unpacking, that's an ellipsis, and that's object, and then that's a class. This is actually the AST of that expression. Not only that, like actually, I just thought, oh, it's nice, like the parser parses this thing, but actually you type this thing and call spam, it actually returns because the annotation is not executed. So you cannot call it not only by Python code, but it's actually, you can run it. It's not a problem. So the idea of this is that what people normally think about Python is really a small subset of what Python is, or at least what is described in the Python grammar, of the official Python grammar. And the idea is that we're going to explore exactly how, like, which is the biggest scope or what is the bigger difference with what people think Python normally is. And for that, we need to introduce some special topics regarding grammars and regular grammars. So let's start there. So the Python grammar is described in the Python source code in a file called grammar. It's in the grammar folder. But for that, we need to introduce first how these files describe. So in this file, you will find things like this. You will find rules. So a rule basically is described like a name, a colon, and then a rule description. We are going to see what it is. Then you can have many alternatives for the rule, right? Like you can have the same rule name and two different descriptions. These descriptions usually they are called productions, because basically the grammar is written in a way that what is straightforward given this form is to generate actual program, not to recognize them. But usually when you have two alternatives for a rule, people put it in this way, which is like this rule name has this alternative or this other alternative, but it's basically the same form. And then you will find this kind of constructions, which means that this rule name is one letter A or more. It's very similar to regular expressions, as you can imagine. The asterisk is basically zero H or more. And then you can find also like square brackets, which mean like this chunk here is optional, right? So this is the complete metagram. So the metagram is the grammar that talks about the grammar, right? So these are the rules that we found to put here, right? So the top rule is basically the one that you will start and it will record. So this is just for reading the grammar file. So using these expressions, we can actually see the actual Python grammar, which looks like this, right? So for example, we recognize file input. So this is like one of the start rules. So it says like, okay, so a file written in Python, it will be a new line, or zero or more statements and an end marker. And you will say, okay, what is an statement? Okay, so a statement is a simple statement or a component statement. And then you say, okay, well, it's a simple statement. So and then you go down, right? If you go enough, you will start seeing things that you can actually recognize. Like for example, compound statement, which is the bottom one, you will find like if statement or while statement or for statement. And then you can probably imagine what those are. Okay, so let's keep looking in this file. So for example, if you see if statement, then you will find that an if statement is the string if. And the difference, we will see what this means. But the difference between the strings, the ones that are in yellow and the ones that are in white is that one of them cannot expand anymore, right? Like the string if is literally the string if is not like another rule. But you can see like how you more or less recognize the structure here, right? So an if is like if, then something that probably is like a condition, then colon, suite is basically a block, and then elif and the same structure, right? So it's very simple, but the idea is that one of the things that we need to understand is how the parser actually can recognize this thing and see if a program is valid or not. Okay, so more definitions. So as I said, the ones that are in yellow are called terminals because when you are like expanding the rule, that thing cannot be expanded anymore, right? It's just like a string. While the ones that are in white, they are called non-terminals and this is basically referring to another rule that is in the file. And this is going to be very important in the following slides. Okay, so let's talk about LL1 grammar. So it turns out that Python has a sort of weird feature, which is the grammar is in a category that is called LL1. So let's see what it is. There are very few languages that have this condition and this condition, you will see how it actually percolates over the language. So one thing that we are going to understand is how like something that technically is something technical, right? Because like the fact that it's LL1, it seems not very important or it seems like it's an implementation detail, but you will see how this actually percolates over the language that has very deep impact. So what is an LL1 grammar? So technically the definition is a bit boring. An LL1 grammar is a grammar that can be parsed by an LL1 parser, right? Like, you know, when in New York City someone says, yeah, an LL1 grammar can be parsed by an LL1 parser. And I say, thanks, I understand everything now. But let's see what an LL1 parser is, right? So the idea is that you will parse the grammar rules, sorry, the tokens that you receive left to right, that's the first L. You do left mode derivation, so you start expanding the rules from the left, that's the second L. And then this is the key point here, is that when you are parsing, you can only look for the next token, which means, for example, that if you are parsing an if condition, you can only look at the next token in the stream. So if you look at the if and then you need to decide what rule follows, you can only look at the next token, not more. And this is going to be super, super important. It turns out that Python has two hidden extra rules. It used to be in and all checkout of the source code. There was a comment explaining this thing, but this has actually been lost. It's actually even before it ported to Mercurial. So the extra rules are these two. So it turns out that Python doesn't have empty productions. Usually these are called epsilon productions. An empty production is basically a rule that has as a possibility the empty string. So it's like you have a rule and the empty string satisfies that rule. It turns out that that makes everything much more complicated, but Python doesn't have them. So basically you know how this more or less works. When you create an L1 parses, you need to create this thing called the follow sets and the first sets. But it turns out that if you don't have any production, you don't need the follow sets. We are going to explain what these are, so don't worry. And it turns out that there is another rule, which is not super enforced, but we'll see how it manifests, which is at some levels of the grammar, only the last alternative can have an unterminal. So let's see one example. So this is more of the grammar. And if you see here, I want you to focus on this flow statement here. And you will see that the flow is a break, a continue, a return, a raise, or a yield. And these are here. But if you see, like, the break here starts with a terminal, which is the word break. Continue, it starts with a terminal, which is continue. Return the same, but only yield statement is the one that actually, which is here, starts with something that expands again. This is what I mean, right? Only the last option can be expanded. And this simplifies a lot also the parser, but it turns out that this is actually not enforced. So it's just a simplification for some levels, but it makes it very fast. Okay, so let's see what other pieces do we need. So the first sets, because imagine this thing, right? Like when you are parsing this thing, you can only look at the next talking to the side where rules you need to use. And if you think about that, it will be very important for you to know, given a rule, like imagine, if a statement, what are the tokens, like the strings, the terminals, that this rule can start for? Because, and this is important because as you can only use one token, you can only use that one. And if you see that the rule that you are trying to parse doesn't include this token, do you know that it's invalid? Because you can only look at one token and this rule doesn't start with that token, so it's invalid. So the first sets are basically all the words a rule can start with, right? So we see, for example, that simple statement can start with a lot of tokens like a circle, the asterisk, a lambda, we see that, for example, print statement only can start with print, that makes sense. Return statement can only start with return, but other ones like yield argument can start with a lot of things, right? And this is going to be key in a few slides. Okay, so let's see how actually Cpython works. So it turns out that the parser of Cpython is actually written by hand, but what we have is a parser generator. So we have a generator that automatically receives the grammar and generates a parser for you. And this makes everything much, much easier, but let's see how it works. It's actually relatively simple. So the idea, like the global idea is that we are going to have the grammar, that's the EVNF, like standard Baku's null form. So it's this text file with the rules. We are going to construct something called non-deterministic finite automata, which sounds very scary, but it's actually pretty simple. And then we are going to simplify those in what is called a deterministic finite automata. So let's see what these are. So basically a finite automata is defined by a set of states. It's basically like, it's actually a flowchart, right? Like, you know, you follow arrows and then you're in different states, right? And the idea is that you have a set of states, which is places where you can be. You have transition functions. We tell you you can move to a state or another state. And then, obviously, you have a start state and you can have one or more final states. Like, let's say an example, right? So they look like this. So you start with the A. Let's say that is the start. And then you have transition functions, which is this lowercase v and lowercase c. These are basically words in the Python programs that you write. So this lowercase v and lowercase c will be things like if, for, while. And depending if you have if or a while, in this case a v or a c, you will move to state v or state c, right? It's pretty easy. And then you can have a final state. So in this case, if this kind of automata is going to recognize a language, if I end in the d state, then the program is valid. If I am, for example, in c and the next token is, I don't know, h, the program is invalid because I can only move to the final state if I have a d, right? So that's basically what the parser will do. If the parser can reach the final state, then your program is good and it will continue. If the parser sees that it needs a d and you don't have that, it will raise a syntax error. It's as easy as that, right? But it turns out that we have two kind of automata, right? We have the non-deterministic ones and we have the deterministic one. So which is this non-determinism? So it turns out that there is two kind of non-deterministic here. One of them is that, imagine that you have one of these states, right? One of these uppercase a, and it turns out that you have a transition with the same letter, right? This is one of the source of non-determinism because which one do you choose, right? You can actually choose both. You don't know. The other kind of non-determinism is if you have this scenario. So this epsilon, which is the empty string again, is a transition that you can choose even if you don't have a token. So in this case, if I mean a, even if I don't have any token, or even if I have one, I can choose not to pick it. I can move to C for free. That's another non-determination, right? Because like the state can be a, or actually I can be C if you want. So you don't know. Cool. So the idea is that these things are actually pretty easy to construct and they are pretty easy to reason about. But when you try to run them, they can actually, very difficult to implement a runner for this because they cannot end, or you cannot end in cycles. So the idea is to simplify this into something that is more easy to parse. So for example, this is an example, right? So you will have like, you will start in A, you have some excellent transition. If you look at this C, you can transit to C itself by A, but if you choose A you can also to B. So this is like a simple one because it only has three nodes, but has like a lot of non-determination. And the idea to construct a deterministic one is actually pretty simple because if you look at the two possibilities of this, of the non-deterministic ones, and you can say, okay, so if I can transit to two nodes, technically it's like if I consider this automata, technically it's like if it's in C and B at the same time, right? Because if you mind that you are in A, you can actually imagine like a parallel universe in which you go to C and a parallel universe in which you go to B. So you can consider all the multiverse if you want. So really it's like you're having a new node called BC, right? It's like a set of states. And now you have only one transition which is A to BC. And if you look at epsilon transitions because epsilon is free, it's like if this thing is in A and C at the same time, right? So the idea is that this thing is like if you have AC, right? And then you have whatever transition that was and one transition to B, right? To B using the token name. So the idea is applying these two rules. It's very simple, right? When you have one of these, you collapse them in the previous one, and when you have the other, you collapse in the second one. And the idea is apply this thing iteratively, right? So the problem is that if you do that for this one, you will end with this one, which doesn't have non-determination, but is much more complicated. Like that's the price that you will pay because it has more nodes, it has like empty nodes sometimes, and it has like power states, right? Like all the power set combination. It can have up to two to the power of empty nodes. So this is basically summarizing the difference, right? Non-deterministic automata, the idea is that it can have empty string transitions. And the deterministic one cannot have them, so it's easier to reason about. Non-deterministic one is simple as we saw, less nodes, deterministic one has more nodes. One technical thing is that in a non-deterministic one, it's very difficult to implement backtracking, and you want that sometimes for your parsers. A deterministic one is trivial. And the key point here is that the non-deterministic one is super simple to construct from the grammar. Like you just read the nodes that you are parsing. While the deterministic one, you need this process that we showed previously. Okay, so lots of theory, but let's see how it works. So for example, this is actually, let's see how these structures in the actual Python grammar. So this is one of the rules that you have in Python, right? This is the rule for factors. So it includes addition, subtraction, node, and then you have more rules. So this is the non-deterministic automata that you see here. As you can see, if you start from the top, this state zero here doesn't have any token that you will transition, right? Like these arrows doesn't have a token like this one, right? For example, state four, you will go to state eight, if you find a minus sign, but from state zero to state one is the epsilon transition, right? So they are here, right? And if you are in state zero, you don't have any clue to know if you have to go to the right or to the left. So you don't know. If you apply the process that I showed you, you end with this other one, right? And now you have interesting things. Like you don't have epsilon transitions anymore. Like in state zero, you will transition to state one if you find one of these three things. So you can always decide with arrow to choose. But the process leaves some nodes out, right? For example, this state five is there, but it's not connected. So there is an extra process that the parser does is grabbing one of these which is already deterministic and it simplifies it, and it simplifies to these things. So basically it's like removing nodes that are not used and also maybe collapse some of them. So for example, these state two and state three that are on top of factor, they are also eliminated because you cannot start from them. But this is the process. You construct one of these, you make it deterministic and then you simplify. So let's see some other examples that we have here. So for example, this is for the comparison. So it's when comparing two things. So this is like the non-deterministic one, the deterministic one and as you can see the final one is actually very simple. But this is also a very simple rule. For example, this is like the actual sign that you can see. The idea of this is that you don't need to like read them and understand them. But the idea is that I want to give you a vibe of how the parser is simplifying these structures and how like you can just count the arrows if you want, right? It's not a good measurement. But the idea is that you can see more or less how it works, right? Like state zero will always transition depending on the sign. So for example, you can actually see in this graph how the grammar works even if you don't need to read it. For example, this is for the decorated rule and then you can see in the final product that you can only decorate three things, right? In the Python grammar. And I think class definition of function. If you want to now create your own Python language and decorate something else, you just put something here and it will just work. We will see an example later this time. So for example, this is for the multiplication and division and everything that has the same precedence as multiplication. This is for the print statement. So if condition, so you will see like how they simplify. I want to show you one of them, which is my favorite. So this one is the one for, is the most complicated one in the grammar. This one is for parsing function definitions, like the arguments in the function definition, right? Like the full arguments, keyword arguments. It looks like this. But I don't worry because this is the non-deterministic one, right? Like the deterministic one, like after all this complicated process is so much simple. Well, oops. This actually includes like everything, right? So, okay, so it looks very complicated but with, whoa, nice. So you can see like actually things that you will recognize like commas. This is for like the new type comments. So you recognize like the two stars. So you have these kind of attractors which is like nodes that have a lot of arrows inside. And the reason this looks like so complicated is not because the rule itself is that complicated because it's really not. It's actually a bunch of rules. But it's something that we're going to see now. The reason this looks like so complicated is that the fact that the grammar is a little one makes the parser super fast, like very, very fast and super simple to implement, like at least the parser generator. But writing things that you may think is very simple it turns out to be extremely complicated. The fact that we can write this rule is technically cheating because this rule is written in a way that is called complete let's expansion. Because if you try the rules that correspond to this you will reach something that we will see afterwards which is an ambiguity. Like you cannot write in that way. So you need to literally expand them and the rule needs to go like almost token by token which is something that usually you don't do because of the sentence parsers. We will see a lot of examples of this. And this makes some other things so much complicated. You will see something that a lot of people have requested of the Python grammar even for 3.9 or 3.8 but it's actually impossible with the current parser. And like as long as you use black this is a feature that black needs to do some kind of formatting and it turns out that it's impossible to implement. We will see which one. Okay, so let's talk about this. Let's talk about L1 limitation. It turns out that L1 is sort of good, right? Like if you are like a very fast parser and it's very easy to implement sort of. And a lot of people argue that the grammar being L1 makes the grammar more simple to read, right? Like humans are also parsers and when you are reading the grammar it's also context free. It's a property of L1 grammar. So you don't need to remember like a bunch of other things that you have seen before like in C++ or in C. Like in C if you read a variable or a type you cannot define it, you cannot distinguish them you don't know what you have read before. But here you just need the next token. But it turns out that, yeah, sure it makes like the grammar sort of simple and that's what Python, people recognize as Python, right? Like Python is simple to read and simple to reason about and probably that's very tied to the fact that it started being L1 but now it's just dragging. Like it's very difficult to implement something. So let's see the limitations that we have with this. So let's imagine I have this rule, right? Like we have a rule. And we have two possible productions. So the rule will start with a word do it's like a do while loop. So we'll start with the word do it will go into a rule called A so we don't know what it is doing. Then the word while and then another expression. And we have another production that has the same terminals but instead of A it has B, right? Two different productions, doesn't matter. And from A and B we have the first sets here. So if you go to the rule A A can start with the lowercase A or B. B can start with the lowercase A or C, right? So imagine that we have this program. So someone writes do, B, RO, C while C something, right? So we need to know which one of the two productions this thing corresponds to. But it's very simple, right? Because you go and say, okay, do, do, okay both start with do, that's nice. And then, okay, it's A or B. So we see lowercase B. So if we go to the first sets we see that lowercase B is only on the uppercase A. So we know it's the left one, right? Makes sense? Because this lowercase B only appears in the first sets of the first one. So we know it's the left production. So that's nice. This program is valid and it's the left. But what happens if someone writes that? It turns out that the lowercase A is actually on both first sets. So you actually don't know which one of the two productions to use. This is the ambiguity. This seems like very stupid. It's actually the problem because you don't know which one to choose and you actually need to reject this rule. Or the parser dice, right? Most of the time, the parser generator, most of the time will tell you that it cannot produce the parser. But sometimes it actually is poached. It's written in C, right? So let's see some cases when this happens actually in Python. So for example, this is when you call functions. This is a rule for calling functions, right? So you see that when you are writing an argument, the argument can be a test. A test is everything that evaluates to an object, more or less. You can write maybe a comprehension. You can write a default argument, which is like some variable name, like I don't know, like reject equals true or something like that, right? And test is something that evaluates to an object. You can do unpacking with dictionaries and unpacking with lists. Well, it turns out that this is actually not the rule. This is false because if you write this thing, it's invalid, right? Because imagine that you are parsing this thing. If you are here and you say, okay, I can only look at the next token. Let me see the first set of tests. So the first set of tests includes name but the second rule also starts with name. So you see name, which one of the two you choose? You cannot distinguish. So this is ambiguous. The actual rule, because you will say, okay, this is ambiguous, but I can write functions, right? So I can call them. So what is the actual rule? So the actual rule is this one, which is very weird because the actual rule says that you can put instead of a string saying like, I don't know, process equals true or something like that, you can literally put arbitrary Python objects here. This rule allows things like least comprehension equal least comprehension or yield equal yield or something like that. But you're going to write that, right? So how does it work? So it turns out that the grammar allows it. Like the parser will parse it super happily. So it's yep. But it turns out that when we are constructing much later the Astra and Syntax tree, we look that if that test is actually just a name and if something else, we reject it. We are like sort of cheating, right? Because if someone goes like, I don't know, is you sent to outer space like the grammar file and aliens find this thing and they implement their own Python, they will allow some weird shit. It's like, what is this, right? And it turns out that actually, you will see how CPython is coupling with the language itself, right? Because this is the spec. And if you implement this spec, you are thinking that you're wrong, right? Like this is not valid Python. And actually, the ASD of CPython is a piece of CPython, right? PyPy will do it in a different way and other implementations can do whatever, right? But we've solved these kind of limitations much later. It's some kind of a hack. Let's see, this is actually everywhere. Like for example, this is your favorite operator. So it turns out that this is also the same problem because this is like what is called a namespression test. It can be an object or the wild root, right? Which is normally like a variable wild root test, right? But if you do the same thing, this optional thing is also like an optional part. So if you want to, if you have a token, let's say it's a name, and then you want to say, okay, I'm actually in the optional part or I'm actually in the test. It's the same problem. And as we know before, tests includes name in the first set. So this is again an ambiguity, right? Because if I give you name, you don't know if you are on the optional part and you are going to use the eval operator or you are going to go into the right part and then you're going to use test. The rule again is test test. So again, you can pull like list comprehension or wild root list comprehension and then someone will complain much later. Awesome. So what about this? So this is actually a thing that makes us very sad. So in Python 3, you can actually have, in context managers, you can have multiple ones in the same, in the same width expression, right? This is something that Python 2 didn't have. Because the problem is that you have six of them. You need to indent like one every time in Python 2. But in Python 3, you can write them. But if you have so many of them that doesn't fit on the same line, how do you do that, right? So suddenly you need to write that, that breaking continuation line, which is very weird because for matters doesn't know how to deal with that because that line actually allows arbitrary indentation in the next line because literally it's the same line for the parser, which is always very odd to write. So what do you want to write? Do you need a comment after watching? Yeah, I put it there. Yeah, I know. Actually, Google has complained from the same time the last time I give this talk. And I remember to put the comment, but it seems that not in all the slides. It's actually a different font. Cool. So what do you want to write, right? You want to write this. So it's the same as the imports, right? Like you have imports, and then you open a parenthesis, and you want to bunch of them, and then you close the parenthesis. And for matters know how to format these things because it's delimited. So you want to write that. It's nice, right? You cannot. You cannot because you could say, okay, the rule that the grammar has is the top one, right? With an item at least, and then you can have zero or more commas and another item. So this is the rule. And then you say, okay, so let's write another rule, right? Let's write you can have the same or a parenthesis, the same rule and a closed parenthesis, right? But it turns out that the first sets of the with item include the open parenthesis. And this open parenthesis is the one that you put. So you can do something very weird, which is bad, which is that it turns out that you can use a weird property for parser that if you write this rule in a specific way, the parser will say everything looks fine and it will always choose the first production. And you can actually write these things. It turns out that the problem is that that this allows one very weird thing, which is writing will open parenthesis, yield closed parenthesis. That's like a generator that admits sending values in. And it turns out that the Python test suite has one instance of this rule. So when I was playing with this, it failed one test, which is totally unrelated to a grammar test. It was in a sync engine, right? So the sad story of this is that we cannot implement this thing with the current parser. It's impossible. I have spent hours writing extremely weird rules for this, like, I don't know, removing indentation, so the parser thinks that you're writing multiple lines when it's not, it doesn't work. Instead of saying, oh, open parenthesis, closed parenthesis, I have tried to say, okay, optionally open parenthesis and optionally closed parenthesis, and then I will find out if I can fix it in ASD. It turns out that you cannot, or you cannot easily. So sadly, you won't have, probably, this rule on 3.9 or even error if we don't change the parser. So sorry. But cool. So the idea of this is that, the takeout of this is that, yes, you will see a lot of people saying, like, oh, yeah, the Python grammar is great, right? Because the cell one is one of the few grammars of the cell one, so that's an extreme nice property. We should be thankful because that's the very, it sits at the very core and it makes a very simple thing, which is sort of true, right? Don't get me wrong. Like, I think everybody's sort of agree with that. But the problem is that, it has two side effects which are not desirable. One of them is that it should be a technical thing, right? It just is an L1 parser, but like, LK parsers can parse also L1 or almost can parse L1. So you could choose a different grammar definition, maybe. But the problem is that what this should be a technical decision, it has, like, it percolates over everything and it has impacts as this one, right? Like, this is like the end of the pipeline, right? Something that is a very technical thing. It has the consequence that you can allow this rule, which is actually desirable, right? So it's interesting because it makes you, like, reflect on what exactly is Python, right? And what is written and all these alien things. So let's continue with the idea, but this is one of the core topics of this talk that I want you to think about, especially when you see new grammar rules appear in the language and when you see what people call, like, a Python or clear Python, right? Like, as you saw in the first rule, like, what Python really is, is actually not what you think it is or at least what it thinks it allows. Okay, let's continue. So what are the parser things? So we saw how the parser is described, how the grammar is described, so what is the end result of the parser? So the end result of the parser is a parser tree, also called a concrete syntax tree. So for example, you have x plus y, so the parser generates this thing. These are numbers which usually are linked to basically the finds in the c-header, so if you substitute by the name, you will find this. So this is basically a tree, like, written with this, but the idea is that that expression will go rule by rule, so it will start with a valid input, then it's test list, then it will go, like, down, down, down the grammar, until it finds one of the tokens that actually can recognize, particularly term, there, and it will start going inside until you find the x, and then it will backtrack a bit, like, basically, it will go stack up, and it will find the plus sign, and then again. And basically, the fact of this, this is an intermediate theme, because this is still very concrete of your parser, right? If I write different rules, this tree will be different. And the idea is that it's an extra step that will convert this thing, that we are not going to talk about here, but this thing to an AST, which is the abstract syntax tree, which is, like, abstracted from the parser, right? That's, like, more linked to the language itself, right, instead of this. The abstract syntax tree, this will be basically, like, just an expression with three nodes, two for the operator, and one for the plus. But this is still useful. This is actually what it formatters use, usually, like, particularly black uses these things. And the reason this is useful is because this is still, at this level, you still have information about comments, because the AST doesn't have comments, right? It's not part of things that you can execute. Line breaks. So you can actually have this thing and reconstruct your original program. And this is what formatters do, right, because they don't want to lose your comments or your, maybe, special indentation rules that you made. So even if this is, like, an intermediate thing, this is still something that is useful. Okay, so let's see how do you make a new grammar rule. Let's see if I can do this thing quickly. Awesome. Let's make this bigger. So if you start, like, an interpreter here, this is Python 3.9, because of the future. It turns out that if you try to use a decorator, and then you want to put a lambda here, right? You say, okay, lambda something, maybe 32. It says, oh, wait, this is in body syntax, because it turns out that you're going to write lambda decorators, or basically arbitrary things. So let's actually change that, right? So how do you do this? It's actually pretty simple. So you go, this is the Python source tree. So you go to this grammar file, which lives in grammar, grammar. You find the rule for the decorator, which is this one. And it says, okay, it's add and something called dotted name. So let's change this thing for this list, which is basically everything. And then you say, okay, it is an extra step, and I'm going to defy all the graphs of life coding and do C-like coding. Let's go. It's actually pretty simple. So you see this is parsing the nodes. It's very boring code. It says AST for dot name. So let's change this thing for AST for this list. This list. Oh, it will fail because we have not regenerated the parser. So we do make regenerate grammar. This regenerates the parser, and then we compile. Awesome. It's finished, right? Let's see what happens. Add lambda 32. Good. Function. Return. Larry. Add things. And what is this function? It's 32. Awesome. Much better, right? Don't do that at home. Okay. Awesome. So let's see another one, right? This is actually pretty simple. The reason this is very simple is because you just need, it's a substitution rule, right? You just like making the grammar less restricted, which is always easy to implement. So let's see actually, blah, blah, blah. So let's see actually how do we implement a more complicated one. It's not the most complicated one that you will see, but let's implement this thing, right? So let's implement a new operator. So it will be an arrow. It has to be a right arrow because left arrow is basically less than minus something, and it's ambiguous. So let's implement this. So this is going to be something that, you know, you put an object on the left that is A. You write the arrow and then you put an object on the right. And this will invoke some magic method. Let's call it arrow method or something like that, right? So how do you do this thing? So okay. So you go to the grammar file and then let's say, okay, let's say that the arrow has the same presence as the multiplication sign. So you add the arrow here. So now you say, okay. So you can multiply. You can verify. You can divide or all these things. Okay. So you regenerate the grammar as you saw just now. It turns out that because this is a new token, you need to go to this file and say, okay, there's a new token. It's called right arrow. And it has this arbitrary number here. And then you need to teach the tokenizer to recognize that. It looks like, you know, C code or whatever, but it's actually pretty simple, right? Because the way it goes is that it's a switch here, which is passing a token. And then you say, okay, you see a minus good. You see then a bigger sign, good. So emit right arrow. It's pretty simple, right? Okay, good. So now with these changes, you run the tokenizer and nice is recognize the arrow as an operator. Cool. So we have already teach the tokenizer. So we can move forward. So it turns out that now we need to teach the parser to emit this theme. But it turns out that this is always already implemented because the rule for multiplication and division is agnostic of the actual like sign that you put in the middle. So you can keep that. And now what we need here is to teach Python to emit bytecode for the new thing. So you need to go to the compiler. So the compiler receives basically the ST and emits the bytecode. And so you go to the compiler for the bind operation. So it's a binary operation. This is like A plus B. So you'll see that this is not in the grammar, right? This is already extracted. And then you say, okay, if the actual operator is one of these arrow operations, they emit this new bytecode that I'm going to call in place right arrow operator. And then you know to go to this file and defining the bytecode and says, okay, there's a new bytecode target, which is this one. It's just adding the name. Here what you're saying is basically that if you see the token, then you emit this token for the ST to recognize. Actually, you don't need to understand this. You just need to see that it's actually very simple even for C code. So at this point you have already, you check the bytecode, you will have a bytecode or like right arrow something. And now the thing that is left is teaching Python to call something on the class. And that's very simple, right? So this is the eval loop. This is eval.c. It's one of the core things here. And when basically it's like if we have the target, the in place right arrow operator, then grab the object on the left, grab the object on the right, which is a stack machine. So it just pops you from the stack. And then what you are going to call is function called number right arrow operator that I'm going to implement. And then what is left here is some cleanup. The only important part here is this three first line. It's giving me the two operands and call this function. And the function is going to be super simple because all of the machinery is already implemented. So I just need to go to the file defining the type object, so every class. And I know to see, okay, this is a new magic method and the reverse one because it has to also go to the other one if it doesn't have it, which is called underscore underscore right arrow underscore underscore. And it has this sign here. This is just registering it. And the function that I said that I need to implement basically is going to call binary op, which is already implemented with this thing here. Again, you don't need to know what you need to change, right? But if you know it's actually just one line. And just with that, literally you compile, you can write this code. So you create this aromorphism to make Haskell problems happy. It will receive a function that will store and then you implement this underscore underscore arrow that is going to basically a map. So you will map the function over whatever is there. And now you can write this f arrow, this list, and it will like power every element to the square power. And it works. I mean, you have to believe me about whatever. Awesome. So it's that simple. It's actually not that simple because the two peps that has been implemented in 3-8, like the Walrus and 5-70, like the position only arguments, those PRs are like changing 2,000 lines of code because it's not like, particularly the Walrus, the problem is the scoping. Like it's a lot of scoping rules and it's very difficult. Like Emily did a really excellent job there. And it turns out that the slash is actually very difficult to add because it's, do you remember that enormous gigantic like automata for the root? So you need to put something there, which is like, you know, playing this game of like operation, right? Not cool. And you need to make it more or less faster. Okay. So let's wrap up. So what do we learn? So we learn basically which is the Python grammar and basically how it's defined and why it's defined and how it works. We have learned also how we don't write, at least in Cpython, right? This is also tied to interpreter. All the interpreter implementation will do different things. But you have learned also how we generate the parser and it's actually a very fast parser. I think someone will maybe be able to confirm this by, I think PyPy also have an L1 parser. This is an interesting thing to think about also because it turns out that all of the parsers, well, not all, but like a lot of the packages and other implementation of Python somehow are already relying on this to be L1, right? Like for matters, packages that give you better parsers, implementation of the language, et cetera. So imagine that, because you know, imagine that you want to put these parentheses around the width statements and then we really implement the parser. Like we have actual discussions and we do is testing an implementation for the parser. But imagine that we change this into something more general, like maybe Peg or LK or something like that. This will basically make life for everyone much harder, right? Because now all the L1 parsers will not be able to pass these new rules or implement these parentheses expression with the parsers. So it's something to think about, right? Like because Cpython sometimes is looked as just, you know, one of the implementation of the language, the main one or one of the most used, but it's actually extremely coupled with the language. So it's something to consider and every time you change something, it has like effects everywhere. So it's something to think about. And then we learn how all of these things work together. So we learn also basically how using the things that we have seen before, like the parser and all that, we can change the language in different ways. And more complicated ways are actually more difficult, but the simple ones are great. This is actually a very good point of good design, right? The fact that this has been very easy is because we have a strong design regarding all these pieces. Well, it's an old implementation, but it's very strong. And also you have learned something that I think is very important for this talk, which is like what are the limitations of the grammar, right? Like even if you think that a rule is very, very simple, right? Like you have parentheses and you may think that you have parentheses all around. It's super easy to fall into one of these. Everywhere, like literally everywhere. It's very difficult to implement new grammar rules making everything more consistent, which is also very important. It's like even if you manage to implement a new grammar rule, the problem is that you can be limiting everything, right? For example, the Jill statement is also limiting a lot of the things, right? Because it forces some paths in the grammar just to be able to work the way it works. So the point here is that even if you manage to implement something, you may be limiting other things in the future. So it's something to think about if you want to maintain another one thing. And this is everything I have. So I don't know if there is any time for questions, but I hope you enjoy it. Thank you, Pablo. We have a minute for questions. I can give you. So with the with statement and adding the parenthesis for the parenthesis with list, that's ambiguous because there is another thing that you can do with parenthesis. What's the other thing that you can do with parenthesis there? In the with statement? Yeah. So you can write, for example, this is valid. Let me change the momentum. It's a Jill or anything that has parentheses. So for example, let's write a Python with another three. So you can write that. And the parenthesis, that parentheses is like, if you write, which is not valid, right? If you write this now, when it sees the open parenthesis, it will go into the node that is parsing the Jill. And when it sees the right parenthesis, it will say this is unbalanced because I already parsed this. Technically, you can also do this, right? So you can write, like here, an arbitrary number of parenthesis and also do this, which is very stupid that it's valid as well. And the problem is that this parenthesis belongs to the test, particularly this is here. Actually, I can set for this chunk. It's important here. You see the atom here, this rule includes the open parenthesis and it goes into a Jill expression or a test list comprehension, which you already know that it's everything. So you can basically write open parenthesis, whatever. And this actually goes again. It can look. So that's one of the problems. Thank you very much. Unfortunately, there's no more time for other questions. Thank you very much.