 So I have the honor to introduce Pablo who again another member of our Python infrastructure team I remember Pablo joining as a pre-FSD many years ago now. This is very scary. So With great pleasure is Pablo talking about the soul of the beast Awesome. So thank you very much for coming. So in this talk, I would like to Go deep into interesting question, which is what makes Python Python, right? So we have two main parts that people recognize usually as Python one is the interpreter Which is normally when you type Python is probably see Python the one will appears Which is the default implementation But the interpreter can be done in many languages as well We have all other implementations for example in Java like jython or I don't Python So that part can actually So see Python is not technically what what makes Python Python So there's another part where people usually recognize as Python, which is the grammar, right? It's the kind of programs you write and and expressions you do so in this particular talk We are going to center on all the words surrounding that particular aspect of the language and how tiny Technical details of how is made up will impact a lot. They call you right So I will start with an interesting question. So the question is is this valid Python code Like ten seconds who thinks is valid Python code like raise the hand who thinks he's in Bali Well, this doesn't add up, right Okay, so actually this is valid Python code and this is the ST of this thing So you didn't need to understand it just that this means that Python is able to pass this code There's some some clever stuff like this is actually like minus and bigger than This is the leaps things so but it resembles some something horrible, right? And the idea that what with this what I want to say is that the code that people normally recognize as Python is Actually a tiny subset of what the language actually allows and there is some very twisted ways of Writing valid Python programs, right? And in this talk, we're going to try to understand how these rules are actually lay out and with impact they have So so let's start with some grammar basics, right because we are talking about the grammar. So the grammar is basically Document that explains what is valid Python, right? So let's start with how this document is described So we are going to have rules right and the rules are basically described as A name for the rule which also a production and then some description of what the rule is So you have to be in mind that what the grammar describes is actually how to generate Python programs So the parser actually what it's going to do is that it's going to grab the grammar and it's going to sort of reverse engineer Solving the opposite question, which is given a particular text is this a Python program that can be generated of the grammar But you have to be in mind that the grammar is laid out in a way It actually makes very easy to say okay, give me a random Python program So in that way, we'll have rules here and then rule descriptions on the second side so we can have many Descriptions for the same rule so this basically means that this rule can actually produce these two things Instead usually instead of doing in this way. We use the or operators So this means that this rule can be this chunk or this other chunk We will also have the plus sign so the plus sign means one or more This means that this rule is one letter a or more than one. It's very similar to regular expressions But it's actually much more powerful than them But you can familiarize more or less how to describe this thing if you have in mind regular expressions So in the same way we have the asterisk which means zero or more Then we will have these square brackets, which means this is optional So we will have something like this rule means that you can have an a or this particular other rule And this particular other rule can be optional in that particular example and then so this is the actual So now that we have more or less, but in mind This is the actual gram a chunk of the actual grammar of Python, right? So let's see how we interpret this thing so we're starting the first rule which is file input and then we say that Basically a file what you can write in a file is start with a new line and then like zero or more statements And then the end of the file so you say, okay, what is a statement? So then you go here and you say okay a statement is a simple statement or a component statement And then you say, okay, what is a simple statement? I said, okay It's a small statement or et cetera et cetera right and then in the component statement We see things that we more or less recognize already right like if statement while the statement so all of these are things that are possible and The idea is that the whole document with this is only a subset will basically describe what is possible In the grammar and then we have to distinguish two possible inputs here. So one is Let's go here Yeah, so one of these is what we call a terminal. So a terminal here is a word which is surrounded by quotes This means like literally this word So for example for the while statement, we expect the word while and then we have these non-terminal So if you imagine this grammar as a tree the terminals are the leaves and the non-terminal are any intermediate now, right? So a non-terminal basically something that you need to expand if basically refers to another rule so one thing which is not super important here, but one thing is that if If the basically the grammar is written a way that is sort of substitution So basically you have here a rule and then you have always the description of the rule at the end It's called a context free grammar There's other definition for that but the idea is that that simplifies a lot of things and The idea is that we have to be we'll see later how this thing play But it's very important to distinguish these two terms because the words that you can write at the beginning of the rule Or the words that can start the rule will impact a lot how how certain structures are allowed in CPython So I skip this this slide. So this is another part of the grammar So we'll see the statements that we mentioned before right? So we saw that component statement can be any of these and then we start seeing so for example And if the statement is the word if followed by something that evaluates to a condition or an object you may that can be true or false then The column and then this suite is basically a block what people recognize usually as a block It's called here in the grammar suite and then you have the leaf and all the stuff so the idea is that we have this document right with these kind of rules which basically describe what is Possible to write and what is not so let's analyze the impact that it has or the particular things that CPython imposes over this so Particularly the CPython grammar is an L1 grammar. So let's see what it is So an L1 grammar is a grammar that is parsed left to right You do left more derivation which means that you analyze first the token Which is on the on the left and then you start expanding from that and this is the most important rule Which is that when you are parsing like the parser when he's analyzing a particular program He's comparing it with the grammar can only look one token on the light one talk is not head Which means that if I need to distinguish a rule like imagine you have a particular rule Which has two possibilities like two productions and then I need to distinguish which one of the two is the correct one So for doing that I can only look the next token in the input, right? So for example, if you are doing something like for X in range something and I find the word for if I need to Distinguish two different productions. I can only look one token ahead with will be this X in example I put so it turns out that this is a very well known program how to solve like this actually leads to very simple parsers But the Python grammar has two more particularities that made the parser even simpler for the ones that know more of it of Grammar theory basically this is a parser generated with a parse table With what is known as the first sets and the follow sets more on this later But particularly we have two more things here one is that the Python grammar doesn't allow empty productions This means that every rule has to be something that doesn't end in the empty string So an empty production is basically a rule that can be empty and this makes the parser more difficult because if you have rules That can be empty and you can only look one token ahead Then if the rule is empty you need to actually look to so that's what leads to what is called the follow sets So we because we don't need them I know when to explain this talk and then we have another kind of rule which is a bit more Lucy This is not actually enforced But this basically the grammar is laid out in that way which means that in some levels only the last Alternative of a rule can have an on terminal. So for example, we can see whoops Okay, so for example, let's see this one. So if you see this So the flow statement so we have break continue return race and Jill so you check all of them except Jill They start with a non-terminal, right? So for example, break is terminal start with break continue statement is that we continue Etc. Etc. But the only one who actually starts with a non-terminal is Jill statement Which is here and on terminal called Jill expression. So this is more or less layout on this simplifies a bit more How the first sets are laid out and it makes easier to produce the parser. So Okay, so this is more or less the grammar. So let's let's Describe what is this first sets? So the first sets for a rule for example, let's pick simple statement are all the non-terminals The rule can start with and this is very important because as we saw before if we are only allowed to look at the next token And we need to see if a particular rule is valid It will be very interesting to know all the possible tokens that rule can start with right because if we have that token And the token is in one of these particular non-terminals that are allowed Then we know that we are good because it's possible that that rule has that start with that token But if we have a particular token and then we are trained to pass a rule and that rule doesn't start with that token We know that is invalid. So we can say that is a syntax error, right? So this is what is called the first sets and it's going to be very important later So this is the first set of some of the expressions So for example, print can only start with print race can only start with race, but simple statement can start with all of these things Okay, so let's see how the parser works. So CPython actually doesn't have a handwritten parser It has a parser generator. So it's a tool that generates the parser for us from the grammar file Which makes it much much more lesser or prone because you don't need to manually modify the parser every time You generate more grammar rules. So let's see how it works. So it's actually a very simple thing So we start with the grammar in extended backwards not formed Which is the document that we saw before and there's this will product this will produce Non-deterministic finite automata. So we'll see what this is. Basically, these things are just Control graph flow. So it's like this kind of drawing when you have notes and arrows telling you which option you have We'll see some examples and the things that these these initial things are very simple to produce But they are very complicated. It's full of errors and complexity So the step that grasp one of these and convert them in what is called a deterministic finite automata is basically simplifying it and making it more more easy to To follow So let's see one example So let's say let's say we start with this rule which is the rule for factors right like a plus b a minus b not something Right, so this is this is the rule So then the first thing that the part the parser generator produces the non-deterministic finite automata Which is this control graph, right? So we will start in this state and then we will go down until we start seeing these Tokens so the idea is that you start here and then you read some particular program that is trying to follow that rule And if you are able to go from the state to the end then the rule is correct If you are not able to go to from the start to the end then the rule is in value So imagine that instead of plus minus and till that you have Asterisk right which is not part of this rule in particular So this then you cannot follow here because there is no step that has the asterisk It's actually hidden in power So it's a valid Python program, but if you don't look at power for now you can imagine how you cannot follow this in now So this is very complicated because in theory you can only follow the arrows if one of the tokens is allowed So for example for a state 4 to state 8 is very easy to know if you can follow because if you find the minus sign Then you can go ahead But if you don't find the minus sign that you cannot follow it and the problem is that we have here in the Non-deterministic automata we have some stages which we don't know if we can like follow them or not For example for going for a state 0 to a state 1 or 2. We don't have any particular sign to follow So that that is a problem because we don't know if we can we are the rule to transition to that states or not That's the complicated part of the non-deterministic finite automata So the next step is to produce a deterministic one which is that one So in this case as you can see all the states that all the transition between a state and something is followed by a Particular token so for example if one in a state 0 we just need to look at the next like token in the input either plus minus or till And if it's not any of these we say that the rule is embodied if it's one of these then we follow it And we will go to state one and a state one goes to another rule called factor But as you can see in this particular root DFA Sorry, yeah DFA there is not transition between states and states with our unknown transition But you can see something which is still bad which is for example this node which is here isolated this node is rubbish It doesn't mean anything it just means that when we Transform this thing into this thing we produce some things which are either extremely complicated or they can they are basically rubbish Which has no states so the next step is simplifying the thing into what is called the final DFA So basically is removing complexity and removing these these particular nodes So this is basically what it is it's produced this kind of control flow diagrams Which are still very complicated with some particular algorithm you Transform this to this which is very no and all and from this you run some simplification Which is called a minimization algorithm and then you produce those here and this this particular Diagrams is what the parser follows if you think about this is more or less like a Hashtagal or a dictionary so basically you are in a particular state And then you look at that the possible keys here are plus minus or tiller And if you have them in the dictionary or the has table then you transition to whatever Value you have and if you don't you say that the rule is in value So that makes actually the parser so much so so fast because it's basically some sort of has table look up again and again and again Nesting these these rules here So for example when you go to power like the square box in power you have another of these basically describing what that particular sub rule is So it's basically nested diagrams instead of nested diagrams. So let's take some examples here So for example, we can see the rule for comparison So in these examples, you always will have the the complicated one, right? They're not DFA on the left then the DFA here and the simplified one here. So you more or less can can see What kind even if you know exactly the algorithm, which is not super complicated But it will take a lot of time to explain you can see more or less or you can feel the kind of Simplification that the parser starts doing so this is for example for comparing things between them. This is the rule for Like the comparison but with the actual token So you will see things that you recognize as comparison operator like not equals less than Bigger than so you can see actually how when the rule is actually a bit more complicated The non-DFA is actually a monster and this non-DFA is actually very well simplified I didn't put the one in the middle because it's still it doesn't fit in the particular screen So we can see more so for example, this is the decorator And this is a very interesting example because you can see actually from a very high point of view what you can decorate So for example in the rule for decorator things you can see that you can decorate async functions class definitions and functions and you cannot decorate anything more because The rule doesn't allow that at the end we will see a way of adding things to this so you can play a bit later but you can see how these things are actually even if you don't see the rules and only you see this kind of Automatas we sound something lower level you can still recognize more or less what is valid or what is describing? so more of these you have the The factor and turn that we saw before so you can see how here in the non-DFA We have still this this transition between states and in the like And one we have only transition between the typical numbers and you can see how they actually Much more simplified so I want to show you my favorite one which is This one so this one is the the diagram that parses function definitions Which is this monster But don't worry because this is the non-DFA right so let's see the simplified version which is much clearer to read This is not so this is the most complicated rule But this is a reason is so complicated and this is one of the things we are going to see at the end So the reason is so complicated is because the way we need to describe this rule is actually fighting LL1 limitation so the LL1 a lot of people see it as a bless because they say okay because it's a so simple grammar that has These these various strong conditions that make them like very strict Then it's very difficult to have like very complicated rules Which is good because at the end you want Python to be simple and readable and like not like other languages so But there is a problem which is that sometimes what the user perceives as a very simple rule is actually not that simple At least you cannot describe it as an L1 rule In that particular case you need to do a hack and you will see many hacks in the grammar Which is called full-left expansion which is literally grabbing all the tree and expand it at the end because you cannot have Like these limitations will arise in a way but there is there is a way of solving this limitation which is having a rule full of non-terminals and in that way You don't have this problem because you don't need to deal with first set because it's all non-terminals And the reason this particular rule is so complicated is because this is one of them This is the most complex sample of the most beast example Okay, so let's let's continue so let's see what limitations on the L1 grammar are So one particular sample for example, let's see let's see this rule So this rule we have is whatever rule and it will start with the word do then some particular sub rule call a then while an expression and then we have another possibility for the rule Which is the same a star except that it has another rule called b and then we go to the first sets We remember are the non-terminals a rule can start with and we see that a can start with the letter a or the letter b And b can start with the letter a or the letter c, right? Then let's say we are trying to pass this input, right? The user writes that and we want to see if it's a Correct program in this case We know that is the first choice like the first rule because if we go to the first sets This b can only be on the first set of a capital a so we know is this particular here, right? Because this a can only can start with a sorry with b and we have b so the option beta doesn't have b in size so we know it's not b But let's say the user writes that code so now we have a problem because this particular lower case a appears in both So is this one or is this one? So we don't know and this makes the rule ambiguous because you don't know which one of the two right because if you With all the tokens that the rule can start they share one of these non terminals So when you find this a you don't know which rule to choice and this is a problem because this rule is embodied in one And this is the limitation right this is what makes ambiguous you cool If you have a more powerful parser and then you say okay But imagine that I'm able to like follow and read more tokens on do backtracking Then you can distinguish this one or this one But because we are only allowed to look at the next token we cannot write so we were passing do and then we look at The next token we find a and then we know we don't know what to fall and that's the problem So this happens in CPython. So actually it happens a lot. So for example, this is the rule that describes how to pass How to call functions and particularly I want you to look at this argument rule. So let's let's see what the argument is Okay, so we have Test is basically an object. So three four x whatever then we have the keyword arguments So a particular name equal and an object and then we have a star star like dictionary unpacking and and listen parking, right? So this rule is actually not not valid because if we look at the first sets of tests So this one it actually contains name So these two are ambiguous because if you are here and it's okay I would I want to choose this one or this one and then you find name You don't know you to choose this one because test also starts with name. So you don't know which one to follow So this makes these rules not possible, but you can still call functions in Python, right? So how is made so this is one of the first hack we have which is that the actual rule is test equal test and this rule allows to do things like least comprehension equal to least comprehension or dictionary comprehension equal to dictionary comprehension and then you say wait Wait, but I cannot write that thing in Python, right? So the reason is that we allow the thing on the parcel of the parcel is pretty happy with that But when it comes to the ST which is much later in the in the compiled pipeline We say by a way is this test actually anything other than just a name like a keyword if it's not then say Syntax error is just that the syntax error doesn't happen on the parcel, but you don't know so Little lies that we have Another interesting one is the do you know the walrus operators or the walrus operator also suffer this So this is the rule for the walrus operator Which is like a particular name the walrus operator and an object and this has the same problem Right because the first set of tests also include name and you say a but I don't see the pipe Right, so it's not this or this thing But you think about it the same case because because this is optional, right? So if you are saying okay, I need to pass that thing or I actually have this optional part So because you can only look one token and both can start with the word name Like a non-terminal name, which is any python variable then you don't know what to choose So again, you have the same problem and the actual rule again is this horrible thing and then we force it much later As you can imagine. This is not very like Maintainable And then we have one that Lucas here knows very well. So this is with statements, right? And this is valid Python 3 so in Python 3 you can have multiple Context managers in the same width expression So for handle you can say with a as x b as y and then if you have a lot of these then you say Well, I don't want to put all of them in the same line So if you want to continue you use you need to use this line break character, right? Which is sort of okay, except that this thing really mess with indentation and auto-formatter tools like black Cannot really format this thing very well So you could do this thing if you allow this thing which is put parentheses around right? So you say something like okay So we open parentheses and then you put all your context managers and then you close parentheses This this particular constructed very easy for auto-formatter because they know what is delimited here, right? except that this is invalid and You will say why it sounds like I mean I have seen this thing before right? I seen this thing in strings I say this thing imports right you can say import and then open parentheses and a bunch of things Well, it turns out that if you analyze the rule Which is this one and then you want to allow the parentheses then you say okay? So I'm going to write the same rule so with and then I'm going to put two possibilities Right the old rule and then I'm going to say open parentheses the old rule and close parentheses Right, so it's either the rule or the ruling between parentheses So again the point of this is that this is ambiguous because this guy which is the item also can start with the open parentheses So you say with open parentheses now You don't know if the parent is belongs to because you why three between parentheses or because you are Rooping with parentheses is the whole rule, right? And this is sad so we cannot implement that I Really try Okay, so you will see right so we have this dual thing in how we describe the grammar We have a very powerful and simple way to describe it because we have another one restriction And this makes not only the parser very fast but also very simple to describe But then we have all these weird things because we we want to put some rules that we know They are simple enough for users to use but all these technical restrictions restrict what you can do And I want you to pause here and make you think about this because the fact that the rule the grammar is a little one and all these Things is they're actually very technical implementation details Right like this is something that the user doesn't need to know but this thing which is at the core like really really deep down This thing right now is percolating outside right like even if you are not like Writing C Python and you're writing your own interpreter and then you want to fulfill Python grammar Even if you have the most powerful parser right now if you want to run valid Python You cannot write this which is weird right? It's like the implementation detail are poisoning something as important as the grammar and and this is like sort of dangerous and like Worrying and beautiful at the same time how these tiny technical decisions can can affect so much like what is a bite on by Python problem? So at the end just to conclude this particular step So what the parser produces what is called a parse tree? So if you write x plus y following all these DFA some basically the argument that we saw it produces this particular Extractor and if you substitute those numbers to the actual rules you will have this right so that's three plus four or one plus fourteen or Something is this this particular work which is basically a flattened out version of the diagrams that we show before so This is what is fed to the ST and we are going to stop here because we are not going to talk about STs So this is when the parser ends so what I'm going to show you right now as a Application is how imagine that you want to create a new grammar rule in C Python or in Python right like I want to Extend Python for playing of it. So or how we core developers actually write new grammar rules So let's say you want to write this rule right so you have two objects And then you want to implement the arrow operator Which is something that the user can implement right so the user imagine that the user has a class and the class Implements underscore underscore arrow underscore underscore and you and that's that's something right and then you want to allow to write a Ro be and it goes this particular function right in the same way when you write a plus be it goes down there add So the first you need to do is add this token to the tokenizer So sorry to the grammar so you go to the grammar then you find the rule describing Multiplication division and all the stuff and then you add the arrow here right so we have the star the slash all the things that are allowed And then we say also the arrow So when you add this thing and then you write the parser generator You need to include the token as a valid token right so it needs you go to this particular file Which is in C so we are going to see a lot of secret And then just you say okay There is a new token which is going to be called arrow and we are not playing with talking is at this point So we just define the existence of the token So the idea is give you a whoo is the end okay, so I'm going to go fast. Don't worry. Don't worry very fast Okay, so then you go to the tokenizer and Describe how to pass the thing so you find like a minus sign and then follow by the bigger sign Then you say okay, that's the arrow and you say return the arrow Then if you run the tokenizer, you will see that the tokenizer already is able to pass the arrow So it's actually very simple to do even if you don't understand how I'm doing these things So these three lines here is the only thing you need to do for the AST to pass So you don't need to understand it. You just need to say that if you know what you're doing is actually very simple And then finally you the only thing you need to do is implement the actual operator So in this case This is the vital for that which means like you find a arrow B Then pop a for be called the function will describe in the arrow and then continue with the interpreter Which is again very simple. You know what you're doing Finally you just need to describe that there is something called arrow and right arrow Which can be described in the objects which is again these three lines plus the header file and with that you can write this code so you can create a Class that implements underscore underscore arrow and in this case, I'm going to map the operator to the other and then you can write this code F like arrow B and it basically maps the function over the list So as you see all these particular I'm ending right now So if you see all of these particular like things make implementing new rules are the one that we saw right now Extremely simple, but they carry out all this danger and all this impact on On the language that you see and appreciate. So thank you very much. I'm sorry for surpassing