 Okay, so for the next talk, we have Formalizing a Language by Jeremiah Page. He is a software engineer based in the United States and working for Active State. Hello Jeremiah. Hi, how are you? Thank you very much. We're very happy to have you here. Thanks. Okay, welcome everyone to my talk, Formalizing a Language. Let's get right to it now. Python is a language that has been designed since the very beginning to be very readable. It was observed that people spend more time reading their code than they do writing it. So it was desired to build a language that could be easily read. Humans read Python code and they parse it. They consume the information in a way that best fits in their own head. And that can be different per person. Each of us might read a piece of Python code differently depending on the situation or our background. For instance, this line of code, I would probably read out loud as comp equals requests.gethtdpsbp2021.uropython.eu. When I read that line of code out, most of you probably understand exactly what I'm trying to do. Which is pretty great because I left out some very crucial pieces of information that a computer might want to know. I didn't tell you actually all the characters that were in the string. I didn't tell you that there was a string. I didn't tell you where the string started or ended or that it was an argument or that I was even doing what they call. You all just kind of inferred that. Which is great, but the computer doesn't infer that. The computer wants to know precisely what I'm doing. So as though humans read code more than they write it, the computer reads our code even more. Again, it wants to understand what the meaning I'm trying to convey when I write this line of code. So it also parses this code, but it parses it into something called an abstract syntax tree or an AST. This is the output of the AST module that comes with Python if I put in this line of code. It may not look quite like a tree, but as you squint a little bit, you can see that these nested levels of object calls look kind of like a tree. There's a root, there's some children, and there's some sibling nodes. And our flatline code has been given a little bit of structure by the computer. And then the computer uses this AST to get the output. It really wants a stream of 1s and 0s. So this last line here is the bytecode of the same line of Python code. So first Python makes an AST, then it makes the bytecode. So how does it know what kind of structure to give all of these strings that I wrote out earlier? Well, Python compiled it into bytecode. Yes, we're going to talk about compiling Python code today, and no, I won't have time to get into interpreting Python at all. Compiling Python starts with a step called lexing. This is where the computer goes through the string I wrote out and figures out where all the word boundaries are. Words to Python are things that you and I might call words like comp and get, but it's also things like significant white space, punctuation, and strings, whatever may be contained within the quotes. Things that don't matter to the computer, like insignificant white space or comments are thrown out at this point. Then that stream of tokens is passed into the next step, which is parsing. Parsing takes the tokens and generates the AST that we saw earlier. This AST tells the computer a lot of information about the atoms of this statement. It gives it the relationship between different pieces that I wrote out. It gives a precedence and a relationship between all the different components. The computer needs to know which parts are more fundamental than others. For instance, the computer can't assign an value to the comp variable unless it has a value to assign. So it has to get something back from the get method call. But it can't call the get method unless it knows where the get method lives in memory, so it has to go and look it up. So this tree structure relates directly to the order of operations that this has to happen. Next, and finally, is assembling. Python takes the AST that it built and it finds it back out into a 1D representation, a stream of 1s and 0s. But you'll notice, if you look here, that the order, even though it reads left to right again instead of a tree, is not the same that we wrote it up. Computers don't actually always appreciate the way that we write code. It doesn't follow the ordering at once. It needs to do the more fundamental actions before the more complicated ones. And so the ordering of the bytecode actually more closely follows the AST if you were to read it from bottom up than it does the way that you and I write code left to right. So it got all the structure by, or how did the computer apply all the structure to our code? Did it through formal grammar and specifically grammar rules? So what is formal grammar? Well, formal grammar is a set of rules that describe a language. In this case, I mean set in a pretty mathematical sense in that it defines the entire space of all combinations of words that make valid Python programs. And by defining the entire space that's valid Python programs, it simultaneously defines a space that is invalid Python programs. Take for instance this line of code that we saw earlier. We saw that it was lexically valid when we passed it through the tokenite or the lexer. We got a series of tokens and they were all correct. And then we passed those tokens to the parser and the parser produced an AST. It was syntactically valid. And if we continued at that route, this code would have actually run. I'm going to pass that to the next line of code, which I've changed a little bit. If you didn't look too closely, you might think that this next line of code was also valid Python. But I've made some changes and it no longer is syntactically valid. It is still lexically valid. If I were to pass this line of code to the lexer, it would produce a stream of tokens, no problem. After all, I've only used valid Python characters. It doesn't care what the meaning is. It doesn't know what the meaning is. It passes the tokens onto the parser and the parser, but the parser can't find any resolution. It can't parse this line of code because it's not part of the Python language. Up to and including Python version 3.9, Python used a very special, specially defined subset of formal grammar called deterministic context free grammar. This is just a very formal definition. Deterministic just means that this grammar has only one solution whenever it reads a program, which is good. We want our programming languages to always produce the same code whenever we write the same line in Python. And context free just represents how the rules are able to be written in the metagrammer. Speaking of a metagrammer, we have to have some way to write down these rules so that we can all agree what they are. The older version of Python grammar used EBNF notation, which is just a very formal way of writing grammar rules that's used for various different languages. It's actually a very special notation that was designed just for Python, but it follows other EBNF notations out in the wild. Sometimes called a metagrammer because this is a way to write down rules for a language that tell you how to write another language. And all this was actually parsed using an LL1 parser. The parser is the engine that takes in grammar rules and takes in the token stream and produces the AST as a result. So an LL1 parser is just a shorthand for left to right, which is how it reads the code. Leftmost derivation, which is the way in which it decides rules to follow in grammar. And that one represents the size of its look head buffer. So as the parser is going along the token stream, it has exactly one peak it can use to try and resolve its different grammar rules. And starting in version 3.9, Python got a parsing expression grammar, usually called peg. We got a new notation to represent this peg, but we'll see later that it was heavily inspired by the original grammar, while still borrowing from other languages, other pegs used in other languages. And it got a new parser, packrat parser, which was specially designed for peg. The most significant difference between this and the previous parser for our purposes is that peg or the packrat parser has an infinite look ahead. So it can continue looking down the stream of tokens until it resolves any ambiguity in the rules and then make a decision. If we go back to our original example, compiling Python code starts with text, and that text is lexed into tokens, and those tokens are parsed into an AST, and that AST is finally assembled into bytecode. The only part of this whole process that's changing in Python 3.9 is it's going from tokens to the AST, the parsing. The tokens that represent your Python programs in 3.9 and the AST that results from parsing it haven't changed at all. It's just this little process. Now we'll look at some of these rules that we've been talking about for so long. This is the older EBNF grammar rules for the import statement. Each of these lines is a grammar rule, and they start with a name and then a colon and then a definition. The grammar rule definition can include other rule names, string literals or token literals, which are the all caps, and they can also include some metagrammar syntax. So the import statement is defined in the grammar as matching if the import name rule matches or the import from rule matches. So to understand this further, we have to descend into another rule. In this case, we'll just descend the left-hand branch of this so we can understand it a bit better. The next line down is the import name rule, which is the import string followed by dotted as names. The rule. And dotted as names is defined below as dotted as name, followed by zero or more dotted as names. You'll see that star at the end, the clean star, represents zero or more repetition, just like in our regular expressions. So we descend again to dotted as name, which is a dotted name followed by an optional as string and a name token. Square brackets indicate an optional, the optional presence of some part of the rule. And the name again, all caps, means it's a token. So last rule on the list, dotted name is a name token followed by zero or more dot name tokens. So you can put these rules together to see that together they define the import statement as you and I know it and use it Python every day. So now we'll see how Peg defines the same import statement. Looks pretty similar. In fact, three of these rules haven't really changed at all. The import statement is still matched if either the import name rule or the import from rule match. Import names the same dotted as names is the first one to change. And it's just using a new meta syntax that was not possible in the old EBNF. This string dot some other rule name is just a shorthand for saying that when there's more than one match of the rule name, they're separated by commas. It works just like a string dot join in Python code. And at the end of this we use the clean cross like we would in a regular expression to indicate one or more. Dotted name hasn't changed or dotted as name hasn't changed, but dotted name has changed. It now matches if another dotted name rule matches followed by a dot name or just a name token. So in this case we actually have a recursive rule. It keeps matching itself until it finds the terminal branch, which is just the end of a dotted name sequence, just a name. And then when it reaches that point, it follows its recursion back up and the outermost dotted name contains all the inner ones. So now let's look at a slightly more complicated example to see where Peg really shines. This is the older EBNF definition of an argument list. This is what you can pass a Python callable. And our list is one or maybe more arguments separated by commas. And then there's an optional trailing comma, which is for our readability benefit. So the real interesting part is in the argument. Argument is here we look at it's the test rule followed by comp for we don't define those here but for the purposes of these rules you can think of the test as basically any Python expression. It's actually slightly more inclusive than any Python expression, but it's almost the same and then a comp for is a comprehension for loop so a generator. Argument is one of those or it is a named expression or a keyword or an unpacking either star or double star. All right, these are the flavors of arguments that you're allowed to pass call with a Python and our list is one or more of them. So how does Peg define this. Well, Peg takes a lot more rules and longer rules to define essentially the same thing remember these both these grammars are in three nine they're both recognizing the same language. Peg's taking a little more time to do it. Little more space anyways. So peg arguments is really just matching when it matches args that trailing comma and closing princess is just matching its place in the larger program overall. So args is defined using again this new peg syntax as a comma separate sequence of star expressions or named expressions and then optionally followed by keyword arguments. Or it's just keyword arguments. Okay, so keyword arguments there is in both branches of the args rule. So we'll descend into that. Keyword arguments is defined as quarreler starred one or more followed by one or more quarreler double starred or it's just quarreler starred or it's just quarreler double starred and all these define or define below but they're pretty self explanatory I think. So why is peg repeating itself so much here and taking so much longer to define the same thing. Well, these rules are enforcing a very specific order. Quargs says, yeah, you can have your keyword arguments. You can have starred expressions and double star expressions. You can have either both, but if you have both, you have to do all your starred expressions before any of your double starred. And we take that rule and plug it back into the args above saying yes, you can also have named expressions and or starred expressions and or double star expressions. In all three there must appearance and order. You have to be finished with your name expressions before you start any of your own packing operators. This is the actual rule of Python. We've never been allowed to call something and start with a star star expression, and then follow up with some arguments and then some keyword arguments and then some named expressions. But the old grammar actually didn't really recognizes it was just looking for any sequence of arguments. It would accept it and it would build a parse tree and it wouldn't, wouldn't think anything of it. Your program would actually parse. The reason it wouldn't execute is because there's post processing done on the parse tree from the LL1 parser. After it was done with your program, it would actually be the tree will be walked again and certain conditions like the args would be checked against rules that didn't quite fit in the grammar. And if your arguments were misordered, then you get a syntax error there instead of from the parser. But with the peg, it just won't parse. If you try and input your arguments in the wrong order, you won't match a complete set of grammar rules and your program will fail to parse. You'll never get the tree. You'll get a syntax error straight from peg. So this more clearly captures the actual language of Python by describing the exact scenario you have to have. So what can peg bring us besides some more verbose grammar to describe a language that we already have? Well, it brings new things too. First up is the feature I am most looking forward to in Python 3.10. I'm very excited for this. So you might not know, but coming out in 3.10 later this year, you'll be able to logically join multiple lines into one with statement using parentheses. We do this with all our other statements, our long Fs, our long fours, our long import statements, sometimes automatically for us using winters, they get wrapped with parentheses. But until 3.10, this is not possible with with statements. You might wonder why, but it has to do with the new peg grammar. So this is the rule in 3.10 for with. It's either the with keyword followed by an open parenthesis, one or more with items and closing parenthesis, or it's the older way specifying it, which is just the keyword and one or more with items. So this is the new shiny way. What was the old way? The old way looks pretty similar. It's still the with keyword and one or more with item. We just don't have that second branch. So why didn't we add it in 3.9 instead? Well, we could have written it, it would have been legal in the grammar. We could have said this would also be another option for the with statement is the with word and then the open parenthesis, literal, one or more items, closing parenthesis. But remember that VBNF grammar is parsed with an LL1 parser and that one is the size of its look at buffer. So when it's looking at a with word in the token stream, it can have as one peak and look one token ahead. So it does and it finds an opening parenthesis and what should it do? It has two choices for this rule and it's the pick one. Well, one of those rules has a literal opening parenthesis in it. So you'd think it'd pick that. But if you look at the other rule, the other rule, what comes immediately after the with word is a with item. If we look at with item, it comes first there as a test. And remember a test is essentially an arbitrary expression and Python expressions can be enclosed in parentheses. We can do this for all kinds of reasons for precedence or to join logical lines for the expression itself. And so there's ambiguity here. We could be following either one of these rule branches and it just, it wouldn't parse very well. But with the peg, it has infinite look ahead and it will just keep going past the first with and the first opening parenthesis until there is some resolution to the ambiguity. And then it will pick one. Another new feature from peg is soft keywords. Python's always had keywords, words that you're not allowed to use for your own purposes and always had reserved words, ones that you could reassign if you want to, you probably shouldn't. But now starting with Python 310, we're going to have two new soft keywords. These are words that if used in the right place will start a Python statement. But if used in any other place, they can be used just like a name. You can find them, make them your function name or whatever you'd like. Now, some of you might be saying, wait, we've had soft keywords before. Async and await were both soft keywords. You would actually be right. Back when they were introduced in Python 3.5, async and await were both soft keywords, even though we didn't have a peg parser at all. But I told you you have to have a peg parser for soft words. So what happened? Well, the parser in Python 3.5 conscripted the tokenizer to do quite a lot of extra work for it. When they were introduced as soft keywords, the tokenizer was changed so that if it found the string async or the string await, it would have a look around its current lexing context. I would say is this a place where async has to be a keyword? If it is, it would replace the token in this stream or it would insert into the stream the token async. If it was in a place where it didn't have to be a keyword, it would leave it alone as a name token, just like all our other strings and names. Then when a parser was actually doing its thing, it would just check, did I receive the async token? If I did, this is a keyword. No other checking is necessary. However, if it got a name token and that name token happened to contain the string async or await, that would be fine too, and that would match. And this way, async and await were both allowed to be soft keywords. This is not the realm of the tokenizer normally. You can see here more grammar from Python 3.10. The definition for how you do a function def is the def string followed by a name token, or it's the async token def string name token. So this is kind of strange. Def is a keyword we've definitely always had since the beginning of Python, but it doesn't get a special token. It just matches a string in the grammar. Also, async is not even a soft keyword anymore. It's a hard keyword, and we don't have an LL1 parser anymore. But our tokenizer is still tokenizing this word specially, because that was what it took to make soft keywords. But going forward with match and other words, peg doesn't need this conscripting of a tokenizer. With its infinite look ahead, it can make the decision whether where it's seeing the match word is appropriate as a statement or as a normal name. So this is the match statement rule just below the function def rule. You can see it just uses a normal string like it does for def and every other Python keyword. Just async and await that are going to continue to get this special treatment because of a feature that was introduced back in Python 3.5. So what has peg brought to the language? What have we gained by completely switching out the grammar of a very popular programming language? Well, our peg grammar actually accurately defines the entire space of Python programs now, so that's good. That's what grammar is officially supposed to do. It also means less steps between the tokens and the AST, parsing step with the old LL1. Because we had to walk the tree after we made it to do additional checks and also in some cases to reshape the tree, there were more steps getting from one to the other, so we simplified the parsing step. We've added new soft keywords so far we have a match in case, but there's no reason why we can't keep adding soft keywords to the language as we continue to accept new keywords and we won't break existing code. This is how we get a match statement in 3.10 and we keep readot match as a function in the standard library at the same time. And I don't think they'll become any hard keywords like isn't going to wait. And pegs just has the ability to express more elaborate ideas than an LL1 parser did. So looking forward to more changes from pegs. Thanks to your Python for putting on this conference. Thank you for attending and thank you ActiveState for sending. Thank you very much for your talk. Some questions have come in from the audience so I'll just ask them. Does peg have any performance runtime improvements? Actually slower I believe. So both grammars are on Python 3.9 mostly get to kind of prove that it works. There's lots and lots of testing but we want to be sure. So 3.10 there is no more LL1 parser but both of them exist in 3.9. If you didn't realize this it's probably means that you were using the peg parser the whole time because that was the default. So there is a performance impact. I believe it's around 10%. At least it was. The first one out is the very first release of Python 3.9. So it takes a little bit longer but a lot of improvements have been made so that it takes about the same amount of memory. But it was decided that computers have come long enough in the 30 years since Python first came around that it was worth it for the extra power. Great. And then there was another question. What are the main reasons why they moved from LL to peg? So the main reason is that I think the primary reason it was chosen to finally move parsers is that by the time we got to Python 3.8 we had already accepted rules into Python language that couldn't be expressed in pure in a grammar that could be purely parsed by an LL parser. So we were already doing rules that absolutely had to have post-processing done on the tree. It wasn't just inconvenient to write them out that way. But we were going beyond the formal definition of LL and with peg we're not. We can again contain the entire definition of language to the grammar itself. So there are examples. But like with the statement we would never have gotten around to having a precise with statement if we stuck with the LL forever. Thanks a lot. And then I just wanted to let you know that the audience really enjoyed your talk and that if you liked it you can answer questions in the breakout room or in the conference room. Yeah, I'll jump into the conference room now. Thank you very much. I'm not caught up yet. Thank you.