 I'm going to try to attempt to explain how a Python interpreter works in about 30 minutes. And I'll be doing that by showing you how to build, like, a really simple Python interpreter in technical terms, it's called a tree walk interpreter from scratch in Python. But the first thing that should come to your mind is why? Why would you want to do that? The point of this exercise is kind of twofold. The first point is that I want to show that it's not some crazy, like, black magic wizardry to create a programming language. It's kind of just like writing any other piece of code. And the second point is that Python is like a 32-year-old language. It has been contributed to by thousands of people. So I think that a lot of people would assume that, like, it would be too hard to contribute to. And, yeah, like, I wanted to show that at its core, like, the Python interpreter is, like, like, still fairly simple. And I think that you, like, still can, then that you should try to contribute to it. So, yeah, let's start the thing with, like, a bit of a story. A long time ago, I wanted to build a JSON parser. Yeah, it would be something like this. Give me one second. Can I stop it from being, do some mirroring? I'd like it to be that you just come mirror what I see. Yeah, that's better. Yeah. Should be good, should be good. Yeah, that's better. Yeah, so I wanted to, like, build a JSON parser, just like the one that's built into the standard library. I just, like, try to write something, like, that's a parse function, it takes in a string and then it gives you whatever information that's available, like, inside the string, like, directly available to you, like, the fact that the results had, like, two objects inside it, which is the information of two people, and that John here is a vegan, but Rob is not. And, yeah, like, the thing is, I'm sure that all of you have used a parse function like this, but, like, it's, like, often, like, I looked over that, like, what is it, like, doing, like, exactly. So, well, this might look like structured data to you, but to the computer, it's just a string of characters. And, yeah, so how do we make the computer understand things like that there are two keys here, and that the values are strings as opposed to, like, flows and numbers. How do we do that? Well, the first type of it is to differentiate the, like, the different parts of the string that you have. Like, the fact that we have an opening and a closing curly brace here, which are colored in yellow, like, there's a red, the colon, surrounded by these two strings in purple. This, the first level, like, segregating the string input that we get is called the tokenizer. The step of the data comes from the JSON spec, and the spec is really important, but because, like, without the spec, there would be, like, no definitive way to be able to figure out what exactly it means to be a JSON string. And, like, thankfully, the JSON.org website has an exact spec which defines, like, what it means to be a JSON object. Like, it goes, like, something like this. In JSON, like, an object is, like, defined as it's starting with a curly brace token, and then, like, there can be many items, followed by a comma, and then it ends with another item with, like, no comma at the end. Like, sadly, JSON doesn't have tailing commas. And then it closes with a closing curly brace. And what exactly are these items? The, like, an item can be defined as a key, and then a colon and a value. For now, you can assume that the key and values are strings in these, in this example. And, yeah, with that grammar properly defined, we are able to tell that, yeah, like, that is supposed to be a JSON object, that these are, like, items. Like, there are two items in the object, and that's a key and a value. Yeah, that's pretty much all there is to parsing. Like, once we have, like, the tokens, then we have a spec. We can take and structure the input, and we can, like, very easily extract the information out of it. Yeah, I think, like, that's enough, like, talking in the air. Let's actually look at some code. Yeah, so we, like, start with a tokenize function, which takes an adjacent string, and it turns it into a queue of tokens. So what we do is fairly simple. We just start with the index zero, and so we keep going till the end of the string. We take out one character. First, we check if it's any white space. Then we just jump over it by doing index plus equals one, then continue on to the next character. Now, if it's a character, like, like a curly brace or a bracket or something like that, we do the same, but we create this token object, and so we append it to a queue of tokens, and then we move on to the next character. There's, like, more of the same, but there's, like, the three things left here, which is a string, like, a number, and then everything else, which is the true, false, and null. That's the only possible values that you can have in JSON. So for those, since we don't know how long that token is going to be, we just pass in the current index to the function, and then we get the new index after the thing has been passed. Let's take a look at the extract string function for it. So, like, what it does is actually pretty similar to the top-level tokenized function. So we start a while loop till the end of the string, but instead of the starting from zero, we start from the current index, and then we skip over, like, the starting code, and we keep going until we see the end code, at which point we go ahead of the code, we, like, take the entire coded string out, take, or we append the token to our queue of tokens, and then, so we return the index of the thing that we have, like, after the string has been passed. The only exception to this rule is when we see a backslash, in which case we jump over two tokens, and, like, this allows you to escape things, like, double quotes if you want to have that in your JSON string. And, yeah, that's pretty much all there is to tokenizing. We just collect tokens, like, until we run out of the JSON string, and then, so we return the tokens that we have collected. Yeah, let's take a look at the parse phase next. The parser will, like, take out the first token from the tokens queue, and then, like, based on some of the properties, such as it being a square bracket, we call parse array, like, if it's an opening curly brace, we call parse object, like, if it's a string or a number, we can just call that. And for true, false, and null, we just return the value true, false, or none. But, like, the real magic that's happening here, like, that's happening inside the parse array and the parse object functions. So let's actually take a look at parse object. So to be able to, like, to parse a JSON object, we start within the, like, within an empty dictionary. There is the one special case here, which is that if we see, like, a closing curly brace immediately, we just return the empty object. But apart from that, we take a look at the first token. So we expect it to be a key. So we, like, the parse it as a string. Then, we expect to see a colon token, which we just, the discard. And then for the value, we just call parse again. But hey, the parse function is calling parse object. And now, parse object is calling parse once again. How exactly is that supposed to work? Well, the thing is, the parse function, we saw that it, like, it only looks at the tokens that it needs to look at. If we parse it, the queue of tokens, that will take a look at the first one and say that it, like, if it happens to be a string, like, it will just take that one token and then return. But the benefit from that is that we can just keep calling it. Like, this may be a bit hard to visualize, so let's actually try to see what's going on. So say that we are called the top-level parse function with these tokens. There's a, that, then opening the curly brace, the string, colon, string, closing curly brace. Okay. The parse looks at the first token, like it happens to be an opening curly brace, so it calls parse object. The parse object, like, it expects the first token to be a key, and then it expects a colon, which it gets, and then it calls parse again for the value. And parse will, like, look at the first token, and since, like, is it an opening bracket? No, is it a curly brace? No, there is the type string. Yes, so we call parse string. From there, parse string, like, parses that token into a Python string, and then our parse object function expects either a comma, or a closing curly brace, and since it finds a closing curly brace, we break out, and we return that object. Yeah, since this technique relies on recursion a lot, so where, like, take parse, type eventually ends up calling parse again, this technique is called recursive descent parsing. But yeah, like, with that implementation, our JSON parser is effectively complete. Let's actually try to, like, this is how you would, they could use something like that. The, like, by defining a tokenize function, which can turn, like, the string into tokens, and then, like, calling parse over those tokens, the, like, so we basically have a fully functional JSON parser now. Let's actually look at a demo of that. Yeah, so it's about the 300 lines of code, and, like, if we try to call the parse function with this JSON string, yeah, so we are effectively able to parse JSON, it works. And the good thing about JSON, like, is that the spec is so simple, but JSON is, like, so widely used that, like, even, like, there's something as complex, like, as this API call that came from, I think, random user.me or something, this entire API call, so it should be parsable. Yeah, like, so we have, like, a fully, a fully-featured JSON parser from, I suppose, like, 300 lines of Python code. But yeah, that's not what the talk is about. The talk is about writing a Python interpreter. So yeah, like, now we are gonna build, like, a Python interpreter for this JSON parser, like, that we just wrote. So we're gonna build towards a package called interpreted, which is a Python interpreter, and the Python interpreter is gonna run the JSON parser. So yeah, let's try to do that. Okay, first things first, what exactly is, like, what we call an interpreter? Well, I think that's actually three things. So the first part is the tokenizer, it, like, turns your Python code into tokens, like, tokens, like, like, if you were to have a Hello World program, the tokens will be print, then take an opening bracket, the string Hello World, the closing bracket, then the new line. And the new line is important because Python doesn't have semi-colons, like, or anything like that. The only way for the parser to be able to tell like, that a statement has finished, is for there to be, like, a token, like, a new line. Then there's the parser phase, which, like, takes in these tokens, then it, like, turns it into this, like, nested data structure, and then there's the interpreter, which, like, runs this data structure from top to bottom, and then we get our program execution. Yeah, the third phase is called the interpreter, even though it's part of the integrator, I know. Yeah, let's look at the, like, the tokenizer first. Well, the Python tokenizer is actually, like, pretty much the same as the tokenizer that we wrote for the JSON module. Like, there's just a few, like, a few additions. There's, like, two major changes. The first one says that the new lines are important. And, like, how exactly do we build that into, like, the Python tokenizer? We do something like this. Like, so, say we have a method to scan one token. It, like, reads one character. Then if that character happens to be a new line, and, like, we are not inside some, like, bracketed statement, like, a list or a multi-line statement, then we add a new line token. Like, if we have any adjacent new lines, those can just be ignored, because, like, even though white space exists in a code base, for the runtime, the white space, like, is pretty much meaningless. And the same holds true for, like, all the other white spaces that exist. Those can also be skipped. And then, like, the tokenizer goes on to take tokenize all the operators and variables and strings and brackets and everything, just like the JSON tokenizer. The second big change is that indentation matters in Python. Like, as opposed to pretty much every other language that people write. Well, to be able to support that, all we have to do is, whenever we add a new line token, we make sure to detect the indentation on that new line. Yeah, so let's see how we're gonna do that. So, the thing with indentation is that it's kind of like a stack, because, like, if a line is indented by two levels, you need to know what both those levels are, because, like, things like this can happen. So, like, say that we start any statement, there's the one level of indentation, say, four spaces, then there's, like, a bunch of empty lines. The comments are essentially empty lines for the runtime. Then we see another statement with, like, the same level of indentation. So, we need to be able to detect that there has been no change in the indentation. Then we go to two levels of indentation. Then now, we have the indentation of eight spaces. When we go back, we need to be able to tell that we have gone back to indentation level one. So, to track that, we need a stack. And similarly, like, right here, we have gone from the two levels of indentation to zero. And that is, like, really important to be able to tell that we haven't gone from, like, here to here, we have gone completely outside. So, to track that, like, we can do something like this. So, we take out the current level of indentation by keeping looking at characters as long as they click their space or a tab. Then we add that to our level of indentation. The current indentation will be at the top of the stack. And to give, like, there's a mismatch of the sequence of six spaces and tabs in the indentation. In the indentation, we get this, the classic error of inconsistent use of tabs and spaces. If the current level of indentation is the same as, like, the indentation on the new line, we do nothing. But if the new indentation is greater than the current level of indentation, then we add one indent token. And then, like, the latest level of indentation has to be added to the stack. But if we have a less number of, like, the indent characters than the current indentation, then we have to detect how many didens we have to do. Say that, like, the previous statement was five indents in, and we have gone to, like, two level of indents now. So, the previous indent will match, like, the index one because of zero indexing. So, we do a plus one to get to two. And the, like, the redient count will be the, like, the current length five minus two, that's three. So, we, like, so we add three dident tokens, then we take out those three indentations. Like, so that's how we keep track of the current level of indentation. Yeah, on to the parser. The parser is actually pretty similar to the JSON parser. The only difference is that instead of making a dictionary, that we make this nested Python object. Mm, that's it. Yeah, like, we take in a, like a Python program, to tokenize it, and then, like, we turn it into this, like, a structure. Like, for the Hello World, that we know that, like, that there's a Python module. And inside, like, the body there is, like, this thing called an expression statement. The, like, the, which has a function called to the function print with the one argument, which is the constant Hello World. This, the structure, is called the parse tree for the abstraction tax tree. Well, the parsing bit is best explained live. So, I'm just gonna go through the code of the parser. Let's see. Yeah. So, this is the current grammar of the parser that we're implementing. So, like, just like the JSON spec, we have a spec for Python as well. And, yeah, like, at first glance, it's fairly straightforward. Like, what is a Python module? There's, like, just a bunch of statements. So, a statement can be a single line statement or a multi-line statement. What's a multi-line statement? That's a function definition, and if statement, while loop, for loop, and so on. So, a function definition, that, well, that starts with the def token, and then we see a name, and then we see an opening bracket, and then some parameters, if there are any. Then a closing bracket, the colon, and a block. So, we'll come to block in a second, but the parameters, they'll just be a bunch of names. And, like, then we can have a comma and then a name and a comma and a name and so on forever. Right. Well, that's the pretty much how the parser, like, is implemented in practice. We just take the grammar and, like, turn that into Python code. How? Well, when the parser starts, we simply call parseModule. Let's try to go to the parse function. Yeah. So, the parse function will create a list of statements and then, like, we just, like, keep going until the whole, like, module has been parsed, then we call parseStatement. So, we get back one statement, then we append that to the list of statements that we parsed, then we return the module. That is basically what that's saying. For a statement, let's look at parseStatement. parseStatement is just, like, either we call parseModuleStatement or parseSingleLineStatement. Yeah. parseModuleLineStatement will, like, eventually go on to call parseBlock. Like, let's take a look at parsing and ifStatement. So, for the ifStatement, we call parseExpression. So, we expect, like, a colon after it and then we call parseBlock. And parseBlock is kind of interesting. Let's look at the grammar of it. Well, take a block is defined as a new line, then indent, take a bunch of statements, then it didn't. And, yeah, like, that makes sense. Like, that's a block of code. Like, it happens to be on a new line. There's the new level of indentation. We have some statements and then we did it back. So, yeah, do you guys that? Yeah. The thing with blocks is, like, we can implement it something like this. So, parseBlock, we'll expect a new line, then an indent, like, it will make a body for that block. And, like, as long as it's not, like, a fully parsed or we see a drent, we keep calling parse, the statement, then append those. But, parseStatement can call parseMultilineStatement. And, multi-line statements have blocks inside them. So, like, what's going on there? Well, like, let's try to, like, to visualize that as well. So, if we have a function here, then we find a block. And, like, say that's the block that we're currently parsing. We'll take a find of one level of indent and then we find our first statement, which happens to be some multi-line statement. So, like, that's a while loop. The while loop itself, like, goes on to call parseBlock. Like, we find a new line, so we find the one indent and then we, like, go on to parse the statements. Like, that's a single line statement. That's a single line statement. And then, we find one multi-line statement. So, we call parseBlock again. So, we find a new line and, like, one indent, the single line statement. And then, finally, we find one d-dent. Like, at that point, the parseBlock in yellow will return. And then, immediately after that, this, like, third statement in the green block has the finally finished parsing. And then, the green parseBlock will find a d-dent and then return. Then, like, then, for the blue one, like, it has finally parsed its first statement, which is the while statement. Then, it will parse the second statement and then, like, it felt that, yeah, the file has ended. Like, let's return. So, that's how the parsing phase goes. Yeah, from there, we can move on to the interpreter. And, yeah. How does code run? Like, that's the question. Say we have a parse tree, like the one that we saw. How do we make it run? Well, the thing is, like, just how, you might have been taught how Python programs run. They run top to bottom, we have to write. Like, it actually is that simple. Like, a Python module is a bunch of statements. Like, how do you run those statements? One by one, top to bottom. Say that, like, that's the program. The problem comes from the deeply nested sort of recursive nature, like, of the AST. The problem here is that we have a module, but inside the body, there can be, like, any kind of node, really. And inside that node, there can be any kind of node. So how do you deal with that? Like, how many, like, how many if statements would you like to write? Not many, I hope. But yeah, like, to deal with that, like, there's this thing called the visitor pattern. Like, that's what will help us do the entire, like, interpretation thing. Yeah, let's actually take a look at the interpreter. Interpreter, yeah. Yeah, so, we have an interpreter class and it happens to have a global scope. The scopes are really important. So we have a bunch of built-in functions that exist in the global scope. And then we set, like, the current scope, which happens to currently be the global scope as well. And then, like, we have the top level, visit method. Now, the visit is what will be called with the module. And then the visit is supposed to interpret the entire parsed tree. Well, take what we do is fairly simple. We look at the kind of take node it is. Like, right now, this will return module and we try to look for a function with the same name. So we try to look for the visit module and then we just, like, run that. And take what is the visit module doing? Well, it goes over all the statements in the body and it just calls self.visit statement. The notice is the use of self.visit. Like, this is, like, the thing that's very interesting about the visitor pattern is that, like, we can just keep visiting the children in the sequence that we want them to be run. So you can kind of think of self.visit as just self.run. So we just run that statement. Then how does that get run? Well, take, say that it happens to be print statement. So it will be a function call. So this will be call. Then take it will look for a visit call and visit call will be called with that node. Yeah. Let's actually go through a simple example of this. I think that would be better. Yeah. Say that we are doing something like this. Python dash m interpreted dot parser. Yeah. Let's do x equals to, no, let's do x equals two plus two. Print x. Yeah. So that's what the parse tree of that looks like. And if you want to make it a bit more like a tree, we can just run black on it. So yeah, that's the transform code. And yeah, the module, like a body has an assign node, which that's the first statement. And the second statement is called an, like it's called an expert statement because the thing inside it, like is an expression. Now function calls can like return values, but since we don't care about the value in this case, we just take, say that, hey, like treat this expression as a statement, don't worry about the return value. So let's like try to walk through this. So like when we visit the first statement, we'll go through, visit assign. And the, like a visit assign will try to visit the value. And what's the value? The value is a binary operation. It's like doing a binary operation to between the constant two and the constant two. And like the operation there is plus. So let's try to look at visit bin op. Yeah, so we get the, we run like the left and the right side of the binary operation by just calling cell.visit. And since these happen to be constants, let's look at visit constant. Yeah, that just takes the value out, which is two, it like wraps it in a value object and then just take returns that. If we go back up to bin op. Yeah, so we get a value object with the value two and a value object with the value two. Now, we just start like a define all the operations that those since the operator is plus. Now we return a new value with the value two plus two, which is four. Now we have to go back, take up to visit assign. We have only visited the value part of it. Yeah, so this gives us a value object with the value four. And for names, so doing an assignment is a fairly simple and we happen to be assigning it to a name X. So in that case, we just set the value to the name in the current scope. Yeah, fairly simple. For like trying to assign to the dictionary key or to a list index, it gets much more complicated, but we're not doing that. So I'm not gonna worry about that. But yeah, like now we have set the value four to the name X in the current scope. So what happens next? Well, like the next thing is a call to the name print with one argument, which is the name X. Yeah, so we do the visitexpert statement. And yeah, like that's gonna be fairly anti-climatic. We just visit the expression inside, then we don't return the value, so we return none. Okay, let's look at visit call them. So we are visiting a call. The call is to the function print. And the name print and visit name, so what does that do? So we take the name out, we check, like if it's in the current scope, like if it is not set in the current scope, so we try to look for it in the global scope. Again, if that's not set, we just, like we say that the name has not been defined. But in our case, like both the scopes happen to be the same. And we have defined, I believe, cell.globals. Yeah, so we have defined print to be the print function. So what do we do when we actually find it? Where is it called? Yeah, we found the function, which happens to be an object that we created called print. Since it is a type of the type function, we collect all the arguments by visiting every single arg. Like now the argument is also a name, so it will be taken out of the, like from the current scope. And if it's not there, that will be found in the global scope. So we take out the value x from the scope, which is four, and that becomes the value four in these arguments. And then we take this function object and we call it with the arguments. So let's try to look at the print function. When we call it, we just convert all the arguments to strings and we print it. So yeah, that's how this program should run. So if I were to move it out of the parsing phase and just run it. Oh yeah, the pretty part is not there as well. Yeah, that should print four. So yeah. Now one, like a glaring thing that's still left to be discussed is, take what about functions? Like take short, like there's a global scope, but what about function scopes and things like that? Well, interpreting functions is also, like it's not very tricky. So we have this thing here called a user defined function or a user function. Well, whenever we see a function definition, we just create a user function with the definition, which is like take the body, all the statements and everything. We wrap that inside this user function object and we just store that in the current scope. So you could define the function at the global scope or inside a local scope or something like that. And when we run this function and so we get a user function, the object from the scope, when we call it, we are calling this part. User function. Yeah, when we do a call, the call we do is we store, take the current, the scope of the interpreter, we create a new scope for the function. We set that to be the entire interpreter scope so for a bit. And then for every single argument that was passed to the function call, we take out these arguments, the values. We take the function parameters from the function definition that we stored when the user function object was created. And in that new scope, we assign these arguments to the parameters. And then for every single statement that was defined in the function, we visit those statements. Like if we encounter a return statement somewhere in there, we catch that as an exception and we just return that as the value of that function call. And then finally, we set the scope of the interpreter back to the parent scope when the function exits. So yeah, with this, we should have a working interpreter. So yeah, like with suppose, like around, like this, like takes like implementing a visit function for every single type of the node that you have in your parser. But the implementation of most of them is the fairly simple. Like for break and continue, we just do like an exception. My favorite take cover is a visit pass, where we just pass. But yeah, with like around 600, 700, take lines of interpreter code, we should be able to run this code inside our interpreter. And yeah, it works. And yeah, I think that's it. But the one thing that maybe a few of you, might have in your mind is that, wait, that's a toy implementation that doesn't really do anything. Like where's my favorite Python feature in this implementation? Like there's a bunch of things that we haven't even talked about. Like the byte code that gets generated in the actual implementation. Things like the classes are not implemented. Threads and I haven't even looked at that code, to be honest. But the thing is, most of these can be implemented, like fairly simply. The current framework that you have with this tokenizer parser and like an interpreter, this can be expanded upon to add all of these one by one. You know the best part. There's all these other libraries that are written purely in Python. Like async.io, for example, is like a giant blob of Python. It's like 15,000 lines of code. And you don't have to re-implement that to be able to re-implement Python. It's just like it's Python code, you can run it. And like the same holds true for all of these, like the separate libraries as well. Like C Python as a whole is I think a one-third C code and a two-thirds Python code. But yeah, that's basically everything I have. Like if you want to look at some resources, like the interpreter that I have created, like the implementation has like a large amount of influence from the book, Crafting Interpreters. And if you want to learn about the exact internals of C Python, there's a book by Anthony Shaw. It's called C Python Internals. It's like a really good book. But yeah, thanks for your time. So we have around like four minutes for questions. So if anyone has any questions, please sign up. Yeah. Hi. When you were talking about parsing blocks, you said that the parser looks for the didn't and then sort of returns. Yeah. Does that mean that if you add a bunch of empty lines with just indentation, for example, would that make the parser actually take a bit longer to parse? Well, the parser won't, but the tokenizer will. So at the tokenization phase, we essentially ignore, they call it like a duplicate, the new lines that we see. So if there's just a bunch of white space, that will just be looked over by the tokenizer to be only one, like the new line token to look at. So yeah, like not really, but sort of. OK. So this may be an unfair question, but taken to the natural conclusion, are there any benefits to writing your Python interpreter? And for that, I mean, can you maybe implement a specialized subset of Python that is optimized for memory, optimized for performance, something along these lines? Yeah, most definitely. And if you look at the implementations of the stackless Python or the micropython, that's exactly what they've done. They have taken the Python interpreter and specialized it to the use case that they want. So yeah. OK, thank you. Hi. Do you know if you're familiar with it? I've been doing some work recently with LARC that I think is a Python library that lets you write custom grammars and parsers. I was curious if you've used anything like that and what the trade-offs might be for using something like that versus doing it from scratch, like you've done in the talk. So I'm sorry, take what does the tool do? It adds the custom grammars to Python. It lets you basically define your own grammar and then it will give you a parser for that grammar. So you could write your own language with it quite quickly. Yeah. So if you were to write a non-toy implementation of something like this, you should probably use these things called the parser generators. Now, I didn't use them because I wanted to do the entire talk in 30 minutes. But there are a bunch of these things that can take in your grammar and give you a parser out of it. But yeah, there are also huge languages, like Ruby, for example, for the longest time, had a handwritten parser. And I don't think that Ruby is a toy language. But yeah, both the approaches, they exist. They have their own trade-offs. But yeah, both are valid ways to do the same thing. Cool. Great. Thank you so much, Tushar, for the talk. So we're going to have the keynote happening here next really soon, so stick around for that.