 Hello, everyone. So I'm Peter. I work for Red Hat on Python. Great job, by the way. And today I will be talking about byte codes and stacks. You might not have seen the talk on the board downstairs, because that's a late addition to the conference. Also, you might have been to a talk that is a bit more advanced than this one already, so I apologize for that. Hopefully, this will get some of the basics a bit clearer. Not very good timing at the end of the conference, but bear with me. So I will be talking about how Python actually takes the codes you've written and manages to do what the code says. So we'll be looking at the compiler, the abstract syntax tree, tokenization, at the execution model, which is frames, and so on. This is for Python version 2.7. There are always little tweaks between different Python versions to all this. So if you're using a different version, not everything will work exactly the same. Also, I've uploaded the notebooks I will be using to the EuroPython website. So if you have some more time, then you can just play along. Right. So I have this little Python module. It does very simple stuff. It assigns some variables, prints something, and it defines a function. So this will be using this for examples throughout the talk. Now this is just a piece of text. This is just some bytes or some Unicode characters sitting on your disk. Then you tell Python to run it. So what actually happens? Python opens the file, reads the text, and the first thing it needs to do is to tokenize, which means cut the text up into very small pieces. Actually, how many of you have taken a compiler's class in school? Almost half, I guess. So I guess you'll be a bit bored. How many have taken a compiler's class on Python? None. Okay. There are just a few differences to the traditional model. So tokenization, cutting the source text into little pieces that we can then use. There's a module for it. You can run the tokenized module. It will give you something like this. So all the information from the original file, let's say this equal sign is an operator. It occurs on line one between characters two and three. So like this, the text is cut up into little pieces that are called tokens that the rest of the process then uses. Now a little difference from the traditional model that you will not find in other languages is if you have indentation, traditional languages would have braces at the beginning and end, which then get converted into brace tokens. In Python, we have indent and de-dent tokens. So the indentation is actually detected in the tokenization phase. If you have some text between parentheses where indentation doesn't matter, then the tokenizer doesn't emit these indent and de-dent tokens. So there's a bit of cleverness already built into the tokenization step. If you'd like to do this from Python, you just open the file, you call tokenize dot tokenize, and you get all the information to play with. This is really nice if you're, for example, writing a Python-like language that has all the same tokens. You can just use tokenization for free and then build some kind of parser on top. So speaking of parsers, that's actually the next step that happens. Now you have a stream of tokens. You need to turn it into an abstract syntax tree. Now there was a talk about abstract syntax trees before. Is there anyone who does not know what an AST is? Okay. Maybe I should have written my abstract a little bit clearer. So an AST, as I guess you all know, is a representation of what will actually happen in the program. There's an AST module which can parse the text. So there's just tokenization and the parsing in one step. Gives you a module object which has some fields. In this case, it has a body and inside that body, we have my assignment, assignment expression which is a function call and a function definition. Python itself has an AST dump function which is not really useful because it just puts everything in a single string. There are lots of modules on PyPI that make this a little bit better to work with AST pretty as something I just found at previous versions of the stock had the code actually in there. It's not hard to find. So here's the structure of my code. There's an assignment with a target and a value, an assignment with a target and a value, expression which is a call. I assume all of you have seen things like this. So now we have an AST tree. The next step to do with this AST tree is to compile it into byte code. Again, I'll ask is there anyone who does not know about compiling to byte code? A few people. Okay. So let's run through this. Again, I have this module, I have my AST and I can call the built-in compile function which will take this AST and package it up into a code object. What this takes is the AST tree, a file name suggested it can tag the code object. This is where it's coming from. Backtraces then can use that information. And there's a mode. I'm not getting into that in this talk. So this will give us a code object. A code object is something Python can run. So if I do exec of the code object, it will actually run all the code contained in there. Now, what is inside a code object? There's a little function called code info that gives me most of the things inside there formatted pretty nicely. So it has some meta information, some file name. It's also used for functions. So it has lots of stuff about arguments, what kind of information goes in. Some internals, number of local variables, stack sizes is an implementation detail. It has flags to say what kind of features this code object has. For example, if there are any free variables or nonlocals, so it needs to reach out. Or if in Python 2 there were two kinds of functions, one was faster. So the type of the code object is encoded in these flags. We also have a tuple of constants. So whatever constants are present are put into a tuple and then they're referred by an index. We also have names because inside the code object usually Python refers to everything by numerical index. When it needs to print a back trace or generate the mapping of variables to values, it will look here and get the names of the actual variables. If I write my own function, now I guess don't spend too much time reading this. This just prints out all the attributes of my code object. There's a bit more. There's lots of the same information we had before, but there's something that this module hides from us. One of those is the code which we couldn't read anyway, but this is the actual instructions. This is what Python will then execute as your code runs. Another thing we have is the Elno tab, which is a mapping of the bytecodes to line numbers. Again, when a stack trace is generated, Python doesn't keep track of lines, but of where in the bytecode it is, and then it uses this line number table to map back to individual lines. That's mostly it. Whenever I have a function in my module, as I had had before, it gets put in, well, the code object for that function is compiled separately and put in as a constant for the enclosing module. Then what happens is when that function is created, when the dev statement is actually executed, this code object gets packaged into a function. This is where that is taken from. If you want to look for information on that, you can just get it out of the constants, and flags are a little bit different for functions than they are for modules, for example. I talked a bit about the bytecode. Let's go a bit more in depth. If I have my AST again, my bytecode looks a bit like this. It's bytes. It's just numbers from zero to 225. If I print them out as actual numbers, they become a bit more readable. You can see a bit of pattern here. It's always some big number and some small number, big number and small number, right? Actually, the bytecodes come in pairs of two bytes, so it's not actually a bytecode. It's a word code. It has units of two bytes. It's always an instruction and an argument. The disk module can tell us what all these numbers mean. First, we have the 100 and zero. The 100 means load a constant, and the zero is which constant this will load. This was for the statement A equals three. What this needs to do is load the number three. That puts it onto a stack, and then it stores that value into the variable A. So this does two instructions, load a number, store the number. Similar for the second line, for calling a function, it just gets all the arguments of the function. It gets the function itself. And in this case, I was multiplying something, so it calls multiply, and then it calls the function, and because we're not using the return value, it gets rid of the return value, right? So bytecode is just a set of instructions that Python follows linearly as it goes along. You can also run the disk module from the command line, get the exact same result. You can also use it programmatically, so if you do get instructions, you get nice little objects that tell you what each individual bytecode means. So this is pretty similar output to what we had before. So now that we have our bytecode or our information about what the program should do, what Python does is it saves that because all this parsing is quite an expensive operation. Python will save this, cache this, so that next time the same module is loaded, it doesn't have to parse it again. So serialization is a bit like, I guess at this level most of you know it, but it's a bit like JSON. You get some values, you can write them into a file. This doesn't use JSON. It uses a module called Marshall. So Marshall has pretty much the same interface as JSON. You call dumps. You get some bytes corresponding to whatever you put in. Marshall is specially made for Python code objects, so it can only serialize simple stuff like tuples, numbers, code objects themselves. Anything mutable will not appear in this dump. So if your program creates a list, there's no constant for that list in the bytecode, there's just an instruction to create a new list. For immutable stuff like tuples numbers, the codes are already in there. So I skipped a bit. Where does Python put all this? It puts it in a PYC file. Since anybody not familiar with PYC files, everybody is great. So this is what a PYC file looks inside. It's, you know, again, just a bunch of bytes. If I list what the serialized version of the code object is, also a list of bytes, if I compare them, the PYC file will have a header on the first section, and the rest is just the Marshall's, the serialized code object, right? So we can verify. The first 16 bytes are a header. The rest is just the code object. What is this header? The header is mostly there for cache invalidation. So when you change the source code, Python has to check if the old cache is still valid. If you change it, it's not valid, and it has a bit of information to check if we're still fresh. So the first four numbers are what's called a magic number. It's the same for every Python version. It's called import libutil magic number, and this just tells Python, you know, this is a PYC file for this particular version of the byte code. If you run a PYC file with a different version of Python, it won't load. The next two files are in Python 3.7, usually all zeros. These are flags, because there are a few different versions, few different formats for PYC files that do a different kind of cache invalidation. Usually what you get would be the zeros. The next four bytes are a number of seconds from 2000 something from the Unix epochs, so it's a timestamp. So whenever you change the source code, it's timestamp when you change, and Python will know that something changed. Of course, if you do many changes in the same second, you might run into trouble, which is why we have the last thing in the header, and that is the size of the original source code. So if you add something, the size should hopefully change, and the cache will be invalidated. This is not perfect, but it works in most cases as anybody had problem with PYC cache invalidation. Wow, that's many more than I expected. So for you who raised their hands, there is now in Python 3.7 an option to use actual hashes of the source code. For this, you set some flags, use the compile all module, it has an option to use hashing instead, or to not check anything at all. So if this is not okay for either their options now. All right. So that was caching. How much time do I have? About five minutes. All right. So with that out of the way, I will talk a little bit about the execution model. So we have the code object, which has all the instructions, and the code object is immutable, which of course is different from functions. If you have a Python function, you know, I have this funk in my module, I can put custom attributes on it. It will remember them. I can do mostly anything in that. I can do the same thing with the code object. That's immutable, and that's actually shared between all functions that are defined on the same line. So if I have a loop and define functions in that loop, there will be different function objects with, they might have different defaults for their arguments. They might have even different names. The attributes we set on them might be different, but they are sharing the same object. So they're different. If I call them, they can do different things. But the code, the actual instructions are the same because for each function definition, that serialized code object, that module will just have one code object that will be shared. Now, so I have an immutable code object. I have a mutable function wrapper on top of it. That supplies the name and the default arguments. And every time I call a function, there's another object on top, which is called a frame. Now, the frame will remember what function is executing, where in the function Python is currently, and where to go when that function returns. So let's see. The best way to look at this stuff is in the import module. I have a function that will get the current frame and write some information from it. So what we got is a frame object. It has some code. It has the last instruction. So this is the position in the byte code where this frame is. It has a line number, which is actually not stored in the frame. I think it's computed on the fly when I ask for it from the position. It has some local variables, and it has the frame of the function that called this. So in this case, this is a frame for my Jupyter Notebook cell. Now, the interesting thing about frames is that they are mutable. So when the function is executed, the current instruction number changes. So whenever I printed this line number, that corresponded to this line. If I do this multiple times, I will get different line numbers. So as the function is executed, the frame gets updated. Also, the locals might get updated. If I set a variable, I will get that in the dict of the locals. All right. So all this is packaged up in my frame object. For the back, the back is the function calling this frame. So that will, in this case, give me my notebook shell. That's a place to return when my function is done. This is, of course, used for printing tracebacks. When you get an exception, this chain of frames is walked back and you get information from all the frames visited. And I think I don't have too much time left. So if you have any kind of questions, I'm aware that... Okay. So I'm aware that this was a bit quick. Originally, I gave this in three hours and I could go into a lot more detail. So I hope to do this at the start of the conference to get some kind of background for everything, but that didn't work out. So if you have any questions, if you want to go into more detail, if there's something that sparked your interest, then either ask a question or talk to me later. Yes, of course, but let's first thank the speaker. Thank you, Peter. Okay. Who's going to first ask a question? Who's going to be first? Okay. So thank you for the talk. Actually, I would probably hate person asking such a question, but I will try still. And do you think the things you've shown could be useful in like regular Python programming for optimizing or debugging or just interesting? Yes, they will definitely be useful. When I asked how many people had trouble with the PYC cache invalidation, I was really surprised. And obviously when you're debugging that, it helps to know what the PYC actually is and why it's giving me stale information. Also for debugging, obviously it's a lot more useful if you are writing a debugger than if you're using one, but it still helps to know how the things are executed. What happens when you define a function? There's something that's compiled ahead of time and then it's wrapped up in a function and then you get a frame whenever a function is called. I think this is good knowledge to have an idea how the computer works. It definitely helps in debugging if you have a better knowledge of the internals. As any of the modules that you've imported to give us the talk and to show us all those amazing things actually written in Python or are they all written in C? It depends. A surprising amount of them are Python wrappers on C libraries. One of them I've shown you is inspect. And inspect, I believe, has okay, if I call it from the command line, Python dash inspect, is it inspect? Yeah. Then it'll actually give you the source code of Python modules. Right? So this is how you could check. But yeah, most of them are Python wrappers. My question is regarding the interpreter. The information that you presented regarding byte codes and so on, are they specific to normal Python interpreter or it's shared by PyPy or it has nothing to do with the interpreter at all? All right. This is something I actually forgot to say at the beginning. All of this is specific to Cpythons. Specifically, it's for Cpython 3.7. It should stay the same for 3.7.1, 3.7.2, but it can change dramatically even across Python versions. For example, I think at Python 3.6 we got the word code where each instruction is two bytes long, which improved speed a bit before the byte code sizes were variable according to if the instruction took an argument or not. So that just threw off all the tools that had to deal with byte code. All of this is specific to Cpython and also usually subject to change. So anyone else? So you talked about how it gets to byte code from the original code. What happens after that to get to the machine code that actually runs? Oh. Was that too big a question? Okay. So what happens if I'm in the frame if Python is actually executing this? There is no machine code involved. It stays byte code and the Python interpreter is inside there is a giant loop with a C-switch statement. So it's just a bunch of ifs and it just looks at the current instruction and calls a C function or calls some C code that values the instruction. Nothing gets compiled to machine code in Cpython. So we have time for one more question, I think. Okay. Do you have the three-hour talk somewhere? Is it video or something? I don't think I do. Okay. But if you organize a meet-up, then let me know and I can maybe come over. It's really interesting stuff and I always get interesting questions from people who have a bit more time to look into this more deeply. So I would be interested in maybe getting that recorded. One short question maybe. Very short. Do you have any links or places we could go to to look up more information about this topic? Yes. If you found this interesting, then it's uploaded on the Europe Python site. So notebooks are there. And I have a little summary that sometimes has a link, for example, the full grammar specification. Also, most of this you can find in the Python documentation. That's supposed to be very informative. If it's not, then file a bug maybe or contribute. Or if you have a question, I'm always happy to discuss things. Either you're on the conference tomorrow maybe or by email. Thank you. Thank you very much. Let's make some noise.