 Thank you. So for those of you who had a previous talk, you saw how to build a JIT, just in time compiler, from the ground up. In this talk, I guess we're going the other direction. We will be starting from a finished product, a complicated piece of software, PyPy, and we will try to figure out how the JIT inside it works. So about myself, I'm Ron Lamy. I've been PyPyCoder for about seven years now. And before this talk, I actually didn't know that much about the JIT. And for me, it was a magic black box, and by preparing this talk, I started to open it, and I want to show you how you can open it too. So let's talk about PyPy first, and the way I like to introduce it is by a quote from someone whose name you might know. If you want your code magically to run faster, you should probably just use PyPy. And that's the goal of PyPy, run Python code faster. So nowadays, PyPy is full and fully compliant implementation of Python. We have and we've had for several years a complete support for Python 2.7, and we will continue to support 2.7 for an indefinite period. We've also released a version that supports Python 3.5. We have beta support for 3.6. It's not fully complete yet, there's still a few things in the standard library to implement, but we should have a full release of 3.6 in the not too distant future. I don't want to give a date, but this year, definitely. And well, of course, the main draw to PyPy is that it can be much faster than Cpython when running pure Python code. I don't like to give exact numbers because it depends a lot on the kind of code you're running. Some code can't be, can't really be improved above what Cpython does. But for other sort of code, you can have huge speed ups, and it can be a game changer in terms of order of magnitude. You get from Python speed to more or less C speed. And so, and of course, that is all due to the JIT, which we'll talk about in a few minutes. But I just want to remind you that pure Python code is not the only thing we care about when we use Python. So PyPy also has a good story for C code and all sorts of programming languages that speak, that have the same interfaces as C. So the best way to talk to a compiled language from PyPy is to use CFFI. It's convenient in general for the whole Python world also on Cpython. But on PyPy, CFFI is particularly well optimized and works well with the JIT. But of course, in the Python world, we can't get by without supporting all these C extensions, so things like NumPy, Scikit-learn, Cytan. So PyPy has an emulation layer to support all these extensions. Well, it's annoying for us because we could run code faster if it was written in Python instead of being written in C. But anyway, we have the compatibility. There's a site where you can get binary wheels. They are not yet on PyPy, so that's why you have to use this repo for now. And yes, with that, I think PyPy offers good, very good compatibility for everything you'd like to do in Python. But that's not the main point of my talk. I'd like to show you the internals of PyPy and specifically of the JIT. So let's talk about the internals of PyPy first. And I'll start by comparing it with Cpython, which most of you are probably more familiar with. So in Cpython, well, it's written in C, as the name indicates. And once you compile the sources, you get an executable that has two main components. First, you have the compiler that takes your Python code and transforms it into byte code. And the interpreter proper that actually runs the byte code. In PyPy, well, it's quite similar. The first difference is that the language is different. PyPy is implemented in something called Rpython, which is a subset of Python 2. And with the Rpython toolchain, that is something that was built by the PyPy project in order to transform the source code into the PyPy executable. So thanks to this toolchain, you get the PyPy executable, which has the same byte code compiler as Cpython, which has byte code interpreter, which is quite similar to Cpython. But it also has the JIT, which is the part on the right on this diagram. So what the JIT does is that at runtime, it uses runtime information to optimize the code that the interpreter is currently running. And it produces machine code, which is a lot more efficient than interpreting byte codes one by one. And it can switch back and forth between the interpreted and the assembler running mode. So this, let's talk a bit about the toolchain. So the start is Rpython code, then the object, the code, is imported at the Python level by the toolchain. And then the toolchain analyzes this Python 2 code and does type inference and a variety of things to reach a stage where the whole code for the interpreter is represented as control flow graphs. So these are, well, it's a graph of all the operations that happen inside the interpreter. And then from that representation, the toolchain adds the garbage collector, adds the JIT, and then transforms it into a C code, which is compiled, and at the end you get the PyPy executable. So I tend to explain this quite often. And then the next part of the conversation goes a bit like this. What about the JIT? Well, it's just complicated. It's magic. But someone who's actually a PyPy code developer recently said if you spend enough time with it, any magic is just careful and clever putting bits together. So let's just spend some time with the JIT and it won't seem as scary anymore. So to do that, we have to run some code and I have, well, a somewhat stupid example, but well, it has to fit on a slide, so it's going to be that complicated. And the idea is that you have like a library for working with physical quantities and the physical quantity has, you have a value and you have a unit. And then when you do operations with these quantities, you have to look at the unit and then do the actual operation. So for simplicity, we only implement addition here and we just check if it's the same unit because you don't want to add meters and seconds and we don't want to bother about weird things like feet and yards and if we have the same unit, we just return a new object that represents the addition of the two. And since we were interested in performance and what the JIT can do with such code, we have a simple and rather stupid benchmark where we just add these quantity objects repeatedly for 500 million times. So that's quite a lot of operations, but let's see what PyPy can do with it and what the JITs can do with it. So here I'm in PyPy 3.0, so I just run time Python by file and it takes less than a second. So that's a very, very crude benchmark, so let's run it twice to see if it's stable. All it differently is. And well, let's compare with Cpython to get a feel. Python shouldn't be too long now. All right, well, yeah, 12 seconds, okay. So, well, 12 times faster, that looks decent, but actually I cheated. In reality, the code has a slight addition to what I just showed you. So on Cpython, you can see it down at the back at the bottom. Yes, I'm running 100 times less iterations on Cpython because I didn't want to wait for the whole talk for it to return. So on that specific example, PyPy is actually more than 1,000 times faster than Cpython. And, well, it looks quite magical, but I'll remind you that this is still just when you're running on PyPy, it is just like regular Python. You can interrupt it in the middle and you will get a trace back. You can, well, you can even run it under PDB and interrupt this. Oops, that's not what I wanted to do. I want to run. Yes, and here I've started running the program under PDB and I've interrupted it. So now I can inspect and then so self inspect the value. So we are at the 418,053,951 iteration of the loop. And we can still do whatever we want with the PDB. And, but despite having all the power of Python, this run in just one second for 500 million iterations. So, well, let's try to figure out how this happened. And the first thing, the first level we need to look at is the same as you would do in Cpython. We can, we should first look at the bytecode because that's what the interpreter actually sees. It doesn't see the raw source code, it sees the bytecode. And it's, so let's import our code. Let's run it again. And to inspect bytecode, that's a very simple tool which is this. Look at the compute function. We can see the bytecode. So I won't be talking about bytecode too much. I hope most of you have at least seen something like that before. But this, in this bytecode, you can see the interesting part which I'll try to highlight. So this, which I've highlighted, is what is the code for the loop. And it's actually pretty simple. I mean, you've seen in the source code, it's just plus equal. So in bytecode, this turns into mostly the in place add bytecode with a few operations before and after to put things on the stack. The bytecode language is a stack-based language. So you put things on the stack and then every operation pops things from the stack and puts the result back on the stack. So let's now have a look at this in place add bytecode. And this is, well, the slightly simplified source code for the in place add bytecode inside PyPy. Well, the real thing is, does exactly the same thing. It just has more metaprogramming that obscures the things. And since the implementation for PyPy is Python, well, it looks like Python too, so we can easily read it. And what it works is, so, first it takes values from the stack, as I explained, and then does the calculation. At the end, it pushes the result. So what is the actual calculation that happens inside the interpreter when you do a plus equal? Well, first you look up Dunder I add on the object, on the left object. And then there's something complicated which doesn't happen in our case because we didn't implement Dunder I add. So then since we don't have Dunder I add, we fall back to doing the same thing as simple addition. So then addition proper is more complicated because you have to check the types of both arguments. Then you look up Dunder I add on the left argument. And then in certain cases, you will look up Dunder I add on the right argument. But as it happens in our case, we have our two objects of the same type. They are of this quantity type. So we end up here and then we just call this Dunder I add method which we've looked up on the type. So now if we want to go deeper, we have to actually talk about it because this is pretty much the same logic as in C Python. So the magic happens in the JIT. And so let's talk about first about how the PyPy JIT was designed. And the JIT in PyPy is what's called a tracing JIT. And the ideas behind a tracing JIT is that first, and well I guess it's the idea behind the old JITs, you consider that most of the time in your code is spent in very few lines. So you compile just in time, you can compile only a small part of your code and get larger performance benefits. And the other important principle is that when you have a conditional in your code, in most cases you pretty much only ever take one of the branches. So like as I showed you in the in-place add bytecode, you need to look up for the Dunder I add method, but we didn't implement it in our user code. So this check will always fail and we will always fall back to doing just an addition, not the special in-place addition logic. So the idea is that we should, so first we compile only the hot loops, the parts of the code, where we see that the program that is currently running is spending time. And then we should also optimize for the fast path, so try to, we can assume that in most cases we only have to consider one branch of an alternative. And of course the one thing that gives an advantage to just a time compiler over ahead of time compilers is that with a just a time compiler you know what you are running on, you are at runtime, you have seen already the user code. So it's easier to optimize for what the user wants to run. And therefore the JIT works, well it traces the code. So to compile some optimized code, the first thing is to trace one iteration of a loop and to record all the operations that were made in that iteration of the loop. And that way you can include runtime information in that trace. And once you have the trace then you can optimize it. But you also need to add what is called guards to check that the conditions under which the trace was recorded are still valid. And finally an important idea in PyPy is that because Python is so complicated, because one bytecode like in place add can do so many things, it would be very hard to trace at the Python level. So to simplify the implementation in PyPy we want to trace what the interpreter does. We want to have a record of that implementation I showed you and try to run that. And as a side effect, doing that allows the JIT to stay in sync with the interpreter. So the way to record what the interpreter does is called JIT code. So to create those JIT code, the first thing is that the implementation of PyPy contains hints that tell the tool chain where things can be jitted, where you have an opportunity for optimizing loops. So that's called JIT driver. And the main one, of course, is on the main bytecode dispatch loop so that you know that when you jump back to the beginning of a loop, you know that you can start jitting under certain conditions. And then the tool chain just follows all the code that is reachable from these JIT drivers. There are hints, like the decorators, like don't look inside that can modify the lookup of the code. And then you have things like a lidable which allow certain optimizations by telling the JIT that some function is a pure function in the sense of functional programming. It's referentially transparent if you know what that means. And therefore you don't need to run it again. And also you can also declare some attributes of certain objects as quasi-immutable so that it means that the JIT will assume they are always constant, but they can change, but the JIT assumes they don't change. So with all that, the tool chain just converts the internal representation of the interpreter to this format that is optimized for size. And then when tracing what the JIT does is that it uses the JIT codes and interprets them actually. So during tracing the PIPI interpreter is running on top of another weird interpreter that runs these JIT codes and records all its operations. So while it does that, it records guards. So the guards are usually simple checks, but they need to... There are the conditions that when the guard fails, there's a slight complication because the code needs to exit the optimized path and fall back to the interpreter. And there's also a different sort of guard where you assume that something is true and if it isn't, then the whole trace is thrown away. So you don't need to check the guard when you execute the trace. So that's very efficient. And I'd like to show you what those traces look like. So for that I'll use VMProf. It's a statistical profiler, but most importantly today it can show the traces. And to use it, you have this VMProf that you can install and it records profiling information and JIT information which you can then visualize on a server. Here I'm running locally. There's also an option to run it in the cloud. And if I open this... So here's what the JIT records when running our code. And you can see here in the in-place ad, there's a whole lot of operations. But you can see that there is this COSI emut operation right here which comes from a lookup on the type object. So that's what the... I guess that one is the lookup of the IAD method on the object. And it's recorded as quasi-mutable. So that means the JIT won't have to... will in the end not have to worry about it. So let's go back to the presentation. So once we have traced the raw operations there's important optimizations that allow... that reduce the number of operations because that was a huge lot of operations. So you have classic compiler optimizations. It can check for the values of ints in order to remove, for instance, index checks on RA access. It removes the guards that are useless or are implied by other guards. And the most important optimization is virtualization. So when objects don't escape the loop or when they don't usually escape the loop then they don't need to be created at all. And they will only be created on demand. And that way you remove allocations which are very expensive operations. And another important operation, optimization is unrolling where instead of optimizing one loop you first run one iteration of the loop and then very often in loops you have things that are always the same. You have basically loop invariance. So by running this first iteration you compute all these things that stay constant and then you can have a second iteration that doesn't need to repeat these loop invariant operations. And that second iteration is what is actually repeated all the time. And after that the trace and this sequence of operation is passed on to the different back-end. So every architecture needs a different back-end. And, well, that's where you actually emit the assembly. And it's relatively simple compared to the rest of the JIT. And in the end you just have for each operation a simple assembly to emit. So let's have a look at the final result. So first, after the optimization, here we are, sorry, wrong one, after the optimization, so without unrolling in this in-place add you see that there are still a lot of different operations. But after all the optimizations after unrolling this in-place add has removed all the checks on the different types. It has removed the function call. And, well, the only thing that is left is I think incrementing the loop counter. And you can see the final assembly that is generated. Well, it fits right here. It's only about 10 assembly instructions. So that's why it is very fast. So let's conclude. First thing is that on this benchmark the JIT is quite unreasonably effective. It somewhat fortunately manages to remove all the operations which makes a big difference of what Cpython has to do. But more generally the way that it works is that the tool chain contains a generic framework that is pretty much Python agnostic. And in order to get the massive performance benefits you've seen, the interpreter needs to exploit the features of the tool chain. And together they give you abstractions for free. They remove most of the overhead that comes from using dynamic language like Python. So if you want to know more, you can contact us on IRC, like IRC. Just an announcement. We'll have a help desk tomorrow morning and we'll be at the sprint. So come talk to us. And I hope we have time for a couple of questions. Thank you. Thank you. So we have time for one, maybe two questions. Anybody? Please come to the microphones. Hey, thank you very much for the talk. So I'm going to ask some kind of a mean question. But in practice sometimes I've seen PyPy behave like slower than Cpython. And so we're often shown this kind of canonical example of a for loop where PyPy does so much better than Python. And that's great. But I think it would be useful also if you could point to some, I don't know, pathological coding patterns in which PyPy would perform slower because of some overhead that are present that are not present in Cpython. Well, it's always hard to find the... to really understand the bad cases. But basically the bad cases tend to be when the jit is unable to remove the overhead. So here, for instance, if you have dictionary access, that is somewhat slow. And here the performance was very good because all the dictionary access could be removed. Because there is this quasi-immutable mechanism for type objects. I didn't really talk about it, but there's an optimization for instance attributes as well. But sometimes you disable that and then the performance suffers. That's very interesting because actually my test case implies a lot of dictionary access. So maybe it's ready to go. Yeah, but every case is a bit different, so we can't really give a general answer. Okay, so if you have any more questions, please find Ronan and PyPy guys at their help desk or at the sprints. And thank you again.