 Hele. OK, so, as I said, I'm the technical lead on the Faster C-Python team. The Faster C-Python project started just over a year ago. We made some small changes in 3.10 and we've already started making changes in 3.12. But this talk is really focused on what the changes we've made in 3.11. It's possible that some of the things I say aren't 100% true for 3.11, there may be elements of the others, but it's more in the case of giving you a feel for the sort of changes we've made and how it all fits together. So, C-Python is a computer program, like many others. It's written in C, hence the name, and it runs on a computer. So, in order to make it faster, it helps to understand, first of all, how the computer runs on works. It's an understanding of the hardware that's helpful. We don't need to be able to build a machine or maintain it, but we need some sort of feeling for its performance characteristics and how they work. Now, modern CPUs are complicated things. This is an architecture diagram. It was a slightly out-of-date machine. But I want to bring your attention to a couple of things. First of all, it's multiple core, but that doesn't really interest us because of the infamous GIL. But even if there wasn't a GIL, we'd expect each core to be running a separate thread. So, all these optimisations we're going to talk about would only really apply to one thread anyway. But what is interesting is that within each core, there is parallelism. Every time an instruction is fetched, several instructions are usually fetched at one time, and multiple instructions are dispatched. There's multiple execution units in each core, allowing you to do several floating-point operations or several memory accesses concurrently. And that's the kind of interesting thing about these CPUs. They're called superscaler for this reason. The other thing that's important is memory access is important. Now, it's slow. Now, if we go back to the previous thing, there's no real memory on here, but you'll see elements of caches, and those caches are there to speed up memory access. It takes a long time for light to move around. If you're clocking at 5 gigahertz, in 5 gigahertz, light moves six centimetres. It's not very far, and electrons move considerably slower than that. So, it takes physically, it's just impossible to make memory access fast. It's just a physical limitation. So, the way the CPUs work around it is they have layers of cache. They have a tiny L1 cache, and I mean tiny as in a few kilobytes. And that's only four cycles delay. Then there's a L2 cache, which is typically 10 cycles. L3 cache, which is larger still and shared across the cores, and that's usually used for inter-core communication. That's 30 cycles, and then RAM is a couple of hundred cycles. These superscaler CPUs can work around the delays, and they can push forward some execution while others are delayed. But as a first sort of general principle, we want to avoid memory accesses. The exact details are very hard to work out. Trying to get precise timings is extremely hard on modern hardware. But just bear in mind a simple thing, that if you have a memory access, it's generally going to be slow. And because, as I said, the superscaler, it's dependent memory access is the thing we really want to avoid. So, if we have two memory accesses, and we can do both at the same time because they're independent, the CPU can largely do them at the same time. If we have a memory access that depends on another memory access, in other words, we need the data from the first one to work out where to look in the memory for the second one, that's going to be doubly slow. Okay. Now, back in 1976, Nicholas Vert was a famous computer scientist, and he wrote this textbook. It's called Algorithm plus Data Structures Equals Programs, which is a slightly fuzzy picture I've got on Amazon there. Now, I just want to sort of split this thing into two. So, there are basically, before object-oriented programs, which kind of merge the whole concept of data structures and algorithms, this was a kind of important way of looking at programs, and it still applies to sort of low-level system stuff, like CPython or Linux or Windows or other things like that. So, we're going to sort of split this talk into two, the sort of data structures part and algorithms. So, first of all, we're going to look at the data structures. We're going to look at some data structures in CPython and how we've changed them in 3.11 for better performance. Right, but before we get into real data structures, just a little quick sort of a side, just to give you a feel for this sort of thing. So, a link list is a data structure you might be familiar with from your computer classes or from bad job interviews, so they ask you to implement a link list. And as you can see from the image and comparing it with an array list, it's quite an inefficient structure. It's inefficient in a couple of ways. So, suppose we want to access the integer 2 at a second index 2 in the list. In the array list on the left, the array list on the left, we need to follow four links. And because we're following a link, we need to read the memory in the head to point to the link. Once we've read that, we know where the link is and then we can read the memory in that. So, there's just four dependent memory loads to get to 2. If we look at the array list on the right, it's only the two loads. It's still dependent, but it's only two. And not only that, every time we want to add an element to the list, we need to allocate more memory. Now, allocating memory is another slow operation. It's not for physical reasons, just because there's lots of code involved. So, just sort of designing the data structures to avoid these dependent loads and excessive memory allocations is kind of important. But before we look at the implementation of any data structures, I just want to give you a quick refresher on frame stacks and frames in Python. Whenever you call a Python function, we need somewhere to put the values and variables, any temporary variables, the reference to the module globals and a few other bits and pieces. And these go in a frame object. This frame contains... Well, I've said what it contains. So, every time you call a Python function, we create a new frame object and then it calls other Python functions that push frame objects and so on. So, we form a stack of frame objects. Now, note I use the term stack here because in Python 3.10 and earlier, we have this, which is awfully like the link list example I gave you before. Now, this isn't as bad as the link list because we only ever really want to access the top frame that we're currently executing. So, we don't have to worry about the extra cost of following the links, but we do need to worry about the memory allocation or allocating memory each time. Now, in Python 3.10 earlier, there was some sort of caching involved, but conceptually, we still have to do this allocation and even with the caching, it's not as efficient as it could be. Now, the point is this is a stack. So, we implement it as a stack in 3.10. So, in 3.10, we just allocate a big chunk of memory per thread. There's actually sort of several link chunks because we don't know in advance how big the stack's going to get. But in almost all cases, we can just allocate a new frame by just reusing the memory in that chunk. Now, this has a number of advantages and we still need the sort of link pointer to follow, but we'll come to why we need that in a minute. But it's still much faster than having to chase around memory just because we can compute the offsets from one to the other. But also, what I was saying earlier about caches, the thing with a stack is that the bit you're just using is probably the bit you've just used for a previous function, so it's still in the cache, almost certainly. Which means, as opposed to a new frame which might be anywhere in memory but it's still probably in the cache. So the cache performance of this is about as good as you can get, and we're not having to do a new allocation for each time we call a Python function. But you may be wondering, doesn't this change the Python semantics? I don't know if some of you will have used sys.getFrame. Unfortunately, we will have encountered exceptions. And exceptions count with a trace back, and that trace back includes frame objects. So how do we do with this? Well, what we do in this case is when we need to, we allocate a frame object. So you can imagine that... I've only just shown the frame object for the frame on the top of the stack, but you can imagine ones for each frame. You can imagine that we'd allocate a frame object for each one, the one in yellow and the lower one in white. And we would link those as if they were the previous ones. So effectively we could sort of lazily recreate this sort of stack if we need to. So basically this is a general principle of the sort of optimisation we do. We design it so that the usual case, the common case is fast, and make sure that we can continue to perform correctly for the less common case. Even though the less common case might end up being slightly slower than it was in earlier versions. So typically normal execution becomes faster, and raising the exception becomes a tiny bit slower. Unless your code is odd, shall we say, you would expect that the exception should be relatively rare. I mean, we're given an example earlier if you sort of look before you leap thing, where your exceptions have become a little more common. But generally, you know, there's still the rare case, and those pieces of code are not typical code anyway. You will have some of them in your code base, but in terms of sort of execution counts, they're rare. One other thing you may have noticed in this is it looks like I might be cheating by dropping a few things in it to make it look smaller. But one of those is... So there's a debug information in there, which we lazily allocate in the frame object here. And the other is the exception stack. Now, the exception stack we've dropped in 3.11, because we have what's called zero cost exceptions. Zero cost exceptions is their technical name. Obviously they're not really zero cost because nothing is, but they're pretty close. So the way exceptions worked in 3.10 is that if you had a try accept statement, what would happen is when you hit the try statement, the try, we would push a little piece of data onto a stack, internal stack, which told us where to go and how much stuff to pop off the execution stack if there were an exception. And then we got the end of the try body that would be popped off. That worked perfectly. But it means every time you enter a try and leave a try, you're doing a little bit of work. And we need somewhere to put that stuff as well, those things. So that took up 160 bytes, I think, in 3.10, and earlier it was 240 bytes in each frame, which is somewhat wasteful given. And that's because we have a maximum exception depth of 20. If you try and write try accepts nested 21 deep in Python, you'll get a syntax error. You've probably never done that, and I wouldn't recommend you do. OK, so the way it works in 3.11 is that instead of pushing these blocks, what we do is in the bytecode compiler, we analyse where the exception would jump to for each bytecode, and then we just create tables that describe that, and those tables are stored alongside the normal code. That's a little bit slower, again, when the exception is raised, but it has no real cost when we don't raise an exception. There is a tiny cost, obviously, because it uses a little bit more memory in the code object. So zero cost exceptions are probably considered as mostly zero cost exceptions. OK, so that's frame objects. Now let's discuss the more sort of normal objects. So plain old Python objects. So here's a plain old Python object. It doesn't do anything. It just takes two attributes and assigns them to itself. This, obviously, most Python objects look kind of like this, but with, obviously, some extra code to actually do something. So this is a kind of standard Python object. So let's look at how this is laid out in memory. Now Python objects, and I may have heard that it's probably, you can consider Python objects as just a sort of a thin wrap around a dictionary. Essentially they give you that sort of nice syntactic sugar where you can do, instead of looking up the item in the dictionary by its sort of quoted name, you just do a dot name attribute lookup. That's one way of looking at it. I mean, we don't necessarily think of them that way, but every object has this done-addict attribute, which will allow you to get to this in a dictionary. But almost all code doesn't actually access the dictionary directly. It's very rarely used. Another thing we need to consider about Python objects is that they are not fixed size. Now, what I mean by that is not that they can have a variable number of attributes, because, as you just said, they kind of belong in the dictionary. But it can also have other things that sort of change the size of the Python object. This will become relevant in a second when I show you the diagrams. Objects can inherit from built-in things like an integer, or they can have done the slots, which change the layout of the object. So that's a little tricky. So a naive implementation of this can be rather slow and bulky. So here's the naive implementation. We haven't had a naive implementation since Python 3.2, but I'm going to show it to you anyway, because I think it's illustrative of the sort of overheads and the sort of logical, simple way of implementing this. So you have an object, and it has a pointer to its class and its dictionary, basically. Except that, because it's variable size, the pointer to the dictionary isn't at a fixed offset, so we need to look up what the offset is in the class. Now, the light green color, basically, is because that's our shared instances. In other words, there's one of those per class. So if you have 1,000 instances of a class, or a million, there's only one of those. So that's in green because their memory cost is effectively immortalised, but per additional object is zero. The red ones, however, are redundant information per instance. In other words, there's stuff we really don't need, but we want to get rid of. So basically, you've got your objects, it's got its class, it's got a dictionary, and those dictionaries are basically an array of keys, hashes and values. So if you put a value in an object, so we go back to ourself.a equals whatever, the object's dictionary conception, it has the value stored under the key a, and they'll be indexed somehow or other. And then we use a hash table look up and so on. There's plenty of other talks on how dictionaries work. Now, in 3.3 and onward, we change this so that the... remove the redundant keys and hashes. They're moved into a separate data structure, and they're fed across the class in most cases. And then we have this. So now we just have... we've got rid of a lot of the redundancy there, and we now have this table of values, the keys and hashes are in a separate thing, and that's usually accessed from the class where it's sort of cached. And then we have the dictionary that points to those. But you'll see there's still some in red, so we still have to go... OK, so one other thing is, if we want to access key value zero here, no, we have to follow this four memory accesses. We need to get the class, we need to get the dictionary offset of the class, we need to use that dictionary offset to find the dictionary pointer in the object, we need to follow that to the dictionary, and then we need to follow that to the table. In 3.3 we use left memory, but we still have the 3.3 to 3.10, we still have that number of indirections, we still have to follow all that thing. So the first thing we do in 3.11 is move the pointers. There's nothing physically stopping us putting pointers in front of the header of the object. So we move the dictionary pointer in front of that, and that means it's a fixed offset, which means we don't need to look up its offset. So this is the first thing, and we've already reduced the number of indirections to get to our values to two. We haven't saved any memory yet. The second thing we can now observe is that that dictionary is redundant. It has a pointer to a keys, well, we can add a pointer to the value, sorry, but we can add a pointer to the values to the object. It has a header, but it just tells us it's a dictionary. Well, we know it's a dictionary. It has a pointer to the keys, but we can only access the keys via the class. So we can just drop that. This is what basically a Python object, when you just create it, given the example I gave earlier, with this very simple thing. And it will give us this. So this is our nice sort of compact form in 3.11. Okay, so that's the data structure. So it's a bit of a whirlwind tour, I realise, and it won't get slower. Okay, so algorithms. So first of all, bytecode. Bytecode is what the interpreter runs when it's running your Python program. Again, there's plenty of talks on this, a very brief refresher or introduction, depending on whether you've seen this stuff before. So take the function on the left, it returns the A attribute of its argument. The bytecode on the right, ignore the resume, that's just sort of an administrative internal thing, just to mark the beginning of a function. What it does is it loads the local variable self onto the evaluation stack, it then replaces that with the A attribute of that value, and then it returns the value on top of the stack. In this case... Actually, no, I'll just carry on. Okay, so this is PEP 659. The specialising adaptive interpreter. Now this is kind of the headline feature of 3.11. But the reason I did the data structures first is because much of what this does depends on how those data structures are put together and laid out. As I said earlier, it's the number of memory axes that we do is often keeper for moments. And designing our data structures to allow us to do stuff fast is kind of important before we can actually do things fast. So the specialised adaptive interpreter basically has an idea that for each bytecode of interest, so not simple things like load, fast, return value, but the complicated things like looking up an attribute. There's a million different ways you can have an attribute, well there's not a million, there's like 13 or something, different ways of attribute look-ups. You can have properties, methods, values on the attribute, values on the class, and specialised stuff done in C code, and so on and so forth. So for each instruction we basically have in two states. One is the kind of general form, which we call the adaptive form, and that basically just does the general look-up that we did in 3.10 and decrements a counter. When that counter reaches zero, we try and specialise it. And we also have the specialised forms, which are customised for particular values or types of values that we see, and those also have a counter from misses. So we don't know that every time, just because we've seen integers being added together, we don't know the next time we get to that addition it's going to be integers, it could be a floating point number, it could be strings. If we don't see the thing we'd had before, we fall back to the general case and we decrement the counter. This is where it's adaptive, it's in two states. Ideally it basically goes from the adaptive to the specialised state and stays there, but code isn't always that straightforward. So we need to sort of cope reasonably efficiently where the case it isn't. So basically there's the two forms, a specialisation where we go from the general to the specialised form, and de-optimisation where we go from the specialised form to the optimised form. So before we get there, there's a quickening form. So going back to this sort of bytecode, I'm just going to... That bytecode is the one on the left there and I've added a few little things. So actually internally we have some what's called inline caches, a little bit of space in between some of the bytecodes for putting like data that we kind of need to speed things up. The quickening form is we change the basically straightforward form, the load attribute to this load attribute adaptive and it's basically the same thing, it does the same thing but it has this counter. So we get to the point where it has this warm-up counter and now next time we execute the warm-up counter hit zero and we specialise. So there's the adaptive form on the left and the specialised form on the right. Now, assuming we're specialising the function we show four instances of the class we showed earlier, the simple one that just has the a and b attributes. We specialise and we have this form called load attribute instance value where there's about ten different forms of specialisations of load attribute. That might be wrong, but it may have been right at some point in the past, it will be right at some point in the future because we keep changing the number. The load attribute instance value is for this sort of simple case. It's just a normal Python class, nothing special, no properties, they're just the values in the instance. So we have a miss counter and that says every time we miss we decrement that and in which it gets to zero we just go back to the original form and possibly bounce around a bit. We have a type version and that basically says we're interested in the current state whether the current state of the class of the value we're looking at is as it was when we specialised the code. So what we do is we add a version number to all classes and then when we change a class that gets incremented and then we can just check the version number as a quick check and the index here is basically the index into the values array. Now we're not checking the keys here which is an interesting thing because the dictionary keys which I omitted to mention earlier which is that they cannot have keys removed from them. So once we know the key is in the dictionary, the cached dictionary keys on the class we know that it will always remain in there. If a dictionary, if we delete it from the instance we can just null out that slot if we change the instance so much that we have to redo dictionary keys then we just throw it, we don't use the cached form and we have our own thing. So I just want to run through how this works quickly. This is c code if you're not seeing c code don't worry, I'll just go through this before. So how this works is, so as I said the object's on the top of the stack so we pop that off and then we look in its type and we check the type version that's just the read the cache and we have the type version in the cache if we don't match we do this what's called deopt so we deopt if this thing and deopt is basically just we I'm just checking the time so if we deopt we just decrement the counter sorry and then fall back to the general form if we don't do any deops we read out the index and then we're done. Now we have a few bunch of memory reads here but these are mostly independent so this is much simpler and faster than the more general case that's not the only specialisation we do we specialise a whole bunch of stuff there's the bytecodes on the left and the equivalent python code on the right that we specialise so in summary we design the data structures to reduce the memory access and then we design specialised bytecodes to for common cases that we're likely to see and we design those to take a advantage of those new data structures such that we can reduce the cost of a thing so in other words we have this data structures plus specialised code equals faster python I'm going to skip the future because you'll have to come back next year to see what we're doing 3.12 if you can read really fast there's some of the things we're going to be doing and thank you so let's see if this works I just want to say on behalf of the whole team thank you to all the other core developers who've torrated us breaking stuff I particularly want to thank Ken Jin, Nardosan and Dennis Sweeney for their contributions our team is paid but those are volunteer contributions so I want to thank them for doing that thank you very much again for your presentation we have a very short time for questions and we're going to take the remote question first actually we're going to take a question from the audience first so a quick question there was this instance of the class and it's layout you have this dundardict attribute and it was pointing to null do you actually need this pointer at all right now? well we could in theory tag the values and dictionary and shove them in the same pointer but we can, obviously if someone asks for a dundardict we'll need to fill that in so this is basically there was a slide for it, yeah so we go from this and then if someone wants the dictionary we just fill it in and put that in and another question can you actually inline the values inside the object and reallocate it when someone adds new attributes there? we could do that's definitely something we've considered this is kind of a compromise between flexibility and performance this we may add them to the end the problem is that it's even more wasting if we get it wrong with the values because we can't they're just there so we have to reassign values and we still need the pointers so it's not clear which is better thank you very much for the question can we have the next question please hi it's very actually good to hear that there's so many optimisations I just was wondering with so many new tricks in the book now do you think that we may have someone expected non determinisms and like pretty much bugs because now the code is not executed in a very deterministic way? well it's sort of deterministic it's just more complicated so we do have fixed counters I mean if we had random counters there are some advantages to randomising things but as you say non determinism one but to be honest it's always been non deterministic because the hardware is largely so in terms of performance in terms of what it does it should I mean not doing the same thing as a bug and it's more complicated so there are more likely to be bugs that is definitely a thing we're obviously aware of that and do our best not to induce more bugs so there's a trade off there's a slight risk of introducing new bugs especially in obscure cases and the question is are we willing to trade that slight risk for significantly improved performance and I believe we are I believe community as a whole is ok thank you very much for the questions that's all the time we have now so let's have another round of applause for Mark and thank you for the talk