 I'm just gonna talk about how we're making C Python faster and when I say we I don't just mean me or my team I mean the whole community over quite a few years So the first thing is I've been thinking about this for a while So here's a picture of me a Europe Python. Was that 12 years ago? when I was younger and Yeah, so the first thing I want to talk about is everyone's all about speed speed speed But I want to get you thinking about sort of terms of time and the reason is in terms of like just simple maths basically So if you want to make something five times as fast, you need to reduce the execution time by 80% If you want to use it by 10 times as far as you have to use it in 90% So you think well what's half as good as 10 times as fast, but it's five times as fast Well, it isn't five times as fast requires almost as much work as 10 times as fast as you can see from the picture If you do half the work to get to 10 times as fast, you've got barely twice as fast. So The other thing is we need to consider all aspects of the runtime So if you know about a JavaScript most of you I imagine most of us have heard of JavaScript at least even the most the hardcore Python programmer And you might be aware, you know, the modern JavaScript engines are much much faster than they used to be and A lot of those are well as just-in-time compilers and other things, but there are many aspects to these programs and if you just consider like part of the program and don't deal with all of the bits then you have a bit of a problem So imagine you've got you just ignore 10% the program and focus on 90% and speed that up a lot You think that'd be good enough But it's just ignoring that last 10% can really drag down your performance the difference between just ignoring it completely and doubling its speed means that The effort you have to put into that 90% is reduced considerably So saying we need to consider all aspects of the Thing we need to do some profiling and find out where the time is spent if you can't read that diagram at the back Don't worry. I can't read at the front either. It's more just to give you a feel for Like how we profile so these are all the sort of benchmarks in our standard benchmark suites broken down into sort of various aspects of the where time is spent The interpreter itself is a surprisingly small fraction of the time spent running a Python program Something like this these numbers for 311 Something like a third and then we'd like memory management is 10% the cycle GC Which is something you probably don't think about much is 10% and then there's library code and various dynamic look-ups and various other bits and pieces And we need to sort of consider all of these things because I say if we just consider like a third of the runtime There's we're not going to get much of a speed up So there's a few general principles We like to some sort of apply and the most important one is that nothing is faster than nothing in other words If you can replace anything you do by doing nothing, that's the best you can possibly do So a lot of things are like we just look for redundancy, you know In the case I want to say do nothing. It's maybe don't do it repeatedly So, you know if you if you do a look up and then you do a look up again the second time you can get away without doing it at all That's obviously as fast as it will go There's speculation which is basically making guesses about future behavior based on past behavior If you've gone round of loop a million times and you've seen the same types or every time You're gonna see the same types next time And the last thing is sort of efficient data structures. That's how we lay out Python objects and the rest of the sort of ancillary bits and pieces in memory That's more to do with just making it work. Well on modern hardware CPU clock speeds compared with like main memory reads Orders of a hundred times faster. So fewer memory reads is always a good thing So I'm gonna go through those in reverse Order so sorry, I should say that so I'll do some stuff on efficient data structures then speculation So shrinking the Python object. So let's consider a sort of simple Python object So it was Python object with four attributes just this code that doesn't do anything It's just exists. So we have an initializer initializes the four attributes and if we grab the dunderdict out of here We'll get dictionary of showing those four attributes This isn't really interesting. This is just a case of like this is just an example We can use as archetypes show you how it's made out of memory now if we go back to the olden days of 2.7 3.2. I'm don't know if there's anyone old enough here to have ever used anything early in to sit to seven There is there's a few hands going up quite a few hands going up actually anyway, but we will try to forget about that But it was a long time ago But even and to be honest it only made much difference before then it was broadly the same But this is how the object was laid out I'm not going to worry about all the details But I want to get you the sense that there's a lot of stuff there for just four attributes and on a 64-bit machine In the early days a lot of those mean 32 bit machines, but I 64 bit machine. That's 352 bytes There's the object the dictionary and then all the ancillary stuff about keeping all the data Now if you know the little bits I Hope everyone can see the highlighting there so all the hashes and the keys there if you have a thousand of Objects of diet just showed you all of the same class Then they all have the same hashes and keys in the dictionary table So in 3.3 we started sharing those between the classes So this they became looked like this So now we just have a table with the values in it and the keys are shared Between all instances of the class so the layout looks a bit like this The reason I've just got green circles for the keys in the class. It's not that they're quite large objects But they're shared. So if you have a thousand instances of the class Then each kilobyte not on the class itself only counts for one byte per object So we're gonna ignore those pretend that effectively zero But there's still more redundancy here. You can see some gaps here in the values And this is to do with hash tables and efficient look up. So in 3.6 They were shrunk and that was removed and we have what's called the compact dictionary And then we have further refinements in three a little later So if you can see there's various alignment aspects here For because C Python as clear as in the name there's written in C We need compatibility with C libraries and they insist on like two word alignments due to malloc restrictions So anything is an odd number of words has to be aligned and we waste a bit of space here So you can see there's alignment issue at the top and one of the GC headers is wasted. So in 3.8 We shrink the GC header a little bit GC is short for garbage collector. That's to do with us collecting cycles and We're going down to a more manageable size. So we're down to 160 bytes now from our original 352 But actually there's a huge redundancy here The entire dictionary It has no actual information all the information in it is basically About the the table of values which is in the table of values and everything else is just saying I'm a dictionary as a GC header so it can be garbage collected as a reference count Which just say because it exists class point towards points to a dictionary and then some other Stuff that again is largely redundant. So in 3.11. We can just remove that all together And here we're down to a more reasonable amount memory use If you actually need the done detect in 3.11 will will dynamically recreate it Which can be a little efficient if you do use a heavily although we're looking in 313 to Dynamically really remove it again if you do require it There's a little bit more redundancy here Our dictionary pointer is not necessary because we're not I don't have a dictionary So we can merge that with the values pointer is a little bit of alignment the bottom So in 3.12 we further shrink this to 96 bytes Now this is looking pretty good But it's still not as good as like a C++ or Java class layout, but we're not too doing too badly But there's still a bit more redundancy We probably don't need two words of GC headers they're weak refs Almost all objects don't have weak references so we could maybe put that in flags and or an external table So this is what the future might look like This won't happen in 313, but something akin to this might be in 314 we'll see And this is where we put the values at the end of the object. This is less to do with space saving Space saving and more to do with the time it requires to look things up so we can Basically get to those values run a single hop from the pointer to the object and as I said earlier memory reads are key for performance So skipping the one memory read will help us with performance So over the last 10 years I guess when I anyone know when 3.2 came out Was a while ago We've reduced the size by seventy for seven percent and reduced the number of memory reads to access by 80% at least the future version the currently the numbers aren't quite so good And if you look at those they don't compare too badly to Java or C++ Java or C++ and there's a little on being a little Cheating here when I say there's one memory read the Java or C++ just requires that memory read Whereas Python we need some additional checks to make sure that other things haven't changed So there's a couple of additional checks To ensure that we're actually like we can use that fast path But still with modern CPUs which can do several things at once So we're still it's in the order in the same order as Java and C++ potentially and the memory use is double But I think given the flexibility and power of Python. I think that's a reasonable price to pay So having gone through how we shrunk things that was mostly focused on the past now I'm going to talk about speeding up the interpreter, which is probably more focused on the future so as I said the interpreter is a A reasonable chunk of the runtime It's definitely not all of it and there's definitely more things we need to do around there But obviously I won't have time to cover those in this talk, but I will focus on how we sped up the interpreter So a little interlude before we do that is think of bytecode. I don't know how many people are familiar with bytecode Do you have a quick show of hands? Some some doing that some people keeping their hands down. So I'll give you a very quick Interlude so so imagine the very simple statement y equals x C Python the virtual machine is a stack machine That means it operates by pushing values onto a stack and popping them off a stack How that's implemented is not really relevant here But the point is that the instructions that the VM operates operate on that stack. So to assign X to Y we first have to load the value of X onto the stack And then we have to store the value that's on top of the stack into Y and the bytecodes Which is the machine the virtual machine instructions that do that are on the right here So there's load fast and then store fast. We load. It's just called the reason it's called fastest because it's fast in the previous version It's calling anything fast as always a terrible name because then you have like load faster and then load faster to and so on so So these are just for loading local variables. So it could probably be better named load local So anyway load fast X which takes the value X onto the stack and then load store fast puts it Value on top of the stack Y to do the Value expression a plus B we do something similar We load a and then we load B and then we add them together. We're using this binary op the short for binary operator Instruction and then this is the addition variant and that leaves the addition a plus B on the stack If we were to then store that to a local variable, we'd add a store fast to it So that gives you a very Simple version of bytecode and hopefully the rest of this won't be too confusing if you've not seen bytecode before So The bytecode has been fairly Straightforward and we've not really messed with it It's just added to as new features have been added up until quite recently in three seven We made some small change to improve method calls now thing with a method call in Python is That object dot meth call parentheses arg is equivalent to assigning object dot meth to a temporary variable and then calling the temporary variable with And that temporary variable would be a what's called a bound method So up until three seven we actually created these bound methods every method call in three seven We added these Instruction pair a load method call method which meant that we could avoid creating this temporary object on every method call And then in three eight we added some caches to looking up low global variables So every time you have like int or float or type or Yeah, any sort of built-in function or type appears in your source code That's obviously a name lookup and in three eight We had some caches to those to speed those up. So there are fairly small improvements In three eleven however, we made some pretty big improve and changes to these things And what's called a specializing adaptive interpreter? And what that does is it specializes one bytecode at a time. So Specializing is basically changing the bytecode. So it's expecting a certain type or types It's a very narrow overhead, but it does reduce a lot of the dynamic over it It's very narrow scope, but reduces the dynamic overhead a lot And the reason is that every time you we see an addition We don't have to do a look up on the left type And then if that fails do a look up on the right type and do lots of chasing around We can just say, okay, we're expecting a couple of integers We'll just check their integers and then we'll do the simple integer addition So the way this as it works is we Specialized one bytecode at a time and that each bytecode is done independently We don't try any sort of broader or more intelligent approach This makes us very simple and also pretty robust And here's a couple of examples as I said if we we can specialize binary op into binary op add int And that's specialized obviously for adding integers We know statically there's going to be addition because that was in the source code But we dynamically we don't know the types, but again, it's one of those things where If the last time we execute that bytecode it was integers. We're adding together. It will be integers the next time It's not guaranteed obviously so we need checks, but those takes almost never fail and The consequently it's a much much cheaper to do a couple of cheap checks And then the exact operation we want to do then going all doing all the look up and this Dis-specialization here are responsible for pretty much most of the speed up in 311 and 311 we saw varying speed ups, but if you go back to the diagram which showed how The different in program spend different times into parts program I would say that 311 roughly sped up the actual interpreter itself it doubled the speed of it approximately Very approximate But obviously you don't see that double speed up they do for a handful of benchmarks But generally you don't because programs don't always spend that much time in the interpreter itself Another example is load attribute here Which is just loading an attribute of something and a very simple specialization is load attribute class Which is where we're looking up on the class We know that the object we're looking up the attribute on it is a class and Then we can do a different look up in fact we can say well We know it's a class we check which class it is because it's almost certainly the same as it was last time And then we actually just cash the result If you're interested in this and it's all pretty interesting stuff where in my opinion anyway You can watch online brand bush's talk from Picon US and he explains this very well It's a full half hour talk on this bit Which I'm trying to condense into two slides and does rather skimming over So what I want to focus on more is about the future here. So I Was a present in the future We want to optimize larger regions. So here's a couple of examples. We've got a little function add and The tiniest code snippet which adds beta one now There's quite a lot of coded by codes are on the right and not really expecting you to understand that too clearly But basically we have the function and then below that little code that calls that function and stores result So what we can do is we can look at larger regions of code So here's the thing where we effectively inline the function You'll note that I've changed you may note that I've changed the call to a push frame and the return to a pop frame Because we're no longer doing a call and Return look up. We're just skipping through that because we can effectively inline the function and this allows us to do both specialization, which is kind of what we were doing for the Binary op add in but we're going to break this down into smaller checks. So we check that the top and First item in the second item on the stack are integers and then we add them in the highlighted section there And then what we can do is called what's called partial evaluation now I'm really I'm just going to basically just hand wave this sort of way and give you a rough feel for what it can do If you're really interested, there is a 400 page textbook online about partial evaluation You have to be really interested for that But I'm trying to give you a rough feel for what we can do But the idea is we can evaluate whatever we can upfront like during our sort of optimization phase So we don't have to recompute it later and we use a technique called and abstract interpretation to do this So I'm just going to basically just Sort of animate through this manually animate it So so here's the code on the left on the right at the bottom is the actual real frame the actual in memory that we kind of The actual work that's done will be done at runtime and the first the grayed out bit is the sort of abstract bit Which was sort of abstractly interpreting So as we go through the bytecode we abstractly do what it says So we load global that pushes the function add into our abstract frame We check as a function abdo is abstract so we can kind of just Not to do the check. We have to do a little check somewhere else, but again, I'll go to that skip through that Then we load the variable. We don't really need to load the variable We just maintain abstractly what it would have been if we'd done it Likewise, and then we push this frame. So we have another abstract frame And we keep doing this until we get to a point where we actually need to do some work. So the previous one We're checking that the value one is an integer. Well, we obviously leave you don't need to do anything that we know it is but then here we need to check that the Value B is an integer and in this case We actually have to do some real work to do this. So first of all, we need to get to be so here If you notice the I've changed the color and so he highlighted the load fast instruction there at the top And that's because we actually have to do that instruction all to get to be in order to check it Then the check itself has to be done at runtime and obviously we can't add some unknown number to one virtually We have to do it for real So in order to do that, we actually need the one so we load that then we add the int But the frame itself has remained like abstract So we can we can pop that without doing any real work and then the store has to do real work And then we've ended up at the real result with only doing much less work So we've basically reduced our 13 instructions down to five instructions in this admittedly rather contrived example So apart from demonstrating you can prove anything with a contrived example We also showed that you know, there is the potential here for reducing them out of work done considerably so there's a couple of optimizations there reducing memory use and This optimized how we reduce the work done in the interpreter But as I said, there's a whole bunch of aspects to the VM that we need to consider So how do we bring all these together? There's a whole bunch of techniques we can apply partial evaluation compilation to machine code commonly knows jik compilation Specialization, which is the conversion, you know binary op to binary op add int then we've memory management the battery object layer I mentioned partial evaluation can also help with memory management Then as well called unboxing, which is where you can represent like Python integer objects or Python float objects as like machine integers and floats Then there's a cycle garbage collector. We've a better object layout will help there But there's also something to do called incremental collection, which we we create Collect some of the heap rather than trying to do the whole heap at once and then a C extension code We can even improve the performance of some of that by allowing C extensions to be written at a lower level interface And then removing some of the overhead of calling into or returning from the C extensions using specializing and specialization and unboxing So the really thing I wanted you to take away from that is that Python is getting faster and We've done a fair bit, but there's plenty more to do so we keep we expected to keep getting faster, so If you're interested in helping or interested in this sort of stuff then Come and talk to me, but I mean the most practical thing you can do is just upgrade to the most recent version and save yourself Some energy and maybe a bit of money So Thank you and one last thing I have to say is That list of benchmarks we had at the beginning. That's our data. That's what we're working with If it doesn't represent your workload, there's a good chance. We aren't going to be helping you So give us your benchmarks. We need more benchmarks. We always need more benchmarks. Okay. Thank you Thank you very much mark Do we have any question remember we have a microphone here in the middle of the the corridor So please approach and queue up in case you have more questions go Thank you for your talk my question comes from a place of ignorance, but in the beginning you mentioned caching some look ups If you cash everything would it not be equivalent to not caching anything in the if you end up caching everything Yeah, I mean it's one of those Most caches are wrong anyway in general because you know it would evaluating things a certain circumstances where we can be highly confident That the thing we evaluated last time is the thing we're gonna get next time and that's actually not that many things We're caching. It's just things that happen to happen a lot in code So You know if you've got a piece of code that says like in and you're expecting a string You're turning some JSON data into numbers and you just int open parentheses the variable with a string in it now That in is a global variable. We look it up every time But we know with almost absolute certainty that it's going to be the the class int So that's the sort of thing. We really want to cash, but other times computations. Yeah, there can be Infinitely many and yes, we don't want to be caching results of computations unless you explicitly use LRU cash then but that's on you Okay, thanks The person here in the front now Thank you so much for a talk mark Do you see any value in cheating for speed as speed is often a matter of perception? Maybe if we store the first 300 fiber-nation numbers in Python and the most prime numbers so we win in cheating and have a better impression Yeah, that's why I want more benchmarks. So we're not tempted to do that We don't we do have also a question from this court remotely is as are there any plans idea to leverage type hints to speed up things by knowing type up front Okay, well the problem with type hints as they are just that it's a hint So we have to check it But we also already know statistically what the things likely to be so the statistical information is no worse than the type hints But it's more readily available and faster to get out So the problem of the type hints is they're generally giving us the same information but worse There are instances where we could use it because the type in says that we should be confident that it's going to be that value right Immediately whereas statistically we want you to build up a bit of information So it might give us a little bit of a head start in circumstances, but but generally is not worth the extra complexity Thank you very much the person in the back Are there any plans to replace the stack machine in CPython with a registered machine? I think that was that was the no-gill fork went in that direction Yeah, they did and we actually did some experiments with that and Yeah, there is a speed up but the problem is that it's only speeding up the interpreter and if soon as we move into a just-in-time compiler it doesn't help and stack machines are very nice in terms of Manipulating the code so that example I had earlier where something is in-lined the stack machine has a very nice property That the inline code has exactly the same effect as a call As long as you have enough stack space whereas a registered machine that doesn't work you have to rename all the registers Thank you for the question here in the front Python's Zen says that there should be one and preferably only one obvious way to do it I don't know how many ways we have to add things now. We have to add integer. We have general add We probably have to add strings together. So we have lots more implementations of add which is going to add add Overhead for maintenance of CPython. So what is it? That you are doing to CPython core dev Maintainers making their jobs so much harder by giving them so much more code that they have to maintain going forward Well one argument I could come up with is that as we make Python itself faster There's less requirement to write extensions in C code So maybe more code be written in Python and the other thing is of course, there's a CPython and there's no NFC, so we're free to do everyone Do we have any other question? No, as you also I don't see any other question on this course. So let's say thanks Mark again and see you next