 Hello, everyone. Welcome to the talk about optimizing Python code. Before I get started, who's familiar with Python already? Raise your hands. Fantastic. So, hey, I'm Eric. I work as a Python developer. And I work for a company called Adhimean. You can find me at with the handle Eric Gezeny on GitHub, Twitter, whatever. So the first question you might ask yourself. Since you made it to this room, I assume you're curious about optimizing code. So I'll walk you through a few methods that I use to optimize performance issues. This is just my methods. There are plenty of them out there. So, yeah, just pick one for it. The first question you need to ask yourself is, obviously, why do you want to optimize things and what? You don't usually wake up in the morning wanting to optimize stuff unless you just watch the talk about optimizing Python. And that's not very good reason. Now, optimization is usually a byproduct of solving a problem. So you might have an issue with slow IO, for instance. And that's crippling your data analysis or your database queries or forcing you to collect less data or less often from your connected devices. You might also want to run programs on the shared environment where you may have to pay for extra RAM. Or if you stuff too much processing on the same machine, you might take down the machine altogether because of RAM usage. You might also want to optimize your uptime. I know you think it's a fetish for system administrators with beard and sandals, but uptime is very serious in some domains. And some systems which use less resources tend to put less stress on the hardware they're running on. And so, therefore, they're staying up longer. A system which is designed to handle bad inputs and also is also a system that will survive brutal network losses, for instance, or defective hardware. And finally, some systems should never fail because human lives depend on it. So then, sometimes, that's some stuff you want to improve and optimize. Handling a lot of workload at the same time is a non-trivial exercise, especially in Python. But that's a pretty common scenario if you want to write, let's say, a web server or a web service, an IoT controller, or a database front-end. So finally, you want to make things go faster. And today, we've grown into an impatient bunch. Time is money. Wasting time is wasting resources. So faster is always better. Well, maybe it's not everywhere, but when it comes to computing, nobody is going to complain if calculations go too fast or if an algorithm is too efficient, obviously. That's the kind of optimization we're going to have a look at today. Because, yeah, you need to pick one. Each of those domains is a full-blown conference in itself. Also, you don't want to get mixed up when you're optimizing. We are going to do it scientifically with measurable outcomes. You pick one category, let's say conveniently, CPU optimization, and then you optimize that. And then when you're satisfied, you're free to optimize IO or reliability. When I'm going to speak about optimizing for speed, the method remains valid for all domains. So it's always the same thing we do. First, you need to send some targets. Like this page must load below 200 milliseconds. Or one iteration of this loop must execute below 10 milliseconds. So the first one is usually stuff that marketing guys come and tell you, okay, we're getting bad SEO because the page takes too long to load. Or the second is more like an embedded system or time critical systems where you need to execute stuff every 10 milliseconds and not miss a beat. Or maybe you're running on a controller which only has eight kilobytes of memory. So you need to optimize for that. For each target you define, if it's not already obvious from the problem, a set of metrics that you will compare for every change you make. For CPU optimization, it's pretty easy. It's just time spent for running a particular piece of code. So at least if you're making things wrong and you're making things worse, at least you know because when you're like, just using your judgment, oh, it feels faster or it feels better or it feels more efficient. Sometimes psychology is also running here. And it allows you also to see if you reach your target because it's pretty easy to get sucked up and like, oh, I could make it a bit faster or a bit faster. But if you're already reached your goal, then there's no need to dig further. So the rules are often overlooked and it gets more true. The more experience you get, it's benchmark, benchmark, benchmark. It's facts count more than experience. You always have to measure, compare, review your progress because some obvious fixes sometimes aren't. You can trust your guts, but you need to verify. So how come you can't trust your guts? Because, well, you control your program, but your program lays on top of the interpreter. It's written in another language, which is C. Well, depending if you use C Python, but let's take it for granted for now. Then this executes on the operating system. Which operating system? They're not all made the same regarding schedulers, memory allocators, and so on. If you're doing IO, what file system are you using? Do you use red? Which red? Red one? Red zero? Red 10? If you're doing GPU computations, okay, who provides the drivers? Which version of the driver? Isn't there a slowdown in the driver itself? And then this runs on a hardware where the architecture have significant influence as well as the presence of CPU extensions. Is it an old ID drive you get? Or is it a new SSD? Even the damn temperature and the material in which the network cables are made can interfere with your performance calculations. So all of this to say that even if you have the strong inner feeling that this kind of fix should make things faster, sometimes you see either the expected result, sometimes the opposite. And that's sometimes lays in the middle of all these layers of things that work to make your program run. So a word of advice here, you must never attempt changing a system if you don't have a solid version control system that can at least diff and commit, and a solid test coverage. Because when you change things in the many different places to see if it's going to improve, then you might wonder what exactly brought improvements. So a good method is like, okay, I achieved a faster execution. Now I rewind and I apply my changes incrementally to just like, okay, it's a deliberate action that brought improvements and just like not, okay, by some random chances, I aligned all my bits in the memory and things went faster, but it's completely out of my control. So rewind your changes, apply them, and if you break something, then you can just safely revert back. And when you achieve your goals, celebrate. And sometimes you hit the end of the road. Sometimes despite your best efforts, you can't do better. But again, there are many ways of meeting your targets. Maybe there are other ways of improvement. Just take a break, look under a different angle. There's always more than one tool in the box. Speaking of tools. So my go-to tools for speed optimization in Python are first the profiler. I'll show you all of them, don't worry. So the profiler, graphical analyzer, the time it function, usually used in conjunction with an improved interpreter, and the great PyTest profiling plugin. So how many of you have already used PyTest? Not enough. Okay, so what's a profiler? It's a library that's going to capture every single call made while the profiler is active. So let's say you got function A that calls function B, somewhere the profiler will look at this interaction and log, okay, function A called function B, and function B returned after, let's say, 10 milliseconds. And every time you're going to call, it's going to add one line in the log. So for this reason, I suggest that you focus, yeah, it's going to capture a lot of noise as well, because you might want to execute one piece of code, but before and after a lot of code is also being run, and which you have no issue with performance. So I suggest that you focus on one specific top function, and then from there, you just figure out which function is having the worst timings, and then proceed narrowing the scope up until you find the function that really needs optimization. The profiler capture can be dumped as Python stat format called also P-stat. Just for your information, you can call the profiler on the whole program. I would not recommend that because there is too much noise getting into play, except if you want to debug stuff like importing your models and so on and so on. The only advantage that I see is that you don't need to change your code for doing so. Beware that the profiler will look and log everything that occurs in your program. So that's also the real-life equivalent of driving with the parking brake on. So as soon as you're done profiling, remove the profiler to see the real results because if you improve something like, let's say, you shave half a second of a program execution with the profiler still in place, when you remove the profiler, you will see the real accomplishment and it's probably going to be much better without the profiler. So that's where the ability to read code is going to be a bit more critical here. How do you embed the profiler within your code? Just import the C profile library, create a profiler object, enable it, run your target function, disable the profiler and dump the stats. Easy peasy. So that's what a stats file looks like. So if only we had a bigger screen, you would see, can I find my mouse and I can have the pointer here. So yeah, for those who can't read, so this is the timings, so total time, per call time, cumulative time and there the function which got run. And so again, for those who can't read, here I run the statistics over the full program capture. So I got a lot of import lib bootstrap function calls. They are not relevant for this exercise because it's just the Python interpreter booting itself and loading my modules. So since my function is pretty fast to execute, the only thing I can see from this capture is noise. So it's not very useful. On the other hand, if I target just one function then I only see what's interesting to me. So again, so here you see it's only my own code that's being run here. So let's face it. It's not very easy to read to the human high. Luckily we are blessed with a few more tools in the box. There's a tool called Gprof2. That will turn a stats file into a dot script. So what's dot? Anyone familiar with GraphViz and dots? Ah, pretty good. So dot is a program that turns a description file into an actual graph that we can turn into an image and look at it and interpret it. So that's the output of a Gprof2 dot based on a stats file that got collected on my whole program. And you see that there is a lot of things going on. So the more red, the slowest, the more blue, the fastest or the less time spent. So don't worry, it's pretty tiny. I'll show more examples later. What's important to see here is that each function calls one or more other functions and the color is kind of giving you clues about what kind of optimization path you want to improve. There's no need to go in the blue space here. You can just focus following the red here. This is from the same program but only targeting the one function that I wanted to capture. So you see that it's much narrower and there's only the function that, so this is the first function I wanted to call and it's just making calls to other sub-functions but it's much clearer what is taking time and it just gives me, compared to the previous slide here, I had to look into let's say 10-ish functions to figure out what was taking time and here it's just pretty obviously just two of them. So we'll go back at looking at pictures later. Now I'll just show you a bit more of iPython and the percentage time it magic call that will execute a function for you and call it several times just to figure out how fast it is. So you may have already tried to do some performance analysis before like just timing how long something takes to run. If it runs pretty fast, like the load of your computer may interfere with the actual run time of the function. So if you just happen to have all your CPUs freeze and nothing running on your device, then the execution is going to be pretty fast and like if you're also having a Chrome tab open somewhere and like with streaming video or something, then your function is magically going to be slower. Why so? It's because of the interference of your system into the measurement. So time it runs like 1000 time by default, the code that you're passing it. So it's kind of averaging the load of the system so it's not counting as much, waiting as much in the final calculation. It's very easy when you're on the terminal and you want to try several different parameters. So let's say you're calling a function and you want to see like, okay, if I call it with this parameter, is it going to be better or worse? So it's pretty nice when you're using IPython. So here we can see, for those who can see that if I call my function with let's say this MD5 hash method, it's taking 207 microseconds and if I call it with SHA-512, it's going to take 231 microseconds. So this one is a bit slower, but kind of weirdly, each loop takes longer with MD5 and less time with SHA-512. So it might be that the total time was influenced by the load of the system but each loop was more or less 2.19 microseconds with the second execution. So you can tell pretty accurately what is the exact timing, the cumulative timing and the individual timing of each loop. And finally, if you already have a good test coverage and you should, if you remember my previous slide, you can enable profiling on each of your unit tests and take in the plug-in here which is called PyTest profiling which is a standard plug-in for PyTests. This plug-in will do the heavy lifting for you like activating the profiler, then the stats called gprof2. And then it will just output a few SVG or PNGs for you to watch. So that's very convenient if you're a beginner. So you don't have to go through all the steps manually every time. So now we've got our tools. Let's see how to use them. You will often hear people mention obvious performance hogs or low-hanging fruits. Well, that's usually the first thing you need to look at because they are under your control unlike CPU architecture or temperature if you remember what I said before. So also good refactoring will also usually bring a lot of good things with it and little drawbacks. So I would recommend going for, I would not recommend going for more advanced techniques before having exhausted all of these low-hanging fruits. Even if it does not sound sexy, it's actually the most efficient. So for the presentation I will focus on a booming field which is password cracking using a brute force method. There are very efficient ways to do this but I'm not going to show them because obviously there is nothing left to optimize in them. So I decided to bake my own password cracker like all the good kids do. Just a bit of vocabulary if you're not familiar with password cracking. So what does it mean? You're not cracking password. You're just comparing the hash of a password with the hash of something that you don't know. So you're just generating a lot of passwords, possible passwords and you're hashing them through a hashing function. Who knows what the hashing function is? Great. And so you're just comparing the hashes and with some luck you end up with a match which means that since hash functions are deterministic what you put in the input was the actual password that the person entered in the first place. So brute forcing is attempting all the possible inputs in hope to find one of the ones used initially and the salt is just a piece of data that you either prefix or suffix your password. It's an added factor to increase the size of the input artificially. So this is absolutely unreadable. So this is a 35 lines of Python cracker and that's very, very bad. So here you might see that I'm using the very clever computer as a password. So I'm just using the string computer and then I'm salting it with four digits which is like just don't do this. And so I want to see how fast I can figure out this password using the list of the 500 worst password of history one at a time. And yeah, to spice things up I just sold my password with four digits as I said but I'll also use the very quick, very insecure MD5 function for hashing. So just crossing my fingers that I end up somewhere. So I just made a bunch of utility functions. So there's this digest function that just transforms clear text into a hash. Numerical numeric salts that's generating salts. So basically from 0, 0, 0, 0 to 9, 9, 9, 9, 9. One function that combines both and generates a hash for all combinations for one password and all salts. And then one function that just iterates over all possible passwords one by one and then calls a function that's appending with the salts and generating the hashes. And comparing the hash that I computed with the target hash and if I get a match it returns me which password it was and which salt it was. So here if I run the profiler on my tool you see that the pass is pretty obvious and here you might see that there's one call here to numeric salts that's called 110 times which generates in turns 110 millions of calls because I'm generating 1000 salts times 110. So that's one million something. And so that's a lot of wasted CPU. That's a lot of wasted CPU because you'll see a bit further but these millions of calls could be avoided. So what's an invariant? Basically if you have a function A that calls B but B does not use any of the inputs of A's scope then why are you even calling A from within B? B could be outside, located outside of A's scope. B could be, and B is then an invariant. So if you can screen your eyes and look at the codes can you figure out if there's an invariant here? So I see nobody's screening eyes so I guess it's too hard to read. So I'll just spoil it. In this function in particular no the obvious function call was generate hashes but it used the clear texts parameter here which comes from the for loop here. So you could not take generate hashes one step above like outside of this loop because we need this parameter. Here is there an obvious invariant, still not squinting. So here you see that there's a function called numeric salts called with salt space and only salt space but salt space comes from the colder function. So there is no good reason why numeric salt salts would be called for each time generate hashes is being called. So we can take that function and move it up in the calling tree. It's only called by the caller. So if we extract numeric salts we put it before the loop and so we get just the results and we just pass the results of the execution of this function. It's going to be just executed once. So if we run the profiler again, obviously numeric salts is gone from the graph because it's just being called once. Hence it's not mandating that the profiler will get it. One quick tip also on Unix, you got the nice time function utility and time here is telling me that, okay, I find my password. Oh, okay, so I changed my example for increasing the execution time but now the password dreams is also on the list and the salt was 5432 and it took my clever program seven seconds using 99% of my CPU to come up with this conclusion. Now we're going to look at how to make it better when we can do parallel computing. So an embarrassingly parallel problem is one where little or no effort is needed to separate the problem into a number of parallel tasks. Thank you Wikipedia. So the difference between parallel and sequential, probably you know it already, but parallel is, yeah, okay, you can do it multiple times in parallel, like one task does not depend on the output from one other. Sequential is, okay, I need to wait for the input, well, from the output of one function so I can use it in my other function. Luckily password cracking is embarrassingly parallel because I'm just like exploring a space of possible solutions. I don't really care of like, okay, this function succeeded or not. So this is a 46 lines of code parallel password cracker, still unreadable. How do you do this? Well, I had to make a few changes on how things were done but it turns out you can make a pool of different processes. So as you know, Python is running, for one interpreter is running in one only process. If you want to, but your computer, I hope has more than one CPU nowadays. So if you want to use these processors, you need to use the multiprocessing library. So here it's a very crude example. There are more elegant ways of doing so. So this is just one way of doing so. So you just create a pool of processes and then you just apply asynchronously meaning, okay, I just hand out work to these processes and I'm not waiting for the answer. I'm not waiting for them to complete before giving a job to another worker. And then I just, so I pass them the clear text, the solts. So the list of all the, like one password, all the solts and the target hash and I pass them a function that's going to print the results when they find it. That's not very smart. I'm going to get a result fast but I didn't plan on interrupting my program when I find the solution, meaning it's still going to exhaust all possible solutions. It's going to run through all passwords and all solts because it's a very crude example so I didn't think it very thoroughly. But it's going to display very quickly the problem. As soon as it finds it, it's going to display it. I just need to terminate the program manually. So how does it look? Well, exactly the same as it was before. So I get a clear text password. I get the list of solts for each solts. I check if the target, if the hash is matching the target. If it does, then I return. So that's the output. And again, I use time. And time reports 353 CPU, percent of CPU time, meaning that, okay, it elapsed 15 seconds of CPU time but on my clock, it only took four seconds to crack it. So it's twice, about twice faster as the iterative version, but it used a lot more CPU time. So if you squint a bit, you might be able to see like this big bump in CPU usage. So yeah, that's what 353% CPU usage looks like on the four processor, on the four core machine. So one other tool in your toolbox, throwing more hardware at it. It's effective, but often overlooked because well, it's like a bit of cheating. You're just postponing the problem most of the time. So what better specs means? It means like changing maybe the CPU architecture. Some processors that are meant for desktop computers are not very suited for parallel programming while some other architectures are used more in servers, are more efficient. Clock speed is obviously very important. And the size of the L2 cache for very CPU intensive applications, you want your cache to be full of the time. And the bigger the cache, the best. For non-parallel problems, the only answer is faster CPU clock. So if you get 3.5 gigahertz versus 3.2 gigahertz, 3.5 is always going to win for non-parallel programs. But for parallel programs, if you add more CPUs, meaning okay, I ran on my four CPU machine here, if I go on the 32 CPU machine, it's going to be faster obviously. And then while I'm at it, while just use one machine with 32 CPUs, why not going like renting a farm of machines, like 50 computers, 50 nodes of a distributed computing system, all running 32 cores, of course it's going to every time divide by the number of machines, well the total number of cores. So if you get 100 cores, it's going to be 100 times faster give or take than just running on one single core. Yeah, just one thing. If you start using different computers to run your calculations, you might not want to roll your own system for your own distributed system software. So a quick way of achieving this is using something called Celery. Maybe some of you are already familiar with it, but it's handling all the networking, queuing, failover, and so on for you. So it's much convenient. And then you've got high performance libraries because you don't want to reinvent the wheel. So these high performance libraries we have in Python, they rely on the concept that they call vectorizing or vectors. Basically in the iterative world, if you want to make a sum, what you need to do is for each line, my lines, the total is the addition of the previous total plus the amount of the current line and then return a total. The good thing is each line can be different. So you can make a different, you can have sometimes integers, strings, and make each line a different way, computing each line differently. But imagine if you're familiar with an Excel sheet, if you want to get the total of one column, then usually the one column only holds one type of data. It always floats or always integers or always dates or always whatever. So they're exploiting this trait of data of being conveniently aligned, like one column represents one type of data. When you have this, then when your data is typed and what I call your dataset is homogeneous, like same type in the same column, then you can profit from optimized calculation procedures either in CPUs or just faster loops. Like if you're having all your data with the same type, like summing up a whole column is much faster even if it's just going into C or Fortran behind the scenes. But what you see from Python is usually something along the lines of lines, then square brackets, amounts, and then calling a sum function that magically returned exactly the same results as the iterative one, but way, way faster. So these high performance library, which are they? So you've got something called NumPy. Who knows NumPy already? Fantastic. So I'm going to be quick here. So the base object in NumPy is the ND array. It's incentivizing you to use the same type over the same array. The only drawback that I found is that usually an array is a bit shaped like number of rows by number of columns. So it's very dense. So if you've got sparse data, like you're just interested by a diagonal, you still need to represent in memory the whole matrix, which can be a bit RAM consuming. The syntax is a bit unfriendly unless you come from C and Fortran where it makes more sense. It's very efficient for numerical operations and it has a good integration with siten that we are going to see a bit further. Then you get pandas. Pandas is the more data analyst friendly version of NumPy. It's still dragging a lot from NumPy. The base objects are the series for just like one dimension, which is like, there's always an index, but like just one kind of data. The data frame, if you've got homogeneous data in the index, which allows you to access your data faster. It is batteries included. So NumPy also has some helper function to read from CSV, read from different file formats. Pandas comes with like way more integration with way more formats. Say in our defaults, so trying to parse the CSV with pandas is less time consuming and less error prone than trying to parse the CSV with NumPy. Even though behind the scene, it's exactly the same thing which is happening. Management of date, date times, and time zone is also a bit simplified and very robust. And so basically it's more user friendly, but not yet Pythonic. Sometimes it still looks a bit like NumPy. No, no, no, NumPy is definitely not. It's still very close to the way data is laid out on the, like in the C structures behind. Pandas, it's really making a big improvement regarding the, I would say, the modeling of the data, but still you're not, the way you apply functions for instance is, it's not really the way everyone would be writing it. It's using, it's making an excessive use of lambdas for instance or it's advising you to do so. So it's not something I would recommend, like the way I would recommend doing things. But yeah, it's like, it's, for me, pandas is, it's way more convenient to use than NumPy. I don't pretend that NumPy is more user friendly or Pythonic than pandas. So yeah, here I would like just to check the behavior of counting Python, in pure Python, I want to check how many leaked passwords from a big also password leak that occurred a few years ago from the Rocky website. And I would just want to check how many times the Eric string can be found in some passwords there. So this is my very crude and not very smart way of counting lines in Python. You've got also many ways, different ways of doing so, but that's one of them. Executing the script finds 16,681 matches in 33 seconds. Again, 99% CPU in 33 seconds. This is how you would do it in pandas. So basically reading the file into an array and then, so I'm also counting the total lines and then just looking for how many rows contain the string Eric and then count these lines. So here it's doing a bit more stuff. It's counting, it's first checking its, so the total size of the array. So it has loaded 14 million passwords I guess and it found out, so it's pretty weird. It finds a different amount, so I put that on the account that the function which are comparing strings are made differently in the in operator in Python and in pandas, so it's finding a few less passwords, but still it runs a bit faster than my pure Python example. The thing is here I'm demonstrating over strings which are not the most optimized way because like manipulating strings, whatever language is always going to be more costly than manipulating numerical data. So here if I would have taken a different example with numbers, the computation time would have been more striking. Then if you really, really want to reinvent the wheel, so let's say you've got a very business critical function that's not easy to model with a bunch of pandas calls, then you might want to write it in Sighton. So what's Sighton? It's turning a Sighton code which is like a bastard language between Python and C so you can inject some CNES in Python code so that's hinting the Sighton compiler to make some optimizations for you. So you write, if you want to write C code, you're totally free to do so, it's possible you just write your C code, compile your code, and then you see FFI or C types to call your code, but then you're really into C LAN so you're dealing with the py objects yourself and exceptions are often going to end up in tiers and segfolds. While Sighton is pre-compiling your Python code in C, doing all the linking, wrapping, so your code can be imported as easily as from my module import my function, and you got a nice and seamless transition between C and Python context so you can still raise an exception from your C code whereas your Sighton code is still going to be executed a bit faster, but it's not going to segfold, you'll get a proper Python exception. You can use print also, well, even though printf is fine, and you don't have to deal with py objects if you don't want to, so you have the ability to do so but if you prefer to just use my object dot something you're free to do so in Sighton. So that's a shamelessly stolen example from the Sighton website because it's pretty eloquent. So that's your regular integrating function in Python so that's, you just call it this way. In Sighton, you will start seeing those sneaky C depths so which have nothing to do with the depth here, it's just a different way of hinting Sighton and also here you can see that the variables have been typed. It's different from the annotations from Python 3.6 and up. Here it's more like C so if you're familiar with C it will look very familiar and here you get your definitions of, okay, I is going to be an integer and S and DX are going to be double. And this is already making a big improvement. And then if you, you can also type your functional together. So if I'm here, this function is still a Python function. Okay, I can make some deductions based on the type of the variables which are going to already produce some performance improvements but it's still going to run in, from within the Python, as Python code. This tells Sighton, okay, this is just pure C, you just execute it as if it was C code, you just don't look at anything. But then indeed you risk having a poor exception handling mechanism. Sighton gives you the accept additional keyword just tells you, okay, if there was an exception to a cure, just return minus two, please don't say fault. And it also provides you a nice tool that's helping you figure out which part of your code are still running in Python land, which part of your code are running in C land. So the more yellow, the more Python. So here if you can see it just shows like this function call is obviously going to be run in Python because that's the entry point. So it's pretty hard to turn it into pure C but my integrate function here has turned completely white. It means that Python is not going to be involved there. So no casting between pi objects and integers like integers are integers for the whole time we are spending in this function call. And here most of the function has been like the C depths obviously are in C. This one is yellow ish. So some bits are in Python, some bits are in C. So Sighton is really helping you with this tool to figure out which piece you still need to convert into pure C to achieve maximum performance. The only drawback with Sighton, it feels like okay, that's great. Only problem is that you're actually introducing all the problems you have with deploying C code in your program. Meaning you need to package it and distribute it, meaning also you need to compile it for each architecture you're intended to run on. That's why you have different versions of NumPy and different versions of Python, sorry, different versions of Pandas for each kind of architecture and platform out there. Still it helps you as much as it can by providing you a few setup tool hooks. So if you want to turn a module into Sighton, it's pretty easy. You just use the pyx extension called the Sightonize function from Sighton itself. And then every time you want to build your, or rebuild your function, you just called build x with the in place flag and it's going to turn your pyx code into a proper C module, Sighton module. And then from there, it's very easy. You just do from C, integrate, import, integrate F and then you just call it as if it was a Python function which is pretty neat and that's how for instance, Pandas is made. In Pandas, you never know if you're running a Sighton or Python code because everything is hidden behind the curtain. And last but not least, PyPy, it's an initiative from two guys I think. They're not very much on this project. It tries to bring just-in-time compilation on the, for the Python language. So like what is just-in-time optimization? How many are familiar with the notion of just-in-time optimization? That many. Okay, so I'm going to go fast there. Like compilers, let's say the C compiler, it just looks at your code like before it even starts and make some assumptions on how it's going to be executed. But it doesn't take into account the actual way you're going to use it. So maybe some function will not receive as much optimization as another one or one path of optimization will be used and that's not the one which will yield the best results. But the problem is the only way you would know is after the fact when you've been running the function in production with actual data. That's what just-in-time does. It just instruments your code, looks at what's being executed and then from there it's making improvement to the code and to the code pass on the fly. So that's an alternative Python implementation which is 100% compatible with 2.715 and 35 something, 353 I guess. It's not yet 100% compatible with some C-based libraries, namely pandas, but it's compatible with NumPy. So if you're able to achieve your work with just NumPy, then you can use PyPy as well. It automatically rewrites the internal logic of your code for better performance, but it needs a lot of data to make better decisions, which means that it has a very slow, warm-up time. So if you execute it on a very small subset of data, then you won't see any performance improvement. You might even see a performance aggravation. But if you conversely use it with a very big dataset, then you're going to see a big improvement. So that's a very simple example. Let's say I have a message I want to transmit and I'm making a lot of these messages and I'm storing them and I want to check the last message length. So isn't there? So I'm creating, let's face it, I'm creating five, five or 15? No, just five. Five million objects. So that's quite a lot for the C-Pyton interpreter. So making five million objects, storing them and looking at the last, it takes 20 seconds on C-Pyton, use 99% of the CPU. But takes about 20 seconds in total. With PyPy, the same thing takes five mere seconds, still using all of my CPU, but only for about six seconds total. Why so? It's because it had the time to train itself while making five million messages. Now if I try with 500, then it's actually not better. So PyPy is doing the job in 0.04 seconds, but PyPy is doing in 0.06. So here again, it's to take with a grain of salt because obviously measurement imprecision might occur here, but still it's not like significantly better. 500 is not enough. So yeah, the good thing with just in time is you can take existing code you already have, just running through PyPy, and if you don't have any strong C dependency, then it's going to be ridiculously fast. I've tried this messages example with 15 millions, and it's still like about between 10 and 16 times faster than the C-Pyton version. Yeah, the only problem is if you need Pandas, then currently you're kind of stuck. The guys working on PyPy are also working kind on improving the compatibility with Pandas, but it's not yet achieved. It's another interpreter that you need to maintain and so on if you're deploying on servers. It works definitely better with pure Python types. So like if you again have C structures and so on, it might yield another performance, and it definitely needs a warm-up period. The key takeaway is that you can't have it all. Every time you're optimizing in one way, you're reducing readability, maintainability, ease of deployment, compatibility with different libraries. Whatever the time, the kind of optimization you're doing, you're always sliding in one direction, but moving away from another thing that might be important. So it's pretty hard to really find a sweet spot where you get it all. In summary, if you get a kind of simple code base and a wide deployment, let's say you've written a funny library, then you should aim for the low-hanging fruits first. Then use maybe some optimized library, let's say pandas or numpy, and then finally get better hardware like more cores or faster CPU. If your code is very easy to run in parallel, then you should just aim for more threads or processes depending if you're IO bound or CPU bound. And usually just throwing more CPUs under the form or more cores on the same machine or more boxes running the same code is going to make it dramatically faster. If you're unlucky and you have to deal with sequential code and your end or your deployment options are limited, usually just finding better hardware is going to make an improvement. Like if you have a CPU that, which is twice faster, your program is going to run twice faster. You can also try PyPy, which for similar hardware runs sometimes significantly faster and or siten. Thanks a lot for your time. If you got any optimization related questions, I still have four minutes. If my calculations are correct, no, I'm out for one minute. Any question, maybe? Nope. Thank you very much.