 The next talk is, is it a bird? Is it a plane accelerating Python with Numba? The talk is by Juan Luis Cano. Thank you. Thank you very much, everybody, for coming. Thank you to Mario for inviting me to give this last minute talk, who you enjoyed. So this is what we're going to talk about. And I'm going to do a very quick introduction to Python. It's slow, OK? We're not going to dig into the details of that, but just to introduce the matter. Then I'm going to talk about Numba, which is a Python library to accelerate Python. I'm going to give some examples that I think are very interesting. And then I'm going to spend some time speaking about the limitations of Numba, the current limitations, some workarounds that I've been using, some tricks that you might find useful, and then some conclusions. So first, a couple of words about me. I'm an aerospace engineer. I work as a software engineer in Satellogic. It's a satellite company that is building a constellation of satellites for Earth observation. And I do orbit analysis and software that goes on board of the satellite, all kind of things. And I'm also the chair of Python Spain, which is kind of like the Spanish branch of the Python Software Foundation. And we recognize the Python in Spain every year. This year is the seventh edition, which is in Alicante. There's a very cheap direct flight from here, it's from London, so I really encourage you to go. The paella is really good. And you can follow me on GitHub. So first, about Python is slow, just a couple of words. There are several reasons why Python is slow. And bear with me for a second if you are thinking, oh, but it's not really that slow. When we say that Python is really not that slow, it's probably because you're using already some workaround, for example, because you're using NumPy. If you're using NumPy, it means that, in fact, you're using a thin Python layer on top of some C code, which makes all the heavy lifting for you. But some of the data structures that you have in Python, they're not optimized for speed. They're optimized for other things. There's a very good article by Jake Banderplas that goes into a little bit of detail about that. The presentation is full of links, and the slides are already online if you want to check them later. And then Python is a dynamic and interpreted language rather than static and compiled, which means that if you do stuff like this, every time you're iterating over some variable, the interpreter has to check, OK, I'm within the bounds. Is this the type that I'm expecting? You know, it's doing lots of checks that add a lot of overhead, and when you have four nested loops, which we're going to see, why would I want this kind of thing later on? It really adds up and makes a big difference. And then there's another thing that we love about Python, which is that it's a very easy language to do in transmission. So we can say, OK, so take this code and tell me if it's a function, tell me what are their parameters, now change some local states. So we can do all sorts of crazy stuff on Python. That's great. That's one of the things that makes Python so fun to work with and so dynamic. But on the other hand, some of these features have a huge cost, in particular, and when I was working in doing some performance optimizations with Astrobyte, which is a Python library for astronomy, we found that the easy instance checks that we were doing in some places were more expensive than the actual astrodynamical algorithms that we were implementing. So this was really an important factor to take into account. But then all of a sudden, in 2012, I saw this tweet by Travis Olyfan that you all know probably, the creator of Nonpy, founder of Continuum Analytics now, Anaconda, et cetera. And he said that he had just released a library to transform Python into LLVM, which is a compiler infrastructure that is very trendy nowadays, and it's being used in a lot of different technologies. And ever since, I've been using Numba a lot. So now let me introduce this library. It's only one of the options that you have to accelerate Python code. We can discuss, maybe a little bit later, we have time in the questions. So Numba is just in time compiler. So it means that you write your Python code, and then you do something, and that turns that into compile code on the fly. So you don't have a separate compilation step. You do everything in an interpreted way, in the same way you are working in a REPL or something like that. It's released with a permissive license, BST two clauses, and it's very easy to install nowadays. It was not like that in the beginning, because you could only install it with Conda. But now you have wheels for all the platforms, and you can pip install Numba whenever you want. So the first example I'm going to show is the Monte Carlo computation for Py. This is a very stupid example of how to optimize code, but I think it's a good working example. So if we want to estimate what's the value of Py, there's a very stupid exercise that is selecting points uniformly randomly in a grid, and then checking how many of them are within that circle divided by the total number of points that I was drawing. And that in the long term, when I draw a lot of points, this gives me an estimate of what's the value of Py. So if I want to implement this kind of code, it would be something like this. So I'm selecting x, y coordinates in a uniform way, and then I'm checking if the radius of the circle is greater or not the coordinates of the points. And then I divide the number of points by the total number of points, and I get a very bad estimate for Py. So if I only use 100 points, this is going to be a little bit off. If I use 100,000 points, it's a little bit better, but not so much. And if I use 10 million points, I didn't bring the result here. Now, it's good to two or three decimal places, but it takes four seconds. So it's not really a super efficient method of estimating Py. But the good thing is that we can take this function here and accelerate it in a very easy way. So now if we do from number import yet of just in time, then we pass our original function to this just in time compiler, the first time that we use it. It's going to do a step to see what are the types of the input parameters that you're using, and then it's going to do some type inference on the types of the outputs. So the first time we call it is going to take a little bit more time, and in our case, it's like a little bit under 700 milliseconds. But then when you repeat it again, it's only 100 milliseconds. So if you remember before, it was four seconds, and now it's 100 milliseconds. So it's more than 10 times faster. So with only adding one line of code, we already had a very nice improvement. Then there's another example that I'm very fond of. This is the formula for what happens to a plate when you apply some load. In layman terms, what happens to this table when I push it to the floor, something like that. So if you go and implement this formula in Python, it's going to appear something like that. And then you see this wonderful four-nested loop that we need to use here. And as you can imagine, this is not going to be very easy on Python. And in fact, I'm going to compute the quick version first. And if you notice now, instead of the separate step of having the fast function equals j of something, now I'm adding this decorator on top. And now everything works. So now this fast calculation is taking like 100 milliseconds. And I didn't separate here the step of checking the types. And if I do it with the original Python function, which I can access using this dot Python attribute that now appears in the function, it takes 14 seconds. So now we're like 100 times faster, again only adding one line decorator. So I think this is very powerful. And then, of course, you have the output, which is this. This is the plate if you apply the load using the Jupyter widgets that Sylvain was presenting earlier this morning. OK, so some shameless self-plug. Why denying it? I was presenting these results in the European Space Agency some years ago. And I was implementing a highly nontrivial algorithm. And I managed to get it more or less within the same order of magnitude as Interfortran. I mean, of course, it was not faster than Fortran because that would be a little bit crazy. But it was within one order of magnitude, which I think it's a very nice improvement. And the pure Python equivalent was like 1,000 times slower. And then there's another thing that I won't cover in detail, but I think it's very interesting, is that you can call C functions exported through CFFI. CFFI is a project that the PyPy people started to have a better way of interacting between Python and C extensions, C or C++. And instead of hacking directly the C Python API, which is what NumPy does, for example, then you have these two layers. So on the one hand, you have the C code. And on the other hand, you have some sort of interface which is made with CFFI. And the good thing is that for three years now, it has had support to call CFFI functions natively. So now you have some C code, and you want to add some accelerated functions. On top of that, you can still do it. OK, so now let's talk a little bit about the caveats and limitations. I'm going a little bit too fast, so I hope you're preparing a lot of questions. The first thing is that there are two modes of operation in NumBa. One is the object mode, and the other one is the no Python mode, and this is a little bit tricky. So if you pass this adjit decorator to some function, and for any reason, NumBa is not capable of accelerating that, then it has some fallback mode that says, OK, so I'm not going to try accelerating this. Instead, I'm going to wrap the Python interpreter and fake that I'm accelerating it. This is 99% of the times, not what you want, because the function is going to be as slow as it was before, or even slower. So you have to avoid this object mode at all costs. In fact, this is so problematic that now they're in the process of slowly deprecating it because they realize that there was a mistake in the first place. But the good thing is that I'm going to do that in a second. So for example, if you have this function here that is just returning a silly range between 0 and 10, this accelerates nicely with adjits. But then if you add this reversed function call here, which is not supported by NumBa, then you get this huge warning that is telling you beware because this is actually not accelerating. This is a feature from the last version of NumBa, 0.44. Before that, you just would get a slow function call without notice. So the solution, and what I've been doing since the beginning, is to use adjits with this parameter here, noPython equals true, which forces the noPython mode. And if it can't compile a function for any reason, it's going to give you a never straight away. So you either have fast working code, or you don't have it at all, which I think it's a more reliable situation. Or you can use this ngit, which is exactly the same thing. Just save some keystrokes because this is embodied into the function call. Then there's another limitation which I find extremely important that I had the opportunity to discuss with the core developers because they're hosting now webinars every two weeks of open source projects. So passing functions are arguments is slow, and I think this is very important. So if you have some piece of reusable code that you want to say, OK, so I have this code, this algorithm, but I want this algorithm to work with different functions that have some common interface. Now there's not an easy way of doing it. And one typical example that I wanted is optimizing functions. For example, this is a very simple implementation of the Newton method. So you have some numerical function. And if it's derivative, then you can find the zeros of the function in this iterative way. But the thing is that, in general, you don't want to implement the Newton algorithm for every function you want to implement. What you want is to have your objective function. And then, on the other hand, your algorithm that is going to optimize it, which is what SciPy does and every same people do. So the thing is that now, if you try to do this thing, the function is going to actually be slower than it was with plain Python. This is an open issue that they're still working on it. So the downside of this is that your code becomes a little bit more Fortran-esque. Instead of being Pythonic, it's a little bit more low level. But I found a workaround. So if you're interested with using some combination of caching and closures and et cetera, I could reproduce this functionality in a satisfying way. And in fact, now optimizing that function is like 400 nanoseconds, which is totally awesome. And then another limitation is that number functions only can understand very basic Python structures and non-PA arrays and nothing else. So for example, in the AstroPA community, we use the derived object from non-PA arrays, which are the quantities, that they add physical magnitudes. So for example, I have an array, a non-PA array, but it has attached some units to it. Some meters, kilometers, astronomical units, whatever. So this is very convenient because most non-PA functions, and especially with the latest non-PA release, understand that this is a derived object from a non-PA array. And then they can preserve these units. And for example, if you do the square root of square kilometers, then you get kilometers, which is very nice. But this doesn't work in non-PA at all. And this applies also to things like automatic differentiation, so any object that you derive from non-PA. So what I found, my strategy with this problem is to have like a two-layer architecture. So on one hand, I have some high-level API that I want to offer to my users. So for example, in my case, I want the user to be able to specify some quantities in terms of meters, kilometers, whatever, which are like derived non-PA objects. And then there's some intermediate code that translates this to something that Numba can understand, and then uses the quick and dangerous algorithm that doesn't have this nice API, but that is working really fast. And this strategy is working okay for me. So some more tips and tricks. When you are writing some non-trivial algorithm with Numba at the beginning, it's a little bit difficult to get it to work because you might find some Python features that is not supported, some non-PA function that is not supported. You made a mistake with the d-types of the arrays and that can make your code know to compile with no Python mode. So my advice is that if anything fails, you can export this environment variable, Numba disabled gt equals one, and then you can keep the code the same, you do add ng, blah, blah, blah, but Numba is not actually compiling anything. So you can try to debug and see if there's an error or et cetera. Then the next thing I do is I try to split the function into smaller chunks to actually see what is the thing that is triggering the number to not compile the function because the error messages that these gifs are not always super useful, they're improving, but they're not super informative sometimes. And then as a last step, sometimes you have to rewrite some things. So for example, if you have something that is using, is appending something to a list, I think that kind of works now in some restricted cases, but it didn't used to be that way. So then you have to find some way to pre-allocate a non-py array to store the results or those kind of things. So in the end, you end up having to do some rewrites sometimes, minimal, hopefully. And you have to keep an eye on these two pages of the documentation, at least what Python features are supported and what non-py functions are supported. Things that I didn't cover in this talk that I really highly recommend you to check out, you can also do ahead-of-time compilation, it works in a slightly different way that you can produce something that later you can call. It's not like you can, I mean, what you produce still depends on the Python runtime. So it's not a way to export something into a binary that you can put into some embedded computer or something like that, unluckily. They also have support for GPUs, both CUDA and AMD technology, which I think is very interesting, and they have some decorators to do stencil functions and these kind of things that you can use to manipulate areas of some arrays. And then there's some support to do multithreading without having the global interpreter lock. I mean, there are some interesting features that you can use to accelerate your Python code even more. So conclusions, Numba is awesome when you make it work, which is not always a pleasant experience at the beginning. Still requires a bit of code rewrites, but then the code ends up being mostly Pythonic Python, which is my main selling point for Numba compared to, for example, Scython. But for non-numerical code, unfortunately, you probably will have to find something else because this is related to computations with arrays and et cetera. So that's all I wanted to say. I always finish my talks with pictures of rockets. The slides are online on GitHub. You can follow me on Twitter or send me an email whenever you want. I will be here for the rest of the weekend. So I will be happy to take questions. Thank you very much. Okay, so we have a few minutes for questions. So one here. Hi there. Thanks very much for your talk. It was really interesting. Thank you. In layman's term, well, Mike, I have a question which comprises of two subparts, which is really then two questions, but my first part of the question is, could you explain in layman terms exactly how the acceleration occurs? And the second part of my question is, is this, what is the acceleration? Is it follow a linear pattern? Or how does it, you know, if you have a more comprehensive function you need to do, is it still the same level of acceleration or does that decrease? And actually my third part is, how do you, I know you mentioned that you didn't cover GPUs and the like, but how would you see that being implemented in regards to machine learning and, you know, libraries like TensorFlow or, you know, things of similar ilk? Okay, so I think you have to repeat, sorry. No, but you will have to repeat. So the first part of the question was, in layman terms, how does the acceleration work, right? So when you do this at gate with some function, then the resulting function that you have, this plate display for example, then it has some inspect, ah, sorry, I have to call it once. Thank you. So now once you call the function, it has a bunch of LLVM code that is kind of attached to it with respect to the input types that you selected. So it's like generating this LLVM code on the fly and then with the LLVM compiler tool chain, it's compiling that to some binary that depends on your platform. In my case, it's like an Intel processor, 64 bits, et cetera, so that's happening on the fly. So Numba, it's taking care of transforming this numerical Python code into LLVM and then LLVM takes care of the rest. That's more or less how it works. And then the second part of the question was? But if anybody has any more questions, sure you'll take questions afterwards in the way. So everybody, let's please thank Juan.