 Hi, thanks for coming. So I will talk about understanding number, the Python and NumPy compiler. And I want to be clear at the start, like I don't really understand number. So I'm a gamma ray astronomer from Heidelberg. I'm not a number compiler or CPU expert. I recently started using it. And I think it's awesome. And I wanted to introduce it to you. So let me talk a little bit about how we use number and why we use it. So in gamma ray astronomy, we have these telescopes like the HES telescopes in Namibia, or the Cherenkov telescopes array in Chile. And there we have to do lots of numerical computing for data calibration, reduction, analysis. And we need both interactive data and method exploration and then production pipelines. And also the software is often written by astronomers, not professional programmers. And traditionally, the approach was, for example, for HES for the past decade to write everything in C++, also using the root software from CERN, and then maybe have a little bit of Python scripting on top. And now it's different. Now we're trying to build everything in Python and NumPy. And then when needed for performance, we write a little bit of number siphon or COC++. So it's a different approach. For the CTA software, we're prototyping it using this Python first approach. And you see the stack, like it's based on Python. And then the PyData ecosystem. So you have NumPy, SciPy, and so on. In astronomy, we also have AstroPy, which is the base library, which contains the standard data formats and time handling, coordinate sending, and so on for astronomers. And then on top of this, we can implement the gamma ray analysis software. And this approach is not really unique to what we do in our project. It's kind of become the standard in astronomy. As you can see in this graph here, Python is really now the most popular language to use in astronomy. And as Perry Greenfield points out, like one reason is that Python is a language that is very powerful for developers, for professional developers, but it is also accessible to astronomers. And this makes for a good mix. And I have two other quotes from Jake Van der Klaas from the Python keynote at 2017 in the same vein, saying, for scientific data exploration, speed of development is primary and speed of execution is secondary. And also in Python, we have these libraries for nearly everything, and Python is the glue that combines the scientific codes together. So we have these thousands of Python packages that implement all kinds of methods. Why do we need number, actually? I mean, some algorithms are hard to write in Python and NumPy. Here's an example of a function computing one step in this game of life. I don't know if you're aware of this, but you have this update rule, whether cells live or die. And you can write it in NumPy, but then you have to write this double NumPy.roll call, and the code is really hard to write and hard to read and hard to maintain. So this would be a case where just writing a for loop is simpler, and then if you write it in Python, it's inefficient, and then writing in C and wrapping it from Python can be tedious. So this would be a good case where you would go to number. And there's also a quote from Stan Sybert from the NumPy number team, just saying, like, don't write such NumPy high-coos if Python for loops are simpler than write loops and use number. Okay, so let me introduce number. What is it? I mean, to find out more, go to the webpage, numberpydata.org. The tagline is number makes Python code fast, and it's an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code. And actually later in the talk, I'll try to unwrap that statement and explain what it means. But first, let's see a little bit. The name number comes from the combination of NumPy and Mamba. So you do number crunching in Python, and it's then fast like Mamba. So Mambas are apparently some of the fastest snakes in the world. This is where the name comes from. And then how do you use it? Let's assume you have some function doing math like here, using this Monte Carlo method to compute the number pi, where you just throw random points in the unit square and then check which ones are x squared plus y squared less than one, so which ones are in the circle and you count, and that's one way to compute pi. And if you do this in Python, it's extremely slow for a million numbers, because what Python does is that every number is represented by a pi object and then to do the actual computation, what happens in something like x squared or less than one and so on, always you take the pi object and you figure out that it's an integer, look at the integer inside, then do the operation and the computation and then create another pi object. So you have all of this overhead that makes math from Python really slow. To speed it up with number, it's very easy. You just import number and add number.jit decorator to your function and then in this case it's 30 times faster. So this is how you tell number to jit your function. Number also understands num pi, so in this case I have two arrays x and y, which are like not large num pi arrays and I do the same computation by passing them into a Monte Carlo pi method and then uses a num pi array expression to do the same computation and this is faster than using Python, but it also uses much more memory, but again you can add the number.jit decorator and it will run faster because what number can do is it can avoid these temporary array copies. So when num pi c is x squared, it creates a new num pi array which is x squared and creating these new arrays and allocating them and processing them is some overhead which can be avoided if you compile down to machine code and that's why number in this case is faster. If you want you can also write the same function using a for loop so it's really up to you. In this case writing a num pi array expression is more convenient, but sometimes writing a for loop is more convenient and number will compile your code either way into optimized machine code. So kind of the evolution of a scientific programmer coming to Python is that maybe you start out with C or Fortran and you write loops all the time to do your data processing, then you come to Python and you figure out that loops are 100 times slower and you learn that you have to write everything as num pi array expressions, but actually now that we have number writing for loops is okay again. So it's a bit of a regression. What are the limitations of number? Number compiles individual functions. It does not compile whole programs like PyPy. Number only supports a subset of Python. It has some support for dictionary list and sets, but you cannot have mixed types of keys and values and all of the operations. Like some things, some Python data structures just cannot be efficiently processed and translated to machine code. And you see an example of this on the left side where if I list with mixed data types of a string and an integer, and actually number is not able to compile this to efficient machine code. And what you get is you get a typing error if you try this. Also, number only supports a subset of num pi. I mean, this is ever-growing, but not all functions and all arguments to those functions are available. So you have to see. And also, I mean, number does not support any kind of Python code. Like you cannot use pandas or scikit-learn or the request library or these kind of things from your Python function because number will not know how to translate that into machine code. It's really focused on math and numerics code. So a little bit more about these two JIT modes, the object mode and the no-python mode. If you add the number.jit decorator and you have some function which cannot be efficiently compiled to machine code, what will happen is that you get a number warning saying the compilation is falling back to object mode, but it will still run and give you the result. And what's happening under the hood is that all that happened was that number translated this function into something that's equivalent what the CPyzen interpreter would execute where you still have PyObjects and Pyzen C API calls and you get the same performance as if you only use Python without number at all. So 99% of the time this is not what you want and to make it more obvious when you write functions which cannot be compiled efficiently, you can use this no-python-equals-true thing and then you will directly get a typing error saying fail to compile in no-python mode. And there's a shortcut because typing number.jit no-python-equals-true is long, you can also type number.jit. If you actually do have a function where you need to go back and interact with the Python interpreter and Python objects you can, and for this you use the number.object-mode-context-manager, this is rarely needed but it can be useful if you have like a long-running function and you want to log the progress or update a Python progress bar which is a Python object, then you can say with number.object-mode and inside this with statement you can again interact with your Python session. Okay, so now let's come to the part where we try to understand what number does at least a little bit. So the description kind of is number is a type specializing JIT compiler for Python bytecode using LLVM and this might make your head explode but let's try to unwrap that a little bit. There's really three parts involved here. What's going on when you write a function and you JIT it and you call it and the first part is Python itself and then you have number and then you have LLVM in the background and I'll explain on the next slides how these interact. So what the Python compiler does, it starts with your source code, parses it into an abstract syntax tree and then transforms it to bytecode and this happens on import of a module like when the Python interpreter sees the death statement it creates this function object and attaches the bytecode to the function object which as you can see here on the left and then what number can do is it can start with this bytecode and compile and transform it to machine code and that's what the number.jit decorator does and actually when number.jit decorator is called it does very little. All it does is it creates a CPU dispatch a proxy object because it cannot compile the function yet because it doesn't know what types come in and what kind of machine code it should generate. So only when the function is called it will then jit compile the bytecode to LLVMIR exactly for those input types and then it will also manage the LLVM compilation and execute the compiled function for you. So what is LLVM? LLVM is a compiler infrastructure project there's many frontends for languages like C, C++, Fortran, Haskell, Rust, Julia Swift and so on. There's also many backends for all kinds of hardware so all the different CPU types the vendors have added support for it and optimized it well and you could consider a number just to be the Python frontend for LLVM. So the way this works is that LLVM is shipped as a Python package called LLVM light that number depends on and this is maintained by the number team at Anaconda that then ships number and LLVM light so that it's readily available. Concerning alternatives, I mean at least for us like the most obvious one and what people also use is Siphon and like numbers, Siphon is often used to speed up numeric Python code. Siphon is an ahead of time compiler where you have to type annotate your Python code and then it compiles to C and then you use a C compiler to compile this to machine code. It's more widely used at this point it's easier to debug because generated C code is easier to debug than LLVM IR code which is lower level and looks more like assembler code and probably very few Python people can read and debug this. And also it's Siphon is very good at interfacing with C and C++ code. I mean number on the other hand is easier to use you don't have to add type annotations you don't need to have a C compiler but it is harder to debug another advantage of number is that it optimizes just in time for your CPU and GPU so you don't need to build and distribute binaries for many architectures. Instead it will always use all of the CPU features you have like advanced processor instructions and so on. Other number alternatives I mean there are Siphon is great and there are many other great tools that exist for high performance computing in Python so you have Siphon, C, C++ and PyBind 11 if you want to go the Python C extension way there's PyPy which is an alternative to Cpycin which did compile the whole program and then you have all of these modern things like TensorFlow, Jax, PyTorch and Dask and so on that similarly to number also use Python and NumPy mainly as the language to specify what kind of computation should happen but then they will do some kind of compilations and execute it in various ways and I don't have time and also don't have the expertise to explain all of the differences how Jax and TensorFlow and so on compares to number and how exactly they compile but you have to make a choice there and it's not easy because there's many great libraries now available in Python. Okay so some more things about number. One thing you should know about is number.s or number minus minus sys info from the command line or you can also use it from iPython and Jupyter if you just put the exclamation mark in front you can execute shell commands so you can also get the info from there and as you can see here on the left this gives you all of the relevant information about the hardware on your computer like what CPU and GPU you have which Python version, number version and LLVM version you're using whether you have the Intel short vector math library installed I'll talk more about this later whether you have the Intel threading building blocks library installed which will make number faster in some cases also and then which GPUs you have available and GPU drivers you have installed. One thing you can do if you want to if you have a multicore CPU and want to make your computation run faster is to add the parallel equals true option to the number.jit call and this will then do multi-threading using one of the backends either OpenMP threading building blocks or a custom one if you want to use TBB you have to do this content and start TBB in addition and then as shown in this example it will work automatically for NumPy array expressions there's no code changes needed so in this case I got a 3.2 speedup on my four core CPU for this computation by parallelizing it if you have four loops inside of your function you should use the number.p range generator I guess and then what will happen is the same thing that number will parallelize this loop and you will get the speedup from using multiple cores or also vector instructions and what I show here on the lower right is just always like when you have Python decorators you can use them in two ways you can either put them with the add sign on top of your function and the decorator will be automatically applied or you can just write your function without the decorator and then apply the decorator after by just passing the function into the decorator so you can say number.jit pass in the compute function or you can say number.jit parallel equals true this is a decorator and you can pass in the compute function and this can be convenient if you want to try out different options to compile the same function so you don't have to always copy and paste the code of the function also. Another option you have is fast math equals true and there you can trade accuracy for speed for some computations. So there is this IEEE floating point standard that in this case, for example, requires that the loop is must accumulate these numbers in order to get precisely defined results but if you're willing to give this up and have a little bit of different accuracy then the computation can go faster because the compiler can vectorize this reduction. There's another way you can speed up math functions so if you have code that has square root, sign, exponential and so on then you can kinda install the ICCRT package and this will make these Intel short vector math libraries functions available until number will tell elevator to use those and these are just like much faster implementations of these math functions. So that's how you get fast math. You might ask like how fast is number? I would say number gives very good performance and there are many options to tweak the computation but there is no simple answer to this question and there is also no simple answer to the question of like how number compares to Python, Sys, NumPy, C, Fortran and so on. You can find many blog posts on the internet and they all kind of have different outcomes depending on what application they do and compiler flags they use and what hardware they have. They get different speeds for different tools and we've already seen this for number on the previous slides. The speed up is not always the same. So what you have to do is if you care about performance you start by defining a benchmark for your application that you really care about where you have a performance bottleneck and then you measure and then you try to improve it. Another thing I wanted to introduce are NumPyU funds or universal functions. So these are functions like add sign and so on from NumPy itself and these all support array broadcasting which you can see here on the left side. So you cannot just multiply two numbers, you can also multiply a number with an array and it will broadcast these inputs and generate an output array and there's a more complex example here and then they also have these special methods attached like accumulate which apply it to one element at a time and accumulate it giving a different output array and so far if you wanted to make one of these U-funcs you had to write C code and use the NumPyC API which was pretty hard and now with number it's really easy. You use the number of vectorized decorator and it will make the U-func for you. So the way you write your function is you don't put a for loop. You just write the operation for one element and then kind of the implicit for loop to loop over arrays and do broadcasting and so on. This will be generated by number for you and there's two ways to do it. You can give the type signature in the vectorized call like I do here I say please make a U-func assuming the inputs are integer numbers and then number will generate one U-func on the vectorized call. If you don't give a signature then again such a dispatcher object is created and then dynamically when you call the U-func when you call the function number will generate U-funcs for the input types you pass. So it will generate a different one if you pass a float input array or an integer input array. Okay so I'm almost done already. So the number is really a family of compilers. I've talked about two, number.jit for regular functions and number.vectorized for U-funcs and just given a quick introduction to those if you wanna learn more then check out the number documentation. There is also the geo-vectorized decorator which can make generalized U-funcs. There is the stencil for neighborhood computation so if you wanna do some kind of convolution or sliding window computation it's easy with the stencil decorator. There is C-func which can generate functions with well-defined C-callback ABI. So this could be useful for example if you wanna call a number function that you write in Python from C or C++ code. So if like all of your application runs in C++ but then you wanna extend it with Python then this would be one way to do it. And then there are the scooter.jit and rock.jit decorators to work with GPUs also. So just as a last point I mean who uses number? So Jake van der Plaas again like wrote a few blog posts in 2013 like when the project was still very young saying I'm becoming more and more convinced that number is the future of fast scientific computing in Python. This has not really happened so far. And then in 2018 Matthew Rocklin wrote an article advocating for the numeric Python community should consider adopting number more. So I think currently many people and applications use number for their work in projects like we do for gamma ray astronomy and many others do as well. But then the large libraries like NumPy, SciPy, Pandas and Scikit-learn have not adopted number yet. And there are some nice examples now of libraries or packages using number for example data shader which is shown on the top left which is a large data visualization library is implementing their stuff in number. There's Librosa which is for digital signal processing and audio and music analysis is implemented using number. And Intel has written Hpad which is the high performance toolkit and this can do big data processing and supports Pandas. So they've done something very similar to what number did. They've defined a decorator and Hpad.jit which can take a function which does IO and then NumPy or Pandas computations and so on and it will parallelize this also to clusters using MPI. So my summary and conclusions are that I mean number is a type specializing JIT compiler for Python bytecode to LLVMIR. The project started in 2012. The current version is 0.44 and I think they're well underway to version 1.0. It will use your CPU and GPU well and it's really easy to use. You just have to add this Python decorator. So use number.jit for normal functions, number.vectorize to make your functions and use number minus S to check your machine and installation. If you don't find you get good performance or it's not working and then if you wanna use multi-core CPU and get fast results consider putting parallel equals true and fast mass equals true and also installing this as a SVML package. And I mean yeah, I'm not a number developer. I didn't do anything to make this work. I just wanted to thank the number developers at Anaconda and also the other people and companies that have contributed like for example Intel. Very cool, thank you for the great talk. We have five minutes for questions. Again, we have microphones over there. Over there, if you have a question just line up behind the microphone and ask away. Hi, thank you for the talk. How would you recommend packaging your package that uses number? Would it depend on the LLVM chain on the target machines that where you install it or? It's no, I mean basically you don't have to do anything. Like you have a pure Python package and you just put in your setup.py or into your dependencies you put number and that's all you do. And then PIP or CONDA will automatically, I mean number and LLVM lite automatically ship with CONDA with the base installation but you can also PIP and CONDA install them. So if you put number as a dependencies when someone installs your package they will always get it and you just kind of have a pure Python package and you don't have to build wheels or binary distributions or do complex things. It's really easy. Thank you. Yeah, that was really awesome. I just wanted to know what do you think are some of the reasons that the adoption of Numba is probably a bit slower than some may have expected? I mean, so I guess it took longer to reach 1.0 than they thought and projects wait for 1.0 in stability. I mean number has been amazingly stable for the past five years I think also but it's still constantly improving. The, I mean it is a bit scary if you have, if like for example say NumPy itself or SciPy they could get rid of a lot of code if they adopted Numba and do things much more easily like implement all of their U-funs and this kind of stuff but then it would be all in like they would depend on LLVM lite and Numba and this would have to run and be stable for the next decade on all the supported platforms. So I know for example that on the NumPy mailing list like three years ago there was a discussion whether to use Numba as a dependency for Numba and at the time people said no mainly because for example ARM processor support was not there yet and in the past years the Numba team has added this and is testing really on a large range of hardware and it's getting better and better and I think it will happen. I don't know how fast it will happen but I think projects will start to adopt Numba more and more. Cool, thank you. Let's go this side and then go back to that side. I have a question about the other case study for using C++ code so could I use a just entire compiled function Python to call my C++ library using Numba? Is there some support for that? I have to admit I don't know so this is something where size really shines and you can call any C and C++ code. I have not done this myself. I think there are ways to call into a C library but I have to pass on that. Check the documentation please. Thanks. I have a question. First of all thank you for your talk. I started using Numba a bit and discovered PyPy and I was wondering where you have let's say you have multiple functions one after each other. What is the overhead if you put a Numba at Numba.jit before each functions versus compiling or using the whole code with PyPy? Is there a difference or... Um... I'm not sure, I think it depends. I mean for the cases where we are using Numba like this overhead of JIT compiling the function does not matter at all. Like these milliseconds that I spent once to compile my functions I really don't care about even if it would take a second and then my analysis runs for an hour and it's really fast. I mean if you have thousands of little functions that you need to JIT compile and have very short running processes then this might become relevant and I don't know how PyPy compares in terms of the JIT speed compared to Numba. So maybe what you're saying is that at the initialization of your module it's not at that time where the compilation is happening it's during the execution. Yeah, so you have the overhead of starting up LLVM and doing the compilation I mean if you look at this so basically this is what's here. I mean you see this change of things that happen and some things are done by Numba and some things are done by LLVM but overall this is very fast like for a given function but if you have thousands of functions and they're really long and complex and you use like high optimization options for LLVM it doesn't take I guess seconds or minutes in extreme cases I don't know but I never was in the case where this matters. Oh thank you. Thank you very much we are running out of time you can chat with the speaker during lunch and the conference and thank you very much let's thank all the speakers.