 to add the number of the decorator on it, and it only optimizes that function or those functions if you have many of them. The nice thing with that is that it allows us to react to the semantics. That means we are not bound to Python's user semantics. We can cheat a bit and we cheat actually quite a bit in order to optimize your code. But also, your high-level code around that, all your classes and meta classes and so on, they can still use all kinds of complicated things that number doesn't support, but it's not a problem. Because since they are executed in the regular Python environments, then number doesn't care. So as I said, specialized, so really, well, right now it's specialized for number crunching. It's really, I would say, it's tailored for NumPy arrays. That is, NumPy array is the dominant data type in scientific computing. It has a lot of features, and we try to support them. A bunch of other things. So we are slowly trying to extend the range of things that we support, but still, right now, it's more specialized in number crunching. So the main target is the CPU. We officially support x86 and x8664. Ideally, LLVM provides us with support for many other architectures. And we also have a target for NVIDIA GPUs using CUDA. So this means you write Python code and you can execute it on the GPU. But we have a limited feature set, because there are some limitations on what you can do on a GPU, of course. There isn't a real runtime. You will be able to do memory allocation, but it will be quite slow. And the allocated memory will run in the GPU's global memory, which is not very fast. We also have potential support for other architectures thanks to LLVM. So one of my colleagues tried a number on the Raspberry Pi, and it actually works. But we don't support it officially. I think LLVM takes several hours to compile. We have some support going on for HSA, which is something by AMD called it means Heterogeneous System Architecture. It's an architecture for what they call APUs. So the goal is to blend the programming model between GPUs and CPUs. You write one implementation, and it can run either simultaneously on the GPU or CPU, or you can run it on either of them. And supposedly, there's some memory sharing and so on. Let's talk a bit about the architecture. So number, if you compare it to other jets, is quite straightforward. It's not very exciting. It works one function at a time, which is a constraint which we're going to relax because we need to relax it in order to support recursion. But right now, it's one function at a time. It starts from the Python bytecode. So we don't have a parser. We just use the bytecode emitted by Cpython. And we have a compilation analysis chain, which transforms it slowly across various steps to LLVM IR. So LLVM IR, it's LLVM's internal representation. It's a kind of, let's say, a portable assembly. And it allows you to specify a lot of things. The difference we see, for example, is that you can specify some behaviors in a very granular way. For example, you can specify if signed overflow on integers is well-defined or undefined. If you have, for example, in this example, if you have undefined behavior on signed integers overflow, then it allows LLVM to do further optimizations. So after the LLVM IR is shipped to LLVM, everything is delegated to LLVM itself, including low-level optimizations and also executed in the function. And on top of that, we also generate some Python facing wrappers because each function gets a low-level implementation which takes some native types. And you have to marshal those from and to Python objects. So this is the compilation pipeline. You see there are two entry points there. You can see the wavy rows. So the first entry point is the Python bytecode itself, as I said. We have an analysis chain from the bytecode. First, the bytecode is analyzed. We build a control flow graph, a data flow graph, and we produce something which is called number IR. So number has our intermediate representation, which is quite as high-level as bytecode. But it's a bit different. It's not a stack machine. It's based on values. The second entry point is when a function is actually called. When a function is actually called, we record the types of the values. And we do type inference of those values. We try to propagate all the types across the function. I'm going to talk about the number types just after. It's much more complicated than just mapping some classes to some types because we have more granular typing in number than in Python. After the type inference pass, there's a pass which deals with rewriting the IR. So it's an optional pass which has some optimizations. The next pass is lowering. Lowering is from the LLVM jargon. It means that you take a high-level language, which is numbers IR, and you lower it to something very low-level, which in this case is LLVM IR. And then we ship everything to LLVM, to the LLVMJIT, which produces machine code, and we execute it. So there is a small rectangle named cash, which is grayed out because it's not implemented yet. But ideally, we will be able to cache either the machine code or the LLVM IR in order to have faster compilation times. So number types. As I said, the number type system is more granular and more precise than the Python type system. We have several integer types based on their depending on the bitness, on the signness. We have a single precision and double precision floating point types. We have tuples are typed, which means that you don't have a single tuple type. You have a different tuple type for every kind of parameter that's in the tuple. So tuples are typed based on each element's number type. So you have a different type or type, for example, for a pair of int and float 64, for a pair of float 64 and float 32, and so on. NumPy array themselves, so they are a very important part of number and of scientific computing. They are typed according to their dimensionality and to their contiguousness. So the lowering pass is what really takes the type inferred number IR, and it transforms it into LLVM code, LLVM IR. So this is always straightforward and not very exciting part, but it has a lot of code because we implement a lot of functions. We implement all the operators. We implement math functions and so on. If we are careful enough with what we generate, we can allow LLVM to inline and do other optimizations here. So what's supported? NumPy supports a rather small subset of Python, at least. Unless in text field, unless in text front, it supports quite a bit, but not all. It supports all control flow routines or constructs. It supports raising exceptions, but not catching them. It supports calling other compile functions. We have recent support for generators, but only the simple kind of generators, that is not those to which you can send values, not control routines, but just syntactic iterators with a yield keyword. So what don't we support? We don't support, well, over rest. We don't support exception catching code. We don't support context managers. We don't support comprehensions. And actually, we don't support list and sets and dicts yet, although it will certainly come. And we don't support yield from. As for the built-in types and functions, we have support for most types which are useful for scientific computing. So all the numeric types, integrals, floats, and so on. Topples unknown, which are quite basic. And we have support for the buffer protocol, which means that you can address, you can index over byte by trace memory views. And the very thing which supports the buffer protocol, which also includes, for example, memory map files using the M-map module. We have support for a bunch of built-in functions. And we have support for most operators, but of course, only the types that we support, so all the numeric types. We are able to optimize several of the standard library modules, mostly those which are specialized for numeric computing, so C-map and math, of course. We have support for a random number generation. We actually use the same algorithm as C-piphones, of a mouse and twister, except that we have a separate state. We have support for C-types, which means you can call row C functions from number code, which is a cheap way of actually calling C libraries. And it generates very fast code because it calls it from a native context. Similarly, we support CFFR, which is just a replacement for C-types most of the time. And we support mostly NumPy, at least a large subset of NumPy. So what we support in NumPy is really the objective of the whole page in the documentation. So I'm not really, I'm talking a bit about it here. We support most kinds of arrays in NumPy, so most dimensionalities from 0D to ND. We support arrays of various D-types, scalar arrays, numbers, and so on, structured arrays. We support arrays with sub-arrays in them. The only thing we don't support, and we won't support in a long time, I think, is arrays containing Python objects. Because the whole point of NumPy is to generate native code, which doesn't go through the C-type in API. We have recently added support for constructors, so we can do memory allocation, allocate memory from number functions. Various operations on arrays, such as iterating, indexing, slicing, so there are various kinds of iterators we support, such as v.flat operator and more or less fancy ones. We have support for reductions of the products, accumulative sums, and so on. On the scalar types front, we support a daytime 64 and time delta 64, which are weird, and I think little-known types, which allow you to do low-level computations on some daytime and time deltas. And we support NumPy.random in the same way that we support the random module. So the limitations, apart from what we don't support in terms of syntax, in terms of syntax, and in terms of types, we don't support recursion, so that's because we're compiling one function at a time and we'll have to change that. We can't compile classes. Again, that's because we compile one function at a time, so we don't have a way of specifying a structure and several methods operating on a user-defined type. And the other limitation is that type influence is really has to succeed. If a type influence path fails to infer a type for a given variable, then the whole compilation fails. Ideally, we would have a way to say, well, this is a Python object, but the rest is still in third, so we will be able to bridge it. But right now, this is not possible. And actually, when type inference fails, it goes into a mode called Object Mode, which is not very interesting as far as performance is concerned. So as I said, the fact that it's opt-in, it allows us to relax the semantics. As you have understood, perhaps, it has fixed-size integers up to 64 bits. So for example, if you have an addition of two integers and the result overflows, then you are just seeing a truncated result. You don't have an overflow or anything. We take the liberty of freezing the global and outer variables, so we consider them constants, which makes it much easier to compile. And it allows us to generate more optimized code. For example, if you have a math.py, then usually math.py won't change, so it's only fair to consider it a constant. But of course, in your module, you have a global variable whose value changes, then you won't see it in your compile function. It will keep the order value. So we don't have any frame introspection. Basically, we don't have any debugging features right now, neither from the C level nor from the Python level. So this is something which, at least at the C level, we're going to work on it, because we want to expose the names of the JIT functions to LLVM so that you can fire some JDB and have a nice traceback. So how to use it? So basically, the main way to use it is to use the JIT decorator. It's very simple, so you have a function, and you just tag the decorator on it, and hopefully, it will be able to compile it. So the default way is not to pass any argument to the JIT decorator. And it will lazily compile a function. This means that it will wait for a function to be called, and it will do the type inference thing at this point, and it will generate the native code. And since you're calling the function, it will call the native code on the fly. And there's another way to call it, which is to manually specialize the arguments. Let's say you really know you have some 32-bit ints, and you want some double precision floats, or some single precision floats. And so you are able to pass an explicit signature to number the JIT. But this is not really recommended. It's mostly for us to test. So there's an option to remove the guill, which is quite easy for us, since we are not calling any CPiPhone API from the generated native code. So you just pass no guill equals 2, and the guill will be removed. So the guill is a global interpreter lock. For those who don't know, it's a lock which constrains CPiPhone execution to a single thread. If you remove the guill, you can call your function or your functions from several threads and have a parallel execution on several cores. But of course, you have no protection from these conditions. So you are in the same position as a C++ programmer who has to be careful about not having several threads accessing the same data and mutating it, for example. As a tip, instead of having your own thread pool, we just use concurrent.futures on Python 3. Another feature is the vectorized decorator. So NumPy has something called a universal function. To explain what a universal function is, it's better to take an example. So if you take the plus operator between arrays, for example, which is a shortcut to the np.add function, the np.add function is basically doing an element-wise operation on all elements of its inputs. And the way it's implemented is really to have a loop on the element-wise operation internally. And then the nice thing with a universal function is that you have several additional features. There's something called broadcasting in NumPy. So if you are adding, for example, a scalar and an array, actually, the scalar will be added to each element in the array. So really, the lower dimensional argument is broadcasted onto the higher dimensional argument. So this is handled automatically by the UFUNC framework. And the inner loop doesn't have to care about that. And it also gives you, for free, some reduction methods. So you have some reduce and accumulation functions. So NumPy comes with a fixed set of universal functions, so add, multiplication, square root, and so on. Traditionally, if you want to add a universal function, write your own, you have to go in C. So you write your inner loop in C with a specific C API provided by NumPy. You compile it against the right NumPy version, and you get your universal function. So it's not very convenient for users, and the users don't do that. So using NumPy, you can write the element-wise function in the pubes, I think. And you can put the vectorized decorator on it, and it will generate the UFUNC. Another more sophisticated feature of NumPy is a generalized universal function. So this is an extension of the idea of a universal function. The universal function works on one element at a time. It doesn't see the neighbors of the rest of your arrays. A generalized universal function, you can see the whole arrays, and you have to specify exactly what the layout of the inputs are. So it's almost for some more sophisticated functions, such as a moving average. So NumPy also allows you to generate a generalized universal function using the geo-vectorized decorator. So here is an example. It's called an Ising model. So it's something which is used, apparently, mainly for benchmarking, but it seems inspired from some physics model. The basic idea is that you have a 2D grid, a two-dimension grid of Boolean states, either Boolean or binary states, and you can think of it as each element having either a value plus one or minus one, and it starts from a random state, basically, and at each iteration, you make each element a value based upon its neighbors. So at the end, it's supposed to converge towards something which is quite stable. So this was generated with number, this animation. So if you look at how it looks like, well, you have an inner function which processes each element in the array and which updates it based on its neighbor's value. So there are a couple of operations. It takes its neighbor's value, combines it with the actual value of the element, and it takes a decision based on that and a random number. And the outer loop is just looping over the whole array and it updates all elements. So the outer loop which we see in update one frame, it does one iteration, and then if you want to make the model converge, you have to call it a number of times. So if you measure that, well, you get 100 times speedup for number of a C Python, which is less than you get with Fortran, but still it's within range. In this case, it's twice lower and we know why, actually, because array indexing in Python is more sophisticated. For example, well, the main reason is that if you, Python allows negative array indexing, you know that if you have a negative index, you are indexing from the end, so you have to have a runtime check of the negativeness of each number. And in some cases, LRVM isn't able to optimize it out. So besides that, we have QDA support, as I said. So the main API for that is the QDA.Jet decorator. So we don't try to hide the QDA programming model. The QDA programming model is based on the notion of a grid of threads. So you have blocks of threads and you have a grid of blocks. And the GPU executes all those threads in parallel, more or less. But you have to tell the GPU which is the topology of threads. And besides that, there are two types of functions. There are kernel functions, which are called from the CPU, actually. So the kernel function is not able to return a value. You pass it some arrays, some input arrays, some output arrays, which are marshaled automatically by number to the GPU. And you write the results in the array from the GPU. And there's something called device functions which are really sub-functions and they are called from the GPU to the GPU. So these ones can return values. When you're using the QDA support in number, you have a limited array of features because, as I said, you don't have a large runtime available in the GPU. So it also requires the programmer to have not only some knowledge of QDA on how a GPU works, but also to have some intuition of how to optimize the code for execution on the GPU. Because it's not usually, you're not usually arranging your algorithm in the same way on the GPU or on the CPU, especially except in trivial cases. So here is an example. It's a very simple example of this one. It's just to show you how it works. We are trying to compute the cosine of an array. So we're using the QDA.Jet decorator. We have a function which takes two arguments. The first argument is the input array. The second argument is the output array. So there is no convention. It's just a choice here, for example. We must, the idea is that each GPU thread which will compute one value over the array. So it will take one element in the array and compute the cosine and put it in the output array. So the first thing is that you are computing the index. So to compute the index of a current thread, you call the QDA grid function. And then you have to just call math.cos on the input and write it in the output. So this is the definition. Then you want to call it. You have to, so GPU cos defines the GPU function. Then you want to instantiate the kernel, actually. And instantiated the kernel means that you define the grid topology. So this is the thread config here. It's a two-element topo. The first element is the number of blocks in the grid, I think. And the second number is the number of threads in each block. So you define the topology based on the output. And you call the GPU cos function with the topology and the input and output. So, well, in this example, the numbers are better on the GPU, but it's not very important because you won't. You wouldn't use a GPU just to compute a cosine. You will do something more complex. So if you want to install number, since it's open source, you can compile it from scratch if you want. But you have to compile LLVM in a specific version of it because LLVM has backwards and compatible changes in each feature release. So the current version of number is LLVM 3.6. And you will have to fetch LLVM 3.6, compile it for your platform, or get, if you can get them, some binary development packages. And then you have to compile LLVM light with a sufficiently recent C++ compiler, which is not trivial at all. So we really recommend you use Conda, which is a continuous-owned package manager. So it's an open-source package manager. And it comes with a default distribution of binary packages called Anaconda. And if you have Conda, you just type Conda installed number and you have it on your platform. So, let's wrap up. So you can find documentation on the web. We have, of course, GitHub account with code and issue tracker. You are very welcome to come to a number user's mailing list, either as a user or as a potential contributor. I must also mention that number is commercially supported by Continuum Analytics. So if you want to buy consulting enhancements, support for some architectures, you can write to sales at Continuum.io. And there's a last thing called NumberPro, which is an extension, a proprietary extension to Number, which provides bindings to some specialized libraries for the GPU, various scientific specialized libraries. And it also has, I think it has extensions to allow it to parallelize the code easier on the CPU. So that's it. So two questions about your use of LLVM. First, it sounded like you supported only a subset of all the platforms that LLVM supports. Why is it that you don't just have the same support requirements and platform list as LLVM? What did you say? We support a subset of what? Do you support everything that LLVM supports or do you only support a couple? You mean as architectures? Yes. It's a matter of validation because ideally it works, but who knows what it will give actually, you know? Okay. And I was also wondering a couple of years ago, the attempt to marry CPython and LLVM together called Unladen Swallow. I was wondering if like nothing ultimately came of it and Unladen Swallow died, but I was wondering if the work that they had done was helpful at all in the development of Numba. I don't think so. Well, not directly. Then at the time they said that they had helped LLVM improve their support for JIT compilers, perhaps indirectly benefited, but we didn't take anything from them because we use our own wrapper run LLVM called LLVM Lite. And then Numba is pure Python. The big difference with Unladen Swallow is that Unladen Swallow did everything in C++, which I think was a very, I mean, it's necessary if you want to compile very fast, but it's also much less flexible. So pure Python allows us to experiment and develop very quickly. I have three questions. First question is, does a number compile, do you do the JIT compiling in a separate thread? No, it's in the same thread. So you actually have to wait for the compilation to finish before it gets fast? Yeah, well, what will you do anyway? Oh, you mean when you, because it's lazily compiling, so if it's compiling when you're calling the function, anyway, you must wait for it. Yeah, of course, but I mean, well, anyway. Sometimes in some JIT compilers, they do it in a separate thread and it just continues with the slow version until it's done. Oh, right, no, we don't do it. Okay. Second one is, do you have any support for storing the compiled code? For? For storing the compiled code on this? Oh, not yet, no, as I said, we want to support caching, but not yet. So that's what you meant with caching. So you actually, not like PyPy, which has the problem that it cannot store the compiled version? I'm not sure if PyPy does that. Okay. So they have to redo it every time you run the code. So it would be more efficient if you just do it once and then store it and then just finish it. That's not the best way, not the best thing to add, but right now we don't have it. Okay, and third one is, how do you do error handling? Because you said you don't have any way to catch exceptions? Yeah, so we have a way to raise them, so if you raise an exception for number code, then you just catch in when it goes outside of the number code. So you can communicate errors to the user, but you can't handle it in the number code. And maybe an extra question. So you support, so you're working for the support of NumPy? Do you plan also to support CYPy? Are there some plans for that or? Not yet. We mostly support NumPy right now. So every kind of pure Python code which relies on NumPy arrays may be perhaps accelerated if it intersects with a subset of things we support. But we don't have direct support for anything other than NumPy right now. I suppose someday we want to support pandas. Yeah. We have no more time for questions. Thank you.