 Welcome to another edition of RCE. Again, this is Brock Palin. I have again with me Jeff Squire from Cisco Systems and OpenMPI. Jeff, thanks a lot for your time. Hey Brock, how's it going? Alright, you can always find all of our old shows at RCE-cast.com. Also, you can subscribe from there. There's an RSS feed, link to iTunes, you know, use your favorite podcatcher to check anything out you want to look at. Today, though, we have a guest. A guest from an organization has been on here before, and he's going to be talking about some Python stuff. So Stan, why don't you take a moment to introduce yourself? Hi, I'm Stan Siebert. I'm a software developer at Continuum Analytics, and I work on the the Numba team. Although my previous career, I was a particle physicist working on Neutrino and Dark Matter experiments, but I was also a big Python and GPU computing advocate, which sort of led me back into a software engineering. So, you mentioned that when we're talking about Numba. Can you give us a quick overview of what is Numba? Yeah, so Numba is basically a Python compiler. It's designed to compile Python functions, so you may be familiar with other Python projects that take your entire program and try and optimize them in some way. Numba is a little more specific, and we'll let you take a particular function and say, I'd like this function to go faster because I'm only going to use it with certain data types. I'm only going to pass in arrays of floats or that sort of thing. And then what Numba can do is analyze the function, take this extra information you've given it, and generate machine code that does just the thing you want very quickly. At a higher level, Numba is really its goal is to try and help you in your Python programming, if you get to a point where your Python is not fast enough and you need some kind of sort of smooth way to speed it up, we want you to be able to keep working in Python and not have to switch languages out to C++ or Fortran, or do other switch Python interpreters entirely. We're trying to give you a smooth ramp from your initial prototype to a faster version of your program. So how did the Numba project get started? So actually, Numba was originally started by Travis Oliphant, who you've had on the show previously. One of the main founders of NumPy was actually trying to find a way to write specialized functions for NumPy. He was trying to do some element-wise calculations, and he wanted to generate functions that were as fast as a compiled language. And so he actually went and discovered that there was some old Python bindings to LLVM, which is a fairly standard compiler toolkit. He resurrected those LLVM bindings and then started building NumPy, which was sort of designed for a very narrow thing, but then he started hiring more people to help him expand it into a general Python compiler framework. Okay, so are you doing code translation, you're turning it into C and then sending it to a normal compiler, or what exactly is going on here? So what's actually going on is it's a runtime compiler, so it's not like GCC or something where you run it to compile your program. When your program starts up at either the point where you define the function or you could decide manually later to take the Python bytecode, and then it translates it into the LLVM intermediate representation, the LLVM IR, and that is fed into LLVM, which is another project that does all of the generation of machine code and everything. So we never pass through C or anything else. LLVM kind of looks like a high level assembly language. So in a sense, NumPy is acting as a front end and LLVM is acting as the back end of a compiler tool chain that you can run in the middle of your program. So LLVM is probably best known as part of clang. It's not normally known as a runtime compiler, so how slow or fast is this the first time you hit something? It's not too bad. Again, because NUMBA lets you target specific functions, you don't have to compile a whole lot of code often to make your program run faster. You can identify a couple of critical functions, target those directly, and then you can be generating this code in less than a second for a reasonable size function. And then that overheads only really when you start up, and then for the rest of your program, of course, you are just directly running the function you already compiled. All right, so how did you pick LLVM? Because like Brock said, this is not what it's typically used for, although LLVM has been used for more and more interesting things over time. Like you said, mostly known for clang, but has been used in some other offshoot projects. And do you work with some language people who kind of suggested that, or where did this all come from? LLVM, I think, became a really easy choice early on, because it's a very modular system. It's really a toolkit for building compilers. And that's what we wanted to do. And we didn't want to have to figure out all of the optimization passes in the whole back end. In some ways, LLVM proceeded clang. And so LLVM was being used for a lot of academic compiler research, but has now become, with a lot of commercial support, is being used for clang, obviously. A lot of other companies have adopted NVIDIA. NVIDIA's compiler tool chain for their GPUs now uses LLVM and AMD is using it, and a lot of other places have picked it up. So actually, one of the advantages to NUMBA using LLVM is now we get access to all of those other architectures. NUMBA can compile code for the GPU, which is, I think, one of a huge point about NUMBA that separates it from similar projects, is we can target both the CPU and the GPU, thanks to using LLVM. Actually, let's dive down on that point there, because that's an interesting little bit there. And I am fairly ignorant of GPU stuff, so forgive me if I ask something really silly. But targeting the GPU is one thing, but then you also have to add additional stuff to move the memory back and forth and handle the registration and things like that. So do you have additional code in NUMBA that handles that kind of logic when your target is a GPU? Yeah, so when we add a new target to NUMBA, we include both some driver-level stuff that lets you manage the device, open up a connection, move data to it. In some cases, NUMBA will move the data for you. You often, as a first pass, don't have to worry about that yourself. We'll copy it over and then copy it back. Sometimes you want to control when those copies happen, and so you'll use the device API basically to say, OK, now I want you to copy it, and then I'm going to wait to bring it back until I'm done with my whole calculation. But that's sort of a little bit of infrastructure around the core LLVM piece, which is shared quite a bit between the CPU and GPU targets. A lot of that is a similar code, thanks to the LLVM abstractions. So what if I wanted to, instead of using it as a just-in-time compiler, actually kind of say I know I'm going to run something many, many times. Can I kind of save that compiled state? Yeah, so we have a tool. I would say it's not quite ready for primetime, but we have been working on called PyCC, which is jobs to do exactly what you're describing, and it's certainly on our roadmap to flesh that out so that you can take a file of Python code and say, I want to compile this to a compiled module that I can load into Python later and just call and not have the compiler overhead. But right now, pretty much the most of the use cases for Numba have to be runtime compilation, which, as I said, for a lot of projects you just don't notice. There's enough other stuff going on. You're not compiling your whole program, but certainly we want to support libraries of standard functions that people write and want to just reuse over and over. Now, you said earlier that it took the Python bytecode. Which is it, or does it do both? Can you take both just scripts and parse the ASCII or maybe the Python interpreter does that for you, or do you only read Python bytecode? So we read the bytecode because at runtime, once you've read that the ASCII function basically in, the Python interpreter converts it to bytecode immediately. And so at runtime, the thing we have access to is that bytecode. People have done code generation where you generate a Python string, and then you ask Python to convert that to some bytecode and then you can hand that off to Numba. But we don't want to re-implement a Python parser, but that's already built into the interpreter for us, so we don't have to do that. Gotcha. Now, when, again, here's some ignorance of Python itself. When Python generates the bytecode, regardless of whether it's in the separate compile step or at runtime, does it do anything like what LLVM does itself or is it pretty much just a straight translation? It's a pretty simple translation. The bytecode is all of the different constructs you have in the function get turned into a sequence of bytecodes. The bytecode is really just there as an interpreter convenience, that it's easier to write a fast interpreter that deals with bytecode as opposed to a more high-level construct. And since that bytecode is what's really sitting in memory when you load a Python function, that's where we start from. So what about Python kind of keeps an address to... I'm thinking about data structures and kind of moving from the Python world to handing it off to this LLVM world where this compile thing is actually running. How much effort is involved there? Does the user have to be aware of any of this? I think the users who get the biggest benefit from Numba right now are users who are already accustomed to using NumPy, and NumPy being a standard array package for Python that has become the bedrock infrastructure of really most scientific computing in Python. NumPy already stores your data in a nice efficient machine representation. It's the same way you would store it in C or Fortran just with a little extra metadata about how the arrays laid out. So if you're using those kind of arrays already, which you probably are if you're doing some kind of scientific computing, we can directly use those arrays right in Numba. There's no need to translate anything. We just know how to reach into a NumPy array and get pointers to the data bits we need to use. So that makes it pretty easy if you're already starting from NumPy arrays, which you probably are if you're doing some kind of major calculation. So that leads into my next thing actually really nicely though. So can I use NumPy on any code or does it kind of have to really only work well if I'm using NumPy arrays? So there's a couple of different levels there. Our goal is to make it so that it will at least work on any code. You may not see a significant benefit on some code, but we don't want it to ever do the wrong thing or just fail. For example, it is possible if you're writing a function. I mean, as I mentioned earlier about type specialization, one of the key bits of Numba is in order to generate fast code, it has to know that you're only going to use a specific set of data types with the function. You might have a function where we can't do that inference because you could be calling out to plotting libraries and all kinds of things and have arbitrary Python objects floating around, but if in the middle of that function there's a loop over some fixed data type like you have a NumPy array that you're looping over, we can actually extract that bit, lift it out and compile that to fast code while the rest of the function is in some way still in the slow path. So there's the slow sort of Python object mode, which is at least the same as Python but not really much faster, and then there's the faster native representation mode where we try all of the language constructs, we try to make them as fast as we can, but there are of course certain limits. If we can't figure out what the Python code would do or it's something we don't support yet, you'll actually just get an error back that says we can't compile this. So this is going to show my ignorance of another package here a little bit, but you just mentioned that stuff, so the JIT that comes with R by default only does loops, things like that. So is that kind of like the focus? Is your focus in on where you're kind of iterating over stuff and can you tell to only do that? I would say in number, loops are the focus because from a practical perspective that's where you're going to get the most win. We can take a large range of functions and turn them into efficient machine code, but if the function doesn't run for more than a microsecond, making it faster probably doesn't help. It's usually the core bit of your code is something you're going to do a million times and that's often in a loop, and so loops become practically the right place to focus, but we don't restrict the compiler to just loops. Now that being said, when you detect a loop in the bytecode and you give it to LLVM, you're just relying on LLVM to do whatever it's going to do to optimize that loop, like unroll it or whatever it does. I don't remember my compiler's class very well, right? You're not trying to precondition things. You're just identifying good spots and giving it to LLVM. That's mostly true. We do have to do a little bit of work to make sure that we're preserving the Python semantics. We want to make sure that that loop behaves the way it does in Python, but for the most part it is a pretty direct translation to turn that loop into something. LLVM is fairly low level, so it ends up looking kind of like assembler, in some respects with jumps and all sorts of things floating around in there. Then yes, and we rely very heavily on LLVM's optimization passes, which have been written by compiler experts who are great at this stuff to do things like unrolling or generating SIMD instructions to be able to auto take advantage of the nice vector hardware in CPUs. We can get the advantage of that too in Numba. Okay, and let me jump back just a little bit too. You've mentioned a couple of times that there is restriction on the types that you will try to optimize in things like that. Why is that? Really, performance optimization. Once you've gotten your program so that it's not doing useless work, to get more performance really is about specialization. It's about saying, although in Python being a very dynamic language, you can send in a basic Python function, you can send any object in that as long as it implements all the methods that are going to get called on it, the duck typing approach basically, it'll work. That's more general than most people's code is actually in practice. In order to specialize the function to generate machine code as efficient as Fortran, we have to tell Numba that you are going to in some ways limit yourself to just a few types. You can decide what those types are and you can have more than one. You can say, I'm going to have this function work for double precision and single precision and all of those things, but you're not going to be able to use this function with some type someone else made up without recompiling it. That's really what it is. The specialization is how you get the speed. Okay, so if I had to... Oh, I'm sorry. If I had to crassly classify the types of speed ups that you're getting, it's number one, because you're taking it away from byte code and turning it into native code, and two, you're further optimizing that native code. And when I say you, I mean LLVM is doing the bulk of this work here, but it's really two things, turning it into native and then optimizing that native code such as loop unrolling, SIMD, blah, blah, blah. Is that an accurate characterization? Yeah, I think that's pretty close. The translation, the turning it into LLVM is where we use our type knowledge to say, oh, if you're working on these types, I can generate LLVM that only does the thing I need, and then once I've generated that LLVM, the optimizer in LLVM can do a whole lot more work for us. Do you do a first pass? Sometimes when you run a compiler, you just say, I could vectorize this, but it's too short, so I'm not going to bother doing it. Do you do anything like that? We don't really have to. We let LLVM make a lot of those optimizations. It's got pretty good optimization passes to be able to generate multiple code paths if it needs to to handle long and short arrays, that sort of thing. What would be some good rules of thumb? If I'm a scientific programmer and I'm getting into this whole scientific computing with Python world, what are some ways that I can write my code quote-unquote well that would work well with both numpy and numba and things like that? I would say step one is to absolutely make sure you're using numpy. That right there will maybe speed up your code enough that you don't even need numba, but I would say it's a prerequisite to being able to make good use of numpy. So start with numpy, get familiar with it, make good use of it. Second thing is to get familiar with profiling tool. There are a number of profiling tools for Python because you absolutely want to make sure that you are actually speeding up the thing that matters. Because numba doesn't take your whole program, you want to tell it to focus on a specific function. You need to know what the functions are. I am personally always surprised when I see what the actual bottleneck is in a program. So you want to start with that profiler. The profiler will probably identify a couple of functions that do a lot of work on arrays. Certainly if you have anywhere where there's... Numpy does a lot of stuff, but sometimes you end up having to fall back to writing a manual for loop. That's exactly the place where numba is going to help you a bunch is to take that for loop and to turn it into machine code. So just out of curiosity, you mentioned that if you use numpy itself, that might be enough to give you enough acceleration that you don't need to use numba. What would be an example of something that does need the acceleration of numba? That numpy is not enough? Yeah, so if you start with numpy, that's often good enough. But where numpy can fall down is if you have a complex expression with arrays where numpy has to make a bunch of intermediate arrays to store partial results. If you're adding a to b and then multiplying by c, it has to make one intermediate array in the middle of that calculation. Whereas a numba version of that function doesn't necessarily need to do that. That's the point where switching up from numpy to numba can be a big performance way to save on memory allocation that can eat up a lot of time. Okay, so how much time are we actually talking about here? Like what's kind of speed ups are you kind of seeing when people use numba in these cases? So if you're starting with a pretty solid numpy program, we're talking a factor of a few, depending on the complexity of your expressions. If you've gotten to a point where your algorithm requires that you iterate over a numpy array individually, because there's just not a numpy function to do the operation you need to do and you have to do it loop over every element, that could be a factor of 100. So that's why I say if you find yourself having to do element-wise operations on numpy arrays, numba will be huge for you. If you're already getting good use out of numpy, numba may still help you by a factor of a few, depending on precisely what you're doing. So I want to quickly touch on something. You said first you got to figure out those hot spots and you got to run it through a profiler. What profiler do you normally use when you're looking at applications to figure this out? I usually use the C profiler in Python. It's just a module called C Profile. It's a function profiler, so it'll tell you which functions you spend the most time in. That's a, since numba is a function optimizer, you want to look at the function level. I've also used something called KernProf, which is a line profiler. It'll actually look through and assign times to all the lines in a file, which is nice if you have a longer function and you want to know which pieces. Those are often candidates to pull out into a function that you then hand over to numba. Okay, so let's say I've gone through and I've done that and I found a couple of hot spots in my application. What do I need to do to say, hey, numba, lift this part out and compile it and things? So if you already have, it's pretty low level. We're working on adding higher level constructs to it to be able to help you in other cases. But at a very low level, if your function already has a for loop in it, the minimum thing you have to do is add a Python decorator. You can import from numba, a jit decorator that you put above the function. And when you do that, numba will, the first time you call that function, look at the data types that you pass in and compile it for those data types. There are sort of more advanced versions of the time, no, really, I'm just going to use these data types, compile it when my program loads. Don't wait till the first call. That's the very base level. You may find that that's not enough. You may then have to go in, or if you have a function that's doing array operations that we can't directly accelerate, you may have to unpack those array operations into the set of for loops that would manually loop over the data in the order that you want and do the calculation. So there's sort of two stages. One is to just try reusing the jit decorator and then two, you may reorganize the function a little bit to make it more obvious to numba what you're trying to do. Okay, so you mentioned that decorator, so I assume that's part of numba. What exactly do I have to change in my Python to say this function or this loop I want numba to work on? Yeah, so what you do is you, at a very minimum, you import the numba package and it includes a Python decorator called jit, j-i-t, that you can just put above your function that you want accelerated. And what that will do is the first time the function is called, it will inspect the data types of what you called it with and then optimize it for those data types. It's sort of an automatic jit. There are more advanced uses where you can say ahead of time, these are the types I'm going to pass in, compile it when the module loads, don't wait for the first call. Beyond that, you may find that the initial jitted version of your function is not quite fast enough and you actually unpack. You may be using some numpy operations directly that you have to unpack into some for loops at this point. We're working on fixing that and do the looping manually with the right sort of calculation and storing data where you need to store it. So that you may have to rewrite your function a little bit, but the first step is really just to hand it to the jit and see what happens. Is there, is it ever going to be easy enough that a number would just kind of figure it out if I import the number package and I don't decorate my code anywhere, it's just going to say, oh, here's a loop. You seem to only be passing these data types. I can, I can work with that. I don't know. I mean, it may be really long term. I think that often that's not very helpful because most of the time you would spend compiling would not be useful because you're not going to call all the functions in your code a whole bunch. So it's certainly this, I mean, one of the things that Numba is doing differently than a lot of other tools is what you're kind of thinking of as a tracing jit. It's very common like in your web browser where the runtime inspects the code as it's running and sees where the hotspots are and then calls the jit on your behalf. I can imagine Numba actually being part of a bigger tool like that, but it would require instrumenting the interpreter to be able to inspect what's going on at runtime. So I can see Numba being a component of that kind of a bigger package. Do you ever see like functionality in Numba just being into Python interpreter? It would be nice. I think Python is undergoing kind of I mean, if you look at the space, there are so many compiler jit kind of related projects. I think we're undergoing in some ways kind of the transition that JavaScript did in the browser speed wars where people are saying, you know, just because it's a dynamic language doesn't mean it has to be slow. And so I think we're currently in the stage of exploring the space in a bunch of different ways. And I think as it becomes clear what's successful and what's not, things may settle down and we may see things moving into the standard interpreter. Okay, so looking at this from that perspective that this is an exploration and research and things like that, what is some negative knowledge that you've learned? What are things that you have learned that don't work or either, you know, in the lab where you get feedback from people saying, hey, you did this feature and it's really not working out the way that I want it to. I mean, it's hard to accelerate. I mean, we focused very heavily on array related work because that's very common in scientific computing and it's also a lot of low hanging fruit. There's a huge huge speed ups potentially for array work. A lot of cases where you have in some ways more individual object manipulation and that sort of thing, those can be harder to accelerate in something like Numbo where it focuses on functions where you have control spread among many different functions each doing a little piece of the calculation. It can be hard to pull all that together in something architected like Numbo to speed it up. There are other JIT approaches like things you see in PyPy and that sort of thing can work. What is PyPy? Oh, sorry, PyPy is a completely, is a Python interpreter that incorporates a JIT inside of it for your whole program. So the PyPy is trying to solve similar problems to Numbo but does it in a very different way. Numbo is designed very consciously to live inside of the standard Python interpreter so that it can use all of the standard extension modules and things that you have already available to you in the standard Python interpreter. We don't want to replace the entire interpreter in Numbo. So Numbo, I only heard about it recently from a Python conference and would you call it production ready? Is it something that's safe to use today? It's not going to write your code into anything that does not give you the expected result? I'd say it's ready definitely for I'd say advanced users, early adopter types. Some people are using it in production and I think that's the kind of thing you have to evaluate case by case. The fundamentals are pretty solid. Where we run into trouble with people is that there's a lot of the Python language out there. It's a very flexible language and so not all things can be handled by Numbo. We'll tell you we'll generate an error and say sorry we can't compile that and so what I tell people is really they need to try it out and one thing I want to emphasize is actually we love hearing about people's use cases. There's so much of this space to cover to do something first and so when people get on the Numbo users mailing list and let us know what they're doing that really influences our decisions about what to work on. You've been using the word we a lot. Who's involved in Numbo? Numbo has a number of developers here at Continuum Analytics. The lead developer is Sue Kwan-Lam. We also have Jay Bork, Oscar Vielas and Antoine Petraux who's actually one of the core Python developers working with us this summer to help us expand Numbo's capabilities and be able to get more more Python working in it. We don't have a whole lot of external developers and that's something we're actually focusing on this summer is improving our developer documentation to help people outside of the organization understand the innards of Numbo and understand how they can contribute and add functionality to it. Now just out of curiosity I asked this to a lot of based types of projects and it is purely a curiosity question but I'm always fascinated by what version control system do you use and why? We use Git and honestly I think most people's answer to that is because of Git Hub. I'm you know I probably will get yelled at for this but I personally don't like Git. It just never has fit in my brain properly. I'm a big Mercurial fan but Git Hub in some ways is the killer app for Git and so you want to have access to developers and make your code easy to get to. Git Hub is a great way to do it. So what's coming into future for Numbo? What's the upcoming features and things you got coming? Our big push as we solidify this low level compiler base is to be able to ask the question what higher level constructs would help people write faster Python especially in the scientific and data analytics world. So one thing we have been prototyping and working with is what we're calling a deferred array object. It's basically something that looks just like a NumPy array but it doesn't do the calculation as you work with it. Part of the problem that NumPy has is every time you do an operation it has to give you back the answer immediately even if you're just going to take that and feed that into another calculation. So the deferred array object waits until you get to the end, looks at the whole operation you did and then asks LLVM to optimize that and then does the calculation. And we think that that will help people write array code, a code that looks like you would write with NumPy or Matlab or that sort of thing so it's nice and concise but has all of these compiler benefits and generate stuff that really is as fast as Fortran. That's a big feature on the horizon. We want this Numbo array to work equally well on the CPU and the GPU because we have the compiler infrastructure to target both of those. We want this, which we think will be huge is that you can then take this and say okay now that I've got this working I happen to be sitting in a computer with an NVIDIA GPU I can just flip this and say okay now target the GPU and my same code should just work on the GPU. So one thing we forgot to ask before is actually what accelerators are you currently targeting? So aside from the CPU backend we have a pretty well established one for CUDA so that's NVIDIA's GPU programming environment basically. We have in progress one for OpenCL so that would be needed if you want to target an AMD GPU and that's pretty much what we've got cooking right now. The CUDA one is definitely, you know, as production ready as the rest of Numba is the OpenCL one is very experimental but we're certainly open to other adding more. I think OpenCL and CUDA cover the majority of stuff out there. Things like Xeon5 would be fantastic. We don't have the resources right now to do that but we would be very open to doing that kind of thing. Okay and finally what's the license on Numba? So Numba is BSD licensed. It's the much like much of the Python world. So it's open source. It's pretty permissive. There is, you may run into a product called Numba Pro. That is a closed source sort of enhancement for Numba. The Numba core is open source and will continue to be open source. There are some value add stuff that we put on there that we sell to some customers. The core and actually fairly recently we were excited to be able to take the GPU part which was originally proprietary and move it into the open source half of Numba and so that's something we would like to do more of in the future as well. Quick question. Is Numba part of the commercial version of Anaconda? The Numba Pro? Yeah although I mean Numba Pro is separately licensed I believe. I don't know the details of whether or not if you buy the commercial Anaconda support or what that comes with. The Anaconda distribution itself is free. Okay well that's everything so Stan thank you very much for your time. Cool thanks. It's great to talk to you guys.