 We're ready to get started. Cool. Welcome back. Our first speaker of the day, our first long keynote, sorry, speaker of the day is Chris LeBlanc from GNS Science talking to us about Scython. Cool. Take it away. Thank you. Thanks for having me. So the talk I'm going to give is getting the benefits of C without leaving Python. All right, let's get right into it because there's a ton of slides to cover. Myself, I have a background in Earth Sciences, which is really just geology with a peppering of some other stuff, like environmental science, and a background in geophysics. Been using Python since roughly 2001. I'm now a software developer for GNS Science, working on something called the Globe-Claritas. We do a bunch of seismic processing, and you would instantly think earthquakes, and I wouldn't blame you. But this is more active seismic, so running seismic surveys and processing the tens of terabytes or even more that can be generated from that. So we need lots of efficient code, which is generally a lot of C code, sometimes even a bit of FORTRAN. So what is Scython? Well, Scython is fork of Pyrex. What is Pyrex? Pyrex is a tool that makes C extension modules very easy to write. It's pretty much just like writing a normal Python, really. The main author of Pyrex was Greg Ewing from here, believe it or not, University of Canterbury, quite some time ago. It gives you all sorts of advantages, even just doing the one simple switch converting from Python to a C extension in Scython can sometimes give you a 50% speed increase just on its own without adding additional features. It also acts as a very good bridge between the Python world and the C world. Since you're creating C code on your own, you can actually call C functions directly, which isn't something you can do in standard Python. You can also go the other direction. You can run Scython functions directly from C, which is, I don't have time to cover it, but it's a very useful thing. All right, let's get into the demos here. So here's a very, very simple demo straight from the Python 2 documentation. It's Python 2, sorry. All the demos I have here work on Python 2 and 3. But if you see this little function here, it's just doing a very simple try accept block there. And that's only five, six lines or so. Now, whoa, here's the C code that implements it, and that's not even all of it. There's some more, and there's some more. Don't worry, I'm not going to go through this code, because it's kind of nasty. But you can see we have to use things like this PyObject getItem that's part of the Python C API. You have to check for any errors. You have to worry about all your reference counting. So you have to manually increase and decrease reference counts, and you have to know when to do this and when not to do it. It's all pretty bad. So if we can get away from this and stay in Python, that would be a much better thing. And that's what Scython gives us. So it won't go through all these bullet points, but it's mainly Python. It gives you excellent compatibility between Python 2 and 3. You get all those things you're used to, like garbage collection and string handling. It does some really useful things, such as conversions between Python types and C types. So in Python, in integer, it doesn't really have a specific size to it. It has sort of arbitrary precision, whereas C has types of very definite sizes. So Scython will handle a lot of that automatic conversion for us. It creates code that can compile on pretty much any compiler, stable and mature. So it might not be quite as sexy as some of these new JIT compiled approaches, but it's definitely a robust way to do it. The disadvantages are, since it's not done at runtime, you need to compile it first. This usually boils down to using something like disk utils, which is not the best. I'm not the first one to say I don't like disk utils. And it is C Python specific. So this won't work on things like PyPy or Iron Python or some of those. So why don't we get into it again and just convert a very simple Python function into a Cython function? So here, this is a pretty simple thing. This is kind of as simple as Python code gets. We're just setting a variable x equals to 0 and then incrementing it through this thing called count. And there's no defined types on this one. So let's add some. So we're using the C def keyword to declare that x is an integer. And this isn't a Python integer. This is a C integer. So this is no longer an immutable type like you have in Python. In Python, if you say x equals 1, that's immutable. You can't modify it. And later on, you say x equals 2. It's actually discarding the old value and applying a new one to it. This is different. It actually modifies it in place. And integers can be different sizes on different platforms. So it's more of the C world that you have to be aware of. But as long as you're aware of that and have some knowledge of C, it's very powerful. And so all we did there was one small change. And this will vastly increase your performance. I'll go into some examples of that later. Now what if we want to make this a complete Python function? So in the internals, if you looked at the C code, this would actually just be a C function. So we're declaring that the function itself takes an integer and returns an integer. And we still have C def int x equals 0. But at the end, we're returning x, which is also an integer. So this would be very familiar to anyone that's done a bit of C. It's a little bit unusual for Python. But this new pet that we talked about at the keynote sort of introduces some typing to Python as well. So it's a bit in line with that sort of thing. And so how do you build this? So the files I've shown you so far, they're still really just Python files. And these have a pyx extension on them. So to convert that, basically to compile that to a C file, you just run the Cython command on it. And so you can do that manually. But what is usually done is to use disutils or setup tools with a setup.py file. And you just use the build ext infrastructure that exists in Python to do all the hard work for you. So things like compile flags and all that is largely taken care of. But you can override that if you want to. And we build in place just so we have the shared object where we want it. So that's the groundwork out of the way. Now, here's my sort of impression of Python and the global interpreter lock. So for those of you who don't know, we do have threading in Python. But anytime you're using it and modifying your typical Python object, it's going to lock that interpreter. So only one thread can run at a time. So if you're running a for loop, it's going to run only that for loop until that's done. And then it might let the other threads do their thing. So it's really not much better than running things in serial. So wouldn't it be nice if we could get around that? There's actually a lot of modules that do already in the Python standard library. So things like time.sleep, that's a very simple one that releases it. Most of NumPy releases it because most of NumPy is implemented in C. And there's lots of other C extensions that do as well. So how do we do this? Because you can't normally just disable the GIL from Python. But using the syntax here, this with no GIL, that lets us disable it. So at that point, things running in parallel, even just using standard Python threading, will actually work at the same time. So you'll get a speedup like you would expect to get with multi-threading. Now there are some caveats with this. And that is that you basically can't modify any Python object in here. And so that's a bit of a bad one. So as long as you're dealing with Cython C-style objects and variables, you'll be fine. But if you need to modify some sort of Python state, you have to ensure that you re-acquire this global interpreter a lot. And so that's done in a very similar way. You just say withGIL. And so in this example here, we're just seeing if a certain condition happens. So maybe it's an error state from some C code or something. And to raise an exception, you can't do that since that's, again, part of the Python world. You can't do that without re-acquiring the GIL. So we do that and raise the exception. And there is a bit of checking by the Cython compiler. So if you do something that's not quite right, it'll typically let you know. So why do we stay away from threading in the first place? Well, there's all sorts of problems here. Race conditions, deadlocks, data corruption. Even just implementing threading itself is a bit of a pain with thread pools and things like that. So wouldn't it be nice if there was a simpler way to do this? Well, there is. And here's a very complex looking diagram from Wikipedia. So this is describing OpenMP. And how many people here have heard of OpenMP? So maybe a half or a third of the audience. So I'm not terribly surprised because we're Python programmers. Why would we look into this? But OpenMP is a CN4 trend multi-threading API N spec. So it's pragma-based. So what looks like comments in your code actually direct the compiler how to execute certain tasks in parallel. And so on the upper section there, we see these parallel tasks. That upper one is executing things serially. So it's fairly simple. It's doing ABC and ABCD. But the one beneath it, that's where it's doing things in parallel. As you would expect a typical parallel program to do. The advantage of OpenMP is that it abstracts a lot of this pain away from you if you do it correctly. So I'm going to look at that in a few minutes just in relation to Python and see how that works. And I was dabbling around with this for quite a while to find a good benchmark that would show how Python can speed things up without getting too complex and still touching on the aspects I needed to touch on. So I sort of fell on this 2D Laplace equation. It was first done in 2004 by Prabhu Ramesh Chandra. I hope I'm pronouncing that correctly. That was further sort of updated by Travis Oliphant in 2011. So they previously compared all these things here. So NumPy, Fortran, Matlab, Octave, all sorts of stuff. I'll discuss a few additional ones, sort of newer ones. So I'll discuss Python itself, because that's the standard. PyPy with its just-in-time compilation. NumPy again, Numba, which is a very interesting one. Scython, Scython wrapping C, and Scython running in parallel. So here's a very simple diagram. The 2D Laplace equation, for each element, it iterates over in the entire array. So a big 2D array, it takes four surrounding neighboring elements. So it's a fairly, I suppose, I-O, not really I-O, memory-intensive algorithm. And I think it's a reasonable benchmark for this sort of thing. It's typical in terms of the sort of things scientists would do on NumPy arrays. And so here's the starting state. The entire 2D array is all 0s, except for the upper row, where those are set to 1. And this is all 64-bit floating point. So there is it after 10 iterations, 100,000 and 10,000. So you can see what's happening here. It's almost like diffusion of heat. But don't ask me, because I'm not really a mathematician, per se. So why don't we have a look at the various implementations and see what sort of speed-ups they give us. So this one is the naive Python approach here. Let's see if my cursor works yet. So down here, we've got the work array. For those of you that haven't seen NumPy in action, this is a way of creating a 2D NumPy array that's all filled with 0s. And then this step here, it looks like we're only accessing one element. But we're actually saying the entire row, that upper row is equal to 1. And then we run this function up here for x in range 100. So it's an iterative approach, and it modifies this array in place. And up here, this is the standard algorithm you see on those previous two blog posts. I'm really not a fan of variables like i and j and u, but I thought I'd stick with their implementation as well just to keep things simple. And it is very in line with the equation itself. So it makes sense. All right, and here's the first benchmark. So this is Python. And as you would expect, this is going to be the slowest one. You might wonder why I chose the scale on that upper graph there. That'll make more sense later as we see various other implementations coming in. So these are graphing the same thing. The upper graph is just a linear graph, and the lower one is a log log graph. And so that shows different behaviors better. This shows the behavior of smaller arrays much better. So we'll see that later on. So I mentioned PyPy, and it's just-in-time compiler. So this one here gives us an order of magnitude speed improvement without changing the code at all. So that's not bad. We see this little hiccup there. That's actually just the jit warm-up. OK, then there's the numpy version. I only have five minutes, so I've got to move my butt. So the numpy version eliminates all the loops and uses this so-called vectorized optimization, and not optimizing operation, to perform the same sort of operation on this array. Unfortunately, it creates several temporary arrays. So if you're working on a massive array, you might find you run out of memory. So this gives us a huge improvement on the order of two orders of magnitude. OK, so then we move on to numpy. So numpy is this new thing. It uses the LLVM, or low-level virtual machine, that lots of things are using like Julia, that other language. So all we have to do is decorate this function. So it's very similar, pretty much identical to the Python version, except we decorate it with this jit, and we get this huge speedup, especially for small arrays, it does very well. And again, we see that the jit warm up for that first data point. Now we have the Python version. So this is, again, very similar to our Python version, except we use the cdef int for the i and j variables that we're using to loop over. And apart from that, it's pretty much identical. I've included a couple of decorators here that help performance a little bit. Let me just turn off some of the safety. This is the setup.py file that composes it. I don't think I have time to talk about that. And so that's, yeah? OK, cool. All right, so the setup.py file we use to compile from the C code to a shared object in Linux or DLL on Windows. The magic all happens here in the siphonize command. Apart from that, it's all just very simple, relatively simple, very standard distutils approaches here. So this creates an extension module called demos, in this case. All right. And that gives us the purple graph we see down here. And so that's very similar to number. And I'm actually very impressed by number's performance here. So that's siphon, which is very similar to C. But why don't we see how fast this runs if we just wrap a C function that we already have? And so this is something we do a lot at my job, because we have oodles and oodles of C code that's very optimized and has been running for many years. So it would be foolish to just throw it away. So what we do here, it's very similar to including a .h file in a bit of C code. This is siphon syntax. We say cdef extern from, and then give it the header file we have. And so that does a bit of type checking for us. And then we specify the function prototype check. And here's the function that actually does it. So here's how we declare that our u object is a numpy array, or nd array, as they call it. Then we've got other variables. We don't actually give them a type here, because it handles the type conversion on its own. So then we run this C function directly. So unlike C types, we don't have to declare input types and response types, since it's really just a C function call at this stage. And this is an interesting way of getting the pointer to that numpy array, the first element in it, since it's really just a contiguous amount of C memory as far as this thing is concerned. So we do the ampersand 00, which is very C-ish. And then we pass in the shape and other variables. OK, there's the C code, enough said. Then we get on to the siphon C wrapper. So the only real difference here is we're including this .c file. If it's just a simple c file without any other dependencies, we can just stick it in there. And Python can compile it for us. And again, it's right on that sort of lower curve, so it's up with the fastest of them. It's only slightly faster than the siphon version. OK, so let's get on to something a little more interesting here, the siphon parallel. So the parallel aspect of siphon, there's only two things you really use as a user. There's the parallel directive. So anything inside a parallel block is local to that thread, so thread local buffers. And then there's P-range. So P-range uses OpenMP that I mentioned earlier to spawn multiple threads and have things run in parallel. It also does lots of other trickery for us, so in terms of synchronization and things like that. If you really want to get a bit more low level, you can use the OpenMP module in siphon. So for example, you want the number of threads and that's how you do it. And that's just a wrapper to OpenMP and their API. So this is almost identical to our previous siphon version. The only change here is I've replaced an X-range with a P-range, and I had to say noGill equals true. So anything inside this for loop is not allowed to modify a Python object, or if it does, you have to reacquire the global interpreter lock. And in this case, we're just modifying this NumPy array. And that is a Python object, but it's really just a C buffer, as far as this is concerned. So let's see the response here. We can see we get on the order of, I think, it's 3.2 times faster with a four core machine. That's most likely because it's not really an algorithm that's perfectly suited towards being run in parallel. If you had something that was embarrassingly parallel, you'd probably get even better performance. Down here, we can see there's a bit of a hitch there. It's always there, no matter what. And at small array sizes, you can see the performance actually drops off quite a bit. But that's because it's running on four threads. Programmatically, you can say, if such and such condition, then let's just use one thread. And so if you do that, you get your standard C serial performance again. OK. So what if you need more performance? Well, you can mess around with compiler flags. I don't recommend that. You won't really get that order of magnitude speed up. You want? I would think PyCuda, OpenCL, they would get you one or two, maybe more orders of magnitude. Number Pro does the same sort of thing. OpenMP4 is similar to what we had already, except it offloads things onto the GPU. So potentially, Pyrange could give us a massive speed up without any added complexity. There's also distributed parallelism. So MPI, iPython, Spark, et cetera. I think a combination of these two would be really powerful. OK, so conclusions. It makes C extensions very easy. It offers excellent performance, especially in parallel. I'm also very impressed by Numba. They used to have a P range, but they removed it because it wasn't exactly stable. I have a feeling that'll probably come back in the future, but for now there's no simple equivalent to Scython's P range. And here are some arbitrary scores I gave all these approaches. I think both Scython in parallel and Numba are probably the shining stars here. But if you want simple code that still runs fast and your problem is well tuned towards it, I think NumPy is the default option. OK, thank you. And there's some links. Oh, we've got a couple minutes for questions, and the first question can be me. What's the current state of play with Scython and Python 3? Oh, they work quite well, I think. Scython has, as far as I know, full Python 3 compatibility. And so that's another reason to use Scython instead of, say, the CE API, because there's a lot of differences under the corp, but Scython abstracts most of these away. And so things like Unicode strings, it's quite easy in Scython. In your work, is it generally obvious where the performance issue is, or is there a profiling or tooling that you use to figure out where your optimization should target? I should have mentioned this. I wasn't going to include a slide on it, but of course the first thing you should always do is profile. Now in these talks here, since they were based on existing blog posts that were all about optimizing this particular algorithm, it was sort of obvious where the bottleneck was. But definitely, in our work, it's not obvious at all. Things you assume might be the bottleneck, often aren't, and it's other things. So definitely, it's all about profiling. And I didn't really have time to go into that on this talk, because it's only 20 minutes. But Scython does give you some good tools for that. Sure, I can. So this is a very busy graph at the end. Yeah, no, it is. No, it is a question. So looking at this graph tells me at the moment, and I'm not sure if I'm missing an important point. Because neither Scython nor NUMPAI are considerably faster than NUMPAI at large array sizes. Yeah, I would tend to agree. I mean, I didn't expect this. If I look at small arrays, I probably don't need that level of optimization in many cases. So I'm just trying to make a point for NUMPAI. Yeah, I agree. I'm actually surprised to see NUMPAI and all the other approaches be so similar. And it's one reason why I included a graph instead of just numbers for one array size. Because you can see it depends greatly on what your array size is. So I'm definitely very surprised that NUMPAI is as fast as it is, because I would expect another order of magnitude speed up from going to C. And so either there's something wrong in my code where I'm testing these speeds, or over the past, say, four years, there's been some real optimizations to the NUMPAI itself, which is definitely very possible. So I actually agree. I think if your problem doesn't require multiple cores and doing things in parallel, and NUMPAI is fast enough for you, then absolutely go with NUMPAI. The one condition on that one is if you're using lots of memory, this sort of approach won't work very well at all. But we all knew that. Tempted to finish the talk on Scython with just use NUMPAI, but is there any other questions? I think the parallel thing is pretty cool. That's only four cores. If you had 16 or 32, that would be quite nice. Cool. Thank you, Chris, for telling us not to use Scython. Thank you very much. All right, you're welcome.