 Sponsors, next we have Stefan Benel and David Woods who are going to talk about fast native data structures CC++ from Python. Hi Stefan, hi David, how are you? Nice to have you at the conference. I think it's a good occasion. So this seems to be a very technical talk, basically like the one that we had before. I won't take away more of your time, so I'm just gonna head on straight into the presentation. So I'd like you to share your screen, right? And then I hop off the stage and if there are questions then I will run a Q&A, otherwise you can use the full 45 minutes. Thank you. Wonderful. Thanks Marc-Emerie. Welcome everyone to our talk. So this is a talk I'm using in Python specifically for using it to assess native data structures and Python data structures in a very fast way. So I'll get right into it and I'll start with a little intro to Python, just in case I haven't used it before. So I'm going to use it from a Jupyter notebook. You would first say loadx-sython to enable Python support and then you can use Python from within the notebook in Python cells. So in order to make use of Python you would use static-type declaration. So that's what Python gives you. It's basically Python but it adds a way to statically type your code. You use static-type, static-c-type, static-c++ data-type in your code and then compiles the whole thing into a Python extension that you can then load and use from your Python runtime. Okay. So part of this look, you would basically just write Python code and then by adding the percentage and Python cell magic in the Jupyter notebook or by compiling your module with Python using this tutorial, using a setup tool, Python would then translate it to see and give you an extension that you can use to import it into Python. Okay. So it's from a Jupyter notebook, you would just say .sython and then you can write Python code in it and have Python compiles and run it for you. Okay. So here's a simple Python function. You can use static-type declarations in your Python code. So you would use that for arguments in this case. You can have arguments type declarations as you know from Python. You can declare global variables as C data types, C integer in this case and then you can use that in your code. So this is how you would add static-type declarations into your Python code to make Python aware of them and to make Python generate better code for you. Okay. You can now use your functions. You can just call them as you would from any Python function. So a bit more about functions. Python can compile Python functions, but it also allows you to generate dark C functions and it has a way, let me say, Cfunk, use the Cfunk decorator. This function is turned into a plain C static function in your code and that's also a way to get fast access to Python functions from other Python code so that you get a straight C call into your code and otherwise expose it to Python as a normal Python function. So this is a kind of a mix between the two. You can use C code directly from your code and so here's a way to use code from the Lipsy math part. I'm using the C sign function, C pi constant, just do a little calculation here in my code and then this executed directly in C rather than going through Python as a module, for example. So let's go straight into the topic of the talk. Fast access to Python data structures. So Python data structures, what I mean with that is basically the built-in. Let's look at what Python can do to make access to the Python list type faster. Okay. Here's a very tiny toy example. Calculate the sum of squares for this of integers. So I have 80,000 integers. Just randomize them for the sake of an example here. Then I can use the sum function in Python and a generator expression to calculate the list of squares. When I run this, what I get is that for 80,000 integers, apparently this is pretty quick. It runs in 11.8 milliseconds. And now we can do the same in Python. We can run a loop over the list of integers and calculate the square that way with sum of squares that way. And so when I run this, obviously it's a bit slower because the sum built-in is highly optimized and very fast. So it takes a bit longer. The fact of the, what do we have before? Before we had, I actually run it faster for whatever reason. Well, it's probably just because it's so fast that you get variances in the execution time. It's 20 percent faster for whatever reason. Maybe it's just not enough data or I don't know. Okay. Anyway, so we have two versions to do this in Python. And now what happens when we do the same calculation in Python? Again, we use load xSython in order to make Python available for a notebook. And now we put this function into a Python cell, a Python compiled cell, and let's Python compile it for us. And so that's a little bit faster already. So we gain a bit of speed here compared to the sum version. It's a bit faster just by basically adding these seven characters, right, percent, percent, Python and faster. Why is that? Well, we can ask Python what it made of our code. And we do that by adding dash A here, right? So show me the annotated version of my code and give you an HTML version that shows what Python understood of your code and how it interpreted. I'll make sure you can align here. You can see it's apparently doing some Python operation on it. So there's a Python C API function being called. And the loop here is very involved with lots of code. And that's also why it's a dark yellow. The darker the code line, the more Python interaction there's involved in it. Well, it can run much faster if we push the extra calculation into native C integers. So what I do now is I add static types. And that allows me to tell site then that I want everything but the iteration. So iteration over the list is still Python operation, but afterwards then everything that I do with the values in the list is now using C long integers and runs on plain C. And you can see that from the annotated version, that's now the S plus equals line. It's really a straight C operation. Okay. So when you time this, it's visibly faster, right? And we are at the speed up factor of 0.13. So that's a lot faster. Okay. We can go even faster because Python lists are not a very efficient way to store integers, to store basic data types or native data types, right? And commonly when people do this, there are two data types that people use and can use in Python, that is the Python array type, and obviously NumPy arrays, we'll see about that later. Now that's this nice array type in Python, which allows you to efficiently store plain or native C data types like car, character, int, double all these native data types that you can work on. You can look that up in library documentation. And another way we use this is we create a new array, tell it what data type we have. So the little L says we want to see long integers as native data type, and then we just put our list of values in there and get copied into a straight memory array with C long integers. And now we can run the whole thing on this array type. And we can see it's about the same speed, tiny bit faster in this case. Okay. So compared to unpacking the integers directly from the list, it makes what, 20% faster, right? And so we get a speed up of 0.10. This is currently already a bit more efficient than the list. But we can do this even faster by using the buffer protocol. The buffer protocol is a feature that Python has built in that allows native assess to, well, it basically allows native code to efficiently assess data buffers. And that is a protocol in the runtime and that allows different extensions, different Python modules to assess the same data. So what we do, instead of just saying, well, there's an argument int, we say this argument int actually has a type and that's a C long array, right? One dimensional, so that's only one colon in here, one dimensional C array of longs. And what this gives you is a view on that data buffer. So we call this a memory view. So when we do the calculation now with this typed input argument, typed as a memory view, we can see that it gets way faster. We're still passing in the Python array here. And now what Python can do here is it can unpack the data buffer directly and assess the C memory format. So it doesn't need to go through the Python interface of the array anymore. It doesn't need to read Python integers. It can now directly assess the internal memory buffer of the array and that gives you C speed in the end. So the factor is really huge in comparison. Okay. And now you can see that the same actually works with NumPy arrays because NumPy arrays also support this buffer interface. Now when I create a NumPy array from the list, one dimensional one NumPy array and pass this into the calculation function that uses memory views, you can see that it's about the same speed in the end. I can decide which data type I want to use. So maybe a little small. I hope you can read it. So here's a little comparison that compares the capabilities of different data types and the kind of the performance implication that that gives you. Now I'll pass over to David to present some more lower low level data structures. Hi. Yes. So I'm mainly talking about C++ and how to use the sort of built-in containers from C++ in Siphon. But just to start off, I was going to start by showing you how to allocate some sort of raw C or C arrays. So to do this, I get if we can move on to the next slide. We're using malloc and free, which for those that are familiar with C, those come from the C standard library and they allocate a bunch of memory. Well, malloc allocates a bunch of memory free, we then get rid of it. And I guess the important thing to emphasize here is that it's entirely your own fault if you forget to free it. So while Python objects will free themselves when you run out, you know, for malloc and free, you're getting a sort of raw block of memory directly from C and you're responsible for allocating it and freeing it. So just to show us how you would use it in a sort of C example, what we're doing here is we're, what we're doing here is we're aiming to allocate the memory, copy over our list into it. And then we're going to iterate through the list as before. So obviously, this is slightly artificial. There's, you know, we have to iterate for the list to copy into it. So the first thing we do is we do this, site and declare as a sort of as a long pointer, then we, then we call pymem malloc, which is a sort of Python implementation of malloc. And then we have a sort of small loop that iterates over our list of integers and copies it into our memory. And then, then we do our sum over squares. And we can do that. It looks very much like a, like a normal Python loop. And it that loop is very fast and efficient because it's accessing directly into the C into the C memory. And then we then after with that done, then we can, we can free our C memory. And to free our C memory, we use this try finally pattern, which is the, the most realistic of the most reliable way of ensuring that we actually manage to free our memory. So obviously, in this case, for a very simple example, it's a bit pointless to iterate over a Python array copying, and then to iterate over a C array during the calculation. But you can imagine, this is a quick way to allocate memory. And, and for a more complicated thing, it might well be, be worth it if we want to sort of scratch buffer to work in or something like that. So just in terms of the timing, what you find is, it's is pretty good, but you have had to, you have had to copy a whole block of memory. So you've got an extra, an extra iteration. So yeah, if we can, I think move on to two vectors next. So the C plus plus vector is, for those of you that know C plus plus, it's a sort of a typed container. So a C plus plus vector. In this case, we're using the square brackets in tax to say that we want to store, we want to store a vector of, of longs. And this creates in C plus plus, it's called a template type. And so what, what you find that this type behaves a lot like a Python list, except it can only hold one type within it. So it can only hold longs. And so what you find is, you look at the code, and it looks a lot nicer than the C one. And the, the iteration, the iteration looks very nice, you can see we can just do for, you know, for value and ints, and it, it adds up and everything's nicely typed. What you're sort of, and just just looking into that iteration, what you've for those that you know, C plus plus, it iterates using the sort of begin and end iterator protocol. So this is a generic way in C plus plus to define a container that can be iterated. And what you find is that siphon can cope with pretty much any container that's iterable. But the, the timing ends up slightly slower. And this is because there's an invisible change of data type when you first run it. So it needs, it needs to go over your entire Python list and copy it into the C plus container. And that's automatic. And that's, you know, really nice. And it's fairly quick. But you've got to be aware that it's happening. So anytime you're passing data to and from Python, you have these invisible conversions. So what you find is that the conversion costs you a little bit of time, about, you know, 20%. But the, but without the conversion, it runs pretty much at the same speed as your memory view examples, or that kind of thing. So if you can work directly with the C plus plus types and avoid the conversions, then it can be a really nice way to work. But be aware they're happening. So yeah, be aware the hidden auto conversion is, is my big message about the C plus plus types. So if we move on to the next slide, I think. So one, one nice thing you can do with these containers is that you can expose them to Python, or back to Python. So here what I do is I cast them to a memory view. So I've got a container, I fill it with values. In this case, it's the equivalent of a range. And then I expose it to a memory view just by just by using the length and getting the pointer out of the container. So the memory view knows how long it is, the memory view is understood by Python. And so what I can do is I can call the NumPy function, the NumPy sign function on our exposed container. So if I've decided to write something in C plus plus, because it works faster for my particular case, but suddenly find myself needing a NumPy function, these options are available to kind of mix and match and go between them. And the, I guess the important thing to emphasize is there's no copying here. It's, it's a memory view of the C plus plus container rather than a, rather than a copy. But you do have to be careful that the C plus plus container outlives the memory view, otherwise you're in all kinds of exciting trouble. So the final thing to mention is, is C arrays. So in contrast to malloc, you don't have any choice of how long they are run time. So it's really just doesn't work for, for iterating over a sum of squares thing, but you can allocate a sort of small statically, with a static length, you can allocate that on the stack. And Scython provides sort of nice tools to be able to slice them to be able to copy between them. And that kind of thing to be able to assign some lists. And these are, these are things that's actually surprisingly painful to do from C, but Scython goes a long way to making them look like Python types. So you can iterate over them without, without upsetting yourself too much. But you just have to be aware that the length is, the length is fixed. And so that really doesn't suit all problems. So I think just going back to our table, what you find certainly the, the advantage of the vector is it can hold, it can hold Python objects, but it can hold pretty much everything else you can think of. So it can hold structs and more complicated data types. And it is easily resizable. And it's resizable quicker than something like array, array. So it's, it's not that it can do anything absolutely unique, but sometimes the combination of things it can do makes it a convenient tool for the job. And I think that's the end of, and I'll hand it back to, to Stefan for the moment. Wonderful. Thanks. So we've heard about Python sequences and native data type sequences. And how we can use those from Scython. So how about Python big snakes, the second important data type that we have in Python. Imagine that we have a repetitive list of four digits integers. And we want to convert those into strings. Okay, so in Python, we would just say, use this comprehension across string for the string function, stir function on each number. And then we get a list of strings, right? That's fairly fast. We can do the same with map. So this create a list of mapping the stir function to the numbers also fairly fast. And now the thing is, since it's a very repetitive list, we get lots of identical strings in the end. And that uses more, let's say it uses more memory than we want to invest for this. So we use a cache, right? We use a dict as a cache to only create strings that we haven't seen before. So we just run over the integers, look up the integers in the number cache. If it's in there, we take it, we take the string from it. And it's not in there, we insert in your string for it. And then we build a list of basically unique strings in the end, which are, you know, duplicate references to the identical string objects, right? So that uses less memory in the end. So when we do this in Python, it's clearly a lot slower than the very simple this comprehension or map version, point seven. So that's almost twice as slow, right? But what happens when we do this inside? So let's just compile this, what we have, take the function as it is, push it into a Python cell, compile it. And then what we can see is actually faster than the map version now, right? Why is that? Well, Python understands what a dict is. It understands the Python operations that we do in our code. And it can speed up the iteration, it can speed up the dict assess, and all that. And that already gives us 10% speed advantage over the past fastest Python implementation that we've seen. And it's about twice as fast as the interpreted version of the loop. Cool. See what else we can improve. Are f strings fast? Everyone likes f strings? Let's use an f string here, instead of calling this third function, right? And we can see that that makes it another bit faster. So we are down to 0.8, pretty much 20% gain compared to the map version. And so that shows you that f strings are also very fast in Python. And what we see before is, you know, Python supports static typing, so let's use static typing. We can type our integers as soon as they come into the system. So we just say the loop variable that we have here, which takes the integers one after the other, looks it up in the cache and all that. Let's type that as a C long. Okay. And what happens, it actually makes it slower. So why did it become slower? Well, the reason for that is that a lot of these operations here or the dict lookups and all that actually use objects, right? So now that we've unboxed, I would say, so unpacked our Python integer object into a C long, in order to look it up in dict, we need to box it again into a Python object, integer object to ask the dict if it's contained in there. So we've actually now introduced a lot of back and forth operations between C data types and object data types, and that's costly. So let's use C integers only where they help. We keep our algorithm as it was before. We use Python objects from the list, we take one by one, look it up in the dict, and then only when we do the integer to a string conversion in the end, then we tell Python, please unpack that this into an integer, and then pass the C integer into an F string. And that is actually faster. Alright, so that gives us another speed up, because now, Syson can do C integer conversion to strings, which is faster than Python, the generic Python object conversion into the two formatting. Okay, and then we're back to an example that David was right. So I just wanted to, I guess, show some of the more obscure C++ data types, just an example that it sounds slightly artificial, but I have used this myself. So calculating the rolling median is a kind of common operation for smoothing images and time series, and we use the median because it's it's quite good for ignoring sort of big outliers. So I've got a picture on the next slide, which shows how we might implement a naive implementation of this. So we we're doing a median over five, we take a chunk, and then we do a partial sort to find the sort of middle number. And we have to do a copy each time. So if we go on to the next slide, the actual code here, what I've done is I've I've written it's just in NumPy. So we we use a memory view. There's a bit of thought about picking out the start and end index. And then we just call NumPy median on this chunk we've selected. And that's basically it. It's a fairly simple implementation. But and it does work. But what you find is it isn't phenomenally fast. So the next thing to do is so we yeah, this is just creating some arbitrary data just to just to test it. And we we find it takes about sort of 360 milliseconds. So so it's not phenomenally fast. The next thing to do is to avoid the Python call to NumPy median and use C++ instead. So this looks a fairly horrendous block of code. But actually, there's there's a lot of declarations in there. So it's not quite as bad as it looks. And the main thing I'm doing is I'm using nth element, which comes from the C++ standard library, and does exactly what we want to do it. It sort of does a partial sort of the array. So everything below the element is less than it everything above the nth element is more than it. And that's almost it. There's there's then a tiny bit of thought about what happens if you've got auto reading, even length. So moving on. To call that we've we've got a sort of further function. That that makes one significant sort of saving on the on the Python version. And that's that it it allocates this working array once because it's a known length. So we only need to do one allocation. And that's a fairly significant saving. And then it calls this calculate median function that I've written. And the I guess a final thing I want to note on this is I quite like this pattern for for allocating values that we return to Python. So we've got a Python object, which is just the NumPy out. And then we've got a view on it that lets it let's it run fast in Python. But when we get to the return statement, we just write turn out. That's just a pattern I like a lot. And this this slide is just just noting that I've avoided the per iteration allocation. And so what we find is that this is this is actually, you know, fairly significantly faster. It's, it's a 4% of the of the total, the total time that the NumPy version. So it's a worthwhile optimization. But there's a sort of better way of doing it. The better way of doing it is to take two sorted containers of the values above and below the median. I think if we go on to the next slide, that has a little diagram. And essentially, as we go, we can we can quickly add and remove elements from them. So we can we the first step, we remove five, because that's the that's the sort of element we're no longer looking at. And we add eight, that's the element that we're sort of adding to the rolling median. So we can add these very quickly to the containers, because they're sorted. And because they're sorted, we can then look at the look at the two end points and the median, the median is just the five, it's already there because it's sorted. And all we need to do is sort of swap, rebalance the ends. So this, this unfortunate then involved becomes quite a long example, which I won't go in immense detail over. But but it's just this, we're removing these two elements. So we're putting one element in taking one out and then rebalancing it. And the the upshot is that it ends up fairly significantly faster. So this ends up as about 15 15% of the time taken to to run the other C++ example. So it's a it's a really quite big saving, just because we've used the right algorithm. And I guess the point I wanted to make here was that the C++ containers aren't just that they're like Python containers, but faster, so vector is kind of a faster list. But here, I've managed to get access to a container that just doesn't exist in in Python. And so I've managed to use the right algorithm without having to reimplement this huge container type myself. So the the message is be aware that by using sort of C++ containers, you can get you can get access to to nice stuff that doesn't really have an equivalent in Python. And yeah, so that's all I wanted to say on this particular subject. So I'll hand back over. Cool. So just a quick view on some advanced Python language features that help you, you know, dealing with data that are not so immediately, well, that are efficient, but not so directly available in Python, when I see tuples. So if you need to deal with more than one value, like a couple of two, for example, you can declare this in Python as just a couple of types. And then internally, this is implemented as a C struct. So there's a very efficient way of putting together a bunch of of different data types that you would normally store in a Python tuple, right? But the the most efficient way to do this in Python has a data type. And one more thing, you may know the total ordering decorator from functools, we now have an implementation in Python that is very efficient. And you can use that on C implemented classes on extension types, right? And if you have a total ordering decorator there works, it just implements something like equals and lower than or like a few comparison functions, and it explodes them into supporting all comparison functions for your type. And then you can do various comparisons with it, including those that you have not implemented. And yeah, you can see they're all there. And now if you compare the the timings, there's a lot faster because it's a type implemented in C, right? So the comparisons benefit a lot from it. So if you have the need for it to implement some data type that suffers from comparison performance consider turning it into an extension type, compiling it in in Python, and then use the total ordering decorator to let sites and generate the comparison operations for you. Okay, more in NumPy, right? We've seen memory views before, we've used them to assess what to to make data from C++ vectors available to Python, and to assess the data that were stored in Python arrays, NumPy arrays. And now let's look into that a bit more. So here's NumPy expression that we calculate on a two dimensional array. And if we do this in NumPy, well, let's see, it takes about 800 milliseconds to do this on a large array, some more large array. And now let's see what Python can do for us. So first thing we can do is we can spell out this calculation into an actual loop, so a two dimensional loop running over the array and doing the calculation from two arrays into a third output array, but doing it directly item by item, while we're running over the arrays. Okay. Nice. So when we do this, we get about a bit more than a two times speed up. Simply because we can now unpack the data and do the whole calculation directly and see in Python. But let's see if we can speed it up further. Okay, so the first thing we look at is the annotated version and it tells us there are actually Python operations going on in the last line. When we look deeply into that, we can see that what Python does here is Python calculations are safe here. So the access to the memory do takes into account Python's ability to use negative indexing to do out of balance assess and raise an index error for that. And here, our algorithm is written in a way that makes sure that we're not getting out of the balance of the arrays. And so we can enable these safety checks, we can now take off our safety belt, because we know that it's going to work. And then when we disable those, it gets another bit faster, usually 10% or so. So this is something you can do in Python, you can really get into C and accept that fee and save, and then can give you speed. Python comes with another feature that was Python. Python is basically a NumPy expression compiler, it compiles it to C++ and you can use Python as a back end in Python. And now you can use the plain NumPy expression that we had before with a couple of type declarations in your code, and use another Python and help Python to compile this expression for you as part of the Python compilation. And also a nice speedup that you get in the end. It's about more or less the same performance that you get with a loop, loop tends to be a little faster overall, but you know, depends totally depends on your specific code. And that's one more thing that you can do in Python, and that is since we're doing all these calculations down in C space now, you can use multi threading you can disable the GIL, you can release the GIL while it's doing the calculation and run it in multiple threads. And so OpenMP support is one way to get OpenMP support is P range function instead of the range function for the iteration for the parallel range function. And when you use this in Python in your code, you just would say range by P range, tell us to free the GIL while it's running over your loop. And that gives you another nice speedup for the specific example is not huge. But in many cases, you get, well, if it's really independent data and large arrays, you typically get something close to the number of threads. And so one thing to keep in mind, as David already mentioned before, it's a good idea to take the creation of the output array out of your your benchmarks basis out of your run code, because often you already know where to put your data in the end, or you can even do a calculation in place on one array. And then if you take this out, then look at the bare calculation without the additional application for the output array, then it's another very visible. Okay, so for the takeaways, what have we learned today? Direct memory is great. It's wonderful to be able to unpack data structures in Python. Even high level data structures, they're wonderful, as long as you can unbox them. And so if I can really put an extra toolbox here in your hands, you get unboxed, the sets built in, you get a direct native access to NumPy rays, the data structures, you can use them directly in your code. And some new feature in Python three, you can now do all this in Python syntax as well. Okay, cool. Thank you for listening. Thank you. Okay, thank you very much. So let's play the applause. Thank you. Right, so we have a couple of questions, we have more questions than we can, you know, run through in this talk, but we you can go to the breakout room afterwards and then answer the other questions. So the first question is, is the note Jupiter notebook or the slide deck available somewhere because it has so many nice hints and, you know, tips and features. And it's really interesting. Sure, we can make it available. Yeah, I'll post a URL in the chat somewhere in the room chat. And related to this, I have a question as well. Is there something like a Python cookbook, which, you know, kind of collects all these tips and, and, you know, features that people might not directly find in the documentation? I think one was published a few years ago. Yeah, there's a book on Python available. It's already a couple of years old by now. So it doesn't have all these nice Python syntax features in there, which is still really pretty recent. But the documentation has it all. I mean, we're keeping the documentation updated, definitely. Okay. Great. So next question, when will size and three be released? Yeah, last year. We're still working on that. So there's still rough edges that we want to finish before releasing it. But it's really getting there now. And we're adding the features that we want in it and picking stuff that we want to take for the next major version. But you can already use it. That's the cool thing. So it just get the latest alpha. It's called alpha because it's not feature complete yet. It's not called alpha because it's going to break your code and do anything back to you. It's working. And you can use it right now, even for your production code, I could recommend it. Okay, excellent. So let me see, we can, I think, do one more question. So let's take a more technical one, this one. Is it possible to use Python to write custom computations for data being held in Apache Arrow memory format? Yeah, so the Python wrapper for Apache Arrow is actually written in Python. So on the PyArrow implementation is written by the pandas people with McKinney, amongst others. And they're a big fan of Python. They're using it for a lot of things. And pandas themselves have a huge bunch of Python code, and it's mostly written in Python. And so it's PyArrow. Yes, you can definitely use this from Python. Okay, excellent. So thank you very much for answering the questions. I will post the other questions to the breakout room. So perhaps you can go there and then maybe join the jitzy there to answer the other questions. So thanks again for the talk. Very nice. Very interesting. Lots of, you know, detail, lots of hints that you it's probably hard to find otherwise. And yeah, enjoy the rest of the conference. Thank you.