 So, yeah, we'll be talking about Scython today. Scython is a language very similar to Python that allows us to write code that executes almost at the speed of C, but at the convenience of having a higher level of language. So the example code for today is uploaded on B-Space. You can also grab it off my Git repo. Here's the name Stefan V, and it's under the repo Scython underscore ay250. So that's just an archive of the different source files that we'll be looking at today. Right, so many of you might have seen this diagram before. This illustrates the typical trade-off we make when we work with scripting languages versus compiled languages. On the one axis here, the y-axis indicates the ease of use, and the x-axis, the execution speed. So ease of use, you can think also the amount of time you put in when you write the program, and the execution speed is how much time it takes the computer to then evaluate that code. Now, your time is obviously much more valuable than the computer's. So that is why we choose to use Python. Python saves us a lot of time because we can basically take our mental model of a piece of code and just put it in source code form. Python is very similar to English, so it's very easy to express your thoughts in Python. So it's easy to use, low coding time. On the other extreme, we have languages like C++, C++ can be fairly hairy to code, but once you've written your piece of C++ code, it executes extremely rapidly. Other languages in this domain would be Fortran, although I might even argue that Fortran is a lot simpler than C++ to use, so we could probably move it somewhere here on the graph. But what we're going to try and do today is to bridge those two worlds. So to go from a world where we have a high-level language, easy to use, and to get the same performance as we would out of a language like C++. Now there are many projects that already aim to do this. In SciPy, we've got Weave, which allows you to take a C snippet as a string and compile it and execute it. Swig allows you to wrap C++ code. NumExpression is a project that does just-in-time compilation of expressions. And then of course NumPy is one of those efforts as well, because NumPy wraps the lower-level linear algebra and other libraries. So it's essentially Fortran and C being executed every time you do a dot-product matrix multiplication, broadcasting, all those things execute in C, but you do it on a much higher level. Unfortunately, NumPy is limited if you've got an algorithm. So for example, say you've got a matrix and you just get something to write with. The box. Oh, here. Uh-huh. Perfect. All right. So NumPy is ideal when you've got a large block of numbers. You want to execute some operation on all of those numbers at the same time. You want to do the same thing to all those numbers. But if you are in the situation where you have a matrix of numbers and the number X, you do some computation on X that influences the computation you're going to do on Y and that influences in turn the computation you're going to do on Z. Now, or Z, you guys call it. So, now this kind of operation you can't really express in NumPy. Because if you do, you know, if you say my output array equal to my input array times 2 or any expression of that form, it will just operate per element. So what do we have to do? Well, you can split it up into, so this won't work. So we can split it up into two for loops. You can say for I in range rows and for J in range columns. And then you can do something, you know, you can say A at position I, J equals some expression that involves neighboring values of A that have already been computed. So what's the problem with this? Is this something we want to do in Python? So why not? Write it slow. But why is it slow? Why is Python so ridiculously slow when you have two nested for loops like this? So Python is an interpretive language, right? So it basically executes your source code line by line. So Python has no knowledge, really, when it enters these for loops of what comes later on. So can't do any kind of optimization. All it can do is basically to step over this command, you know, do it over and over and over again. And this is basically the process that we're going to try and optimize today using Scython. Right, so just one last comment on this. In this trade-off of ease of use versus speed, there's something called the T constant. And that is when your experiment starts to run for a period of time that's just long enough that you no longer want to sit and watch it but rather want to go and make a cup of tea or, you know, walk around or do something else. So sometimes just having your experiment run in ten seconds instead of five minutes makes an enormous difference. So sometimes it is worth optimizing. Now you probably all heard the expression, premature optimization is the root of all evil. So that's why this is the last lecture as well. But it's also a good guideline to keep in mind. We don't want to optimize unless we really need to. Right, so like I said, Scython allows us to cross this gap between a high level language, low level performance. It's great because we get to code in a language that is essentially Python. It's a superset of Python, so it's Python with a little bit of syntax added on top. But with the same speed advantage as the C. When I taught in Italy, I came across this interesting saying. I can't pronounce it. But what it comes down to is that you can't have an inebriated wife and a full bottle of wine. And I think, you know, that is often the idea that people have is that you cannot have high level code and low level performance. But we'll see that that's not really true. I think you guys say you can't have your cake at E to 2. So what kind of improvements are we looking at here? Well, we've got, if you just compile a piece of Python code, just put it on there, compile it without any changes. We're probably looking at about a 30% speed increase. So not really worth writing home about. But the loops do get quite a bit faster. They go at like two to eight times speed. And then I'll show you if we actually provide Scython with the types that we're using. If we tell it, we're working with integers here. We're working with floating point numbers. We can go hundreds of times faster. So that's how we're going to reach warp 10. Scython's got about, well, these are four of the very common use cases. The one is, of course, optimizing the execution of existing Python code. So the procedure that you'll do is you'll profile your code. If you guys done profiling at all? So I'll show you an example of that. It's basically execute your code, time how long the different functions take to execute, take the one that's slow, that's causing the bottleneck, and replace that function with Scython. We also often want to wrap existing C or C++ code. We'll see examples of that. And then Python has this annoying thing called the global interpreter lock. So if you try and split your program up into threads in Python, so say you've got eight CPUs in your computer. So you believe you can split your operation up into eight blocks and just have each CPU do an eighth of that for you. Well, you're going to run into problems with Python. Because Python has this thing called the global interpreter lock. It means that none of those threads can access Python objects without first acquiring this global interpreter lock. And all that really comes down to is that only one thread executes at a time. So Python doesn't make it easy for you to use threads. You typically have to split up into multiple processes. But because Scython is written in C, you don't have that constraint. So we'll see how to utilize that as well, or how to work around the global interpreter lock rather. And then the last use case is just when you want to mix C and Python. But you typically don't want to use the C API. I don't know how many of you have seen the Python C API. Not pretty. We'll have a look at the C code that Scython generates, which is a little bit more ugly than what you'd write by hand. But we'll give you a good indication of what you're being spared here. Right, so for this tutorial this afternoon, I'll take a piece of pure Python code, we'll benchmark it, go through the profiling process I just described, and we'll find that it's too slow. Then we'll take that code, we'll run it through Scython without changing anything. And we'll find that it's a little bit faster, like I said, about 30% faster. Then we'll tell Scython a little bit about that code. We're working with integers, we're working with doubles, et cetera. And then we'll find that it's a lot faster. So when we're done with that example, I'll show you how you can integrate NumPy arrays with Scython. I'll show you how to do multiple threads, the global interpreter lock I was just referring to, and then finally how to wrap C libraries inside of Scython. So you can access those C libraries straight from your Python scripts. From Python to Scython. So this example is just numerical integration. So your function is this red curve here. And these are the sampling points. So how do we approximate the integral of this function? Well, we just construct these little blocks underneath the function, and we add the areas together. So you can see that I have this interval, it's divided into one, two, three, four blocks. So n equals to four. And the distance here, the length of each block, is just the total interval, which is 0 to 2. So that's 2 in this case, divided by the number of blocks. So we've got 0.5 as the basis length of each block. And then the height would be whatever the function value is at that point. And then you just calculate the area of each block, add them all together, and it gives you a rough approximation of the integral of your function. So here's just an example where I made the blocks even smaller. And you can see that the smaller the blocks become, the more closely the integral is approximated. So how would the computer code look for computing that? So have you guys seen the first line of code here? So from future import division just means don't use integer division, always use slashes. This is now standard in Python 3, by the way. And then we define the function that we'd like to integrate. I just made up a function x to the power of 4 minus 3 times x. Could be whatever you want. And here's the real bit of code, is the contents of integrate f. So we provide it with two parameters, a, b, that's the start and the end of the interval over which we want to integrate, and then end the number of blocks that we want to use. First we calculate, again, the base length of that block. It's the interval divided by the number of blocks. It gives you that basis width. And then for each block, you just compute the function value at that specific point. And we multiply all of those. After adding them all together, we multiply them by dx. So do you guys roughly follow what I'm doing here? Is it more or less clear? OK. It doesn't really matter what we're doing, but just you have an intuitive idea of what's going on. All right, so if you execute this piece of code in, let's see what happens when we execute it in Python. Just to make sure. So I import integrate, and I call integrate f. We can choose any interval we want, so 1 to 10. And say we want 10,000 blocks. And it tells us that the answer is 19,000 something. We can run, do you remember the time at command in iPython? So time it executes a function numerous times, and then picks the minimum execution time. Why does it do that? Why does it pick the minimum? If you had to benchmark a piece of code, in other words, execute it to see how fast it runs, how would you approach it to get an accurate answer? Would you run it once? What's the problem with running a piece of code once? Right, that's one option. What else? If I were doing it, I would average it. What? You would average it? Yeah. So say you were running your function 100 times, and during the first 50, some system process spun up like you just described, but during the last you'd have nothing, then your average would sort of be a mixture between this really bad timing and a good timing. But that's an option. So any other ideas? So what iPython does is it just executes your function a thousand times and takes the minimum run. So if you execute it, even if there are system effects, even so normally the first time you run a function, the computer needs to load a lot of stuff into memory. So that's a very expensive operation. Your CPU is, in fact, a factor 10 maybe faster than your 100 faster than your memory. So it's really slow to access your computer's memory. So the first time you run your function, data gets moved into memory. The second time you run your function, that data comes straight from the cache. So the memory right on the CPU. So super quick. So the first timing is already useless. You can throw that out. Like I said, if anything spins up during your trials, it's useless. You can throw them out. So in the end, you're just looking for the fastest, fastest timing that your computer can do that you get by taking the minimum. Yes? What if it is? Right. So if you really want to know, if you deployed this on a commercial system, what would you get? Maybe his solution of taking the average over a very large number of runs would be a better approach. But overall, if you want to compare between different runs, your own code, then you need sort of a consistent way to measure. And for that, the minimum is really good. OK, so currently, we're standing at about 6.22 milliseconds per loop. Now, this is a fast operation, so we don't really care. I could make that. Let's say we took 100,000 points. Yes, and now we factor 10 slow. So it's starting to creep up. And if we, I mean, I can increase that number until we really end up being in trouble. So let's just do the rest of these experiments all with, what do I have there, 100,000 points. All right, so the next thing I'm going to do is I'm just going to take that exact snippet of Python code and I'm going to run Scython on it. So do you guys all have a working version of Scython? Yeah, you can import Scython inside of your interpreter. And you should be able to execute the Scython command on the terminal as well. So you run Scython and then the filename, which is typically a .py file or .pyx.pyx is the Python specific extension. So let's see what happens. I'll just adjust the, can you guys see at the back or should I make the font a little bit bigger? Is it all right? Okay, so you run Scython on your py file. And then you should have a file called integrate.c there that looks like that, just filled with C code. Let me load that up for you so you guys can see. Yeah, so this is really pretty. This is the kind of thing you maybe not like to write yourself. Like I said, it's a little bit uglier than you would have written yourself because you have all those underscore underscore picks and then weird variables in there. That's just for Scython to make sure everything's unique, but it doesn't look that much better when you do this by hand. Because when you work with a Python API, you have to track, do reference counting, track all sorts of things that you wouldn't normally do, allocate, deallocate objects. So it gets messy fairly easily. So here specifically is the snippet that, so inside the file, you'll see comments with Python and then the C code that does exactly the same thing. So if you can somehow warp your mind around all these underscore picks things, somehow we'll find in here that there's an X to the power of four. Let's see if we can spot that. I see a power. So pi number power, some variable to the power of integer four. So that happens, then there's a minus somewhere. There's a multiply for the three, yes, three with X and there's the subtraction and then that gets returned. Oh, no, that's, I guess that's a Scython specific macro that they defined. Say what? I guess it is. So what was that language that implemented not go to but come from? They're always, I always found that rather amusing. That would make for really hard to understand code. Anyway, right, so this is the kind of thing that Scython generates for you and then you can compile it. It's got the line here somewhere. Yeah, there we go. So the first line you see on the screen here is used to compile it. You see, I just call the GCC compiler. It's called with a minus O2 flag for some optimization. It gives the Python include path, takes our C file and produces a dot O file. And that file, say again? Well, if you ever want to know what GCC does, you just load up the man page and you can see FBIC, that's not the one, come on, it's basically, maybe that's, there we go. Say it emits position independent code. So it means your code can be reallocated in memory and still execute fine. They don't hard code memory addresses, I presume. We're basically gonna be using, Python loads this dynamic library so you need that, I presume. Right, but when you execute that command, you get the dot O file out and Python can import that dot O file. So you can do import, integrate, underscore, compile and Python would load it up. You don't usually want to remember that GCC syntax because sometimes your system looks a little bit different. You need different compile flags and so on. So we much rather write a setup.py file. So you just construct a setup.py file and you put this in there. So it includes, it imports a couple of stuff from distu.tools, the compilation tools and it also imports a special compilation tool from Scython and then we just write this little snippet of code. So we say, so we've got a setup. We use Scython's bold extension, hey boy. And then we're gonna add an extension, it's called integrate, this wasn't, that's a typo, I think that should just be, that should be integrate compiled and integrate.py, yeah. But basically you tell it what the extension name is that you wanna build, what are the source code files, those all go in there and then you run setup.py, build extension and minus I for build it in the current directory in place and then it will generate that GCC call for you. So maybe just, if you could all go into the demos directory, into the integrate sub directory there and just type python setup.py bold underscore ext minus I and see if it builds for you. It gets a bit bigger. Oh, that just messed up my colors, yeah. You compiled? Anyone else manage to compile it? We're good? Yeah. Okay, so you see there's the GCC line I showed you. Oh and my system, it's a little bit fancy, it's got some other flags in it but once you've written your setup.py file, I've got a couple of extra functions in here but yeah, there you see it. So we generate the integrate compiled python module and the source file is integrate.py. Okay, so let's see, let's import that into python. So I'll first import integrate, that's the pure python version and then import integrate compile. Oh, what happened there? Okay, there we go. I mean, yeah, what I need to have you do some time. Oh, I, so yeah, I just removed that C file. I don't know why when I generate it by hand it caused problems, not sure why. Which C file? I think I removed integrate.c and then you just do the python setup with by bold ext. Yeah, integrate compile. There we go. Does that work now? Yeah. Okay, cool. Is that what? Okay. What file, am I important to integrate on this file.so? Yeah, but you just, you don't type the .so. No, it's a dynamic library, but it's a python extension. Yeah. All right, so let's time those two. So we've got our original function integrate f. Let's go from zero to 10. We said take 100,000 points. I think that how long did that take? About 6.66 milliseconds. And now we'll do the compiled version. And there we go. Compiled version. Yeah, so right, so the other two timing 66.4 milliseconds for the original 47.4 for the compiled version. So a 1.4 times speed increase. That's not incredibly exciting, but it's something. That's a good question for all. Yeah, I'm not sure if that's right, but they're the same. Okay. Right, so we used time it. We saw that there was about a 1.4 times speed increase. And now the question is, can we help Syson to do even better? And yes, we know we can. We can give it some clues. We can tell it what types of variables are we working with here. Syson does have a very basic type inferencing engine, so it can guess to some extent what types of variables you're working with. But it's very conservative because it has to be careful. You know, when you use Python, you can assign anything to anything. So it has to be really careful that it doesn't try to do optimization on objects of the wrong type. So why do you think it helps to provide Syson with this information? Why would Syson be able to make things faster if I tell it, for example, that I or some variable is an integer? Yeah, so Syson doesn't know that it's an integer. So what does it have to do when you send it the variable I? Exactly, so if you do a simple for loop where you say for I in range n, when Syson tries to compile this, it has to check what is the type of that variable I because it's now going to be assigning different values into that I. So I could be a double or it could be an integer, it could be anything. And in Python, you're allowed to do all sorts of strange assignments towards objects. And you can't just do that. You know, you have to make sure that it's the right type you're working with. On the other hand, one Syson knows that I, for example, if we tell it, I is an integer, then it doesn't have to do any checks. It doesn't have to do any unpacking of Python objects. It can just write straight to that memory location, values one, two, three, four, five. So it drastically reduces the overhead. But it doesn't figure this out when you compile it. So if Syson is better than Python because it doesn't go through to figure out that I is never not an integer? No, it doesn't, it's not that clever. It can't, so let's look at that. How does it work? So that's because the loop now executes in C and it doesn't execute in Python anymore. So if you run it via the interpreter, the interpreter is just basically looping over that thing. Now it's a C for loop, so it's a little bit of a speed increase there. But the problem is what happens inside the for loop is still slow. So that doesn't help you. So here's the code that we saw earlier on that was generated. Just squint and try and make out what's going on here. But you'll see, constantly you'll see things like pi number power or pi number multiply or pi number subtract. Those are all causing to the Python C API. It says, I've got a Python object. I believe this is a number. I don't know if it's an integer or a floating point number, but it's some number and do this operation, subtract these two Python numbers from one another. A Python object is appointed to some place in memory with a description of exactly what is contained there. So what needs to happen here is that Python, they need to unpack this thing. It needs to go and see, oh, what is stored inside the first one? That might be an integer. What is stored inside the second one? That might be a float. They need to call the appropriate operation on that. Just a lot of overhead. So here's what we're gonna do. We're gonna tell Scython what these types are and we'll see if that helps. How do we do that? Well, Python understands, Scython understands when we just write the type in front of the number so we can, when you've got the X argument to your function F, you can specify double there. So it now knows that the X that will be passed into F is always a double. And then inside our function here, this is the alternative syntax. We can do C def, which means in C define that S is a double and that DX is a double and that I is a size type. Size type is the C language for int, basically, positive integer. And yeah, we also tell it that when we integrate our function we'll be providing a double A, B and the number of blocks N, which is an integer. Right, so that's all I added. Yeah, yeah. C def is special to Scython. You could also, so this is one syntax. The other syntax would be to simply write C def in front of each. So you could do C def double S and C def double DX. Yeah, these are equivalent. Just when you define a number of different variables, it's handy to just put C def colon up there. But you don't need C def. No, not when you specify the parameters, no. Yeah, Python can't understand this anymore. We've now broken this for Python. But Scython will know what to do with it. Yeah, go for it. So if you don't declare every variable, it will still run. Scython won't mind. It will just be slower because it will assume then that that object can be anything. It can be any Python object. It'll give you a speed up, yeah. So you typically, if you have a counter for i in range n, you typically want to define that i because it's going to be called every single time in the loop. So anything that's used in every single loop iteration, so i, s, and in this case, dx and a, you want all these variables to be defined, yeah. Any other questions? Does this look good? Okay, let's compile it and see what happens. Okay, so let's do that. So I just extended my setup.py file. So I've got integrated py, which was the original we wrote. I've got integrate types, which is this version I just showed you. And then I've got two more that I wrote down. But I'm just showing you how the setup.py file looks. It looks exactly the same for each different version that I wrote. So okay, you call setup.py, you build the extension. It says all my extensions are built. So let's go into Python. We import integrate. It's the original. Integrate compiled is just a compiled version and integrate types, the version where I specified the types. So let's time them. Pypython, scython version, and scython version with types. Okay, so that looks a little bit better. I mean, we've shaved another 20 milliseconds off. So 60, 64, 47, 23, so what's it? That's it. We double the speed again. So that's not bad. But earlier on, I didn't talk about doubling speeds. I mentioned like, you know, maybe a factor 100. So this doesn't get us all that excited, right? So what's the matter? Why did I do wrong? Why isn't scython producing what I promised it would? Let's see what the bottlenecks are. So here's what's happening. We have a Python layer in which everything Python occurs and then we've got the C layer where things are really fast. Python slow, C's fast. So what happens? We call our function integrate f with its parameters and now we've got this very nice fast for loop in C that we just defined. We gave it all the different types. So we expect this loop to execute very rapidly. So this basically in C looks just like this loop for i is zero to less than 100,000. Increase the counter and then execute f of x. And that's exactly where the problem is. In every loop iteration, we execute our function f. But what is f? f is a function defined right up there and that's a Python function. So what happens is inside your for loop, you jump out into the Python interpreter. The Python interpreter says, oh, you're trying to run f of x and then it says, okay, f of x, you can run it. It's implemented in scython. So it sends you back to scython and then it calls the scython version of that function and your s is updated and your loop goes around. But this is the big problem, is that you jump from C into Python and back again. So how can we avoid that? Well, we can say, let's just not make this function available to Python. Let's define it as a pure C function because then the for loop can execute or everything can execute on the C side and we never have to jump back into Python. So how do we do that? Well, we add the following. Where we used to have def, f. We now just say C def, f. So define the function in C and you have to provide the output type as well. So there's just like, if you wrote a function in C, you would have to tell it what are the types of the input parameters and what is this function returning. So you have to tell it, well, it's gonna send back a double. Hmm? It does. So that's the second optimization I made is that you could have x to the power of four here, except it would then call the C power function from the math library. But this turns out to be quicker if you just do x times xn, yeah. And then the top line is just to tell Scython that it should use C division instead of Python division. So Python division is overly careful. Like, if you accidentally divide by, you know, a strange number like zero, it won't do any, it will, Python will warn you. But we tell it just use whatever C makes available, just use that implementation. So that speeds it up even further. Can you say it for a question? Yeah, I think if you don't, I mean, the easiest is just to look what Scython generates. So let me open that code for you, Seth. So I'm gonna call the Scython command. I now change it back to x to the power of four. And I'm gonna call that for you. There we go. I'm gonna call Scython with a minus a flag. So minus a does annotation, which means it generates an HTML file that I can open in my browser. And that HTML file shows you, shows you the code that was generated. Well, actually it shows you original code. And then when you click on a single line, it expands to the code that was generated. So this is quite handy. It also highlights, the lines that are highlighted indicate, well, the darker they are, the more time they will take to execute. So you can very easily see which lines that you mess up on. If you didn't annotate a type, for example, let's remove one type annotation so you can see. Let's not define the type of I, the counter. So I'm not gonna tell it that that's an integer. And then you see what happened? These two lines suddenly became very dark because it now says this for loop is now horribly slow. There's, if you expand those lines, you see like, it's just a lot of Python code that needs to deal with all those objects. Whereas the moment I define that type, compile, look what happens. This essentially becomes a for loop in C for some variable equals zero, some variable less than some value, increment that variable, single line in C. But what I wanted to show you was x to the power of four. It's not highlighted, it is fast. It calls pow, pow of value four. But that's still, that's a single function call every time. So I think if you just, if you made this x times x times x times x, yeah, then just takes that power call out. That makes it a little bit faster. Did you all see what I showed him now? Right, so let's benchmark those last two. So it was integrate types and integrate. Okay, so there's a 23 millisecond version. That's the fastest one we had so far. And then we'll do this latest version of ours with a function defined in C. There you go. There we went down from 23.6 milliseconds to 704 microseconds. So, yeah. So we've already, we've had a times two speed up. We've had a 1.4 times speed up. We've had, whoa, what did I just do? There we go. Whoa, is that right? 704, oh, there we go. Okay, so I had a 1.4 times speed up the first time. I had approximately a two times speed up the second time and I now had a 33.52. Huh, that's much closer to the 100 that I promised initially. So, yeah. Now we've taken our code, sped it up by 100 times and let's just see how that looks again. You see there's the whole code snippet. We didn't have to add much. We had to annotate some types. This looks pretty much the same as I did before. This looks the same. And we just had to tell it that the callback function wasn't seen. So that's really it. That's the magic. So before I go, go on, do you have any questions? Any, anything that doesn't quite fit your mind yet? We're good? Yeah. I don't know what you're gonna cover in the video. Yeah. So like arrays, for instance, with the absolute unspecified dimensions, this is always a problem to make when I'm wrapping code. Right. Yeah, I'll show an example of that just now. Yeah. All right. So this is probably not something you're gonna use in an awful lot, but I'm going to show you how to do arbitrary callbacks as well because it does come up from time to time. Typically when you have an integrate function like that, you don't want to have only one function that can be integrated, right? We hard coded that function x to the power of four minus three times x. So often what we'd rather wanna do is to provide a function that should be integrated. So how do you do that in Cython? Well, you define a class. So Cython fully supports classes as well. In this case, I called my class integrand. It contains a single function that returns a double to function f. It takes a double as input, returns a double as output. And in this case, it's just not implemented. And then I subclass that and create to the class called myfunk. And myfunk now just implements that as x to the power of four minus three times x. And then you can tell Cython, well, the first parameter to this function is going to be an integrand. And then we have exactly the same code as before. Here I used the three C defs. You see the different syntax than before. This is exactly the same. But here we call integrand.f. So this just allows us to specify a function to integrate instead of always integrating the same function. There's a small performance hit for that. So integrate any I called it. So you'll see we did lose some speed there back to 1.5 milliseconds again. But it does allow you to do callbacks. All right, so if you're all happy with that, we can move on to NumPy arrays. So your setup.py file is gonna look pretty much exactly the same as in the previous example. The only thing you have to add is an include path because the NumPy headers are in some path on your computer. And you get that by importing NumPy and then including the directory numpy.getinclude. If you execute that, you'll see import numpy, numpy.getinclude. Just gives you a directory on your computer and inside of that directory, you've got the NumPy headers. So when you write any code that works with NumPy, you need to tell it that's where you can find those headers. Okay, so let's see how that file looks. NumPyBasic.pyx, there we go. So, Scython has a NumPy header. So it's aware of how the NumPy in the array structure looks. So the first line you do is you say Cimport, NumPy as NP. Cimport is a specific import, it's Scython import. So that is not available in pure Python. But when you do that, it basically loads the Scython definition of the NumPy array. And then just like before, we want to specify what type of parameter we're going to pass into our function. In this case, the function foo is going to take an nd array as input. So you could just say it's gonna be an nd array, but typically you want to tell it what type of values are going to be included in that array and what's the dimensionality of that array. So in this case, I told it, well, we're going to have an array of float 64 numbers. You'll see the mapping is very similar to the normal NumPy names. It's typically just the type with an underscore T afterwards. That's how Scython likes it. And then we tell it that it's going to be a two-dimensional array. So typically a matrix or something like that. And it's going to be called R, A-R-R. Okay, and then we define a new size type. You remember that size type is just an unsigned integer, a positive integer, i and j are both in just positive integers. So for i, in range, r dot shape. And that's what you asked earlier on. What's this loop gonna do? Well, it says for each i in r dot shape zero, what is r dot shape zero? If you have a matrix, what would r dot shape zero be? Yeah, it's, so if you've gotten an m by n array called x, then x dot shape would be mn. So array dot shape, the zeroth value, that's your number of rows. And the first value is your number of columns. So loop over the rows, loop over the columns, and at each row column position, put the value i plus j. So if I had a two by two array, what do you think the output would be? What would this value be? First one, zero, zero. This one, one, two. Why? Well, this value, is it positioned zero, zero? So what is zero plus zero, it's zero. This one, is it positioned one, zero? So one plus zero is one. This is it positioned one, one. So one plus one is two. See how that works. Second? Or j, I don't know why. Yeah, I just wanted to point, those are just things I wanted to highlight. There's no reason it isn't. Yeah, you can't do that from Python. Okay, so let's compile that. Just to show you that I'm using, this is the same code I just showed you up there. And call my build extensions in place. Now you see we get a couple of more messages. It again calls GCC, but it now includes the numpy include directory. That's why you needed to change your setup.py file. And then, yeah, so that includes the numpy headers for you. Okay, so let's import numpy basic. And what do we say? Numpy basic has a function called foo. So there's a function foo, and it takes a two dimensional float 64 arrays input. So let's create one. Just make zeros, two, two. And we make a D type float 64. Okay, so there's our output array. And then we call our function on X. There we go. Yeah, quick. Not even a really sick. If you do numpy. Yeah, it did complete. Yeah, so. Well, in Python when you, it always sends a pointer around, right? So even in pure Python, if you provided the R parameter there, it would always be a pointer to your original array. And then if you index into that, so Scython knows something about your array. It knows where the array stores the dates and memory. So when you do this assignment here, it takes that array, it dereferences it, goes into its memory and it puts those values down there. So you're altering, you know, the numpy array is just a, sort of a high level description of some memory. And Scython understands what the description is. So it allows you to write back to that memory. I think another option would have been to do something like, to create a new output array, for example. That might be a more, you know, standard approach. So you might have wanted to do something like, out is zeros like your input array. So this is gonna be, let's see. So we could do that. But, and then let's say, I plus J doesn't make much sense here. So let's just say it's our input array at that position plus 10. So what would this do? It would just take your input array and add 10 at each position and store it to out. So what's the problem here? Well, in this case, Scython doesn't know what out is. So we also have to tell it that out is a in the array of type float with two dimensions. So that would work. This call here, we'll have to import numpy up there. We'll unfortunately still be a Python call, but it doesn't matter because we do it once. The expensive part of this process is this double four loop and that will still be accelerated. So if I didn't make any mistake, we should be able to compile that. So let's just give X some other values. Okay, so there's our X. We want it to be float 64. And then we call numpy basic dot foo on X. I should probably have a return statement here. That would help. There we go. So it took my input array X, it added 10 to each number and it created a new output array and sent that back. Did that answer your question? Not really. For a lot, yeah. I mean, it looks like it has the array. If you pass that with it, it will not show you. Yeah, but. I think that was me. No, you will see it on the outside. So what you're seeing here, if we reconstruct, let's just do a zero snippet in C. Basically, it's something like, no, this is the kind of code that will be generated. So if you were, yeah, this is this, yeah. What? Type on the function. Type on the function. What? Did I? So this is exactly the type of thing that's happening in the background. The point is being sent in and, so I think it's giving you access to the members of that pointer. And, yeah, so when you do something like array ij equals five, what's really happening in the background is it's doing something like, something like, something like that is being generated in the background by Siphon. Any questions about the NumPy example? Do you guys think you would be able to write a, take a small NumPy example like this? Maybe? Okay. So what I've provided you with, and this is also in the vspace folder, if you go to problems, you'll see there's a directory called fractal, a file called fractal.py. If you run that, it runs fairly slowly and it generates a colorized version of the Mandelbrot set. But you saw those 10 iterations that it takes a while. If you wanna do a hundred, you're gonna wait quite a while. I wrote the NumPy version as well. The NumPy version is nice and snappy. I had it run up to 99 and it generates it in much higher resolution. But I think actually I know that the Siphonized version of fractal.py can be even faster than the NumPy version. So your challenge for this breakout is to take this code, what? The non-NumPy one. To generate this code, you'll see there are a couple of double four loops in there. Annotate the types and see if you can make it run faster than the NumPy version. And I mean, you guys have never written Siphon before, so I would expect you to have a lot of questions while you're doing this. So just put up your hand and I'll give you some hints along the way. How to do parallel four loops in Siphon. So the example I'm using is just a matrix multiplication. So we've got a matrix A and we're going to multiply it with a matrix B. I don't know if you remember from your linear algebra class how that works. But basically to get this element here, you take the first row of A and you multiply those elements by the elements in the second column of B and that gives you the entry over there. So it's A11, B12, A12, B22. And if you sum those products together, you get that value. Here's another one, A31 times B13, A32 times B23, sum those together, you get that value in your output. So how would the Python code look for that? Well, we say for each row in A, so we're going to step through the different rows in A, we're going to multiply by each column in B. So J indicates, so I is the row here, J indicates the column in B and our output is going to be row I column J. And what's that value going to be? Well, we need to sum these values in A, so the row in A multiplied by the column in B. So that's what this sum here does. So we four K in the number of columns in A. So K is going to run zero, one, zero, one in each case. And then we're just going to multiply those two together. We're summing it into a variable called S and that's the value that eventually we write R2 out IJ. So are you satisfied that that would produce the matrix multiplication? This is hideously slow if you run it in Python. Again, two, four loops, it takes a really long time. So we'd like to speed it up a little bit. So what can we do? Well, we know by now that we just have to annotate the types. So we're going to tell it what the input types are and what the intermediate types are. The input types, we've got three ND arrays. They are all float 64 arrays. They've got two dimensions because they're matrices. A, B, the two matrices you want to multiply plus out, which is where your output is going to be stored. And then our different variables, they're almost all of size T. So I said that's a positive integer. Our counters I, J, and K, similar thing. And then just the sum, which is going to be a 64 bit floating point number. And then exactly the same code we had before. Didn't change anything. Now, so we will get some speed up there. And I didn't even try the original Python version. That's just really slow. So, yeah, so this is called dot. Yeah. Okay, so import prange demo. And it just satisfies ourselves that this thing actually does the job. So I'm going to construct a, let's see, 1,000 by 500 matrix A. A 500 by 1,000 matrix B. So those dimensions would be compatible. What would the dimensions of my output be? Uh-huh. 1,000 by 1,000. So if everything works as planned, I call that function dot. Dot A and B into out. Takes a while. And then the outputs, there's the output, let's just verify that it's correct. So we take the output, we subtract from it the dot, the matrix multiplication as computed by numpy. Get the maximum value there. Yeah, 10 to the power of minus 13, that's small enough. So I'm satisfied that out actually contains the matrix product between A and B. Let's time it. Okay, 3.3 and a half seconds per loop. So this currently makes use of only one of the two CPUs on my laptop. And I suspect that I should be able to get about almost half of that time if I utilize both cores. So let's see how to do that. Okay, very minor modifications to the code. First, I added the two lines. Those two at the top are just telling Scython, do not do bounds checking on numpy arrays. So if I accidentally try and assign values to an invalid row and an invalid column, it wouldn't complain, but it's an optimization, it saves some time. And it also tells it, don't do a wrap around. So when I try and reference a row minus one, numpy will typically give you the last row, or minus two will give you the second last row. But this is telling you don't do that. So it saves a little bit more time. But the important part is really this. We have our for loops, as in the previous example, but instead of range, I changed this to P range. And P range is Scython language for parallel range. And that makes use of open MP. And it will split this for loop up and send the data to two different CPUs on my computer. That's it. The only other thing I have to remember is that I have to use this with no yield clause. What that means is, you'll remember I spoke about the global interpreter lock earlier on. And what this basically says is that don't worry, we're not going to be altering any Python objects while we're in these loops, because that's illegal. Remember I said that you cannot, from multiple threads, you can't change Python objects at the same time. So this is a guarantee to Python that we won't be messing with Python objects. We're just going to modify the memory of, well, we're gonna modify S, which is a C variable. So Python doesn't care about that. And we're going to modify out the memory of out. But again, that's not a Python object, that's just memory inside the computer. Okay, so let's see what that does. I compiled that spot of my P range demo file and that's just called P dot. So it's timed that one. Yeah, why on the range? Why on the range not on the four? I'll actually think about that. So we get a speed up of, yeah, so that's almost two. Now, I tried, the first example I constructed for you was wasn't a matrix multiplied, was something very simple. And then I noticed that I didn't get anything near this. I got sort of a 1.2 times speed up. So whatever happens inside the parallel four loop needs to be fairly expensive, otherwise you don't really gain. Because to shuffle the data out to the different CPUs also comes with a cost. But when you do something as expensive as matrix multiplied, it makes a huge difference. And of course, if you had four CPUs, you won't get quite a quarter of the speed, but you've all seen those curves. Sort of you get, you hold the speed, but then you can't quite get a quarter of the speed. And if you had eight CPUs, you know, the curve changes and you get less and less out of your multiple cores, unfortunately, but yeah. Okay, so right, then I just want to finally show you how to wrap C and C++ libraries, because this is something that people do quite often, especially in academic codes. If you have Fortran code that you need to wrap, I put a link in there to Andre Saratek's page. He's got a very good overview of how to wrap Fortran with Python, but, and I think, so what he does is he writes a C wrapper around the four, generates a C wrapper for the Fortran and then wraps the C inside of Siphon. But we'll just see how to wrap C today. So there's the simplest example. If you want to call functions inside the C Math library, then you do C define and then you have externally from Math.h. So this is basically the same as an include Math.h statement. And then you can say, well, include the functions cosine, sine and tan for me, but take double as an input, they return double as an output and also define the constant or expose the constant pi for me. Those of you who have programmed C before would recognize this and pi is just what C calls the Python constant. And then I have a function that executes those functions cosine of zero and cosine of pi. So I expect to see one and minus one. I compiled it already. So if I do import trig, test trig, there we go, one and minus one. So it's really as simple as saying, which functions do I want? Give their definition and call them. The only modification you might need to make your setup that pi file is that you need to tell it that there's a library that you now need to link to. And that's Lib M, that's the math library that comes with C. So as long as you do that, things work fine. Okay, so that's the simplest case. C++ gets a little bit more hairy. So how many of you have programmed C++ before? Oh, okay, fair number. So you'll recognize this. This is just a class definition. This is inside from an H file. We've got a class circle inside the namespace geom for geometry. That circle's got several public members. It's got a constructor that takes an X and Y position. That's the center of the circle and the radius. It's got a destructor that in our case will do nothing. You can get the X and Y attributes. You can get the radius, get the area that gets computed, or you can set the center, set the radius, and then the members X, Y and R are all private. So again, if we want to reference this from Siphon, we just need to basically copy that header segment out and again say, do a Siphon definition externally from circle.h. That's a file I just showed you. From the namespace geom, define, C define a new class for us, call it circle, and circle has a constructor, get X, get Y, get radius, et cetera. Here's the Siphon implementation of a class called PyCircle. This is a wrapper that I wrote around that C++ class then, and the magic is highlighted in the two pink lines you see up there. So you've got, first we say our class PyCircle is going to contain an instance of circle. Circle is that C class of ours. So that's the this pointer. It's just going to be an instance of that circle. And there's a special method here called C init that gets called when Siphon constructs that PyCircle class and you'll see that it initializes the this pointer. What does it initialize it to? Well, to a new instance of circle. So basically call into that C++ code, construct a new circle object, and they just assign it to self.this pointer. And from that point on, you could just handle that object as if it were and already, it isn't already constructed C++ object that you could just work with. So you'll see, for example, when I defined area in my Python file, I just returned self to this pointer.getArea. So I just called straight through to the C++. Yeah, so nothing magic there. Let me just show you this. Yeah, so here, for example, I defined, in my Python class, I defined a property called center. But as center, I return a tuple because Python's got this nice structure, so why wouldn't I make use of it? So I just returned self to this pointer.getX and self to this pointer.getY. Instead of having it split up as on the C side where on C++ side where this was split up into two, I just combine it into one. So when you combine that, when you compile that and you import this class, what am I doing? There we go, circ.pyCircle. You'll see, I added a bit of a doc string here. Scython handles doc strings perfectly. So if you put a doc string in, it gets into your compiled code as well. So we can construct our pyCircle with a center of five, five and a radius of one. What happens when I do that? Well, it calls C in it with those parameters 551. That calls the C++ command newCircle551. That constructs my pointer. And now I've got an object that contains this pointer. And when I, for example, call this area property, it will return self to this pointer.getArea. So let's see, c.area, there we go. C dot, what else do we have? Center is five, five. And you can also set center to five, 10. There you go. Okay. So not much magic going on there. The, again, the setup.py file needs to be modified ever so slightly. You just need to tell it that, again, we're trying to generate the extension circ. It's written up in circ.pyx, but in this case also includes the C++ file circle.cpp and please use the language C++, not C. So that's it. You've now seen how you can build code that is both high level and executes rapidly with Cython. Oh, I never showed you the profiling. That's quite important. So let me just do that. That's really quick. So in IPython, you've got the prun command. So I just imported a module called sum task. It's got a method called execute. And if you do prun, execute. So you can put any Python function called there. If you use prun on it, it executes that Python code and then it gives you this table summary of how long each part of that file took to execute. So what do we see here? Well, we see that there's something that took 0.68, I assume those would be seconds, but 0.68 to execute, that's the longest. And that is sum task.py line three. It's a function called expensive square. So let's see what that is. So profile sum task.py. What does expensive square do? Well, it takes an array X, it copies that array and then it goes with a for loop through each value and it squares that value, returns X. So this is typically your workflow. You write your code, you run it in the profiler, you see like, whoa, that thing is like 10 times slower than anything else in my program. Like the top one is 0.68, the one below is only 0.021. So that's by far the most expensive call in my program. So let's improve that. And then I replace expensive square with cheap square, which executes really quickly. And then when you run it and you look at the summary, you'll see, aha, now the most expensive thing is assert array compare. That's what I used to check that the results were the same at the end. But look, this is only 0.023. So the previous one was 0.6 something and that's now completely out of the equation. So that was a good place to improve performance. So that line over there is now the one that takes most of the time. But that's just a check to make sure that things run fast. So I mean, I could comment that out. I don't really need that. And now you see like, whoa, everything's only running 0.00 something, so. And yeah, I mean, by that stage, you don't really care about performance anymore super fast. Yeah, so that's how you identify code blocks that you want to address, code blocks that are slow that you want to replace with siphon pieces. And we've now seen how we can identify bottlenecks, replace them. It's always a good idea to write your code in Python, write tests to verify the behavior and then go back and replace with siphon. If you have the tests in place, it's not a big worry because if you break things, you'll know immediately. So, yeah, siphon's a bit of a hack in some sense, it's not perfect. It's, I mean, it's a little bit ugly. We have preferred maybe to have had the speed straight out of the box in Python, but that's not really feasible at the moment. So this is a solution that many projects out there use. We use it in scikit's image. Scikit-learn use it, lots of other projects use it. You may have recently seen the Julia language. That's been in the news quite a lot. They basically do this right out of the box, which I think is brilliant. But we have a heavy investment in Python, so this is a good solution for the time being. All right, that's it.