 So the first thing that we're going to talk about now is, I guess, the easiest step of anything that interfaces with an external language. We use Weave quite a bit for prototyping. It's also used in production environments as well, but it's a really nice tool for just trying, you find some hotspot in your application that's really slow. I want to try a different algorithm in there. It also has some tools where you don't actually have to write any C or C++ to get some of the speed benefits. So we'll go over those. The notion here just to discuss, I mean, the general notion is that we're going to have some kind of C code or Fortran code that we want to execute. And what you have to do to interface that into Python is you really have to write a wrapper function around this. And that wrapper function is going to take a Python, a Python integer is actually a structure. It's not like your int and C. I mean, it has a structure that has information about the type of the variable and what its ad operations are and all kinds of things like that. So, I mean, like math operations, ads track, things like that. So, if we take one of these data structures, we have to unpack it and actually pull the integer value that we want out of that object and make it an int so that we can call our C code. Well, and then we call our C code, we get an integer value, maybe return from the function. We have to pack that value back up as a Python variable before we pass it back to Python. So, there's this unpacking and packing process that always happens around your function call if you're interfacing with Python. Well, we kind of handles this all, this packing unpacking process for you. That's why it's the easiest example to kind of go over. And there are really three different sets of tools. We'll mainly cover the first two and I think I'll probably dance over the last one. But Blitz is a tool that allows you to, Blitz is a C++ library that does fast vector algebra or vector math. And you remember the slicing operations we talked about all day yesterday and all of that, well, you can do that in C++ with Blitz. And so, numeric numpy expressions look very much like Blitz expressions. Just ones in Python, ones in C. So we will do the translation of your numpy expression to a Blitz expression. Do the wrapping and unwrapping of the variables for you. Compile that, or create the wrapper. Compile the wrapper and load it on the fly so that you don't see that process. There's also, or don't have to deal with that process. So that's one approach. There's also a thing that allows you to inline C code directly in your application. And then extension tools is something that allows you to build a standalone extension. So we'll go through some examples here. This will actually work from, if you try this example from IPython, it won't quite work right. It will work right from, because IPython does something with standard out. It's not the standard output that C uses when it writes out. And what this example does is, it's a quick example that we set a variable a equal one. We've got inline standard out, we're writing the standard out here, the variable a. And then we're writing an end of line character. This is basically a printf or a print statement for C++. Just prints out a. And then what we tell it is, this variable a needs to cross from Python into C. And what happens is, it's gonna print out, under the covers, this goes off, takes that little line, puts it in some boilerplate C++ code that does the wrapping and unwrapping of a from an integer into a C integer, calls your function. And then in this case, all it does is prints out one, right? That's not very interesting, but it is kind of nice to know that it's done everything for you in the background, you haven't had to do anything. Well, what comes, if you call it a second time, that function is already there, so it doesn't recompile it, right? So you don't have that compile step. Well, what happens if you say a is equal to a string? C, C++ is not a dynamic language like Python is. When you have variable a, you say it's an int, it has to be an int. So what happens here, it calls the function, but it finds out oops, that was a string, I expected an int. And so it's smart enough to go in and say, okay, I'm gonna recompile this. And it recompiles it, and now it keeps track of two versions of the same function, one for integers, one for strings. And you can call it with quality here. And if you call it again, obviously it doesn't recompile it. But if you go ahead and do change a back to a one and recall it, it's immediate because it keeps track of that. If you kill the application and restart, there'll be a slight pause when you run one of these things because it has to go find the external module and load it, but there is no recompile step. So I'll just do a quick example here, a slightly different one. But we'll make a equal to one. And there's some magic variables that occur and one of them is called return val. So return val, if I assign a variable to return val, then that's gonna be returned from the function. So if we do this, I'll run that. We go off, it does some compiling, and then the output was one. Now if we run this again, the output again is one. A is equal to QWERTY, it goes off and compiles. It's still there, A is equal to one, still there. So you haven't had to do a whole lot of work here to be able to actually interface with C, C++, and write an algorithm that's fairly small, or write short algorithms and try it out. So you can create a support code, right? We've created an extra string up here. We've called that support code and that's just a C++ function or a C function. And what we do is we say, okay, I wanna call this function called Bob and return the value from it. Stick it in return val and that's gonna return from the application. But Bob's not defined in here, right? So we just pass in the thing that defined Bob as some support code. That's just gonna be pasted before this function at the top. Then we're gonna pass in the variable A, and this runs beautifully. We get our one back. However, we come in and we pass in a string. What's gonna happen when you call Bob with the signature int val with a string, no luck, right? It's gonna die. And so this error message is it's not, we don't, there's some pretty raw stuff that comes out of Weave sometimes, especially when you're developing. You still have to be able to interpret and understand C++ error messages that are gonna occur. The other thing that you're gonna wanna learn is where these files, this file is actually a real file written out someplace, right? So when you're debugging things, you go look at that file and see what the heck's gone wrong. But here, you know, it's saying Bob can't convert parameter one from a pi string into an int, cuz I need to have that as an int. And so you don't get away completely from having to know C, C++ when you're trying to debug these things. The first thing we have to do is we have to allocate the arrays, right? We have to allocate this temporary array, so we call them alloc. And then the next thing we do is we go through a loop and we just add these values together and assign those out to temp. Very straightforward. But then we come along and we have to allocate a new array to do the second part, and we do a loop, call another loop and we go through it. Instead, you do a little something a little something a little more like this, and this fuses those loops. It combines those loops into one loop. And notice, you really don't have to keep up with any temporary variables. I mean, there aren't any, you just need the output value. We're just assigning in to C, we can do that there. And so this might be the more likely way that somebody would write this algorithm. And you haven't done any, you haven't had to do any memory allocation and you've used the loops. This is going to be much faster than this algorithm over here. You have a whole lot less memory thrashing from the cache. You have a lot more, you have a fused loop, so you're doing more operations, so you're allowing the compiler to do the best pipelining it can for operations. You're doing more operations on objects that are in the, when they're in the registers. It's just a faster, much faster approach. Well, so we go from this to something that looks like this. Well, this is what's called an FDTD update equation. And it just has kind of a position dependent constant. So it depends upon where you are in the grid. Every grid element's going to have a different constant. We have a constant here that we multiply by an E value. Then you're calculating a curl over here. So you have another constant and another constant. It's just, and you're doing derivatives. You remember the offsets we were doing yesterday to calculate a derivative? This is exactly the same sort of thing. We're doing a derivative in the y direction and then one in the z direction. So this is a fairly complex actually finite difference equation. Express fairly succinctly in Python. But there are a lot of operations. There are like seven math operations here that have to be done. So you can write this in Python. It works pretty well. But you can take this and you can just put quotes around it, make it a string, and then hand it into Blitz. And Blitz actually takes this expression. It creates the abstract syntax tree for it. It walks that tree and reconstructs a C++ expression that's exactly the same, or does the same operation. And then it compiles it under the covers and it runs it. And so all of a sudden you take in this thing where we create all these temporary variables and that goes away. Because when it makes it in the C, it fuses those loops together and makes a single loop. So does this make a difference? Well, you can see how old these are. I really need to update the timings on this slide. I mean, we're talking numeric 17. The last version of numeric was 24 before we went to NumPy. But this gives you an idea of what the performance is. So if we look at this, if we use numeric on this simple equation and then do it using the Blitz compiler, then we have a speed up of 1.13. It's almost negligible. But if you look at this expression, you kind of would expect that. For a large array, 512 by 512, most of the operations are really the math. And A equal B plus C. Well, numeric ought to be able to do that very fast because it's not really doing anything strange. It has to allocate the temp. It's going to add those while all those have to happen here. When you do A equal B plus C plus D, it's a little larger and you're actually seeing a speed up with the Blitz of a factor of two. Then when you get to that FDTD, it's actually about three. It's not so bad. Now, if you look at double precision in the past, double precision was performance and numeric was not very good. I think numpy does better than this. But we were seeing a factor of speed up of around nine. And all you had to do to get that factor of speed up of nine was take this expression that you wrote down here, convert it to a string. So it's exactly the same and call this. So it's a small amount of work to get quite a bit of benefits. The easiest way to get a speed up. The thing to be careful about with Blitz is it doesn't work with every, it doesn't handle broadcasting. So if you have a numeric expression that has broadcasting, Blitz isn't going to solve that problem for you. But it handle, and we found one other condition it didn't handle a few days ago, but most conditions it should handle most of your general equations as long as you don't have broadcasting in them. Here's your five point stencil, right? So if you take that, you get a factor of, so you could take, one of the things you could do today is just take that stencil that you created, put quotes around it, call we've got Blitz on it and compare the speed between the two. Also compare that the results are the same, what you hope will happen. All right, so here's a slightly different view, look at Blitz that kind of compares Blitz and inline. Here's Laplace's equation. And this is an update where we're doing, we're running Laplace's equation on this grid and we have a voltage set at the top, a ground on all the bottom edges. And so the voltage, if you iterate the Laplace's equation, it will iterate until you have the voltage across the whole grid at all of these internal points. So you have the pure Python version. You can use numeric on the same problem set. And this is one of the interesting problems. Notice at the bottom what we're doing is we're trying to calculate an error. We're watching the error go down on this problem. This is a place where vectorization actually doesn't do very well, and here's why. In this case, when we're calculating the error value, when we change a pixel we need to compare the old value to the new value, right? Well, that's really, you only have to do that. I mean, if you're doing that in a for loop in C or C++, you can just calculate what is the error of that single value. And then if you're just accumulating the error across the whole grid, you can just sum up as you're going through the loop and keep track of the error. And that's what we're doing right here. We're just calculating the difference between this temporary. We save out the old value, calculate the new one, calculate the difference, and add that to the error. Well, if you're doing this, you can do all of this with slicing, but look what you have to do. We have to, if we're gonna compare an error to the last values we had, we have to have the last values around. But I'm writing my new values over here directly into my old values, right? So I'm just writing them over. Well, that means I have to copy the old array out. So this is a place where vector calculations cause you to double your memory usage, basically for the problem, just to calculate the error. But you can write this in numeric. It's much faster than the old version. We see the error value here. We can put all of this into blitz. We get a factor of speed of about three. If we put that update equation in blitz, we still use the old approach to calculate our error. And then the third way is to actually write this as an inline. You can write this as C code, right? So this is a little more lork. You have to write your for loops and that sort of thing. Here we have the for loop for i and for j and we're looping across the arrays. We're using some, when we come in, these look a little nicer than usually. When you're usual, when you write in C, usually you have to do the pointer offset yourself and that sort of thing. Blitz actually allows you to do indexing. So we're converting these to blitz arrays when we come in. And so these are actually you as a blitz array here. And we can do the same operation, but notice that we can just calculate the error here without having to allocate our new array or allocate the old array. We're just comparing to the temporary old value. So we haven't doubled our memory and we're down to 4.3 seconds. You can actually write a faster version of this that blitz the arrays they have have a little bit of overhead compared to doing the pointer offset calculations yourself. So if you make this uglier, you can actually cut it down a little more. And that comes in at 2.9 seconds. So if you look at all these, you're looking at a runtime of around 2000 seconds for pure Python. Really, numeric is our benchmark here. But if you use blitz, you get a speed up of about 2.84. Inline, you're to 6. Weave, you're to 10. And using f to pi to just wrap something in 4-trend, you're on the order of 10. And if you write a pure C program, it's slightly faster than the weave.inline version, and I actually don't know why that would be. I can't see why it would be actually any faster. It'd be interesting to rerun this on something slightly more modern than a pinion 3, 450, as well as to see if we can see where that difference comes in. But this is a nice study to give you an idea of what you can expect to see. And this is not unusual. I mean, if you take an algorithm we saw earlier that vector quantization went, had a factor of speed up of 25 that we showed yesterday. Here we're seeing a speed up of around 10. So you can expect if you take numeric code from Python to C, you can still expect getting a speed up between 5 and 25, rewriting the algorithm. It's not always true, but it's true in a lot of places.