 So welcome. This is going further with CUDA for Python programmers as the name suggests this won't make too much sense unless you've got started with CUDA for Python programmers. The good news is that I have a video called getting started with CUDA for Python programmers. So start there. It's only a bit over an hour long. You might be surprised at how quick and easy it is to get started if you haven't. So assuming that you have got started, today we're gonna be looking at the most important next step of taking advantage of CUDA which is we've already learnt to take advantage of the thousands of threads that you can run simultaneously on a GPU. Today we're going to learn how to take advantage of the incredibly fast memory. So up till now, although we haven't really talked about it, the memory we've been using is what's called global memory. It's basically think of this... So this is the same book we looked at last week which we do recommend programming massively parallel processes and the stuff we're covering today is largely covered in chapter 5 of that book. In this CUDA mode series, there's a lecture from Thomas Venerman which goes through this and the previous chapter in some detail and so that's actually a good video to watch maybe after this one or before this one. Either order is fine. They're covering civil-mode material in different ways. The key thing to understand is so that we're looking at here today is that if we look at this box here which is basically you can think of as a GPU and in the GPU you have global memory. Global memory is basically what we've always been using so far when we just put when we say like dot CUDA in PyTorch it's actually putting the tensor into global memory. Global memory is pretty fast compared to some other types of memory we might be used to but it's far from the quickest memory available in the GPU. In fact this shared memory is much faster. Shared memory however is not global which is to say that not all of the threads can see it. In fact as this box indicates shared memory is something that's only available to the threads on a specific streaming multi-processor SM or in CUDA programming world within a block. So all of the different threads in a block can access shared memory and the reason we care about this is that shared memory is about 10 times faster than global memory and because CUDA with all of its simultaneous thousands of threads running at the same time is so incredibly quick because GPUs are so incredibly quick the speed at which you access memory turns out to matter a whole lot. So being able to use this shared memory effectively is as important as being able to use the thousands of threads at the same time simultaneously is important. So in the last lecture we focused on how to use all of those threads and today we'll focus on how to use shared memory. Those two things will get you quite a long way in terms of creating pretty fast CUDA code. Okay so the repo for these lectures is the CUDA mode repo and specifically the CUDA mode slash lectures repo and in there you'll find there's a lecture five. You don't have to have seen all the lectures beforehand but they're certainly all useful. Just need to have seen lecture three which is the one I mentioned. Lecture five is where today's notebook will be found and here it is. Okay so one thing I'll just mention that I've added is a little utils.py where some of the stuff that we used last time and we'll use quite a bit. I've just put it all into a script so we can access it multiple times. So we've got the CDIV which is sealing the vision function, the little load inline wrapper called load CUDA and the little kind of prefix we have that has the hash includes and the stuff we're going to need there. And so you'll see that we're going to import those here. Other than that we're going to import all the usual stuff that we like to use. Last time we used simple namespace but actually thought let's make things closer to let's make things closer to how CUDA does things and let's create a little thing called dim3. So this is a 3D grid with an x, a y and a z using the handy little Python named tuple functionality. So here's a nice way for us and we can just like in CUDA provide as many of the dimensions as we want. Today we'll be doing two dimensional grids. So there'll be implicit z equals one. So we can access these as d.x and d.y for example. Like before we'll use Wellitzer to print stuff from CUDA if we want to. CUDA launch blocking is helpful for debugging so you can turn that on if you want it. And so today we're going to do a matrix multiplication of a 5120 by 256 matrix M1 by a 256 by 5120 matrix M2. This approach of going from, actually it's not true. Okay, yep. So like before we're going to start by looking at pure Python and so pure Python is going to be really, really slow. So to handle that we're going to create a sample of the first matrix with the first four rows and a sample of the second matrix with the first four columns. And so we'll use that for our pure Python example. All right, so just to remind you what we've already done in the past is we created this simple kernel runner that goes through every block and every thread, not real blocks and threads, they're actually just integers and calls some kernel, just not actually a kernel, it's just a function. And I'm going to use dem3 now, now that we've got it to pass into that. And so this was our previous matrix multiplication. We grab the row, we grab the column from the indexes we passed in. We have a guard and we then accumulated our dot product for whatever particular row and column we're filling in. So this is basically the ignore the extra details here, but conceptually we're just doing this dot product, for example, to fill in this is, so if we're filling in this here, then this is R and this is C. So this is comma C. And so we're doing the dot product between that column and that row. And so that's what this looping is here. So I is going through all of the elements of the row and the column and multiplying adding and then putting that into the output. So that's what we do. We have a so-called kernel and then we created something that would call the kernel by calling our kernel runner, passing in the function. We need our blocks and our threads per block, which are just dem3 tuples. And then we pass in our flattened data and any other information that's required. And so we can check that that matrix multiplication result using these small sample versions is close to the pytorch version. And it is. And then we also looked at the CUDA version and the CUDA version we created by pasting the kernel into chatGPT. And it's bad out something which we hardly had to change at all to get this. And the kernel runner also looks very similar, except that the syntax for calling a kernel in CUDA is different with this weird triple angle bracket. To make life a little bit simpler for myself, you might remember before we had the CPP source where we would copy and paste that into a string. I got a bit bored of doing that manually so I created a little getSignature function that just uses a regular expression to automatically find that line of code. And so for the rest of this lesson, I will be getting this CPP source automatically and that way I don't have to worry about changing it. And but you can see that regex is just returning the necessary line of code plus the semicolon. So that makes life a little bit simpler and I like being simple. Okay, so then we go load CUDA. It ran very quickly because I've already compiled this once and pytorch caches that. And this is actually another change I made since last time is that in the load CUDA function, if you don't pass in a name, then it uses the function's name. And that means that pytorch will cache different versions of your code with different names for the various different things you're doing. So you won't lose your cache each time. So that's handy as well. Okay, so now we can use the full matrices because we're going to be nice and fast. We need them to be contiguous and CUDA. So we create M1C and M2C for that. And they should be similar to the result of to the result of pytorch doing it. And it takes about six milliseconds. One thing I wondered about is how long of that six milliseconds was it actually running the matrix multiplication compared to doing all this other stuff before it. So I just commented out those two lines of code and reran it. And that took about 50 micro seconds. So that's 0.05 milliseconds. So very little of this time is kind of overhead. Most of this is actually doing a matrix multiplication. So I think that's a encouraging start. Okay. So how do we take advantage of shared memory? The problem here is that in our inner loop here M and N are global memory. And so in this loop that's happening k times, we are reading from global memory again and again and again. And that is a bit of a bummer. So there's a better thing we could do, which is instead we could use shared memory. Now the problem is that shared memory is quite small. So we can't just dump everything into shared memory for every single thread because we've got lots and lots of threads running at the same time. Or I should say for every block. So we've got lots of blocks running at the same time. If every one of them had the entire matrices in memory for every block, that's going to be an enormous amount of data. And that's going to be far too much for our GPU to handle. So to deal with that, what we do is we do something called tiling. And a tile is basically, so we're going to pick a tile width of 16. So here where it says tile width here, we're going to use 16. We basically say, okay, if we're going to calculate this r comma c thing here. Instead of doing the entire dot product of all of this row and all of this column, what we could instead do is just grab the first little bit of that row and the first little bit of that column. We could take the dot product of those and put them into r comma c. And then we could do that again for the next tile across and the next tile across and so forth. And this is what this dot dot dot here is. And the next tile across. And so then eventually we get to this bit of the row by this bit of the column, take the dot product of those and add them up to the existing r comma c output we've already got. And so that's just it's doing exactly the same thing, but rather than doing the dot product all at once, we're doing it one step at a time. That's not interesting of itself, but what is interesting is you might notice that let's say for calculating this bit here, let's say, so this is thread zero comma zero, we can do the same thing. We can take the first little bit of this and the first little bit of this and take their dot product and that gives us the first piece we're going to need of that one. And we can do that again and again and again until eventually we get to this one and we do this bit times this bit and we keep going all the way to the end until there's a final tile at the end. And once we've done that for all of the bits, eventually we're going to have the correct answer in zero comma zero. Why is that interesting? Well, it's interesting because we could reorder this rather than doing the whole first little bit of this row and then the next bit of that row and the next bit of that row and the next bit of that row. Instead, what we could do is we could calculate zero comma zero for the first tile and then we could calculate zero comma one for the first tile and notice this, the zero comma one, it's exactly the same row as we had before, right, but a different column. Now, with a normal kind of CPU style thinking, you'd say like, oh, maybe this will be in the cache. So this could be faster. And that doesn't work in GPU programming and GPU programming, we instead use shared memory. So we could have put this into shared memory. And if we had done so, then the second time we use it, we don't have to read it from global memory, it's already there in shared memory. And then the same thing will happen when we get to the second row, right? We could put that into shared memory. And then we go the second row of that tile times the first column of that tile is needed to do one comma zero. And if you think about it, we've already accessed the first column of the tile in N. So if we put that in shared memory as well, then we won't have to get that from global memory either. So maybe you see where this is going. What we're going to be able to do actually is before we do any work is we'll put this whole tile into shared memory. And we'll put this whole tile into shared memory. And then we'll take the matrix multiplication of the two tiles. And that will give us all of the first pieces of the entire tile output. And then we'll do the same for the tile one to the right of this one and one underneath this one. And we'll take the matrix product of those and add it to this again, and so forth until eventually, again, we get up to here, we put that whole tile into shared memory, we put that whole tile into shared memory, we take the matrix product, which again, remember, it's just lots and lots of dot products, the column and row dot products. And so all of those are going to be able to use shared memory. And again, we add them to the outputs here. And so once we eventually do that for all of the tiles, we will have finished calculating these outputs. So how many times did we read from global memory? Each of the input elements only got read from global memory once. And as soon as we grabbed it, all we did with it was we put it into shared memory, and then the actual dot product was entirely done from shared memory. And that's how we make this faster. So to do that, let's use Python, plain Python. And we're going to basically try to design something in Python that looks a lot like how CUDA is going to do it. And then we're going to auto generate CUDA just like we have in the past. So in CUDA, the kind of maximally flexible way to do things is what's called dynamic shared memory, where you tell CUDA how much shared memory you're going to want, and it puts it aside for you, and then in basically one contiguous block with a pointer to that block that you will have access to, which is the same as an array, and then you basically grab from that block any of the pieces you want. In Python, we can do exactly the same kind of thing by using a trick which is true for both NumPy arrays and PyTorch tensors, which is that views into those tensors are writable. So if we create a tensor of length five, and then we create a view of the first three elements, and of the last two elements called B and C, if we then modify B, it actually changes A because they're a view of the same memory. And if we change C, it'll also change A. And so that's going to be handy for us. You'll see why in a moment. We're going to basically use our shared memory like that. Now, the thing is, we've got to restructure our kernel runner a little bit because we have two steps now. Step number one is copy all of our input into shared memory, and then step two is take the dot product. And so that doesn't quite work with our previous approach because we just have one big loop and we just have one thing that we do. So I've changed our kernel runner to create a shared memory kernel runner. I've still got the same loop through blocks.y, the same loop through blocks.x. This is all pure Python again. And here I'm going to create our shared memory. And so this is now going to be past the shared memory into each function. So all of our threads are going to be have access to the same shared memory. Now, we don't actually create the threads here. So instead, step number one is in my kernel, I'm actually going to do the loop through the threads manually. We'll improve this in a moment. Don't worry. It's pretty messy with all this kind of duplicate code, but at least it's nice and simple to understand. So first of all, let's just run this and confirm we get the same answer as before. And we do. So let's see what's happening. The bit that does the running is exactly the same, except that I'm calling our new shared memory runner. And I'm also telling it, the third thing you have to pass in is the shared size is how much shared memory. So how much shared memory do we need? We need tile width times tile width because that's the size of the tile is tile width by tile width. But we're going to need two of them, one for M and one for N. So the amount of shared memory we need is tile width times tile width times two. So that's what this is, tile width times tile width times two. So that's going to be passed in as the shared memory size, and that will be constructed here, torture.series. Okay. So that shared then gets passed into our kernel, our pretend kernel, and it's just one big contiguous block of memory. So we have to grab the first share size bits, and that will be our M shared memory. So our two inputs are M and N. And everything from there onwards is going to be our N shared memory. So then what we do is we loop through, this is exactly the same as we had before. In fact, I should use C div here to make it a bit more obvious what's going on, C div. So we go through every element in the dot product we're going to need. And so the indexing starts to get a bit complicated here. So pH is what the book, and therefore we will use, which is basically the index of what tile are we up to. All right. So we look through each tile. So look through each tile. So the number of tiles we'll need is the size of the K dimension. So that's the number of columns in M or the number of rows in N. And then divide that by the tile width, and that tells you how many tiles will fit. And we do a ceiling division to go all the way to the end. So then we need to know, so let's say we're doing, again, we're doing this R comma C1 here, right? So we need to know where this is. Where does it start? And the answer is that we've done pH lots of tiles so far. Each one is jumped across TW or tile width. So this distance here is pH times tile width. And we're going to call that IDX. So this is an important tip. I found I had a lot of trouble getting this to, like, settled in my head until I drew it all out and wrote on my picture what everything is. So, and I've been doing this with the help also of my friend, Karim, who works with me at answer.ai. And he found the same thing. We were both like, our first attempts were both to do it just in code. And we did not get it working until we actually started drawing it out. And that's when Karim and I actually were like, oh, okay, that'll make sense. So that's what IDX is, right? And so notice IDX is that, but it's also, because these are symmetric, it's also that. That length is also IDX. Okay. So now we need to fill in the shared memory. So we've got two sets of threads, one to fill in the shared memory and one to do the matrix product. Let's write that in. Fill, shared, do the dot products from shared. Okay. So we need to go through all of our threads, find out what row and column we're in. So how do we find out what row and column we're in? And again, these are the things that get complicated. So this is R as we've already mentioned. So R is going to be equal to, look, there's two pieces of it. There's the IDX piece, which goes from here to here. And then there's also an additional piece which is from here to here. What is that piece? Well, that piece is simply the coordinates of this grid location within the tile. And so remember that we are looping through, so blockdm.y and blockdm.x is the size of the tile, right? So that means that, all right. So we've got tile row and tile column. And so that's what that is. So that's, so therefore this here is tile row. And this here is tile column. And so therefore to find R, we have to add together IDX plus TR. And here it is IDX plus TR. And that needs to be less than the second dimension of the matrix. And then here, we just need to index into it. So if this was a two dimensional tensor, we could just do TR comma TC, but it's not, it's one dimensional. So we have to flatten out our dimension. So it becomes TR times TW plus TC. So this is filling in our M shared and N shared by going through all the possible elements of the tile and filling them all in. Okay. So after this bunch of loops is complete, MS and NS will simply contain a copy of the appropriate tile from M and N. And again, here the indexing we're doing is, so this remember is the kind of element that we're up to in terms of the column. And this is the row that we're doing, but we have to do times C, sorry times K, in order to deal with the fact that we've flattened out our indexes. If one thing to think about that you might have been wondering is what about this final tile that goes off the edge? So it's not big enough. So what happens there? So for that final tile, we put in zeros. So we call that padding. And so they show that in the book here. So in this case, they're doing a four by four matrix multiplication containing two by two grids. And you can see here when we're doing this one, we've actually, sorry, it's a three by three using a two by two grid. So when we get to this piece here, it goes off the edge. So what happens when we go off the edge, we just put zeros in to the shared memory. And so that means then when we do the dot product between this one here containing zeros, and this one here containing zeros, then the zeros can just be ignored. They don't, they don't do anything because they're just zeros. So that's why we put zeros in if we are outside the dimensions of the matrix for both M and N. So now that has filled in our shared memory or our pretend shared memory. I mean, it is shared memory. It's just not any faster because we're just in Python. And so now we've done that we can go through all the threads, again, find out what row and column we're in using exactly the same code. And then we can go through our tile width and aggregate all of the bits of our dot product. So why are we aggregating through tile width? Because the dot product will always be between tile width on this side and tile width on this side. So every one, every row from here and every column from here will be a size T w. So that's why we do that. Okay. So, okay, so that's that rather messy, tiled matrix modification in Python. So then I, but like, this is a place to start. So if you don't understand anything, come back to here, because you can run it through in the debugger, you can print out what the shared memory looks like, you know, you can make sure you understand exactly what's going on because it's plain Python. And so then all I've I did is I basically said, okay, well, effectively, that is saying, oh, run this bit of code as all the threads, and then run this bit of code as all the threads. So just to refactor this a little bit, I created a run threads function that just says, okay, look through all the threads and call some function. And so using this approach. So with this function available, we can now change our loop. So that instead of having to do this big for loop, we can just say run this function in every thread and run this function in every thread. And so then those functions just contain the two pieces. So this is now going to get a bit closer to what the CUDA code is going to look like. The CUDA code is going to have something that says go through each tile, and then fill the shared using all the threads, and then do the dot product using all the threads. Okay, so this is identical to the last one, we've just refactored out the loops. So it's going to get a little bit closer to what the final CUDA code will look like. The thing that calls it is identical. And of course, therefore the results the same. Are there any questions so far? I think he asked about the relationship between blocks and CUDA and the tile size. Sure. Yeah, so in CUDA, a block is a, as we learned in the last one of these lectures, a block is just a kind of a conceptual thing that CUDA programming model provides. It's just a bunch of numbers basically that are passed to your function as block IDX. And you know that all of the threads in a block will be running on the same SM, on the same streaming multiprocessor. What you do with that information is entirely up to you. And last time we did nothing with that information. This time we're going to, we're taking advantage of it to say like, okay, well everything in a block has access to the same shared memory. So we decided that we will treat a block as something that is calculating one particular part of our output, a tile. So that's what we called it. We just called it a tile. So a tile is just a is a semantic thing that we're using. And by mapping that semantic idea of a tile to the CUDA programming models idea of a block and basically saying, okay, we're going to treat each block as a tile. It's going to allow us to use one block to calculate one tile in our output. And so therefore we're going to have one set of shared memory for each block, which we're mapping to each tile in our output. So you can kind of think of them as being the same thing. But they're kind of conceptually different. Okay, so now we're going to make it even more CUDA like because actually CUDA code does not have a thing called run threads. It doesn't look like this. Instead in CUDA code, there is no loop like this. But instead, all of these functions across all of these possible i0 and i1 coordinates are run at the same time. I mean, not necessarily the same time, but they can be the same. They can all be the same time or some subset of the same time. Conceptually, for the programming model, you think of them as all running at the same time. To do that in Python, we have to use threads. Now, in real life, Python threads don't actually all run at the same time, except in certain situations, at least with the current version of Python, because there's a thing called the global interpreter lock. They actually run one other the other. But again, for the programming model, we can ignore that. So we're just going to pretend that they actually are in parallel. So to create threads, we use Python threading library. It has a thread class. And so let me show you a couple of interesting things here. I've got a function here called g that just prints whatever you pass it. And it prints the negative of whatever you pass it. And it prints whatever you pass it times 10. I'm going to call g using a bunch of threads. One convenient way to create and run a bunch of threads is with a thread pool executor. So this is going to create three threads and run them at the same time or as much as that this Python can handle. And so that thread pool dot map basically will run all the numbers from one up to num and call our function g. So it'll call this three times. So if I just comment out these mysterious lines of code, you can see what it does is it runs all of them for the first thread. And then all of them for the second thread and then all of them for the third thread. This is not going to work for us because we want all of our threads to first complete the task of filling shed memory and then all of them to complete the task of doing the dot product. So we need to have a way to tell a thread to stop until all of the threads are up to this point. And in Python, that thing is called a barrier. And so we can create a barrier like so. And we can say create a barrier so that until three threads have hit that barrier, stop. So that's what. And so then we're going to pass that in the sync barrier, SB sync barrier. And so it's going to pause at the sync barrier until all the threads here and then pause at this sync barrier until all the threads are here. And now if you run it, you can see they all complete the first task. And then they all complete the second task. And then they all complete the third task. And you'll see they don't necessarily do it in the same order because threads can happen in any order. And so this is the trick, which is going to allow us to have a single loop, which everything in that loop first does the shed memory filling in task and then does the dot product task. So here is our new kernel runner as before it goes through each block as before it creates our shed memory. And it's now going to create a synchronization barrier containing the number of threads. So threads per block Y times threads per block X is how many threads there will be. And then we're going to create a whole bunch of threads, one for every Y and one for every X. If you haven't seen this before in Python, if you have two things in a list comprehension, it just does the Cartesian product of those. This will go through every in thing and Y and everything in X. And so O and P will be our two coordinates. So create a new thread, the function that it's going to call is whatever function you asked for. And the arguments are going to be the coordinates of the block, the coordinates of the thread. And then we'll say how many threads per block, pass in the shed memory, pass in the synchronization barrier and any arguments you requested. And so now this looks like actually, as you'll see, like CUDA code. We can figure out what our row and column is using exactly the approach we saw before. Although now our row and column are actually going to be based on block IDX and block DIM because this is actually telling us whereabouts we are. The shared memory is exactly the same as before. And so again, we loop through all of our tiles. And again, we set the shed memory just like before. But you'll notice now we don't need two separate loops. We just do the set the shed memory piece. And then we say wait for the synchronization barrier. So remember that this is happening for every, this is happening for every output value in the tile simultaneously. At least as far as the programming model is concerned, it's simultaneously. In fact, in Python, it doesn't do a good job of actually paralyzing it. And in CUDA, we don't know for sure if they're happening exactly the same time. But as far as the programming model is concerned, we should think of them as happening at the same time. So all of these different coordinates are running conceptually at the same time. And so when we hit wait here, that means that all of the threads have finished running those two lines of code. And so now we know that MS and NS are filled in for that tile. And so now we can go ahead and do the dot product. And then once every thread has done its own dot product, we then need to stop and wait until they're all done and then once they're all done, we can go ahead and fill in the next tile shared memory. This is very important to have this wait here because if this wait wasn't here, then some of them will still be going ahead and doing the dot product and others will be replacing the shared memory. And that was going to give you wrong answers. So you have to wait after you've completed the shared memory filling in and you have to wait after you've completed doing the dot product. Okay, this code's identical to before. And again, it's giving us the same answer. So that is a good sign. So here's the cool thing. I then took this code and I passed it into chat GPT. And I said convert the following Python code to Cuda C. And I pointed out that you can remove these from the argument list. So we don't need those in the argument list. I mean, obviously, you can manually remove this, but I just thought if I have one prompt that always works, I don't have to do anything manually. I said change sync B dot wait to sync threads. And I said for creating shared. So we'll talk about all this in a moment. So basically told about the minor things that would have to change to convert the Python into Cuda C. And the thing it gave me worked first time. Although I did do some minor cleanups, but this is the code it created after my minor cleanups. So you'll see now it's getting so it's converted the it's given it's recognizes that we need float arrays for our input and output matrices. It's typed all of those things correctly. And so I did my cleanup, I added a few things. So I've got now the the tile column is thread ID X, X the tile row, Y and then we've got R and C just like before. Now Cuda, the way it does shed memory is a little bit weird. It doesn't get passed in just like thread IDX and block IDX don't get passed in. You just have to put this magic incantation in exactly one line of code in your, in your kernel. And so here it is. Here's this one line of code. And then following that you say what data type you want your shared memory to be. And then you say what you want to call it. And that's an array. So this is created something called MS, it's just a pointer to the start of the shared memory that Cuda is going to create. That's what Xtern done to shared means. So MS is a pointer to the start of the shared memory. We need NS to be a pointer to the start of the second half of the shared memory. So go in tile width times tile width because that will finish off the M part of the shared memory. That's tile width by tile width. And the second half is the N part, just tile width by tile width. So remember we did this in the Python version as well. So if any of this is confusing, go back to the Python version and step through it in a debugger. So now we've got MS and NS as our shared memory. And then this is exactly the same as the Python version. But we've now got this sync threads. So sync threads underscore underscore sync threads underscore underscore sync threads is identical to the sync B dot wait. It says wait until all of the threads are finished doing the previous lines of code before any of them are allowed to do the next one. Because this stuff is built into Cuda, we don't have to create a sync barrier object and pass it in and all that. You just have this magic line of code. So there's quite a bit of magic in Cuda like this Xtern done to shared and like this underscore underscore sync threads, but there's not too many pieces and you can see we're basically using them all here. So the next part is then to call the kernel. And so when we call the kernel, we've got the normal triple angle brackets blocks threads per block. And then we pass in one third argument to the triple angle brackets, which is how much shared memory do you want to create. And so that is what's going to be used automatically when it creates this shared memory that we get appointed to here. That's how much shared memory it will create. How much shared memory could you should you create. Well, in this case, I've commented out this section, so ignore that for a moment. For now, we're just going to do the same thing. We're just going to make the tile width 16. So the amount of shared memory we need in bytes is tile width times tile width for m times two for n as well times size of float because we don't want bytes. We want floats. So that's the amount of bytes of shared memory we need. And that's what we pass in. Okay, so that's basically that. And so we can then run that and we can see that we get the same result, which is good. One thing though, which is a bit of a worry is that our time is actually slightly worse. It's gone from six milliseconds to six and a half milliseconds. So we'll talk about that in a moment. I just want to mention one other thing that is in the book, they say, okay, for your size, you should write some function to calculate what size it should be. But they never say how to do that. And so in future lectures, we'll be talking about how to think about things like this and how to design this correctly. But in this commented out section here, you can see the basic idea. So this will work if you run it even though it's not necessarily optimized. So you can call this special function CUDA get device properties passing in a structure to fill in. So and means a pointer to that. So that's like a reference to the structure to fill in. And I think this is the device number if I remember correctly. And it will return back a structure containing a number of things, including max threads per block. So and it'll also give you shared memory per block. So you can use that to dynamically figure out threads per block and to dynamically figure out your tile width and stuff like that. I'm not saying this is an optimized way to do any of those things. It's just an indicative kind of showing you how you can get all the pieces you can need to calculate that. And so in later lectures, we will learn more about how to actually figure out what would be the optimal tile width and shared memory size and so forth. But for now I'm just using 16. Okay, so this is the mystery part. The mystery part is this is slower as we saw. But if I take the exact same code, and instead I use this thing where we tell it what size to create is called dynamic shared memory allocation. If we don't use dynamic shared memory allocation, then we do that by not passing in the shared memory size here. But instead, if we know at compile time how big we want our tiles to be, so we can try I've tried both 32 or 16, you can actually create an ms of tw by tw and an ns of tw by tw. So you can have two separate things and because we know we're deciding at compile time what our tile width is, then this is not dynamically created. The rest of the code is the same, except now we've got proper kind of two dimensional indexing, which is nice. And with this one, this is faster. So we've gone down from from six to five. And I think if I remember correctly, when I tried 16 tile width, it's a bit faster too, it's more like four. We'll have that running in the background. 16. Okay, just re-compile that. Okay, any more questions before I move on? So there's one question from G Mesa. He asked why the size is like tw times tw times two times size of float. I think this was to the previous where you had the idea of this, where the two kinds. So we have to the shared memory, we need to be able to store the the tile for m and the tile for n. So each one of those is tw by tw. And so therefore we need two times tw by tw in order to have enough room for both of those two input tiles. And then we use it here, we've got to point it to the start of m, we've got to point it to the start of n. And we also saw it in the Python version, the shared memory size, we passed in tw times tw times two, because we needed the ms, m shared memory tile and n shared memory tile. Okay, thank you for the question. Did you just find some documentation or some reference why the dynamic shared memory version is loaded, or is this supposed to be this way or am I a little bit surprised that it's No, it's a total mystery to me. So maybe there's something wrong with my code. I don't know. So like, this is something I think we should try to follow up on and maybe some of our friends at Nvidia can tell me the dumb thing I did, because you know, I'm a newbie to all this stuff, so I probably did something stupid. But yeah, I've looked around, I've read around, I've searched, I've asked chat GPT, nothing so far has told me I found some other people who have said the same thing on the internet saying like, oh, why is my dynamic and static having different speeds? Haven't found answers to any of those either. So yeah, this one's the theory is that it definitely should not be slower. They should be identical. They should create exactly the same PTX code. So yeah, my guess is maybe I made a mistake in my code somewhere. So I will, if anybody figures this out, I will update the YouTube description to say what the answer turned out to be. Oh, hi there. Jeremy here with a message from the future. I figured out why that code was going slowly. And the reason is because of this tiny little bit here. The problem is that when TW, the tile width, is not known at compile time, it turns out that CUDA does not know how to create an optimized piece of code for a range of tile widths. So it falls back to the slowest possible version. I found a somewhat better solution than just supporting one constant tile width, which is you can skip over this if you're not interested. It's a bit more advanced, but basically you can use a C++ template and you can make tile width a template parameter instead of a normal parameter. When you do this, now you can only call it with a constant tile width, which is a bit ugly, but you can actually deal with that by basically supporting some fixed number of tile widths and it will compile a separate version of the kernel for each one. So I've got it here doing 8 and a 16 and a 32. So you could have something, so here I've just got tile width equal 16, but it's a variable to show it's possible and you could replace that with some code that calculates dynamically, whether you want to make it 8 or 16 or 32, or you could do additional ones as well. And then there's a lambda. This is how C++, very ugly, does lambdas. Looks quite different to Python lambdas, as you can see. But basically this is a lambda now, which we'll take. So this is the function that we're going to call and we'll call that function using some particular kernel. This is the kernel function, kf is the kernel function. Anyway, so lots of messiness there. It's pretty hideous code. And I guess this is where it gets pretty complicated when you actually want to have optimized kernels for a range of different hardware. The good news is that at the moment at least any even reasonably modern Nvidia GPU supports exactly the same amount of shared memory. So maybe all this dynamic stuff isn't that necessary. Although having said that, I do think that you do need to change the tile width depending on the matrix or the size of the matrices that you're using. So yeah, I do think this is actually reasonably complicated to make it work well in lots of different situations. And I guess this is why there's a whole team of people at Nvidia who work on doing this in kublas and koodie NNN. So we don't have to worry about it. Anyway, I'm glad I got it figured out and I will now return you back to our scheduled programming. All right, so now I've got something really exciting to show, which is that we can do everything that we've just seen in a different library called number. Number is another way of writing kuda code. It's a way of writing kuda code where you actually write the kuda code in Python. Here is the kuda code for our call. I haven't tried this before, but we can actually see how long this takes to run. Okay, that's interesting. So this one is slower still. So again, maybe I'm doing something weird. This is using the dynamic shared memory approach. I got two times tile width times tile width times, I just manually put four, which is how many bytes there are in a float. But still, it's running at much more, it's running at kuda speeds, which is good, even if it's not the full speed we were getting from kuda. Now, why would you do this? Because I mean, actually, if you look at the amount of code I have here, it's not less code than the amount that I had here. So it's not easier to write. I mean, so I still got to use block IDX, block damn thread IDX. So all these are now available in the kuda, what would that be, module, I guess. And they're kind of globals available here. We can create our shared array here. If we say shared dot array zero, this is actually quite a new thing in number. It does the dynamic approach. So when you call the kernel, rather than using triple angle brackets, you use square brackets, passing in the blocks, threads per block, stream number, which we haven't learned about yet, and the dynamic shared memory size. And so here is where it creates it with the dynamic shared memory size. Tell it that you want them to be floats. And so now we can do the exact same trick that we did before, grabbing our ms and ns. Instead of underscore underscore sync threads, it's kuda dot sync threads. So in some ways, I'd say like, okay, this is not necessarily a big win. But there's a couple of things that do make it a big win. So one I'll show you, for instance, is I mean, we could just do something pointless like a times one here. There we go. So it should force it to recompile the kernel. Okay, run. There we go, done. So it took less than a second to recompile the kernel. So for some reason, which I don't fully understand, compiling number kernels, kuda kernels is way faster than compiling C and C++ kuda kernels. And I have asked an Nvidia guy about this, and he was like, oh, well, it's just how it is. Sorry, like, it doesn't seem to be an obvious way to make the C C++ version faster. So I actually think this is great for doing development is I can, you know, have actual kuda running and just change things and run it very fast. So I think that's very handy. The second thing that's very handy is that I don't have to flatten my tensors. m and n here are being passed in are actually m and n. The only thing I've done to them is wrapped them with as kuda array, which is a take zero time. It's just changing the type, basically. So you don't have to flatten it. So I can actually use proper indexing notation, which is convenient. So that's another nice thing. I can do things like dot shape. So I don't have to pass in the h, k and w, which again is quite nice. So there's some conveniences. But then I'm going to show you the most amazingly cool thing, which is the Python thread thing we created back here that kind of simulates threads and simulates kuda in Python is fully built in to number. So everything that we kind of recreated from scratch here actually already exists. And so in number to use it, if you Google for number kuda simulator, you'll see here that if you set the environment variable number enable kuda sim to one, then that enables the kuda simulator. The code is executed as normal, except that it's actually run on the CPU as pure Python code, just like ours was. So you can, for example, set a break point or print stuff directly from Python. Now notice, because this is not running kuda, it's going to be slow. It's going to be exactly as slow as our manual Python version, because this is just their Python manual Python version, or I think it's exactly as slow. So you still want to use much smaller subsets of your data. But this is a great way to actually, in my opinion, to do real world kuda development is do it in number. Do it all with number enable kuda sim set to one with small amounts of data until everything works. And then, and you have to, by the way, you have to set that environment variable before you import number. So you would have it before you import number. And if you're using a notebook, you'd have to reset the kernel, restart the kernel, and then change the environment variable and then reimport number. And then you can set it to zero. And now the exact same code will now be running on the GPU. And then I've tested this, if you then take your code and paste it into chat GPT and say, please convert this into kuda C code. For me, at least it did it perfectly correctly first time. So I think this is a really useful way to kind of combine all the stuff we've learned about from first principles. And we've done it all from scratch. And so we understand how it all works. But now to implement it in practice, maybe the easiest way to do that is actually to, yeah, use number. Now, of course, you don't even need to convert it to C or C++, you could just leave it in number. The challenge is from a deployment point of view, you know, it might be a bit more tricky. With PyTorch, if you use our load inline load kuda approach, the documentation explains how you can pre compile that and actually provide a PIP or condo installable package that people can just use right away without having any of the kuda development toolkit installed. The number, that's not true. Having said that if you do install number from condo, it automatically installs all the stuff you need for the toolkit. So maybe that's okay. So anyway, these are some things like pros and cons to think about, you know, so maybe you just just use number as is. Maybe it's a little bit slower. So maybe that'll be a problem or maybe you auto convert it to kuda C and actually use that in practice. Yeah, so I think that basically covers everything. Anything, any more questions or anything anybody wanted to add? From my side, first of all, I must say, this was a super fantastic session I learned a lot and I had to expect that you would go so deep into like the material stuff and that also I think this shows so many elements of the kuda development. It's starting from having this singular and perfect variable that you normally see and also maybe talks people if they see to calls for the first time or it makes it a little bit inaccessible in the first class because you just see those variables and they are plus and multiplied, whatever the magic happens. That's what you showed this like the starting from this drawing all the memory like basically the structures look like and what you really need to do and then reference back to this because normally you then in the first approach, you already get something wrong and looking back where the variables come from. That's why I always get some bits and paper beside me. Yes, marvelous. Thank you. Yeah, there are also a few things that stood out for me. I think if more people are interested in how to ship number kernels like that's something we can certainly take a look at. As far as I know, like I think number did have like an ahead of time compilation mode which makes on stuff easier. But because it's all sort of old Python, you can ship it, you just have to use the compilation cost. Yeah, the AOT ahead of time compilation is is deprecated as of as of what is this February 2024. They say they're going to replace it with something newer and better, but they haven't said what that is yet. So that's currently an outstanding question. They did say they want remove the previous AOT approach until the new approach is working and everything. So yeah, hopefully by the time people watch this video on YouTube in the future, maybe this will all be fully resolved. So we have a question here from Jonas who wants to know how, if you have checked it, how it compares to Kube Blast, to the Optimize Cuda library or maybe also to the PyTorch matrix multiplication as a baseline? I haven't because I'm not a Cuda expert. I actually don't know how to optimize all these shared memory sizes and tiles and blah, blah, blah. So I know that stuff like Kube Blast has a whole lot of... So actually, one of the things you might have noticed is I changed my input matrix sizes from last time. Last time I used the MNIST data and a pretend set of weights for MNIST. And so the output was 50,000 by 10. And that really kind of long, narrow matrix is particularly hard to optimize because like with a tile width of 32, for example, most of it's padding. So I actually used kind of a simpler, not your old shapes to make this a bit easier for me to think about. So I think it'll be a fun exercise as we go further to see if we can figure out for that like a one layer MNIST model, can we create something that is close in performance to what Kube Blast or Kube DNN does? Because yeah, I think it's kind of... I found it quite tricky to think about how I would do all that automatically. There are things like TVM, which maybe one day we can look at TVM together. I know Thomas Venerman says he's used that. So maybe he can even help us there, which is something which can kind of automatically optimize these and a lot more details for you. And then also, I know like sometime hopefully in the next month or so, so sometime around late February or March 2024, Mojo GPU should be becoming available at least in a preview form. And that has an auto-tune functionality which might help us to automatically find the best parameters as well. But yeah, for now, I didn't even bother checking because I suspected I was going to be quite embarrassed at how much further there is to go. I think this is like a sort of opportunity for the community because you have this like a notebook, which can be as like everybody can fork it and play around and try different stuff, like make performance measurements. And I'm pretty sure that somebody will make reports on things that they experimented and maybe we can look what we account status and look if we can come closer to what state of the art basically this metrics multiplication. And I think also the session today played the foundationary for other stuff like special codes, technology from tens of course, for example, from NVIDIA, which is specifically in the hardware, further features for assisting metrics multiplications. So, things we can look into in the future, maybe. Okay. So, for what it's worth, like I think getting something Kubella's competitive is probably going to be very difficult, at least like on like basically the service use or things like A100s. I suspect that's not as true for consumer GBA use. So, that's potentially like the GBA requirements that this stuff because it also be less attention from the video. Good point. So, actually, Misa wants to know what is PyTorch currently using for maintenance multiplications. So, PyTorch mostly still uses like Kubella's. There is like an experimental fluid which if you use Torch compile, there's a flag called Torch triton metamol. I think it's off by default, but you know, that's something like we're potentially interested in exploring. But I think as far as today it's mostly still Kubella's. The way you can tell, by the way, is if you like launch the PyTorch profiler, there's like the function names will have specific signatures to sort of say, okay, it's using Kubella's, it's using TensorFlow's. So, like a profiler, it's what would be really different. And Jeremy, what you said, I think regarding this speed of compilation, for me personally, this is super important in getting like experimenting with things. I was trying to optimize for this, and I think it's a big advantage if this is really much faster than you're running in 360. Yeah. Exactly. Yeah, you need a quick iteration loop. So, I think between the CUDA simulator and the fast number CUDA jet, it can make life a lot better. There was two more general questions regarding chagivity use. I think one was whether chagivity could be a possible solution or fusing multiple kernels or as it's, like, but where the limits basically are for what chagivity can do, and because I think it's like, how do you have to try out? Let's try it. That'd be an interesting thing for people to experiment with, wouldn't it? See how far we can go. I think in general, people tend to underestimate maybe what chat GPT can do. So, yeah, why don't you try some things that maybe you suspect might be a bit too hard for it, and we'll find out where the limits are. So, for example, this, and that's quite, it's called shout-to-mem-things. What's this, and it automatically by the output, how did you have to... I put that prompt in. Well, so when I converted number into CUDA, it automatically changed the number shared memory. In my prompt, I had something saying replace sync b.weight with underscore, underscore sync threads. I didn't try doing it without it. Maybe it would have guessed. And there was another question regarding sort of general capabilities of chagivity for coding. So, do you think what's your expectation for to develop us in the future? So, what would be the place where we just specifically prompt things? Personally, I haven't found it at all useful for anything remotely novel. So, I found it really useful for using languages that are not that familiar with, or frameworks that I'm not that familiar with, and it kind of gets me you know, or kind of calling some API. I find it really good for that. But for doing anything kind of at all algorithm related, which is anything other than just replicating a well-known algorithm, I haven't, I've found it terrible. So, at least the kind of work I do, which is kind of because it's research oriented. So, most of what I write is pretty new. I haven't found it, I haven't found it at all useful. So, at least I think my work's safe for a while. I don't know about yours. Yeah, a little bit. I mean, it's, I'm not an expert, like I know, like you guys know a lot more than me. So, I mean Triton, Triton's different in that in Triton, you can like have like a matrix multiplication inside, you know, you can have like at inside your kernel. And you know, Triton's really clever at kind of optimizing these much more sophisticated kind of things. You know, so numbers are a lot more basic really. It's just a mapping of the CUDA C concepts directly to Python concepts. You know Triton is a fairly recent literally PhD thesis artifact. So, it's doing something a lot more clever. I mean, the similarities, you know, Triton, it works with a decorator. It converts, you know, Python code to GPU code. I think they're good for different things, you know, because like Triton is somewhat limited as well. Like it's very, very smart, but it's not a mapping of the entire CUDA programming model. So, for example, when I know it better, when you guys, you know, Horace and those guys did the GPT fast thing and wanted to do 4-bit discretization, you know, you found that you couldn't do it in Triton. And that's because Triton doesn't have any way to express that at the moment. So, yeah, I think, you know, they both have their place. Do you guys feel the same way? Yeah, so like, we are going to have like Charles, who is the author of the GPT fast journals, come give us a talk in two weeks, though he's going to sort of tell us the first time it'll happen. I think when people ask this question, the motivation is sort of, can you avoid learning CUDA? And after that, it's like, that's a negative. That's seem to be, yeah. Yeah. And I'm not even sure it's easier, I'm not even sure it's easier to use Triton. Like, I feel like Triton is more complex in many ways. So, like, once you know CUDA, then you can find the places where Triton can help. I think trying to use Triton effectively, if you didn't know CUDA, would be an even harder path. Especially now that we've kind of figured out these ways of doing, you know, iterative notebook based CUDA development, you know, which is one of the big benefits of Triton is just it's a decorator and it's Python and stuff. So, yeah. So, I like Triton very much, but the most joke that maybe open has a like fixed version of it entirely, but like the color team, yeah, can write really colors because it's arranged so often into strange things and unexpected. Yeah. But also, I think that there's like iteration speed is similar to number. That's a good positive point for, okay. So, I think then this was a wonderful session today. Thank you so much. Thank you so much opportunity to work on this. Bye, everybody. Thanks for joining.