 Hi everybody, welcome back to lesson 12 of practical deep learning for coders. So got a lot of stuff to cover today, so let's dive straight in. And I actually thought I would start by sharing something, which I've seen been getting a lot of attention recently, which is the clip interrogator. So the clip interrogator is a hugging face basis, I guess, a radio app where I uploaded my image here. And it's output, let's just zoom in a bit. It's output a text prompt for creating a clip embedding from, I guess. So I've seen a lot of folks on Twitter and elsewhere on the internet saying that this is producing the clip prompt that would generate this image. And generally speaking, the prompts it creates are rather rude. My one's less rude than some, although extremely long forehead, maybe not, thanks very much. But your personal data avatar, funny professional photo, I don't know what tectonics is meant to be here without eyebrows. So this doesn't actually return the clip prompt that would generate this photo at all. And the fact that some people are saying that makes me realise that some people have no idea what's going on with stable diffusion. So I thought we might take this as an opportunity to explain why we can't do that and what we can try and do instead. So let's imagine that my friend took a photo of himself and he wanted to send me his photo and he thought he would compress it a whole lot. So what he did was he put it through the clip image encoder. So that's going to take this big image and it's going to turn it into an embedding. And the embedding is much, much smaller than the image. It's just a vector of a few floats. So then my friend hopes that they could send me this embedding and so they send that over in an email and they say, there you go Jeremy, there's the clip embedding of the photo I wanted to send you. So now you just have to decode it to turn it back into a picture. So now I've got the embedding and I have to decode it. How would you do that? Well, you can't. Okay, we have a function here. Let's call it F which is the clip image encoder which takes as input an image which I'll call X and returns an embedding. Does that mean that there is some other function and inverse functions we normally write with a minus one, an inverse function with which I can take that embedding. Let's say we call that Y. We pass it Y and it would give us back our photo. And so Y remember is F of X. So to put it another way, this is F inverse of F of Y. So an inverse function is something that undoes a function and so that gives you back Y. Is there an inverse function for the clip image encoder? Well, not everything has an inverse function. For example, consider the function like let's say in Python which takes F of X returns 0. Can you invert that function? If you get back, you pass in 3, you get back 0. Is there a function that's going to take the output and give you back the input? No, of course not because you just threw the whole thing away. So not all functions can be inverted and indeed in this case we've started with a function which is whatever 512 by 512 by 3 say and we've turned it into about something much, much smaller. I can't remember exactly how big a clip image encoding is, embedding is, but it's much smaller. So clearly we're losing something. But what I could do is I could put it through a diffusion process. And so remember, a diffusion process is something where we have learnt, we have taught, or we shouldn't, I don't know, taught, an algorithm has learned to take some noise. So we could start with some noise and we could start with an image embedding. We haven't done this before, but we could do that. We could train something that takes noise and an image embedding and removes a bit of the noise and we could run that a bunch of times and it wouldn't give us back the original picture but hopefully it would give us something back if it's a conditional, so remember using the conditional diffusion approach, we'd get back something that might be something like our original image. So that's what diffusion is, right? Diffusion is something that takes an embedding and inverts an encoder to give you back something that hopefully might generate that embedding. Now of course remember, we don't actually get image embeddings when we do prompts in stable diffusion. Instead, we have text embeddings. But if you remember, that actually doesn't matter because do you remember how we actually, or we, OpenAI, trained clip so that they had various pictures along with their captions and they trained an algorithm that was explicitly designed to make it so that each image returned a embedding for the image that was similar to the embedding that the text encoder created for the caption. Then remember, all of the stuff that didn't match it was trained to be different. And so that means that a text embedding which describes this picture and the actual image embedding of this picture should be very similar if they're clip embeddings. That's the definition of clip embeddings. So you see this idea that you could take a text or image embedding and turn it back into an image perfectly makes no sense. This is the very definition of the thing we're trying to do when we do clip. And because what we're basically trying to do is invert the embedding function, these kinds of problems are generally referred to as inverse problems. So stable diffusion is something that attempts to approximate the solution to an inverse problem. So why does that mean that clip interrogator is not actually inverting the picture to give us back the text? Well, it's just as nonsensical. If we've got an image embedding, trying to undo that to get back to the picture and trying to undo that to get back to a suitable prompt is equally infeasible. Both of them require inverting an encoder and that just doesn't exist. The best we can do is, or at least the best we know how to do at the moment is to approximate that using a diffusion process. Okay, so that's why these texts that it spits back are fun and interesting, but they are not the thing that you can put back into stable diffusion and have it generate the same photo. And the nice thing is that actually the code for this is available and you can take a look at it. Here's the app. And you'll see what it does is it has a big list of... Let's have a look at some examples. So it has big lists of examples, for example, a big list of artists. And it has a big list of mediums and a big list of movements and so forth. It's got all this hard-coded pieces of text. And so what it does is it basically mixes and matches those various things together to see which one works well. And it combines it with the output of something called the blip language model, which is not designed to give you an exactly accurate description of an image, but it has been specifically trained to give an OK-ish caption for an image. And it actually works reasonably well. But again, it's not the inverse of the clip encoder. So okay, so that's how that all works. So where we had got to was that we had done matrix multiplication with broadcasting, where we had broadcast the entire column from the right-hand matrix all at once. And that allowed us to get it down to a point where we only have one for loop written in Python. And generally speaking, we do not want to be doing, looping through too many things in Python, because that's a slow bit. So the two inner loops we originally had, which just to remind us, originally we're here, these two inner loops, looping through 10, and then to 784 respectively, have been replaced with a single line of code. So that was pretty great. And our times now is increased, is improved by 5,000 times. So we're 5,000 times faster than we started out. So another trick that we can use, which I'm a big fan of, is something called Einstein summation. And Einstein summation is a compact representation for representing products and sums. And this is an example of an Einstein summation. And what we're going to do now is be going to replicate our matrix product with an Einstein summation. And believe it or not, the entire thing can be pushed down to just these characters, which is pretty amazing. So let me explain what's happening here. The arrow is separating the left-hand side from the right-hand side. The left-hand side is the inputs. The right-hand side is the output. The comma is between each input. So there are two inputs. The letters are just names that you're giving to the number of rows and the number of columns. So the first matrix we're multiplying by has i, rows, and k columns. The second has k rows and j columns. It's going to go through a process which creates a new matrix that actually this is not even doing, this is not yet doing the matrix multiplication. This is without the sum. This one's going to create a new matrix that contains i, rows, and k... Well, how do we say it? i faces and k rows and j columns. So a rank 3 tensor. So the number of letters is going to be the rank. And the rules of how this works is that if you repeat letters between input arrays... So here's my inputs, i, k, and k, j. We've got a repeated letter. It means that values along those axes will be multiplied together. So it means that each item in each row of... Sorry, in each... Yeah, across a row will be multiplied by each item down each column to create this i by k by j output tensor. So to remind you, our first matrix is 5 by 784. That's M1. Our second matrix is 7084 by 10. That's M2. So i is 5, k is 784, and j is 10. So if I do this torch.insum then I will end up with a i by k by j. It'll be 5 by 784 by 10. And if you have a look, I've run it here on these two tensors, M1 and M2, and the shape of the result is 5 by 784 by 10. And what it contains is the original 5 rows of M1, the original 10 columns of M2. And then for the other 784, that dimension, they're all multiplied together because it's been copied between the two arguments to the i'sum. And so if we now sum up that over this dimension, we get back... So what we get back, if we go back to the original matrix multiplier we do, 10.94, negative, negative 0.68, etc. And so now with this Einstein summation version, we've got back exactly the same thing. Because what it's done is it's taken each of these columns by rows, multiplied them together to get this 5 by 784 by 10, and then added up that 784 for each one, which is exactly what matrix multiplication does. But we're going to use one of the two things from Einstein summation. The second one says if we omit a letter from the output, so the bit on the right of the arrow, it means those values will be summed. So if we remove this k, which gives us ik and kj goes to ij. So we've removed the k entirely. That means that sum happens automatically. So if we run this, as you see, we get back again, matrix multiplication. So Einstein summation notation is... It takes some practice getting used to, but it's very convenient. And once you get used to it, it's actually a really nice way of thinking about what's going on. And as we'll see in lots of examples, often you can really simplify your code by using just a tiny little Einstein summation. And it doesn't even have to be a sum, right? You don't have to admit any letters if you're just doing products. So maybe it's a bit misnamed. So we can now define our map mole as simply this torch.einsum. So if we now check it, test close that the original result is equal to this new map mole, and yes it is. And let's see how the speed looks. 15 milliseconds. Okay. And that was for... Oh, the whole thing. So compared to 600 milliseconds. So as you can see, this is much faster than even the very fast broadcasting approach we used. So this is a pretty good trick, is torch.einsum. Okay, but of course we don't have to do any of those things because PyTorch already knows how to do map mole. So there's two ways we can run map mole directly in PyTorch. You can use this special at operator. So x train at weights is the same as map mole train comma weights, as you see, test close. Or you can say torch dot map mole. And interestingly, as you can see here, the speed is about the same as the einsum. So there's no particular harm, there's not to do an einsum. So when I say einsum, that stands for Einstein summation notation. All right, let's go faster still. Currently, we're just using my CPU. But I have a GPU. It would be nice to use it. So how does a GPU work? An NVIDIA GPU, and indeed pretty much all GPUs, the way they work is that they do lots and lots of things in parallel. And you have to actually tell the GPU what are all the things you want to do in parallel one at a time. And so what we're going to do is we're going to write in pure Python something that works like a GPU, except it won't actually be in parallel, so it won't be fast at all. But the first thing we have to do if we're going to get something working in parallel after creative function that can calculate just one thing, even if a thousand other things are happening at the same time, it won't interact with anything else. And there's actually a very easy way to think about matrix multiplication in this way, which is what if we try to create something which just as we've done here, fills in a single item of the result. So how do we create something that just fills in row zero column zero? Well, what we could do is we could create a new map model where we're going to pass in the coordinates of the place that we want to fill in. So we're going to start by passing it zero comma zero. We'll pass at the matrix matrices we want to multiply and we'll pass in a tensor that we've prefilled in with zeros to put the result into. So we're going to say, okay, the result is torch dot zeros, rows by columns, core map model for location zero comma zero, passing in those two matrices and the bunch of zeros matrix ready to put the result in. And if we call that, we get the answer in cell zero zero. So here's an implementation of that. So the implementation is first of all, we've been past the zero comma zero coordinates. So let's destructure them. So hopefully you've been experimenting with destructuring because it's so important. You see it all the time into i and j. So that's the row in the column. Make sure that that is inside the bounds of our output matrix. And we're going to start by start at zero and loop through all of the rows of A and all of the columns of B for i and j, sorry, all of the columns of A and all of the columns of B for i and j, just like the very innermost loop of our very first Python attempt. And then at the end, pop that into the output. So here's something that fills in one piece of the grid successfully. So we could call this row by columns times each time passing in a different grid and we could do that in parallel because none of those different locations So something which can calculate a little piece of an output on a GPU is called a kernel. So we'd call this a kernel. And so now we can create something called launch kernel. We pass it the kernel. So that's the function. So here's an example. Launch kernel passing in the function and how many rows and how many columns are there in the output grid. And then give me any arguments that you need to calculate it. So in Python, star args just says any additional arguments that you pass are going to be put into an array called args. If you use something like C, you might have seen like variadic arguments, parameters, it's the same basic idea. So we're going to be calling launch kernel. We're going to be saying launch the kernel matmul using all the rows of A or the columns of B and then the args, which are going to be the star args are going to be m1, the first matrix m2, the second matrix and res another torch dot zeros we just created. So launch kernel is going to loop through the rows of rmA and then for each row of A it'll loop through the columns of B and call the kernel, which is matmul on that grid location passing in m1, m2 and res. So star args here is going to unpack that and pass them as three separate arguments. And if I run that, all of that, you'll see it's done it. It's filled in the exact same matrix. Okay, so that's actually not fast at all. It's not doing anything in parallel, but it's the basic idea. So now to actually do it in parallel, we have to use something called CUDA. So CUDA is a programming model for NVIDIA GPUs and to program in CUDA from Python, the easiest way currently to do that is with something called NUMBER. And NUMBER is a compiler. Well, you've seen it actually already for non-GPU. It's a compiler that takes Python code and spits out, you know, compiled fast machine code. If you use its CUDA module, it'll actually spit out GPU-accelerated CUDA code. So rather than using at NGIT like before, we now say at CUDA.GIT. And it behaves a little bit differently. But you'll see that this map model, let me copy the other one over so you can compare. Compare it to our Python one. Our Python map model and this CUDA.GIT map model look, I think, identical, except for one thing. Instead of passing in the grid, there's a special magic thing called CUDA.Grid. And you say, how many dimensions does my grid have and do you unpack it? So you don't have to, it's just a little convenience that NUMBER does for you. You don't have to pass over the grid, it passes it over for you. So it doesn't need this grid. Other than that, these two are identical. But the decorator is going to compile that into GPU code. So now, we need to create our output tensor just like before. And we need to do something else, which is we have to take our input matrices and our output, so our input tensors, the matrices in this case, and our output tensor, and we have to move them to the GPU. I should say copy them to the GPU. So CUDA.device copies a tensor to the GPU. So we've got three things getting copied to the GPU here, and therefore we store the three things over here. Another way I could have written this is I could have said map, which I kind of quite like doing, a function, which is CUDA.device to each of these arguments. And this would be the same thing. This is going to call CUDA.device on X-Train and put it in here on weights and put it in here and on R and put it in here. So that's a slightly more convenient way to do it. Okay, so we've got our 50,000 by 10 output. That's just all zeros, of course. That's just how we created it. And now we're going to try and fill it in. There is a there's a particular detail that you don't have to worry about too much, they don't just have a grid, but there's also a concept of blocks and there's something we call here tpb, which is threads per block. This is just a detail of the CUDA programming model. You don't have to worry about too much. You can just basically copy this and what it's going to do is it's going to call each grid item in parallel with a number of different processes, basically. So this is just the code which turns the grid into blocks. And so you don't have to worry too much about the details of that. You just always run it. Okay. And so now how do you call the equivalent of launch kernel? Well, it's it's a slightly weird way to do it, but it works fine. You call map mole, but because map mole has CUDA.jit, it's got a special thing which is you have to put something in square brackets afterwards, which is you have to tell it how many blocks per grid. That's just the result from the previous cell and how many threads per block in each of the two dimensions. So again, you can just copy and paste this from my version, but then you pass in the three arguments to the function. This will be A, B and C. And this will, this is how you launch a kernel. So this will launch the kernel map mole on the GPU. At the end of it, RG is going to get filled in. It's on the GPU, which is not much good to us. So we now have to copy it back to the CPU, which is called the host, copy to host to run that and it's done. And test close shows us that our result is similar to our original result. So it seems to be working. So that's great. So I see SEVA on the YouTube chat is finding that it's not working on his Mac. That's right. So this will only work on an Nvidia GPU as basically all of the GPU, nearly all of the GPU stuff we look at only works on Nvidia GPUs. Mac GPUs are gradually starting to get a little bit of support from machine learning libraries, but it's taking quite a while. It's got quite a way to go, as I say this, at least towards the end of 2022. If this works for you and later on, that's great. Okay, so let's time how fast that is. Okay, so that was 3.61 milliseconds. And so if we compare that to the PyTorch MatMol on CPU, that was 15 milliseconds. So that's great. So it's faster still. So how much faster? Oh, by the way, we can actually go faster than that, which is we can use the exact same code we had from the PyTorch op, but here's a trick. If you just take your tensor and write dot CUDA after it, it copies it over to the GPU. If it's on a Nvidia GPU, do the same for weights dot CUDA. So these are our two CUDA versions. And now I can do the whole thing. And this will actually run on the GPU. And then to copy it back to the host, you just say dot CPU. So if we look to see how fast that is 458 microseconds. So that is somebody just pointed out that I wrote the wrong thing here, one E X3. Okay. So how much faster is that? Well, 458 microseconds original on the whole data set was 663 microseconds. So compared to our broadcast version we are another thousand times faster. So overall this version here compared to our original version which was here, the difference in performance is 5 million X. So when you see people say, yeah, Python can be pretty slow. It can be better to run stuff on the GPU. If possible we're not talking about a 20% change. We're talking about a 5 million X change. So that's a big deal. And so that's why you need to be running stuff on the GPU. All right. Some folks on YouTube are wondering how on earth I'm running CUDA when I'm on a Mac, and given it says local host here, that's because I'm using something called SSH tunneling which we might get to sometime. I suspect my live coding from the previous course might have covered that already. But this is basically you can use a Jupyter notebook that's running anywhere in the world from your own machine using something called SSH tunneling which is a good thing to look up. Okay. One person asks if Einstein summation burrows anything from APL. Yes, it does actually. So it's kind of the other way around actually. APL burrows it from Einstein notation. So I don't know if you remember I mentioned that Ken Iverson when he developed APL was heavily influenced by tensor analysis. And so this Einstein notation is very heavily used there. If you'll notice a key thing that happens in Einstein notation is there's no loop. You know, there isn't this kind of sigma, you know, I from here to here and then you put the I inside the function that you're summing up. Everything's implicit. And APL takes that a very long way. And J takes it even further, which is what Ken Iverson developed after APL. And this kind of general idea of removing the index is very important in APL and it's become very important in NumPy, PyTorch, TensorFlow and so forth. So finally, we know how to multiply matrices. Congratulations. So let's practice that. Let's practice what we've learned. So we're going to go to 0 to mean shift to practice this. And so we're going to try to exercise our kind of tensor manipulation operation muscles in this section. And the key actually endpoint for this is the homework. And so what you need to be doing is getting yourself to a point where you can implement something like this, but for a different algorithm. Why do we care about this? Because this is like learning your times table. Your times tables if you're doing you know mathematics. It's this kind of like thing that's going to come up all the time. And if you're not good at your times tables, everything else is a lot more. A lot of other things, particularly at primary school and high school, you know, they they get difficult. You get slower and it's frustrating. And you spend time thinking about these mechanical operations rather than getting your work done. It is it's important that when you have an idea about something you want to try or debug or profile or whatever, that you can quickly translate that into working code. And the way that code is written for GPUs or even for fast running on CPUs is using broadcasting Einstein notation matrix modifications and so forth. So you've got to you've got to got to practice. It's super important. So we're going to practice it by running, by developing a clustering algorithm. And the clustering algorithm we're going to work on is something called mean shift clustering which hopefully you've never heard of before. And I say that because I just think it's a really funny algorithm that not many people have come across excuse me and I think you'll find it really useful. So what is cluster analysis? Cluster analysis is very different to anything that we've worked on in this course so far and that there isn't a dependent variable that we're trying to match. But instead we're just trying to find are there groups of similar things in this data and those groups we call clusters. And as you can see from the wiki page there's all kinds of applications of cluster analysis across many different areas. I will say that sometimes cluster analysis can be overused or misused it's really best for when your various columns are the same kind of thing and have the same kind of scale. For example pixels are all the same kind of thing, they're all pixels. So one of the examples they use is market research. So I wouldn't use cluster analysis for socio-demographic inputs because they're all different kinds of things but the example they give here makes a lot of sense which is looking at data from surveys. If you've got a whole bunch of like from one to five answers on surveys. Alright so let's take a look at this and the way I like to build my algorithms is to create some, often to create some synthetic data that I know how I want it to behave. So we're going to create six clusters and each cluster is going to have 750 samples in it. So first of all I'm going to randomly create six centroids and so the centroid is going to be like the middle of where my clusters are. So I'm going to randomly create them. I need to end clusters by two because I need an X and Y coordinate and so now I'm going to randomly generate data around those six centroids. Okay so to do that I'm going to call a little function I made here called sample and I'm going to run it on each of those six centroids and so I'll show you what that looks like. So here's what that data looks like. So the X's are the six centroids and the color dots is the data. So if you were given this data without the X's the idea would be to come back with figuring out where the X's would have been like where are these clustering around and so if you can get clusters that's the goal here is to find out that there's a few discreetly distinctly different types of data in your data set. So for example for images I've used this before to discover that there are some images that look completely different to all the other ones. For example they were taken at night time or they're of a different object or something like that. So how does sample work? Well we're passing in the centroid and so what we want is we're going to get back so each of those centroids contains an X and a Y. So multivariate normal is just like normal it's going to give you back normally distributed data but more than one item. That's why it's multivariate. And so we passed in two means, a mean for our X and a mean for our Y and so that's the mean that we're going to get and our standard deviation is going to be five. Why do we use torch.diag five comma five? That's because we're saying that's because for multivariate normal distributions there's not just one standard deviation for each column that you get back. There could also be a connection between columns. The columns might not be independent. So you actually need, so it's called a covariance matrix. Not just a variance. We discussed that a little bit more in lesson 9b if you're interested in learning more about that. Okay so this is something that's going to give us back random columns of data with this mean and this standard deviation and this is the number of samples that we want. And this is coming from PyTorch. So PyTorch has a whole bunch of different distributions that you can use which can be very handy. So there's our data. Okay so remember for sample or clustering we don't know the different colors and we don't know where the X's are. That's kind of our job is to figure that out. We might just briefly also look at how to plot. So in this case we want to plot the X's and we want to plot the data. So it looks like this. So all I do is I loop through each centroid and I grab that centroid samples and they're just all done in order. So I grab it from i times n samples up to i plus 1 times n samples and then I create a scatter plot with the samples on them. And what I've done is I've created an axis here and you'll see why later that we can also pass one in but I'm not passing one in. So we create a plot and an axis and so in map plot lib you can keep plotting things on the same axis. So then I plot on the centroid a big X which is black and then a smaller X which is what is that magenta and so that's how I get these X's. So that's how plot data works. Okay so how do we create something now that starts with all the dots and returns where the X's are? We're going to use a particular algorithm particular clustering algorithm called mean shift and mean shift is a nice clustering approach because you don't have to say how many clusters there are. So it's not that often that you're actually going to know how many clusters there are. So we don't have to say quite a few things like the very popular K means require you to say how many. Instead we just have to pass them in called a bandwidth which we'll learn about which can actually be chosen automatically. And it can also handle clusters of any shape. So they don't have to be ball shaped like they are here. They can be kind of like L shaped or lips shaped or whatever. And so here's what's going to happen. We're going to pick some point. So let's say we pick that point just there. Okay and so what we now do is we go through each data point. So we'll pick the first one and so we then find the distance between that point and every other point. Okay. So we're going to have to say what is the distance between that point and that point and that point and that point and that point and that point and also the ones further away that point and that point and you do it for every single point compared to the one that we're currently looking at. Okay. So we get all of those as a big list. And now what we're going to do is we're going to take a weighted average of all of those points. Now that's not interesting without the weighting. If we just take our average of all of the points and how far away they are, we're going to end up somewhere here. This is the average of all the points. But the key is that we're going to take an average and just find the right spot. The key is we need to find an average that is weighted by how far away things are. So for example this one over here is a very long way away from our point of interest. And so it should have a very low weight in the weighted average. Whereas this point here which is very close should have a very high weight in our weighted average. So what we do is we create weights for every point compared to the one that we're currently interested in using a what's called a Gaussian kernel that we'll look at. But the key thing to know is that points that are further away from our point of interest, which is this one are going to have lower weights. That's what we mean there. They're penalised. The rate at which weights fall to 0 is determined by this thing that we set at the start called the bandwidth. And that's going to be the standard deviation of our Gaussian. So we take an average of all the points in the data set a weighted average weighted by how far away they are. So for our point of interest this point's going to get a big weight this point's going to get a big weight this point's going to get a big weight that point's going to get a tiny weight that point's going to get an even tinier weight. So it's mainly going to be a weighted average of these points that are nearby. And the weighted average of those points I would guess is going to be somewhere around about here. Right. And we'd have a similar thing for the weighted average of the points near this one that's going to probably be somewhere around about here. Or maybe over here. And so it's going to move all of these points in closer it's almost like a gravity. They're kind of going to be moved like closer and closer in towards this kind of gravitational center. And then these ones will go towards their own gravitational center. And so forth. Okay. So let's take a look at it. Alright. So what's the Gaussian Kernel? This is the Gaussian Kernel which was a sign in the original March for Science back in the days when the idea of not following scientists was considered socially unacceptable. We used to have March for these things if you remember. So this is not normal. So this is the definition of the Gaussian Kernel which is also known as the normal distribution. This is the shape of it. I'm sure you've seen it before. And here is that formula copied directly off the Science March sign. Okay. Here we are. See the square root 2pi etc. Okay. And this here is the standard deviation. Now what does that look like? It's very helpful to have something that we can very quickly plot any function. It doesn't come with matplotlib but it's very easy to write one. Just say oh let's, as x let's use all the numbers from 0 to 10 100 of them spaced evenly. That's what linspace does. Linearly spaced 100 numbers in this range. That's going to be our x's. So plot those x's and plot f of x's the y's. So here's a very nice little plot funk we want. And here it is. And as you can see here we've now got something where you are this like very close to the point of interest. You're going to get a very high weight. And if you're a long way away from the point of interest you'll get a very low weight. So that's the key thing that we wanted to remember is something that penalizes further away points more. Now you'll notice here I've managed to plot this function for a bandwidth of 2.5 and the way I did that was using this special thing from funk tools called partial. Now the first thing to point out here is that very often it drives me crazy. I see people trying to find out what something is in Jupiter and the way they do it is they'll scroll up to the top of the notebook and search through the imports and try to find it. That is the dumb way to do it. The smart way to do it is just to type it and press shift enter and it will tell you where it comes from. And you can get its help with question mark and you can get its source code with two question marks. Okay, so just type it to find out where it comes from. Okay, so this is as Seve has mentioned in the chat also known as carrying or partial function application. This creates a new function. So let's just grab it. We create a new function and this function f is the function Gaussian but it's going to automatically pass bw equals 2.5. So this is a partially applied function so I could type f of 4 for example so that's going to be a tensor there we go and you can see that's exactly what this is. Go up to 4 go across, yep, about 0.44. So we use partial function application all the time. It's a very, very, very important tool. Without it for example plotting this function would have been more complicated. With it it was trivially easy. I guess the alternative, like one alternative which would be fine but slightly more clunky would be we could create a little function in line. So we could have said oh plot a function that I'm going to define right now which is called lamb which is lambda x which is Gaussian of x with a bandwidth of 0.2.5 you could do that too you know it's fine but yeah, partials I think are a bit neater, a bit less to think about and they often produce some neater and clearer code. Okay why did we decide to meet the bandwidth 2.5? As a as a rule of thumb choose a bandwidth which covers about a third of the data so if we kind of found ourselves somewhere over here a bandwidth which covers about a third of the data would be enough to cover two clusters ish so it would be kind of like this big somewhere in the middle there so that's the basic idea but you can play around with bandwidths and get different amounts of clusters I should mention like often when you see something that's kind of on the complicated side like a Gaussian you can often simplify things I think most implementations and write ups I've seen talk about using Gaussians but if you look at the shape of it it looks a lot like this shape so this is a triangular weighting which is just using clamp min so it's just using a linear with clamp min and it occurred to me that we could probably use this just as well so I decided to define this triangular weighting and then we can try both so we'll start with we're going to use the Gaussian version so we're going to be literally moving all the points towards their kind of centre of gravity so we don't want to mess up our original data so we clone it that's a pie torch thing, it's dot clone and so big X is our matrix of data I mean it's actually a that's right, matrix of data and then little X will be our first point and it's pretty common to use capital letters for matrices so this is our data this is the first point okay so there it is, we're going to start at 26.2 26.3 so 26.2 26.3 so somewhere up here so little X its shape is just it's a rank 1 tensor of shape 2 big X is a rank 2 tensor of 1500 data points by 2, the X and Y and if we call X none that would add a unit axis to that and the reason I'm going to show you that is because we want to find the distance from little X to everything in big X and the way we do a distance is with minus but you wouldn't be able to go you wouldn't be able to go X minus big X and get the right answer, let's think about that X dot shape oh we've got that already oh no actually that is going to work isn't it so yes, alright so you can see why we've got these two versions here if we do X none we've got something of shape 1 comma 2 now we can subtract that from something of shape 1500 comma 2 because the 2's match up because they're the same and the 1500 and the 1 matches up because you remember our numpy rules everything matches up to a unit axis so it's going to copy this matrix across every row of this matrix and it works but do you remember there's a special trick which is if you've got two shapes of different lengths we can use the shorter length and it's going to add unit axes to the front to make it as long as necessary so we actually don't need the X none we can just use little X and it works because it's going to say is this compatible with this well the last axis remember we go right to left the last axis matches the second last axis oh it doesn't exist but we understand that there's a unit axis and it's going to do exactly the same thing as this so if you have not studied the broadcasting from last week carefully that might not have made a lot of sense to you and so definitely at this point you might want to pause the video and go back and reread the numpy broadcasting rules and last time and practice them because that's what we just did we use the numpy broadcasting rules thousands more times throughout the rest of the course and many more times in fact in this lesson okay so now I think it's a pretty good place to have a pause so I'll see you back here in nine minutes hi everybody welcome back so we had got to the point where we had managed to get the distance between our first point X and all of the other points in the data and so we're just looking at the first eight of them here so the very first distance is of course zero on the X axis and zero on the Y axis because it is the first point the other thing is that because we the way we created the clusters is they're all kind of next to each other in the list so these are all in the first clusters and none of them are too far away from each other so now that we've got all the distances it's easy enough to get the distances on X and Y it's easy enough to get the distance kind of Euclidean distance so we can just square that that difference and sum and square root and actually maybe this is a good time to talk about norms and to talk about what we just did there we've got all these data points so here's one of our data points and here's the other one of our data points and there's some you know distance across the X axis and there's some distance along the Y axis so we could call that change in X and change in Y and one way to think about this distance then is it's this distance here so to calculate that we can use Pythagoras so A squared plus B squared equals C squared or in our case so this would be C A and B say so in our case it would be the square root of the change in X squared plus the change in Y squared and then saying square root we could say to the power of a half another way of saying the same thing but there's a different way we could find the distance we could first go along here and then go up here and so that one would be change in X if you like to the one plus change in Y to the power of one oneth I'm running it a slightly odd way for reasons you'll see in a moment it's just this otherwise in general if we've got a whole list of numbers we can add them up let's say there are some list V we can add them up we can do each one to the power of some number alpha and take that sum to the one over alpha and this thing here is called a norm so you might have remember we came across that last week and we come across it again this week they basically come up they might end up coming up every week they come up all the time particularly because the two norm which we could write like this or we could write like this or we could write like this they're all the two norm this is just saying it's this equation for alpha equals two and Stefano is pointing out we should actually have an absolute value I'm not going to worry about that we're just doing real numbers here so we'll keep things simple oh well I guess for a higher than one no you're probably right for something like three yeah I guess we do need an absolute value there that's a good point because okay we could have this one and so the distance actually has to be the absolute value so the change in X is the absolute value of that distance ah yes thank you Stefano okay so we'll have the absolute value okay so the two norm is what happens when alpha equals two and and we would call this in this case we would call this the Euclidean distance but actually where it comes up more often is when you're doing like a loss function so the mean squared error is just well the root mean squared error I should say is just the two norm where else the mean absolute error is the one norm and these are also known as L2 and L1 loss and remember what we saw in that paper last week we saw it in this form there was a two up here which is where they got rid of the square root again so that would have just been change in X squared plus change in Y squared and now we don't even need the parentheses oops you don't see okay okay so all of this is to say that for um you know this comes up all the time because we're very very often interested in distances and errors and things like that um I'm trying to think I don't feel like I've ever seen anything other than one or two so although it is a general concept I don't think we're going to see probably things other than one or two in this course I'd be excited if we do that would be kind of cool so here we're taking the Euclidean distance which is the two norm so this has got eight things in it because we've summed it over dimension one so here's your first homework is to rewrite using torch.ineSum you won't be able to get rid of the X minus X you'll still need to have that in there but when you've got a model ply followed by a sum now you won't be able to get rid of the square root either you should be able to get rid of the model ply and the sum by doing it in a single torch.ineSum so we're summing up over the first dimension which is this dimension so in other words we're summing up the X and the Y axes okay so now we can get the weights by passing those distances into our Gaussian and so as we would expect the biggest weights it gets up to 0.16 so the closest one is itself it's going to be at a big weight these other ones get reasonable weights and the ones that are in totally different clusters have weights small enough that at three significant figures they appear to be zero okay so we've got our weights so there are the weights are 1500 long vector and of course our original data is 1500 by 2 the X and the Y for each one so we now want a weighted average we want this data we want its average weighted by this so normally an average is the sum is the sum of your data divided by the count that's a normal average a weighted average each item in your data let's put some eyes around here just to be more clear each item in your data is going to have a different weight and so you multiply each one by the weights so rather than dividing by and which is just the sum of ones we would divide by the sum of weights so this is an important concept to be familiar with weighted averages so we need to multiply every one of these X's by this okay so can we say weight times X no so remember we go right to left so first of all it's going to say let's look at the 2 and multiply that by the 15 are they compatible things are compatible if they're equal or if at least one of them is one these are not equal and they're not one so they're not compatible that's why it says the size of a tensor a must match now when it says match it doesn't mean they have to be the same one of them can be one okay that's what it means to match they're either equal or one of them is one so that doesn't work on the other hand what if this was 1500 comma 1 if it was 1500 comma 1 then they would match because the one and the two match because one of them is a unit axis and the 1500 and the 1500 match because they're the same so that's what we're going to do because that would then copy this to every one of these which is what we want we want weights for each of these x y tuples so to add the trailing unit axis we say every row and a trailing unit axis so that's what that shape looks like so we can now multiply that by x and as you can see it's now weighting each of them and so each of these x's and y's down the bottom they're all zero so we can sum that up and then divide by the sum of weights so let's now write a function that puts all this together so you can see this really important way of like to me the only way that makes sense to do particularly scientific numerical programming I actually do all my programming this way particularly scientific numerical programming is write it all out step by step check every piece have it all there documented for you and for others and then copy the cells merge them together and indent them to indent its control writes bare bracket and put a function header on top so here's all those things we just did and now rather than just grabbing the first x we enumerate through all of them so that's the distance we had before that's the weight we had before there's the product we had before and then finally sum across the rows divide by the sum of the weights so that's going to calculate for the ith it's going to move so it's actually changing capital X so it's changing the ith thing in capital X so that it's now the weighted sum actually sorry the weighted average of all of the other data weighted by how far it is away so that's going to do a single step so the mean shift update is extremely straightforward which is clone the data iterate a few times and do the update so if we run it take 600 milliseconds and what I've done is I've plotted the centroids moved by two pixels or two what not two pixels two units so that you can see them and so you can see the dots is where our data is and their dots now because every single data point is on top of each other on a cluster and so you can see they are now in the correct spots so it has successfully clustered our data so that's great news and so we could test out our hypothesis could we use triangular just as well as we could have used Gaussian so control slash comments and uncomments yep we got exactly the same results so that's good it's really important to know all these keyboard shortcuts hit H to get a list of them some things that are really important don't have keyboard shortcuts so if you click help edit keyboard shortcuts there's a list of all the things Jupiter can do and you can add keyboard shortcuts to things that don't have them so for example I always add keyboard shortcuts to run all cells above and run all cells below as you can see I type Q and then A for above and Q and then B for below alright now that was kind of boring in a way because it did five steps but we just saw the result what did it look like one step at a time this isn't just fun it's really important to be able to see things happening one step at a time because there are so many algorithms we do which are like updating weights or updating data you know so if it's stable diffusion for example you're very likely to want to show your incrementally denoising and so forth so in my opinion it's important to know how to do animations and I found the documentation for this unnecessarily complicated because it's a lot of it's about how to make them performant but most of the time we probably don't care too much about that so I want to show you a little trick a simple way to create animations without any trouble so matplotlib.animation has something called funk animation that's what we're going to use to create an animation you have to create a function and the function you're going to be calling funkanimation passing in the name of that function and saying how many times to run it and that's what this frames the argument this says run this function this many times and then create an animation that basically contains the result of that with a 500 millisecond interval between each one so what's this do one going to do to create one frame of animation we will call our one update here it is one update right we're going to call this that's going to update our axis and then we're going to have an axis which we've created here so we're going to clear whatever was on the plot before our new data on that axis and then the only other thing you need to do is that the very first time it calls it we want to plot it before running and D is going to be passed automatically the frame number so for the zeroth frame we're going to not do the update we're just going to plot the data as it is already I guess another way we could have done that would have been just to say if D then do the update I suppose that should work too maybe it's even simpler let's see if I just broke it so we're going to clone our data we're going to create our figure and our subplots we're going to call func animation calling do one five times and then we're going to display the animation and so let's see so html takes some html and displays it and 2js html creates some html so that's why it's created this with javascript and so we'll click run one two three four five there's the five steps so if I click loop you'll see them running again and again fantastic so that's how easy it is to create a map plot lib animation so hopefully now you can use that to play around with some fun stable diffusion animations as well you don't just have to use 2js html you can also create um oopsie dozey you can also create movies for example so you can call 2html five video would be another option and you can save an animation as a movie file so there's all these different options for that but hopefully that's enough to get you started so for your homework I would like you to create your k-means or whatever to try to create your own animation or create an animation of some stable diffusion thing that you're playing with so don't forget this important ax.clear without the ax.clear it prints it on top of the last one which sometimes is what you want to be fair but in this case it's not what I wanted alright so kind of slow half a second not that much data I'm sure it would be nice if it was faster well the good news is we can GPU accelerate it the bad news is it's not going to GPU accelerate that well because of this loop this is looping 1500 times um if we so looping's not going to run on the GPU so the best we could do with this would be to move all this to the GPU now the problem is that calling something on the GPU 1500 times from Python is a really bad idea because there's this kind of huge communication overhead of this kind of flow of control and data switching back between the CPU and the GPU it's the kernel launching overhead it's bad news so you don't want to have a really big fast python loop that inside it calls code a code so we need to make all of this run without the loop which we could do with broadcasting so let's roll up our sleeves and try to get the broadcast version of this working so generally speaking the way we tend to do things with broadcasting on a GPU is we create batches or mini batches so to create batches or mini batches we normally just call them batches nowadays we create a batch size so let's say we're going to do a batch size of 5 so we're going to do 5 at a time alright so how do we do 5 at a time this is only doing 1 at a time how do we do 5 at a time as before let's clone our data and this time little x for our testing so we're going to do everything ahead of time as we always do this is not now x0 anymore but it's x colon bs so it's the first 5 this is now the first 5 items ok so little x is now a 5 by 2 matrix this is our mini batch the first 5 items as before our data itself is 1500 by 2 alright so we need a distance calculation but previously our distance calculation only worked if little x was a single number and it returned just the distances from that to everything in big x but we need something that's actually going to be return a matrix right we've got let's see we've got 5 by 2 in little x and then and then in big x we've got something much bigger not to scale obviously we've got 1500 by 2 and what is the distance between these 2 things well if you think about it there's going to be a distance between item 1 and item 1 there's also going to be a distance between item 1 and item 2 and there's going to be a distance between let's use a different colour for the next one item 2 and item 1 right so the output of this is actually going to be a matrix the distances are actually going to give us a matrix where I mean it doesn't matter which way around we do it we can decide if we do it this way around for each of the 5 things in the mini batch there will be 1500 distances the distance between every one so we're going to need to do broadcasting to do this calculation so this is the function that we're going to create and it's going to create this as you can see 5 by 1500 output but let's see how we get it so can we do x minus x no we can't why is that that's because big x is 1500 by 2 and little x is 5 by 2 so it's going to look at remember our rules right to left are these compatible yes they are they're the same are these compatible no they're not ok because they're different so that's not possible to do what if though we wanted to what if we insert in big x an axis at the start here and in little x we add an axis in the middle here then now these are compatible because you've got they're the same because I should use arrows really these are compatible because one of them is a 1 and these are compatible because one of them is a 1 as well so they are all compatible and what it's going to do is it's going to do the subtraction between these directly and it's going to copy this across all 1500 rows it'll copy it this is going to be copied and then this sorry it will cross 5 rows and then this will be copied across these 1500 rows because that's what broadcasting does I mean it's not really copying but it's effectively copying and so that gives us we can now subtract them and that gives us what we wanted which is 5 by 1500 and there's also by 2 because there's both the x and the y so that's why this works that's what this is doing here it's squaring them and then summing over that last shortest axis summing over the x and the y and then take square root I don't know why I said torch dot square root we could just put dot square root at the end same same in fact it's worth mentioning that so most things that you can do on tensors you can either write torch dot as a function or you can write it as a method generally speaking both should be fine not everything but most things work in both spots okay so now we've got this matrix which is 5 by 1500 and the nice thing is that our Gaussian kernel doesn't actually have to be changed to get the weights believe it or not and the reason for that is now how do we get the source code I could move back up there or I can just type Gaussian question mark question mark and see it and the nice thing is that this is just this is a scalar so it broadcasts over anything and then this is also just a scalar so this is all going to work fine without any fiddling around okay so now we've got a 5 by 1500 weight so that's the weight for each of the 5 things our mini-battery each of the 1500 things each of the most compared to then we've got the shape of the data itself x dot shape which is the 1500 points so now we want to apply each one of these weights to each of these columns so we need to add a unit access to the end so to add a unit access to the end we could say colon comma colon comma none dot dot dot means all of the axes till however many you need so in this case the last one comma none so this is going to add an access to the end so this is going to turn weight dot shape from 5 comma 1500 to 5 comma 1500 comma 1 and this is going to add an access to the start remember it's the same as x none comma colon comma colon and so let's check our rules left right to left these are compatible because one of them is one these are compatible because they're both the same and these are compatible because one of them is one so it's going to be copying each weight across to each of the x and y which is what we want we want to weight both of those components and it's going to copy 1500 points sorry each of the point 5 times because we do in fact want to weight every one of the 5 things in our mini batch a separate set of weights for each of them so that sounds perfect so that's how I think through these calculations okay so we can now do that multiplication which is going to give us something of 5 by 1500 by 2 which is up with the maximum of our ranks and then we sum up over those 1500 points and that's going to give us our 5 new data points now something that you might notice here is that we've got a product and a sum and when you see a product and a sum that tells you that maybe we should use an sum so in this case we've got our weight we've got 5 by 1500 so let's call those i and j as for the 5 and 1500 we've got the x is 1500 by 2 now we want to take the product of that so we need to use the same name for this row so we use j again and then k is the number of rows, that's the 2 and then we want to end up with i by k so torch.insum gives exactly the same result that's great but you might recognize this that's exactly the same insum we had just before when we were doing matrix multiplication oh that is a matrix multiplication we've just reinvented the matrix multiplication using this rather nifty trick so we could also just use that and so again this is like what I was just playing around with this morning as I started to look at this and I was thinking like oh can we simplify this, I don't like this kind of like messing around with axes and summing over dimensions and whatnot and so it's nice to get things down to insum do you all get them down to matrix multiplies it's just clearer stuff that we'll recognize because we use them all the time they all work performance would be pretty similar I suspect okay so now that we've got that we then need to do our sum and we've got our 5 points, this is our 5 denominators so we've got our numerator that we calculated up here for our weighted average the denominator is just the sum of the weights remember and so numerator divided by denominator is our answer so again we've gone through every step we've checked out all the dimensions all along the way so nothing's going to surprise us don't try and write a function like this just bang from scratch right you've got to drive yourself crazy instead do it step by step so here's our mean shift algorithm clone the data go through 5 iterations and now go from 0 to n batch size at a time so python has something called slices so we can create a slice of x starting at 1 up to i plus batch size right unless you've gone past n which goes to use n and so then we're just copying and pasting each of the lines of code that we had before actually I just copied the cells and merged them of course I don't actually copy and paste because it's slow and boring and there's my final step to create the new xs and so notice here s is not a single thing it's a slice of things you might not have seen slice before but this is just internally what python is doing when you use colon and it's very convenient when you need to use the same slice multiple times okay so let's do that using CUDA I would run it first without CUDA but I mean I've done all the steps before so it should be fine so pop it on the GPU and run mean shift and let's see how long that takes it takes 1 millisecond and previously without GPU it took 400 milliseconds and you know the other thing we should probably think about doing is looking at other batch sizes as well because now we're looping over batches right so if we make the batch size bigger that for loop is going to do less looping so what if we make that 16 will that be any faster I actually never tried this before that's interesting it's actually slower huh there you go fascinating what if it was 8 amazing so the big batches don't quite seem to be working so well for some reason so I wonder if I hang on what's going on why is it why is it changing how it should be my batch size was 5, why is it slower suddenly I think it's just a bit varying it's probably the answer so it just varies a lot okay so it doesn't seem like changing the batch size is changing much here so that's fine so we'll just leave it where it was and then check looking at the data oh that looks lovely oh I see thank you people on YouTube pointing out that I'm passing batch size so I actually need to put it here right so if we use the batch size of 5 no wonder it was messing up oh look at that I've totally made it slow now 157 really seconds haha okay 64 13 milliseconds alright finally that makes much more sense 256 1024 okay so the bigger bigger is better and I guess we could actually do all 5,000 at once probably nice alright thank you YouTube friends for solving that bizarre mystery alright so that's pretty great I mean you know to see that we can GPU optimise a mean check like I actually googled for this to see if it's been done before and it's the kind of thing that people like write papers about so I think it's great that we can do it so easily with PyTorch but it's the kind of thing that previously had been considered you know a very challenging academic problem to solve so maybe you can do something similar with some of these now I haven't told you what these are so part of the homework is to go read about them and learn about them DB scan funnily enough actually is an algorithm that I accidentally invented and then discovered a year later had already been invented that was a long time ago I was playing around with J which is the successor to APL on a very old Windows phone and I had a long plane flight and I came up with an algorithm and implemented the whole thing on my phone using J and then discovered a year later that I just invented DB scan this is actually a really cool algorithm and it's got a lot of similarities to mean shift LSH comes up all the time so that's great and in fact I have a strong feeling and I've been thinking about this for a while that something like LSH could be used to speed this whole thing up a lot because if you think about it and again maybe this already exists I don't know but if you think about it when we did that distance calculation the vast majority of the weights are nearly zero and so it seems pointless to that big eventually 1500 by 1500 matrix that's slow it would be much better if we just found the ones that were pretty close by and just took their average and so you want an optimized nearest neighbors basically and so this is an example of something that can give you a kind of a fast nearest neighbors algorithm or there are things like kd trees and octrees and stuff like that so if you want to have a bonus bonus invent a new mean shift algorithm which picks only the closest points to avoid quadratic type alright so not very often you get an assignment which is to invent a new mean shift algorithm I guess a super super bonus super super bonus publish a paper that describes it alright you definitely get four points if you do that we'll give you a number of points equal to the impact factor of the journal you get it published in so what I want to do now is move on to calculus which for some of us may not be our favorite topic that's funny is to find out the own some version here already I didn't notice okay always ahead of his time that guy let's talk about calculus if you're not super comfortable with derivatives and what they are and why we care 3Blue1Brown has a wonderful series called the Essence of Calculus which I strongly recommend watching it's just a pleasure actually to watch as it's everything that is on 3Blue1Brown a pleasure to watch the and so we're not going to get into backprop today we're just going to have a quick chat about about calculus where do we start so the good news is just like you don't have to know much linear algebra at all you basically just need to know about matrix multiplication you also don't need to know much calculus at all just derivatives so let's think about like what derivatives are so I'm going to borrow actually the same starting point that 3Blue1Brown uses in one of their videos is to consider a car and we're going to see how far away from home it is at various time points okay so after a minute let's say after a second it's traveled 5 meters and then after 2 seconds it's traveled 10 meters okay and after 3 seconds you can probably guess it's traveled 15 meters so there's this concept here of ah got it the wrong way around obviously so time distance okay so there's this concept of like location it's like how far if you traveled at a particular point in time so we can look at one of these points and find out how far that car has gone we could also take 2 points and we can say where did it start at the start of those 2 points and where did it finish at the end of those 2 points and we can say between those 2 points how much time passed and how far did they travel in 2 seconds they traveled 10 meters so we could now also say alright well the slope of something is rise over run oopsie daisy 10 meters in 2 seconds and notice we don't just divide the numbers we also divide the units we get 5 meters per second so this here has now changed the dimensions entirely we're now not looking at distance but we're looking at speed or velocity and it's equal to rise over run it's equal to the rate of change and what it says really is as time the x axis goes up by 1 second what happens to the distance in meters as 1 second passes how does the number of meters change and so maybe these aren't points at all maybe there's a function it's a continuum of points and so you can do that for the function so the function is a function of time distance is a function of time and so we could say what's the slope of that function and we can get the slope from point A to point B using rise over run so from t1 to t2 the amount of distance the amount of time that's passed is t2 minus t1 that's how much time has passed let's say this is t1 this is t2 and the distance that they've travelled while they've moved from wherever they are at the end to wherever they were at the start so that's the change in distance divided by the change in time change in distance divided by change in time okay let's say that's y so another way now the thing is when we talk about calculus we talk about finding a slope but we talk about finding a slope of something that's often more tricky than this right we have slopes of things that look more like this and we say what's this slope oops I'm terrible at drawing let's maybe put it over here because I'm left handed what's this slope now what does it mean to have like the idea of a velocity at an exact moment in time it doesn't mean anything you know at an exact moment in time you're just like it's frozen right what's happening exactly now but what you can do is you can say well what's the change in time between a bit before our point and a bit after our point and what's the change in distance between a bit before our point and a bit after our point and so you can do the same kind of rise over run thing right but you can make that distance between t2 and t1 smaller and smaller so let's rewrite this in a slightly different way let's call the denominator the distance between t1 plus a little bit we'll call it d it's that minus t1 so this is t2 right it's t1 plus a little bit so we say oh here's t1 let's add a little bit and notice that when we write it this well let's actually let's do the rest of it so now f of t2 becomes f of t1 plus a little bit and this is the same and now notice here that t1 plus d minus t1 we can delete all that because it just comes out to d so this is another way of calculating the slope of our function and as d gets smaller and smaller and smaller we're kind of getting a triangle that's tinier and tinier and tinier and it still makes sense it's still that some time has passed and the car has moved right but it's just smaller and smaller amounts of time now if you did calculus at college or at school you might have done all this stuff messing around limits and epsilon delta and blah blah blah I've got really good news it turns out you can actually just think of this d as a really small number where d is the difference difference and so when we calculate the slope we can write it in a slightly different way as the change in y divided by the change in x this here is the change in y and this here is the change in x and so in other words this here is a very small number a very small number and this here is the result in the function of changing by that very small number and this way of thinking about calculus is known as the calculus of infinitesimals and it's how Leibniz originally developed it and it's been turned into a whole theory nowadays and the reason I talk about it here is because when we do calculus you'll see me doing stuff all the time where I act like dx is a really small number and when I was at school I was told I wasn't allowed to do that I've since learned that it's totally fine to do that so for example next lesson we're going to be looking at the chain rule which looks like this dy dx equals dy du times du dx and I'm just going to say oh these two small numbers can cancel out and that's why obviously they're the same thing and that's all going to work out nicely so anywho what would be very helpful would be if before the next lesson if you're not totally up to date with your remembering all the stuff you did in high school about calculus is watch the 3 blue 1 brown course we are not going to be looking I don't think at all at integration so you don't have to worry about that also we are not going to on the whole be doing any derivatives by hand so for example there are rules such as dy dx if y equals x squared is 2x these kind of rules you're not really going to have to learn because PyTorch is going to do them all for you the one that we care about is going to be the chain rule but we're going to learn about that next time okay I hope I don't get beaten to a bloody pulp the next time I walk into a mathematicians conference I suspect I might but hopefully I get away with this I think it's safe we'll see how we go thanks everybody very much for joining me and really look forward to seeing you next time where we're going to do back propagation from scratch we've already learned to multiply matrices so once we've got back propagation as well we'll be ready to train a neural network alright thanks all, bye