 Hi everybody and welcome to lesson 13 where we're going to start talking about back propagation. Before we do, I'll just mention that there was some great success amongst the folks in the class during the week on working with flexing their tensor manipulation muscles. So far the fastest mean shift algorithm which has a similar accuracy to the one I displayed is one that actually randomly chooses data points as subset. And I actually think that's a great approach. Very often random sampling and random projections are two excellent ways of speeding up algorithms. So it'll be interesting to see if anybody during the rest of the course comes up with anything faster than random sampling. Also been seeing some good Einstein summation examples and implementations and continuing to see lots of good diff edit implementations. So congratulations to all the students and I hope those of you following along the videos in the MOOC will be working on the same homework as well and sharing your results on the fast AI forums. So now we're going to take a look at notebook number three in the normal repo course 22 P1 repo. And we're going to be looking at the forward and backward passes of a simple multi-layer perceptron neural network. The initial stuff up here is just importing things and just settings and stuff that just copying and pasting some stuff from previous notebooks around paths and parameters and stuff like that. So we'll skip over this. So we'll often be kind of copying and pasting stuff from one notebook to another's kind of first cell to get things set up. And I'm also loading in our data for Amnist as tensors. Okay, so we to start with need to create the basic architecture of our neural network. And I did mention at the start of the course that we will briefly review everything that we need to cover. So we should briefly review what basic neural networks are and why they are what they are. So to start with, let's consider a linear model, oops, that's not how I do it. So let's start by considering a linear model of, well, let's take the most simple example possible, which is we're going to pick a single pixel from our Amnist pictures. And so that will be our x and for our y values, then we'll have some loss function of how good is this model. Sorry, not some loss function. That's created even simpler. For our y value, we're going to be looking at how likely is it that this is say the number three based on the value of this one pixel. So the pixel, its value will be x and the probability of being the number three we will call y. And if we just have a linear model, then it's going to look like this. And so in this case, it's saying that the brighter this pixel is, the more likely it is that it's the number three. And so there's a few problems with this. The first one obviously is that as a linear model, it's very limiting because maybe, you know, we actually are trying to draw something that looks more like this. So how would you do that? Well, there's actually a neat trick we can use to do that. What we could do is, well, let's first talk about something we can't do. Something we can't do is to add a bunch of additional lines. So consider what happens if we say, OK, well, let's add a few different lines. So let's also add this line. So what would be the sum of our two lines? Well, the answer is, of course, that the sum of the two lines will itself be a line. So it's not going to help us at all match the actual curve that we want. So here's the trick. Instead, we could create a line like this that actually we could create this line. And now consider what happens if we add this original line with this new, what's not a line, right? It's a two line segments. So what we would get is this everything to the left of this point is going to not be changed if I add these two lines together, because this is zero all the way. And everything to the right of it is going to be reduced. It looks like they've got similar slopes. So we might end up with instead. So this would all disappear here. And instead, we would end up with something like this. And then we could do that again, right? We could add an additional line that looks a bit like that. So it would go, but this time it could go even further out here. And it could be something like this. So what if we added that? Well, again, at the point underneath here, it's always zero, so it won't do anything at all. But after that, it's going to make it even more negatively sloped. And if you can see using this approach, we could add up lots of these rectified lines, these lines at truncated zero, and we could create any shape we want with enough of them. And these lines are very easy to create, because actually all we need to do is to create just a regular line, just create a regular line, right, which we can move up, down, left, right, change its angle, whatever. And then just say, if it's greater than zero, truncate it to zero. Or we could do the opposite for a line going the opposite direction. If it's less than zero, we could say truncate it to zero. And that would get rid of, as we want, this whole section here and make it flat. OK, so these are rectified lines. And so we can sum up a bunch of these together to basically match any arbitrary curve. So let's start by doing that. Oh, the other thing we should mention, of course, is that we're going to have not just one pixel, but we're going to have lots of pixels. So to start with the kind of most, you know, slightly, you know, the only slightly less simple approach, we could have something where we've got, you know, pixel number one and pixel number two. We're looking at two different pixels to see how likely they are to be the number three. And so that would allow us to draw more complex shapes that have some kind of surface between them. OK, and then we can do exactly the same thing is to create these surfaces. We can add up lots of these rectified lines together, but now they're going to be kind of rectified planes. But it's going to be exactly the same thing. We're going to be adding together a bunch of lines, each one of which is truncated at zero. OK, so that's the quick review. And so to do that, we'll start out by just defining a few variables. So N is the number of training examples. M is the number of pixels. C is the number of possible values of our digits. And so here they are, 50,000 samples, 784 pixels and 10 possible outputs. OK, so what we do is to is we basically decide ahead of time how many of these line segment thingies to add up. And so the number that we create in a layer is called the number of hidden nodes or activations. So we'll call that NH. So let's just arbitrarily decide on creating 50 of those. So in order to create lots of lines, which we're going to truncate at zero, we can do a matrix multiplication. So with a matrix multiplication, we're going to have something where we've got 50,000 rows by 700, was it 784? Yeah, by 784 columns. And we're going to multiply that by something with 784 rows and 10 columns. And why is that? Well, that's because if we take this very first line of this first vector here, row one, of 784 values, they're the pixel values of the first image. OK, so this is our first image. And so each of those 784 values will be multiplied by each of these 784 values in the first column, the zero index column. And that's going to give us a number in our output. So our output is going to be 50,000, 50,000 images by 10. And so that result, we'll multiply those together and we'll add them up. And that result's going to end up over here in this first cell. And so each of these columns is going to eventually represent, if this is a linear model, in this case, this is just the example of doing a linear model, each of these cells is going to represent the probability. So this first column will be the probability of being a zero. And the second column will be the probability of one. The third column will be the probability of being a two and so forth. So that's why we're going to have these 10 columns, each one allowing us to weight the 784 inputs. Now, of course, we're going to do something a bit more tricky than that, which is actually we're going to have a 784 by 50 input going into a 784 by 50 output to create the 50 hidden layers. Then we're going to truncate those at zero and then multiply that by a 50 by 10 to create our 10 outputs. So we'll do it in two steps. So the way SGD works is we start with just, this is our weight matrix here. And this is our data. And this is our outputs. The way it works is that this weight matrix is initially filled with random values. Also, of course, this contains our pixel values. This contains the results. So W is going to start with random values. So here's our weight matrix. It's going to have, as we discussed, 50,000 by 50 random values. And it's not enough just to multiply. We also have to add. So that's what makes it a linear function. So we call those the biases, the things we add. We can just start those at zeros. So we'll need one for each output, so 50 of those. And so that'll be layer one. And then as we just mentioned, layer two will be a matrix that goes from 50 hidden. And now I'm going to do something totally cheating to simplify some of the calculations for the calculus. I'm only going to create one output. Why am I going to create one output? That's because I'm not going to use cross entropy just yet. Instead, I'm going to use MSE. So actually, I'm going to create one output, which will literally just be what number do I think it is from zero to 10? And so then we're going to compare those to the actual. So these will be our Y predictors. We normally use a little hat for that. And we're going to compare that to our actuals. And yeah, in this very hacky approach, let's say we predict over here the number nine. And the actual is the number two. And we'll compare those together using MSE, which will be a stupid way to do it. Because it's saying that nine is further away from being two. Nine is further away from two than it is from four in terms of how correct it is, which is not what we want at all. But this is what we're going to do just to simplify our starting point. So that's why we're going to have a single output for this weight matrix and a single output for this bias. So a linear, let's create a function for putting x through a linear layer with these weights and these biases. So it's a matrix multiply and an add. All right, so we can now try it. So if we multiply our x, we're doing x valid this time. So just to clarify, x valid is 10,000 by 784. So if we put x valid through our weights and biases with the linear layer, we end up with a 10,000 by 50. So 10,050 long hidden activations. They're not quite ready yet because we have to put them through value. And so we're going to clamp at zero. So everything under zero will become zero. And so here's what it looks like when we go through the linear layer and then the value. And you can see here's a tensor with a bunch of things, some of which is zero or they're positive. And so that's the result of this matrix multiplication. OK, so to create our basic MLP multi-layer perceptron from scratch, we will take our mini batch of x's. xb is a x match. We will create our first layer's output with a linear. And then we will put that through a value. And then that will go through the second linear. So the first one uses the w1b1, OK, these ones. And the second one uses the w2b2. And so we've now got a simple model. And as we hoped, when we pass in the validation set, we get back 10,000 digits, so 10,000 by 1. Great, so that's a good start. OK, so let's use our ridiculous loss function of MSc. So our results is 10,000 by 1. And our y-valid is just a vector. Now, what's going to happen if I do res minus y-valid? So before you continue in the video, have a think about that. What's going to happen if I do res minus y-valid by thinking about the NumPy broadcasting rules we've learned? OK, let's try it. Oh, terrible. We've ended up with a 10,000 by 10,000 matrix, so 100 million points. Now, we would expect an MSc to contain 1,000 points. Why did that happen? The reason it happened is because we have to start out at the last dimension and go right to left. And we compare the 10,000 to the 1 and say, are they compatible? And the answer is, that's right, Alexei in the chat's got it right, broadcasting rules. So the answer is that this 1 will be broadcast over these 10,000. So this pair here will give us 10,000 outputs. And then we'll move to the next one. And we'll also move here to the next one. Oh, there is no next one. What happens? Now, if you remember the rules, it inserts a unit access for us. So we now have 10,000 by 1. So that means each of the 10,000 outputs from here will end up being broadcast across the 10,000 rows here. So that means that we'll end up, for each of those 10,000, we'll have another 10,000. So we'll end up with a 10,000 by 10,000 output. So that's not what we want. So how could we fix that? Well, what we really want, we want this to be 10,000, 1 here. If that was 10,000, 1, then we'd compare these two right to left. And they're both 1. So those match. And there's nothing to broadcast because they're the same. And then we'll go to the next one, 10,000 to 10,000. Those match. So they just go element-wise for those. And we'd end up with exactly what we want. We'd end up with 10,000 results. Or, alternatively, we could remove this dimension. And then again, same thing. We're then going to add right to left, compatible 10,000 so they'll get element-wise operation. So in this case, I got rid of the trailing, comma, 1. There's a couple of ways you could do that. One is just to say, OK, grab every row and the 0th column of res. And that's going to turn it from a 10,000 by 1 into a 10,000. Or alternatively, we can say dot-squeeze. Now dot-squeeze removes all trailing unit vectors, and possibly also prefix unit vectors. I can't quite recall. I guess we should try. So let's say res, none, comma, colon, comma, none. Queue dot shape. OK, so if I go queue dot squeeze dot shape. OK, so all the unit vectors get removed. Sorry, all the unit dimensions get removed, I should say. OK, so now that we've got a way to remove that axis that we didn't want, we can use it. And if we do this attraction, now we get 10,000 just like we wanted. So now let's get our training and validation wise. We'll turn them into floats because we're using MSE. So let's calculate our predictions for the training set, which is 50,000 by 1. And so if we create an MSE function that just does what we just said we wanted, so it does this attraction. And then squares it and then takes the mean. That's MSE. So there we go, we now have a loss function being applied to our training set. OK, now we need gradients. So as we briefly discussed last time, gradients are slopes. And in fact, maybe it would even be easier to look at last time. So this was last time's notebook. And so we saw how the gradient at this point is the slope here. And so it's the, as we discussed, rise over run. Now, so that means as we increase, in this case, time by 1, the distance increases by how much? That's what the slope is. So why is this interesting? The reason it's interesting is because let's consider our neural network. Our neural network is some function that takes two things, two groups of things. It contains a matrix of our inputs. And it contains our weight matrix. And we want to, and let's assume we're also putting it through a loss function. So let's say, well, I guess we can be explicit about that. So we could say, we then take the result of that and we put it through some loss function. So these are the predictions. And we compare it to our actual dependent variable. So that's our neural net. And that's our loss function. OK, so if we can get the derivative of the loss with respect to, let's say, one particular weight, so let's say, weight number 0, what is that doing? Well, it's saying as I increase the weight by a little bit, what happens to the loss? And if it says, oh, well, that would make the loss go down, then obviously I want to increase the weight by a little bit. And if it says, oh, it makes the loss go up, then obviously I want to do the opposite. So the derivative of the loss with respect to the weights, each one of those tells us how to change the weights. And so to remind you, we then change each weight by that derivative times a little bit and subtract it from the original weights. And we do that a bunch of times, and that's called SGD. Now, there's something interesting going on here, which is that in this case, there's a single input and a single output. And so the derivative is a single number at any point. It's the speed. In this case, the vehicle's going. But consider a more complex function like, say, this one. Now, in this case, there's one output, but there's two inputs. And so if we want to take the derivative of this function, then we actually need to say, well, what happens if we increase x by a little bit? And also, what happens if we increase y by a little bit? And in each case, what happens to z? And so in that case, the derivative is actually going to contain two numbers. It's going to contain the derivative of z with respect to y. And it's going to contain the derivative of z with respect to x. What happens if we change each of these two numbers? So for example, these could be, as we discussed, two different weights in our neural network. And z could be our loss, for example. Now, we've got actually 784 inputs. So we would actually have 784 of these. So we don't normally write them all like that. We would just say, use this little squiggly symbol to say the derivative of the loss across all of them with respect to all of the weights. And that's just saying that there's a whole bunch of them. It's a shorthand way of writing this. OK, so it gets more complicated still, though, because think about what happens if, for example, you're in the first layer where we've got a weight matrix that's going to end up giving us 50 outputs. So for every image, we're going to have 784 inputs to our function. And we're going to have 50 outputs to our function. And so in that case, I can't even draw it, right? Because even if I had two inputs and two outputs, then as I increase my first input, I actually need to say, how does that change both of the two outputs? And as I change my second input, how does that change both of my two outputs? So for the full thing, you actually are going to end up with a matrix of derivatives. It basically says, for every input that you change, my little bit, how much does it change every output of that function? So you're going to end up with a matrix. So that's what we're going to be doing is we're going to be calculating these derivatives. But rather than being single numbers, they're going to actually contain matrices with a row for every input and a column for every output. And a single cell in that matrix will tell us, as I change this input by a little bit, how does it change this output? Now, eventually, we will end up with a single number for every input. And that's because our loss in the end is going to be a single number. And this is like a requirement that you'll find when you try to use SGD is that your loss has to be a single number. And so we generally get it by doing the sum or a mean or something like that. But as you'll see on the way there, we're going to have to be dealing with these matrix of derivatives. So I just want to mention, as I might have said before, I can't even remember, there is this paper that Terence Parr and I wrote a while ago, which goes through all this. And it basically assumes that you only know high school calculus. And if you don't check our Khan Academy, but then it describes matrix calculus in those terms. So it's going to explain to you exactly. And it works through lots and lots of examples. So for example, as it mentions here, when you have this matrix of derivatives, we call that a Jacobian matrix. So there's all these words. It doesn't matter too much if you know them or not. But it's convenient to be able to talk about the matrix of all of the derivatives if somebody just says the Jacobian. It's a little convenience. It's a little bit easier than saying the matrix of all of the derivatives, where all of the rows are the things that are all the inputs and all the columns are the outputs. So yeah, if you want to really understand, get to a point where papers are easier to read in particular, it's quite useful to know this notation and definitions of words. You can certainly get away without it. It's just something to consider. OK, so we need to be able to calculate derivatives, at least, of a single variable. And I am not going to worry too much about that, A, because that is something you do in high school math, and B, because your computer can do it for you. And so you can do it symbolically, using something called SIMPI, which is really great. So if you create two symbols called x and y, you can say, please differentiate x squared with respect to x. And if you do that, SIMPI will tell you the answer is 2x. If you say differentiate 3x squared plus 9 with respect to x, SIMPI will tell you that 6x. And a lot of you probably will have used Wolfram Alpha that does something very similar. I kind of quite like this, because I can quickly do it inside my notebook and include it in my prose. So I think SIMPI is pretty cool. So basically, yeah, you can quickly calculate derivatives on a computer. Having said that, I do want to talk about why the derivative of 3x squared plus 9 equals 6x, because that's going to be very important. So 3x squared plus 9. So we're going to start with the information that the derivative of A to the B with respect to A equals B times A. So for example, the derivative of x squared with respect to x equals 2x. So that's just something I'm hoping you'll remember from high school or a refresh your memory using Card Academy or similar. So there that is there. So what we could now do is we could rewrite this derivative as 3u plus 9. And then we'll write u equals x squared. OK, now this is getting easier. The derivative of two things being added together is simply the sum of their derivatives. Oh, forgot B minus 1 in the exponent. Thank you. So it'd be A to the power of B minus 1. That's what it should be. Which would be 2x to the power of 1, and the 1 is not needed. Thank you for fixing that. All right. So we just sum them up. So we get the derivative of 3u is actually just, well, it's going to be the derivative of that plus the derivative of that. Now, the derivative of any constant with respect to a variable is 0. Because if I change something, an input, it doesn't change the constant. It's always 9. So that's going to end up as 0. And so we're going to end up with dy du equals something plus 0. And the derivative of 3u with respect to u is just 3, because it's just a line. So that's its slope. OK, but that's not dy dx. We want dy dx. Well, the cool thing is that dy dx is actually just equal to dy du du dx. So I'll explain why in a moment. But for now then, let's recognize we've got dy du dx. We know that one, 2x. So we can now multiply these two bits together. And we will end up with 2x times 3, which is 6x, which is what Simpae told us. So fantastic. OK, this is something we need to know really well. And it's called the chain rule. And it's best to understand it intuitively. So to understand it intuitively, we're going to take a look at an interactive animation. So I found this nice interactive animation on this page here, webspace.ship.edu slash msreddit, reno, geogibra calculus. OK, and the idea here is that we've got a wheel spinning around, and each time it spins around, this is x going up. OK, so at the moment, there's some change in x, dx, over a period of time. All right, now this wheel is 8 times bigger than this wheel. So each time this goes around once, if we connect the two together, this wheel would be going around four times faster because the difference between the multiple between 8 and 2 is 4. Maybe I'll bring this up to here. So now that this wheel has got twice as bigger circumference as the u wheel, each time this goes around once, this is going around two times. So the change in u, each time x goes around once, the change in u will be 2. So that's what du dx is saying. The change in u for each change in x is 2. Now, we could make this interesting by connecting this wheel to this wheel. Now, this wheel is twice as small as this wheel. So now we can see that, again, each time this spins around once, this spins around twice because this has twice as a circumference of this. So therefore, du y du equals 2. But now that means every time this goes around once, this goes around twice. Every time this one goes around once, this gun goes around twice. So therefore, every time this one goes around once, this one goes around four times. So du y dx equals 4. So you can see here how the 2, well, how the du dx has to be multiplied with the du y du to get the total. So this is what's going on in the chain rule. And this is what you want to be thinking about is this idea that you've got one function that is kind of this intermediary. And so you have to multiply the two impacts to get the impact of the x-wheel on the y-wheel. So I hope you find that useful. I find this, personally, I find this intuition quite useful. So why do we care about this? Well, the reason we care about this is because we want to calculate the gradient of our MSE applied to our model. And so our inputs are going through a linear. They're going through a value. They're going through another linear. And then they're going through an MSE. So there's four different steps going on. And so we're going to have to combine those all together. And so we can do that with the chain rule. So if our steps are that loss function is the loss function, which is some function of the predictions and the actuals. And then we've got the second layer is a function of, actually, let's call this the output of the second layer. Slightly weird notation, but hopefully it's not too bad. It's going to be a function of the value of the value activations. And the value activations are a function of the first layer. And the first layer is a function of the inputs. Oh, and of course, this also has weights and biases. So we're basically going to have to calculate the derivative of that. OK, but then remember that this is itself a function. So then we'll need to multiply that derivative by the derivative of that. But that's also a function. So we have to multiply that derivative by this. But that's also a function. So we have to multiply that derivative by this. So that's going to be our approach. We're going to start at the end. We're going to take its derivative. And then we're going to gradually keep multiplying as we go each step through. And this is called back propagation. So back propagation sounds pretty fancy, but it's actually just using the chain rule. Gosh, I didn't spell that very well. Prop, occasion. It's just using the chain rule. And as you'll see, it's also just taking advantage of a computational trick of memorizing some things on the way. And in our chat, Siva made a very good point about understanding non-linear functions, in this case, which is just to consider that the wheels could be growing and shrinking all the time as they're moving. But you're still going to have the same compound effect, which I really like that. Thank you, Siva. There's also a question in the chat about why is this colon comma 0 being placed in the function, given that we can do it outside the function? Well, the point is we want an MSE function that will apply to any output. We're not using it once. We want it to work any time. So we haven't actually modified preds or anything like that or y-train. So we want this to be able to apply to anything without us having to preprocess it. That's basically the idea here. OK, so let's take a look at the basic idea. So here's going to do a forward pass and a backward pass. So the forward pass is where we calculate the loss. So the loss is, oh, I've got an error here. That should be diff. There we go. So the loss is going to be the output of our neural net minus our target squared. Then take the mean. OK, and then our output is going to be the output of the second linear layer. The second linear layer's input will be the value. The value's input will be the first layer. So we're going to take our input, put it through a linear layer, put that through a value, put that through a linear layer, and calculate the MSE. OK, that bit hopefully is pretty straightforward. So what about the backward pass? So the backward pass, what I'm going to do, and you'll see why in a moment, is I'm going to store the gradients of each layer. So for example, the gradients of the loss with respect to its inputs in the layer itself. So I'm going to create a new attribute. I could call it anything I like. And it's going to call it .g. So I'm going to create a new layer, a new attribute called out.g, which is going to contain the gradients. You don't have to do it this way, but as you'll see, it tends to be pretty convenient. So that's just going to be two times the difference because we've got difference squared. So that's just the derivative. And then we have taken the mean here. So we have to do the same thing here, divided by the input shape. And so that's those gradients. That's good. And now what we need to do is multiply by the gradients of the previous layer. So here's a previous layer. So what are the gradients of a linear layer? I've created a function for that here. So the gradient of a linear layer, we're going to need to know the weights of the layer. We're going to need to know the biases of the layer. And then we're also going to know the input to the linear layer because that's the thing that's actually being manipulated here. And then we're also going to need the output because we have to multiply by the gradients because we've got the chain rule. So again, we're going to store the gradients of our input. So this will be the gradients of our output with respect to the input. And that's simply the weights because the weights, so our matrix multiplier is just a whole bunch of linear functions. So each one slope is just his weight. But you have to multiply it by the gradient of the outputs because of the chain rule. And then the gradient of the outputs with respect to the weights is going to be the input times the output summed up. I'll talk more about that in a moment. The derivatives of the bias is very straightforward. It's the gradients of the output added together because the bias is just a constant value. So for the chain rule, we simply just use output times one, which is output. So for this one here, again, we have to do the same thing we've been doing before, which is multiplied by the output gradients because of the chain rule. And then we've got the input weights. So every single one of those has to be multiplied by the outputs. And so that's why we have to do an unsqueeze minus 1. So what I'm going to do now is I'm going to show you how I would experiment with this code in order to understand it. And I would encourage you to do the same thing. It's a little harder to do this one cell by cell because we kind of want to put it all into this function like this. So we need a way to explore the calculations interactively. And the way we do that is by using the Python debugger. Here is how you, let me see a few ways to do this. Here's one way to use the Python debugger. The Python debugger is called PDB. So if you say PDB.setTrace in your code, then that tells the debugger to stop execution when it reaches this line. So it sets a break point. So if I call forward and backward, you can see here, it's stopped. And the interactive Python debugger, IPDB, has popped up. With an arrow pointing at the line of code, it's about to run. And at this point, there's a whole range of things we can do to find out what they are. We pick H for help. Understanding how to use the Python debugger is one of the most powerful things I think you can do to improve your coding. So one of the most useful things you can do is to print something. You see all these single letter things? They're just shortcuts, but in a debugger, you want to be able to do things quickly. So instead of typing print, I just type P. So for example, let's take a look at the shape of the input. So I type P for print input dot shape. So I've got a 50,000 by 50 input to the last layer. That makes sense. These are the hidden activations coming into the last layer for every one of our images. What about the output gradients? And there's that as well. And actually a little trick, you can ignore that. You don't have to use the P at all. If your variable name is not the same as any of these commands. So I could have just typed out.g.shape. Get the same thing. Okay. So you can also put in expressions. So let's have a look at the shape of this. So the output of this is, let's see if it makes sense. We've got the input 50,000 by 50. We put a new axis on the end. Unsqueeze minus one is the same as indexing it with dot, dot, dot, comma, none. So let's put a new axis at the end. So that would have become 50,000 by 50 by one. And then the outg.unsqueeze, we're putting in the first dimension. So we're gonna have 50,000 by 50 by one times 50,000 by one by one. And so we're only gonna end, we're gonna end up getting this broadcasting happening over these last two dimensions, which is why we end up with 50,000 by 50 by one. And then with summing up, this makes sense, right? We wanna sum up over all of the inputs. Each image is individually contributing to the derivative. And so we want to add them all up to find their total impact. Because remember the sum of a bunch of, the derivative of the sum of functions is the sum of the derivatives of the functions. So we can just sum them up. Now this is one of the situations where if you see a times and a sum and an unsqueeze, it's not a bad idea to think about Einstein summation notation. Maybe there's a way to simplify this. So first of all, let's just see how we can do some more stuff in the debugger. I'm gonna continue. So just continue running. So press C for continue. And it keeps running until it comes back again to the same spot. And the reason we've come to the same spot twice is because Lynn grad is called two times. So we would expect that the second time we're going to get a different bunch of inputs and outputs. And so I can print out a tuple of the inputs and output gradient. So now, yeah, so this is the first layer going into the second layer. So that's exactly what we'd expect. To find out what called this function, you just type W, W is where am I? And so you can see here, where am I? Oh, forward and backward was called, see the arrow, that called Lynn grad the second time. And now we're here in W dot G equals. If we wanna find out what W dot G ends up being equal to, I can press N to say go to the next line. And so now we've moved from line five to nine, six. So the instruction point is now looking at line six. So I could now print out, for example, W dot G dot shape. And there's the shape of our weights. One person on the chat has pointed out that you can use breakpoint instead of this import PDB business. Unfortunately, the breakpoint keyword doesn't currently work in Jupiter or in IPython. So we actually can't, sadly. That's why I'm doing it the old fashioned way. So this way, maybe they'll fix the bug at some point, but for now we have to type all this. Okay, so those are a few things to know about, but I would definitely suggest looking up a Python PDB tutorial to become very familiar with this incredibly powerful tool because it really is so very handy. So if I just press continue again, it keeps running all the way to the end and it's now finished running forward and backward. So when it's finished, we would find that there will now be, for example, a W one dot G because this is the gradients that it just calculated. And there would also be a X train dot G and so forth. Okay, so let's see if we can simplify this a little bit. So I would be inclined to take these out and give them their own variable names just to make life a bit easier. It would have been better if I'd actually done this before the debugging, so it'd be a bit easier to type. So let's set I and O equal to input and output dot G dot unsqueeze. Okay, so we'll get rid of our breakpoint. And double check that we've got our gradients. And I guess before we run it, we should probably set those to zero. What I would do here to try things out is I'd put my breakpoint there and then I would try things. So let's go next. And so I realize here that what we're actually doing is we're basically going to do the same thing is we're basically doing exactly the same thing as an INSUM would do. So I could test that out by trying an INSUM, right? Because I've just got this as being replicated and then I'm summing over that dimension because that's the multiplication that I'm doing. So I'm basically multiplying the first dimension of H and then summing over that dimension. So I could try running that and ah, it works. So that's interesting. Oh, and I've got zeros because I did X train dot zero, that was silly. So that should be gradients dot zero. Okay, so let's try doing an INSUM. And there we go, that seems to be working. That's pretty cool. So we've multiplied this repeating index. So we were just multiplying the first dimensions together and then summing over them so there's no I here. Now that's not quite the same thing as a matrix multiplication, but we could turn it into the same thing as a matrix multiplication just by swapping I and J so that they're the other way around. And that way we'd have J I comma I K. And we can swap into dimensions very easily. That's what's called the transpose. So that would become a matrix multiplication if we just use the transpose. And in NumPy, the transpose is the capital T attribute. So here is exactly the same thing using a matrix multiply and a transpose. And let's check. Yeah, that's the same thing as well. Okay, cool. So that tells us that now we've checked in our debugger that we can actually replace all this with a matrix multiply. We don't need that anymore. Let's see if it works. It does. All right. X train dot G, cool. Okay, so hopefully that's convinced you that the debugger is a really handy thing for playing around with numeric programming ideas or coding in general. And so I think now's a good time to take a break. So let's take a eight minute break and I'll see you back here, actually seven minute break. I'll see you back here in seven minutes. Thank you. Okay, welcome back. So we've calculated our derivatives and we want to test them. Luckily PyTorch already has derivatives implemented. So I got to totally cheat and use PyTorch to calculate the same derivatives. So don't worry about how this works yet because we're actually gonna be doing all this from scratch. Anyway, for now I'm just gonna run it all through PyTorch and check that their derivatives are the same as ours and they are, so we're on the right track. Okay, so this is all pretty clunky. I think we can all agree. And obviously it's clunkier than what we do in PyTorch. So how do we simplify things? There's some really cool refactoring that we can do. So what we're gonna do is we're gonna create a whole class for each of our functions. For the value function and for the linear function. So the way that we're gonna do this is we're gonna create a Dunder call. What does Dunder call do? Let me show you. So if I create a class and we're just gonna set that to print hello. So if I create an instance of that class and then I call it as if it was a function, oops, missing the Dunder bit here. Call it as if it's a function, it says hi. So in other words, everything can be changed in Python. You can change how a class behaves. You can make it look like a function. And to do that, you simply define Dunder call. You can pass it an argument like so. Okay, so that's what Dunder call does. It just says it's just a little bit of syntax, sugary kind of stuff to say, I want to be able to treat it as if it's a function without any method at all. You can still do it the method way. You could have done this. Don't know why you'd want to, but you can. Because it's got this special magic named under call. You don't have to write the dot Dunder call at all. So here, if we create an instance of the value class, we can treat it as a function. And what it's gonna do is it's gonna take its input and do the value on it. But if you look back at the forward and backward, there's something very interesting about the backward pass, which is that it has to know about, for example, this intermediate calculation gets passed over here. This intermediate calculation gets passed over here because of the chain rule, we're gonna need some of the intermediate calculations and not just because of the chain rule because of actually how the derivatives are calculated. So we need to actually store each of the layer intermediate calculations. And so that's why ReLU doesn't just calculate and return the output, but it's also stores its output and it also stores its input. So that way then when we call backward, we know how to calculate that. We set the inputs gradient, because remember we stored the input. So we can do that, right? And it's going to just be, oh, input greater than zero dot float, right? So that's the definition of the derivative of ReLU and then chain rule. So that's how we can calculate the forward pass and the backward pass for ReLU and we're not gonna have to then store all this intermediate stuff separately. It's gonna happen automatically. So we can do the same thing for a linear layer. Now, linear layer needs some additional state, weights and biases, ReLU doesn't, right? So there's no edit. So when we create a linear layer, we have to say, what are its weights? What are its biases? We store them away. And then when we call it in the forward pass, just like before we store the input. So that's exactly the same line here. And just like before, we calculate the output and store it and then return it. Okay, and this time, of course, we just call win. And then for the backward pass, it's the same thing. Okay, so the input gradients, we calculate just like before. Oh, dot t, brackets is exactly the same with the little t as big t is as a property. So that's the same thing, that's just the transpose. Calculate the gradients of the weights. Again, with the chain rule and the bias, just like we did it before. And they're all being stored in the appropriate places. And then for MSE, we can do the same thing. We don't just calculate the MSE, but we also store it. And we also, now the MSE needs two things, an input and a target. So we'll store those as well. So then in the backward pass, we can calculate its gradient of the input as being two times the difference. And there it all is. Okay, so our model now is much easier to define. We can just create a bunch of layers, linear W1, B1, ReLU, linear W2, B2. And then we can store an instance of the MSE. So this is not calling MSE, it's creating an instance of the MSE class. And then this is an instance of the LEN class. This is an instance of the ReLU class. So they're just being stored. So then when we call the model, we pass it our inputs and our target. We go through each layer, set X equal to the result of calling that layer, and then pass that to the loss. So there's something kind of interesting here that you might have noticed, which is that we don't have, where did we do it? Something interesting here is that we don't have two separate functions inside our model, the loss function being applied to a separate neural net, but we've actually integrated the loss function directly into the neural net, into the model. See how the loss is being calculated inside the model. Now that's neither better nor worse than having it separately, it's just different. And so generally a lot of hugging face stuff does it this way, they actually put the loss inside the forward. Most stuff in FastAI and a lot of other libraries does it separately, which is the loss is a whole separate function and the model only returns the result of putting it through the layers. So for this model, we're gonna actually do the loss function inside the model. So for backward, we just do each thing. So self.loss.backward, so that self.loss is the MSE object. So that's gonna call backward, right? And it's stored when it was called here, it was storing, remember the inputs, the targets, the outputs, so it can calculate the backward. And then we go through each layer is in reverse, right? This is back propagation, backwards reversed, calling backward on each one. So that's pretty interesting, I think. So now we can calculate the model, we can calculate the loss, we can call backward, and then we can check that each of the gradients that we stored earlier are equal to each of our new gradients. Okay, so William's asked a very good question, that is if you do put the loss inside here, how on earth do you actually get predictions? So generally what happens is in practice, hugging face models do something like this. So I'll say self.preds equals x, and then they'll say self.finalloss equals that, and then return self.finalloss. And that way, I guess you don't even need that last bit. Well, anyway, that is what they do, so we'll leave it there. And so that way you can kind of check like model.preds, for example. So it'll be something like that, or alternatively you can return not just the loss, but both as a dictionary, stuff like that. So there's a few different ways you could do it. Actually, now I think about it, I think that's what they do is they actually return both as a dictionary. So it would be like return dictionary loss equals that, comma.preds equals that, something like that, I guess, is what they would do. Anyway, there's a few different ways to do it. Okay, so hopefully you can see that this is really making it nice and easy for us to do our forward pass and our backward pass without all of this manual fiddling around. Every class now can be totally separately considered and can be combined however we want. We could create layers, so you could try creating a bigger neural net if you want to. But we can refactor it more. So basically, it's a rule of thumb when you see repeated code, self.mp equals imp, self.mp equals imp, self.x equals return, self.out equals return, self.out equals return, self.out. That's a sign you can refactor things. And so what we can do is a simple refactoring is to create a new class called module. And module's gonna do those things we just said. It's gonna store the inputs and it's going to call something called self.forward in order to create our self.out, because remember that was one of the things we had to again and again and again, self.out, self.out, and then return it. And so now there's gonna be a thing called forward which actually in this, it doesn't do anything because the whole purpose of this module is to be inherited. When we call backward, it's gonna call self.backward passing in self.out because notice all of our backwards always wanted to get hold of self.out, right? Self.out, self.out, because we need it for the chain rule. So let's pass that in and pass in those arguments that we stored earlier. And so star means take all of the arguments regardless whether it's zero, one, two or more and put them into a list. And then that's what happens when it's inside the actual signature. And then when you call a function using star, it says take this list and expand them into separate arguments calling backward with each one separately. So now for value, look how much simpler it is. Let's copy the old value to the new value. So the old value had to do all this storing stuff manually and it had a little self.stuff as well. But now we can get rid of all of that and just implement forward because that's the thing that's being called and that's the thing that we need to implement. And so now the forward of value just does the one thing we want which also makes the code much cleaner and more understandable. Did over backward, it just does the one thing we want. So that's nice. Now we still have to multiply it so we still have to do the chain rule manually. But so same thing for linear, same thing for MSE. So these all look a lot nicer. And one thing to point out here is that there's often opportunities to manually speed things up when you create custom autograd functions in PyTorch. And here's an example. Look, this calculation is being done twice. Which seems like a waste, doesn't it? So at the cost of some memory we could instead store that calculation as diff, right? And I guess we'd have to store it for use later so it would need to be self.diff. And at the cost of that memory we could now remove this redundant calculation because we've done it once before already and stored it and just use it directly. And this is something that you can often do in neural nets. So there's this compromise between storing things, the memory use of that and then the computational speed up of not having to recalculate it. This is something we come across a lot. And so now we can call it in the same way, create our model, passing in all of those layers. So you can see with our model, so the model hasn't changed at this point. The definition was up here. We just pass in the layers. So not the layers, the weights for the layers. Calculate the loss, call backward. And look, it's the same, hooray. Okay, so thankfully PyTorch has written all this for us and remember the calling to rules of our game once we've re-implemented it, we're allowed to use PyTorch's version. So PyTorch calls their version nn.module. And so it's exactly the same. You inherit from nn.module. So if we wanna create a linear layer just like this one, rather than inheriting from our module, we will inherit from their module. But everything's exactly the same. So we create our random numbers. So in this case, rather than passing in the already randomized weights, we're actually gonna generate the random weights ourselves and the zeroed biases. And then here's our linear layer, which you could also use Lin for that, of course. Sorry to find our forward. And why don't we need to define backward? Because PyTorch already knows the derivatives of all of the functions in PyTorch and it knows how to use the chain rule. So we don't have to do the backward at all. It'll actually do that entirely for us, which is very cool. So we only need forward, we don't need backward. So let's create a model that uses nn.module. Otherwise it's exactly the same as before. And now we're gonna use PyTorch's MSE loss because we've already implemented ourselves. It's very common to use torch.nn.functional as capital F. This is where lots of these handy functions live, including MSE loss. And so now you know why we need the colon common none because you saw the problem if we don't have it. And so create the model, call backward. And remember, we stored our gradients in something called .g. PyTorch stores them in something called .grad. But it's doing exactly the same thing. So there is the exact same values. So let's take stock of where we're up to. So we've created a matrix multiplication from scratch, we've created linear layers. We've created a complete back prop system of modules. We can now calculate both the forward pass and the backward pass for linear layers and values. So we can create a multi-layer perceptron. So we're now up to a point where we can train a model. So let's do that. Many batch training, notebook number four. So same first cell as before, we won't go through it. This cell is also the same as before, so we won't go through it. Here's the same model that we had before, so we won't go through it. So just rerunning all that to see. Okay, so the first thing we should do, I think is to improve our loss function. So it's not total rubbish anymore. So if you watched part one, you might recall that there are some Excel notebooks. One of those Excel notebooks is entropy example. Okay, so this is what we looked at. So just to remind you, what we're doing now is which we're saying, okay, rather than outputting a single number for each image, we're gonna instead output 10 numbers for each image. And so that's going to be a one hot encoded set of, it'll be like 1, 0, 0, 0, et cetera. And so then that's gonna be, so well, actually the outputs won't be 1, 0, 0, they'll be basically probabilities, won't they? So it'll be like 0.99, 0.01, et cetera. And the targets will be one hot encoded. So if it's the digit zero, for example, it might be 1, 0, 0, 0, 0, for all the 10 possibilities. And so to see, how good is it? So in this case, it's really good. It had a 0.99 probability prediction that it's zero and indeed it is, because this is the one hot encoded version. And so the way we implement that is we don't even need to actually do the one hot encoding thanks to some tricks. We can actually just directly store the integer, but we can treat it as if it's one hot encoded. So we can just store the actual target zero as an integer. So the way we do that is we say, for example, for a single output, oh, it could be cat, let's say cat dog, plain fish building, the neural net bits out a bunch of outputs. What we do for Softmax is we go E to the power of each of those outputs. We sum up all of those E to the power ofs. So here's the E to the power of each of those outputs. Here's the sum of them. And then we divide one, each one by the sum. So divide each one by the sum. That gives us our Softmaxes. And then for the loss function, we then compare those Softmaxes to the one hot encoded version. So let's say it was a dog, then it's gonna have a one for dog and zero everywhere else. And then Softmax, this is from this nice blog post here. This is the calculation sum of the ones and zeros. So each of the ones and zeros multiplied by the log of the probabilities. So here is the log probability times the actuals. And since the actuals are either zero or one and only one of them is gonna be a one, we're only gonna end up with one value here. And so if we add them up, it's all zero except for one of them. So that's cross entropy. So in this special case where the outputs one hot encoded, then doing the one hot encoded multiplied by the log Softmax is actually identical to simply saying, oh, dog is in this row. Let's just look it up directly and take its log Softmax. We can just index directly into it. So it's exactly the same thing. So that's just review. So if you haven't seen that before, then yeah, go and watch the part one video where we went into that in a lot more detail. Okay, so here's our Softmax calculation. It's e to the power of each output divided by the sum of them or we can use sigma notation to say exactly the same thing. And as you can see, Jupyter notebook lets us use latex. If you haven't used latex before, it's actually surprisingly easy to learn. You just put dollar signs around your equations like this and your equations backslash is gonna be kind of like your functions if you like and curly parentheses are used to kind of for arguments. So you can see here, here is e to the power of and then underscore is used as subscript. So this is x subscript i and power of is used for superscripts. So here's dots. You can see here it is dots. So it's actually, yeah, learning latex is easier than you might expect. It can be quite convenient for writing these functions when you want to. So anyway, that's what Softmax is. As we'll see in a moment, well actually, as you've already seen, in cross entropy, we don't really want Softmax, we want log of Softmax. So log of Softmax is, here it is. So we've got x dot xp, so e to the x divided by x dot xp dot sum. And we're gonna sum up over the last dimension. And then we actually want to keep that dimension so that when we do the divided by, we want to trailing unit axis for exactly the same reason we saw when we did our MSE loss function. So if you sum with keep dim equals true, it leaves a unit axis in that last position. So we don't have to put it back to avoid that horrible outer product issue. So this is the equivalent of this and then dot log. So that's log of Softmax. So there is the log of the Softmax with the predictions. Now, in terms of high school math that you may have forgotten, but you definitely are gonna want to know, a key piece that in that list of things is log and exponent rules. So check out Khan Academy or similar if you've forgotten them, but a quick reminder is, for example, the one that we mentioned here, log of A over B equals log of A minus log of B. And equivalently, log of A times B equals log of A plus log of B. And these are very handy because, for example, division can take a long time, multiplier can create really big numbers that have lots of floating point error. Being able to replace these things with pluses and minuses is very handy indeed. In fact, I used to give people an interview question 20 years ago at a company which I did a lot of stuff with SQL and math. SQL actually only has a sum function for group by clauses. And I used to ask people how you would deal with calculating a compound interest column where the answer is basically you have to say, because this compound interest is taking products, so it has to be the sum of the log of the column and then e to the power of all that. So there's like all kinds of little places that these things come in handy, but they come into neural nets all the time. So we're gonna take advantage of that because we've got a divided by it's being logged. And also rather handily, we're gonna have therefore the log of X dot X minus the log of this, but X and log are opposites. So that is gonna end up just being X minus. So log softmax is just X minus all this logged. And here it is, all this logged. So that's nice. So here's our simplified version. Okay, now there's another very cool trick, which it's one of these things I figured out myself and then discovered other people had known it for years. So not my trick, but it's always nice to rediscover things. The trick is what's written here. Let me explain what's going on. This piece here, the log of this sum, right? This sum here. We've got X dot X dot sum. Now X could be some pretty big numbers and E to the power of that's gonna be really big numbers. And E to the power of things creating really big numbers. Well, really big numbers, there's much less precision in your computer's floating point handling. The further you get away from zero basically. So we don't really big numbers, particularly because we're gonna be taking derivatives. And so if you're in an area that's not very precise as far as floating point math is concerned, then the derivatives are gonna be a disaster. They might even be zero because you've got two numbers that the computer can't even recognize as different. So this is bad. But there's a nice trick we can do to make it a lot better. What we can do is we can calculate the max of A, sorry, the max of X, right? And we'll call that A. And so then rather than doing the log of the sum of E to the X, I, we're instead gonna define A as being the minimum, sorry, the maximum of all of our X values. It's our biggest number. Now, if we then subtract that from every number, that means none of the numbers are gonna be big by definition because we've subtracted it from all of them. Now, the problem is that's given us a different result, right? But if you think about it, let's expand this sum. It's E to the power of X1, if we don't include our minus A, plus E to the power of X2, plus E to the power of X3 and so forth. Okay, now we just subtracted A from our exponents, which is made, we're now wrong, but I've got good news. I've got good news and bad news. The bad news is that you've got more high school math to remember, which is exponent rules. So A to the A plus B equals X to the A times X to the B. And similarly, X to the A minus B equals X to the A divided by X to the B. And to convince yourself that's true, consider for example, two to the power of two plus three. What is that? Well, you've got two to the power of two is just two times two. And two to the power of two plus three, well, it's two times two times, is two to the power of five. So you've got two to the power of two, you've got two of them here, and you've got another three of them here. So we're just adding up the number to get the total index. So we can take advantage of this here and say like, oh, well this is equal to E to the X1 over E to the A plus E to the X2 over E to the A plus E to the X3, oops, plus E to the X3 over E to the A. And this is a common denominator. So we can put all that together, E to the A and why did we do all that? Because if we now multiply that all by E to the A, these would cancel out and we get the thing we originally wanted. So that means we simply have to multiply this by that and this gives us exactly the same thing as we had before. But with critically, this is no longer ever gonna be a giant number. So this might seem a bit weird, we're doing extra calculations. It's not a simplification, it's a complexification. But it's one that's gonna make it easier for our floating point unit. So that's our trick. It's rather than doing log of this sum, what we actually do is log of E to the A times the sum of E to the X minus A. And since we've got log of a product, that's just a log, that's just the sum of the logs and log of E to the A is just A. So it's A plus that. So this here is called the log sum X trick. Oops, people pointing out that I've made a mistake. Thank you that, of course, should have been inside the log. You can't just go sticking it on the outside like a crazy person. Yeah, that's what I meant to say. Okay, so here is the log sum X trick. Oh, I caught it M instead of A, which is a bit silly, I should have caught it A. But anyway, so we find the maximum on the last dimension and then here is the M plus that exact thing. Okay, so that's just another way of doing that. Okay, so that's the log sum XP. So now we can rewrite log softmax as X minus log sum XP. And we're not gonna use our version because PyTorch already has one. So we'll just use PyTorches. And if we check, here we go, there's our results. And so then as we've discussed, the cross-entropy loss is the sum of the outputs times the log probabilities. And as we discussed, our outputs are one hot encoded or actually they're just the integers, better still. So what we can do is we can, I guess I should make that more clear. Actually there, just the integer indices. So we can simply rewrite that as negative log of the target. So that's what we have in our Excel. And so how do we do that in PyTorch? So this is quite interesting. There's a lot of cool things you can do with array indexing in PyTorch and NumPy. So basically they use the same approaches. Let's take a look. Here is the first three actual values in Y-Train. They're five, zero and four. Now what we wanna do is we wanna do is now what we wanna do is we wanna find in our softmax predictions, we want to get five, the fifth prediction in the zeroth row, the zeroth prediction in the first row and the fourth prediction in the second, in the index two row. So these are the numbers that we want. This is gonna be what we add up for the first two rows of our loss function. So how do we do that in all in one go? Well, here's a cool trick. See here, I've got zero, one, two. If we index using a two lists, we can put here zero, one, two. And for the second list, we can put Y-Train column three, five, zero, four. And this is actually gonna return zero comma zero, one comma, so it's gonna be zero comma five, one comma zero and two comma four, which is, as you see, exactly the same thing. So therefore, this is actually giving us what we need for the cross-entropy loss. So if we take range of our targets first dimension or zero index dimension, which is all this is, and the target, and then take the negative of that dot mean, that gives us our cross-entropy loss, which is pretty neat, in my opinion. All right, so PyTorch calls this negative log likelihood loss, but that's all it is. And so if we take the negative log likelihood and we pass that to that, the log softmax, then we get the loss. And this particular combination in PyTorch is called f.cross-entropy. So let's check. Yep, f.cross-entropy gives us exactly the same thing. So that's cool. So we have now re-implemented the cross-entropy loss and there's a lot of confusing things going on there a lot. And so this is one of those places where you should pause the video and go back and look at each step and think, not just like, what is it doing? But why is it doing it? And also try typing in lots of different values yourself to see if you can see what's going on. And then put this aside and test yourself by re-implementing log softmax, NLL loss and cross-entropy yourself and compare them to PyTorch's values. And so that's a piece of homework for you for this week. So now that we've got that, we can actually create a training loop. So let's set our loss function to be cross-entropy. Let's create a batch size of 64. And so here's our first mini batch. Okay, so XB is the X mini batch. It's gonna be from zero up to 64 from our training set. So we can now calculate our predictions. So that's 64 by 10. So for each of the 64 images in the mini batch, we have 10 probabilities, one for each digit. And our Y is just, all right, let's print those out. So there's our first 64 target values. So these are the actual digits. And so our loss function, so we're gonna start with a bad loss because it's entirely random at this point. Okay, so for each of the predictions we made, so those are our predictions. And so remember, those predictions are a 64 by 10. What did we predict? So for each one of these 64 rows, we have to go in and see where is the highest number. Highest number. So if we go through here, we can go through each one. Here's a point one. Okay, it looks like this is the highest number. So it's zero, one, two, three. So it's the highest number is this one. So you gotta find the index of the highest number. The function to find the index of the highest number is called argmax. And yep, here it is, three. And I guess we could have also written this probably as preds.argmax. Normally you can do them either way. I actually prefer normally to do it this way. Yep, there's the same thing. Okay, and the reason we want this is because we want to be able to calculate accuracy. We don't need it for the actual neural net, but we just like to be able to see how we're going. Because it's like it's a metric. It's something that we use for understanding. So we take the argmax, we compare it to the actual. So that's gonna give us a bunch of bulls. If you turn those into floats, there'll be ones and zeros. And the mean of those floats is the accuracy. So our current accuracy, not surprisingly, is around 10%, it's 9% because it's random. That's what you would expect. So let's train our first neural net. So we'll set a learning rate, we'll do a few epochs. So we're gonna go through each epoch and we're gonna go through from zero up to n, that's the 50,000 training rows and skipping by 64, the batch size each time. And so we're gonna create a slice that starts at i. So starting at zero and goes up to 64, unless we've gone past the end, in which case we'll just got to n. And so then we will slice into our training set for the x and for the y to get our x and y batches. We will then calculate our predictions, our loss function and do our backward. So the way I did this originally was I had all of these in, oopsie dozey, in separate cells and I just typed in i equals zero and then went through one cell at a time, calculating each one until they all worked. And so then I can put them in a loop. Okay, so once we've got done backward, we can then with torch not go a grad, go through each layer and if that's a layer that has weights, we'll update them to the existing weights minus the gradients times the learning rate and then zero out. So the weights and biases for the gradients, the gradients of the weights and biases, this underscore means do it in place. So that sets this to zero. So if I run that, oops, got to run all of them. I guess I skipped cell. There we go, it's finished. So you can see that our accuracy on the training sets, a bit unfair, but it's only three epochs is nearly 97%. So we now have a digit recognizer. Chains pretty quickly and is not terrible at all. So that's a pretty good starting point. All right, so what we're gonna do next time is we're going to refactor this training loop to make it dramatically, dramatically, dramatically simpler step by step until eventually we will get it down to, so we'll get it down to something much, much shorter. And then we're gonna add a validation set to it and a multiprocessing data loader. And then yeah, we'll be in a pretty good position, I think, to start training some more interesting models. All right, hopefully you found that useful and learned some interesting things. And so what I'd really like you to do is at this point, now that you've kind of like got all these key basic pieces in place, is to really try to recreate them without peaking as much as possible. So recreate your matrix, multiply, recreate those forward and backward passes, recreate something that steps through layers, and even see if you can recreate the idea of the dot forward and the dot backward. Make sure it's all in your head really clearly so that you fully understand what's going on. At the very least, if you don't have time for that, because that's a big job, you could pick out a smaller part of that, the piece that you're more interested in, or you could just go through and look really closely at these notebooks. So if you go to Kernel, Restart and Clear Output, it'll delete all the outputs and try to think like, what are the shapes of things? Can you guess what they are? Can you check them, and so forth. Okay, thanks everybody. Hope you have a great week and I will see you next time. Bye.