 Welcome back to lesson nine, part two, how to train your model. Before we talk about training our model though, I wanted to revisit a couple of things that came up last week. And the reason I really wanted to revisit them is because I wanted to kind of give you an insight into how I do research. I mean, a lot of this course really will be me showing you how I do research and how I do software development in the hope that that is somewhat helpful to you. So one of the questions that came up last week was we looked at, that's the size of that, OK. We looked at, we looked inside the nn.conf2d that comes with PyTorch to see how it goes about initializing parameters. And we found that inside here, convend.reset parameters, we found the way that it does initialization. And we found this math.square root 5 without any commentary, which was quite mysterious. So I decided to do some research into kind of dual research. One is like, what's the impact of this math.square root 5? And then at the same time, trying to get in touch with the PyTorch team about asking them where this math.square root 5 comes from. So let me show you how I went about doing that research. So I loaded up just what we had from last week, which was the ability to download the MNIST data and open it up. And then the function to normalize it, which I thought we'd export if we haven't already. And then we'd grab the data and we'd normalize it. And because we're going to be talking a lot about convolutions towards the end of today's lesson and particularly next lesson, I suspect. So I'll skip over some of the details about convolutions for now. But basically to do a convolution, as you know, we need a square or rectangular input. And our MNIST input, remember, was just a single vector per image 768 long. So I resized them all to 28 by 28 one channel images so that we could test out the impact of this conv2d in it in PyTorch and set up the various variables that we wanted to have. And then I created a conv2d layer. So we have one input because it's just one channel. NH, which is number hidden, which is 32 outputs. And let's do a 5 by 5 kernel. Now we're talking more about why 5 by 5 might be suitable. And just for testing, let's just grab the first 100 elements of the validation set. So we've now got a tensor of 100 by 1 by 28 by 28. So it's a really good idea when you're playing with anything in software development, but including notebooks to refactor things. So I'm going to be wanting to look at the mean and standard deviation of a bunch of things. So let's create a little function called stats to do that. And I never plan ahead. What I'm going to do, when you see this in a notebook, it always means that I've written out that by hand. And then I copied it. And then I'm like, OK, I'm using it twice. I'll chuck it in a function. So then I go back and create the function. So here I've got the mean and standard deviation of my L1, which is a conv2d layer. And so a conv2d layer contains a weight, tensor, parameter, and a bias, tensor, parameter. So just to remind you, L1 dot weight dot shape is 32 output filters, because that's what number hidden was, one input filter, because we only have one channel, and then five by five. So that's the size of our tensor. And if you've forgotten why that's the size of a tensor, you can go back to the Excel directory, for example, from part one, where you can find the convExample spreadsheet. And in the convExample spreadsheet, you can see what each of those parameters does. So we basically had a filter for each input channel and for each output channel. So that's kind of what it looked like. And so you can see, for the next layer, we now have a four dimensional tensor, a rank four tensor. That we've got the three by three. We've got it for each input and for each output. So that's 32 by one by five by five. So the mean and standard deviation of the weights is 0 and 0.11. And this is because we know that behind the scenes, it's called this function to initialize. So the bias is initialized with a uniform random number between negative of this and positive of this. And then the weights are initialized with chiming uniform with this odd math.square at five thing. So that's fine. That's not particularly interesting. What's more interesting is to take our input tensor of MNEST numbers and put it through this layer, which we called L1, which remember is a conv2d layer, so layer one. And let's create an output t. And let's look at the stats of t. So this is the stats of the output of this layer. We would like it to be a mean of 0 and a standard deviation or a variance of 1. The mean of 0 is there, but the standard deviation of 1 is not there. So that looks like a problem. Let's compare this to the normal chiming in it. So the normal chiming in it, remember, is designed to be used after a ReLU layer or more generally a leaky ReLU layer. And recall that a leaky ReLU layer has the y equals x here. And here, the gradient of this is called a or leak or whatever. And now in our case, we're just looking at a conv layer. So we don't have anything kind of going on there. In fact, it's straight here as well. So effectively, we have like a leak, if you like, of 1 or an a of 1. So to use chiming in it with no ReLU, we can just put a equals 1. And if we do that, then we get a mean of 0 and a variance of 1. So chiming in it seems to be working nicely. Okay, so let's now try it with ReLU. So let's now define a function, which is the function for layer 1, which is to pass it through our layer 1 conv, and then do a ReLU with some a, with some leak amount, which we're all set to 0 by default. So this will be just a regular ReLU. And you can see that if we now run that with chiming initialization, we get a variance of 1, which is good. And the mean is no longer 0 as we discussed last week. It's about a half. But if we go back and re-initialize the conv2D with this default PyTorch, this is not looking good at all. With ReLU, it's even worse. Because remember, they don't have anything kind of handling that ReLU case in the default conv. So this looks like a problem. So a variance of 0.35. It may not sound a lot lower than 1, but let's take a look at what that means. So I forgot to mention where we are. This is the 0 to a y square root of 5 notebook. So 0 to a notebook. So in order to explore this, I decided that I would try and write my own chiming init function. And so normally with the chiming init function, if we were working with just a regular, fully connected matrix multiplication, we would basically be saying how many output filters are there. So this is the weight matrix, then what's the width of the weight matrix? For a convolutional layer, it's a little bit different, right? Because what we actually want to know is each time, like in this case, we're basically multiplying all these together with some set of inputs and then adding them all up, right? That's basically what a single step of a matrix multiplication is. In a convolution, we're also multiplying a bunch of things together and adding them up. But what we're actually adding together is, if it was three by three, is we're multiplying together each of the three by three elements, right? And also the channel dimension, right? We actually multiply all of those together and add them all up. So because convolution and matrix multiplication are kind of one and the same thing, as we know, with some weight tying and with some zeros. So in order to calculate the total number of multiplications and additions going on for a convolutional layer, we need to basically take the kernel size, which in this case is five by five, and multiply it by the number of filters, okay? So the general way to get that five by five piece is we can just grab any one piece of this weight tensor, and that will return a five by five kernel. And then say how many elements are in that part of the weight tensor. And that's going to be the receptive field size. So the receptive field size for just the immediate layer before is how many elements are in that kernel. So for this, it's 25, it's five by five, right? And so if we then say, okay, let's grab the shape of the weight matrix. And it gives us the number of filters out, 32. And then the number of filters in, one. And then I'll skip the rest because they're the only two things I want. So now for the climbing her in it, we can calculate fan in is the number of input filters times receptive field size. So that's one times 25, fan out is 32 by 25. So there you can see, this is how we calculate the effective fan in and fan out for a convolutional layer. So we can do all that by hand. And then the climbing in it formula, you need to then for leaky value, you need to multiply by root two. Or if there's a leaky part in it, so if the A is not equal to zero, then it's actually root two divided by one plus A squared. So that's just the formula for the climbing in it. And that's often called the gain for the in it. And so there's the formula for the gain. So you can see that if the gain is one, right, then that's just linear. There's no non-linearity at all. So there's no change to the calculation of how you do the initialization. On the other hand, if it's a standard value, then you've got the root two, which we saw last week from the Kaming paper. With a gain of 0.01, it's about root two as well, it's pretty close. And this is kind of a common leaky value amount. But what about in the case of the PyTorch in it? In the case of the PyTorch in it, it's root five, which is 0.577. Which sounds like an odd number. It's a long way away from what we were expecting to see. So that's a bit concerning. But one thing we have to account for here is that the initialization that they use for PyTorch is not Kaming normal, it's Kaming uniform, right? And so normally distributed random numbers look like that. But uniform random numbers look like that, right? And so the uniform random numbers they were using as their kind of starting point were between minus one and one. And so the standard deviation of that obviously is not one, right? The standard deviation is obviously less than one. And so you can Google for the standard deviation of a uniform distribution. Or you could jump into Excel or Python and just grab a bunch of random numbers and find out what the standard deviation is. And you'll find that, I've done it here actually. I've grabbed 10,000 random numbers in that uniform distribution and asked for their standard deviation. And it turns out that it's one over root three, okay? So part of the reason for this difference actually is that they need to gain to handle uniform random numbers rather than just normal random numbers. But it still doesn't quite account for the difference. So let's take a look. So here's my version of Kaming in which I've just grabbed all of the previous lines of code and merged them together. And then I've just added this thing to multiply it by root three because of the uniform random number. And so then if I run this Kaming two on my weights and get the stats of that. Nice, I again get a variance of about one. And again, confirming that if I, well, this is interesting. If I do it with a equals math dot square root five, I would expect to get the same result as the PyTorch default, which I do. It's about 0.4, which is what we found back here, 0.35. So it seems like we've successfully re-implemented what they had. So at this point, I was like, okay, well, what does this do? What does this mean? So to see kind of what this looks like, I threw together a quick conf net. And I grabbed the first 100 dependent variables. And so then I took my input and I ran it through the whole conf net to get the stats out of the result. So this is now telling me what happens when I use the default PyTorch in it. And put it through a four layer conf net. And the answer is I end up with a variance of 0.006. And that sounds likely to be a really big problem, right? Because there's so little variation going on that last layer. And also there's a huge difference between the first layer and the last layer. That's the really big issue. The first layer had a standard deviation of 0.4. And the last layer had a standard deviation. Well, the input layer is one, the first hidden layer is 0.4. And the last layer is 0.006. So these are all going at totally different rates. And then what we could do is we could grab that prediction and put it through mean squared error. This is the function we created last week. Run backward and get the stats on the gradients. For the first layer weights. So this has now gone all the way forward and all the way back again. And again, standard deviation is nowhere near one. So that sounds like a big problem. So let's try using climbing uniform instead. And if you look at the climbing uniform source code, you'll see that it's got the steps that we saw, gain over root of the fan. And here is the square root of three, because it's uniform, okay? And so we can confirm, let's go through and go through every layer. And if it's a convolutional layer, then let's call climbing uniform on the weights and set the biases to zero. So we'll initialize it ourselves and then we'll grab T. And it's not one, but it's a lot better than 0.008. So this is pretty encouraging that we can get through four layers. We wouldn't want to probably have a 40 layer neural network, which is losing this much variance. But it should be fine, plenty good enough for our four layer network. And then let's also confirm on the backward. And the backward, the first layer's gradient is 0.5. So that was my kind of starting point for the research here. And at the end of this, I kind of thought this is pretty concerning. And why did I think it was concerning? We'll be seeing a lot more about why it's concerning, but let's quickly look at 2B initializing. So Sylvain put this together today, and he called this why you need a good in it. And he's pointed out here that if you grab a random vector x and a random matrix A, which is normally distributed, means 0 and standard deviation of 1. Then 100 times, you basically go, x is A times x. And then you go, so you're basically multiplying again and again and again. After 100 iterations, your standard deviation and mean are not a number. So basically the issue is that when you multiply by a matrix lots and lots of times, you explode out to the point that your computer can't even keep track. So what Sylvain did next was he actually put something in a loop to check whether it's not a number. And he found it was only 28 iterations before it died. So it didn't take very long to explode. Now, in the other hand, what if we take the random numbers for the standard deviation of 0.01 instead of 1? And we do that 100 times, then it disappears to zero. So you can see if you've got 100 layer neural net, cuz that's what it's doing. It's doing 100 matrix multiplies on itself, on the output of each previous one. You've got to be super careful to find some set of weights. Cuz if this is your starting set of weights, if it's 0.01 or if it's one standard deviation, you can't ever learn anything. Because there are no gradients, the gradients are either 0 or NAND, right? So you actually have to have a reasonable starting point. And this is really why for decades, people weren't able to train deep neural networks. Cuz people hadn't figured out how to initialize them. So instead, we have to use some better in it, okay? And we'll talk about that in a moment. For those who are interested, Sylvan has then gone on to describe why it is that you have to divide by the square root of the fan. And so feel free to keep reading that if you're interested. It's cool, but we don't need to know it for now. It's just some derivation and further discussion. So in parallel, I also asked the PyTorch team about this. And I sent these results to them, and I said, what's going on? And so Sirmuth finally appeared, and he said it was a historical accident. Because 15 years ago, or for 15 years before PyTorch appeared, there was a product called Torch, a neural network product in Lua. And they did it that way. And so then on Google+, in 2014, he started talking to Sonda Deleman, who's now at DeepMind, and about at this time, maybe a bit before he was our intern actually. And Sonda said, this is at Inletic. And Sonda said, this root five thing looks weird. And Sirmuth said, no, no, go look at the paper. And Sonda said, no, that's not what the paper said. And Sirmuth said, oh, it's a bug. But it's a good bug. Because somebody went and checked it out, and they thought that they were getting better results with this thing. So then I talked to Sirmuth, and he was already aware of this issue to some extent, and within a couple of hours, PyTorch team had created an issue, saying they're gonna update their inlets. So this is super cool, so this is partly to say, well, this is an awesome team, super responsive, and this is why PyTorch works so well is that they see issues and they fix them. But it's also to say, when you see something in a library, don't assume it's right or that it makes sense. When it comes to deep learning, none of us know what we're doing. And you can see it doesn't take too much to kind of dig into something. And then you can raise an issue and say, here's analysis that I did. There's a fantastic extension called GISTIT, G-I-S-T, GISTIT. For Jupyter Notebooks, it lets you take your little research notebook, press a single button, and it turns it into a shareable gist that you can then put a link to say, here's the analysis that I did. And so, yeah, that's a little bit of fun, a little bit of research I did into answering this question from last week. There are lots of interesting initialization approaches you can use. We've already talked about the Dloro and Bengio paper. We've already talked about the Keiming-Hur paper. There's an interesting paper called All You Need Is A Good Init, which describes how you can kind of iteratively go through your network and set one layer of weights at a time to literally kind of do a little optimized to find out which set of parameters gets you a unit variance at every point. There's another cool paper which talks about something called orthogonal initialization. If you've done some linear algebra, particularly if you've done Rachel's computational linear algebra course, you'll know about the idea of orthogonal matrices and they make good inits. We talked briefly last week about fix-up initialization and then there's also something called self-normalizing neural networks. Fix-up and self-normalizing neural networks are both interesting papers because they describe how to try to set a combination of kind of activation functions and in it such that you are guaranteed a unit variance as deep as you like. And both of those two papers went to something like a thousand layers deep and trained them successfully. In both cases, the fix-up's much more recent, but in both cases people have kind of hailed them as reasons we can get rid of batch norm. I think it's very unlikely to be true. Very few people use this cell U thing now because in both cases, they're incredibly fiddly. So for example, in the self-normalizing neural networks case, if you put in drop out, you need to put a correction. If you do anything different, you need to put in a correction. As you've seen, as soon as something changes like the amount of leakiness in your activation function or whatever, all of your assumptions about what your variance will be in the next layer disappear. And for this cell U paper, it was a particular problem because it relied on two specific numbers that were calculated in a famous 96 page long appendix of math in the cell U paper. And so if you wanted to do a slightly different architecture in any way, they only showed this a fully connected network. So if you want to do convolutions, what are you going to do? You'll redo that 96 pages of math. So that 96 pages of math is now so famous that it has its own Twitter handle, the cell U appendix, which has the pin tweet, why does nobody want to read me? And this is like literally what the entire 96 pages of the appendix looks like. I will mention that in my opinion, this is kind of a dumb way of finding those two numbers. The all you need is a good in it paper is a much better approach to kind of doing these things in my opinion, which is like, if you've got a couple of parameters you need to set, then why not kind of set them using a quick little loop or something? So if you need those, if you want to find two kind of cell U parameters that work for your architecture, you can find them empirically pretty quickly and pretty easily. Okay, so that's a little bit about in it. We'll come back to more of that very shortly. There was one other question from last week, which was, we noticed that the shape of the kind of manual linear layer we created and the shape of the PyTorch 1 were transposed. And the question was why? And so again, I did some digging into this until eventually, Seymouth from the PyTorch team pointed out to me this commit from seven years ago in the old Lua Torch code where this actually happened. And that basically it's because that old Lua library couldn't handle batch matrix multiplication without doing it in this transposed way. And that's why still to this day PyTorch does it kind of upside down. Which is fine, like it's not slower, it's not a problem. But again, it's kind of an interesting case of like, I find this happens all the time in deep learning. Something's done a particular way forever. And then everybody does it that way forever. And nobody goes back and says, why? And this particular case, it really doesn't matter, I don't think. But often it does, right? So like things like how do we initialize neural networks and how many layers should they have and stuff like that? They kind of, nobody really challenged the normal practices for years. So I'm hoping that with this really ground up approach, you can see what the assumptions we're making are and see how to question them and see that, to me, PyTorch is the best library around at the moment. And even PyTorch has these weird kind of archaic edges to it. Okay, so that was a little diversion to start with, but a fun diversion because that's something I spent a couple of days this week on and think it's pretty interesting. So to go back to how do we implement a basic modern CNN model, we got to this point. So we've done a matrix multiplication, so that's our fine function. We've done value, so that's our non-linearity. And so a fully connected network forward is simply layering together those two things. So we did that, and then we did the backward pass and kind of refactored that nicely. And it turned out that it looked pretty similar to PyTorch's way of doing things. And so now we're ready to train our model, and that's where we're up to. So here we are, O3 mini batch training, and we're going to train our model. So we can start by grabbing our MNIST data. So again, we're just importing the stuff that we just exported from the previous class. Here's the model we created in the previous class. And so let's get some predictions from that model, and we'll call them PRED, okay? And so now to train our model, the first thing we need to do is we need a loss function, because without a loss function, we can't train it. Now, previously we used mean squared error, which I said was a total cheat. Now that we've decided to trust PyTorch's autograd, we can use many more things, because you don't have to write our own gradients. And I'm too lazy to do that. So let's go ahead and use cross entropy, because cross entropy makes a lot more sense. To remind you from the last class, there is an entropy example notebook where we learned, first of all, that cross entropy requires doing two things. First of all, softmax, well, in the case of this multi-class categorical cross entropy, you first do softmax, and then you do the negative log likelihood. So the softmax was if we have a bunch of different possible predictions. And we got some output for each one from our model. Then we take e to the power of that output, we sum them all up, and then we take the e to the power of divided by the sum of e to the power of, and that was our softmax. So there it is in math form. There it is in summation, math form, and here it is in code form, okay? So e to the x divided by x, x sum. And then the whole thing, we do a dot log. And that's because in PyTorch, negative log likelihood expects a log softmax, not just a softmax. And we'll see why in a moment. So we pop a log in the end. So here's our log of softmax function. So now we can go ahead and create our softmax predictions by passing threads to log softmax. Now that we've done that, we can calculate cross entropy loss. And cross entropy loss is generally expressed in this form, which is this form, sum of actual times the log of the probability of that actual. So in other words, if we have is cat and is dog, then here's our actuals. So it's one hot encoded, is cat, yes, is dog, no. We have our predictions from our model, from our softmax. We can then say, well, what's the log of the probability it's a cat? So log of this, what's the log of the probability it's a dog? So log of one minus that. And so then our negative log likelihood is simply b times e plus c times f. And then take the negative of all that. That's negative log likelihood, which is what this is. But remember, and I know I keep saying this because people keep forgetting, not you guys, but people out in the world keep forgetting that when you're multiplying by stuff, which is mainly zero and one hot encoded multi, categorical classification, most of your categories are zero. Every time you multiply by zero, you're doing nothing at all, but you're doing it very slowly. So rather than multiplying by zero and then adding up the one, that you have, a much faster way, as we know to do to multiply by a one hot encoded thing, is to first of all simply say, what's the location of the one here? So in this case, it's location two. In this case, it's location one, if we index from one. And then we just look up into our array of probabilities directly offset by this amount, or put it in math terms for one hot encoded x's. The above is simply log of pi where i is the index of our prediction. Sorry, not our prediction, of the actual, so the index into here of the actual. So how do we write this in PyTorch? And I'll show you a really cool trick. This is what we're gonna end up with. This is our negative log likelihood implementation. And it's incredibly fast and it's incredibly concise. And I'll show you how we do it. Let's look at our dependent variable. So let's just look at the first three values, they're five, zero, four. So that's the first three elements of the dependent variable. And so what we wanna do is we wanna find what is the probability associated with five in our predictions, and with zero, and with four. So our predictions, our softmax predictions, remember, 50,000 by 10, okay? And so if we take the very first of those, there they all are, all right? And so it's said that the actual answer should be five. So if we go into this zero, one, two, three, four, five, that's the answer that we're gonna want, right? Okay, so here's how we can grab all three of those at once. We can index into our array with the whole thing, five, zero, four. And then for the first bit we pass in just the contiguous integers, zero, one, two. Why does this work? This works because PyTorch supports all of the advanced indexing support from NumPy. And so if you click on this link, one of the many things that types of indexing that NumPy and therefore PyTorch supports is integer array indexing. And what this is, is that you pass a list for each dimension. So in this case we have two dimensions. So we need to pass two lists. And the first is the list of all of the row indexes you want. And the second is the list of all of the column indexes you want. So this is gonna end up returning 0, 5, 1, 0, and 2, 4. Which is the exact numbers that we wanted. So example 0, 5 is minus 2.49, right? So to grab the entire list of the exact things that we want for our negative log likelihood, then we basically say, okay, let's look in our predictions. And then for our row indexes, it's every single row index. So range of target dot shape zero. So target dot shape zero is the number of rows. So range of that is all of the numbers from zero to the number of rows. So 0, 1, 2, 3, blah, blah, blah, 50,000, or 49,999. And then which columns do we want for each of those rows? Well, whatever our target is, whatever the actual value. So in this case, 5, 0, 4, etc. So that returns all of the values we need. We then take minus because it's negative log likelihood and take the mean. So that's all it takes to do negative log likelihood in PyTorch. Which is super wonderfully easy. So now we can calculate our loss, which is the negative log likelihood. Of the softmax predictions, that's what we had up here, compared to our actual y-trading. And so there it is. Now, this was our softmax formula, which is e to the x over sum of e to the x's. So we have a, and then it's all logged. So we've got a log of a over b. And remember, I keep telling you that one thing you want to remember from high school math is how logs work. So I do want you to try to recall that log of a over b is log of a minus log of b. And so we can rewrite this as log of e to the x minus log of all that. And of course, e to the something and log are opposites of each other. So log of e to the x is just x. So that ends up being x minus x dot x dot sum dot log. Okay, so this is useful. And let's just check that that actually works. So as I kind of keep refactoring these things, as even as I like to me, these mathematical manipulations are just refactoring, right? So just refactoring the math. So you keep checking along. So we created a test near last time. So let's use it to make sure that it's the same as our loss. Now, you'll see here, this is taking the log of the sum of the xp. And there's a trick called log sum xp. The reason we need this trick is that when you go e to the power of something, you can get ridiculously big numbers. And if you've done Rachel's computational linear algebra course, then you'll know that very big numbers in floating point on a computer are really inaccurate. Basically, the further you get away from zero, the less kind of fine grain they are. It gets to the point where like two numbers, a thousand apart, the computer thinks they're the same number. So you don't want big numbers, particularly when you calculate ingredients. So anywhere you see an e to the x, we get nervous. And we don't want x to be big. But it turns out that if you do this little mathematical substitution, you can actually subtract a number from your x's and add them back at the front. And you get the same answer. So what you could do is you can find the maximum of all of your x's. You can subtract it from all of your x's and then add it back afterwards outside the xp. And you get exactly the same answer. So in other words, let's find the maximum, let's subtract it from all of our x's. And then let's do log sum xp. And then at the end, we'll add it back again. And that gives you exactly the same number, but without this numerical problem. So when people talk about numerical stability tricks, they're talking about stuff like this. And this is a really helpful numerical stability trick. So this is how you do log sum xp in real life. We can check that this one here is the same as, and look, in fact, log sum xp is already a method in PyTorch. It's such an important and useful thing. You can just actually use PyTorch's, and you'll get the same result as the one we just wrote. So now we can use it. So log softmax is now just x minus x dot log sum xp. And let's check, yep, still the same. So now that that's all working, we may as well just use PyTorch's. Log softmaxes and PyTorch's NLL loss. But actually, NLL loss of log softmax is called cross entropy. So finally, we get test near f dot cross entropy is the same as loss, and it is, okay? So we've now recreated PyTorch's cross entropy, so we're allowed to use it according to our rules, okay? So now that we have a loss function, we can use it to train. And we may as well also define a metric, because it's nice to see accuracy, to see how we're going, it's just much more interpretable. And remember from part one, that the accuracy is simply grab the arg max, okay, to find out which of the numbers in our softmax is the highest. And the index of that is our prediction. And then check whether that's equal to the actual. And then we want to take the mean. But in PyTorch, you can't take the mean of ints. You have to take the mean of floats, which makes some sense. So turn it into a float first. So there's our accuracy. So let's just check. Let's grab a batch size of 64, and let's grab our first x batch. This is our first playing around with mini batches, right? So our first x batch is gonna be our training set from zero up to batch size. So our predictions is we're just gonna run our model. And remember our model was linear, value linear. So still using a super simple model. So let's calculate some predictions. And let's have a look at them, and here's some predictions. And it's 64 by 10, as you'd expect. Batch size 64 and 10 possible probabilities, right? So now we can grab our first batch of dependent variables and calculate our loss, okay, that's 2.3. And calculate our accuracy. And as you'd expect, it's about 10% because we haven't trained our model, okay? So we've got a model that's giving basically random answers. So let's train it. So we need a learning rate, we need to pick a number of epochs, and we need a training loop. So our training loop, if you remember from part one, remember lesson two SGD, our training loop looks like this. Calculate your predictions, calculate your loss, do a backward. Subtract learning rate times gradients, and zero the gradients, okay? So let's do exactly the same thing. So we're gonna go through with G poc, and go through i up until n, which is 50,000, that's the number of rows. But integer divided by batch size, because we're gonna do a batch at a time. And so then we'll grab everything starting at i times batch size, and ending at that plus batch size. So this is gonna be our first, this is gonna be our ith minibatch. And so let's grab one x minibatch, one y minibatch, and pass that through the model and our loss function, and then do backward. And then we're gonna do our update, which remember we have to do with no grad because this is not part of the gradient calculation, this is the result of it. But now we can't just go a dot subtract learning rate times gradient. We have to do that for every single one of our parameters. So our model has three layers. The value has no parameters in it. So the linear layer has weight and bias, and this linear layer has weight and bias. So we've basically got four tensors to deal with. So we're gonna go through all of our layers. And let's just check whether that layer has an attribute called weight or not. That's a bit kind of more flexible than hard coding things. And if it does, then let's update the weight to minus equals the gradient of that by the learning rate. The bias to the bias gradient by the learning rate. And then zero those gradients when we're done. So let's run it. And then let's check the loss function and the accuracy. And the loss has gone down from 2.3 to 0.05. And the accuracy has gone up from 0.12 to 1. Notice that this accuracy is for only a single mini batch. And it's a mini batch from the training set. So it doesn't mean too much. But obviously our model is learning something. So this is good. So we're now, well, we haven't really done Conv. I guess we've got a basic training loop. We're now here. We have a basic training loop, which is great. So we've kind of got all the pieces. So let's try to make this simpler because this is too much code, right? And it's too hard to fiddle around with. So the first bit we'll do is we're going to try and get rid of this mess. And we're going to replace it with this. And so the difference here is rather than manually going through weight and bias for each one, we're going to loop through something called model.parameters. So we're not even going to loop through the layers. We're just going to loop directly through model.parameters. And for each parameter, we'll say that parameter minus equals gradient times learning rate. So somehow, we need to be able to get all of the parameters of our model. Because if we could do that, we could greatly simplify this part of the loop and also make it much more flexible, right? So to do that, we could create something like this. I'm just calling this dummy module, right? And in dummy module, what I'm going to do is I'm going to say every time I set an attribute like L1 or L2 to, in this case, to linear, I want to update a list called underscore modules with a list of all of the modules I have. So in other words, after I create this dummy module, I want to be able to print out, here's my representation. I want to be able to print out the list of those modules and see the modules that are there. Because then, I can define a method called parameters that will go through everything in my underscore modules list. And then go through all of their parameters. And that's what I'll be able to do. See, I could do here, model.parameters. So how did I create this? You see, it's not inheriting from something, right? This is all written in pure Python. How did I make it so that as soon as I said here's an attribute in my, in my init, that somehow it magically appeared in this underscore modules list so that I could then create this parameters so that I could then do this refactoring? And the trick is that Python has a special DunderSetAtra method. And every time you assign to anything inside self, inside Python, it will call this method if you've got one. And so this method just checks that my, the key, so in other words, the attribute name, doesn't start with underscore. Cuz if it does, it might be underscore modules and then it's gonna be like a recursive loop. And also, Python's got all kinds of internal stuff that starts with underscore. So as long as it's not some internal private stuff, put that value inside my modules dictionary and call it k. That's it, right? And then after you've done that, do whatever the superclass does when it sets attributes. And in this case, the superclass is object. If you don't say what it is, then it's just the Python class level object. So now we have something that has all of the stuff we need to do this refactoring. But the good news is PyTorch also has something that does that and it's called nn.module. So we can do the exact same thing rather than implementing that set attribute stuff ourselves. We can just call just inherit from nn.module and it does it for us, right? And so this is now you know why you have to call superdunder in at first, right? Because it has to set up its equivalent of this underscore modules dictionary, right? So that's why you have to call super in at first. And then after you've done that in PyTorch, it's exactly the same as what I just showed you. It now creates something which you can access through named children. And you can see here if I print out the name and the layer, there is the name and the layer, all right? So this is how PyTorch does the exact same thing. Just like I created a Dunder repra, PyTorch also has a Dunder repra. So if you just print out model, it prints it out like so. You can grab the attributes just in the normal Pythonic way. It's just a normal Python class that has a bit of this extra behavior. So now we can run it with this refactoring. Make sure everything works and there we go, okay? So this is doing exactly the same thing as before, but a little bit more conveniently. Not convenient enough for my liking. So one thing we could try to do is to get rid of the need to write every layer separately. Maybe go back to having it as a list again. So if we made it a list of layers, right? And then we want to be able to pass that to some model class. Pass in the layers. This is not enough to make them available as parameters, right? Because the only thing that actually that PyTorch is going to make available as parameters are things that it knows are proper nn.modules. So but here's the cool thing. You can just go through all of those layers and call self.addModule. That's just the equivalent of what I did here when I said self.underscoreModules, blah, blah, right? So in PyTorch, you can just call self.addModule. And just like I did, you give it a name, and then you pass in the layer. And so if you do that, if you do that, then you end up with the same thing, okay? So that's one thing you can do. But this is kind of clunky. So it'd be nice if PyTorch would do it for you, and it does. That's what nn.moduleList does. So if you use an nn.moduleList, then it just basically calls that line of code for you. So you can see us doing it here. We've got to create something called sequentialModel, which just sets self.layers to that module list. And then when we call it, it just goes to each layer. x equals that layer of x and returns it. And there's that, okay? Even this is a little bit on the clunky side. Why would we have to write it ourselves? We don't. PyTorch has that code already. It's called nn.sequential, okay? So we've now recreated nn.sequential, and there it is doing the same thing. So again, we're not creating like dumbed down versions. If you look at nn.sequential and you look at the source code and you look at forward, it's just, look, it's even the same name. So go through each module in self.underscore.modules.values, input equals module import, return import, all right? So that's their version. And remember our version was basically the same. All right, and we could even put it in something called underscore modules. So yeah, that's all nn.sequential is doing for you. Okay, so we're making some progress. It's less ugly than it used to be. Still more ugly than we would like. This is where we got our fit function up to. So let's try and simplify it a bit more. Let's replace all this torch.nograd for pnmodel.parameters, blah, blah, blah. With something where we can just write those two lines of code. That would be nice. So let's create a class called optimizer. We're going to pass in some parameters and store them away. And we're going to pass in the learning rate, and we're going to store that away. And so if we're going to be able to go opt dot step, opt dot step has to do this. So here is step with torch.nograd. Go through each parameter and do that, okay? So let's just back to that out. And zero grad, we probably shouldn't actually go model dot zero grad. Because it's actually possible for the user, as you know, to say I don't want to include certain parameters in my optimizer. So when we're doing like gradual unfreezing and stuff. So really, zero grad should actually do this. It should go through the list of parameters that you asked to optimize and zero those gradients. So here we've now created something called optimizer. And we can now grab our model. And so remember that the model now we've created something called dot parameters, so we can pass that to our optimizer. And then we can now just go up dot step, up dot zero grad. And let's test it, and it works, okay? Now, of course, luckily for us, PyTorch already has these lines of code. It's called optium.sgd. Now, optium.sgd does do a few more things. Weight decay, momentum, stuff like that. So let's have a look. Here's optium.sgd, and here's its step function. So it's got weight decay, momentum, dampening, nester off. We're gonna see all these things very shortly. But basically all it does is it goes through each layer group, and it does the exact thing that we just see. So once you remove the momentum and stuff, and we're gonna be implementing this in a much better way than PyTorch in very soon. So once you remove all that, their optium.sgd is exactly the same as our optimizer. So let's go ahead and use that instead. And so it's kind of nice then if we're gonna use all the parameters of the model. Let's just create a get model, which creates a model and returns it as well as a SGD optimizer with all the parameters. And okay, there's our training loop and seems to be working. It's nice to put tests in from time to time. And I like to put these tests in like, hey, my accuracy should be significantly better than 50%. Note that these kind of stochastic tests are highly imperfect in many ways. It's theoretically possible it could fail because you got really unlucky. I know though that this is really, it's vanishingly unlikely to happen. It's always much more than 90%. It's also possible that your code could be failing in a way that causes the accuracy to be a bit lower than it should be, but not this low. But I still think it's a great idea to have these kinds of tests when you're doing machine learning. Because they give you a hint when something's going wrong. And you'll notice I don't set a random seed at any point. This is very intentional. I really like it that if there's variation going on when I run my model at different times, I want to see it. I don't want it to be hidden away behind a fixed seed. So there's a big push in science for reproducible science, which is great for many reasons. But it's not how you should develop your models. When you're developing your models, you want to have a kind of good intuitive sense of what bits are stable and what bits are unstable and how much variation do you expect. And so if you have a test which fails one every 100 times. It's good to know that. And so in the fast AI code, there's lots of tests like that. And so then sometimes there'll be a test that fails. It's nothing particularly to do with the push that just happened. But it's really helpful for us because then we can look at it and be like, this thing we thought should pretty much always be true. Sometimes isn't true. And then we'll go back and we'll deeply study why that is and figure out how to make it more stable and how to make it reliably pass that test. And so this is a kind of a controversial kind of test to have. That's something that I found in practice is very, very, very helpful. It's not complete and it's not totally automated and it's imperfect in many ways, but it's nonetheless helpful. Okay, let's get rid of these two lines of code. These were the lines of code that grabbed our X mini batch from the training set and the Y mini batch from the training set. Let's do them both in one line of code. So it'd be nice to have one line of code where we have some kind of object where we can pass in the indexes we want and get back both X and Y. And that's got a data set, as you know. So here's our data set class. Again, not inheriting from anything, it's all from scratch. Plural Python, we initialize it by passing in the X and the Y. We'll store them away. It's very handy to have a length. Hopefully you know by now, if you don't, then now's a good time to realize that dunder len is the thing that lets you go len of something in Python and have it work. That's what len will call. So now we've got the length of our data set. And dunder get item is a thing that when you index into it, it will return that. And so we just return the tuple of XI and YI. So let's go ahead and create a data set for our training set and our validation set. Check that the lengths are right. Check the first few values, make sure they all seem sane. Now we'll grab our model. And as I said, we will replace those two lines of code with one. And so at this point, our training loop is getting quite neat. It's not as neat as it could be, but it's getting quite neat. Okay, so the next thing we're going to do, so that's a data set. Next thing we're going to do is create a data loader. This is what the start of our training loop looked like before. And let's replace it with this single line of code. Okay, so to do that, we're going to have a class that takes a data set and a batch size and stores them away. And when you go for blah in blah, behind the scenes in Python, it calls dunder itter. And so what we're going to do is we're going to loop through range from zero up to the size of the data set, jumping up batch size at a time. So zero, 64, 128, et cetera, up to 50,000. And each time we go through, we will yield our data set at an index starting at i and ending at i plus self dot batch size. Probably quite a lot of you haven't seen yield before. It's an incredibly useful concept. It's, if you're really interested, it's something called a coroutine. It's basically this really interesting idea that you can have a function that doesn't return just one thing once, but can return lots of things. And you can kind of ask for it lots of times. So the way these iterators work in Python is that when you call this, it basically returns something which you can then call next on lots of times. And each time you call next, it will return the next thing that is yielded. So it's not that I don't have time to explain coroutines in detail here, but it's really worth looking up and learning about. We'll be using them lots. They're a super valuable thing. And it's not just for data science. They're really handy for things like network programming, web apps, stuff like that as well. So well worth being familiar with yield in Python. And nowadays, most programming languages have something like this. So you'll be able to take it to wherever you go. So now we have a data loader. We can create a training one and a validation one. And we can, this is how we do it. Itta, validdl, is the thing that basically kind of generates our coroutine for us. And then next is the thing that grabs the next thing yielded out of that coroutine. So this is a very common thing you'll be doing lots. Next, itta blah. You probably did it a whole lot of times in part one, because we kind of did it without diving in very deeply into what's going on. And that returns one thing from our data set. And the data set returns two things, because that's what we put in it. So we expect to get two things back. And we can check that those two things are the right size. So that's our data loader. And so let's double check, there it is, good stuff. So now, there's our fitness function. Let's call the fitness function looking good. So this is about as neat as we're gonna get. That's quite beautiful, right? It's kind of all the steps you can think of if you set it in English there. Go through each epoch, go through each batch, grab the independent variable, calculate the predictions, calculate the losses, calculate the gradients, update with the learning rate, reset the gradients. So that's kind of where you wanna get is to a point where you can read your code in a very kind of insuative way to a domain expert. And until you get to that point, it's very hard, really, I find to really maintain the code and understand the code. And this is the trick for doing research as well. This is not just for hardcore software engineering. A researcher that can't do those things to their code, can't do research properly, right? Because if you think of something you wanna try, you don't know how to do it, or it takes weeks, or if there are bugs in it, you don't know. So you want your code to be quite beautiful. And I think this is beautiful code. And this is, at this point, this data set and this data loader are the same abstractions that PyTorch uses. So let's dig into this a little bit more. We do have a problem, which is that we're always looping through our training set in order, and that's very problematic because we lose the randomness of kind of shuffling it each time. Particularly if our training set was already like ordered by dependent variable, then every batch is gonna be exactly the same dependent variable. So we really wanna shuffle it. So let's try random sampling. So for random sampling, I'm gonna create a sampler class. And we're gonna pass into it a data set to sample and a batch size and something that says whether to shuffle or not. And as per usual, we just store those away. I don't actually store away the data set. I just store away the length of the data set so that we know how many items to sample. Okay, and then here's our dunder error, right? So remember, this is the thing that we can call next on lots of times. And so if we are shuffling, then let's grab a random permutation of the numbers from 0 to n-1. And if we're not shuffling, then let's grab all of the integers in order from 0 to n-1. And then, this is the same we had before, go through that range and yield the indexes. So what does that look like? Here's a sampler with shuffle equals false and a batch size of 3. 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. And here it is with shuffle equals true. 5, 4, 3, 7, 6, 2, 8, 9, 0, 1. So now that we've got these, we can now replace our data loader with something where we pass it a sampler. And we then loop through for s in sampler. So it's going to loop through each of these, right? And the cool thing is that because we used yield, these are only going to be calculated when we ask for them. They're not all calculated up front. So we can use these on really big data sets, no problem. And so it's kind of, this is a common thing is where you're actually looping through something, which is itself a coroutine, and then yielding something which does other things to that. So this is like a really nice way of doing streaming computations. It's being done lazily. You're not going to run out of memory. It's a really neat way to do things. And so then we're going to grab all of the indexes in that sample. And we're going to grab the data set at that index. So now we've got a list of tensors. And then we need some way to collate them all together into a single pair of tensors. And so we've created a function called collate, which just grabs the x's and the y's and stacks them up. So torch.stack just grabs a bunch of tensors and glues them together on a new axis. You might want to do different things like add some padding or stuff like that. So you can pass in a different collate function if you want to. And it'll store it away and use it. So now we can create our two samplers. We can create our two data loaders with those samplers. So the training one is shuffling. The valid one's not shuffling. So let's check. There's the validation data loader. And the training data loader, if we call it twice with exactly the same index, we get different things. In this case, we've got two 8s, but they're different 8s. Call it another two times. We're getting different numbers. Okay, so it is shuffling as we hoped. And so again, we can train our model. And that's fine. So the PyTorch data loader does exactly that. So let's import the PyTorch data loader. And you can see it takes exactly the same arguments. Okay, and we can even pass in the exact collate function that we just wrote. It doesn't have a single sampler that you pass shuffle equals true or false to. It has two samplers, one called random, one called sequential. So slightly different to the API we just wrote, but does exactly the same thing. And so you can create those data loaders and works exactly the same. So that's what a PyTorch data loader does. Most of the time, you don't need the flexibility of writing your own sampler and your own collation function. So you can just pass in shuffle, and it will use the default sampler and collation function that work the way we just showed. Something that we did not implement in PyTorch's data loader is that in PyTorch's data loader, you can pass in an extra parameter called numWorkers. And that will fire off that many processes. And each one will separately grab stuff out of your data set, and then it will collect them together afterwards. And so if your data set's doing things like, you know, opening big JPEG files and doing all kinds of image transformations, that's a really good idea. So we won't implement that. All right, so finally, for this section, we should add validation. So to know if we're overfitting, we need to have a separate validation set. So here's the same loop that we had before. And here's the same loop pretty much again, but with torch.noGrad going through the validation set. So for this, we grab the predictions and the loss as before, but we don't call backward, and we don't step the optimizer because it's just a validation. Instead, we just keep track of the loss, and we also keep track of the accuracy. Okay? The only other difference is that we've added here model.train and here model.navow. What does that do? Well, actually, all it does is it sets an internal attribute called .training to true or false. So let's try it. If I put print model.training after each one and train this, it's a true, false, true, false, true, false. Okay? And so why does it set this thing called model.training to true or false? Because some kinds of layers need to have different behavior, depending on whether it's training or evaluation, or validation. For example, batchNorm only updates its running statistics if it's training. Dropout only does randomized dropout if it's training. Two main ones. So that's why you always want to have train and avail. And if you forget to put something into avail mode when you're done training, you'll often be surprised because you'll be getting worse results than you expected. Okay, so that's our fit loop. One thing to note, are these validation results correct if the batch size varies? Let's spell that correctly. If the batch size varies, because what we're doing here is we're adding up the loss and we're adding up the accuracy. And then at the end, we see how big is our data loader, how many batches are there, and we divide. But if you think about it, if you had one mini batch of size 1,000 and one mini batch of size 1, you can't actually just do that, right? You actually need to weighted average, weighted by the size of the mini batch. So this incorrect way is how nearly every library does it. Fast.ai does it the proper way, and next time we do this, we're going to do it the proper way. Okay, but for now, here's what most people do, and it does not work correctly when your batch size varies. So it's handy to have something that we can basically pass in a training data set and a validation data set in a batch size 2 and just grab the data loaders. The training data set will be shuffled. Validation won't be shuffled. Also, the validation data set, we don't need to do the backward pass, so we don't need to store the gradients. So that means that we have twice as much room, so we can make it twice the batch size. So it's another nice thing to refactor out. You don't have to type it anymore, and also it means that you won't accidentally make a mistake. And so now we can go ahead and fit, and let's do five epochs, and so now these are actual validation accuracies. Okay, great. So we've successfully built a training loop. Let's have a six-minute break. Come back at 7.55 and talk about callbacks. Before we continue, Rachel, any questions? Okay. So why do we have to zero out our gradients in PyTorch? Why do you have to zero out your gradients in PyTorch? So, yeah. The way we... Let's go back to... So here's our optimizer, right? Or let's go back even further. Here's our first version. So this is just with no additional help from PyTorch at all. If we didn't go grad.zero here, then what's going to happen the next time we go through and say lost.backward is it's going to add the new gradients to those existing gradients. Now, why does that happen? Well, that happens because we often have kind of lots of sources of gradients. You know, there's lots of kind of different modules all connected together, and so they're getting their gradients from lots of different places and they all have to be added up. So when we call backward, we wouldn't want backward to zero the gradients because then we would lose this ability to kind of plug in lots of things together and just have them work. So that's why we need the grad.zero here. So then, you know, that's part one of the answer. Part two of the answer is why did we write our optimizer so there was one thing called step and one thing called zero grad because what we could have done is we could have removed these lines and pushed this up here and so that step could have done both. And then, since we've actually got this kind of twice now, we could put it all inside the for loop. So we could certainly have written our optimizer like this. So it goes through each parameter and does the update and sets the gradient to zero and then we would be able to remove this line. So the problem with that is that we then remove the ability to not zero the gradients here. And that means any time that we don't want to zero the gradients, we now can't use the optimizer. So for example, what if you are working with some pretty big objects? So like if you're doing super resolution and you're trying to create a 2K output, you know, your batch size, you can only fit two images on the GPU at a time. And the stability of the gradients that you get from a batch size of two is so poor that you need to use a larger batch size. So, well, that would be really easy to do if you did it like this, right? Because we could say if I percent two then, right? And so this is now going to only run these things every two iterations. And so that means that our effective batch size is now double. So that's handy, right? That's called gradient accumulation. So gradient accumulation is where you change your training loop so that your optimizer step and your zero grads only happen occasionally. So that's really the reason is that there might be times you don't want to zero the gradients every time you do a step. And if there's no way to do that, that's a problem. You could argue that, I can't think of a reason that this isn't a good idea. We could make our optimizer, we could say kind of like auto zero equals true, say. And then we could have something in here which kind of says like if self.auto zero, then self.zero grad, right? Something like that. And then that could even be the default. And then you wouldn't have to worry about it unless you explicitly wanted to do gradient accumulation. I think that would really be a better API design, maybe. But that's not what they've done. But it's so easy to write your own optimizers, you could totally do that. But the upside is removing a single line of code which isn't a huge upside anyway. Any other questions, Rachel? Okay, so that's our training loop. But it's not quite where we want it to be. And I'm stealing some slides here from Sylvain who had a really cool talk recently called an infinitely customizable training loop, so I'll steal his slides. Before I'll do, I would like to do a big thank you to Sylvain. He has been working full time with FastAI for well over a year now, I guess. And a huge amount of what you see in the FastAI library and research and courses is him. So massive thank you to Sylvain who's the most awesome person I've worked with in my whole life. So that's pretty cool. But also thank you to lots of other people. A huge thanks to Staz, who a lot of you will have come across in the forum. And he's done a lot of the stuff that makes FastAI work well. And he's entirely a volunteer, so like super grateful to him. You know, the stuff that lets you check whether your installation works properly, that lets you quickly check whether your performance is what it should be. Also like organizing lots of helpful projects through the forums, he's been fantastic. Lots of other folks as well. Andrew Shaw wrote a lot of the original kind of documentation stuff that we have. Fred Munro has been helpful in thousands of ways and is just incredibly generous. Jason, a lot of you will already be aware of who helped a lot with the final lesson of the last course and is hard at work now on taking it even further to doing some stuff that's going to blow you away. I particularly wanted to point out RADIC because this is the list of the, I can't quite count, as the 20 most helpful people on the forum as ranked by a number of likes. You know, when somebody clicks that like button, it means they're saying, you know, you've helped me and more people have said that about RADIC than anybody else. And it's not surprising because RADIC is just not just an incredibly helpful person but extremely thoughtful. And he's now, I mean, he, gosh, you know, when he started with, as a FastAI student, he considered himself, if I remember correctly, basically a failed ML student. He tried a number of times to learn ML and hadn't succeeded. But he's just applied himself so well for the last couple of years and he's now a Kaggle winner, you know, a world-recognized deep learning practitioner. And so thank you to all of these people and everybody else who's contributed in so many ways. And of course, Rachel, who's sitting right next to me. So this is the fit function that we just wrote. Or the one, the slightly more elegant one before we added validation to it. Okay, so go through each epoch, go through each set of mini batch, go through each mini batch, get the prediction, the loss, backward pass, update your parameters, and then zero the gradients. So that's basically what we're doing. Okay, model, predictions, loss, gradients, step. And each time we grab a bit more training data. But that's not really all we want to do in a training loop. We might want to add the beautiful fast progress bars and animations that Silver created in his fast progress library. Or TensorBoard or whatever. And thanks to Jason, actually, we now have TensorBoard integration in FastAI. So be sure to check that out if you want extra pretty graphs like these. Hyperparameter scheduling. You might want to add all kinds of different regularization techniques. These are all examples of regularization techniques that we support in FastAI and many more. Mixed precision training. And also take advantage of the tensor cores in a Volta GPU to train much faster. There's more tweaks you might want to do to the training loop than we could possibly think of. And even if we did think of all of the ones that exist now, somebody will come up with a new one tomorrow. So you've got some possible ways you could solve this problem. And some of the things we're talking about are even things like how do you add GANs. More complex stuff. So one approach is write a training loop for every possible way you want to train. And this is particularly problematic when you start to like want to combine multiple different tweaks, right? Is you like cutting and pasting or whatever. So that's certainly not going to work for FastAI. There's what I tried for FastAI 0.7. This is my training loop the last time I tried this, which is like throw in every damn thing. And it just got, you know, so every time somebody would say like, oh, when your papers come out, can you please implement? And I'd just be like, no, I couldn't bear it. So now we have something better, callbacks. And callbacks are something which like every library has callbacks, but nobody else have callbacks, anything like our callbacks. And you'll see what I mean. Callbacks let you not only look at but fully customize every one of these steps, right? And so here's our starting training loop. Here's the FastAI version one training loop. It's the same, right? There's the exact same lines of code. Plus a bunch of calls to callbacks. And so each one basically says, you know, before I do a step, on step again, after I do a step, on step end, after I do a batch, on batch end, after I do an epoch, on epoch end, after I finish training, on training end, right? And they have the ability to also change things or even they have the ability to say, please skip the next step by returning a Boolean. So with this, we can create and have created all kinds of things in FastAI like learning rate schedulers and early stopping and parallel trainer. This is literally when I wrote parallel trainer, this is the entire callback I wrote. This is the entire gradient clipping callback, right? After you do the backward pass, clip the gradients. So you can do a lot with a little. And then you can mix them all together, right? Because all of the callbacks work with all of the other callbacks, right? So these are some of the callbacks that we have in FastAI right now. So for example, how did we do GANs last course? So what we did behind the scenes was we created a GAN module. It was ridiculously simple. We created a GAN module that had a forward method that just said, what's your generator mode? Is it, sorry, are you in generator mode or not? Where not means discriminator mode. If you're in generator mode, call the generator, otherwise call the critic. And then there's a function called switch that just changed generator mode backwards and forwards between generator and discriminator. Same thing if we created a loss function where there was a generator loss and a critic loss. And so then we created a callback, right, which had a switch that just switched the generator mode on and off and passed that along to the model I just showed you and the loss function I just showed you. And then it would set, requires grad to the generator or discriminator as appropriate and then would have on train begin, on train end, on batch, blah blah blah, callbacks to do the right thing at the right time. So most importantly, at the start of an epoch, set your generator mode. And at the end of training, set your generator mode, right? So if you look at kind of other libraries implementation of GANs, they're basically kind of a whole new training loop, whole new data loaders, whole new everything. So it's really cool in Fast AI we were able to create a GAN in this incredibly small amount of code for such a complex task. So let's do that ourselves, right? Because we've got a training loop. If we add callbacks, we should now then be able to do everything. So let's start out by grabbing our data as before. So we've got the number of hidden is 50, batch size 64, loss function is cross entropy. This is the signature of our fit function before. And I get very nervous when I see functions with lots of things being passed to it. And it makes me think, do we really need to pass all those things to it? Or can some of them be packaged up together? There's a lot of benefits to packaging up things together. When you can package up things together where they're kind of like things, you can pass them around to everything that needs them together. You can create them using kind of factory methods that create them together. And you can do smart things like look at the combination of them and make smart decisions for your users rather than having to have them set everything themselves. So there's lots of reasons that I would prefer to keep epochs, right? But I'd like to put all these other things into a single object. And specifically, we can do that in two steps. First of all, let's take this data and say training and valid data conceptually should be one thing. It's my data, right? Maybe there's test data there as well. So let's create a class called data bunch that we're going to pass in training data and validation data and we'll store them away. And that's the entirety. There's no logic here. But for convenience, let's make it easy to grab the data set out of them as well. And remember, we're now using, you can either use the handmade data loader that we built in the last one, or you can use the PyTorch data loader. They're both providing exactly the same API at this point, except for the NumWorkers issue. So remember that we passed these data loaders a data set that you can access. And then it would be nice if we could create a getModel function, which could create our model but automatically set the last layer to have the correct number of activations, because the data knows how many activations it needs. So let's also optionally make it that you can pass in C, which is going to get stored away, so that then when we create our data, we can pass in C, which remember we set to our maximum y value. And so that way, we never have to think about that again. So that's our data bunch class. So there's our getModel, so it's just going to create a model with the number of inputs is the size of the input data. Number of hidden is whatever we had, whatever we pass in, then a value, and then a linear from hidden to data.C, and return the model and an optimizer. And we all know all about dot parameters now. So then the other rest of the stuff, model, lossfunk, opt, and data, let's store them in something. Model, opt, lossfunk, data, and we'll just store them away. And that thing we'll call a learner. So notice our learner class has no logic at all. It's just a storage device for these four things. So now we can create a learner passing in the model and the optimizer. Since they're returned in this order from getModel, we can just say star, getModel. So that's going to pass in the model and the optimizer. And we've got our loss function already at the top here. We set it to cross entropy. And we've got our data, because it's that data bunch we just created. So there's nothing magic going on with data bunches and learners. They're just like wrappers for the information that we need. So now we'll take the fit function we had before, and I just pasted it here. But every time I had model, I replaced it with learn.model. Every time I had data, I replaced it with learn.data. And so forth. So there's the exact same thing that we had before. Still working fine. And so now let's add callbacks. So our fit function before basically said for epoch in range epochs, for batch, intrain, dl, and then it had these contents. Predictions, loss, backwards, depth, zero grad. I factored out the contents into something called one batch. Okay. And then I added all these callbacks, right? cb.afterbackcd.afterstep. I did one other refactoring, which is that the training loop has to loop through every batch. And the validation loop has to loop through every batch. So I just created something called all batches. Okay. So this is my fit loop, right? Begin fit, the epoch in epochs, begin epoch, all batches with a training set, begin validate, no grad, all batches with a validation set, after epoch, after fit. Okay. So that's that. So here's a callback, right, which has all the stuff. And so then we need a callback handler, and that's going to be something which you just say, here's all my callbacks. And basically it's just going to go through for each thing and say go through every callback and call it. And keep track of whether we've received a false yet or not. False means don't keep going anymore and then return it. So we do that for begin fit, after fit, begin epoch, begin validate, after epoch, begin batch, after loss, after backward, after step. So here's an example of a little callback we could create. And it's one that's going to, at the start of the fit, it will set number of iterations to zero. And then after every step, it will say number of iterations plus equals one and print that out. And if we get past ten iterations, then it will tell the learner to stop. Because we have this little thing called do stop that gets checked at the end. So let's test it. There we go. And so it called fit and it only did ten batches. And this is actually a really handy callback because quite often you want to just run a few batches to make sure things seem to be working. You don't want to run a whole epoch. So here's a quick way you can do something like that. This is basically what FastAI v1 looks like right now. It does have a little bit of extra stuff that lets you pass back different loss and different data. But it's nearly exactly the same. But I really like rewriting stuff because when I rewrite stuff, it lets me kind of look and see what I've written. And when I looked back at this, I saw cb, cb, cb, cb. It's like it's, it's, there's this object. This is the, the cb is the callback handler that's being passed everywhere. And that's a code smell. That code smell says something should have that state. And specifically these three functions should be the methods of something that has this state. So after I kind of wrote this part of the lesson, I suddenly realized, oh, FastAI is doing it the dumb way. So let's fix it. So I've, and this is likely to appear in a future version of FastAI. I created a new thing called runner. And so runner is a new class. Runner is a new class that contains the three things I just said. One batch, all batches, and fit. Right? And the runner, so here's fit. Right? It's incredibly simple. We're going to keep track of how many epochs we're doing. We're going to keep track of the learner that we're running. And remember, the learner has no logic in it. This stores four things. Okay? And then we tell each of our callbacks what runner they're currently working with. And then we call begin fit. And then we go through each epoch, set the epoch. We call begin epoch. We call all batches. And then with no grad, we call begin validate. And then we call all batches. And then we call after epoch. And then we call after fit. That's it. Now, this self string might look a bit weird. But look at what we had before. Again, horrible code smell is lots of duplicate code. Res equals true for callback, blah, blah, blah, blah, blah. Begin epoch. Res equals true for callback, blah, blah, blah, blah, blah, begin validate. So that's bad, right? So code duplication means cognitive overhead to understand what's going on, lots of opportunities to accidentally have one or instead of an and. Lots of places you have to change if you need to edit something. So basically, I took that out and I factored it out into done to call. So done to call is the thing that we've seen it before. It's the thing that lets you treat an object as if it was a function. So I could have called this lots of things. I could have called it self dot run callback or whatever, right? But it's the thing that happens absolutely everywhere. And so my kind of rule of thumb is if you do something lots of times, make it small. So done to call is the smallest possible way you can call something. You don't have to give it a name at all when you call it. So we say call the callback called after epoch. It also makes sense, right? We're calling a callback. So why not use done to call to call a callback? So after epoch, I got to go through all of my callbacks. I talk about this sorted in a moment. And then the other thing I didn't like before is that all of my callbacks had to inherit from this callback superclass because if they didn't, then they would have been missing one of these methods. And so then when it tried to call the method, there would have been an exception. And I don't like forcing people to have to inherit from something. They should be able to do whatever they like. So what we did here was we used get attribute, which is the Python thing, which says look inside this object and try to find something of this name, e.g. begin validate, and default to none if you can't find it, right? So it tries to find that callback. And there will be none if the callback doesn't exist. And if you find it, then you can call it, right? So this is a nice way to call any callback. But when you implement a callback, as you can see, look how much easier our test callback is now, right? It's just super simple. Just implement what you need. And we inherit from a new callback class, but we don't have two anymore, right? The main reason why is that our callback class now has an underscore order, which we can use to choose what order callbacks run in. We'll talk about that after we handle this question. What is the difference between hooks and pyTorch and callbacks and FastAI? We're going to do hooks very shortly. But if you think about it, if I want to kind of add a callback after I calculate the forward pass of the second layer of my model, there's no way for me to do that, right? Because the point at which I do the forward pass looks like this, self.model. Or if I want to hook into the point at which I've just called the backward pass of my penultimate layer, I can't do that either because the whole thing appears here as self.loster backward, okay? So hooks, pyTorch hooks are callbacks that you can add to specific pyTorch modules. And we're going to see them in very shortly. Well, it might be next class. We'll see how we go. Okay, so very often you want to be able to inject behavior into something, but the different things can influence each other. For example, transformations. We're going to be seeing this when we do data augmentation. So quite often you'll need things to run in a particular order. So when I add this kind of injectable behavior, like callbacks, I like to just add something, which is what order should it run in? You don't have to put this here. You might have noticed that what I do when I call this is I, oh, this currently does, sorry. Actually, when we look at transformations, it won't require order. This one does require an order. So, yeah, okay. So your callbacks need to be something that have an underscore order attribute in them. And this way we can make sure that some things run after other things. So for example, you might have noticed that our runner in the fit function never calls model.avowl, never calls model.train. So like it literally doesn't do anything. You know, it just says these are the steps I have to run and the callbacks do the running. So I created a train eval callback that, at the beginning of an epoch, calls model.train. And at the beginning of validation, calls model.avowl. And I also added stuff to keep track of how many epochs has it done. And this is quite nice. It actually does it as a floating point, not just as an int. So you could be like 2.3 epochs in. It also hits track of how many iterations do you want. So now we have this thing, keeping track of iterations. A test callback that should stop training after 10 iterations. Rather than keeping track of n itters itself, it should just use the n itter that was defined in this callback. So what we can do is we can say, all right, well train eval callback has an order of 0 because it inherits. So what we can just do here is make sure that this is later. underscore order equals 1. And so that way we can now refer to stuff that's inside the train eval callback like n itter. Sorry, well actually we don't even need to do that because it's putting n itter inside self.run. So we can just go self.n itter. If this ran before train eval callback, that would be a problem because n itter might not have been updated yet. So that's what the order is for. Another nice thing about runner, sorry, a nice thing about class callback is that I've defined it under getAtra and I have to find it to say return getAtra self.run, okay. An important thing to know about done to getAtra is that it is only called by Python if it can't find the attribute that you've asked for. So if something asks for self.name, well I have self.name so it's never going to get to here. So if you get to here, it means Python looked for this attribute and couldn't find it. And so very, very often the thing you actually want in the callback is actually inside the runner, which we store away as self.run. So this means that in all of our callbacks, let's look at one, you can basically just use self. pretty much everything and it will grab what you want even though most of the stuff you want is inside the runner. You'll see this pattern in FastAI a lot is that when one object contains another object or composes another object, we very often delegate getAttribute to the other object. So for example, if you're looking at a data set, then I think we delegate to X. If you're looking at stuff in the data blocks API, it will often delegate to stuff lower in the data blocks API and so forth. So I find this pretty handy. Okay, so we have a callback that, as you see, there's very little to it. One interesting thing you might notice is that a callback has a name property and the name property works like this. If you have a property called train eval callback, then we've got a function called camel to snake. This is called camel case. It means you've got uppercase and lowercase letters like a camel. And snake case looks like this. So camel to snake turns this into a snake. And then what we do here is we remove callback from the end and that's its name. So train eval callback has a name which is just train eval with an underscore. And then in the runner, any callback functions that you pass in, which it uses to create new callbacks, it actually assigns them to an attribute with that name. So we now have something called runner dot train eval, for example. So we do this in the Fast.ai library when you say learn dot recorder. We didn't actually add an attribute called recorder to learner. It just automatically sets that because there's a learner callback. So let's see how to use this. There's a question. Okay, let's do that in a moment. So let's use this to add metrics because it's no fun having a training loop where we can't actually see how we're going. And part of the whole point of this is that our actual training loop is now so incredibly tight and neat and easy. But we actually want to do all the stuff we want to do. So what if we create a little callback called average stats callback, right? Where we're going to stick into it a couple of objects to keep track of our loss and metrics, one for training, one for valid. And at the start of an epoch, we'll reset the statistics. At the end of an epoch, we'll print out the statistics. And after we've got the loss calculated, we will accumulate the statistics. So then all we need is an object that has an accumulate method. So let's create a class that does that. And here's our accumulate method that's going to add up the total loss. And for each metric, it'll add up the total metrics. And then we'll give it a property called average stats that will go through all of those losses and metrics and return the average. And you might notice here I've fixed the problem of having different batch sizes in the average. We're actually adding loss times the size of the batch and count plus the size of the batch and metrics times the size of the batch. And so then we're dividing here by the total batch size. So this is going to keep track of our stats. We'll add a Dunder repra so that it prints out those statistics in a nice way. And so now we can create a learner, add our average stats callback, and when we call fit, it prints out how we're going. And so that's the entirety of what it talked to add metrics and loss tracking to our minimal trading loop. Yes, Rachel? Runner Dunder call exits early when the first callback returns true. Why is that? So one of the things I noticed was really annoying in the first way I wrote the callback handler was I had it so that something had to return true to mean keep going. So basically false meant stop. And that was really awkward because if you don't add a return in Python, then it actually returns none. And none is false. And I thought, oh, if I forget to return something, that should mean keep going. That should be like the default. So the first thing to point out is that the basic loop now actually says if not rather than if. Right? So if not, begin epoch. So in other words, if your callback handler returns false, then keep going. And so that means that basically none of my callbacks need to return anything most of the time except for test callback which returns true. So true means cancel. It means stop. Right? So if one of my callbacks says stop, then, I mean, I could certainly imagine an argument in either way. But the way I thought it, if it says stop, let's just stop right now. Why do we need to run the other callbacks? So if it says stop, then it returns stop. It says we don't want to go anymore. And then we can, depending on where you are. So if after epoch returns stop, then it's actually going to stop the loop entirely. So that's why. Yeah, so this is a little awkward. We had to construct our average stats callback. And then we had to pass that to run. And then later on we can refer to stats.validStats.averageStats. Because remember, averageStats was where we grabbed this. So that's OK, but it's a little awkward. So instead, what I do is I create a accuracy callback function. So that is the averageStats callback constructor passing in accuracy, but with partial. So partial is a function that returns a function. OK, and so this is now a function which can create a callback. And so I can pass this to CB funcs. And now I don't have to store it away because the runner, this is what we saw before, the runner will go through each CB funcs. It will call that function to create the callback. And then it will stick that callback inside the runner, giving it this name as the attribute. So this way, we can say this is our callback function. This is our runner, fit. And now it's automatically available inside run.averageStats. So this is what FastAI v1 does, except it puts them inside a learner because we don't have a runner concept. So I think that's pretty handy. It's kind of like it looks a little bit awkward the first time you do it, but you can kind of create a standard set of callback functions that you want to use for particular types of models, and then you can just store them away in a list and you don't have to think about them again, which is what you'll see we'll do lots of times. So like a lot of things in this part two of the course, you can choose how deep to go on different things. I think our approach to callbacks is super interesting. And if you do too, you might want to go deep here and really look into what kind of callbacks you can build and what things you can do with them that we haven't done yet. But then a lot of these details around exactly how I do this, if you're not as interested in the details of software engineering, this might be something you care less about, which is fine. The main thing that everybody should take away is that. That's our training loop. Okay? So, you know, the other stuff about like exactly how did we create our average stats callback and exactly what does Dundercall do are fairly minor details, but you should recognize that the fit function stores how many epochs we're doing, what learner we're working with, calls each of the different callbacks at each time. And like I never remember which ones are at which place. If you go to docs.fast.ai, the callbacks documentation will show you. Personally, I just always look at the source code because it's just so easy to see exactly what happens at each time and exactly what's available at each time. So let's use this. And let's use this to do one cycle training. Because it's pretty hard when you have to have a constant learning rate all the time, particularly because I was really wanting to show you like a deep dive, which we're about to see using hooks, a deep dive into how the mechanics or kind of how the dynamics of training models looks like. And what we'll learn is that the first batch is everything. If you can get the first batches working well, then things will tend to be good. And this is how you can get super convergence. So if you want your first batches to be good, it turns out that good annealing is critical. So let's do that right away. Let's set up good annealing because we have the mechanics we need because we have callbacks. So we're inside 05 anneal. We'll get our data. This is all the same as before. Here's something to create a learner with one line. So let's create a learner with that same little model we had before and our loss function, our data. And we'll create a runner with our average stats callback. This defaulted to a learning rate of 0.5. Maybe we could try it with a learning rate of 0.3. It's pretty handy being able to like quickly create things with different learning rates. So let's create a function that's just going to be partial get model with a learning rate. And so now we can just call get model funk and pass a learning rate in and we'll immediately have something with a different learning rate. Yes, tell me the question. So what is your typical debugging process? My debugging process is to use the debugger. So if I got an exception while I was running a cell, then I just go into the next cell and type %debug. And that pops open the debugger. If things aren't working the way I expected, but it wasn't an exception, then I'll just add set underscore trace somewhere around the point I care about. That's about it. Yeah, I find that works pretty well. Most of the time, then it's just a case of looking at what's the shape of everything and what does everything contain, like a couple of objects in the batch. I normally find something's got nans or zeros or whatever. Yeah, it's really rare that using that debugger that I find debugging is that difficult. If it is, then it's just a case of stepping away and questioning your assumptions. But with the help of a debugger, all of the states right there in front of you, which is one of the great things about PyTorch, is that it supports this kind of development. Okay. All right. So we're going to create a callback that's going to do hyperparameter scheduling. And so for this notebook, we're just going to do learning rate as a hyperparameter. But it's in the last 12 months, one of the really successful areas of research have been people pointing out that you can and should schedule everything. Your dropout amount, what kind of data augmentation you do, weight decay, learning rate, momentum, everything. Which makes sense, right? Because the other thing that we've been learning about a lot about in the last 12 months is how as you train a model, it kind of goes through this different phases of like the kind of weight landscapes, sorry, the loss function, the loss landscapes of neural nets look very different at the start, in the middle, and at the end. And so it's very unlikely that you would want the same hyperparameters throughout. So being able to schedule anything is super handy. So we'll create a parameter scheduler callback. And you're just going to pass in a function, right? And a parameter to schedule. So we're going to be passing in LR, because LR is what PyTorch calls learning rate. And then this function will be something which takes a single argument, which is number of epochs divided by total epochs. And remember I told you that train eval callback we added is going to set this to be a float. So this will be like epoch number 2.35 out of six. So this will be a float of exactly how far through training are we. And we'll pass that to some function that we're going to write. And the result of that function will be used to set the hyperparameter, in this case learning rate. As you know from part one, you don't necessarily want to have the same value of a hyperparameter for all of your layers. So PyTorch has something called parameter groups, which we use in abstraction. We call layer groups in fast AI. But they're basically the same thing. And so a PyTorch optimizer contains a number of parameter groups. Unless you explicitly create more than one, it'll all be in one. But anytime we do stuff with hyperparameters, you have to loop through PG in self.opt.parameter groups. So then parameter group, so learning rate for this parameter group, this layer group, is equal to the result of this function. And then every time we start a new batch, if we're training, then we'll run our scheduler. Pretty hard to know if our scheduler is working, if we can't actually see what's happening to the learning rate as we go. So let's create another callback called recorder. That at the start of fitting, sets the LRs and losses are raised to being empty. And then after each batch, as long as you're training, it depends the current learning rate and the current loss. Now there's actually lots of learning rates, potentially, because there's lots of layer groups. So in fast AI, we tend to use the final layer group as the learning rate we actually print out. But you don't have to do it that way. And then we'll add something to plot the learning rate, so we'll add something to plot the losses. So hopefully this looks pretty familiar compared to the recorder in fast AI v1. So with that in place, we now need to create a function that takes the percentage through the learning, which we're going to call pos for position, and returns the value of learning rate. And so let's create one for linear schedules. So what we want to be able to pass this is a starting learning rate and an ending learning rate. So we might pass it 10 and 1, and it would start at the learning rate of 10 and go down to 1. That would be ridiculously high, but whatever. But we need a function that just takes position. So this is a function that's going to return a function. So here's a function that takes a start learning rate and an end learning rate and a position, and returns the learning rate. So to start plus position times the difference. So to convert that function into one which only takes position, we do partial, passing in that function, and the start and the end we were given. So now this function just takes position, because that's the only thing from inner that we haven't set. So that's going to work fine, but it's inconvenient because we're going to create lots of different schedulers, and I don't want to have to write all this every time. So we can simplify the way that you can create these by using a decorator. Here's the version with a decorator. With a decorator, you create linear scheduler in the natural way. It's something that takes a start learning rate and end learning rate and a position and returns this. And then we add an annealer decorator. And the annealer decorator is the thing that does all this in a partial nonsense. What's a decorator? A decorator is a function that returns a function. And what Python does is if it sees the name of a function here with an at sign before it, then it takes this function, passes it into this function, and replaces the definition of this function with whatever this returns. So it's going to take this, it's going to pass it over here, and then it's going to say return inner where inner is partial, as we described before. So let's see that. So now, shedlin, we wrote it as taking start and end and pause. But if I hit shift tab, this says it only takes start and end. Why is that? Because we've replaced this function with this function. And this function just takes start and end. And this is where Jupiter is going to give you a lot more happy times than pretty much any IDE because this kind of dynamic code generation, it's pretty hard for an IDE to do that for you. Whereas in Jupiter, it's actually running the code in an actual Python process so it knows exactly what shedlin means. So this has now created a function that takes start and end and returns a function which takes pause, which is what we need for our scheduler. So let's try it. Let's say f equals shedlin 1, 2. So this is a scheduler that starts at learning rate 1, ends at learning rate 2, and then we'll say, hey, what should that be 30% of the way through training? And again, if I hit shift tab here, it knows that f is something that takes pause. So it's really nice in Jupiter. You can take advantage of Python's dynamic nature. And there's no point using a dynamic language if you're not taking advantage of its dynamic nature. So things like decorators are a super convenient way to do this stuff. There are other languages like Julia that can do similar things with macros. This is not the only way to get this kind of nice, very expressive ability, but it's one good way to do it. So now we can just go ahead and define all of our different schedulers by passing it each as start and pause. So for example, no scheduler is something which always returns start or cosine scheduling, exponential scheduling. So let's define those, and then let's try to plot them, and it doesn't work. Why doesn't it work? Because you can't plot pi torch tensors. But it turns out the only reason you can't plot pi torch tensors is because tensors don't have an endim attribute, which tells matplotlib how many dimensions there are. So watch this. torch.tensor.endim equals a property that is the length of the shape. This has now replaced the definition, again, using the dynamic features of Python, replaced or actually insert into the definition of tensor a new property called endim, and now we can plot tensors. So the nice thing about Python is you never have to be like, oh, this isn't supported because you can change everything. You can insert things, you can replace things, whatever. So here we've now got a nice print-out of our four different schedulers, which isn't really enough because if you want to do one cycle scheduling, then, in fact, most of the time nowadays, you want some kind of warm-up and some kind of cool-down, or if you're doing something like SGDR, you've got, like, multiple cool-downs. So we really need to be able to paste some of these schedulers together. So let's create another function called combined schedulers, and it's going to look like this. We're going to pass in the kind of the phases we want. So phase one will be a cosine schedule from a learning rate of 0.3 to 0.6. Phase two will be a learning rate, a cosine schedule with a learning rate going from 0.6 to 0.2. And phase one will take up 30% of our batches, and phase two will take up 70%. So that's what we're going to pass in. How long is each phase and what's the schedule in each phase? So here's how we do that. I don't think I need to go through the code. There's nothing interesting about it. But what we do once we have that is that we can then plot that schedule. And you can kind of see why we're very fond of these cosine one-cycle schedules. I don't think this has ever been published anywhere. But it's what fast AI uses by default nowadays. It's you kind of get a nice gentle warm-up at the start. This is the time when things are just super sensitive and fall apart really quickly. But it doesn't take long as you'll see in next week's lesson when we do a deep dive into stuff using hooks. It doesn't take long for it to get into a decent part of the lost landscape. And so you can quite quickly increase the learning rate. And then something that people have, and we'll start looking at papers next week for this, something that people have realized in the last four months or so. Although Leslie Smith really kind of showed us this two years ago, but it's only been the last four months or so that people have really understood this in the wider academic literature. You need to train at a high learning rate for a long time. And so with this kind of cosine schedule, we keep it up high for a long time. But then you also need to fine-tune at a very low learning rate for a long time. So this has all of the kind of nice features that we want. So cosine one-cycle schedules are terrific, and we now can build them from scratch. So let's try training like this. So let's create a list of callback functions that has a recorder in it, an average stats callback with accuracy in it, and a parameter scheduler that schedules the learning rate using this schedule. And then fit. And that's looking pretty good. We're getting up towards 94% pretty quickly. And we can now go plot LR, and it's the shape that we hoped for, and we can even say plot loss. Okay. So we now have really all of the pieces we need to kind of try out lots of different ways of training neural nets. We still haven't looked at convolutions, really. We'll do that next week and a lot more. But you kind of have the ability now to hopefully think of lots of things that you might want to try and try them out. So next week, we're going to be starting with conv nets. We're going to be kind of, and we're going to be finally using our GPU. Because once we start creating conv nets of this size, it starts taking a little bit too long. But just to read ahead a little bit, what's it going to take to put stuff on the GPU? This is the entirety of the callback. So we've now got the mechanics we need to do things unbelievably quickly. And then we'll be able to, oh, and also we'll be wanting to add some transformations. This is the entirety of what it takes to do batch-wise transformations with our callback. As we discussed, though, we can't add callbacks between layers. So we will add callbacks between layers initially manually. And then using PyTorch hooks. And that way, we're going to be able to plot and see exactly what's going on inside our models as they train. And we'll find ways to train them much, much more nicely so that by the end of the next notebook, we'll be up over 98% accuracy. And that's going to be super cool. And then we're going to do a deep dive into batch norm, data blocks API, optimizers, and transforms. And at that point, I think we'll have basically all the mechanics we need to go into some more advanced architectures and training methods and see how we did some of the cool stuff that we did in Part 1. So I'll see you next week.