 Hi folks. Thanks for joining me for lesson 18. We're going to start today in Microsoft Excel. You'll see there's an Excel folder actually in the course 22 P2 repo. And in there, there's a spreadsheet called grad desk as ingredient descent, which I guess we should zoom in a bit here. So there's some instructions here, but this is basically describing what's in each sheet. We're going to be looking at the various SGD accelerated approaches we saw last time, but done in a spreadsheet. We're going to do something very, very simple, which is to try to solve a linear regression. So the actual data was generated with y equals ax plus b, where a, which is the slope was two, and b, which is the intercept or constant was 30. And so you can see we've got some random numbers here. And then over here, we've got the ax plus b calculation. So then what I did is I copied and pasted as values, just one, one set of those random numbers into the next sheet called basic. This is the basic SGD sheet. So that, that's what x and y are. And so the idea is we're going to try to use SGD to learn that the intercept is 30 and the slope is two. So the way we do SGD is we, so those are our, those are our weights or parameters. So the way we do SGD is we start out at some random kind of guess. So my random guess is going to be one and one for the intercept and slope. And so if we look at the very first data point, which is x is 14 and y is 58, the intercept and slope are both one, then we can make a prediction. And so our prediction is just equal to slope times x plus the intercept. So the prediction will be 15. Now actually the answer was 58. So we're a long way off. So we're going to use mean squared error. So the mean squared error is just the error. So the difference squared. Okay. So one way to calculate how much would the prediction, sorry, how much would the error change? So how much would the squared error, I should say change if we changed the intercept, which is B, would be just to change B by a little bit, change the intercept by a little bit and see what the error is. So here that's what I've done is I've just added 0.01 to the intercept and then calculated y and then calculated the difference squared. And so this is what I mean by B1. This is the error squared I get if I change B by 0.01. So it's made the error go down a little bit. So that suggests that we should probably increase B, increase the intercept. So we can calculate the estimated derivative by simply taking the change from when we use the actual intercept using the intercept plus 0.01. So that's the rise and we divide it by the run, which is as we said is 0.01. And that gives us the estimated derivative of the squared error with respect to B, the intercept. Okay, so it's about negative 86, 85.99. So we can do exactly the same thing for a, so change the slope by 0.01, calculate y, calculate the difference and square it. And we can calculate the estimated derivative in the same way, rise, which is the difference divided by run, which is 0.01. And that's quite a big number, minus 1200. In both cases, the estimated derivatives are negative. So that suggests we should increase the intercept and the slope. And we know that that's true because actually the intercept and the slope are both bigger than 1. The intercept is 30, should be 30 and the slope should be 2. So there's one way to calculate the derivatives. Another way is analytically. And the derivative of squared is two times. So here it is here. I've just written it down for you. So here's the analytic derivative. It's just two times the difference. And then the derivative for the slope is here. And you can see that the estimated version using the rise over run and the little 0.01 change and the actual, they're pretty similar. Okay. And same thing here. They're pretty similar. So anytime I calculate gradients kind of analytically but by hand, I always like to test them against doing the actual rise over run calculation with some small number. And this is called using the finite differencing approach. We only use it for testing because it's slow because you have to do a separate calculation for every single weight. But it's good for testing. We use analytic derivatives all the time in real life. Anyway, so however we calculate the derivatives, we can now calculate a new slope. So our new slope will be equal to the previous slope minus the derivative times the learning rate, which we've just set here at 0.0001. And we can do the same thing for the intercept as you see. And so here's our new slope intercept. So we can use that for the second row of data. So the second row of data is x equals 86, y equals 202. So our intercept is not 11 anymore. Intercept and slope are not 11, but they're 1.01 and 1.12. So here's we're just using a formula just to point at the new intercept and slope. We can get a new prediction and squared error and derivatives. And then we can get another new slope and intercept. And so that was a pretty good one actually. It really helped our slope head in the right direction, although the intercept is moving pretty slowly. And so we can do that for every row of data. Now strictly speaking, this is not mini-batch gradient descent that we normally do in deep learning. It's a simpler version where every batch is a size 1. So, I mean, it's still stochastic gradient descent. It's just a batch size of 1. But I think sometimes it's called online gradient descent, if I remember correctly. So we go through every data point in our very small data set until we get to the very end. And so at the end of the first epoch, we've got an intercept of 1.06 and a slope of 2.57. And those indeed are better estimates than our starting estimates of 11. So what I would do is I would copy our slope 2.57 up to here, 2.57. I'll just type it for now. And I'll copy our intercept up to here. And then it goes through the entire epoch again. And we get another intercept and slope. And so we could keep copying and pasting and copying and pasting again and again. And we can watch the root mean squared error going down. Now that's pretty boring doing that copying and pasting. So what we could do is fire up visual basic for applications. And sorry, this might be a bit small. I'm not sure how to increase the font size. And what it shows here, so sorry, this is a bit small. So you might want to just open it on your own computer to be able to see it clearly. But basically it shows I've created a little macro where if you click on the reset button, it's just going to set the slope and constant to 1 and calculate. And if you click the run button, it's going to go through five times calling one step. And what one step is going to do is it's going to copy the last slope to the new slope and the last constant intercept to the new constant intercept. And also do the same for the RMSE. And it's actually going to paste it down to the bottom for reasons I'll show you in a moment. So if I now run this reset and then run, there you go, you can see it's run at five times. And each time it's posted the RMSE. And here's a chart of it showing it going down. And so you can see the new slope is 2.57. The new intercept is 1.27. I could keep running at another five. So this is just doing copy, paste, copy, paste, copy, paste five times. And you can see that the RMSE is very, very, very slowly going down. And the intercept and slope are very, very, very slowly getting closer to where they want to be. The big issue really is that the intercept is meant to be 30. It looks like it's going to take a very, very long time to get there. But it will get there eventually if you click run enough times or maybe set the VBA macro to look more than five steps at a time. But you can see it's, it's very slowly. And, and importantly though, you can see like it's kind of taking this linear route every time these are increasing. So why not increase it by more and more and more. And so you'll remember from last week that that is what momentum does. So on the next sheet, we show momentum. And so everything's exactly the same as the previous sheet. But this sheet, we didn't bother with the finite differencing. We just have the analytic derivatives, which are exactly the same as last time. The data is the same as last time. The slope and intercept are the same starting points as last time. And this is the new B and new A that we get. But what we do this time is that we've added a momentum term, which we're calling beta. And so the beta is going to these cells here. And what are these cells? What these cells are is that they're maybe it's most interesting to take this one here. What it's doing is it's taking the gradient and it's taking the gradient and it's using that to update the weights. But it's also taking the previous update. So you can see here the blue one minus 25. So that is going to get multiplied by .9, the momentum. And then the derivative is then multiplied by .1. So this is momentum, which is getting a little bit of each. And so then what we do is we then use that instead of the derivative to multiply by our learning rate. So we keep doing that again and again and again, as per usual. And so we've got one column, which is calculating the next, which is calculating the momentum, you know, lerped version of the gradient for both B and for A. And so you can see that for this one, it's the same thing. You look at what was the previous move. And that's going to be .9 of what you're going to use for your momentum version gradient. And .1 is for this version of the momentum gradient. And so then that's again what we're going to use to multiply by the learning rate. And so you can see what happens is when you keep moving in the same direction, which here is we're saying the derivative is negative again and again and again, so it gets higher and higher and higher. And did I hear? And so particularly with this big jump we get, we keep getting big jumps, because still we want to, there's still negative gradient, negative gradient, negative gradient. So if we, at the end of this, our B and our A have jumped ahead. And so we can click run. And we can click clicking it. And you can see that it's moving, you know, not super fast, but certainly faster than it was before. So if you haven't used VBA, Visual Basic for Applications before, you can hit Alt, Alt F11 or Option F11 to open it. And you may need to go into your preferences and turn on the developer tools so that you can see it. You can also right click and choose assign macro on a button. And you can see what macro has been assigned. So if I hit Alt F11 and I can just double, or you can just double click on the sheet name and it'll open it up. And you can see that this is exactly the same as the previous one. There's no difference here. Oh, one difference is that to keep track of momentum at the very, very end, so I've got my momentum values going all the way down, the very last momentum I copy back up to the top, each epoch, so that we don't lose track of our kind of optimizer state, if you like. Okay, so that's what momentum looks like. So yeah, if you're kind of a more of a visual person like me, you like to see everything laid out in front of you and like to be able to experiment, which I think is a good idea. This can be really helpful. So RMS prop, we've seen, and it's very similar to momentum. But in this case, instead of keeping track of kind of a lerped moving average and exponential moving average of gradients, we're keeping track of a moving average of gradient squared. And then rather than simply adding that, you know, using that as the gradient, what instead we're doing is we are dividing our gradient by the square root of that. And so remember, the reason we were doing that is to say if, you know, if the, there's very little variation, very little going on in your gradients, then you probably want to jump further. So that's RMS prop. And then finally, Adam, remember, was a combination of both. So in Adam, we've got both the lerped version of the gradient. And we've got the lerped version of the gradient squared. And then we do both. When we update, we're both dividing the gradient by the square root of the lerped, the moving it, we're spending initially waiting average moving averages. And we're also using the momentumized version. And so again, we just go through that each time. And so if I reset, run. And so, oh, wow, look at that. It jumped up there very quickly. Because remember, we wanted to get to two and 30. So just two sets. So that's five, that's 10 epochs. Now, if I keep running it, it's kind of now not getting closer. It's kind of jumping up and down between pretty much the same values. So probably what we'd need to do is decrease the learning rate at that point. And yeah, that's pretty good. And now it's jumping up and down between the same two values again. So maybe decrease the learning rate a little bit more. And I kind of like playing around like this because it gives me a really intuitive feeling for what training looks like. So I've got a question from our YouTube chat, which is, how is J33 being initialized? So this is just what happens is we take the very last cell here, well, these actually all these last four cells, and we copy them to here as values. So this is what those looked like in the last epoch. So if I basically we're going, we go copy and then paste as values. And then they, this here just refers back to them as you see. And it's interesting that they're kind of, you can see how they're exact opposites of each other, which is really, you can really see how they're, it's just fluctuating around the actual optimum at this point. Okay, thank you to Sam Watkins. We've now got a nicer sized editor. That's great. Where are we Adam? Okay, so with, so with Adam, basically, it all looks pretty much the same, except now we have to copy and paste our, both our momentums and our squared gradients. And of course, the slopes and intercepts at the end of each step. But other than that, it's just doing the same thing. And when we reset it, it just sets everything back to their default values. Now, one thing that occurred to me, you know, when I first wrote this spreadsheet a few years ago was that manually changing the learning rate seems pretty annoying. Now, of course, we can use a scheduler, but a scheduler is something we set up ahead of time. And I did wonder if it's possible to create an automatic scheduler. And so I created this Adam annealing tab, which honestly, I've never really got back to experimenting with. So if anybody's interested, they should check this out. What I did here was I used exactly the same spreadsheet as the Adam spreadsheet. But I added an extra after I do this step, I added an extra thing, which is I automatically decreased the learning rate in a certain situation. And the situation in which I, in which I decreased it was I kept track of the average of the squared gradients. And anytime the average of the squared gradients decreased during an epoch, I stored it. So I basically kept track of the lowest squared gradients we had. And then what I did was if we got a, if that resulted in the gradients, the squared gradients average halving, then I would decrease the learning rate by, then I would decrease the learning rate by a factor of four. So I was keeping track of this gradient ratio. Now, when you see a range like this, you can find what that's referring to by just clicking up here and finding gradient ratio. And there it is. And you can see that it's equal to the ratio between the average of the squared gradients versus the minimum that we've seen so far. So this is kind of like, my theory here was thinking that, yeah, basically as you train, you kind of get into flatter, more stable areas. And as you do that, that's a sign that, you know, you might want to decrease your learning rate. So yeah, if I try that, if I hit run, again, it jumps straight to a pretty good value, but I'm not going to change the learning rate manually. I just press run. And because it's changed the learning rate automatically now. And if I keep hitting run without doing anything, look at that. It's got pretty good, hasn't it? And the learning rate's got lower and lower. And we basically got almost exactly the right answer. So yeah, that's a little experiment I tried. So maybe some of you should try experiments around whether you can create an automatic annealer using mini AI. I think that would be fun. So that is an excellent segue into our notebook because we are going to talk about annealing now. So we've seen it manually before, where we've just decreased the learning rate in a notebook and like run a second cell. And we've seen something in Excel. But let's look at what we generally do in PyTorch. So we're still in the same notebook as last time, the accelerated SGD notebook. And now that we've re-implemented all the main optimizers that people tend to use, most of the time from scratch, we can use PyTorch's of course. So let's see, look now at how we can do our own learning rate scheduling or annealing within the mini AI framework. Now, we've seen when we implemented the learning rate finder that we saw how to create something that adjusts the learning rate. So just to remind you, this was all we had to do. So we had to go through the optimizers parameter groups. And at each group set the learning rate to times equals some model player, if we're just that was for the learning rate finder. So since we know how to do that, we're not going to bother re-implementing all the schedulers from scratch, because we know the basic idea now. So instead, what we're going to have to do is have a look inside the torch.optim.lr scheduler module and see what's defined in there. So the ls scheduler module, you know, you can hit dot tab and see what's in there. But something that I quite like to do is to use dur, because dur ls scheduler is a nice little function that tells you everything inside a Python object. And this particular object is a module object. And it tells you all the stuff in the module. When you use the dot version tab, it doesn't show you stuff that starts with an underscore, by the way, because that stuff's considered private, or else dur does show you that stuff. Now, I can kind of see from here that the things that start with a capital and then a small letter look like the things we care about. We probably don't care about this. We probably don't care about these. So we can just do a little list comprehension that checks that the first letter is an uppercase, and the second letter is lowercase, and then join those all together with a space. And so here is a nice way to get a list of all of the schedulers that PyTorch has available. And actually, I couldn't find such a list on the PyTorch website in the documentation. So this is actually a handy thing to have available. So here's various schedulers we can use. And so I thought we might experiment with using Cosine and Nealing. So before we do, we have to recognize that these PyTorch schedulers work with PyTorch optimizers, not with, of course, with our custom SGD class. And PyTorch optimizers have a slightly different API. And so we might learn how they work. So to learn how they work, we need an optimizer. So one easy way to just grab an optimizer would be to create a learner, just kind of pretty much any old random learner, and pass in that single batch callback that we created. Do you remember that single batch callback? Single batch. It just, after batch, it cancels the fit. So it literally just does one batch. And we could fit. And from that, we've now got a learner and an optimizer. And so we can do the same thing. We can do our optimizer to see what attributes it has. This is a nice way, or of course, just read the documentation in PyTorch. This one is documented, I think, showing all the things it can do. As you would expect, it's got the step and the zero grad, like we're familiar with. Or you can just, if you just hit opt. So you can, the optimizers in PyTorch do actually have a repra as it's called, which means you can just type it in and hit shift enter. And you can also see the information about it this way. Now, an optimizer, it'll tell you what kind of optimizer it is. And so in this case, the default optimizer for a learner, when we created it, we decided was optim.sgd.sgd. So we've got an sgd optimizer. And it's got these things called parameter groups. What are parameter groups? Well, parameter groups are, as it suggests, they're groups of parameters. And in fact, we only have one parameter group here, which means all of our parameters are in this group. So let me kind of try and show you, it's a little bit confusing, but it's kind of quite neat. So let's grab all of our parameters. And that's actually a generator. So we have to turn that into an editor and call next. And that will just give us our first parameter. Okay, now what we can do is we can then check the state of the optimizer. And the state is a dictionary. And the keys are parameter tensors. So this is kind of pretty interesting because you might be, I'm sure you're familiar with dictionaries. I hope you're familiar with dictionaries. But normally you probably use numbers or strings as keys. But actually, you can use tensors as keys. And indeed, that's what happens here. If we look at param, it's a tensor, it's actually a parameter, which remember is a tensor, which it knows to, to require grad and to, to list in the parameters of the module. And so we're actually using that to index into the state. So if you look at up dot state, it's a dictionary where the keys are parameters. Now what's this for? Well, what we want to be able to do is if you think back to this, we actually had each parameter, we have state for it. We have the average of the gradients or the exponentially way to moving average gradients and of squared averages. And we actually stored them as attributes. So PyTorch does it a bit differently. It doesn't store them as attributes, but instead it, the optimizer has a dictionary where you can look at where you can index into it using a parameter. And that gives you the state. And so you can see here, it's got a, this is the, this is the exponentially weighted moving averages. And both because we haven't done any training yet, and because we're using non momentum SGD, it's none, but that's, that's how it would be stored. So this is really important to understand PyTorch optimizers. I quite liked our way of doing it, of just storing the state directly as attributes. But this works as well. And it's, it's, it's fine. You just have to know it's there. And then as I said, rather than just having parameters, so we in SGD stored the parameters directly. But in PyTorch, those parameters can be put into groups. And so since we haven't put them into groups, the length of param groups is one. This is one group. So here is the param groups. And that group contains all of our parameters. Okay. So PG, just to clarify here, what's going on. PG is a dictionary. It's a parameter group. And to get the keys from a dictionary, you can just listify it. That gives you back the keys. And so this is one quick way of finding out all the keys in a dictionary. So you can see all the parameters in the group. And you can see all of the hyper parameters, the learning rate, the momentum, weight decay, and so forth. So that gives you some background about about what's, what's going on inside an optimizer. So SEVA asks, isn't indexing by a tensor just like passing a tensor argument to a method? And no, it's not quite the same because this is, this is state. So this is how the optimizer stores state about the parameters. It has to be stored somewhere. For our homemade mini AI version, we stored it as attributes on the parameter. But in the PyTorch optimizers, they store it as a dictionary. So it's just how it's stored. Okay. So with that in mind, let's look at how schedulers work. So let's create a cosine annealing scheduler. So a scheduler in PyTorch, you have to pass it the optimizer. And the reason for that is we want to be able to tell it to change the learning rates of our optimizer. So it needs to know what optimizer to change the learning rates of. So it can then do that for each set of parameters. And the reason that it does it by parameter group is that as we'll learn in a later lesson for things like transfer learning, we often want to adjust the learning rates of the later layers differently to the earlier layers and actually have different learning rates. And so that's why we can have different groups and the different groups have the different learning rates, momentum, and so forth. Okay. So we pass in the optimizer. And then if I hit shift tab a couple of times, it'll tell me all of the things that you can pass in. And so it needs to know T max, how many iterations you're going to do. And that's because it's trying to do one, you know, half a wave, if you like, of the cosine curve. So it needs to know how many iterations you're going to do. So it needs to know how far to step each time. So if we're going to do 100 iterations. So the scheduler is going to store the base learning rate. And where did it get that from? It got it from our optimizer, which we set a learning rate. Okay, so it's going to steal the optimizer's learning rate. And that's going to be the starting learning rate, the base learning rate. And it's a list because there could be a different one for each parameter group. We only have one parameter group. You can also get the most recent learning rate from a scheduler, which of course is the same. And so I couldn't find any method in PyTorch to actually plot a scheduler's learning rates. So I just made a tiny little thing that just created a list, set it to the last learning rate of the scheduler, which is going to start at 0.06, and then goes through however many steps you ask for, steps the optimizer, steps the scheduler. So this is the thing that causes the scheduler to adjust its learning rate, and then just append that new learning rate to a list of learning rates and then plot it. So that's here's and what I've done here is I've intentionally gone over a hundred because I had told it I'm going to do a hundred. So I'm going over a hundred and you can see the learning rate, if we did a hundred iterations would start high for a while, it would then go down and then it would stay low for a while. And if we intentionally go past the maximum, it's actually start going up again because this is a cosine curve. So one of the main things I guess I wanted to show here is like what it looks like to really investigate in a REPL environment like a notebook, how an object behaves, what's in it. And this is something I would always want to do when I'm using something from an API I'm not very familiar with. I really want to like see what's in it, see what they do, run it totally independently, plot anything I can plot, this is how I like to learn about the stuff I'm working with. Data scientists don't spend all of their time just coding. So that means we need, we can't just rely on using the same classes and APIs every day. So we have to be very good at exploring them and learning about them. And so that's why I think this is a really good approach. Okay, so let's create a scheduler callback. So a scheduler callback is something we're going to pass in the scheduling plus. But remember then when we go off the scheduling callable actually, and remember that when we create the scheduler, we have to pass in the optimizer to schedule. And so before fit, that's the point at which we have an optimizer, we will create the scheduling object. I like this ghetto, it's very Australian. So the scheduling object we will create by passing the optimizer into the scheduler callable. And then when we do step, then we'll check if we're training. And if so, we'll step. Okay, so then what's going to call step is after batch. So after batch, we'll call step. And that would be if you want your scheduler to update the learning rate every batch. We could also have an epoch scheduler callback, which we'll see later. And that's just going to be after epoch. Okay, so in order to actually see what the scheduler is doing, we're going to need to create a new callback to keep track of what's going on in our learner. And I figured we could create a recorder callback. And what we're going to do is we're going to be passing in the name of the thing that we want to record that we want to keep track of in each batch. And a function, which is going to be responsible for grabbing the thing that we want. And so in this case, the function here is going to grab from the callback, look up its param groups property and grab the learning rate. Where does the PG property come from, retribute? Well, before fit, the recorder callback is going to grab just the first parameter group. Just so it's like, you've got to pick some parameter group to track. So we'll just grab the first one. And so then also, we're going to create a dictionary of all the things that we're recording. So we'll get all the names. So that's going to be in this case, just LR. And initially, it's just going to be an empty list. And then after batch, we'll go through each of the items in that dictionary, which in this case is just LR is the key and underscore LR function is the value. And we will append to that list, call that method, call that function or callable and pass in this callback. And that's why this is going to get the callback. And so that's going to basically then have a whole bunch of, you know, dictionary of the results, you know, of each of these functions after each batch during training. So we'll just go through and plot them all. And so let me show you what that's going to look like. If we let's create a cosine annealing callable. So we're going to have to use a partial to say that this callable is going to have tmax equal to three times however many mini batches we have in our data loader. That's because we're going to do three epochs. And then we will set it running. And we're passing in the batch scheduler with the scheduler callable. And we're also going to pass in our recorder callback saying we want to track the learning rate using the underscore LR function, we're going to call fit. And all this is actually a pretty good accuracy. We're getting, you know, close to 90% now in only three epochs, which is impressive. And so when we then call rec.plot, it's going to call. Remember the rec is the recorder callback. So it plots the learning rate. Isn't that sweet? So we could, as I said, we can do exactly the same thing, but replace after batch with after epoch. And this will now become a scheduler which steps at the end of each epoch rather than the end of each batch. So I can do exactly the same thing now using an epoch scheduler. So this time, Tmax is three, because we're only going to be stepping three times. We're not stepping at the end of each batch, just at the end of each epoch. So that trains. And then we can call rec.plot after trains. And as you can see there, it's just stepping three times. So you can see here, we're really digging in deeply to understanding what's happening in everything in our models. What are all the activations look like? What are the losses look like? What do our learning rates look like? And we've built all this from scratch. So yeah, hopefully that gives you a sense that we can really, yeah, do a lot ourselves. Now, if you've done the fast AI part one course, you'll be very aware of one cycle training, which was from a terrific paper by Leslie Smith, which I'm not sure it ever got published, actually. And one cycle training is, well, let's take a look at it. So we can just replace our scheduler with one cycle learning rate scheduler. So that's in PyTorch. And of course, if it wasn't in PyTorch, we could very easily just write our own. We're going to make it a batch scheduler, and we're going to train, this time we're going to do five epochs. So we're going to train a bit longer. And so the first thing I point out is, hooray, we have got a new record for us, 90.6%. So that's great. And then B, you can see here's the plot. And now look, two things are being plotted. And that's because I've now passed into the recorder callback a plot of learning rates and also a plot of momentums. And momentums, it's going to grab the beta zero because remember for Adam, it's called beta zero and beta one is momentum of the gradients and the momentum of the gradient squared. And you can see what the one cycle is doing is the learning rate is starting very low and going up to high and then down again. But the momentum is starting high and then going down and then up again. So what's the theory here? Well, the starting out at a low learning rate is particularly important if you have a not perfectly initialized model, which almost everybody almost always does, even though we spent a lot of time learning to initialize models, you know, we use a lot of models that get more complicated. And after a while, people, after a while, people learn or figure out how to initialize more complex models properly. So for example, this is a very, very cool paper. In 2019, this team figured out how to initialize resnets properly. We'll be looking at resnets very shortly. And they discovered when they did that they did not need batch norm, they could train networks of 10,000 layers. And they could get state of the art performance with no batch norm. And there's actually been something similar to transformers called tfixup, that does a similar kind of thing. But anyway, it is quite difficult to initialize models correctly. Most people fail to most people fail to realize that they generally don't need tricks like warmup and batch norm if they do initialize them correctly. In fact, tfixup explicitly looks at this, it looks at the difference between no warmup versus with warmup with their correct initialization versus with normal initialization. And you can see these pictures they're showing are pretty similar actually, log scale histograms of gradients, they're very similar to the colorful dimension plots. I kind of like our colorful dimension plots better in some ways because I think they're easier to read, although I think theirs are probably prettier. So there you go Stefano, there's something to inspire you if you want to try more things with our colorful dimension plots. I think it's interesting that some papers are actually starting to use a similar idea. I don't know if they got it from us or they came up with it independently, doesn't really matter. But so we do a warmup if our networks not quite initialized correctly, then starting at a very low learning rate means it's not going to jump off way outside the area where the weights even make sense. And so then you gradually increase them as the weights move into a part of the space that does make sense. And then during that time, while we have low learning rates, if they keep moving in the same direction, then with this very high momentum they'll move more and more quickly. But if they keep moving in different directions, it's just the momentum is going to kind of look at the underlying direction they're moving. And then once you have got to a good part of the weight space, you can use a very high learning rate. And with a very high learning rate, you wouldn't want so much momentum. So that's why there's low momentum during the time when there's high learning rate. And then as we saw in our spreadsheet, which did this automatically, as you get closer to the optimal, you generally want to decrease the learning rate. And since we're decreasing it, again, we can increase the momentum. So you can see that starting from random weights, we've got a pretty good accuracy on fashion MNIST with a totally standard convolutional neural network, no resonance, nothing else, everything built from scratch by hand, artisanal neural network training, and we've got 90.6% for fashion MNIST. So there you go. All right. Let's take a seven minute break and I'll see you back shortly. I should warn you, we've got a lot more to cover. So I hope you're okay with a long lesson today. Okay, we're back. I just wanted to mention also something we skipped over here, which is this has learn callback. This is more important for the people doing the live course than the recordings. If you're doing the recording, you will have already seen this. But since I created learner, actually, I don't know how to pronounce your surname, sorry, Peter, pointed out that there's actually kind of a nicer way of handling learner that previously we were putting the learner object itself into self.learn in each callback. And that meant we were using self.learn.model and self.learn.opt and self.learn.all this, you know, all over the place. It was kind of ugly. So we've modified learner this week to instead pass in when it calls the callback when in RunCBs, which is what it calls, learner calls, you might remember, is it passes the learner as a parameter to the method. So now the learner no longer goes through the callbacks and sets their .learn attribute. But instead, in your callbacks, you have to put learn as a parameter in all of the methods, in all of the callback methods. So for example, device CB has a before fit. So now it's got comma learn here. So now this is not self.learn. It's just learn. So it does make a lot of the code less yucky to not have all this self.learn.prex equals self.learn.model, self.learn.batch is now just learn. It also is good because you don't generally want to have both have the learner has a reference to the callbacks, but also the callbacks having a reference back to the learner that creates something called a cycle. So there's a couple of benefits there. And that reminds me there's a few other little changes we've made to the code. And I want to show you a cool little trick. I want to show you a cool little trick for how I'm going to find quickly all of the changes that we've made to the code in the last week. So to do that, we can go to the course repo. And on any repo, you can add slash compare in GitHub. And then you can compare across, you know, all kinds of different things. But one of the examples they've got here is to compare across different times. Look at the master branch now versus one day ago. So I actually want the master branch now versus seven days ago. So I just hit this changes to seven. And there we go. There's all my commits. And I can immediately see the changes from last week. And so you can basically see what are the things I had to do when I change things. So for example, you can see here, all of my self.learns became learns. I added the nearly, that's right, my augmentation. And so in learner, I added an LR find. Ah, yes, I will show you that one. That's pretty fun. So here's the changes we made to run CBs to fit. So this is a nice way I can quickly, yeah, find out what I've changed since last time and make sure that I don't forget to tell you folks about any of them. Oh, yes, clean up fit. Let's clean up fit. I have to tell you about that as well. Okay. That's a useful reminder. So the main other change to mention is that calling the learning rate finder is now easier because I added what's called a patch to the learner. Fast cause patch decorator. Let's you take a function and it will turn that function into a method of this class of whatever class you put after the colon. So this has created a new method called LR find or learner.LR find. And what it does is it calls self.fit where self is a learner passing in however many epochs you set as the maximum. I'll check for your learning rate finder what to start the learning rate at and then it says to use as callbacks the learning rate finder callback. Now this is new as well self.learn.fit didn't used to have a callbacks parameter. So that's very convenient because what it does is it adds those callbacks just during the fit. So if you pass in callbacks then it goes through each one and appends it to self.cbs and when it's finished fitting it removes them again. So these are callbacks that are just added for the period of this one fit which is what we want for a learning rate finder. It should just be added for that one fit. So with this patch in place this is all that's required to do the learning rate finder is now to create your learner and call.LR find. And there you go bang. So patch is a very convenient thing. It's one of these things which you know python has a lot of kind of like folk wisdom about what isn't considered pythonic or good and a lot of people really don't like patching in other languages it's used very widely and is considered very good. So I don't tend to have strong opinions either way about what's good or what's bad. In fact instead I just you know figure out what's useful in a particular situation. So in this situation obviously it's very nice to be able to add in this additional functionality to our class. So that's what LR find is. And then the only other thing we added to the learner this week was we added a few more parameters to fit. Fit used to just take the number of epochs as well as the callback parameter it now also has a learning rate parameter and so you've always been able to provide a learning rate to the constructor but you can override the learning rate for one fit. So if you pass in the learning rate it will use it if you pass it in and if you don't it will use the learning rate passed into the constructor. And then I also added these two booleans to say when you fit do you want to do the training loop and do you want to do the validation loop. So by default it'll do both and you can see here there's just an if train do the training loop if valid do the validation loop. I'm not even going to talk about this but if you're interested in testing your understanding of decorators you might want to think about why it is that I didn't have to say with torch.nograd but instead I called torch.nograd parentheses function. That will be a very if you can get to a point that you understand why that works and what it does you'll be on your way to okay so that is the end of excel sgd resnets okay so we are up to 90. what was that 3 percent let's keep check of this oh yeah 90.6 percent is what we're up to okay so to remind you the model actually so we're going to open 13 resnet now and we're going to do the usual important setup initially and the model that we've been using is the same one we've been using for a while which is that it's a convolution and an activation and an optimal optional batch norm and in our models we were using batch norm and applying our weight initialization the kaming weight initialization and then we've got comms that take the channels from 1 to 8 to 16 to 32 to 64 and each one's tried two and at the end we then do a flatten and so that ended up with a one by one so that's been the model we've been using for a while so the number of layers is one two three four so four four convolutional layers with a maximum of 64 channels in the last one so can we beat 90.9 well about 90.5 or 90.6 can we beat 90.6 percent so before we do a resnet I thought well let's just see if we can improve the architecture thoughtfully so generally speaking more depth and more channels gives the neural net more opportunity to learn and since we're pretty good at initializing our neural nets and using batch norm we should be able to handle deeper so one thing we could do is we could let's just remind ourselves of the previous version so we can compare is we could have our go up to 128 parameters now the way we do that is we could make our very first convolutional layer have a stride of one so that would be one that goes from the one input channel to eight output channels for eight filters if you like so if we make it a stride of one then that allows us to have one extra layer and then that one extra layer could again double the number of channels and take us up to 128 so that would make it deeper and effectively wider as a result so we can do a normal batch norm 2d and our new one cycle learning rate with our scheduler and the callbacks we're going to use is the device callback our metrics our progress bar and our activation stats looking for general values and I won't what have you watched them train because that would be kind of boring but if I do this with this deeper and eventually wider network this is pretty amazing we get up to 91.7% so that's like quite a big difference and literally the only difference to our previous model is this one line of code which allowed us to take this instead of going from one to 64 it goes from 8 to 128 so that's a very small change but it massively improved so the error rate's gone down by about over 10% relatively speaking in terms of the error rate so there's a huge impact we've already had again five epochs so now what we're going to do is we're going to make it deeper still but it gets there becomes a point so coming her at our noted that there comes a point where making neural nets deeper stops working well and remember this is the guy who created the initializer that we know and love and he pointed out that even with that good initialization there comes a time where adding more layers becomes problematic and he pointed out something particularly interesting he said let's take a 20 layer neural network this is in a paper called deep deep-residual learning for image recognition that introduced resnets so let's take a 20 layer network and train it for a few what's that tens of thousands of iterations and track its test error okay and now let's do exactly the same thing on a 56 layer identical otherwise identical but deeper 56 layer network and he pointed out that the 56 layer network had a worse error than the 20 layer and it wasn't just a problem of generalization because it was worse on the training set as well now the insight that he had is if you just set the additional 36 layers to just identity you know identity matrices they should they would do nothing at all and so a 56 layer network is a super set of a 20 layer network so it should be at least as good but it's not it's worse so clearly the problem here is something about training it and so him and his team came up with a really clever insight which is can we create a 56 layer network which has the same training dynamics as a 20 layer network or even less and they realized yes you can what you could do is you could add something called a shortcut connection and basically the idea is that normally when we have you know our inputs coming into our convolution so let's say that's that was our inputs and here's our convolution and here's our outputs now if we do this 56 times that's a lot of stacked up convolutions which are effectively matrix multiplications with a lot of opportunity for you know gradient explosions and all that fun stuff so how could we make it so that we have convolutions but with the training dynamics of a much shallower network and here's what he did he said let's actually put two comms in here to make it twice as deep because we are trying to make things deeper but then let's add what's called a skip connection where instead of just being out equals so this is conv one this is conv two instead of being out equals and there's a you know assume that these include activation functions equals conv two of conv one of in right instead of just doing that let's make it conv two of conv one of in plus in now if we initialize these at the first to have weights of zero then initially this will do nothing at all it will output zero and therefore at first it you'll just get out equals in which is exactly what we wanted right we actually want to for it to be as if there is no extra layers and so this way we actually end up with a network which can which can be deep but also at least when you start training behaves as if it's shallow it's called a residual connection because if we subtract in from both sides out then we would get out minus in equals conv one of conv two of in in other words the difference between the endpoint and the starting point which is the residual and so another way of thinking about it is that this is calculating a residual so there's a couple of ways of thinking about it and so this this thing here is called the res block or res net block okay so sam Watkins has just pointed out the confusion here which is that this only works if let's put the minus in back and put it back over here this only works if you can add these together now if conv one and conv two both have the same number of channels as in the same number of filters same number of filters and they also have stride one then that will work fine you'll end up that will be exactly the same output shape as the input shape and you can add them together but if they are not the same then you're in a bit of trouble so what do you do and the answer which coming her at our came up with is to add a conv on in as well but to make it as simple as possible we call this the identity conv it's not really an identity anymore but we're trying to make it as simple as possible so that we do as little to mess up these training dynamics as we can and the simplest possible convolution is a one by one filter block a one by one kernel I guess we should call it a one by one kernel size and using that and we can also add a stride or whatever if we want to so let me show you the code so we're going to create something called a conv block okay and the conv block is going to do the two convs that's going to be a conv block okay so we've got some number of input filters some number of output filters some stride some activation functions possibly a normalization and possibly and some some kernel shape some kernel size so um the second conv is actually going to go from output filters to output filters because the first conv is going to be from input filters to output filters so by the time we get to the second conv it's going to be nf to nf um the first conv we will set stride one and then the second conv will have the requested stride and so that way the two convs back to back are going to overall have the requested stride so this way the combination of these two convs is going to eventually is going to take us from n i to nf in terms of the number of filters and it's going to have the stride that we requested um so it's going to be a the conv block is a sequential block consisting of a convolution followed by another convolution each one with the requested kernel size and the requested activation function and the requested normalization layer the second conv won't have an activation function i'll explain why in a moment and so i mentioned that one way to make this as if it didn't exist would be to set the convolutional um weights to zero and the biases to zero but actually we would we would like to have you know correctly randomly initialized weights so instead what we can do is if you're using batch norm we can initialize this conv to one will be the batch norm layer we can initialize the batch norm weights to zero now if you've forgotten what that means go back and have a look at our implementation from scratch of batch norm because the batch norm weights is the thing we multiply by so do you remember the batch norm we um we subtract the exponential moving average mean we divide by the exponential moving average standard deviation but then we add back the the kind of the the batch norms bias layer and we multiply by the batch norms weights um well other way around multiply by weights first so if we set the batch norm layers weights to zero we're multiplying by zero and so this will cause the initial conv block output to be just all zeros and so that's going to give us um what we wanted is that nothing's happening here so we just end up with the input with this uh possible id conv so a res block is going to contain those convolutions in the convolution block we just discussed right and then we're going to need this id conv so the id conv is going to be a no op so that's nothing at all if the number of channels in is equal to the number of channels out but otherwise we're going to use a convolution with a kernel size of one and a stride of one and so that is going to you know is with as little work as possible um uh change the number of filters so that they match also what if the strides not one um well if the stride is two i'm actually this isn't going to work for any stride this only works for a stride of two if there's a stride of two we will simply average using average pooling so this is just saying take the mean of every set of two items in the grid so we'll just take the mean so we we um so we basically have here pool of id conv of in if the if the stride is two and if the filtered number is changed and so that's the minimal amount of work so here it is here is the forward pass we get our input and on the identity connection we call pool and if stride is one that's a no op so do nothing at all we do id conv and if the number of filters is not changed that's also a no op so this is this is just the input in that situation and then we add that to the result of the conv the result of the convs and here's something interesting we then apply the activation function to the whole thing okay so that way i wouldn't say this is like the only way you can do it but this is this is a way that works pretty well is to apply the activation function to the result of the whole um the whole res net block and that's why i didn't add activation function to the second conv so that's a res block so it's not a huge amount of code right um and so now i've literally copied and pasted our get model but everywhere that previously we had a conv i've just replaced it with res block um in fact let's have a look get model okay so previously we started with conv one to eight now we do res block one to eight stride one stride one then we added conv from number of filters id and number of filters i plus one now it's res block from number of filters number of filters i plus one okay so it's exactly the same um one change i have made though is i mean it doesn't actually make any difference at all i think it's mathematically identical is previously the very last conv at the end went from the you know 128 channels down to the 10 channels um followed by flatten but this conf is actually working on a one by one input um so you know an alternate way but i think makes it clearer is flatten first and then use a linear layer because a conv on a one by one input is identical to a linear layer and if that doesn't immediately make sense that's totally fine but this is one of those places where you should pause and have a little stop and think about why a conv on a one by one is the same and maybe go back to the excel spreadsheet if you like or the um the python from scratch conv we did because this is a very important insight so i think it's very useful with a more complex model like this to take a good old look at it to see exactly what the inputs and outputs of each layer is so here's a little function called print shape which takes the things that a hawk takes and we will print out for each layer the name of the class the shape of the input and the shape of the output so we can get our model create our learner and use our handy little hawks context manager we built an earlier lesson and call the print shape function and then we will call fit for one epoch just doing the evaluation of the training and if we use the single batch callback it will just do a single batch put put pass it through and that hawk will as you see print out each layer the inputs shape and the output shape so you can see we're starting with an input of batch size of 1024 one channel 28 by 28 our first res block was dried one so we still end up with 28 by 28 but now we've got eight channels and then we gradually decrease the grid size to 14 to 7 to 4 to 2 to 1 as we gradually increase the number of channels we then flatten it which gets rid of that one by one which allows us then to do linear to count over the 10 and then there's some discussion about whether you want to batch norm at the end or not I was finding it quite useful in this case so we've got a batch norm at the end I think this is very useful so I decided to create a patch for learner called summary that would do basically exactly the same thing but it would do it as a markdown table okay so if we create a train trainer with our model and call dot summary this method is now available because it's been patched that method into the learner and it's going to do exactly the same thing as our print but it does it more prettily by using a markdown table if it's in a notebook otherwise it'll just print it so fast core has a handy thing for keeping track if you're in a notebook and in a notebook to make something marked down you can just use ipython.display.markdown as you see and the other thing that I added as well as the input and the output is I thought let's also add in the number of parameters so we can calculate that as we've seen before by summing up the number of elements for each parameter in that module and so then I've kind of kept track of that as well so that at the end I can also print out the total number of parameters so we've got a 1.2 million parameter model and you can see that there's very few parameters here in the input nearly all the parameters are actually in the last layer why is that well you might want to go back to our excel convolutional spreadsheet to see this you have a parameter for every input channel you have a set of parameters they're all going to get added up across each of the three by three in the kernel and then that's going to be done for every output filter every output channel that you want so that's why you're going to end up with in fact let's take a look maybe let's create let's just grab some particular one so create a model and so we'll just have a look at the sizes and so you can see here there is this 256 by 256 by 3 by 3 so that's a lot of parameters okay so we can call lr find on that and get a sense of what kind of learning rate to you to use so I chose 2eneg 2 so 0.02 this is our standard learning thing you don't have to watch it train I've just trained it and so look at this by using ResNet we've gone up from 91.7 this just keeps getting better 92.2 in five epochs so that's pretty nice and you know this ResNet is not anything fancy it's it's the simplest possible res block right the model is literally copied and pasted from before and replace each place it said gone with res block but we've just been thoughtful about it you know and here's something very interesting we can actually try lots of other ResNets by grabbing Tim so that's Ross Whiteman's PyTorch image model library and if you call tim.list models star ResNet star there's a lot of ResNets and I tried quite a few of them now one thing that's interesting is if you actually look at the source code for Tim you'll see that the various different ResNets like ResNet18, ResNet18D, ResNet10D they're defined in a very nice way using this very elegant configuration you can see exactly what's different so there's basically only if one line of code different between each different type of ResNet for the main ResNets and so what I did was I tried all the Tim models I could find and I even tried importing the underlying things and building my own ResNets from those pieces and the best I found was the ResNet18D and if I train it in exactly the same way I got to 92% and so the interesting thing is you'll see that's less than our 92.2 and it's not like I tried lots of things to get here this was the very first thing I tried where else this ResNet18D was after trying lots and lots of different Tim models and so what this shows is that the just thoughtfully designed kind of basic architecture goes a very long way it's actually better for this problem than any of the PyTorch image model models ResNets that I could try that I could find so I think that's quite quite amazing actually it's really cool you know and it shows that you can create a state of the art architecture just by using some common sense you know so I hope that's yeah I hope that's encouraging so anyway so we're up to 92.2% we're not done yet because we haven't even talked about data augmentation all right so let's keep going so we're going to make everything the same as before but before we do data augmentation we're going to try to improve our model even further if we can so I said it was kind of not constructed with any great care and thought really like in terms of like this ResNet we just took the conv net and replaced it with a ResNet so it's effectively twice as deep because each conv block has two convolutions but ResNets train better than conv nets so surely we could go deeper and wider still so I thought okay how could we go wider and I thought well let's take our model and previously we were going from eight up to 256 what if we could get up to 512 and I thought okay well one way to do that would be to make our very first Res block not have a kernel size of three but a kernel size of five so that means that each grid is going to be five by five that's going to be 25 inputs so I think it's fair enough then to have 16 outputs so if I use a kernel size of five 16 outputs then that means if I keep doubling as before I'm going to end up at 512 rather than 256 okay so that's the only change I made was to add case equals five here and then change to double all the sizes and so if I train that wow look at this 93.7% so we're getting better still and again it wasn't with lots of like trying and failing and whatever it was just like saying well this just makes sense and the first thing I tried it just it just worked you know we're just trying to use these sensible thoughtful approaches okay next thing I'm going to try isn't necessarily something to make it better but it's something to make our ResNet more flexible our current ResNet is a bit awkward in that the number of stride two layers has to be exactly big enough that the last of them that the last of them ends up with a one by one output so you can flatten it and do the linear so that's not very flexible because you know what if you've got something you know for different size 28 by 28 it's a pretty small image so to to kind of make that necessary I've created a getModel2 which goes less far it has one less layer so it only goes up to 256 despite starting at 16 and so because it's got one less layer that means that it's going to end up at the two by two not the one by one so what do we do well we can do something very straightforward which is we can take the mean over the two by two and so if we take the mean over the two by two that's going to give us a mean over the two by two it's going to give us batch size by channel's output which is what we can then put into our linear layer so this is called this ridiculously simple thing is called a global average pulling layer and that's the that's the Keras term and PyTorch it's basically the same it's called an adaptive average pulling layer but in PyTorch you can cause it to have an output other than one by one but nobody ever really uses it that way so they're basically the same thing this is actually a little bit more convenient than the PyTorch version because you don't have to flatten it um so this is global average pulling so you can see here after our last res block which gives us a two by two output we have global average pull and that's just going to take the mean and then we can do the linear batch norm as usual so um I wanted to improve my summary patch to include not only the number of parameters but also the approximate number of mega flops so a flop is a floating operation per second a floating point operation per second um I'm not going to promise my calculation is exactly right I think the basic idea is right um I just basically actually calculated it's not really flops I actually calculated the number of multiplications so this is not perfectly accurate but it's pretty indicative I think so this is the same summary I had before but I had an added an extra thing which is a flops function where you pass in the weight matrix and the height and the width of your grid now if the number of dimensions of the weight matrix is less than three then we're just doing like a linear layer or something so actually just the number of elements is the number of flops because it's just a matrix model play but if you're doing a convolution so the dimension is four then you actually do that matrix model play for everything in the height by width grid so that's how I calculate this kind of flops equivalent number so um okay so if I run that on this model we can now see our number of parameters compared to the res net model has gone from 1.2 million up to 4.9 million and the reason why is because we've got this um we've got this res block it gets all the way up to 512 and the way we did this is we made that a stride one layer so that's why you can see here it's gone 2 2 and it stayed at 2 2 so I wanted to make it as similar as possible to the last ones it's got you know the same 512 final number of channels and so most of the parameters are in that last block for the reason we just discussed interestingly though it's not as clear for the mega flops you know it is the greatest of them but you know in terms of number of parameters I think this has more parameters than the all the other ones added together by a lot um but that's not true of mega flops and that's because this first layer has to be done 28 by 28 times whereas this layer only has to be done 2 by 2 times anyway so I tried training that and got pretty similar result 92.6 and that kind of made me think oh let's fiddle around with this a little bit more to see like what kind of things would reduce the number of parameters and the mega flops the reason you care about reducing the number of parameters is that it has lower memory requirements and the reason you require want to reduce the number of flops is it's less compute so um in this case what I've done here is I've removed this line of code so I've removed the line of code that takes it up to 512 so that means we don't have this layer anymore and so the number of parameters has gone down from 4.9 million down to 1.2 million not a huge impact on the mega flops but a huge impact on the parameters we've reduced it by like two thirds or three quarters or something um by getting rid of that and you can see that the um if we take the very first resnet block the number of parameters is you know um why is it this 5.3 mega flops it's because although the very first one starts with just one channel the first conv remember our resnet blocks have two comms so the second conv is going to be a 16 by 16 by 5 by 5 and again I'm partly doing this to show you the actual details of this architecture but I'm partly showing it so that you can see how to investigate exactly what's going on in your models now I really want you to try these so if we train that one interestingly even though it's only a quarter or something of the size we get the same accuracy 92.7 so that's interesting um can we make it faster well at this point this is the obvious place to look at is this first resnet block because that's where the mega flops are and as I said the reason is because it's got two cons the second one is 16 by 16 um channels 16 channels in 16 channels out and it's doing these five by five kernels um and it's having to do it across the whole 28 by 28 grid so that's the bulk of the the biggest compute so what we could do is we could replace this res block with just one convolution um and if we do that then you'll see that we've now got rid of the 16 by 16 by five by five we just got the 16 by one by five by five so the number of mega flops has gone down from 18.3 to 13.3 the number of parameters hasn't really changed at all right because the number of parameters was only 6 that 6 6800 right so be very careful that when you see people talk about oh my my model has less parameters that doesn't mean it's faster okay really doesn't necessarily it doesn't doesn't mean that at all there's no particular relationship between parameters and speed even counting mega flops doesn't always work that well because it doesn't take account of the amount of things moving through memory but you know it's not a it's not a bad approximation here so here's one which has got much less mega flops and in this case it's about the same accuracy as well so I think this is really interesting we've managed to build a model that has far less parameters and far less mega flops and has basically exactly the same accuracy so I think that's a really important thing to keep in mind and remember this is still way better than the ResNet 18d from Tim so we've built something that is fast small and accurate so the obvious question is what if we train for longer and the answer is if we train for longer if we train for 20 epochs so we're not going to have you wait for it the training accuracy gets up to 0.999 but the validation accuracy is worse it's 0.924 and the reason for that is that after 20 epochs it's seen the same picture so many times it's just memorizing them and so once you start memorizing things actually go downhill um so we need to regularize now something that we have claimed in the past can regularize is to use weight decay but here's where I'm going to point out that weight decay doesn't regularize at all if you use batch non and it's fascinating for years people didn't even seem to notice this and then somebody I think finally wrote a paper that pointed this out and people like oh wow that's weird um but it's really obvious when you think about it a batch norm layer has a single set of coefficients which multiplies an entire layer right so that set of coefficients could just be you know the number 100 in every place and that's going to multiply the entire previous weight matrix you know convolution kernel matrix by 100 as far as weight decay is concerned that's not much of an impact at all because the batch norm layer has very few um weights so it doesn't really have a huge impact on weight decay but it massively increases the effective scale of the weight matrix so batch norm basically lets the the neural net cheat by increasing the coefficients that the parameters even nearly as much as it wants indirectly just by changing the batch norm layers weights so weight decay is not going to save us um and that's something really important to recognize weight decay is not I mean with batch norm layers I don't see the point of it at all it does have some like there has been some studies of what it does and it does have some weird kind of second order effects on the learning rate but I don't think you should rely on them you should use a scheduler for changing the learning rate rather than weird second order effects caused by weight decay so instead we're going to do data augmentation which is where we're going to modify every image a little bit by random uh change so that it doesn't see the same image each time so um there's not any particular reason to implement these from scratch to be honest um we have implemented them all from scratch in fast ai so you can certainly look them up if you're interested um but it's actually a little bit separate to what we're meant to be learning about so i'm not going to go through it um but yeah if you're interested go into fast ai vision augment and you'll be able to see for example how do we do flip and you know it's just like x dot transpose okay which is not really yeah it's not that interesting um yeah how do we do cropping and padding how do we do random crops so on and so forth okay so we're just going to actually you know fast ai's have probably got the best implementation of these but torch visions are fine so we'll just use them um and so we've um created before a batch transform callback and um we used it for normalization if you remember so what we could do is we could create a transform batch function which transforms the inputs and transforms the outputs using two different functions so that would be an augmentation callback and so then you would say okay for the transform batch function for example in this case we want to transform our x's and how do we want to transform our x's and the answer is we want to transform them using this module which is a sequential module of first of all doing a random crop and then a random horizontal flip now it seems weird to randomly crop a 28 by 28 image to get a 28 by 28 image but we can add padding to it and so effectively it's going to randomly add padding on one or both sides to do this kind of random crop um one thing i did to uh to change the batch transform callback i can't remember if i've mentioned this before but something i changed slightly since we first wrote it is i added this on train and on validate so that it only does it if you said i want to do it on training and it's training or i want to do it on validation and it's not training and then this is this is all the code is um so um data augmentation generally speaking shouldn't be done on validation so we set on validation false okay so what i'm going to do first of all is i'm going to use our classic single batch cb trick and um fit in fact even better oh yeah fit fit one uh just doing training um and what i'm going to do then is after i fit i can grab the batch out of the learner and this is a way this is quite cool right this is a way that i can see exactly what the model sees right so this is not relying on on any you know approximations remember when we fit it puts it in the batch that it looks at into learn dot batch so if we fit for a single batch we can then grab that batch back out of it and we can call show images and so here you can see this little crop it's added now something you'll notice is that every single image in this batch to describe the first 16s i don't want to show you 1024 has exactly the same augmentation and that makes sense right because we're applying a batch transform now why is this good and why is it bad it's good because this is running on the gpu right which is great because nowadays very often it's really hard to get enough cpu to feed your fast gpu fast enough particularly if you use something like kaggle or colab that are really underpowered for cpu particularly kaggle um so this way all of our transformations all of our augmentation is happening on the gpu um on the downside it means that there's a little bit less variety every mini batch has the same augmentation i don't think the downside matters though because it's going to see lots of mini batches so the fact that each mini batch is going to have a different augmentation is actually all i care about so we can see that if we run this multiple times you can see it's got a different augmentation in each mini batch um okay so i decided actually i just going to use one padding so i'm just going to do a very very small amount of data augmentation and i'm going to do 20 epochs using one cycle learning rate um and so this takes quite a while to train so we won't watch it but check this out we get to 93.8 that's pretty wild um yeah that's pretty wild so um i actually went on twitter and i said to the entire world on twitter you know which if you're watching this in 2023 if twitter doesn't exist yet ask somebody tell you about what twitter used to be it still does um uh can anybody beat this in 20 epochs you can use any model you like uh any library you like and nobody's got anywhere close um so this is um this is pretty amazing and actually you know when i had a look at papers with code there are you know well i mean you can see it's right up there right with the kind of best models that are listed certainly better than these ones um and the the better models all use you know 250 or more epochs um so yeah if anybody i'm hoping that somebody watching this will find a way to beat this in 20 epochs that would be really great um because as you can see we haven't really done anything very amazingly weirdly clever it's all very very basic um and actually we can go even a bit further than 93.8 um just before we do i mentioned that since this is actually taking a while to train now i can't remember it takes like 10 to 15 seconds per epoch so you know you're waiting a few minutes you may as well save it so you can just call torch.save on a model and then you can load that back later um so something that can um make things even better is something called test time augmentation i guess i should write this out properly here test text test time augmentation um now test time augmentation actually um does our batch transform callback on validation as well and then what we're going to do is we're actually in this case we're going to do just a very very very simple test time augmentation which is we're going to um add a batch transform callback that runs on validate and it's not random but it actually just does a horizontal flip non-random so it always does a horizontal flip and so check this out what we're going to do is we're going to create a new call back called capture spreads um and after each batch it's just going to append to a list the predictions and it's going to append to a different list the targets and that way we can just call learn.fit train equals false and it will show us the accuracy okay and this is just the same number that we saw before but then what we can do is we can call that the same thing but this time with a different callback which is with the horizontal flip callback and that way it's going to do exactly the same thing as before but in every time it's going to do a horizontal flip and weirdly enough that accuracy is slightly higher which that's not the interesting bit the interesting bit is that we've now got two sets of predictions we've got the sets of predictions with the non-flip version we've got the set of predictions with the flipped version and what we could do is we could stack those together and take the mean so we're going to take the average of the flipped and unflipped predictions and that gives us a better result still 94.2% so why is it better it's because looking at the image from kind of like multiple different directions gives it more opportunities to try to understand what this is a picture of and so in this case I'm just giving it two different directions which is the flipped and unflipped version and then just taking their average so yeah this is like a really nice little trick Sam's pointed out it's a bit like random forests which is true it's a kind of bagging that we're doing we're kind of getting multiple predictions and bringing them together um and so we can actually um so 94.2 I think is is is my best 20 epoch result and notice I didn't have to do any additional training so it still counts as a 20 epoch result um you can do test time augmentation where you do you know what much wider range of different augmentations that you trained with and then you can use them at test time as well you know more more crops or rotations or warps or whatever I want to show you one of my favorite data augmentation approaches which is called random erasing um so random erasing I'll show you what it's going to look like random erasing we're going to add a little we're going to basically delete a little bit of each picture and we're going to replace it with some random Gaussian noise now in this case we've just got one patch um but eventually we're going to do more than one patch so I wanted to implement this because remember we have to implement everything from scratch um and this one's a bit less trivial than the previous transforms so we should do it from scratch and also not sure there's that many good implementations ross whiteman's tem I think has one um so and it's also a very good exercise to see how to implement this from scratch um so let's grab a batch out of the training set and let's just grab the first 16 images and so then let's grab the mean and standard deviation okay and so what we want to do is we wanted to delete a patch from each image but rather than deleting it deleting it would change the statistics the statistics right if we set those orders zero the mean and standard deviation are now not going to be zero one anymore but if we replace them with exactly the same mean and standard deviation pixels that the picture has or that our dataset has then it won't change the statistics so that's why we've grabbed the mean and standard deviation and so we could then try grabbing let's say we want to delete zero point two so 20 percent of the height and width um then let's find out how big that size is so point two of the shape of the height and of the width that's the size of the x and y and then the starting point we're just going to randomly grab some starting point right so in this case we've got the starting point for x is 14 starting point for y is zero and then it's going to be a five by five spot and then we're going to do a Gaussian or normal initialization of our mini batch everything in the batch every channel for this x slice this y slice and we're going to initialize it with this mean and standard deviation normal random noise and so that's what this is so it's just that tiny little bit of code so you'll see I don't start by writing a function I start by writing single lines of code that I can run independently and make sure that they all work and that I look at the pictures and make sure it's working now one thing that's wrong here is that you see how the different you know this looks black and this looks gray now first this was confusing me as to what's going on what's it changed because the original images didn't look like that and I realized the problem is that the minimum and the maximum have changed it used to be from negative point eight to two that was the previous min and max now it goes from negative three to three so the noise we've added has the same mean and standard deviation but it doesn't have the same range because the pixels were not normally distributed originally so normally distributed noise actually is wrong so to fix that I created a new version and I'm putting in a function now does all the same stuff as before as I just did before but it clamps the random pixels to be between min and max and so it's going to be exactly the same thing but it's going to make sure that it doesn't change the the range that's really important I think because changing the ray the range really impacts your you know your activations quite a lot so here's what that looks like and so as you can see now all of the backgrounds have that nice black and it's still giving me random pixels and I can check and because I've done the clamping you know and stuff the mean and standard deviation aren't quite zero one but they're very very close so I'm going to call that good enough and of course the min and max haven't changed because I clamped them to ensure they didn't change so that's my random erasing so that randomly erases one block and so I could create a random erase which will randomly choose up to in this case four blocks so with that function oh that's annoying it happened to be zero this time okay we'll just run it again this time it's got three so that's good so you can see it's got oh maybe no it's four one two three four blocks okay so that's what this data augmentation looks like so we can create a class to do this data augmentation so you'll pass in what percentage to do in each block what the maximum number of blocks to have is store that away and then in the forward we're just going to call our random erase function passing in the input and passing in the parameters great so now we can use random crop random flip and random rows make sure it looks okay and so now we're going to go all the way up to 50 epochs and so if I run this for 50 epochs I get 94.6 isn't that crazy so we're really right up there now up we're even above this one so we're somewhere up here and this is like stuff people write papers about from 2019 2020 oh look here's the random erasing paper huh that's cool so they were way ahead of their time in 2017 but yeah that would have trained for a lot longer now I was having a think and I realized something which is like why like how do we actually get the correct distribution right like in some ways it shouldn't matter but I was kind of like bothered by this thing of like well we don't actually end up with zero one and there's kind of like clamping it all feels a bit weird like how do we actually replace these pixels with something that is guaranteed to be the correct distribution and I realized there's actually a very simple answer to this which is we could copy another part of the picture over to here if we copy part of the picture we're guaranteed to have the correct distribution of pixels and so it wouldn't exactly be random erasing anymore that would be random copying now I'm sure somebody else has invented this I mean you know I'm not saying this nobody's ever thought of this before so if anybody knows a paper that's done this please tell me about it but I you know I think it's it's a very sensible approach and it's very very easy to implement so again we're going to implement it all manually right so that's great get our x mini batch and let's get our again our size and again let's get the x y that we're going to be erasing but this time we're not erasing they're copying so we'll then randomly get a different x y to copy from and so now it's just instead of in a random noise we just say replace this slice of the batch with this slice of the batch and we end up with you know you can see here it's kind of copied little bits across some of them you can't really see it all and some of you can because I think some of them are black and it's replaced black but I guess it's knocked off the end of this shoe added a little bit extra here a little bit extra here so we can now again we'll turn it into a function once I've tested it in the rebel make sure the function works and obviously this in this this case it's copying it largely from something that's largely black for a lot of them and then again we can do the thing where we do it multiple times and here we go now it's got a couple of random copies and so again turn that into a class create our transforms and again we okay so again we can have a look at a batch to make sure it looks sensible and do it for just did it for 25 epochs here and gets to 94 percent now why did I do it for 25 epochs because I was trying to think about how do I beat my 50 epoch record which was 94.6 and I thought well what I could do is I could train for 25 epochs and then I'll train a whole new model for a different 25 epochs and I'm going to put it in a different learner learn two right that this one is 94.1 so one of the models was 94.1 one of them was 94 maybe you can guess what we're going to do next it's a bit like test time augmentation but rather than that we're going to grab the predictions of our first learner and grab the predictions of our second learner and stack them up and take their mean and this is called ensembling and not surprisingly the ensemble is better than either of the two individual models at 94.4 although unfortunately I'm afraid to say we didn't beat our best but it's a useful trick and particularly useful trick in this case I was kind of like trying something a bit interesting to see if using the exact same number of epochs can I get a better result by using ensembling instead of training for longer and the answer was I couldn't maybe it's because the random copy is not as good or maybe I'm using too much augmentation who knows but it's something that you could experiment with so shall one mentions in the chat that cut mix is similar to this which is actually that's a good point I'd forgotten cut mix but cut mix yes copies it from different images rather than from the same image but yeah it's pretty much the same thing I guess ish well hmm similar yeah very similar um all right so that brings us to the end of the lesson and you know I am yeah so pumped and excited to share this with you because you know I don't know that this has ever been done before you know to be able to to go from I mean even in our previous courses we've never done this before go from scratch step by step to an absolute state of the art model where we build everything ourselves and it runs this quickly and we're even using our own custom ResNet and everything you know just using common sense at every stage and so hopefully that shows that deep learning is not magic you know that we can actually build the pieces ourselves and yeah as you'll see going up to larger data sets absolutely nothing changes and so it's exactly these techniques and this is actually I do 99 percent of my research on very small data sets because you can iterate much more quickly you can understand them much better and I don't think there's ever been a time where I've then gone up to a bigger data set and my findings didn't continue to hold true now homework what I would really like you to do is to actually do the thing that I didn't do which is to do the um create your own um create your own schedulers that work with pythons optimizers so I mean it's the tricky bit we'll be making sure that you understand the PyTorch API well which I've really laid out here so study this carefully so create your own cosine annealing scheduler from scratch and then create your own one cycle scheduler from scratch and make sure that they work correctly with this batch scheduler callback this will be a very good exercise for you in you know hopefully getting extremely frustrated as things don't work the way you hope they would and being mystified for a while and then working through it you know using this very step-by-step approach lots of experimentation lots of exploration and then figuring it out that's that's the journey I'm hoping you hope you have if it's all super easy and you get it first go then you know you'll have to find something else to do but um yeah I'm hoping you'll find it actually you know surprisingly tricky to get it all working properly and in the process of doing so you're going to have to do a lot of exploration and experimentation but you'll realize that it requires no um like prerequisite knowledge at all okay so um if it doesn't work first time it's not because there's something that you didn't learn in graduate school if only you had done a phd whatever it's just that you need to dig through you know slowly and carefully to see how it all works um and you know then see how niche and concise you can get it um then the other homework is to try and um beat me um I really really want people to beat me uh try to beat me on the five epoch um or the 20 epoch or the 50 epoch um fashion mnist um ideally using um many ai uh with things that you've added yourself but you know you can try grabbing other libraries if you like well ideally if you do grab another library and you find you can beat my approach try to re-implement that library um that way you are um still within the spirit of the game okay so in our next lesson um uh Jono and Tanishk and I are going to be putting this all together to create a diffusion model from scratch and we're actually going to be taking a couple of lessons for this um not just a diffusion model but a variety of of interesting generative approaches so we've kind of starting to come full circle so thank you um so much for joining me on this very extensive journey and um I look forward to hearing what you come up with please do come and join us on forums.fastedai and share your your progress bye