 Okay, so welcome back to Deep Learning lesson two. Last week we got to the point where we had successfully trained a pretty accurate image classifier. And so just to remind you about how we did that, can you guys see okay, I think? Actually, we can't turn the front lights off in there, so, all right. Can you guys all see the screen, okay? We can turn just these ones, can we? Don't pitch us all into darkness, maybe that works then. Is that, okay, that's better, isn't it? Yeah. Do you mind doing the other two? And maybe that one as well. Oh, not that one. Oh, that's great. Sorry, no, you're right, just that one. Okay, great. That's better, isn't it? So, just to remind you, the way that we built this image classifier was, we used a small amount of code, basically three lines of code, and these three lines of code pointed at a particular path, which already had some data in it. And so the key thing for this to know how to train this model was that this path, which was data, docs, cats, had to have a particular structure, which is that it had a train folder and a valid folder, and in each of those train and valid folders, there was a cats folder and the docs folder, and each of the cats and the docs folders was a bunch of images of cats and dogs. So this is like a pretty standard, it's one of two main structures that are used to say here is the data that I want you to train an image model from. So I know some of you during the week went away and tried different data sets where you had folders with different sets of images in and created your own image classifiers, and generally that seems to be working pretty well from what I can see on the forums. So to make it clear, at this point, this is everything you need to get started. So if you create your own folders with different sets of images, you know, a few hundred or a few thousand at each folder and run the same three lines of code, that'll give you an image classifier and you'll be able to see this third column tells you how accurate it is. So we looked at some kind of simple visualizations to see like what was it uncertain about, what was it wrong about, and so forth, and that's always a really good idea. And then we learned about the one key number you have to pick. So this number here is the one key number, it's 0.01, and this is called the learning rate. And so I wanted to go over this again, and we'll learn about the theory behind what this is during the rest of the course in quite a lot of detail. But for now, I just wanted to talk about the practice. Yes, Yannette, what do you think of this? Well, they cannot see you in the video, I guess. They can now, I just turned it around. Oh, okay, perfect. I also was wondering, could you tell us about the other three numbers in that Richard, let me talk right there. These three here, we're going to talk about the other ones shortly. So the main one we're going to look at for now is the last column, which is the accuracy. The first column, as you can see, is the epoch number. So this tells us how many times has it been through the entire data set, trying to learn a better classifier. And then the next two columns is what's called the loss, which we'll be learning about either later today or next week. The first one is the loss on the training set. These are the images that we're looking at in order to try to make a better classifier. And the second is the loss on the validation set. These are the images that we're not looking at when we're training, but we're just sitting on the side to see how accurate we are. So we'll learn about the history of loss and accuracy later. Okay, so we've got the epoch number, the training loss is the second column, the validation loss is the third column, and the accuracy is the fourth column. Okay, so the basic idea of the learning rate, so the basic idea of the learning rate is it's the thing that's going to decide how quickly do we zoom, do we kind of hone in on the solution. And so I find that a good way to think about this is to think about like, well, what if we were trying to fit to a function that looks something like this, right? And we're trying to say, okay, whereabouts is the minimum point? This is basically what we do when we do deep learning, is we try to find the minimum point of a function. Now our function happens to have millions or hundreds of millions of parameters, but it works the same basic way. And so when we look at it, we can immediately see that the lowest point is here. But how would you do that if you were a computer algorithm? And what we do is we start out at some point at random, so we pick say here, and we have a look and we say, okay, what's the loss or the error at this point? And we say, what's the gradient? In other words, which way is up and which way is down? And it tells us that down is going to be in that direction. And it also tells us how fast is it going down, which is at this point is going down pretty quickly. And so then we take a step in the direction that's down. And the distance we travel is going to be proportional to the gradient. It's going to be proportional to how steep it is. The idea is if it's steeper, then we're probably further away. That's the general idea, right? And so specifically, what we do is we take the gradient, which is how steep is it at this point, and we multiply it by some number. And that number's called the learning rate, okay? So if we pick a number that is very small, then we're guaranteed that we're going to go a little bit closer and a little bit closer and a little bit closer each time, right? But it's going to take us a very long time to eventually get to the bottom. If we pick a number that's very big, we could actually step too far. Could go in the right direction, but we could step all the way over to here, right? As a result of which we end up further away than we started, and we could oscillate against worse and worse. So if you start training a neural net and you find that your accuracy or your loss is like spitting off into infinity, almost certainly your learning rate's too high. So in a sense, learning rate too low is a better problem to have, because you're going to have to wait a long time. But wouldn't it be nice if there was a way to figure out what's the best learning rate, something where you could quickly go like bomb, bomb, bomb, right? And so that's why we use this thing called a learning rate finder. And what the learning rate finder does is it tries each time it looks at another, remember the term mini-batch? And mini-batch is a few images that we look at each time, so that we're using the parallel processing power of the GPU effectively. We look generally at around 64 or 128 images at a time. For each mini-batch, which is labelled here as an iteration, we gradually increase the learning rate, in fact, multiplicatively increase the learning rate. We start at really, really tiny learning rates to make sure that we don't start at something too high, and we gradually increase it. And so the idea is that eventually the learning rate will be so big that the loss will start getting worse. And so what we're going to do then is we're going to look at the plot of learning rate against loss, right? So when the learning rate's tiny, it increases slowly, then it starts to increase a bit faster, and then eventually it starts not increasing as quickly, and in fact it starts getting worse, right? So clearly here, and make sure you're going to be familiar with this scientific notation, okay? So 10 to the negative 1 is 0.1, 10 to the negative 2 is 0.01. And when we write this in Python, we'll generally write it like this, rather than writing 10 to the negative 1 or 10 to the negative 2, we'll just write 1e neg 1 or 1e neg 2, okay? They mean the same thing. You're going to see that all the time. So and remember that equals 0.1, 0.01, right? So don't be confused by this text that it prints out here. This loss here is the final loss at the very, at the end, and it's not of any interest, right? So ignore this. This is only interesting when we're doing regular training, but it's not interesting for the learning rate finder. The thing that's interesting for the learning rate finder is this learn.shed.plot. And specifically we're not looking for the point where it's the lowest, right? Because the point where it's the lowest, it's actually not getting better anymore. So that's too high a learning rate. So I generally look to see like, where is it the lowest? And then I go back like one magnitude. So 1e neg 2 would be a pretty good choice, okay? So that's why you saw when we ran our fit here, we picked 0.01, right? Which is 1e neg 2. So an important point to make here is like this, this is the one key number that we've learned to adjust. And if you just adjust this number and then nothing else, most of the time you're going to be able to get pretty good results. And this is like a very different message to what you would hear or see in any textbook or any video or any course. Because up until now, there's been like dozens and dozens of these, they're called hyper parameters, dozens and dozens of hyper parameters to set. And they've been thought of as highly sensitive and difficult to set. So inside the first AI library, we kind of do all that stuff for you as much as we can. And during the course, we're going to learn that there are some more we can tweak to get slightly better results. But it's kind of like, it's kind of in a funny situation here because for those of you that haven't done any deep learning before, it's kind of like, oh, this is, that's all there is to it. This is very easy. And then when you talk to people outside this class, they'll be like, deep learning is so difficult. There's so much to set. It's a real art form. And so that's why there's this difference, right? And so the truth is that the learning rate really is the key thing to set. And this ability to use this trick to figure out how to set it, although the paper is now probably 18 months old, almost nobody knows about this paper. It was from a guy who's not from a famous research life. So most people kind of ignored it. And in fact, even this particular technique was one subpart of the paper that was about something else. So again, this idea of like, this is how you can set the learning rate. Really, nobody outside this classroom just about knows about it. Obviously, the guy right at Leslie Smith knows about it. So it's a good thing to tell your colleagues about this like here is actually a great way to set the learning rate. And there's even been papers called, like one of the famous papers is called no more pesky learning rates, which actually is a less effective technique than this one. But this idea that like setting learning rates is very difficult and fiddly is has been true for most of the kind of deep learning history. So here's the trick, right? Go look at this plot, find kind of the lowest to go back about a model of 10 and try that, right? And if that doesn't quite work, you can always try, you know, going back another multiple of 10. But this has always worked for me so far. What's, why does this learning rate, this method work versus something else, like momentum base or what's like the advantages and disadvantages with this learning rate, rate like technique versus something else? That's a great question. So we're going to learn during this course about a number of ways of improving gradient percent, like you mentioned momentum and atom and so forth. This is orthogonal in fact. So one of the things the last day I library tries to do is figure out the right gradient descent version. And in fact behind the scenes, this is actually using something called atom. And so this technique is telling us this is the best learning rate to use, given what other tweaks you're using in this case the atom optimizer. So it's not that there's some compromise between this and some other approaches. This sits on top of those approaches and you still have to set the learning rate when you use other approaches. So we're trying to find the best kind of optimizer to use for a problem that you still have to set the learning rate and this is how we can do it. And in fact, this idea of using this technique on top of more advanced optimizers like atom might haven't even seen mentioned in a paper before. So I think this is like a, I mean it's not a huge breakthrough, it seems obvious, but nobody else seems to have tried it. So as you can see, it works well. When we use optimizers like atom which have like adaptive learning rates, so when we set this learning rate, is it like initial learning rate because it changes during the epoch, right? So we're going to be learning about things like atom, the details about it later in the class, but the basic answer is no, even with even the atom that there actually is a learning rate, it's just being, it's being basically divided by the gradient, the average previous gradient and also the recent summer squares of gradients. So there's still like a number called the learning rate. There isn't a, even the so-called dynamic learning rate method still have a learning rate. Okay, so the most important thing that you can do to make your model better is to give it more data. So the challenge that happens is that these models have hundreds of millions of parameters. And if you train them for a while, they start to do what's called overfitting. And so overfitting means that they're going to start to see like the specific details of the images you're giving them rather than the more general learning that can transfer across to the validation set. So the best thing we can do to avoid overfitting is to find more data. Now obviously one way to do that would just be to collect more data from wherever you're getting it from or label more data. But a really easy way that we should always do is to use something called data augmentation. So data augmentation is one of these things that's in many courses it's not even mentioned at all or if it is it's kind of like an advanced topic right at the end but actually it's like the most important thing that you can do to make a better model. Okay, and so it's built into the FastAI library to make it very easy to do. And so we're going to look at the details of the code shortly but the basic idea is that in our initial code we had a line that said imageclassifier.data.promeparts and we passed in the path to our data and for transforms we passed in basically the size in the architecture. We'll look at this in more detail shortly. We just add one more parameter which is what kind of data augmentation do you want to do. And so to understand data augmentation it's maybe easiest to look at some pictures of data augmentation. So what I've done here again we'll look at the code in more detail later but the basic idea is I've built a data class like multiple times. I'm going to do it six times and each time I'm going to plot the same cat. And you can see that what happens is that this cat here is further over to the left and this one here is further over to the right and this one here is fit horizontally and so forth. So data augmentation different types of image are going to want different types of data augmentation. So for example if you were trying to recognize letters and digits you wouldn't want to flip horizontally because like it actually has a different meaning. Whereas on the other hand if you're looking at photos of cats and dogs you probably don't want to flip vertically because cats aren't generally upside down. Whereas if you're looking at there's a current Kaggle competition which is recognizing icebergs in satellite images you probably do want to fit them upside down because it doesn't really matter which area around the iceberg or the satellite was. So one of the examples of the transform sets we have is transforms side on. So in other words if you have photos that are like generally taken from the side which generally means you want to be able to flip them horizontally but not vertically this is going to give you all the transforms you need for that. So it'll flip them sideways rotate them by small amounts but not too much and slightly vary their contrast and brightness and slightly zoom in and out a little bit and move them around a little bit. So each time it's a slightly different slightly different image. I'm getting a couple of questions from people about could you explain again the reason why you don't take the minimum of the loss curve? Yeah. But it's slightly higher rates so and also could you people want to understand if this works for every CNN or for CNN so for every internet. This being the learning rate finder? Yeah exactly. Okay great. Could you put your hand up if there's a spare seat next to you? So there was a question about the learning rate finder about why do we use the learning rate that's less than the lowest point? And so the reason why is to understand what's going on with this learning rate finder. So let's go back to our picture here of like how do we figure out what learning rate to use right? And so what we're going to do is we're going to take steps and each time we're going to double the learning rate. So kind of double the amount by which we're multiplying the gradient. So in other words we'd go tiny step slightly bigger slightly bigger slightly bigger slightly bigger slightly bigger slightly bigger. Okay and so the question is the purpose of this is not to find the minimum. The purpose of this is to figure out what learning rate is allowing us to decrease quickly right? So the point at which the loss was lowest here is actually there right? But that learning rate actually looks like it's probably too high. It's going to just jump like probably backwards and forwards. Okay, so instead what we do is we go back to the point where the learning rates quickly giving us a quick increase in the loss. So here is so here is the actual learning rate increasing every single time we look at a new mini batch right? So mini batch or iteration versus learning rate. And then here is learning rate versus loss. So here's that point at the bottom where it was now already too high. Okay and so here's the point where we go back a little bit and it's still increasing nice and quickly. We're going to learn about something called stochastic gradient descent with restarts shortly where we're going to see like in a sense you might want to go back to 1eneg3 where it's actually even steeper still and maybe we would actually find this will actually learn even quicker. You could try it but we're going to see later why actually using a higher number is going to give us a better generalization. So for now I'll just put that aside. Do you mean higher learning rate when you say higher? Do I mean higher learning rate when I say higher? Yeah, is it the higher iteration or something else? I mean higher learning rate. So as we increase the iterations in the learning rate finder the learning rate is going up. This is iterations versus learning rate. Okay so as we do that as the learning rate increases and we plot it here the loss goes down until we get to the point where the learning rate is too high and at that point the loss is now getting worse. Because I asked the question because you were just indicating that even though the minimum was that 10 to the minus 1 you were going to you suggested we should choose 10 to the minus 2 but now you're saying I mean we should go back the other way higher so. I didn't mean to say that. I'm sorry if I said something backwards so I want to go back down to a lower learning rate. So possibly I said a higher when I meant higher in this lower learning rate. Okay thanks. In last class you said that all the local minima are the same and this graph also shows the same. Is that something that was observed or is that logic theory behind it? That's not what this graph is showing. This graph is simply showing that there's a point where if we increase the learning rate more then it stops getting better and it actually starts getting worse. The idea that all local minima are the same is totally separate issue and it's actually something we'll see a picture of shortly so let's come back to that. Jeremy do we have to find the best learning rate every time we're going to run on the poke? Every time where? Running on a poke. An epoch. So how many times should I run this like learning by finding my training? That's a great question Unette. I certainly run it once when I start. Later on in this class we're going to learn about unfreezing layers and after I unfreeze layers I sometimes run it again. If I do something to like change the thing I'm training or change the way I'm training it you may want to run it again basically. Or you know particularly if you've changed something about how you train like unfreezing layers which we're going to soon learn about and you're finding now the training is unstable or too slow so again you can run it again. There's never any harm in running it. It doesn't take very long. That's a great question. Okay so back to data orientation. So if we add to our when we run this little transforms from model function we pass in augmentation transforms. We can pass in the main two are transform side on or transforms top down. Later on we'll learn about creating your own custom transform lists as well. But for now because we're taking pictures from the side of cats and dogs we'll say transform side on. And now each time we look at an image it's going to be zoomed in or out a little bit moved around a little bit. Rotated a little bit possibly flipped. Okay and so what this does is it's not exactly creating new data but as far as the convolutional neural net is concerned it's a different way of looking at this thing. And it actually therefore allows it to learn how to recognize cats or dogs from somewhat different angles. Right so when we do data augmentation we're basically trying to say based on our domain knowledge here are different ways that we can mess with this image that we know still make it the same image. You know and that we could expect that you might actually see that kind of image in the real world. So what we can do now is when we call this from paths function which we'll learn more about shortly we can now pass in this set of transforms which actually have these augmentations in. Now so that's going to we're going to start from scratch here. We do a fit and initially the augmentations actually don't do anything. And the reason initially they don't do anything is because we've got here something that says pre-compute equals true. And we're going to come back to this lots of times but basically what this is doing is do you remember this picture we saw where we learned each different layer has these activations that basically look for anything from the middle of flowers to eyeballs of birds or whatever. Right and so literally what happens is that the later layers of this convolutional neural network have these things called activations and activation literally it's a number. An activation is a number that says this feature like eyeball of bird is in this location with this level of confidence with this probability right. And so we're going to see a lot of this later but what we can do is we can say all right well in this we've got a pre-trained network remember and a pre-trained network is one where it's already learned to recognize certain things in this case it's learned to recognize the one and a half million images in the image net dataset. And so what we could do is we could take the second last layer so the one which is like got all of the information necessary to figure out what kind of thing a thing is and we can save those activations so basically saving things saying you know there's this level of eyeballness here and this level of dog spaceness here and this level of fluffy ear there and so forth. And so we save for every image these activations and then we call them the pre-computed activations. And so the idea is now that when we want to create a new classifier which can basically take advantage of these pre-computed activations we can just very quickly train we'll learn all the details of this shortly we can very quickly train a simple linear model based on those and so that's what happens when we say pre-compute equals true and that's why you may have noticed this week the first time that you run a model a new model it takes a minute or two whereas you saw when I ran it it took like five or ten seconds it took you a minute or two and that's because it had to pre-compute these activations and just has to do that once if you're using like your own computer or AWS it just has to do it once ever if you're using Kressel it actually has to do it once every single time you rerun Kressel because Kressel uses a just for these pre-computed activations it uses a special little kind of scratch space that disappears each time you restart your Kressel instance so other than a special case of Kressel generally speaking you just have to run it once ever for a dataset so the issue with that is that since we've pre-computed for each image you know how much does it have an ear here and how much does it have a lizard's eyeball there and so that means that data augmentations don't work right in other words even though we're trying to show it a different version of the cat each time we've pre-computed the activations for a particular version of that cat so in order to use data augmentation we just have to go learn dot pre-compute equals false okay and then we can run a few more epochs right and so you can see here that as we run more epochs the accuracy isn't particularly getting better right that's the bad news the good news is that you can see the train loss right this is like the a way of measuring the error of this model although that's getting better the error is going down the validation error isn't going down but we're not overfitting and overfitting would mean that the training loss is much lower than the validation loss and we're going to talk about that a lot during this course but the general idea here is if you're doing a much better job on the training set than you are on the validation set that means your model's not generalized so we're not at that point which is good but we're not really improving so we're going to have to figure out how to deal with that before we do I want to show you one other cool trick I've added here cycle length equals one and this is another really interesting idea here's the basic idea cycle length equals one enables a fairly recent discovery in deep learning called stochastic gradient percent with restarts and the basic idea is this as you as you get closer and closer as you get closer and closer to the right spot right I'm getting closer and closer I may want to start to decrease my learning rate right because I get closer I'm kind of like oh I'm pretty close now so let's let's slow down my steps to try to get exactly to the right spot right and so as we do more iterations our learning rate perhaps should actually go down right because as we go along we're getting closer and closer to where we want to be and we want to like get exactly to the right spot okay so the idea of decreasing the learning rate as you train is called learning rate annealing and it's it's very very common very very popular everybody uses it basically all the time the most common kind of learning rate annealing is really horrendously hacky it's basically that researchers like pick a learning rate that seems to work for a while and then when it stops learning well they drop it down by about 10 times and then they keep learning a bit more until it doesn't seem to be improving and they drop it down by another 10 times that's what most academic research papers and most people in industry do so this would be like stepwise annealing very manual very annoying a better approach is simply to pick some kind of functional form like a line it turns out that a really good functional form is one half of a cosine curve right and the reason why is that for a while when you're not very close you kind of have a really high learning rate and then as you do get close you kind of quickly drop down and do a few iterations with a really low low learning rate and so this is called cosine annealing so to those of you haven't done trigonometry for a while cosine basically looks something like this right so we've picked one little half piece okay so we're going to use cosine annealing but here's the thing when you're in a very high dimensional space right and here we're only able to show three dimensions right but in reality we've got hundreds of millions of dimensions we've got lots of different fairly flat points they may not be actual local minima but they're fairly flat points all of which are pretty good right but they might differ in a really interesting way which is that some of those flat points let me show you let's imagine we've got a surface that looks something like this right now imagine that we kind of random guess started here and our initial therefore kind of learning rate annealing schedule got us down to here now indeed that's a pretty nice low error right but it probably doesn't generalize very well which is to say if we use a different data set where things are just kind of slightly different in one of these directions suddenly it's a terrible solution right where else over here is basically equally good in terms of loss right but it rather suggests that if you move if you have slightly different data sets that are slightly moved in different directions it's still going to be good right so in other words we would expect this solution here is probably going to generalize better than this spiky one so here's what we do is we've got like a bunch of different low bits right then our standard learning rate annealing approach will sort of go downhill downhill downhill downhill downhill to one spot right but what we could do instead is use a learning rate schedule that looks like this which is to say we do a cosine annealing and then suddenly jump up again and do a cosine annealing and then jump up again and so each time we jump up it means that if we're in a spiky bit and then we suddenly increase the learning rate and it jumps now all the way over to here and so then we kind of learning rate, annealing right near that down to here and then we jump up again to a high learning rate oh and it stays here right so in other words each time we jump up the learning rate that means that if it's in a nasty spiky part of the surface it's going to hop out of the spiky part and hopefully if we do that enough times it'll eventually find a nice smooth bowl could you get the same effect by running multiple iterations through the different randomized learning points so that eventually you explore all possible minima's and then you can compare them yeah so in fact that that's a great question and before this approach which is called stochastic gradient descent with restats was created that's exactly what people used to do they used to create these things called ensembles where they would basically relearn a whole new model 10 times in the hope that one of them is like going to end up being better and so the cool thing about this stochastic gradient descent with restats is that the model once we're in a reasonably good spot each time we jump up the learning rate it doesn't restart it actually hangs out in this nice part of the space and then keeps getting better so interestingly it turns out that this approach where we do this a bunch of separate cosine annealing steps we end up with a better result than if we just randomly try a few different starting points so it's a super neat trick and it's a fairly recent development but and again almost nobody's heard of it but I found like it's now like my superpower like using this along with the learning rate finder like I can get better results than nearly anybody like in a Kaggle competition you know in the first week or two I can like jump in spend an hour or two and back I've got a fantastically good result and so this is why I didn't pick the point where it's got the steepest slope I actually tried to pick something kind of aggressively high it's still getting down but maybe like getting to the point where it's nearly too high and that's because I want to make because that's because when we do this stochastic gradient descent with restats this 10 to the negative two represents the highest number that it uses so it goes up to 10 to the negative two and then goes down and then up to 10 to negative two and then down so if I use two lower learning rate it's not going to jump to a different part of the function so I have a few questions but the first one is how many times do you change the learning rate you want to poke we don't change the learning rate oh sorry how many times so in terms of this part here where it's going down we change the learning rate every single mini batch and then the number of times we reset it is set by the cycle length parameter and so one means reset it after every epoch so if I had two there it would reset it after every two epochs and interestingly this point that when we do the learning rate annealing that we actually change it every single batch it turns out to be really critical to making this work and again it's very different to what nearly everybody in industry and academia has done before and when you had a chance could you explain pre-compute equals true because it's still very confusing yeah we're going to come back to that multiple times in this course so the way this course is going to work is we're going to like do a really high level version of each thing and then we're going to like come back to it in two or three lessons and then come back to it at the end of the course and each time we're going to see like more of the math more of the code and get a deeper view okay and we can talk about it also on the forums during the week our main goal is to generalize and we don't want to get those like narrow optimas yeah that's a that's a very good summary so with this method are we keeping track of the minimas and averaging them ah assembling them that's that's another level of sophistication and indeed you can see there's something here called snapshot ensemble so we're not doing it in the code right now but yes if you wanted to make this generalize even better you can save the weights here and here and here and then take the average of the predictions but for now we're just going to pick the last one thank you if you want to skip ahead if you want to skip ahead there's a parameter called cycle save name which you can add as well as cycle then and that will save a set of weights at the end of every learning rate cycle and then you can ensemble them okay so we've got a pretty decent model here 99.3 percent accuracy and we've gone through you know a few steps that have taken you know a minute or two to run and so from time to time I tend to save my weight so if you go learn.save and then pass in a file name it's going to go ahead and save that for you later on if you go learn.load you'll be straight back to where you came from okay so it's good idea to do that from time to time this is a good time to mention what happens when you do this when you go learn.save when you create pre-computed activations another thing we'll learn about soon when you create resized images these are all creating various temporary files okay and so what happens is if we go to data and we go to dogs cats this is my data folder and you'll see there's a folder here called tmp or to and so this is automatically created and all of my pre-computed activations end up in here I mentioned this because if if if things aren't if you're getting weird errors it might be because you've got some like pre-computed activations that like were only half completed or are in some way incompatible with what you're doing so you can always go ahead and just delete this tmp this temporary directory and see if that causes your error to go away this is the fast AI equivalent of turning it off and then on again you'll also see there's a directory called models and that's where all of these when you say.save with a model that's where that's going to go actually it reminds me when this stochastic gradient descent with restarts paper came out I saw a tweet that was somebody who was like oh to make your deep learning work better turn it off and then on again that's that's this is pretty well said is there a question there? so if I want to say I want to retrieve my model in squash again do I just delete everything that I'm told of? if you want to if you want to train your model from scratch there's generally no reason to delete the pre-computed activations because the pre-computed activations are without any training that's what the pre-trained model created with the with the weights that you downloaded off the internet the only yeah I mean the only reason that you want to delete the pre-computed activations is that there was some error caused by like half creating them and crashing or some something like that as you change the size of your input change different architectures and so forth they all create different sets of activations with different file names so you don't generally you shouldn't have to worry about it if you want to start training again from scratch all you have to do is create a new learn object so each time you go like con learner.pre-trained that creates a new object with with new sets of weights being trained from okay so before our break we'll finish off by talking about about fine tuning and differential learning rates and so so far everything we've done has not changed any of these pre-trained filters right so we've used a pre-trained model that already knows how to find at the early stages edges and gradients and then corners and curves and then repeating patterns and bits of text and eventually eyeballs right we have not retrained any of those activations any of those features or more specifically any of those weights in the convolutional kernels all we've done is we've learned some new layers that we've added on top of these things we've learned how to mix and match these pre-trained features now obviously it may turn out that your pictures have you know different kinds of eyeballs or faces or if you're using different kinds of images like satellite images totally different kinds of features all together right so if you're like training to recognize icebergs you'll probably want to go all the way back and learn you know all the way back to kind of different combinations of these simple gradients and edges in our cases dogs versus cats we're going to have some minor differences but we still may find it's helpful to slightly tune some of these later layers as well so to tell the learner that we now want to start actually changing the convolutional filters themselves we simply say unfreeze so a frozen layer is a layer which is not trained which is not updated so unfreeze unfreeze is all of the layers now when you think about it it's pretty obvious that layer one right which is like a diagonal edge or a gradient probably doesn't need to change by much if at all right from the one and a half million images on ImageNet it probably already has figured out pretty well how to find like edges and gradients it probably already knows also like which kind of corners to look for and how to find which kinds of curves and so forth so in other words these early layers probably need little if any learning whereas these later ones are much more likely to need more learning and this is universally true regardless of whether you're looking for satellite images of rainforest or icebergs or whether you're looking for cats versus dogs right so what we do is we create an array of learning rates where we say okay these are the learning rates to use for our additional layers that we've added on top these are the learning rates to use in the middle few layers and these are the learning rates to use for the first few layers so these are the ones for the layers that represent like very basic geometric features these are the ones that are used to for the more complex kind of sophisticated convolutional features and these are the ones that are used for the features that we've added and learn from scratch right so we can create an array of learning rates and then when we call dot fit and pass them an array of learning rates it's now going to use those different learning rates for different parts of the model this is not something that we've like invented but I'd also say it's like it's so not that common that it doesn't even have a name as far as I know so we're going to call it differential learning rates if it actually has a name or indeed if somebody's actually written a paper specifically talking about it I don't know there's a great researcher called Jason Yosinski who who did write a paper about the kind of the idea that you might want different learning rates and showing why but I I don't think any other library support it and yeah I don't know of a name for it having said that though this ability to like unfreeze and then use these differential learning rates I've found is like the the secret to taking a pretty good model and turning it into an awesome model so just to clarify um so you have three numbers there right three hyperparameters the first one is the for the late models so the modern late layers the other way around there and you're in a model so we've the short answer is many many right and they're kind of in groups and we're going to learn about the architecture this is called a ResNet a residual network and it kind of has ResNet blocks and so what we're doing is we're grouping the blocks into three groups and so this one is actually this first number is for the earliest layers the ones closer to the data they're picking yeah the ones closest to the pixels that represent like corners and edges and gradients but why why do you but I thought those layers are frozen at first they are right so we just said unfreeze unfreeze also we so you're unfreezing them because you have kind of partially trained the all the the late layers we've trained we've trained our added layers now you're retraining the whole set exactly I see so but and the learning rate is particularly small for the early layers that's right because you just kind of want to find two that you don't want them to yeah we probably don't want to change them at all but you know if it if it does need to then then it can thanks no problem so using the differential linear rates how is the program like grid search there's no similarity to grid search so grid search is where we're trying to find the best hyper parameter for something so for example you could kind of think of the learning rate finder as a really sophisticated grid search which is like trying lots and lots of learning rates to find which one is best but this is nothing to do with that this is actually for the entire training from now on it's actually going to use a different learning rate for each layer isn't there and so I was wondering so you have a pre-trained model then you have to use the same input dimensions right because I was thinking okay let's say you have this big big use like big machines to train these things and you want to take advantage of it how would you go about you know if you have like images that are like bigger than the ones that they used or we're going to be talking about sizes later but the short answer is that with this library and the modern architectures we're using we can use any size we like so Jeremy do we need can we unfreeze just a specific layer we can we're not doing it yet but if you wanted to you can type learn.freeze underscore two and pass in a latent number much to my surprise or at least initial my surprise it turns out I almost never need to do that I almost never find it helpful and I think it's because we're using differential learning rates the optimizer can kind of learn just as much as it needs to so yeah it's what about if you have a little data like very little data yeah it's still doesn't seem to help the one place I have found it helpful is if I'm using like a really big memory intensive model and I'm like running out of GPU having the less layers you unfreeze the less memory it takes and the less time it takes so there's that kind of practical aspect to make sure also I asked the question right can I just like unfreeze a specific layer no you you can only unfreeze layers from layer n onwards you could probably delve inside the library and freeze one unfreeze one layer but I don't know why you okay so I'm really excited to be showing you guys this stuff because it's like it's something we've been kind of researching all year it's figuring out how to train state-of-the-art models and we've kind of found these like tiny number of tricks and so once we do that we now go learn.fit right and you can see look at this we get right up to like 99.5 percent accuracy which is crazy there's one other trick you might see here that as well as using stochastic gradient descent with restarts i.e. cycle length equals one we've done three cycles so earlier on I lied to you I said this is this is the number of epochs it's actually the number of cycles right so if you said cycle length equals two it would do three cycles of each of two epochs so it is six epochs so here I've said do three cycles yet somehow it's done seven epochs and the reason why is I've got one last trick to show you which is cycle mult equals two and to tell you what that does I'm simply going to draw you a picture show you the picture if I go learn.shed.plot learning rate there it is now you can see what cycle mult equals two is doing okay it's doubling the length of the cycle after each cycle and so in the paper that introduced this stochastic gradient descent with restarts but the researcher kind of said hey this is something that seems to sometimes work pretty well and I've certainly found that often to be the case so basically intuitively speaking if your cycle length is too short right then it's kind of starts going down to find a good spot and then it pops out and it goes down try and find a good spot and pops out it never actually gets to find a good spot right so earlier on you want it to do that because it's trying to find the bit that's like smoother but then later on you want it to find do more exploring and then more exploring right so that's why this cycle mult equals two thing often seems to be a pretty good approach right so suddenly we're introducing more and more hyper parameters having told you that there aren't that many but the reason is that you can really get away with just picking a good learning rate but then adding these extra tweaks really helps get that extra level up without any effort right and so in practice I find this kind of three cycles starting at one mod equals two works very very often to get a pretty decent model if it does doesn't then often I'll just do three cycles of length two with no mode like that's kind of like two things that seem to work a lot there's not too much fiddling at the time necessary and as I say even even if you just if you use this line every time I'd be surprised if you didn't get a reasonable result so a question here why does smoother services correlate to more generalized networks so it's kind of this this intuitive explanation I tried to just kill the whole thing I tried to give back here which is that if you've got something spiky right and so what this what this x axis is showing is like how good is this at recognizing dogs versus cats as you change this particular parameter right and so to something to be generalizable that means that we wanted to work when we give it a slightly different data set and so a slightly different data set may have a slightly different relationship between this parameter and how catty versus doggy it is it may instead look a little bit like this right so in other words if we end up at this point right then it's not going to do a good job on this slightly different data set or else if we end up on this point it's still going to do a good job on this data set okay so that's what psychomult equals do okay so we've got one last thing before we're going to take a break which is we're now going to take this model which has 99.5% accuracy and we're going to try and make it better still and what we're going to do is we're not actually going to change the model at all right but instead we're going to look back at the original visual visualization we did where we looked at some of our incorrect pictures now what I've done is I've printed out the whole of these incorrect pictures but the key thing to realize is that particularly in fact when we do the the validation set all of our inputs to our model all the time have to be square right and the reason for that is that's kind of a minor technical detail but basically the GPU doesn't go very quickly if you have like different dimensions for different images because it needs things to be consistent so that every part of the GPU can do the same thing and I think this is probably fixable but now that's the state of the technology we have so our validation set when we actually say for this particular thing is this a dog what we actually do to make it square is we just pick out the square in the middle right so we would take off its two edges and so we take the whole height and then as much of the middle as we can and so you can see in this case we wouldn't actually see this dog's head right so I think the reason this was actually not correctly classified was because the validation set only got to see the body and the body doesn't look particularly dog-like or cat-like it's not at all sure what it is so what we're going to do when we calculate the predictions for our validation set is we're going to use something called test-time augmentation and what this means is that every time we decide is this cat or a dog not in the training but after we've trained the model is we're going to actually take four random data augmentation and remember the data augmentation's move around and zoom in and out and flip okay so we're going to take four of them at random and we're going to take the original un-augmented centicrop image and we're going to do a prediction for all of those and then we're going to take the average of those predictions so we're going to say is this a cat is this a cat is this a cat is this a cat right and so hopefully in one of those random ones we actually make sure that the face is there zoomed in by a similar amount to other dog spaces it's seen it's rotated by the amount that it expects to see it and so forth and so to do that all we have to do is just call tta tta stands for test-time augmentation this term of like what do we call it when we're making predictions from a from a model we've trained sometimes it's called inference time sometimes it's called test time everybody seems to have a different name so tta and so when we do that we go learn.tta check the accuracy and lo and behold we're now at 99.65% which is kind of crazy what's our green box but for every epoch we are only showing one type of augmentation of a particular image right so when we are training back here we're not doing any tta right so tta is not like you could and sometimes like I've written libraries where after each epoch I run tta to see how well it's going but that's not what's happening here I trained the whole thing with training time augmentation which doesn't have a special name because that's what we mean when we say data augmentation we mean training time augmentation so here every time we showed a picture we were randomly changing it a little bit so each epoch each of these seven epochs it was seen slightly different versions of the picture having done that we now have a fully trained model we then said okay let's look at the validation set so tta by default uses the validation set and said okay what are your predictions of which ones are cats and which ones are dogs and it did four predictions with different random augmentations plus one on the augmented on an augmented version average them all together and that's what we got and that's what we calculate the accuracy from so is there a high probability of having sample in tta that was not shown in during training? yeah actually every data augmented for image is unique because the rotation could be like 0.034 degrees and the zoom could be 1.0165 so every time it's slightly different no problem it's behind you what's your why not use white padding or something like that this one here white padding like just you know put like a white border around when you're oh padding's not yeah so like there's lots of different types of augmentation you can do and so one of the things you can do is to add a border around it basically adding a border around it in my experiments doesn't doesn't help it doesn't make it any less cat-like it's not the convolutional neural network doesn't seem to find it very interesting basically something that I do do we'll see later is I do something called reflection padding which is where I add some borders that are the outside just reflected it's a way to kind of make some bigger images works well with satellite imagery in particular but yeah in general I don't do I have a lot of padding instead I do a bit of zooming it's kind of follow-up to that last one but rather than cropping just add white space because when you crop you lose the dog's face but if you added white space you wouldn't have yeah so that's that's where the kind of the reflection padding or the zooming or whatever can help so there are ways in the Fast AO Library when you do custom transforms of making that happen I find that it kind of depends on the image size you know but generally speaking it seems that using TTA plus data augmentation the best thing to do is to try to use as large an image as possible and so if you kind of crop the thing down and put white borders on top and bottom it's now quite a lot smaller and so to make it as big as it was before you now have to use more GPU and if you're going to use more GPU you could have zoomed in and used a bigger image so in my playing around that doesn't seem to be generally as successful okay there is a lot of interest on the topic of how to do the augmentation in older than images in data that is not images yeah no one seems to know I actually I asked some of my friends in the natural language processing community about this and we'll get to natural language processing in a couple of lessons you know it seems like it'd be really helpful there's been a few example like a very very few number of examples of people where papers would like try replacing synonyms for instance but on the whole an understanding of like appropriate data augmentation for non-image domains is under researched and under underdeveloped the question was could couldn't we just use a sliding window to generate all the images so in that dog picture couldn't we generate three parts of that wouldn't that be better yeah for tta you mean just just in general when you're creating your so for training time I would say no that wouldn't be better because we're not going to get as much variation you know we want to have it like one degree off five you know five degrees off 10 pixels up like lots of slightly different versions and so if you just have three standard ways then you're not giving it as many different ways of looking at the data for test time augmentation having fixed crop locations I think probably would be better and I just haven't gotten around to writing that yet I have a version in an older library I think having fixed crop locations plus random contrast brightness rotation changes might be better the reason I haven't got around to it yet is because in my testing it didn't seem to help in practice very much and it made the code a lot more complicated so you know it's kind of it's an interesting question I just want to know all this fast AI APIs that you are using is it open source or it's yeah that's a great question so the fast AI libraries open source and let's talk about it a bit more generally because you know it's like the fact that the fact that we're using this library is kind of interesting and unusual and it sits on top of something called PyTorch so PyTorch is a fairly recent development and it's kind of I've noticed all the researchers that I respect pretty much are now using PyTorch I found in part two of last year's course that a lot of the cutting-edge stuff I wanted to teach I couldn't do it in Keras Intensor Flow which is what we used to teach with and so I had to switch the course to PyTorch halfway through part two the problem was that PyTorch isn't very easy to use you have to write your own training loop from scratch basically if you write everything from scratch all the stuff you see inside the fast AI library we would have had to have written it you know to to learn and so it really makes it very hard to learn deep learning when you have to write hundreds of lines of code to do anything so we decided to create a library on top of PyTorch because our mission is to teach world class deep learning so we wanted to show you like here's how you can be the best in the world at doing X and we found that a lot of the world class stuff we needed to show really needed PyTorch or at least with PyTorch it was far easier but then PyTorch itself just wasn't suitable as a first thing to teach with for new deep learning practitioners so we built this library on top of PyTorch initially heavily influenced by Keras which is what we taught last year but then we realized we could actually make things much much much easier than Keras so in Keras if you look back at last year's course notes you'll find that all of the code is two to three times longer and there's lots more opportunities for mistakes because there's just a lot of things you have to get right so we ended up kind of building this this this library in order to make it easier to get into deep learning but also easier to get state-of-the-art results and then over the last year as we started developing on top of that we started discovering that by using this library it made us so much more productive that we actually started kind of developing new state-of-the-art results and new methods ourselves and we started realizing that there's a whole bunch of like papers that have kind of been ignored or lost which when you use them it could like automate or semi-automate stuff like Burning Red Finder that's not in any other library so so I kind of got to the point where now not only is kind of fast AI lets us do things easier much easier than any other approach but at the same time it actually has a lot more kind of sophisticated stuff behind the scenes and anything else so that's kind of an interesting mix so yeah so we've released this library like at this stage it's like a very early version and so through this course by the end of this course I hope as a group you know we'll all a lot of people already helping have developed it into something that's you know really pretty stable and rock solid and yeah anybody can then can use it to build your own models under an open source license as you can see it's available on Hit Hub behind the scenes it's it's creating PyTorch models and so PyTorch models can then be exported into various different formats having said that like a lot of folks like if you want to do something on a mobile phone for example you're probably going to need to use TensorFlow and so later on in this course we're going to show like how some of the things that we're doing in a fast AI library you can do in Keras and TensorFlow so you can kind of get a sense of what the different libraries look like generally speaking the simple stuff is like it'll take you a small number of days to learn to do it in Keras and TensorFlow versus fast AI in PyTorch and the more complex stuff often just won't be possible so like if you need it to be intensive flow you'll just have to kind of simplify it often a little bit but you know I think the more important thing to realize is every year the kind of the the libraries that are available and which ones are the best totally changes so like the main thing I hope that you get out of this course is an understanding of the concepts like here's how you find a learning rate here's why differential learning rates are important here's how you do learning rather kneeling you know here's what's to cast a gradient to set with restart stars so on and so forth because you know by the time we do this course again next year you know the the library situation is going to be different again that's a question of that I was wondering if you've had an opinion on Pyro which is Uber's new release I haven't looked at it no I'm very interested in probabilistic programming and it's really cool that's built on top of PyTorch so one of the things we'll learn about in this course is we'll see that PyTorch is much more than just a deep learning library it actually lets us write arbitrary GPU accelerated algorithms from scratch which we're actually going to do and Pyro is a great example of what people are now doing with PyTorch outside of the deep learning world great okay let's take a eight minute break and we'll come back at 755 so 99.65 accuracy what does that mean so in classification when we do classification to machine learning the it's really simple way to look at the result of a classification is what's called the confusion matrix this is not just deep learning but in any kind of classifier machine learning where we say okay what was the actual truth there were a thousand cats and a thousand dogs and of the thousand actual cats how many did we predict were cats this is obviously in the validation sets this is the images that we didn't use to train with it turns out there were 998 cats that we actually predicted as cats and two that we got wrong okay and then for dogs there were 995 that we predicted were dogs and then five that we got wrong and so often these confusion matrices can be helpful particularly if you've got like four or five classes you're trying to predict to see like which group you're having the most trouble with and you can see it uses color coding to tell you you know to highlight the large the large bits you're going to help that the diagonal is the highlighted section so now that we've retrained the model it can be quite helpful now it's better to actually look back and see like okay which ones in particular were incorrect and we can see here there were actually only two incorrect cats it prints out four by default so you can actually see these two actually less than 0.5 so they weren't they weren't wrong okay so it's actually only these two were wrong cats this one isn't obviously a cat at all this one is but it looks like it's got a lot of weird artifacts and you can't see its eyeballs at all so and then here are the how many dogs where they were wrong there were five wrong dogs here are four of them that's not obviously a dog that looks like a mistake that looks like a mistake that one I guess it doesn't have enough information but I guess it's a mistake so so we've done a pretty good job here of creating a good classifier I would based on entering a lot of Kaggle competitions and comparing results I've done to various research papers I can tell you it's it's a state-of-the-art classifier it's it's right up there with the best in the world we're going to make it a little bit better in a moment but here are the basic steps right so if you want to create a world-class image classifier the steps that we just went through was that we started we turned data augmentation on by saying all transforms equals and you either say side on or top down depending on what you're doing start with pre-computing equals true find a decent learning rate we then train just like at one or two epochs which like takes a few seconds because we've got pre-computing equals true then we turn off pre-compute which allows us to use data augmentation to do another two or three epochs generally with cycle length equals one then I unfreeze all the layers I then set the earlier layers to be like somewhere between a three times to ten times lower learning rate than the previous so in this case I did ten times right so it's like this was my learning rate that I found from the learning rate finer than I went ten times smaller and then ten times smaller as a rule of thumb like knowing that you're starting with a pre-trained image net model if you know if you can see that the things that you're now trying to classify are pretty similar to the kinds of things in image net i.e pictures of normal objects in normal environments you probably want about a 10x difference because you want those earlier layers like you think that the earlier layers are probably very good already or else if you're doing something like satellite imagery or medical imaging which is not at all like image net then you probably want to be training those earlier layers a lot more so you might have like a just a 3x difference right so that's like one change that I make is to try to make it out of 10x or 3x yeah so then after unfreezing you can now call lr find again right I actually didn't in this case but like once you've unfrozen all the layers you've turned on differential learning rates you can then call lr find again right and so you can then check like oh does it still look like the same point I had last time is about right something to note is that if you call lr find having set differential learning rates the thing it's actually going to print out is the learning rate of the last layers right because you've got three different learning rates so it's actually showing you the last layer so then yeah then I train the full network with cycle multiples two and until either it starts with the fitting or I run out of time right so like let me show you right so let's do this again for a totally different data set so this morning I noticed that some of you on the forums were playing around with this playground Kaggle competition very similar called dog breed identification so the dog breed identification Kaggle challenge is one where you don't actually have to decide which ones are cats and which ones are dogs they're all dogs but you have to decide what kind of dog it is and there are 120 different breeds of dogs okay so you know obviously this could be like different types of cells in pathology slides it could be different kinds of cancers in CT scans it could be different kinds of icebergs and satellite images whatever right as long as you've got some kind of labeled images so I want to show you what I did this morning so it took me about an hour basically to go end to end from something I've never seen before so I downloaded the data from Kaggle and I'll show you how to do that shortly but the short answer is there's something called Kaggle CLI which is a GitHub project you can search for and if you read the docs you basically run kg download provide the competition name and it'll grab all the data for you to your press or Amazon or whatever instance I put it in my data folder and I then went LS and I saw that it's a little bit different to our previous data set it's not that there's a trained folder which has a separate folder for each kind of dog but instead it turned out there was a CSV file and the CSV file I read it in with pandas so pandas is the thing we use in python to do structured data analysis like CSV files so if you pay pandas we call pd that's pretty much universal pd.readcsb reads in the CSV file we can then take a look at it and you can see that basically it had like some kind of identifier and then the debris right so this is like a different way this is the second main way that people kind of give you image labels one is to put different images into different folders the second is generally to give you a some kind of file like a CSV file to tell you here's the image name and here's the label okay so what I then did was I used pandas again to create a pivot table which basically groups it up just to see how many of each breed there were and I sorted them and so I saw okay they've got like about a hundred of some of the more common breeds and some of the less common breeds that got like 60 or so okay all together there are 120 rows and I've been 120 different breeds represented okay so I'm going to go through the steps right so um enable data augmentation so to enable data augmentation when we call this transforms from model you just pass in an org transforms in this case I chose side on again these are pictures of dots and stuff so they're side on photos um I we'll talk about max zoom uh more detail later but max zoom basically says when you do the data augmentation we like zoom into it by up to 1.1 times okay so but randomly between one the original image size and 1.1 times so it's not always cropping out the middle or an edge but it could be cropping out a smaller part okay so uh having done that the key step now is that rather than going from paths um so previously we went from paths and that tells it that the names of the folders are the names of the labels we go from CSV and we pass in the CSV file that contains the labels so we're passing in the path that contains all of the data the name of the folder that contains the training data the CSV that contains the labels we need to also tell it where the test set is if we want to submit to Kaggle later talk more about that next week now this time the previous data set we had I had actually separated a validation set out into a separate folder right but in this case you'll see that there is not a separate folder called validation right so we want to be able to track how good our performance is locally so we're going to have to separate some of the images out to put it into a validation set okay so I do that at random and so up here you can see I've basically opened up the CSV file turned it into a list of rows and then taken the length of that minus one because there's a header at the top right and so that's the number of rows in the CSV file which must be the number of images that we have and then this is a fast AI thing get cross validation indexes we'll talk about cross validation later but basically if you call this and pass in a number it's going to return to you by default a random 20 percent of the rows to use as your validation set and you can pass in parameters to get different amounts right so this is now going to grab 20 percent of the data and say all right this is the this is the indexes the numbers of the files which we're going to use as a validation set okay so now that we've got that in fact let's kind of run this so you can see what that looks like so val indexes is just a big bunch of numbers okay and so an is 10,000 right and so we have about 20 percent of those is going to be in the validation set so when we call from CSV we can pass in a parameter which is to tell which indexes to treat as a validation set and so let's pass in those indexes one thing that's a little bit tricky here is that um the file names actually have I checked they actually have a .jpg on the end and these obviously don't have a .jpg so you can pass in when you call from CSV you can pass in a suffix that says that the labels don't actually contain the full file names you need to add this to them okay um so that's basically all I need to do to set up my data and as a lot of you have noticed during the week inside that data object you can actually get access to the data set like what the training data set by saying train ds and inside train ds is a whole bunch of things including the file names okay so train ds dot file names contains all of the file names of everything in the training set and so here's like one file name okay so here's an example of one file name um so I can now go ahead and open that file and take a look at it right so the next thing I did was to try and understand what my file my data set looks like and it found an adorable puppy so that was very nice so feeling good about this um I also want to know like how big are these files right like how big are the images because uh that's a key issue if they're huge I'm going to have to think really carefully about how to deal with huge images that's really challenging um if they're tiny well that's also challenging um most of image net models are trained on either 224 by 224 or 299 by 299 images so anytime you have images in that kind of range that's that's really hopeful you're probably not going to have to do too much different um in this case the first image I looked at was about the right size so I'm thinking oh this is looking pretty hopeful so what I did then is I created a dictionary comprehension now if you don't know about list comprehensions and dictionary comprehensions in python uh go study them they're the most useful thing super handy um you can see the basic idea here is that I'm going through all of the files and I'm creating a dictionary that maps the name of the file to the size of that file um again this is a handy little python feature which I'll let you think learn about during the week if you don't know about it which is zip and using this special star notation is now going to take this dictionary and turn it into the rows and the columns uh and so I can now turn those into numpy arrays and like okay here are the first five uh row sizes for each of my images and then matplotlib is something you want to be very familiar with if you do any kind of data science on machine learning in python matplotlib we always refer to as plt so this is a histogram and so I got a histogram of the how high how many rows there are in each image so you can see here I'm kind of getting a sense before I start doing any modeling I kind of need to know what I'm modeling with and I can see some of the images are going to be like 2500 3000 pixels high but most of them seem to be around 500 so given that so a few of them were bigger than a thousand I use standard numpy slicing to just grab those that are smaller than the thousand and histogram that just to zoom in a little bit and I can see here all right it looks like yeah the vast majority around 500 and so this actually also prints out the histogram so I can actually go through and I can see here for 4500 of them are about 450 okay so I get about that sense about so Jeremy how how many images should we get in the validation set it's always a 20% so the size of the validation set like using 20% is fine unless you're kind of feeling like oh my data my data sets really small I'm not sure that's enough you know like if you've got basically think of it this way if you train like the same model multiple times and you're getting very different validation set results and your validation sets kind of small like smaller than a thousand or so then it's going to be quite hard to interpret how well you're doing and this is particularly true like if you were like if you care about the third decimal place of accuracy and you've got like a thousand things in your validation set then you're going about like a single image changing class is changing you know it's what you're looking at so it's it really depends on like how accurate you how much difference you care about I would say in general like at the point where you care about difference between like I don't know 0.01 and 0.02 like the second decimal place you want that to represent like 10 or 20 rows you know like changing the class of like 10 or 20 rows then that's something you can be pretty confident on so like most of the time you know given the data sizes we normally have 20 things to work fine but yeah it's it's kind of a it depends a lot on specifically what you're doing and what you care about and it's not it's not a deep learning specific question either you know so those who are interested in this kind of thing we're going to look into it in a lot more detail in our machine learning course which will also be available online okay so I did the same thing for the columns just to make sure that these aren't like super wide and I got similar results and checked in and again found they're kind of like 400 or 500 seem to be about the average size so based on all of that I kind of thought okay this looks like a pretty normal kind of image data set that I can probably use pretty normal kinds of models on I was also particularly encouraged to see that when I looked at the dog that the dog like takes up most of the frame right so I'm not too worried about like cropping problems you know um if the if the dog was just like a tiny little piece of one little corner that I'd be thinking about doing different you know maybe zooming in a lot more or something like in medical imaging that happens a lot like often the tumor or the cell or whatever is like one tiny piece and that's much more complex so yeah based on all that this morning I kind of thought like okay this this looks pretty standard so I I went ahead and created a little function called get data that basically had my normal two lines of code in it but I made it so I could pass in a size and a batch size the reason for this is that when I start working with a new data set I want everything to go super fast and so if I use small images it's going to go super fast so I actually started out with size equals 64 just to create some super small images that just go like a second to to run through and see how it later on I started using some big images and some and some also some bigger architectures at which point I started running out of GPU memory so I started getting these errors saying CUDA out of memory error when you get a CUDA out of memory error the first thing you need to do is to go kernel restart once you get a code an out of memory error on your GPU you can't really recover from it right doesn't matter what you do you know you have to go restart okay and once I restarted I then just changed my batch size to something smaller so when you call create your data object you can pass in a batch size parameter okay and like I normally use 64 until I hit something that says out of memory and then I'll just have it and if I still get out of memory I'll just have it again okay so that's why I created this to allow me to like start making my sizes bigger as I looked into it more and you know as I started running out of memory to decrease my batch size so at this point you know I went through this a couple of iterations but I basically found everything was working fine so once it's working fine I set size to 224 and I created my you know pre-computing equals true first time I did that it took a minute to create the pre-computed activations and then it ran through this in about four or five seconds and you can see I was getting 83% accuracy now remember accuracy means it's it's exactly right and so it's predicting out of 120 categories it's predicting exactly right so when you see something with two classes is you know 80% accurate versus something with 120 classes is 80% accurate they're very different levels you know so when I saw like 83% accuracy with just a pre-computed classifier no data augmentation no unfreezing anything else across 120 classes something oh this looks good right so then I just kind of kept going through our little standard process right so then I turn pre-compute off okay and cycle length equals one and I started doing a few more cycles a few more epochs so remember an epoch is one passed through the data and a cycle is however many epochs you said is in a cycle it's one it's the learning rate going from the top that you asked for all the way down to zero so since here cycle length equals one a cycle and an epoch at the same okay so I did I tried a few epochs I did actually do the learning rate finder and I found one in x2 again looked fine it often looks fine and I found it kind of kept improving so I tried five epochs and I found my accuracy getting better so then I saved that and I tried something which we haven't looked at before but it's kind of cool if you train something on a smaller size you can then actually call learn dot set data and pass in a larger sized data set and that's going to take your model however it's trained so far and it's it's going to let you continue to train on on larger images and I'll tell you something amazing this actually is another way you can get state-of-the-art results and I've never seen this written in any paper or discussed anywhere as far as I know this is a new insight basically I've got a pre-trained model which in this case I've trained a few epochs with a size of 224 by 224 and I'm now going to do a few more epochs with a size of 299 by 299 now I've got very little data kind of by deep learning standards I've only got 10,000 images right so with a 224 by 224 I kind of built this these final layers to try to find things that worked well at 224 by 224 but I go to 299 by 299 I basically if I overfit before I'm definitely not going to overfit now like I've changed the size of my images they're kind of like totally different but like conceptually they're still picked the same kinds of pictures of the same kinds of things so I found this trick of like starting training on small images for a few epochs and then switching to bigger images and continuing training is an amazingly effective way to avoid overfitting and it's like it's so easy and so obvious I don't understand why it's never been written about before maybe it's in some paper somewhere and I haven't found it but I haven't seen it. Would it be possible to do the same thing using let's say Keras or TensorFlow as well to feed an image of a different size? Yeah I think so like as long as you use one of these more modern architectures what we call fully convolutional architectures which means not VGG and you'll see we don't use VGG in this course because it doesn't have this property but most of the architectures developed in the last couple of years can handle pretty much arbitrary sizes yeah it'd be worth trying yeah I think it ought to work okay so I call get data again remember get data is the just the little function that I created back up here right get data is just this little function right so I just passed a different size to it and so I call freeze just to make sure that everything set the last layer is frozen I mean actually it already was at this point really doing a thing and you can see now with pre-compute off I've now got data augmentation working so I kind of run a few more epochs and what I noticed here is that the loss of my training set and the loss of my validation set my validation set loss is a lot lower than my training set this is still just training the last layer so what this is telling me is I'm underfitting right and so from underfitting it means this cycle length equals one is too short it means it's like finding something better popping out and like never getting a chance to zoom in properly so then I set cycle mult equals two to give it more time so like the first time is one epoch the second one is two epochs the third one is four epochs and you can see now the validation train and training are about the same okay so that's kind of thinking yeah this is this is about the right track and so then I tried using test time augmentation to see if that gets any better still didn't actually help a hell of a lot just a tiny bit and just kind of at this point I'm thinking oh this is nearly done so I just did it like you know one more cycle of two to see if it got any better and it did get a little bit better and then I'm like okay that looks pretty good I've got a validation set loss of 0.199 and so you'll notice here I actually haven't tried unfreezing the reason why was when I tried unfreezing and training more it didn't get any better and so the reason for this clearly is that this data set is so similar to the image net that the training the convolutional ways actually doesn't help in the slightest and actually when I've looked into it it turns out that this competition is actually using a subset of image net so that's okay so then if we check this out 0.199 against the leaderboard this is only a playground competition so it's not like the best to here but you know it's still interesting it gets us somewhere around 10th okay and in fact we're competing against I notice although this is a fast AI student this is a fast AI student these people up here I know they actually posted that they they cheated they actually went and downloaded the original images and trained that so and this is why this is a playground competition they call it it's not it's not real right you know it's it's just to allow us to try things out but you can basically see out of 200 and something people where you know we're getting some very good results without doing anything remotely interesting or clever and we haven't even used the whole data set we've only used 80% of it like to get a better result I would go back and remove that validation set and just rerun the same steps and then submit that let's just use 100% of the data I have three questions the first one is like the class in this case is very it's not balanced unlike the dogs and cats it's not unbalanced like it's not totally balanced but it's not bad right it's like between 60 and 100 like it's it's it's it's not unbalanced enough that I would give it a second thought okay let's just say it's it's very unbalanced uh-huh do you find it usually yeah let's get to that later in this course and don't let me forget right um the short answer is that there was a recent this the paper came out about two or three weeks ago on this and it said the best way to do with very unbalanced data sets is to basically make copies of the of the rare cases uh my second question yeah my second question is uh I want to pin down a difference between pre-computed was true and uh so you have these two options so when you're beginning right right so it's and not only they're frozen they're pre-computed so the data augmentation doesn't do anything at that point right before you all freeze everything what does exactly do like you only you only on freeze the acceleration is that so we're going to learn more about the details as we look into the the math and stuff in coming lessons but basically what happened was we started with a pre-trained network right um which was kind of finding activations that had these kind of rich features um and we were adding then we add a couple of layers on the end of it um which which are which start out random and so with freeze equals with with everything frozen and indeed with pre-computed equals true all we are learning is to is those couple of layers that we've added and so with pre-computed equals true we actually pre-calculate like how much does this image have something that looks like this eyeball and looks like this face and so forth um and therefore data augmentation doesn't do anything with pre-computed equals true because you know we're actually showing exactly the same activations each time we can then set pre-computed equals false which means it's still only training those last two layers that we added it's still frozen but data augmentation is now working because it's actually going through and recalculating all of the activations from scratch uh and then finally when we unfreeze that's actually saying okay now you can go ahead and change all of these earlier convolutional orders so what you just do pre-computed equals so the only reason to have pre-computed equals true is it's just much faster so it's like it is it's about you know 10 or more times faster uh so particularly if you're working with like quite a large data set um you know it can save quite a bit of time but it's never there's no like comp there's no like accuracy reason ever to use pre-computed equals true it's just a it's just a shortcut it's also like quite handy if you're like throwing together a quick model you know it can take a few seconds to create it uh my last question which I think answer is uh so like in your suggestions to build a model you have this staged approach yeah what if uh what if like we just want to go one initial setting uh without these uh checking after each each step is that okay I mean if you wanted like a if your question is like is there some shorter version of this that's like a bit quicker and easier I could like to lead a few things here um okay I think this is a kind of a minimal version to get you a very good result which is like don't worry about pre-computed equals true because that's just saving a little bit of time you know so so um I still suggest you still are fine at the start to find a good learning rate uh by default everything is frozen from the start so then you can just go ahead and run a two or three epochs of cycle in pickles one unfreeze uh and then train the rest of the network with differential learning rates so it's basically three steps learning rate finder uh uh train frozen network with cycle in pickles one and then train unfrozen network with differential learning rates and cycle molecules two so like that's something you could turn into I guess five or six lines of code total I think it's a question behind you oh next one um by reducing the batch size does it only affect the speed of training yeah pretty much so each batch and again we're going to see like all this stuff about pre-computing batch sizes we dig into the details of the algorithms it's going to make a lot more sense intuitively but basically if you're showing it um less images each time then it's calculating the gradient with less images which means it's less accurate which means like knowing which direction to go and how far to go in that direction is less accurate so as you make the batch size baller you're basically making it kind of more volatile um it's kind of like um uh it kind of impacts the optimal learning rate that you would need to use but in practice we're only you know I generally find I'm only dividing the batch size by like two or at most four it doesn't seem to change things very much should I reduce the learning rate accordingly if you if you change the batch size by much you can rerun the learning rate finder to see if it's changed if by much but if you know since we're only generally looking at like a power of 10 it probably is not going to change things enough that you can think is possible this is sort of a conceptual basic question so going back to the previous day where you showed sorry yeah this is more of a conceptual so basic question going back to your previous slide where you showed what the different layers were doing um yeah so for this slide I understand right meaning of say the third column rather to the fourth column is that what the you're interpreting what the layer is doing based on what images actually trigger that yeah so we're going to look at this in more detail so these these gray ones basically say this is kind of what the filter looks like so on the first layer you can see exactly what the filter looks like because the input to it a pixels right so you can absolutely say and remember we looked at what a convolutional kernel was like is that three by three thing so these look like they're seven by seven kernels you can say this is actually what it looks like but later on it's combining you know the the input to it are themselves activations which are combinations of activations which are combinations of activations so you can't draw it but there's clever technique that Seiler and Fergus created which allowed them to say this is kind of what the filters tended to look like on average right so this is kind of what the photos look like and then here is specific examples of patches of image which activated that filter highly so yeah the pictures are the ones that I kind of find more useful because it tells you this kernel is kind of a unicycle wheel finder right and so the the um pictures on the lab the schematics they seem to match up pretty closely to what you're keying on on the right right how do we know that's in fact what the filter was well we'll come back well we may come back to that if not in this part in the next part that that probably in part two actually because this paper this paper uses to create these things this paper uses something called a deconvolution which i'm pretty sure we won't do in this part but we will do it in part two so if you're interested check out the paper it's it's in the notebook there's a link to it Seiler and Fergus it's a very clever technique and not terribly intuitive um right so uh so you mentioned that it was good that the dog took up the full picture and it would have been a problem if it was kind of like off in one of the corners and really tiny what what would you have done what would your technique have been to try to make that work something that we'll learn about in part two but basically there's a technique that allows you to kind of figure out roughly which parts of an image are most likely to have the interesting things in them and then you can like crop out those bits if you're interested in learning about it we did cover it briefly in lesson seven of part one but i'm going to actually do it properly in part two of this course because i didn't really cover it thoroughly at all yeah maybe we'll find time to have a quick look at it but we'll see i know yunet's written some of the code that we need already so once i have something like this notebook that's basically working i can immediately make it better by doing two things assuming that the size image i was using is smaller than the average size of the image that we've been given i can increase the size and as i showed before with the dog breeds you can actually increase it during training the other thing i can do is to create is to use a better architecture now an architecture we're going to talk a lot in this course about architectures but basically there are different ways of putting together like what size convolutional filters and how are they connected to each other and so forth and different architectures have different like numbers of layers and sizes of kernels and number of filters and so forth and so there are some the one that we've been using resnet 34 is a great starting point and often a good finishing point because it's like it's pretty it doesn't have too many parameters often it works pretty well with small amounts of data as we've seen and so forth but there's actually an architecture that i really like called not resnet but resnext which was actually the second place winner in last year's image net competition and like resnet you can put a number after the resnext to say like how big it is and like my next step after resnet 34 is always resnext 50 now you'll find resnext 50 takes like can take like twice as long as resnet 34 it can take like two to four times as much memory as resnet 34 so what i wanted to do was i wanted to rerun that previous notebook with resnext and increase in the image size to 299 so here i just said architecture equals resnext 50 size equals 299 and then i found that i had to take the batch size all the way back to 28 to get it to fit my gpu is 11 gig if you're using aws or crescel i think they're like 12 gig so you might be able to make it a bit higher but this is what i found i had to do so then i this is literally a copy of the previous notebook so you can actually go file make a copy right and then rerun it with with these different parameters and so i i deleted some of the pros and some of the exploratory stuff to see you know basically i said everything else is the same all the same steps as before there's my in fact you can kind of see what this minimum sort of steps looks like i didn't need to worry about learning rate finder so i just left it as it so transforms data equals learn equals fit pre-computed equals false fit with cycle length equals one unfreeze differential learning rates fits and more and you can see here i didn't do the cycle malt thing because i found like now that i'm using a bigger architecture it's got more parameters it was overfitting pretty quickly so rather than like cycle length equals one never finding the right spot it actually did find the right spot and if i used longer cycle lengths i found that my validation error was higher than my training error that was overfitting so check this out though by using these you know three steps i got plus tta 99.75 so what does that mean that means i have one incorrect dog four incorrect cats and when we look at the pictures of them my incorrect dog has a cat in it this one is not a either this one is not either so i've actually got one mistake and then my incorrect dog is teeth right so like we're at a point where we're now able to train a classifier that's so good that it has like basically one mistake right and so when people say like we have superhuman image performance now this is kind of what they're talking about right so if you're actually when i looked at the dog breed one i did this morning i was like it was it was getting the dog breeds much better than i ever could so like it's this is what we can get to if you use a really modern architecture like resnext and this only took i don't know i don't remember like 20 minutes to train so that's kind of where we're up to so if you wanted to do satellite imagery instead right then it's the same thing and in fact the the planet satellite data sets are already on crescent if you're using crescent you can jump straight there right and i just linked it into data slash planet and i can do exactly the same thing right i can image classifier from csv right and you can see these three lines are actually exactly the same as my dog breed lines you know how big how many lines are in the file grab my validation indexes this get data as you can see it's identical except i've changed side on to top down the satellite images are like top down so i can click them vertically at least it'll make sense right and so you can see here i'm doing this trick around like through size equals 64 and train a little bit first learning rate finder right and interestingly in this case you can see it i want really high learning rates i don't know what it is about this particular data set this is true but it's clearly i can use super high learning rate so i used a learning rate of 0.2 and so i've trained for a while differential learning rates right and so remember i said like if the data set's very different to image net i probably want to train those middle layers a lot more so i'm using divided by three rather than divided by 10 right other than that here's the same thing cycle modicles two right um and then i was just kind of keeping an eye on it so you can actually plot the loss if you go learn dot shed dot plot loss you can see here the here's the first cycle here's the second cycle here's the third cycle right so you can see it's work it's better pops out gets better pops out gets better pops out and each time it finds something better than the last time uh then set the size up to 128 and just repeat exactly the last few steps and then set it up to 256 and just repeat the last two steps and then do tta and if you submit this then this gets about 30th place in this competition okay so these basic steps work super well this this thing where i went all the way back to a size of 64 i wouldn't do that if i was doing like dogs and cats or dog breeds because like this is so small that if if the thing i was working on is very similar to image net i would kind of destroy those image net weights like 64 by 64 is so small but in this case the satellite imagery data is so different to image net um you know i really found that it worked pretty well to start right back to these tiny images um it really helped me to avoid overfitting um and interestingly using this kind of approach i actually found that even with using only 128 by 128 i was getting like much better cackle results than nearly everybody on the leaderboard and when i say 30th place this is a very recent competition right and so i find like in the last year like a lot of people have got a lot better at computer vision and so the people in the top 50 in this competition were generally ensembling dozens of models at lots of people on a team uh lots of pre-processing specific satellite data and so forth so like to be able to get 30th using this totally standard technique is pretty cool um all right um so now that we've got to this point right we've got through two lessons if you're still here then hopefully you're thinking okay this is actually pretty useful i want to do more in which case kressel might not be where you want to stay the issues with kressel i mean it's it's it's pretty handy it's pretty cheap and something we haven't talked about much is paper space is another great choice by the way paper space is shortly going to be releasing kressel like instant jupyner books unfortunately they're not ready quite yet um but they do have an ability to uh basically they have the best price performance relationship right now and they you can ssh into them and use them so uh they're also a great choice and probably by the time this is a MOOC um we'll probably have a separate lesson showing you how to set up paper space because they're likely to be the great option but at some point you're probably going to want to look at aws a couple of reasons why um the first is uh as you all know by now uh amazon have been kind enough to donate about 200 000 worth of compute time to this course so i want to say thank you very much to amazon uh we've all been given uh credits everybody who's here so thanks very much aws so sorry if you're on the MOOC we didn't get it for you but everybody here is that's like aws credits for everybody so um but you can get uh even if you're not here in person you can get aws credits from lots of places github has a student pack google for github student pack that's like 150 bucks worth of credits aws educate can get credits um these are all for students um so there's lots of places you can get started on aws um pretty much everybody everybody a lot of the people that you might work with um will be using aws um because it's like super flexible right now aws has the fastest available gpu so you can get in the cloud the p3s um they're kind of expensive at three bucks an hour but if you've got like a model where you've done all the steps before and you're thinking this is looking pretty good you know for six bucks you could get a p3 for two hours and run at turbo speed right um we didn't start with aws because well a it's like twice as expensive as crecel as a cheapest gpu um and b it takes some setup right but um i wanted to kind of go through and show you how to get your aws setup and so we're going to be going slightly over time to do that but i want to show you very quickly so feel free if you have to um but i want to show you very quickly how you can get your aws setup right from scratch so basically you have to go to console dot aws dot amazon dot com and it'll take you to the console right and so you can follow along on the video with this because i'm going to do it very quickly um from here you have to go to uh ec2 this is where you set up your instances and so from ec2 um you need to do what's called launching an instance so launching an instance means you're basically creating a computer right you're creating a computer on amazon so i say launch instance and what we've done is we've created a fast ai it's called an ami an ami is like a template for how your computer's going to be created so if you've got a community amis and type in fast ai you'll see that there's one there called fast ai part one version two for the p2 okay so i'm going to select that and then we need to say what kind of computer do you want and so i can say i want a gpu compute computer and then i can say i want a p2 x large this is the cheapest reasonably effective for deep learning instance type they have and then i can say launch and then i can say launch and so at this point they ask you to choose a key pair right now if you don't have the key pair you have to create one right so to create a key pair um you need to open your terminal um if you don't have uh a terminal if you've got a mac or a linux box you've definitely got one if you've got windows hopefully you've got ubuntu if you don't already have ubuntu set up you can go to the windows store and click on ubuntu right and get it from the windows store so from there you basically go ssh dash key gen and that will create like a special password for your computer to be able to log into amazon and then you just hit enter three times okay and that's going to create for you your key that you can use to get into amazon all right so then what i do is i copy that key somewhere that i know where it is so it'll be in the dot ssh folder and it's called idrsa.pub and so i'm going to copy it to my hard drive so if you're in a macro and linux it'll already be in an easy defined place it'll be in your dot ssh folder let's put that in documents so from there back in aws you have to tell it that you've created this key so you can go to key pairs and you say import key pair and you just browse to that file that you just created there it is i say import okay so if you've ever used ssh before you've already got the key pair you don't have to do those steps if you've used aws before you've already imported it you don't have to do that step okay but if you haven't done any of those things you have to do both steps so now i could go ahead and launch my instance community amis search fast ai select launch and so now it asks me what's where's your key pair and i can choose that one that i just grabbed okay so this is going to go ahead and create a new computer for me to log into uh and you can see here it says the following have been initiated and so if i click on that it'll show me this new computer that i've created okay so it'll be able to log into it i need to know its ip address so here it is the ip address there okay so i can copy that and that's the ip address of my computer so to get to this computer i need to ssh to it so ssh into a computer means connecting to that computer so that it's like you're typing that computer so i type ssh and the username uh for this instance is always ubuntu right and then i can paste in that ip address and then there's one more thing i have to do which is i have to connect up the jupiter notebook on that instance to the jupiter notebook on my machine and so to do that there's just a particular flag that i set okay we can talk about it on the forums as to exactly what it does but you just type minus l 888 localhost 8888 okay so like once you've done it once you can like save that as an alias and type in the same thing every time um so we can check here we can see it says that it's running so we should be able to now hit enter first time ever we set we connect to it it just checks this is okay i'll say yes um and then that goes ahead and ssh is in so um this ami is all set up for you all right so you'll find that the very first time you log in it takes a few extra seconds because it just kind of is getting everything set up but once it's logged in you'll see there that there's a directory called fastai and the fastai directory contains our fastai repo that contains all the notebooks or the code etc so i can just go cd fastai right first thing you do when you go in is to make sure it's updated so you just go get pull right and that updates um to make sure that your repo is the same as the most recent repo and so as you can see there we go let's make sure it's got all the most recent code the second thing you should do is type condo end update you can just do this maybe once a month or so and that makes sure that the libraries there are all the most recent libraries i'm not going to run that so it takes a couple of minutes okay and then the last step is to type duplicate notebook okay so um this is going to go ahead and launch the duplicate notebook uh server on this machine again the first time i do it the first time you do everything on aws it just takes like a minute or two um and then once you've done it in the future it'll be just as fast as running it locally basically okay so you can see it's going ahead and firing up the notebook and so what's going to happen is that because when we ssh into it we said to like connect our notebook port to the remote notebook port we're just going to be able to use this locally so i see he says here copy paste this url so i'm going to grab that url and i'm going to paste it into my browser and that's it okay so this notebook is now actually not running on my machine it's actually running on aws okay using the aws gpu we've just got a lot of memory it's not the fastest around but it's not terrible um you can always fire up a p3 if you want something that's super fast this is costing me 90 cents a minute okay so when you're finished please don't forget to shut it down right so to shut it down you can right click on it and say instant state stop okay we've got 500 bucks of credit assuming that you put your code down in this spreadsheet one thing i forgot to do the first time i showed you this by the way i said make sure you choose a p2 the second time i went through i didn't choose p2 by mistake so just don't forget to choose gpu compute p2 do you have a question no you said 90 cents an hour thank you 90 cents an hour uh it also costs like i don't know three or four bucks a month for the storage as well thanks for checking that all right see you next week sorry we are we're over