 Peter, there we go. Yeah, question. Yeah, it's a joke. The training sort of process in fast AI, like, is there a concept or capability to do like early stopping or best kind of thing? Or if there isn't why, is there a reason why you chose not to do that? Or I never remember because I don't use it myself. So what I would check, I'm just checking now is the callbacks, which is under trainings. I'm just going to the docs training callbacks. And if anybody else knows, please shout out. Probably early stopping callback. Yeah, I found it. Okay, it's under tracking callbacks. So if you go to the docs training callbacks tracker, there's an early stopping callback. So perhaps the more interesting part then is like, why do I not use it so I don't even know whether it exists? There's a few reasons. One is that it doesn't play nicely with one cycle training or fine tuning. If you stop early, then the learning rate hasn't got a chance to go down. And for that reason, it's almost never the case that earlier epochs have better accuracy because the learning rate hasn't settled down yet. If I was doing one cycle training and I saw that an earlier epoch had a much better accuracy, then I would know that I'm overfitting in which case I would be adding more data augmentation rather than doing early stopping because it's good to train for the amount of time that you have. So yeah, I can't think off hand of a situation where I would, I mean, I haven't come across a situation where I've personally wanted to use early stopping. So like in some of the training examples, like where you had the error rate, like some of the prior runs may have had a better lower error rate. Can you in the ones I've shown like a tiny bit better? Yeah, but like not enough to be like meaningful. And yeah, so there's no reason to believe that those are actually better models. And there's plenty of a priori reason to believe that they're actually not, which is that the learning rate still hasn't settled down at that point. So we haven't let it fine tune into the best spot yet. So yeah, if it's kind of going down, down and down, and it's kind of bottoming out and just bumps a little bit at the bottom, that's not a reason to use early stopping. It's also, I think, important to realize that the validation set is relatively small as well. So it's only a representation of, you know, of the distribution that the data is coming from. So reading too much into those small fluctuations can be very counterproductive. I know that I've wasted a lot of time in the past, you know, doing that, but yeah, a lot of time. But yeah, we're looking for changes that dramatically improve things, you know, like changing from ResNet 2016 to ConvNext, we improved by what, four or 500%. And it's like, okay, that's an improvement. Over the weekend, I went on my own server that I have here behind me, that I have a 10 API. And I run all like 35 models for the Patty thing. And I was just, I didn't do the example, but I was thinking about this that when I was taking algebra back in high school or college, you have some of these expressions that you have the function of x is equal to x square for the x greater than something and the absolute value of x when you have x equal to something. So it just got me, my idea, the idea is that maybe some of the data set is going to fail the target value for every single one of the models that we tried. But if we try different models, it's going to be successful. So can we do that? I mean, of course we can, but I mean, what would be the easiest approach to say, for this validation, when x is equal to this or greater than that, this is the model to use. But then if this is the other model, this is what you have to use. Yeah, I mean, you could do that, right? And like a really simple way to do that, which I've seen used to some success on Kaggle, is to train lots of models and then to trade a gradient boosting machine whose inputs are those model predictions that whose output is the targets. And so that'll do exactly what you just described. It's very easy to overfit when you do that. And you're only going to get it, if you've trained them well, you're only going to get a tiny increase, right? Because a neural net's very flexible, it shouldn't have that situation where this part of the space, it has bad predictions and this part of the space, it has good predictions. That's not really how neural nets work. If you had a variety of totally different types of model, like a random forest, energy, BM, and a neural net, I could see that, maybe. But most of the time, one of those will be dramatically better than the other ones. And so I don't that often find myself wanting to ensemble across totally different types of model. So I'd say it's another one of these things like early stopping, which like a lot of people waste huge amounts of time on. And it's not really where the big benefits are going to be seen. But yeah, if you're like in gold medal zone on a Kaggle competition and you need another 0.002% or something, then these are all things you can certainly try at that point. A question reminded me of AutoML. It kind of reminded me of AutoML, like the regime of tools. I don't know how you feel about how you feel about those things. Yeah, we talked about that in the last night's lesson, actually. So you'll have to catch up to see what I said if you haven't seen the lesson yet. I'll mention also reading Kaggle winners' descriptions of their approaches is great. But you've got to be very careful because remember the Kaggle winners are the people who did get that last 0.002% because like everybody found all the low hanging fruit and the people who one grabbed the really high hanging fruit. And so every time you win a Kaggle winners' description, they almost always have complex ensembling methods. And that's why in something like a big image recognition competition, it's very hard to win or probably impossible to win with a single model, unless you invent some amazing new architecture or something. And so you might get the impression then that ensembling is the big thing that gets you all the low hanging fruit, but it's not. Ensembling is the thing which or particularly complex ensembling is a thing that gets you that last fraction of a fraction of a percent. One more question? Yeah, of course. The TTL concept, right? I'm trying to understand conceptually why TTL improves the score. Because technically when you're training, it is using those augmented sort of pictures and providing a percentage number. But when you're kind of, when you run that TTL function, why is it able to predict better? Sure. So like, you know how sometimes you're like looking at some, I don't know, a screw head or a socket or something that's really small and you can't quite see like how many pins are in it or what type is it or whatever. And you're kind of like looking at it from different angles and you're kind of like put it up to the light and you try to like, at some point you're like, okay, I see it. Right? And there's like some angle and some lighting that you can see it. That's what you're doing for the computer. You're giving it different angles and you're giving it different lighting in the hope that in one of those it's going to be really clear. And for the ones where it's easy, it's not going to make any difference, right? But for the ones where it's like, oh, I don't know if it's this disease or that disease, but oh, you know, when it's a bit brighter and you kind of zoom into that section, like, oh, now I can see. And so when you then average them out, you know, or the other ones are all like, oh, I don't know which kind it is. I don't know which kinds. It's like 0.5, 0.5, 0.5. And then this one is like 0.6. And so that's the one that in the average, it's going to end up picking. That's basically what happens. It also has another benefit, which is when we train our models, I don't know if you've noticed, but our training loss generally gets much lower than our validation loss. And sometimes our validate, sometimes our, well, so basically like what's happening there is that on the training set, the model is getting very confident, right? Because even although we're using data augmentation, it's seeing slightly different versions of the same image dozens of times. And it's like, oh, I know how to recognize these. And so what it does is that the probabilities it associates with them is like 0.9, 0.99, you know, like saying, I'm very confident of these. And it actually gets overconfident, which actually doesn't necessarily impact our accuracy, you know, to be overconfident. But at some point it can. And so we are systematically going to have like overconfident predictions of probability. Even when it doesn't really know, just because it's really seen that kind of image before. So then on the validation set, it's going to be, you know, overpicking probabilities as well. And so one nice benefit is that when you average out a few augmented versions, you know, it's like 0.9, 0.9 probability is this one. And then on the next one, it's like augmented version of the same image. It's like, oh, no, 0.1 probability is that one. And they'll kind of average out to much more reasonable probabilities, which can, you know, allow it sometimes to combine these ideas into an average that makes more sense. And so that can improve accuracy, but in particular, it improves the actual probabilities to get rid of that overconfidence. Is it fair to say that when you train without, when you train, it's not able to separate the replicated sort of images or the distorted slightly, the variant of the original image. But when you use the TTA, it is able to group all the four images and pick. Well, that's what TTA is. We present them all together and average out that group. Yes. But in training, we don't indicate in any way that they're the same image, although they're the same underlying object. One other question, Jeremy. We, I'm glad we touched on stumbling and how to pick the best on stumbling. I was going to ask about that. But another question is, we have a fairly unbalanced dataset, I guess, with the normal versus the disease states. When you're doing augmentation, is there any benefit to sort of over-representing the minority classes? So let's, let's pull away augmentation. So it's actually got nothing to do with augmentation. So more generally, when you're training, does it make sense to over-represent the minority class? And the answer is maybe, yeah, it can, right? And so, okay, so just for those who aren't following, the issue Matt's talking about is that there was a couple of diseases which appear lots and lots in the data and a couple which hardly appear at all. And so, you know, do we want to try to balance this out more? And one thing that people often do to balance it out more is that they'll throw away some of the images in their more represented, highly represented classes. And I can certainly tell you straight away, you should never, ever do that. You never want to throw away data. But Matt's question was, well, could we, you know, oversample the less common diseases? And the answer is, yeah, absolutely you could. And in fast AI, if you go into the docs, now where is it? There is a weighted data loader somewhere. Here we go. Of course, it's a callback. So if you go to the callbacks data section, you'll find a weighted DL callback or a weighted data loader's method. Are you sharing your screen? I'm not. No, I'm just telling you where to look. Thanks for checking. So, yeah, I mean, let's look at that today, right? Because I kind of want to look at, like, things we can do to improve things today. It doesn't necessarily help, because it does mean, you know, given that you're, you know, let's say you do 10 epochs of a thousand images, it's going to get to look at 10,000 images, right? And if you oversample a class, then that also means that it's going to get, it's going to see less of some images and going to get more repetition of other images, which could be a problem, you know? And really, it just depends on, depends on a few things. If it's, like, really unbalanced, like 99% all of one type, then you're going to have a whole lot of batches that it never sees anything of the underrepresented class. And so basically, there's nothing for it to learn from. So at some point, you probably certainly need weighted sampling. It also depends on the evaluation, you know, if people, like, say in the evaluation, okay, we're going to kind of average out for each disease how accurate you were. So every disease will then be, like, equally weighted, then you would definitely need to use weighted sampling. But in this case, you know, presuming, presuming that the test set has a similar distribution as a training set, weighted sampling might not help, because they're going to care the most about how well we do on the highly represented diseases. I'll note my experience with, like, oversampling and things like that. I think one time I had done with, I think, diabetic retinopathy, there was a competition for that. And I had used weighted sampling or oversampling, and it did seem to help. And then also a while back, I did an experiment where, I think this was back with FASAI version one, where I took, like, the minced data set, and then I, like, artificially added some sort of imbalance. And then I trained with and without weighted sampling. And I saw, like, there was an improvement with the weighted sampling on accuracy, on, like, just a regular minced validation set. So from those couple experiments, I'd say, like, I've at least seen some help and improvement with weighted sampling. Cool. And was that cases where that data set was, like, highly unbalanced? Or was it more like the data set that we're looking at at the moment? It wasn't highly unbalanced. It was maybe, like, I don't know, like, maybe, like, yeah, just 75% versus 25% or something like that. It's not, like, 99 versus 1%. Nothing like that. It was more, it wasn't that bad. Let's try it today. Yeah. I see we've got a new face today as well. Hello, Zach. Thanks for joining. Hey, hey, glad I could finally make these. Yeah. Are you joining from Florida? No, I'm in Maryland now. Maryland now. Okay. That's quite a change. Yes. Much more up north. Okay, great. So let's let's try something. Okay, so let's connect to my little computer. Upstairs. There are way to shrink my zoom out of the way. It takes up so much space. Hide floating meeting controls. I guess that's what I want. Control, alt, shift, H. Wow. Press escape to show floating meeting controls. That doesn't work very well with Vim. Oh, well, control, alt, shift, eight. Okay. All right. We're not doing tabular today. So let's get rid of that. So I think what I might do is, you know, because we're iterating, well, I guess we could start with the multitask button because this is our kind of like things to try to improve version. I'll close that. I'll leave that open just in case we want it. Okay. By the way, if you've got multiple GPUs, this is how you just use one of them. You can just set an environment variable. Okay. So this is where we, this is where we did the multi, multi-target model. Just moved everything slightly, not comp path, back to where we were. Okay. So now what? What's broken? Data block, image files. Oh, this is working the other day. So I guess we better try to do some debugging. So the obvious thing to do would be to call this thing here, get image files on the thing that we passed in here, which is train path. Okay. So that's working. Then the other thing to do would be to check our data by doing show batch. Okay. That's working. Then I guess, all right, and it's doing our two different things. That's good. Oh, is it? Oh, right. We've got the two category blocks. So we can't use this one. We have to use this one. So fit one cycle. So to remind you, we have, this is the one where we had two categories and one input. And to get the two categories, we used the parent label and this function, which looked up the variety from this dictionary. Okay. And then when we fine-tuned it, and let's just check, yeah, C equals 42. So that's our standard set. We should be able to then compare that to small models, trained for 12 epochs. And then that was this one, part two. They're not quite the same because this was 480 squish, or else this was rectangular pad. Let's do five epochs. Let's do it the same as this one. Yeah, let's do this one, because we want to be able to do quick iterations. Let's see, resize 192 squish. And then we trained it for 0.01 with FP16 for five epochs. All right. So this will be our base case. Well, you know, I mean, I guess this is our base case, 0.045. This will be our next case. Okay. So while that's running, the next thing I wanted to talk about is progressive resizing. So this is training at a size of 128, which is not very big, and we wouldn't expect it to do very well. So, but it's certainly better than nothing. And as you can see, it's a disease error. It's down to 7.5% error already, and it's not even done. So that's not bad. And, you know, in the past, what we've then done is we've said, okay, well, that's working pretty well. Let's throw that away and try bigger. But there's actually something more interesting we can do, which is we don't have to throw it away. What we could do is to continue training it on larger images. So we're basically saying, okay, this is a model which is fine tuned to recognize 128 by 128 pixel images of rice. Let's fine tune it to recognize 192 by 192 pixel images of rice. And we could even like, and like, there's a few benefits to that one is like, it's very fast, you know, to do the smaller images. And I can recognize the key features of it. So, you know, this lets us do a lot of epochs quickly. And then like the difference trends, more images of rice disease and large images of rice disease isn't very big difference. So you would expect it would probably fine tune to bigger images of rice disease quite easily. So we might get most of the benefit of training on big images, but without most of the time. The second benefit is it's a kind of data augmentation, which is we're actually giving it different sized images. So that should, that should help. So here's how we would do that. Let's grab this data block, let's make it into a function get DL. Okay, and the key thing, I guess we're going to do, well, let's just do the item transforms and the batch transforms as usual. Whoops. So, so the things we're going to change are the item transforms and the batch transforms. And then we're going to return the data loader for that, which is here. So let's try going up a bit. DLs equals get DL. I guess it should be get DLs really, because it returns data loaders get DLs item. Okay, so let's see what we did last time. Let's be scaled up a bit. So this is going to be data augmentation as well. We're going to change how we scale. So we'll scale with zero padding. And let's go up to 160. Okay. So then we need learner. So our, where's our squish one here? Squish. So the squish here got 0.45. Our multitask got 0.48. So it's actually a little bit worse. This might not be a great test actually, because I feel like one of the reasons that doing a multitask model might be useful is that might be able to train for more epochs. Because we're kind of giving it more signal. So we should probably revisit this with like 20 epochs. Any questions or comments about progressive resizing while we wait for this to train? Sorry, I can't see how you progressively change the size. I actually didn't. I messed it up. Whoops. Thank you. I have to do that again. I actually didn't. Oh, and we need to get our DLs back as well. Okay. Let's start again. Okay. And let's, in case I mess this up again, let's export this. We'll call this like stage one. So yeah, the problem was we created a new learner. So what we should have done is gone learn dot DLs equals DLs. That's actually, so that would actually change the data loaders inside the learner without recreating it. Was that where you were heading with your comment? There was an unfreeze method, right? Like, I wouldn't be using that. Sorry? There was an unfreeze method, like, and the same thing in the book actually mentioned is using the unfreeze method. There is an unfreeze method. Yes. What were you saying about the unfreeze method? Isn't unfreeze required for progressive resizing? Am I wrong? No, because fine tune has already unfrozen, although I actually want to fine tune again. So if anything, I kind of actually want to actually want to refreeze it because we've changed the, because we've changed the resolution. I think fine tuning the head might be a good idea to do again. Which line of code is doing the progressive resizing part just to be clear? It's not our line of code. It's, I mean, it's basically this. It's basically saying our current learner is getting new data loaders. And the new data loaders have a size of 160, whereas the old data loaders had a size of 128. And our old data loaders did a pre-sizing of 192 Squish and our new data loaders are doing a pre-sizing of rectangular padding. Does that make sense? Why are you calling it, why are you calling it progressive in this case? It just, are you going to keep changing the size or something like that? Yeah, it's changing the size of the images without, without resetting the learner. Just looked it up because I was curious. Fine tune calls a freeze first. I had a feeling it did. Thanks. Thanks for checking, Zerk. So this time, you know, let's see, it'll be interesting, right, to see how it does. So after the initial epoch, it's got a 0.09, right? Whereas previously it had 0.27. So obviously it's better than last time, but it's actually worse than the final point, right? This time it got all the way to 0.418. Yeah, whereas this time it has got worse. So it's got some work to do to learn to recognize what 160 pixel images look like. Can I just clarify Jeremy? I'd say you're like doing one more step in the progressive resizing here. It's not kind of an automated resizing yet. Correct. Yeah. Yeah. There isn't anything in FastAI to do this for you. And in fact, this technique is something that we invented, so it doesn't exist in other libraries at all. So yeah, it's the name of a technique. It's not the name of like a method in FastAI. And yeah, the technique is basically to replace the data loaders with ones at a larger size. And we invented it as part of a competition called Dawn Bench, which is where we work very well on a competition for image net training. And Google then took the idea and studied it a lot further as part of a paper called EfficientNet V2 and found ways to make it work even better. Oh my gosh, look at this. So we've gone from 0.418 to 0.0336. Have we done training at 160 before? I don't think we have. Oh, I should be checking this one. 128, 128. 171 by 128. No, we haven't. This is 256 by 192. So eventually, I guess we're going to get to that point. So let's keep going. So okay, so we're down to 2.9% error. How did you come up with the idea for this? It's something that you just wanted to try or did it happen? Did you stumble upon it while looking at something else? Oh, I mean, it just seemed very obviously to me like something which obviously we should do because like we were spending, okay, so on Dawn Bench we were training on ImageNet. It was taking 12 hours, I guess, to train a single model. And the vast majority of that time, it's just recognizing very, very basic things about images. It's not learning the finer details of different cat breeds or whatever, but it's just trying to understand about the concepts of like fur or sky or metal. I thought, well, there's absolutely no reason to need 224 by 224 pixel images to be able to do that. It just seemed obviously stupid that we would do it. And partly it was like also like I was just generally interested in changing things during training. So one of, you know, in particular learning rates, right? So the idea of changing learning rates during training goes back a lot longer than Dawn Bench, that people had been generally training them by having a learning rate that kind of dropped by a lot and then stayed flat and dropped by a lot and stayed flat. And Leslie Smith in particular came up with this idea of kind of like gradually increasing it over a curve and then gradually decreasing it following another curve. And so I was definitely in the mindset of like, oh, there's kind of interesting things we can change during training. So I was looking at like, oh, what if we change data augmentation during training? For example, like maybe towards the end of training, we should like turn off data augmentation so it could learn what unaugmented images look like because that's what we really care about, for example. So yeah, that was the kind of stuff that I was kind of interested in at the time. And so yeah, definitely this thing of like, you know, why are we looking over 224 by 224 pixel images the entire time? Like that just seemed obviously stupid. And so it wasn't something where I was like, wow, here's a crazy idea. I bet it won't work. As soon as I thought of it, I just thought, okay, that this is definitely going to work, you know, that it did. Interesting. Thanks. Yeah. No worries. One question I have for you, Jeremy. Yeah. There was a paper that came out like in 2019 called Fixing the Test Train Resolution Discrepancy. Yeah, I think yeah, were they like trained on 224 and then did inference finally on like 320 by 320? Yeah. Have you seen that still sort of work? Have you done that at all in your workflow? I mean, honestly, I don't remember. I need to revisit that paper because you're right. It's important tonight. I, you know, I would generally try to fine tune on the final size I was going to be predicting on anyway. So yeah, I guess we'll kind of see how we go with this, right? I mean, you can definitely take a model that was trained on 224 by 224 images and use it to predict 360 by 360 images and it will generally go pretty well. But I think it will go better if you first fine-tune it on 360 by 360 images. Yeah, I don't think they tried pre-training and then also training on like 320 versus just 320 and the 224. Yeah. That would definitely be an interesting experiment. Yeah, it would be an interesting experiment. And it's definitely something that any of us here could do, you know, I think it'd be cool. Right. So let's try scaling this up so we can change these two lines to one. And so this is one, something I often do is I do things like, yeah. I think we don't have your skin. Okay. So as I was saying previously, I had like two cells to do this. And so now I'm just going to combine it into one cell. So this is what I tend to do is I fiddle around as I try to like gradually make things a little bit more concise, you know. Okay. Does it make sense to go smaller than the original pre-train like ConvNet, ConvNext? Yeah. I mean, you can fine-tune to any size you like. Absolutely. I'm just going to get rid of the zero padding because again, I want to like try to change it a little bit each time just to kind of, you know, it's a kind of augmentation, right? So, okay. So let's go up to 192. You know, one thing I find encouraging is that, you know, my training loss isn't getting way underneath the validation loss. It's not like we're, feels like we could do this for ages before our error rates start going up. Interestingly, when I re-ran this, my error rate was much better, 0.418. You've got a good memory to remember these old papers. It's very helpful to be able to do that. Usually what I wind up doing is my dad and I will email back and forth papers to each other so I can just go through my sent look at archive and usually if I don't remember the name of it, I remember the subject of it in some degree. Yeah, I can just go through it all. I mean, it's a very, very good idea to use a paper manager of some sort to save papers, you know, whether it be Mendeley or the Nodo or archive sanity or whatever or bookmarks or something. Yeah, because otherwise these things disappear. Personally, I just tend to like tweet or favorite tweets about papers I'm interested in. And then I've set up pinboard.in. I don't know if you guys have seen that, but it's a really nice little thing which basically anytime you're on a website, you can click a button and the extension and it adds it to pinboard, but it also automatically adds all of your tweets and favorites and it's got a full text search of the thing that the URL is linked to, which is very helpful. So you favorite it's something that just says oh shit? No, I actually wrote something that just said oh shit. That was me writing oh shit. It was this, I mean, totally off topic, but this absolutely disaster. I hope it's wrong, but there's absolutely disastrous sounding paper that came out yesterday that basically, where was this key thing? People who've had one COVID infection have a list of one sequelae of 8.4%, two infections 23%, three infections 36%. It's like my worst nightmare is like the more people get infected with COVID, the more likely it is that they'll get long-term symptoms, which is horrifying. That was my oh shit moment. It's very horrifying. It's really awful. Okay, so keeps going down, right? Which is cool. Let's keep bringing it along, I suppose. I guess what we could do is just grab this whole damn thing here. Kind of have a bit of a comparison. So we're basically going to run exactly the same thing we did earlier, about this time with some pre-sizing first. So that'll be an interesting experiment. So while that's running, you know, this is where I hit the old duplicate button. This is why it's nice if you can to have a second card, because while something's running, you can try something else. Cuda, visible devices. There we go. So we can keep working. Okay, so weighted data loader. So this is something I added to Fast AI a while ago and haven't used much myself since. But if I just search for weighted, here it is. Here it is. So you can see in the docs, it shows you exactly how to use weighted data loaders. And so we passed in a batch size. We passed in some weights. As this is the weights, it's going to be 1, 2, 3, 4, 5, 6, 7, 8. Or it's actually 0, 1, 2, 3, 4, 5, 6, 7. And then some item transforms. Oh, turning things to tensor. So like these are kind of really interesting in the docs. In some ways, it's extremely advanced. And in other ways, it's extremely simple, which is to say, if you look at this example in the docs, everything is totally manual, right? So our labels are some random integers, kind of them. And I've even added a comment here, right? It's going to be in the training set, two are going to be in the validation set. So our data block is going to contain one category block. Because we've just got the one thing, right? And rather than doing get x and get y, you can also just say getters. Because get x and get y basically become getters, which is a list of transformations to do. And so this is going to be a single getter or a single get x, if you like, which is going to return the ith label. And a splitter, which is going to decide whether something's valid or not based on this function. So you can see this whole thing is totally manual. So we can create our data set by passing in a list of the numbers from 0 to 9. And a single item transform that's going to convert that to a tensor. And then our weights will be the numbers from 0 to 7. And so then we can take our data set or data sets and turn them into data loaders. Usually those are weights. So with a batch size of 1, if we say show batch, we get back a single number. And it's not doing random shuffling. So we get the number 0. Because that was the first thing in our data set. Let's see, what do we do next? Now we've got to do n equals 160. So now we've got all of the numbers from 0 to 159. Or getters. Yes, forgetters, yep. You mentioned this is for x or y? This is a list. That's whatever, right? So there is just one thing. I don't know if you call that x or you call it y. It's just one thing. So if you have a get x and a get y, that's the same as having a getters with a list of two things. Okay. So yeah, I think I could just write get x here and put this not in a list. It would probably be the same thing. Okay. Probably handle a little bit of mystery that might be happening as well. The data block has an input parameter, which is how it determines what of the getters is x versus y. Correct, which we actually looked at last time here when we created our multi-image block. Before you joined, Zach. But yes, useful reminder. Okay. So here we see in a histogram of how often, so our, we create like a little synthetic learner that doesn't really do anything, but we can pass callbacks to it and there's a callback called collect data callback, which just collects the data that's part that is called in the learner. And so this is how we can then find out what data was passed to the learner, get a histogram of it. And we can see that, yep, the number 160 was received a lot more often when we trained this learner, which is what you would expect. This is the source of the weighted data loader class here. And as you can see, other than the boilerplate, it's one, two, three, four, five lines of code. And then the weighted data loader's method is one, two lines of code. So there's actually a lot more lines of example than there is of actual code. So often it's easier just to read the source code. Because, you know, thanks to the very layered approach to fast AI, we can do so much stuff with so little code. And so in this case, if we look through the code, we're passing in some weights. And basically the key thing here is that we set, if the, if you pass in no weights at all, then we're just going to set it equal to the number one repeated n times. So everything's going to get one, a weight of one. And then we divide the weights by the sum of the weights. So that the sum of the weights ends up summing up to one, which is what we want. And then if you're not shuffling, then there's no weighted anything to do. So we just pass back the indexes. And if we are shuffling, we will grab a random choice based on the weights. Cool. All right. So there's going to be one weight per row. All right. Let's come back to that because I want to see how our thing's gone. It looks like it's finished. Notice that the fav icon in Jupiter will change depending on whether something's running or not. So that's how you can quickly tell if something's finished. Point two, one, six, point two, two, one. Okay. I mean, it's not a huge difference, but maybe it's a tiny bit better. I don't know. Like it's two, you know, the key thing though is this lets us use our resources better, right? So we often will end up with a better answer, but you can train for a lot less time. In fact, you can see that the error was at point two, one, six back here. So, you know, we could probably have trained for a lot less epochs. So that's progressive resizing. Is there a way to look at that and go, oh, actually, I'd like to take the outputs from epoch nine because it had a better. Yeah. So that was the question we got an earlier about. That's called early stopping. And the answer is no. You probably wouldn't want to do early stopping. But you can't go back to a previous like epoch. There's no history, sort of. You can. You have to use the early stopping callback to do that. All right. Cool. Okay. Yeah. Okay. I'll look at that. Or there's other things you can use. As I say, I don't think you should. But you can. If I go training callbacks tracker. Okay. So the other part of that is, yeah, is it counterproductive? Or yeah, it's going to treat if it works, but not if it doesn't. It's probably not a good idea. Probably we'll make it worse. Yeah. Okay. Great. So the other thing you can do is a safe model callback, which saves, which is kind of like early stopping, but it doesn't stop. It saves the parameters of the best model during training, which is probably what you want instead of early stopping. But I don't think you should do that either for the same reason we discussed earlier. Why shouldn't you do this? It seems like you could just ignore it if you didn't want it. Or later. Like it might not hurt you. Well, so this actually automatically loads the best set of parameters at the end. And you're just going to end up with this kind of like model that just so happened to look a tiny bit better on the validation set at an earlier epoch. But at that earlier epoch, the learning rate hadn't yet stabilized. And it's very unlikely it really is better. So you've probably actually just picked something that's slightly worse. And you know, made your process slightly more complicated for no good reason. Being better on an epoch there doesn't necessarily say anything about the final hidden test set. Yeah. Yeah. We have a strong prior belief that it will improve each epoch unless you're overfitting. And if you're overfitting, then you shouldn't be doing early stopping. You should be doing more augmentation. It seems like a good opportunity for somebody to document the arguments. Because I'm like curious what add-in does. Yes. That would be a great opportunity for somebody to document the arguments. And if somebody is interested in doing that, we have a really cool thing called documents, which I only invented after we created FastAI. Oh, this is like, that's not... I should delete this because this is the old version. Part of Fastcore. And documents, you document each parameter by putting a comment after it. And you document the return by putting a comment after it. And Zach actually started a project after I created documents to add documents, comments through everything in FastAI, which of course is not finished because FastAI is pretty big. And so here's an example of something that doesn't yet have documents, comments. So if somebody wants to go and add a comment to each of these things and put that into a PR, then that will end up in the documentation. I highly recommend that anyone wants to do that to do it. I was just going to say, Zach, something we should do, Zach, is to actually include that example in the document's documentation of what it ends up looking like in NB Dev. Because I can see that's missing. That might be a good idea. I can see if I can get on that tomorrow. Yeah. Sorry, Hammer, what were you saying? I just wanted to encourage everybody that writing the documentation is an excellent way to learn deeply how everything works. What ends up happening is you write this documentation and somebody like Jeremy will review it carefully and let you know what you don't understand. And that's how I learned about some other FastAI libraries. So I highly recommend it going doing that. And here's what it ends up looking like, right? So here's Optimizer. And you can see it's got a little table underneath. And if we look at the source of Optimizer, you'll see that each parameter has a comment next to it. So those parameters are automatically turned into this table. All right. Yeah, documents are super cool. They are super cool. This sounds like a good place to wrap up. Anybody got any questions or comments or anything before we wrap up? I have a question regarding to progressive resizing. We didn't do actually LRFind after each step. Don't you think it's something helpful? The LRFind, did you say? Yeah. I, to be honest, I don't use LRFind much anymore nowadays because at least for object recognition and computer vision, the optimal learning rate is pretty much always the same. It's always around 0.008, 0.01. Yeah, there's no reason to believe that we have any need to change it just because we changed the resolution. So yeah, I wouldn't bother to leave it where it was. Jeremy, if your training and validation loss are still decreasing after 12 people, can you pick up and train for a little longer without restarting? You can. The first thing I'll say is you shouldn't be looking at the validation loss to see if you're overfitting. You should be looking at the error rate. So the validation loss can get worse whilst the error rate gets better. And that doesn't count as overfitting because the thing you want is to improve as the error rate. That can happen if it gets overconfident, but it's still improving. Yeah, you can keep training for longer because we're using, if you're using fit one cycle or fine-tune, and fine-tune uses fit one cycle behind the scenes. Continuing to train further, your learning rate is going to go up and then down and then up and then down each time, which is not necessarily a bad thing. But if you basically want to keep training at that point, you would probably want to decrease the learning rate by maybe 4x or so. And in fact, I think after this, I'm going to rerun this whole notebook, but half the learning rate each time. Because I think that would be potentially a good idea. I have a question. I don't know if it's too late. I think it might be useful to discuss, when you do the progressive resizing, what part of the model gets dropped? Is there some part of the model that needs to be re-initialized for the new? Nothing needs to be re-initialized. I found this on the web. Who found what on the web? I thought you were talking to me, but you're talking to Siri? I'm offended. Siri, teach me deep learning. Confnext is what we call a resolution independent architecture, which means it works for any input resolution. Time permitting in the next lesson, we will see how convolutional neural networks actually work, but I guess a lot of you probably already know. So for those of you to do, if you think about it, it's basically going patch by patch and doing this kind of mini-matrix multiplier for each patch. So if you change the input resolution, it just has more patches to cover, but it doesn't change the parameters at all. So there's nothing to re-initialize. Does that make sense, Hamel? Yeah, that makes sense. I was just asking for the record, fair enough. Yeah, I have a question. Yeah, go ahead. Oh, I was just going to do a quick note to say, is Resonate Resolution Independent? Yeah, basically everything we use is normally, but in the, like, have a look at that, like, best fine-tuning models notebook, and you'll see that two of the best ones are called VIT and SWIN, and also SWIN V2. None of those are Resonation Independent. Although there is a trick you can use to kind of make them Resonation Independent, which we should try out in a future walkthrough. Is that fiddling with the head? Oh, there's a Tim. There's a thing you can pass to Tim. I don't know if we can use it to support progressive resizing or not. It'll be interesting to experiment with. It's basically changing the positional encodings. I have a question. Yeah. After you've done your experiments, progressive resizing, and fine-tuning, how do you, in FAST AI, train with the whole training set? I never got around to do that. Do you think it's done? I almost never do, like, instead I do what we saw in the last walkthrough, which is I just train on a few different randomly selected valid training sets, because that way, you know, you get the benefit on sumbling, you're going to end up seeing all the images, at least one anyway. And you can also kind of see if something's messed up, because you've still got a validation set each time. So, yeah. I used to, like, do this thing where I would create a validation set with a single item in to, like, get that last bit of juice, but I don't even do that anymore. Okay. Thanks. No worries. All right, gang. Enjoy the rest of your day slash evening. Nice to see you all. Bye. Goodbye. Thanks. Thank you.