 Right. Welcome to lesson seven, the penultimate lesson of practical deep learning for coders part one. And today we're going to be digging into what's inside a neural net. We've already seen what's inside a kind of the most basic possible neural net, which is a sandwich of fully connected layers or linear layers and, and reluze. And so we built that from scratch. But there's a lot of tweaks that we can do. And so most of the tweaks actually that we probably care about are the tweaking the very first layer or the very last layer. So that's where we'll focus. But over the next couple of weeks, we'll look at some of the tricks we can do inside as well. So I'm going to do this through the lens of the, the patty, rice patty competition we've been talking about. And we got to a point where, let's have a look. So we created a comf next model. We tried a few different types of basic preprocessing. We added test time augmentation. And then we scaled that up to larger images and rectangular images. And that got us into the top 25% of the competition. So that's part two of the so-called road to the top series, which is increasingly misnamed since we've been presenting these notebooks. More and more of our students have been passing me on the leaderboard. So currently first and second place are both people from this class, Kuri and Nick, go to hell. You're in my target and leave my class immediately. And congratulations. Good luck to you. So in part three, I'm going to show you a really interesting trick, a very simple trick for scaling up these models further. Watch or discover if you've tried to use larger models so you can replace the words more with the word large in those architectures and try to train a larger model. A larger model has more parameters. More parameters means it can find more tricky little features. And broadly speaking models with more parameters therefore ought to be more accurate. Problem is that those activations, or more specifically the gradients that have to be calculated, choose that memory on your GPU. And your GPU is not as clever as your CPU at sticking stuff it doesn't need right now into virtual memory on the hard drive. When it runs out of memory, it runs out of memory. And it also doesn't do such a good job as your CPU at shuffling things around to try and find memory. It just allocates blocks of memory and it stays allocated until you remove them. So if you try to scale up your models to bigger models, unless you have very expensive GPUs, you will run out of space and you'll get an error. Something like CUDA out of memory error. So if that happens, first thing I mentioned is it's not a bad idea to restart your notebook because they can be a bit tricky to recover from otherwise. And then I'll show you how you can use as large a model as you like. Almost, you know, basically you'll be able to use an X large model on Kaggle. So let me explain. Now, when you run something on Kaggle, like actually on Kaggle, you're generally going to be on a 16 gig GPU. And you don't have to run stuff on Kaggle, you can run stuff on your home computer or paper space or whatever. But sometimes if you want to do Kaggle competition, sometimes you'll have to run stuff on Kaggle because a lot of competitions are what they call code competitions, which is where the only way to submit is from a notebook that you're running on Kaggle. And then a second reason to run stuff on Kaggle is that, you know, your notebooks will appear, you know, with the leaderboard score on them. And so people can see which which notebooks are actually good. And I kind of like even in things that aren't code competitions, I love trying to be the person who's number one on the notebook score leaderboard because that's something which, you know, you can't just work it in video and use a thousand GPUs and win a competition through a combination of skill and brute force. Everybody has the same nine hour timeout to work with. So I think it's a good way of keeping the, you know, things a bit more fair. Now, so my home GPU has 24 gig. So I wanted to find out what can I get away with, you know, in 16 gig. And the way I did that is I think a useful thing to discuss because again, it's all about fast iteration. So I wanted to really quickly find out how much memory will a model use. So there's a really quick hacky way I can do that, which is to say, okay, for the training set, let's not use, so here's the value counts of labels, so the number of each disease. Let's not look at all the diseases. Let's just pick one, the smallest one, right? And let's make that our training set. Our training set is the bacterial panicle blight images. And now I can train a model with just 337 images without changing anything else. Not that I care about that model, but then I can see how much memory it used. It's important to realize that, you know, each image you pass through with the same size, each batch size is the same size. So training for longer won't use more memory. So that'll tell us how much memory we're going to need. So what I then did was I then tried training different models to see how much memory they used up. Now, what happens if we train a model? So obviously Confnec Small doesn't use too much memory. So here's something that reports the amount of GPU memory just by basically printing out CUDA's GPU processors. And you can see Confnec Small took up four gig. And also this might be interesting to you. If you then call Python's garbage collection, gc.collect, and then call PyTorch's empty cache, that should basically get your GPU back to a clean state of not using any more memory than it needs to when you can start training the next model without restarting the kernel. So what would happen if we tried to train this little model and it crashed with a CUDA out of memory error? What do we do? We can use a cool little trick called grade into cumulation. What's grade into cumulation? So what's grade into cumulation? Well, I added this parameter to my train method here. That's my train method creates by data loaders, creates my learner. And then depending on whether I'm fine tuning or not, either fits or fine tunes it. But there's one other thing it does. It does this grade into cumulation thing. What's that about? Well, the key step is here. I set my batch size. So that's the number of images that I pass through to the GPU all at once to 64, which is my default, divided by slash slash means integer divide in Python divided by this number. So if I pass two, it's going to use a batch size of 32. If I pass four, it'll use a batch size of 16. Now that obviously should let me cure any memory problems. Use a smaller batch size. But the problem is that now the dynamics of my training are different, right? The smaller your batch size, the more volatility there is from batch to batch. So now your learning rates are all messed up. You don't want to be messing around with trying to find a different set of kind of optimal parameters for every batch size, for every architecture. So what we want to do is find a way to run just let's say a cumulus two, accumulate equals two. Let's say we just want to run 632 images at a time through. How do we make it behave as if it was 64 images? Well, the solution to that problem is to consider our training loop. This is the basically the training loop we used from a couple of lessons ago, the one we created manually. We go through each xy pair in the data loader. We calculate the loss using some coefficients based on that xy pair. And then we call backward on that loss to calculate the gradients. And then we subtract from the coefficients, the gradients times the learning rate. And then we zero out the gradients. I've skipped a bit of stuff like the with torch.no grad thing. Actually, no, I don't need that because I've got data. No, that's it. That should all work fine. I've skipped out printing the loss. That's about it. So here is a variation of that loop where I do not always subtract the gradient times the learning rate. Instead, I go through each xy pair in the data loader. I calculate the loss. I look at how many images are in this batch. So initially I start at zero and this count is going to be 32, say, if I've divided the batch size by 2. And then if count is greater than 64, I do my gradient, my coefficients update. Well, it's not. So I skip back to here. And I do this again. And if you remember, there was this interesting subtlety in PyTorch. If you call backward again without zeroing out the gradients, then it adds this set of gradients to the old gradients. So by doing these two half-sized batches without zeroing out the gradients between them, it's adding them up. So I'm going to end up with the total gradient of a 64 image batch size, but passing only 32 at a time. If I used accumulate equals four, it would go through this four times, adding them up before it subtracted out the coefficients dot grad times learning rate and zeroed it out. If I put in a queue equals 64, it would go through into a single image one at a time. And after 64 passes through, eventually count would be greater than 64. And we'll do the update. So that's gradient accumulation. It's a very simple idea, which is that you don't have to actually update your weights every loop through for every mini batch. You can just do it from time to time. But it has quite significant implications, which I find most people seem not to realize, which is if you look on like Twitter or Reddit or whatever, people can say, oh, I need to buy a bigger GPU to train bigger models, but they don't. They could just use gradient accumulation. And so given the huge price differential between, say, a RTX 3080 and an RTX 3090 Ti, huge price differential, the performance is not that different. The big difference is the memory. So what? Just put in a bit smaller batch size and do gradient accumulation. So there's actually not that much reason to buy giant GPUs. John. Are the results with gradient accumulation numerically identical? They're numerically identical for this particular architecture. There is something called batch normalization, which we will look at in part two of the course, which keeps track of the moving average of standard deviations and averages and does it in a mathematically slightly incorrect way as a result of which if you've got batch normalization, then it could basically introduce more volatility, which is not necessarily a bad thing. But because it's not mathematically identical, you won't necessarily get the same results. ConvNext doesn't use batch normalization. So it is the same. And in fact, a lot of the models people want to use really big versions of, which is NLP ones, transformers, tend not to use batch normalization, but instead they use something called layer normalization, which doesn't have the same issue. I think that's probably fair to say. I haven't thought about it that deeply. In practice, I found adding gradient accumulation for ConvNext has not caused any issues for me. I don't have to change any parameters when I do it. Any other questions on the forum, John? To Maury asking, shouldn't it be count greater than equal to 64 if BS equals 64? I haven't. No, I don't think so. Oh, yeah. So we start at zero, that it's going to be 32, that it's going to be, yeah, yeah, probably. You can probably tell I didn't actually run this code. Medav is asking, does this mean that LRFind is based on the batch size set during the data block? Yeah. So LRFind just uses your data loaders batch size. Edward is asking, why do we need gradient accumulation rather than just using a smaller batch size and follows up with how would we pick a good batch size? Well, just if you use a smaller batch size, here's the thing, right? Different architectures have different amounts of memory, you know, which they take up. And so you'll end up with different batch sizes for different architectures, which is not necessarily a bad thing, but each of them is going to then need a different learning rate, and maybe even different weight decay or whatever. Like the kind of the settings that's working really well for batch size 64 won't necessarily work really well for batch size 32. And, you know, you want to be able to experiment as easily and quickly as possible. I think the second part of your question was how do you pick an optimal batch size? Honestly, the standard approach is to pick the largest one you can, just because it's faster that way, you're getting more parallel processing going on. Although to be honest, I quite often use batch sizes that are quite a bit smaller than I need, because quite often it doesn't make that much difference. But yeah, the rule of thumb would be, you know, pick a batch size that fits in your GPU. And for performance reasons, I think it's generally a good idea to have it be a multiple of eight. Everybody seems to always use powers of two. I don't know, like I don't think it actually matters. And look, there's one other, just a clarification or a check if the learning rate should be scaled according to the batch size. Yeah, so generally speaking, the rule of thumb is that if you divide the batch size by two, you divide the learning rate by two. But unfortunately, it's not quite perfect. Did you have a question, Nick? If you do, you can. Okay, cool. Yeah. No, that's us all caught up. Thanks, Jimmy. Good questions. Thank you. So gradient accumulation in fast AI is very straightforward. You just divide the batch size by however much you want to divide it by. And then add you got something called a callback. And a callback is something which changes the way the model trains. This callback is called gradient accumulation. And you pass in the effective batch size you want. And then you say, when you create the learner, you say, these are the callbacks I want. And so it's going to pass in gradient accumulation callback. So it's going to only update the weights once it's got 64 images. So if we pass in a chemical's one, it won't do any gradient accumulation. And that uses four gig. If we use a chemical's two, about three gig. Chemicals four, about two and a half gig. And generally, the bigger the model, the closer you'll get to a kind of a linear scaling, because models have a kind of a bit of overhead that they have anyway. So what I then did was I just went through all the different models I wanted to try. So I wanted to try Confnext large, add a 320 by 240, VIT large, Swin V2 large, Swin large. And on each of these, I just tried running it with a chemical's one. And actually every single time for all of these, I got a matter of memory error. And then I tried each of them independently with a chemical's two. And it turns out that all of these worked with a chemical's two. And it only took me 12 seconds each time. So that was a very quick thing for me to then know, okay, I now know how to train all of these models on a 16 gigabyte card. So I can check here, they're all in less than 16 gig. So then I just created a little dictionary of all the architectures I wanted. And for each architecture, all of the resize methods I wanted and final sizes I wanted. Now these models, VIT, Swin V2 and Swin are all transformers models, which means that, well, most transformers models nearly all of them have a fixed size. This one's 224, this one's 192, this one's 224. So I have to make sure that my final sizes are square of the size required, otherwise I get an error. There is a way of working around this, but I haven't experimented with it enough to know when it works well and when it doesn't, so we'll probably come back to that in part two. So for now, we're just going to use the size that they ask us to use. So with this dictionary of architectures and for each architecture, kind of preprocessing details, we switch the trading path back to using all of our images and then we can loop through each architecture and loop through each item transforms and sizes and train the model. And then the training script, if you're fine tuning returns the TTA predictions. So I append all those TTA predictions for each model, for each type into a list. And after each one, it's a good idea to do this garbage collection and empty cache that because otherwise I find what happens is your GPU memory kind of, I don't know, I think it gets fragmented or something and after a while it runs out of memory, even when you thought it wouldn't. So this way you can really do as much as you like without running out of memory. So they all train, train, train, train. And one key thing to note here is that in my train script, my data loaders does not have the seed equals parameter. So I'm using a different training set every time. So that means that for each of these different runs, they're using also different validation sets. So they're not directly comparable. But you can kind of see they're all doing pretty well 2.1%, 2.3%, 1.7% and so forth. So why am I using different training and validation sets for each of these? That's because I want to ensemble them. So I'm going to use bagging, which is I am going to take the average of their predictions. Now, I mean, really, when we talked about random forest bagging, we were taking the average of like intentionally weak models. These are not intentionally weak models, they're meant to be good models, but they're all different. They're using different architectures and different preprocessing approaches. And so in general, we would hope that these different approaches, some might work well for some images and some might work well for other images. And so when we average them out, hopefully we'll get a good blend of kind of different ideas, which is kind of what you want in bagging. So we can stack up that list of different, of all the different probabilities and take their mean. And so that's going to give us 3469 predictions. That's our test set size. And each one has 10 probabilities, the probability of each disease. And so then we can use arg max to find which probability index is the highest. So that's going to give us our list of indexes. So this is basically the same steps as we used before to create a CSV submission file. So at the time of creating this analysis that got me to the top of the leaderboard and in fact, these are my four submissions. And you can see each one got better. Now you're not always going to get this nice monotonic improvement, right? But you want to be trying to submit something every day to kind of like try out something new, right? And the more you practice, the more you'll get a good intuition of what's going to help, right? So partly I'm showing you this to say it's not like purely random as to whether things work or don't. Once you've been doing this for a while, you know, you will generally be improving things most of the time. So as you can see from the descriptions, my first submission was our Confnex small for 12 epochs with TTA. And then a ensemble of Confnex. So it's basically this exact same thing, but just retraining a few with different training subsets. And then this is the same thing again, this is the thing we just saw basically, the ensemble of large bottles with TTA. And then the last one was something I skipped over, which was I the the VIT models were the best in my testing. So I basically weighted them as double in the ensemble. I'm pretty unscientific. But again, it gave it a another boost. And so that was that was it. All right, John. Yes. Thanks, Jeremy. So no particular order. Korean is asking would trying out cross validation with K folds with the same architecture makes sense. Okay, so. Ensembling of models. Yeah. So popular thing is to do K fold cross validation. So K fold cross validation is something very, very similar to what I've done here. So what I've done here is I've trained a bunch of models with different training sets, each one is a different random 80% of the data. Five fold cross validation does something a similar. But what it says is rather than picking like say five samples out with different random subsets. In fact, instead, first like do all except for the first 20% of the data and then all but the second 20% and then all but the third and so forth. And so you end up with five subsets. Each of which have non overlapping validation sets. And then you'll ensemble those. You know, in theory, maybe that could be slightly better because you're kind of guaranteed that every row is appears four times, you know, effectively. It also has a benefit that you could average those five validation sets because there's no kind of overlap between them to get a cross validation. Personally, I generally don't bother. And the reason I don't is because this way, I can add and remove models very easily. I don't, you know, I can just, you know, add another architecture and whatever to my ensemble without trying to find a different overlapping non overlapping subset. So yeah, cross validation is therefore something that I use probably less than most people or almost or almost never. Awesome. Thank you. Are there any just come back to gradient accumulation, any other kind of drawbacks or potential gotchas with gradient accumulation? No, not really. Yeah, like amazingly, it doesn't even really slow things down much, you know, going from a batch size of 64 to a batch size of 32. By definition, you had to do it because your GPUs fall. So you're obviously giving a lot of data. So it's probably going to be using its processing speed pretty effectively. So yeah, no, it's just, it's just a good technique that we should all be buying cheaper graphics cards with less memory in them and using, you know, have like, I don't know the prices, I suspect like you could probably buy like two 3080s for the price of 13090 Ti or something. That would be a very good deal. Yes, clearly, you're not on the Nvidia payroll. So look, this is a good segue, then we did have a question about sort of GPU recommendations and there's been a bit of chat on that as well. So any, any commentary, any additional commentary around GPU recommendations? No, not really. I mean, obviously, at the moment, Nvidia is the only game in town, you know, if you buy, if you're trying to use a, you know, Apple M1 or M2 or an AMD card, you're basically in for a world of pain in terms of compatibility and stuff and unoptimized libraries and whatever. The, the Nvidia consumer cards, so the ones that start with RTX are much cheaper, but are just as good as the expensive enterprise cards. So you might be wondering why anybody would buy the expensive enterprise cards. And the reason is that there's a licensing issue that Nvidia will not allow you to use an RTX consumer card in a data center, which is also why cloud computing is more expensive than they kind of ought to be because everybody selling cloud computing GPUs is selling these cards that are like, I can't remember, I think they're like three times more expensive for kind of the same features. So yeah, if you do get serious about deep learning to the point that you're prepared to invest, you know, a few days in administering a box and you know, I guess, you know, prices hopefully will start to come down, but currently a thousand or $2,000 or $2,000 and buying a GPU, then, you know, that'll probably pay you back pretty quickly. Great, thank you. Let's see, some other ones come in. If you have a back on models, not hardware, if you have a well functioning but large model, can it make sense to train a smaller model to produce the same final activations as the larger model? Oh yeah, absolutely. I'm not sure we'll get into that this time around, but yeah, we'll cover that in part two, I think, but yeah, basically there's a kind of teacher-student models and model distillation, which broadly speaking, there are ways to make inference faster by training small models that work the same way as large models. Great, thank you. Alright, so that is the actual real end of road to the top because beyond that, we don't actually cover how to get closer to the top. You'd have to ask Kurian to share his techniques to find out that or Nick to get the second place from the top. Part four is actually something that I think is very useful to know about for learning and it's going to teach us a whole lot about how the last layer of a neural networks. And specifically, what we're going to try to do is we're going to try to build a model that doesn't just predict the disease but also predicts the type of rice. So how would you do that? So here's the data loader we're going to try to build. It's going to be something that for each image it tells us the disease and the type of race. I say disease, sometimes normal, I guess some of them are not diseased. So to build a model that can predict two things, the first thing is going to need data loaders that have two dependent variables. And that is shockingly easy to do in fast AI, thanks to the data block. So we've seen the data block before. We haven't been using it for the Patti competition so far because we haven't needed it. We could just use image data loader from folder. So that's like the highest level API, the simplest API. If we go down a level deeper into the data block, we have a lot more flexibility. So if you've been following the walkthroughs, you'll know that as I built this, the first thing I actually did was to simply replicate the previous notebook, but replace the image data loader from folders with a data block to try to do first of all exactly the same thing. And then I added the second dependent variable. So if we look at the previous image data loader from folders thingy, here it is, we are passing in some item transforms and some batch transforms. And we had something saying what percentage should be the validation set. So in a data block, if you remember, we have to pass in a block's argument saying what kind of data is the independent variable and what is the dependent variable. So to replicate what we had before, we would just pass in image block, comma category block, because we've got an image as our independent variable and a category, one type of rice is the dependent variable. So the new thing I'm going to show you here is that you don't have to only put in two things, you can put in as many as you like. So if you put in three things, we're going to generate one image and two categories. Now fast AI, if you're saying I want three things, fast AI doesn't know which of those is the independent variable and which is the dependent variable. So the next thing you have to tell it is how many inputs are there, number of inputs. And so here I've said there's one input. So that means this is the input and therefore by definition, two categories will be the output. Because remember, we're trying to predict two things, the type of rice and the disease. Okay, this is the same as what we've seen before to find out, to get our list of items, we'll call get image files. Now here's something we haven't seen before. Get why is our labeling function. Normally we pass to get why a single thing, such as the parent label function, which looks at the name of the parent directory, which remember is how these images are structured. And that would tell us the label. And that would tell us the label, but get why can also take an array. And in this case, we want two different labels. One is the name of the parent directory, because that's the disease. The second is the variety. So what's get variety? Get variety is a function. So let me explain how this function works. So we can create a data frame containing our trainings, our training data that came from Kaggle. So for each image, it tells us the disease and the variety. And what I did is something I haven't shown before. In pandas, you can set one column to be the index. And when you do that, in this case, image ID, it makes this series, this data frame, kind of like a dictionary. I can index into it by saying, tell me the row for this image. And to do that, you use the lock attribute, the location. So we want in the data frame, the location of this image. And then you can also say optionally what column you want, this column. And so here's this image and here's this column. And as you can see, it returns that thing. So hopefully now you can see it's pretty easy for us to create a function that takes a path and returns the location in the data frame of the name of that file. Because remember, these are the names of files for the variety column. So that's our second getaway. Okay. And then we've seen this before. Randomly split the data into the 20% and 80%. And so we could just squish them all to 192 just for this example. And then use data augmentation to get us down to 128 square images just for this example. And so that's what we get when we say show batch. We get what we just discussed. So now we need a model that predicts two things. How do we create a model that predicts two things? Well, the key thing to realize is we never actually had a model that predicts two things. We had a model that predicts 10 things before. The 10 things we predicted is the probability of each disease. So we don't actually now want a model that predicts two things. We want a model that predicts 20 things, the probability of each of the 10 diseases and the probability of each of the 10 varieties. So how could we do that? Well, let's first of all try to just create the same disease model we had before with our new data loader. And so this is going to be recently straightforward. The key thing to know is that since we told FastAI that there's one input and therefore by definition there's two outputs, it's going to pass to our metrics and to our loss functions three things instead of two. The predictions from the model and the disease and the variety. So if we're going to, so we can't just use error rate as our metric anymore because error rate takes two things. Instead we have to create a function that takes three things and return error rate on the two things we want, which is the predictions from the model and the disease. Okay, so the predictions from the model, this is the target. So that's actually all we need to do to define a metric that's going to work with our new data loader. This is not going to actually tell us anything about variety. First we're just going to try to replicate something that can do just disease. So when we create our learner, we'll pass in this new disease error function. Okay, so we're halfway there. The other thing we're going to need is to change our loss function. Now we never actually talked about what loss function to use and that's because vision learner guessed what loss function to use. Vision learner saw that our dependent variable was a single category and it knows the best loss function. That's probably going to be the case for things with a single category and it knows how big the category is. So it just didn't bother us at all. Just said, okay, I'll figure it out for you. So the only time we've provided our own loss function is when we were kind of doing linear models and neural nets from scratch. And we did, I think, mean squared error. We might also have done mean absolute error. Neither of those work when the dependent variable is a category. Now how would you use mean squared error or mean absolute error to say how close were these 10 probability predictions to this one correct answer? So in this case, we have to use a different loss function. We have to use something called cross entropy loss. And this is actually the loss function that fast AI picked for us before without us knowing. But now that we are having to pick it out manually, I'm going to explain to you exactly what cross entropy loss does. And these details are very important indeed. Like remember I said at the start of this class, the stuff that happens in the middle of the model, you're not going to have to care about much in your life, if ever. But the stuff that happens in the first layer and the last layer, including the loss function that sits between the last layer and the loss, you're going to have to care about a lot. This stuff comes up all the time. So you definitely want to know about cross entropy loss. And so I'm going to explain it using a spreadsheet. And this spreadsheet's in the course repo. And so let's say you were predicting something like a kind of a mini image net thing where you're trying to predict whether something, an image is a cat, a dog, a plane, a fish or a building. So you set up some model, whatever it is, a comf next model or just a big bunch of linear layers connected up or whatever. And initially you've got some random weights and it spits out at the end five predictions, right? So remember to predict something with five categories, your model will spit out five probabilities. Now it doesn't initially spit out probabilities, there's nothing making them probabilities, it just spits out five numbers. It could be negative, it could be positive. Okay, so here's the output of the model. So what we want to do is we want to convert these into probabilities. And so we do that in two steps. The first thing we do is we go x, that's e to the power of. We go e to the power of each of those things. Like so, okay? And so here's the mathematical formula we're using. This is called the softmax that we're working through. We're going to go through each of the categories. So these are our five categories, so here k is five. We're going to go through each of our categories. And we're going to go e to the power of the output. So zj is the output for the jth category. So here's that. And then we're going to sum them all together. Here it is, sum up together, okay? So this is the denominator. And then the numerator is just e to the power of the thing we care about, so this row. So the numerator is e to the power of cat on this row, e to the power of dog on this row, and so forth. Now, if you think about it, since the denominator adds up all the e to the power ofs, and when we do each one divided by the sum, that means the sum of these will equal one by definition, right? And so now we have things that can be treated as probabilities. They're all numbers between zero and one. Numbers that were bigger in the output will be bigger here. But there's something else interesting, which is because we did e to the power of, it means that the bigger numbers will be like pushed up to numbers closer to one. Like we're saying, like, oh, really try to pick one thing as having most of the probability. Because we are trying to predict, you know, one thing. We're trying to predict which one is it. And so this is called softmax. And so this is called softmax. So sometimes you'll see people complaining about the fact that their model, which they said, let's say, is it a teddy bear or a grizzly bear or a black bear? And they feed it a picture of the cat. And they say, oh, the model's wrong because it predicted grizzly bear. But it's not a grizzly bear. As you can see, there's no way for this to predict anything other than the categories we're giving it. We're forcing it to that. Now, if you want something else you could do, which is you could actually have them not add up to one. Right? You could instead have something which simply says, what's the probability it's a cat? What's the probability it's a dog? What's the probability it's playing totally separately? And they could add up to less than one. And in that situation, you can sort, you know, or more than one, in which case you could have like more than one thing being true or zero things being true. But in this particular case, where we want to predict one and one thing only, we use Softmax. The first part of the cross entropy formula, the first part of the cross entropy formula, in fact, let's look it up and end up cross entropy loss. The first part of what cross entropy loss in PyTorch does is to calculate the Softmax. It's actually the log of the Softmax, but don't worry about that too much. It's just a slightly faster to do the log. Okay, so now for each one of our five things, we've got a probability. The next step is the actual cross entropy calculation, which is we take our five things, we've got our five probabilities, and then we've got our actuals. Now, the truth is the actual, you know, the five things would have indices, right? Zero, one, two, three, or four. And the actual turned out to be the number one. But what we tend to do is we think of it as being one hot encoded, which is we put a one next to the thing for which it's true and a zero everywhere else. And so now we can compare these five numbers to these five numbers. And we would expect to have a smaller loss if the Softmax was high, where the actual is high. And so here's how we calculate, this is the formula, the cross entropy loss. We sum up, so we switched to M this time for some reason, but it's the same thing. We sum up across the five categories, so M is five. And for each one, we multiply the actual target value, so that's zero. So here it is here, the actual target value, and we multiply that by the log of the predicted probability, the log of red, the predicted probability. And so, of course, for four of these, that value is zero, because see here, yj equals zero by definition for all but one of them, because it's one hot encoded. So for the one that it's not, we've got our actual times the log Softmax. Okay. And so now actually you can see why PyTorch prefers to use log Softmax, because it kind of skips over having to do this log at all. So this equation looks slightly frightening, but when you think about it, all it's actually doing is it's finding the probability for the one that is one and taking its log, right? It's kind of weird doing it as a sum, but in math it can be a little bit tricky to kind of say, oh, look this up in an array, which is basically all it's doing. But yeah, basically, at least in this case, a single result where it's Softmax, this is all it's doing is it's finding the 0.87 where it's one for and taking the log and then finally negative. So that is what cross entropy loss does. We add that together for every row. So here's what it looks like if we add it together over every row, right? So n is the number of rows. And here's a special case. This is called binary cross entropy. What happens if we're not predicting which of five things it is, but we're just predicting, is it a cat? So in that case, if you look at this approach, you end up with this formula, which this is identical to this formula, but in just two cases, which is you either are a cat or you're not a cat, right? And so if you're not a cat, it's one minus you are a cat. And same with the probability. You've got the probability you are a cat. And then not a cat is one minus that. So here's this special case of binary cross entropy. And now our rows represent rows of data. Okay, so each one of these is a different image, a different prediction. And so for each one, I'm just predicting are you a cat? And this is the actual. And so the actual are you not a cat is just one minus that. And so then these are the predictions that came out of the model. Again, we can use Softmax or it's its binary equivalent. And so that will give you a prediction that you're a cat. And the prediction that it's not a cat is one minus that. And so here is each of the part yi times log of p yi. And here is, why did I subtract? That's weird. Oh, because I've got minus of both. So this way avoids parentheses. Yeah, minus the are you not a cat times the log of the prediction of are you not a cat? And then we can add those together. And so that would be the binary cross entropy loss of this data set of five cat or not cat images. Now, if you've got an eagle eye, you may have noticed that I am currently looking at the documentation for something called Anna and cross entropy loss. But over here, I had something called f cross entropy. Basically, it turns out that all of the loss functions in pytorch have two versions. There's a version which is a class. This is a class, which you can instantiate passing in various tweaks you might want. And there's also a version, which is just a function. And so if you don't need any of these tweaks, you can just use the function. The functions live in a kind of remember what the sub module called. I think it might be like torch dot n n dot functional, but everybody, including the pytorch official docs, just calls a capital F. So that's what this capital F refers to. So our loss, if we just care about disease, we're going to be past the three things, but just going to calculate cross entropy on our input versus disease. All right. So that's all fine. We pass. So now when we create a vision learner, you can't rely on fast AI to know what loss function to use because we've got multiple targets. So you have to say this is the loss function I want to use. This is the metrics I want to use. And the other thing you can't rely on is that fast AI no longer knows how many activations to create. Because again, it is more than one target. So you have to say the number of outputs to create at the last layer is 10. So this is just saying what's the size of the last matrix. And once we've done that, we can train it. And we get, you know, basically the same kind of result as we always get. Because this model at this point is identical to our previous Convnex small model. We've just done it in a slightly more roundabout way. So finally, before our break, I'll show you how to expand this now into a multi target model. And the trick is actually very simple. And you might have almost got the idea of it when I talked about it earlier. Our vision learner now requires 20 outputs. We now need that last matrix to have to produce 20 activations, not 10. 10 of those activations are going to predict the disease. And 10 of the activations are going to predict the variety. So you might be then asking like, well, how does the model know what it's meant to be predicting? And the answer is with the loss function, you're going to have to tell it. So for example, disease loss, remember, it's going to get the input, the disease and the variety. This is now going to have 20 columns in. So we're just going to decide, all right, we're just going to decide the first 10 columns, we're going to decide the prediction of what the disease is, which the probability of each disease. So we can now pass to cross entropy the first 10 columns and the disease target. So the way you read this colon means every row. And then colon 10 means every column up to the 10th. So these are the first 10 columns. And that will that's a loss function that just works on predicting disease using the first 10 columns. For variety, we'll use cross entropy loss with the target of variety. And this time we'll use the second 10 columns. So here's column 10 onwards. So then the overall loss function is the sum of those two things, disease loss plus variety loss. And that's actually it. That's all the model needs to basically it's now going to, if you kind of think through the manual neural nets we've created, this loss function will be reduced when the first 10 columns are doing good job of predicting the disease probabilities. And the second 10 columns are doing a good job of predicting the variety probabilities. And therefore the gradients will point in an appropriate direction that the coefficients will get better and better at using those columns for those purposes. It would be nice to see the error rate as well for each of disease and variety. So we can call error rate passing in the first 10 columns and disease and then for variety, the second 10 columns and variety. And we may as well also add to the metrics the losses. And so now when we create a learner, we're going to pass in as the loss function, the combined loss. And as the metrics, our list of all the metrics and n out equals 20. And now look what happens when we train as well as telling us the overall train and valid loss. It also tells us the disease and variety error and the disease and variety loss. And you can see our disease error is getting down to similar levels it was before. It's slightly less good. But it's similar. It's not surprising it's slightly less good because we've only given it the same number of epochs. And we're now asking it to try to do more stuff, which is to learn to recognize what the rice variety looks like, and also learns to recognize what the disease looks like. Here's the counterintuitive thing though. If we train it for longer, it may well turn out that this model which is trying to predict two things actually gets better at predicting disease than our disease specific model. Why is that? Like that sounds weird, right? Because we're trying to have to do more stuff. That's the same size model. Well, the reason is that quite often it'll turn out that the kinds of features that help you recognize a variety of rice are also useful for recognizing the disease. You know, maybe there are certain textures, right? Or maybe some diseases impact different varieties in different ways. So it'd be really helpful to know what variety it was. So I haven't tried training this for a long time. And I don't know the answer is in this particular case, does a multi target model do better than a single target model at predicting disease? But I just want to let you know sometimes it does. So for example, a few years ago, there was a Kaggle competition for recognizing the kinds of fish on a boat. And I remember we ended up doing a multi target model where we tried to predict a second thing. I can't even remember what it was, maybe it was a type of boat or something. And it definitely turned out in that Kaggle competition that predicting two things helped you predict the type of fish better than predicting just the type of fish. So there's at least, you know, there's two reasons to learn about multi target models. One is that sometimes you just want to be able to predict more than one thing. So this is useful. And the second is sometimes this will actually be better at predicting just one thing than a just one thing model. And of course, the third reason is it really forced us to dig quite deeply into these loss functions and activations in a way we haven't quite done before. So it's okay. It's absolutely okay if this is confusing. The way to make it not confusing is well, the first thing I do is like go back to our earlier models where we did stuff by hand on like the Titanic data set and built our own architectures. And maybe you could try to build a model that predicts two things in the Titanic data set. Maybe you could try to predict both sex and survival or something like that or or class and survival. Because that's kind of kind of forced you to look at it on very small data sets. And then the other thing I'd say is run this notebook and really experiment at trying to see what kind of outputs you get. Like actually look at the inputs and look at the outputs and look at the data loaders and so forth. All right, let's have a six minute break. So I'll see you back here at 10 past seven. Okay, welcome back. Oh, before I continue, I very rudely forgot to mention this very nice equation image here is from an article by Chris said called things that confused me about cross entropy. It's a very good article. So I recommend you check it out if you want to go a bit deeper there. There's a link to it inside the spreadsheet. So the next notebook we're going to be looking at is this one called collaborative filtering deep dive. And this is going to cover our last of the four major application areas, collaborative filtering. And this is actually the first time I'm going to be presenting a chapter of the book largely without variation. Because this is one where I looked back at the chapter and I was like, oh, I can't think of any way to improve this. So I thought I'll just leave it as is. But we have put the whole chapter up on Kaggle. So that's for the way I'm going to be showing it to you. And so we're going to be looking at a data set called the movie lens data set, which is a data set of movie ratings. And we're going to grab a smaller version of it, 100,000 record version of it. And it comes as a CSV file, which we can read in. It's not really a CSV file. It's a TSV file. This here means a tab in Python. These are the names of the columns. So here's what it looks like. It's got a user, a movie, a rating and a timestamp. We're not going to use the timestamp at all. So basically three columns we care about. This is a user ID. So maybe 196 is Jeremy and maybe 186 is Rachel. And 22 is John. I don't know. Maybe this movie is Return of the Jedi. And this one's Casablanca. This one's LA Confidential. And then this rating says, how did Jeremy feel about Return of the Jedi? He gave it a three out of five. That's how we can read this data set. This kind of data is very common. Anytime you've got a user and a product or service, and you might not even have ratings. Maybe just the fact that they bought that product. You could have a similar table with zeros and ones. So for example, Radek who's in the audience here is now at Nvidia doing like basically does this, right? Recommendation systems. So recommendation systems, you know, it's a huge industry. And so what we're learning today is, you know, a really key foundation of it. So these are the first few rows. This is not a particularly great way to see it. I prefer to kind of cross tabulate it like that, like this. This is the same information. So for each movie, for each user, here's the rating. So user 212 never watched movie 49. Now, if you're wondering why there's so few empty cells here, I actually grabbed the most watched movies and the most movie watching users for this particular sample matrix. So that's why it's particularly full. So yeah, so this is what kind of a collaborative filtering data set looks like when we cross tabulate it. So how do we fill in this gap? So maybe user 212 is Nick and movie 49. What's a movie you haven't seen Nick, and you'd quite like to maybe not sure about it? The new Elvis movie, Baz Luhrmann, good choice, Australian director, filmed in Queensland. Yeah. Okay, so that's movie number 49. So is Nick going to like the new Elvis movie? Well, to figure this out, what we could do ideally would like to know for each movie, what kind of movie is it? Like what are the kind of features of it? Is it like action-y, science fiction-y, dialogue driven, critical acclaimed? So let's say, for example, we were trying to look at The Last Skywalker. Maybe that was the movie that Nick's wondering about watching. And so if we like had three categories being science fiction, action, or kind of classic old movies, we'd say The Last Skywalker is very science fiction, but see, this is from like negative one to one. Pretty action, definitely not an old classic, or at least not yet. And so then maybe we then could say like, okay, well, maybe like Nick's tastes in movies are that he really likes science fiction, quite likes action movies and doesn't really like old classics. Right. So then we could kind of like match these up to see how much we think this user might like this movie. To calculate the match, we could just model play the corresponding values. Use a one times Last Skywalker and add them up. Point nine times point nine eight plus point eight times point nine plus negative point six times negative point nine. That's going to give us a pretty high number. Right. With a maximum of three. So that would suggest Nick probably would like The Last Skywalker. On the other hand, the movie Casablanca, we would say definitely not very science fiction, not really very action, definitely very old classic. So then we'd do exactly the same calculation and get this negative result here. So you probably wouldn't like Casablanca. This thing here, when we multiply the corresponding parts of a vector together and add them up, is called a dot product in math. So this is the dot product of the users preferences and the type of movie. Now the problem is we weren't given that information. We know nothing about these users or about the movies. So what are we going to do? We want to try to create these factors without knowing ahead of time what they are. We wouldn't even know what factors to create. What are the things that really matters when people decide what movies they want to watch? What we can do is we can create things called latent factors. Latent factors is this weird idea that we can say, I don't know what things about movies matter to people, but there's probably something. And let's just try like using SGD to find them. And we can do it in everybody's favorite mathematical optimization software, Microsoft Excel. So here is that table. And what we can do, let's head over here actually, here's that table. So what we could do is we could say for each of those movies, so let's say for movie 27, let's assume there are five latent factors. I don't know what they're for. They're just five latent factors. We'll figure them out later. And for now, I certainly don't know what the value of those five latent factors for movie 27. So we're going to just chuck random numbers in them. And we're going to do the same thing for movie 49. Pick another five random numbers. And the same thing for movie 57. Pick another five numbers. And you might not be surprised to hear we're going to do the same thing for each user. So for user 14, we're going to pick five random numbers for them. And for user 29, we'll pick five random numbers for them. And so the idea is that this number here, 0.19, is saying if it was true that user ID 14 feels not very strongly about the factor that for movie 27 has a value of 0.71. So therefore in here, we do the dot product. The details of why I don't matter too much, but well, actually you can figure this out from what we've said so far. If you go back to our definition of matrix product, you might notice that the matrix product of a row with a column is the same thing as a dot product. And so here in Excel, I have a row and a column. So therefore I say matrix model play that by that, that gives us the dot product. So here's the dot product of that by that or the matrix model play given that they're row and column. The only other slight quirk here is that if the actual rating is empty, I'm just going to leave it blank. I'm going to set it to zero actually. So here is everybody's rating, predicted rating of movies. I say predicted, of course, these are currently random numbers. So they are terrible predictions. But when we have some way to predict things, and we start with terrible random predictions, we know how to make them better, don't we? We use stochastic gradient descent. Now to do that, we're going to need a loss function. So that's easy enough. We can just calculate the sum of x minus y squared divided by the count. That is the mean squared error. And if we take the square root, that is the root mean squared error. So here is the root mean squared error in Excel between these predictions and these actuals. And so now that we have a loss function, we can optimize it, data solver, set objective, this one here by changing cells. These ones here and these ones here solve. Okay. And initially our loss is 2.81. So we hope it's going to go down. And as it solves, not a great choice of background color, but it says 0.68. So this number is going down. So this is using, actually in Excel, it's not quite using stochastic gradient descent because Excel doesn't know how to calculate gradients. There are actually optimization techniques that don't need gradients. They calculate them numerically as they go. But that's a minor quirk. One thing you'll notice is it's doing it very, very slowly. There's not much data here and it's still going. One reason for that is that if it's because it's not using gradients, it's much slower. And the second is Excel is much slower than PyTorch. Anyway, it's come up with an answer and look at that. It's got 2.42. So it's got a pretty good prediction. And so we can kind of get a sense of this, for example, looking at the last three. User 14 likes, dislikes, likes. Let's see somebody else like that. Here's somebody else. This person likes, dislikes, likes. So based on our kind of approach, we're saying, okay, since they have the same feeling about these three movies, maybe they'll feel the same about these three movies. So this person likes all three of those movies and this person likes two out of three of them. So, you know, you kind of, this is the idea, right? As if somebody says to you, I like this movie, this movie, this movie. And you're like, oh, they like those movies too. What other movies do you like? And they'll say, oh, how about this? There's a chance, good chance that you're going to like the same thing. That's the basis of collaborative filtering. Okay. It's, and, and mathematically, we call this matrix completion. So this matrix is missing values, we just want to complete them. So the core of collaborative filtering is, it's a matrix completion exercise. Can you grab a microphone? My question was, is with the dot products, right? So if we think about the math of that for a minute, is, yeah, if we think about the cosine of the angle between the two vectors, that's going to roughly approximate the correlation. Is that essentially what's going on here in one sense with the way that we're. So is the cosine of the angle between the vectors much the same thing as the dot product? The answer is yes. They're the same once you normalize them. So, yeah. Is that still on? It's correlation, what we're doing here at scale as well. Yeah, you can, yeah, you can think of it that way. Okay, cool. Now this just looks pretty different to how PyTorch looks. PyTorch has things in rows, right? We've got a user movie rating, user movie rating, right? So how do we do the same kind of thing in PyTorch? So let's do the same kind of thing in Excel, but using the table in the same format that PyTorch has it. Okay. So to do that next cell, the first thing I'm going to do is I'm going to see, okay, this I got to look at user number 14. And I want to know what index, like how far down this list is 14. Okay, so we'll just match means find the index. So this is user index one. And then what I'm going to do is I'm going to say the these five numbers is basically I want to find row one over here. And in Excel, that's called offset. So we're going to offset from here by one row. And so you can see here it is 0.19, 0.63, 0.19, 0.63, etc. Right. So here's the second user, 0.25, 0.03, etc. And we can do the same thing for movies. Right. So movie 417 is index 14. That's going to be 0.75, 0.47, etc. And so same thing, right, but now we're going to offset from here by 14 to get this row, which is 0.75, 0.47, etc. And so the prediction now is the dot product is called sum product in Excel. This is sum product of those two things. This is exactly the same as we had before, right. But when we kind of put everything next to each other, we have to like manually look up the index. And so then for each one, we can calculate the error squared prediction minus rating squared. And then we could add those all up. And if you remember, this is actually the same root mean squared error we had before we optimized before 2.81, because we've got the same numbers as before. And so this is mathematically identical. So what's this weird word up here embedding? You've probably heard it before. And you might have come across the impression it's some very complex fancy mathematical thing. But actually it turns out that it is just looking something up in an array. That is what an embedding is. So we call this an embedding matrix. And these are our user embeddings and our movie embeddings. So let's take a look at that in PyTorch. And you know, at this point, if you've heard about embeddings before, you might be thinking, that can't be it. And yeah, it's just as complex as the rectified linear unit, which turned out to be replaced negatives with zeros. Embedding actually means look something up in an array. So there's a lot of things that we use as deep learning practitioners to try to make you as intimidated as possible. So that you don't wander into our territory and start winning our Kaggle competitions. And unfortunately, once you discover the simplicity of it, you might start to think that you can do it yourself. And then it turns out you can. So yeah, that's what basically it turns out pretty much all of this jargon turns out to be. So we're going to try to learn these latent factors, which is exactly what we just did in Excel. We just learned the latent factors. All right. So if we're going to learn things in PyTorch, we're going to need data loaders. One thing I did is there is actually a movies table as well with the names of the movies. So I merged that together with the ratings so that then we've now got the user ID and the actual name of the movie. We don't need that obviously for the model, but it's just going to make it a bit more fun to interpret later. So this is called ratings. We have something called collaborative data loaders. So collaborative filtering data loaders, and we can get that from a data frame by passing in the data frame. And it's expects a user column and an item column. So the user column is what it sounds like the person that is rating this thing. And the item column is the product or service that they're rating. In our case, the user columns called user. So we don't have to pass that in. And the item column is called title. So we do have to pass this in because by default, the user column should be called user. And the item column will be called item. Give it a batch size. And as usual, we can call show batch. And so here's our data loaders, a batch of data loaders, or at least a bit of it. And so now that we're, since we talked about the names, we actually get to see the names, which is nice. All right. So now we're going to create the user factors and movie factors, i.e. this one, and this one. So the number of rows of the movie factors will be equal to the number of movies. And the number of rows of the user factors will be equal to the number of users. And the number of columns will be whatever we want, however many factors we want to create. John. This might be a pertinent time to jump in with a question. Any comments about choosing the number of factors? We have defaults that we use for embeddings in Fast AI. It's a very obscure formula and people often ask me for like the mathematical derivation of where it came from. But what actually happened is it's, I wrote down how many factors I think is appropriate for different size categories on a piece of paper at a table. Well, actually an Excel. And then I fitted a function to that and that's the function. So it's basically a mathematical function that fits my intuition about what works well. But it seems to work pretty well. I've said it used in lots of other places now. Lots of papers will be like using Fast AI's rule of thumb for embedding sizes. Here's the formula. Cool. Thank you. It's pretty fast to train these things so you can try a few. So we've got to create, so the number of users is just the length of how many users there are, number of movies is the length of how many titles there are. So create a matrix of random numbers of users by five and movies of movies by five. And now we need to look up the index of the movie in our movie latent factor matrix. The thing is, when we've learned about deep learning, we learned that we do matrix model applications, not look something up in a matrix in an array. So in Excel, we were saying offset, which is to say find element number 14 in the table, which that's not a matrix model play. How does that work? Well, actually it is. It actually is for the same reason that we talked about here, which is we can represent, find the element number one thing in this list is actually the same is actually the same as multiplying by a one hot encoded matrix. So remember how if we, let's just take off the log for a moment. Look, this is returned 0.87. And particularly if I take the negative off here, if I add this up, this is 0.87, which is the result of finding the index number one thing in this list. But we didn't do it that way. We did this by taking the dot product of this, sorry, of this and this. But that's actually the same thing, right? Taking the dot product of a one hot encoded vector with something is the same as looking up this index in the vector. So that means that this exercise here of looking up the 14 thing is the same as doing a matrix model play with a one hot encoded vector. And we can see that here. This is how we create a one hot encoded vector of length and users in which the third element is set to one and everything else is zero. And if we multiply that, so that means, do you remember matrix multiply in Python? So if we multiply that by our user factors, we get back this answer. And if we just ask for user factors number three, we get back the exact same answer. They're the same thing. So you can think of an embedding as being a computational shortcut for multiplying something by a one hot encoded vector. And so if you think back to what we did with dummy variables, right, this basically means embeddings are like a cool math trick for speeding up doing matrix model players with dummy variables, not just speeding up, we never even have to create the dummy variables. We never have to create the one hot encoded vectors. We can just look up in an array. All right, so we're now ready to build a collaborative filtering model. And we're going to create one from scratch. And as we've discussed before, in PyTorch, a model is a class. And so we briefly touched on this, but I've got to touch on it again. This is how we create a class in Python. You give it a name. And then you say, how do we initialize it? How to construct it? So in Python, remember, they call these things Dunder, whatever, this is DunderEdit. These are magic methods that Python will call for you at certain times. The method called DunderEdit is called when you create an object of this class. So we could pass it a value. And so now we set the attribute called a equal to that value. And so then later on, we could call a method called say that will say hello to whatever you passed in here. And this is what it'll say. So for example, if you construct an object of type example, passing in silver, self dot a now equals silver. So if you say, use the dot method, the dot say method, nice to meet you. Axe is now nice to meet you. So it'll say hello, Sylvan, nice to meet you. So that's kind of all you need to know about object oriented programming in PyTorch to create a model. Oh, there's one more thing we need to know, sorry, which is you can put something in parentheses after your class name. And that's called the superclass. It's basically going to give you some stuff for free, give you some functionality for free. And if you create a model in PyTorch, you have to make module your superclass. This is actually fast AI's version of module, but it's nearly the same as PyTorches. So when we create this dot product object, it's going to call Dunder in it. And we have to say, well, how many users are going to be in our model? And how many movies? And how many factors? And so we can now create an embedding of users by factors for users and an embedding of movies by factors for movies. And so then PyTorch does something quite magic, which is that if you create a dot product object like so, it then you can treat it like a function, you can call it and calculate values on it. And when you do that, this is really important to know. PyTorch is going to call a method called forward in your class. So this is where you put your calculation of your model. It has to be called forward. And it's going to be past the object itself and the thing you're calculating on. In this case, the user and movie for a batch. So this is your batch of data. Each row will be one user and movie combination and the columns will be users and movies. So we can grab the first column. Right. So this is every row of the first column and look it up in the user factors embedding to get our users embeddings. So that is the same as doing this. Let's say this is one mini batch. And then we do exactly the same thing for the second column, passing it into our movie factors to look up the movie embeddings and then take the dot product. Dem equals one because we're summing across the columns for each row. We're calculating a prediction for each row. So once we've got that, we can pass it to a learner passing in our data loaders and our model and our loss function, mean squared error. And we can call fit and away it goes. And this, by the way, is running on CPU. And these are very fast to run. So this is doing 100,000 rows in 10 seconds, which is a lot faster than our few dozen rows in Excel. And so you can see the loss going down. And so we've trained a model. It's not going to be a great model. And one of the problems is that, let's see if we can see this in our Excel one, look at this one here. This prediction is bigger than five, but nothing's bigger than five. So that seems like a problem. We're predicting things that are bigger than the highest possible number. And in fact, these are very much movie enthusiasts that nobody gave anything a one. Yeah, nobody even gave anything a one here. So do you remember when we learned about sigmoid, the idea of squishing things between zero and one, we could do stuff still without a sigmoid. But when we added a sigmoid, it trained better, because the model didn't have to work so hard to get it kind of into the right zone. Now, if you think about it, if you take something and put it through a sigmoid and then model play it by five, now you've got something that's going to be between zero and five, used to have something between zero and one. So we could do that. In fact, we could do that in Excel. I'll leave that as an exercise to the reader. Let's do it over here in PyTorch. So if we take the exact same class as before, and this time we call sigmoid range. And so sigmoid range is something which will take our prediction and then squash it into our range. And by default, we'll use a range of zero through to 5.5. So it can't be smaller than zero, it can't be bigger than 5.5. Why don't I use five? That's because a sigmoid can never hit one, right? And a sigmoid times five can never hit five. But some people do give things movies five. So you want to make it a bit bigger than our highest. So this one got a loss of 0.862886. Oh, it's not better. Isn't that always the way? All right. Didn't actually help, doesn't always. So be it. Let's keep trying to improve it. Let me show you something I noticed. Some of the users, like this one, this person here just loves movies. They give nearly everything a four or five. Their worst score is a three. All right. This person, oh, here's a one. This person's got much more range. Some things are twos, some ones, some fives. This person doesn't seem to like movies very much considering how many they watch. Nothing gets a five. They've got discerning tastes, I guess. At the moment, we don't have any way in our kind of formulation of this model to say this user tends to give low scores and this user tends to give high scores. There's just nothing like that, right? But that would be very easy to add. Let's add one more number to our five factors just here for each user. And now, rather than doing just the matrix multiply, let's add. Oh, it's actually the top one. Let's add this number to it, h19. And so for this one, let's add i19 to it. Yeah. So I've got it wrong. This one here. So this row here, we're going to add to each rating. And then we're going to do the same thing here. Each movie's now got an extra number here. Again, we're going to add a 26. So it's our matrix multiplication, plus we call it the bias, the user bias, plus the movie bias. So effectively, that's like making it so we don't have an intercept of zero anymore. And so if we now train this model, data, solve up, solve. So previously we got to 0.42. Okay. And so we're going to let that go along for a while. And then let's also go back and look at the PyTorch version. So for PyTorch now, we're going to have a user bias, which is an embedding of end users by one. Right? Remember there was just one number for each user. And movie bias is an embedding of end movies, also by one. And so we can now look up at the user embedding, the movie embedding, do the dot product, and then look up the user bias and the movie bias and add them, chuck that through the sigmoid. Let's train that. So if we beat 0.865, wow, we're not training very well, are we? Still not too great, 0.894. I think Excel normally does do better though. Let's see. Okay, Excel. Oh, Excel has done a lot better. It's gone from 0.42 to 0.35. Okay. So what happened here? Why did it get worse? Well, look at this. The valid loss got better. And then it started getting worse again. So we think we might be overfitting, which, you know, we have got a lot of parameters in our embeddings. So how do we avoid overfitting? So a classic way to avoid overfitting is to use something called weight decay, also known as L2 regularization, which sounds much more fancy. What we're going to do is when we compute the gradients, we're going to first add to our loss function the sum of the weights squared. This is something you should go back and add to your Titanic model, not that it's overfitting, but just to try it, right? So previously, our gradients have just been, and our loss function has just been about the difference between our predictions and our actuals, right? And so our gradients were based on the derivative of that with respect to the, the derivative of that with respect to the coefficients. But we're saying now, let's add the sum of the square of the weights times some small number. So what would make that loss function go down? That loss function would go down if we reduce our weights. For example, if we reduce all of our weights to zero, I should say we reduce the magnitude of our weights. If we reduce them all to zero, that part of the loss function will be zero because the sum of zero squared is zero. Now, problem is if our weights are all zero, our model doesn't do anything, right? So it would have crappy predictions. So I would want to increase the weights. So that's actually predicting something useful. But if it increases the weights too much, then it starts overfitting. So how is it going to actually get the lowest possible value of the loss function? By finding the right mix, weights not too high, right, but high enough to be useful at predicting. If there's some parameter that's not useful, for example, say we asked for five factors and we only need four, it can just set the weights for the fifth factor to zero, right? And then problem solved, right? It won't be used to predict anything, but it also won't contribute to our weight decay part. So previously, we had something calculated in the loss function. So now we're going to do exactly the same thing, but we're going to square the parameters, we're going to sum them up, and we're going to multiply them by some small number, like 0.01 or 0.001. And in fact, we don't even need to do this because remember, the whole purpose of the loss is to take its gradient, right, and to print it out. The gradient of parameters squared is two times parameters. It's okay if you don't remember that from high school, but you can take my word for it. The gradient of y equals x squared is 2x. So actually, all we need to do is take our gradient and add the weight decay coefficient, 0.01 or whatever, times two times parameters. And given this is just some number we get to pick, we may as well fold the two into it and just get rid of it. So when you call fit, you can pass in a WD parameter, which adds this times the parameters to the gradient for you. And so that's going to ask the model, it's going to say to the model, please don't make the weights any bigger than they have to be. And yay, finally, our loss actually improved. And you can see it getting better and better. In fast AI applications like vision, we try to set this for you appropriately, and we generally do a reasonably good job. Just the defaults are normally fine. But in things like tabular and collaborative filtering, we don't really know enough about your data to know what to use here. So you should just try a few things. Let's try a few multiples of 10, start at 0.1, and then divide by 10 a few times, and just see which one gives you the best result. So this is called regularization. So regularization is about making your model no more complex than it has to be. It has a lower capacity. And so the higher the weights, the more they're moving the model around. So we want to keep the weights down, but not so far down that they don't make good predictions. And so the value of this, if it's higher, will keep the weights down more, it will reduce overfitting, but it will also reduce the capacity of your model to make good predictions. And if it's lower, it increases the capacity of model and increases overfitting. All right, I'm going to take this bit for next time. Before we wrap up, John, are there any more questions? Yeah, there are. There's some from back at the start of the collaborative filtering. So we had a bit of a conversation a while back about the size of the embedding vectors. And you talked about your fast AI rule of thumb. So there was a question if anyone has ever done a kind of a hyperparameter search and exploration. I mean, people often will do a hyperparameter search for sure. People will often do a hyperparameter search for their model, but I haven't seen any other rules other than my rule of thumb. Right, so not productively to your knowledge. Oh, productively for an individual model that somebody's built. And then there's a question here from Zaki, which I didn't quite wrap my head around. So Zaki, if you want to maybe clarify in the chat as well, but can recommendation systems be built based on average ratings of users experience rather than collaborative filtering? Not really. I mean, if you've got lots of metadata, you could. So if you've got lots of information about demographic data about where the user's from and what loyalty scheme results they've had and blah, blah, blah, and then for products, there's metadata about that as well, then sure, averages would be fine. But if all you've got is kind of purchasing history, then you really want the granular data. Otherwise, how could you say like, they like this movie, this movie in this movie, therefore they might also like that movie, or you've got it's like, oh, they kind of like movies. There's just not enough information there. Yeah, great. And that's about it. Thanks. Okay, great. All right. Thanks, everybody. See you next time for our last lesson.