 Hi everybody, and welcome to lesson six, where we're going to continue looking at training convolutional neural networks for computer vision. And so we last looked at this, the lesson before last, and specifically we were looking at how to train an image classifier to pick out breeds of pet, one of 37 breeds of pet. And we've gotten as far as training a model, but we also had to look and figure out what loss function was actually being used in this model. And so we talked about cross entropy loss, which is actually a really important concept. And some of the things we're talking about today, depend a bit on your understanding this concept. So if you were at all unsure about where we got to with that, go back and have another look, have a look at the questionnaire, particular, and make sure that you're comfortable with cross entropy loss. If you're not, you may want to go back to the 04 MNIST basics notebook and remind yourself about MNIST loss, because it's very, very similar. That's what we built on to build up cross entropy loss. So having trained our model, the next thing we're going to do is look at model interpretation. There's not much point having a model if you don't see what it's doing. And one thing we can do is use a confusion matrix, which in this case is not terribly helpful. There's kind of a few too many. I mean, it's not too bad. We can kind of see some colored areas. And so this diagonal here are all the ones that are classified correctly. So for Persians, there were 31 classified as Persians. But we can see there's some bigger numbers here, like a Siamese, six were misclassified. They're actually considered a Berman. But for when you've got a lot of classes like this, it might be better instead to use the most confused method. And that tells you the combinations, which it got wrong the most often. In other words, which numbers are the biggest? So actually, here's the biggest one, 10. And that's confusing an American pit ball terrier or a Staffordshire ball terrier. That's happened 10 times. And Ragdoll is getting confused with a Berman eight times. And so I'm not a dog or cat expert. And so I don't know what this stuff means. So I looked it up on the internet. And I found that American pit ball terriers and Staffordshire ball terriers are almost identical. But I think they sometimes have a slightly different colored nose, I remember correctly. And Ragdolls and Berman's are types of cats that are so similar to each other that there's whole long threads on cat lover forums about, is this a Ragdoll or is this a Berman? And experts disagreeing with each other. So no surprise that these things are getting confused. So when you see your model making sensible mistakes, the kind of mistakes that humans make, that's a pretty good sign that it's picking up the right kind of stuff. And that the kinds of errors you're getting also might be pretty tricky to fix. But let's see if we can make it better. And one way to try and make it better is to improve our learning rate. Why would we want to improve the learning rate? Well one thing we'd like to do is to try to train it faster, get more done in less epochs. And so one way to do that would be to call our fine tune method with a higher learning rate. We used the default, which I think is, there you go, 1 in egg 2. And so if we pump that up to 0.1, it's going to jump further each time. So remember the learning rate, if you've forgotten this, have a look again at notebook 4. That's the thing we multiply the gradients by to decide how far to step. And unfortunately, when we use this higher learning rate, the error rate goes from 0.08 and 3 epochs to 0.83. So we're getting the vast majority of them wrong now. So that's not a good sign. So why did that happen? Well what happened is rather than this gradual move towards the minimum, we had this thing where we stepped too far and we get further, further away. So when you see this happening, which looks in practice like this, your error rate getting worse right from the start, that's a sign your learning rate is too high. So we need to find something just right. Not too small that we take tiny jumps and it takes forever. And not too big that we either get worse and worse or we just jump backwards and forwards quite slowly. So to find a good learning rate, we can use something that the researcher, Leslie Smith, came up with called the learning rate finder. And the learning rate finder is pretty simple. All we do, remember when we do stochastic gradient descent, we look at one mini batch at a time, go a few images in this case at a time, find the gradient for that set of images for the mini batch and step our weights based on the learning rate and the gradient. Well, what Leslie Smith said was, OK, let's do the very first mini batch at a really, really low learning rate, like 10 to the minus seven, and then let's increase by a little bit, like maybe 25% higher and do another step, and then 25% higher and do another step. So these are not epochs. These are just a single mini batch. And then we can plot on this chart here at 10 to the minus seven. What was the loss and at 25% higher than that? What was the loss and the 25% higher than that? What was the loss? And so not surprisingly, if you do that at the low learning rates, the loss doesn't really come down because the learning rate is so small that these steps are tiny, tiny, tiny, and then gradually we get to the point where they're big enough to make a difference and the loss starts coming down because we've plotted here the learning rate against the loss, right? So here the loss is coming down as we continue to increase the learning rate. The loss comes down until we get to a point where our learning rate is too high. And so it flattens out and then it's getting worse again. So here's the point above like 0.1 where we're in this territory. So what we really want is somewhere around here where it's kind of nice and steep. So you can actually ask it the learning rate finder. So we used LR find to get this plot. We can get back from it the minimum and steep. And so steep is where was it steepest? So the steepest point was 5eneg3 and the minimum point divided by 10, that's quite a good rule of thumb, is 1eneg2. So somewhere around this range might be pretty good. So each time you run it, you'll get different values. So different time we ran it, we thought that maybe 3eneg3 would be good. So we picked that and you'll notice the learning rate finder is a logarithmic scale. Be careful of interpreting that. So we can now rerun the learning rate finder, setting the learning rate to a number we picked from the learning rate finder, which in this case was 3eneg3. And we can see now that's looking good, right? We've got an 8.3% error rate after three epochs. So this idea of the learning rate finder is very straightforward. I can describe it to you in a couple of sentences. It doesn't require any complex math. And yet it was only invented in 2015, which is super interesting, right? It just shows that there's so many interesting things for us to learn and discover. I think part of the reason perhaps for this, it took a while, is that engineers kind of love using lots and lots of computers. So before the learning rate finder came along, people would run lots of experiments on big clusters to find out which learning rate was the best, rather than just doing a batch at a time. And I think partly also the idea of having a thing where a human is in the loop where we look at something and make a decision is also kind of unfashionable. A lot of folks in research and industry love things which are fully automated. But anyway, it's great we now have this tool because it makes our life easier. And FastAI is certainly the first library to have this. And I don't know if it's still the only one to have it built in, at least to the basic library. So now we've got a good learning rate. How do we fine-tune the weights? So far we've just been running this fine-tune method without thinking much about what it's actually doing. But we did mention in chapter one, lesson one, briefly basically what's happening with fine-tune. What is transfer learning doing? And before we look at that, let's take a question. Is the learning rate plot in LR find plotted against one single mini-batch? No, it's not. It's actually just the standard kind of walking through the data loader, so just getting the usual mini-batches of the shuffle data. And so it's kind of just normal training. And the only thing that's been different is that we're increasing the learning rate a little bit after each mini-batch and keeping track of it. Along with that, is the network reset to the initial status after each trial? No, certainly not. We actually want to see how it learns. We want to see it improving. So we don't reset it to its initial state until we're done. So at the end of it, we go back to the random weights we started with or whatever the weights were at the time we ran this. So what we're seeing here is something that's actually the actual learning that's happening as we at the same time increase the learning rate. Why would an ideal learning rate found with a single mini-batch at the start of training keep being a good learning rate even after several epochs and further loss reductions? Great question. It absolutely wouldn't. So let's look at that, too, shall we? And, oh, can I ask one more? Of course. This is an important point, so ask as many as you can. This is very important. For the learning rate finder, why use the steepest and not the minimum? We certainly don't want the minimum because the minimum is the point at which it's not learning anymore. So the this flat section at the bottom here means in this mini-batch it didn't get better. So we want the steepest because that's the mini-batch where it got the most improved. And that's what we want. We want the weights to be moving as fast as possible. As a rule of thumb, though, we do find that the minimum divided by 10 works pretty well. That's Sylvain's favorite approach. And he's generally pretty spot on with that. So that's why we actually print out those two things. Lr min is actually the minimum divided by 10. And steepest point is suggest the steepest point. Great. Good questions all. So remind ourselves what transfer learning does. So with transfer learning, remember what our neural network is. It's a bunch of linear models, basically, with activation functions between them. And our activation functions are generally relus, rectified linear units. If any of this is fuzzy, have a look at the 0, 4 notebook again to remind yourself. And so each of those linear layers has a bunch of parameters. So the whole neural network has a bunch of parameters. And so after we train a neural network on something like ImageNet, we have a whole bunch of parameters that aren't random anymore. They're actually useful for something. And we've also seen that the early layers seem to learn about fairly general ideas, like gradients and edges. And the later layers learn about more sophisticated ideas like what our eyes look like, or what does fur look like, or what does text look like. So with transfer learning, we take a model, or in other words, a set of parameters, which has already been trained on something like ImageNet. We throw away the very last layer, because the very last layer is the bit that specifically says which one of those, in the case of ImageNet, 1,000 categories is this an image in. So we throw that away. And we replace it with random weights, sometimes with more than one layer of random weights. And then we train that. Now, yes. Oh, I just wanted to make a comment. And that's that I think the learning rate finder, I think after you learn about it, the idea almost seems kind of so simple or approximate that it's like, wait, this shouldn't work. Or shouldn't you have to do something more complicated or more precise that it's like, I just want to highlight that this is a very surprising result that some kind of a such a simple approximate method would be so helpful. Yeah, I would particularly say it's surprising to people who are not practitioners or who have not been practitioners for long. I've noticed that a lot of my students at USF have a tendency to jump in to try to doing something very complex, where they account for every possible imperfection from the start. And it's very rare that that's necessary. So one of the cool things about this is it's a good example of trying the easiest thing first and seeing how well it works. And this was a very big innovation when it came out that I think it's kind of easy to take for granted now, but this was super, super helpful when it was a new. It was super helpful. And it was also nearly entirely ignored. None of the research community cared about it. And it wasn't until fast AI, I think, in our first course talked about it that people started noticing. And we had quite a few years, in fact, there's still a bit the case where super fancy researchers still don't know about the learning rate finder and get beaten by first lesson fast AI students on practical problems because they can pick learning rates better and they can do it without a cluster of thousands of computers. OK, so transfer learning. We've got our pre-trained network. And so it's really important every time you hear the word pre-trained network, you're thinking a bunch of parameters which have particular numeric values and go with a particular architecture like ResNet 34. We've thrown away the final layer and replaced them with random numbers. And so now we want to train to fine tune this set of parameters for a new set of images, in this case, pets. So fine tune is the method we call to do that. And to see what it does, we can go burn.fineTune question mark. And we can see the source code. And here is the signature of the function. And so the first thing that happens is we call freeze. So freeze is actually the method which makes it so only the last layer's weights will get stepped by the optimizer. So the gradients are calculated just for those last layers of parameters. And the step is done just for those last layer of parameters. So then we call fit. And we fit for some number of epochs, which by default is 1. We don't change that very often. And what that fit is doing is it's just fitting those randomly added weights, which makes sense, right? They're the ones that are going to need the most work because at the time which we add them, they're doing nothing at all. They're just random. So that's why we spend one epoch trying to make them better. After you've done that, you now have a model which is much better than we started with. It's not random anymore. All the layers except the last are the same as the pretray network. The last layer has been tuned for this new data set. So the closer you get to the right answer, as you can kind of see in this picture, the smaller the steps you want to create. Sorry, the smaller the steps you want to take, generally speaking. The next thing we do is we divide our learning rate by 2. And then we unfreeze. So that means we make it so that all the parameters can now be stepped. And all of them will have gradients calculated. And then we fit for some more epochs. And this is something we have to pass to the method. And so that's now going to train the whole network. So if we want to, we can kind of do this by hand. And actually, CNN learner will, by default, freeze the model for us, freeze the parameters for us. So we actually don't have to call freeze. So if we just create our learner and then fit for a while, this is three epochs of training just the last layer. And so then we can just manually do it ourselves, unfreeze. And so now at this point, as the question earlier suggested, maybe this is not the right learning rate anymore. So we can run LR find again. And this time you don't see the same shape. You don't see this rapid drop, because it's much harder to train a model that's already pretty good. Instead, you just see a very gentle little gradient. So generally here, what we do is we kind of try to find the bit where it starts to get worse again and go about, which is about here, and go about 10, you know, multiple of 10 less than that. So about 1 a neg 5, I would guess, which, yeah, that's what we picked. So then after unfreezing, finding our new learning rate, and then we can do a bunch more. And so here we are, we're getting down to 5.9% error, which is okay, but there's better we can do. And the reason we can do better is that at this point here, we're training the whole model at a 1 a neg 5, so 10 to the minus 5 learning rate, which doesn't really make sense, because we know that the last layer is still not that great. It's only had three epochs of training from random. So it probably needs more work. We know that the second last layer was probably pretty specialized to ImageNet and less specialized to PetBreads, so that probably needs a lot of work. Where else the early layers, the kind of gradients and edges, probably don't need to be changed much at all. But what we'd really like is to have a small learning rate for the early layers and a bigger learning rate for the later layers. And this is something that we developed at FastAI, and we call it discriminative learning rates. And Jason Yacinski actually is a guy who wrote a great paper that some of these ideas are based on, which is he actually showed that different layers of the network really want to be trained at different rates, although he didn't kind of go as far as trying that out and seeing how it goes. It was more of a theoretical thing. So in FastAI, if we want to do that, we can pass to our learning rate rather than just passing a single number, we can pass a slice. Now, a slice is a special built-in feature of Python. It's just an object which basically can have a few different numbers in it. In this case, it's going to be passing it two numbers. And the way we read those, basically what this means in FastAI is a learning rate is the very first layer will have this learning rate, 10 to the minus 6. The very last layer will be 10 to the minus 4. And the layers between the two will be kind of equal multiples. So they'll kind of be equally spaced learning rates from the start to the end. So here we can see basically doing our kind of own version of fine-tune. We create the learner. We fit with that automatically frozen version. We unfreeze. We fit some more. And so when we do that, you can see this works a lot better. We're getting down to 5.3, 5.1, 5.4 error. So that's pretty great. One thing we'll notice here is that we did kind of overshoot a bit. It seemed like more like epoch number 8 was better. So kind of back before... Well, actually let me explain something about fit one cycle. So fit one cycle is a bit different to just fit. So what fit one cycle does is it actually starts at a low learning rate. It increases it gradually for the first one-third or so of the batches until it gets to a high learning rate. The highest... This is why it's called LR Max. It's the highest learning rate we get to. And then for the remaining two-thirds or so of the batches, it gradually decreases the learning rate. And the reason for that is just that, well, largely it's kind of like empirically, researchers have found that works the best. In fact, this was developed again by Leslie Smith, the same guy that did the learning rate finder. Again, it was a huge step. It really dramatically accelerated the speed at which we can train neural networks and also made them much more accurate. And again, the academic community basically ignored it. In fact, the key publication that developed this idea was not even... did not even pass peer review. And so the reason I mention this now is to say that we can't... We don't really... We just want to go back and pick the model that was trained back here because we could probably do better because we really want to pick a model that's got a low learning rate. So what I would generally do here is I'd change this 12 to an 8 because this is looking good. And then I would re-train it from scratch. Normally you'd find a better result. You can plot the loss and you can see how the training and validation loss moved along. And you can see here that, you know, the error rate was starting to get worse here. And what you'll often see is often the validation loss will get worse a bit before the error rate gets worse. We're not really seeing it so much in this case, but the error rate and the validation loss don't always... They're not always kind of in lockstep. So what we're plotting here is the loss, but you actually kind of want to look to see mainly what's happening with the error rate because that's actually the thing we care about. Remember, the loss is just like an approximation of what we care about that just happens to have a gradient that works out nicely. So how do you make it better now? We're already down to just 5.4, or if we'd stopped a bit earlier, maybe we could get down to 5.1 or less error. On 37 categories, that's pretty remarkable. That's a very, very good pet breed predictor. If you want to do something even better, you could try creating a deeper architecture. So a deeper architecture is just literally putting more pairs of activation function, also known as a nonlinearity, followed by these little linear models, put more pairs onto the end. And basically the number of these sets of layers you have is the number that you'll see at the end of an architecture. So there's ResNet18, ResNet34, ResNet50, so forth. Having said that, you can't really pick ResNet19 or ResNet38. I mean, you can make one, but nobody's created a pre-trained version of that for you. So you won't be able to do any fine-tuning. So you can theoretically create any number of layers you like. But in practice, most of the time, you'll want to pick a model that has a pre-trained version, though you kind of have to select from the sizes people have pre-trained. And there's nothing special about these sizes. They're just ones that people happen to have picked out. For the bigger models, there's more parameters and more gradients that are going to be stored on your GPU. And you will get used to the idea of seeing this error, unfortunately, out of memory. So that's not out of memory in your RAM. That's out of memory in your GPU. Krutor is referring to the language and the system used for your GPU. So if that happens, unfortunately, you actually have to restart your notebook. So that's a kernel, restart, and try again. And that's a really annoying thing, but such is life. One thing you can do if you get out of memory error is after you'll see an N-learner call, add this magic incantation to FP16. What that does is it uses for most of the operations numbers that use half as many bits as usual. So they're less accurate. There's half precision floating point or FP16. And that will use less memory. And on pretty much any NVIDIA card created in 2020 or later and some more expensive cards even created in 2019, that's often going to result in a two to three times speed up in terms of how long it takes as well. So here if I add in to FP16, then I will be seeing often much faster training. And in this case, what I actually did is I switched to a ResNet 50, which would normally take about twice as long. And my per epoch time has gone from 25 seconds to 26 seconds. So the fact that we used a much bigger network and it was no slower is thanks to FP16. But you'll see our error rate hasn't improved. It's pretty similar to what it was. And so it's important to realize that just because we increase the number of layers, it doesn't always get better. So it tends to require a bit of experimentation to find what's going to work for you. And of course, don't forget, the trick is use small models for as long as possible to do all of your cleaning up and testing and so forth and wait until you're all done to try some bigger models because they're going to take a lot longer. A question, Jeremy? How do you know or suspect when you can, quote, do better? You have to always assume you can do better because you never know. So you just have to, I mean, part of it though is do you need to do better? Or do you already have a good enough result to handle the actual task you're trying to do? Often people do spend too much time fiddling around with their models rather than actually trying to see whether it's already going to be super helpful. So as soon as you can actually try to use your model to do something practical, the better. But yeah, how much can you improve it? Who knows? I, you know, go through the techniques that we're teaching in this course and see which ones help. Unless it's a problem that somebody has already tried before and written down their results in a paper or a Kaggle competition or something, there's no way to know how good it can be. So don't forget after you do the questionnaire to check out the further research section. And one of the things we've asked you to do here is to read a paper. So find the learning rate finder paper and read it and see if you can kind of connect what you read up to the things that we've learned in this lesson and see if you can maybe even implement your own learning rate finder, you know, as manually as you need to. See if you can get something that, you know, based on reading the paper, you get to work yourself. You can even look at the source code of FastAI's learning rate finder, of course. And then can you make this classifier better? And so this is further research, right? So maybe you can start doing some reading to see what else could you do. Have a look on the forums to see what people are trying. Have a look on the book website or the course website to see what other people have achieved and what they did and play around. So we've got some tools in our toolbox now for you to experiment with. So that is... that is Pep reads. So this is a, you know, a pretty tricky computer vision classification problem. And we kind of have seen most of the pieces of what goes into the training of it. We haven't seen how to build the actual architecture. But other than that, we've kind of worked our way up to understanding what's going on. So let's build from there into another kind of data set. One that involves multi-label classification. Well, maybe... So maybe let's look at an example. Here is a multi-label data set where you can see that it's not just one label on each image, but sometimes there's three bicycle car person. I don't actually see the car here, I guess it's being popped out. So a multi-label data set is one where you still got one image per row, but you can have zero, one, two, or more labels per row. So we're going to have a think about and look at how we handle that. But first of all, let's take another question. Does dropping floating point number precision switching from FP32 to FP16 have an impact on final one? Yes, it does. Often it makes it better, believe it or not. It seems like, you know, it's doing a little bit of rounding off is one way to drop some of that precision. And so that creates a bit more bumpiness, a bit more uncertainty, a bit more stochastic nature. When you introduce more slightly random stuff into training, it very often makes it a bit better. And so, yeah, FP16 training often gives us a slightly better result, so, you know, I wouldn't say it's generally a big deal either way, and it's not always better. Would you say this is a bit of a pattern in deep learning, less exact and stochastic way? For sure, not just in deep learning, but machine learning more generally. You know, there's been some interesting research looking at like matrix factorization techniques, which if you want them to go super fast, you can do lots of machines, there's a lot of optimization, and you often, when you then use the results, you often find you actually get better outcomes. Just a brief plug for the FAST AI computational linear algebra course, which talks a little bit about random. Does it really? Well, that sounds like a fascinating course. And look at that, it's number one hit here on Google, so easy to find. Well, by somebody called Rachel Thomas. Hey, that's, the person's got the same name as you. All right. So how are we going to do multi-label classification? So let's look at a data set called Pascal, which is a pretty famous data set. We'll look at the version that goes back to 2007, been around for a long time. And it comes with a CSV file, which we will read in. CSV is comma separated values. And let's take a look. Each row has a file name, one or more labels, and something telling you whether it's in the validation set or not. So the list of categories in each image is a space-g-limited string, but it doesn't have a horse person. It has a horse and a person. PD here stands for Pandas. Pandas is a really important library for any kind of data processing, and you use it all the time in machine learning and deep learning. So let's have a quick chat about it. Pandas is a name of a library, and it creates things called data frames. That's what the DF here stands for. And a data frame is a table containing rows and columns. Pandas can also do some slightly more sophisticated things than that, but we'll treat it that way for now. So you can read in a data frame by saying PD for Pandas. Pandas reads CSV. You have a file name. You've now got a data frame. You can call head to see the first few rows of it, for instance. The data frame has a iLoc integer location property, which you can index into as if it was an array. In fact, it looks just like NumPy. So colon means every row, remember it's row comma column, and zero means zeroth column. And so here is the first column of the data frame. You can do the exact opposite. So the zeroth row and every column is going to give us the first row. And you can see the row has column headers and values. So it's a little bit different to NumPy. And remember if there's a comma colon or a bunch of comma colons at the end of an indexing in NumPy, or PyTorch or Pandas, whatever, you can get rid of it. And these two are exactly the same. You could do the same thing here by grabbing the column by name. The first column is fname. You get that first column. You can create new columns. So here's a tiny little data frame I've created from a dictionary. And I could create a new column by, for example, adding two columns. And you can see there it is. So it's like a lot like NumPy or PyTorch, except you have this idea of kind of rows and column named columns. And so it's all about kind of tabular data. I find its API pretty unintuitive. A lot of people do. But it's fast and powerful. So it takes a while to get familiar with it, but it's worth taking a while. And the creator of Pandas wrote a fantastic book called Python for Data Analysis, which I've read both versions and I found it fantastic. It doesn't just cover Pandas. It covers other stuff as well, like IPython and NumPy and Matplotlib. So I highly recommend this book. This is our table. So what we want to do now is construct data loaders that we can train with. And we've talked about the data block API as being a great way to create data loaders. So let's use this as an opportunity to create a data block and then data loaders for this. And let's try to do it like right from square one. So let's see exactly what's going on with data block. So first of all, let's remind ourselves about what a data set and a data loader is. A data set is an abstract idea of a class. You can create a data set. A data set is anything which you can index into it, like so. And you can take the length of it, like so. So for example, the list of the lowercase letters, along with a number saying which lowercase letter it is, I can index into it to get 0, a. I can get the length of it to get 26. And so therefore this qualifies as a data set. And in particular data sets, normally you would expect that when you index into it, you would get back a tuple because you've got the independent and dependent variables. Not necessarily always just two things. There could be more, there could be less, but two is the most common. So once we have a data set, we can pass it to a data loader. We can request a particular batch size. We can shuffle or not. And so there's our data loader from A. We could grab the first value from that iterator, and here is the shuffled 7 is h, 4 is e, 20 is u, and so forth. And so I remember a mini-batch has a bunch of, a mini-batch of the independent variable and a mini-batch of the dependent variable. If you want to see how the two correspond to each other, you can use zip. So if I zip passing in this list and then this list, so b0 and b1, you can see what zip does in Python is it grabs one element from each of those in turn and gives you back the tuples of the corresponding elements. Since we're just passing in all of the elements of b to this function, Python has a convenient shortcut for that, which is just say star b. And so star means insert into this parameter list each element of b, just like we did here. So these are the same thing. So this is a very handy idiom that we use a lot in Python, zip, star, something. It's kind of a way of like composing something from one orientation to another. All right, so we've got a data set. We've got a data loader. And then what about data sets? What data sets is an object which has a training data set and a validation set, data set. So let's look at one. Now, normally, you don't start with kind of an enumeration like this, like with an independent variable and a dependent variable. Normally, you start with like a file name, for example. And then you kind of calculate or compute or transform your file name into an image by opening it and a label by, for example, looking at the file name and grabbing something out of it. So for example, we could do something similar here. This is what data sets does. So we could start with just the lowercase letters. So this is still a data set, right? Because we can index into it and we can get the length of it, although it's not giving us tuples yet. So if we now pass that list to the data sets class and index into it, we get back the tuple. And it's actually a tuple with just one item. This is how Python shows a tuple with one item as it puts it in parentheses and a comma and then nothing. Okay. So in practice, what we'd really want to do is to say like, okay, we'll take this and do something to compute an independent variable and do something to compute a dependent variable. So here's a function we could use to compute an independent variable, just to stick an a on the end and a dependent variable might just be the same thing with a b on the end. So here's two functions. So for example, now we can call data sets passing in a and then we can pass in a list of transformations to do. And so in this case, I've just got F1, which is this function add an a on the end. So now if I index into it, I don't get a anymore. I get a a. If you pass multiple functions, then it's going to do multiple things. So here I've got F1 then F2, a a b. That's this one, then that's this one. And you'll see this is a list of lists and the reason for that is that you can also pass something like this, a list containing F1, a list containing F2. And this will actually take each element of a, pass it through this list of functions. And there's just one of them to give you a a and then start again and separately pass it through this list of functions. There's just one to get a b. And so this is actually kind of the main way we build up independent variables and dependent variables and fast AI is we start with something file name and we pass it through two lists of functions. One of them will generally kind of open up the image, for example, and the other one will kind of pass the file name, for example, and give you an independent variable and a dependent variable. So you can then create a data loaders object from data sets by passing in the data sets and a batch size. And so here you can see I've got shuffled o a a etc. So this is worth studying to make sure you understand what data sets and data loaders are. We don't often have to create them from scratch. We can create a data block to do it for us. But now we can see what the data block has to do. Let's see how it does it. We can start by creating an empty data block. So an empty data block is going to take our data frame. We're going to go back to looking at our data frame, which remember was this guy. And so if we pass in our data frame we can now we'll now find that this data block has created data sets a training and a validation data set for us. And if we look at the training set it'll get us back an independent variable and a dependent variable and we'll see that they are both the same thing. So this is the first row of the table what's actually shuffled. So it's a random row of the table repeated twice. And the reason for that is by default the data block assumes that we have two things the independent variable and the dependent or the input in the target and by default it just copies. It just keeps exactly whatever you gave it. To create the training set and the validation set by default it just randomly splits the data with a 20% validation set. So that's what's happened here. So this is not much use. What we actually want to do if we look at x for example is grab the f name the file name field because we want to open this image. That's going to be our independent variable. And then for the label we're going to want this here person cat. So we can actually pass these as parameters get x functions that return the bit of data that we want. And so you can create and use a function in the same line of code in Python by saying lambda. So lambda r means create a function. It doesn't have a name. It's going to take a parameter called r we don't even have to say return. It's going to return the f name column in this case and get y is something which is a function that takes an r and returns the labels column. So now we can do the same thing called dblock.datasets. We can grab a row from that from the training set and you can see look here it is. There is the image file name and there is the space delimited list of labels. So here's exactly the same thing again but done with functions. So now the one line of code above has become three lines of code but it does exactly the same thing. We don't get back the same result because the training set wait, why don't we get the same result? I know why. Because it's randomly picking a different validation set because the random split is done differently each time. That's why we don't get the same result. One thing to note, be careful of lambdas. If you want to save this data block for use later you won't be able to. Python doesn't like saving things that contain lambdas. So most of the time in the book and the course we normally use avoid lambdas for that reason because it's often very convenient to be able to save things. We use the word here serialization. That just means basically it means saving something. This is not enough to open an image because we don't have the path. So to turn this into rather than just using this function to grab the f name column we should actually use path lib to go path slash train and then column. And then for the y again the labels is not quite enough. We actually have to split on space. But this is Python, we can use any function we like and so then we use the same three lines and the code is here and now we've got a path and a list of labels. So that's looking good. So we want this path to be opened as an image. So the data block API lets you pass a blocks argument where you tell it for each of the things in your tuple, so there's two of them what kind of block do you need? So we need an image block to open an image and then in the past we've used a category block or categorical variables but this time we don't have a single category we've got multiple categories. So we have to use a multi-category block. So once we do that and have a look we now have a 500 by 375 image as our independent variable and as a dependent variable we have a long list of zeros and ones. The long list of zeros and ones is the labels as a one hot encoded vector a rank one tensor and specifically there will be a zero in every location where in the vocab where there is not that kind of object in this image and a one in every location where there is. So for this one there's just a person so this must be the location in the vocab where there's a person. Do you have any questions? So one hot encoding is a very important concept and we didn't have to use it before, right? We could just have a single integer saying which one thing is it but when we've got lots of things lots of potential labels it's convenient to use this one hot encoding. And it's kind of what it's actually what's going to happen with the actual matrices anyway. When we actually compare the activations of our neural network to the target it's actually going to be comparing each one of these. Okay so the categories as I mentioned is based on the vocab where we can grab the vocab from our data set subject and then we can say okay let's look at the first row and let's look at the dependent variable and let's look for where the dependent variable is 1. Okay and then we can have a look past those indexes with the vocab and get back a list of what it actually was there and again each time I run this I'm going to get different results so each time we run this we're going to get different results because I called dot data sets again here so it's going to give me a different train test split and so this time it turns out that this is actually a chair and we have a question Shouldn't the tensor be of integers? Why is it a tensor of floats? Yeah conceptually this is a tensor of integers they can only be 0 or 1 but we we're going to be using a cross entropy style loss function but we're going to actually need to do floating point calculations on them that's going to be faster to just store them as float in the first place rather than converting backwards and forwards even though they're conceptually an int we're not going to be doing kind of int style calculations with them Good question I mentioned that by default the data block uses a random split you might have noticed in the data frame though it said here's a column saying what validation set to use and if the data set you're given tells you what validation set to use you should generally use it because that way you can compare your validation set results to somebody else's so you can pass a splitter argument which again is a function and so we're going to pass it a function also called splitter and the function is going to return the indexes where it's not valid and that's going to be the training set and the indexes where it is valid that's going to be the validation set and so the splitter argument is expected to return two lists of integers and so if we do that we get again the same thing but now we're using the correct train and validation sets Another question Any particular reason we don't use floating point 8 is it just that the precision is too low? Yeah, trying to train with 8-bit precision is super difficult because it's it's so flat and bumpy it's pretty difficult to get decent gradients but it's an area of research the main thing people do with 8-bit or even 1-bit data types is they take a model that's already been trained with 16-bit or 32-bit floating point and then they kind of round it off it's called discretizing to create a purely integer or even binary network which can do inference much faster figuring out how to train with such low precision data is an area of active research I suspect it's possible and I suspect people have fiddled around with it and had some success it could turn out to be super interesting particularly for stuff that's been done on low-powered devices that might not even have a floating point unit Right, so the last thing we need to do is to add our item transforms random resource crop we've talked about that enough so I won't go into it but basically that means we now are going to ensure that everything has the same shape so that we can collate it into a data loader rather than going dot data sets go dot data loaders and display our data and remember something goes wrong as we saw last week you can call summary to find out exactly what's happening in your data block so now this is something really worth studying this section because data blocks are super handy and if you haven't used past the I2 before they won't be familiar to you because no other library uses them and so like this has really shown you we can go right back to the start and gradually build them up hopefully that'll make a whole lot of sense now we're going to need a loss function again and to do that let's start by just creating a learner let's create a ResNet18 from the data loader's object that we just created and let's grab one batch of data and then let's put that into a mini batch of independent independent variables and then learn.model is the thing that actually contains the model itself in this case a CNN and you can treat it as a function and so therefore we can just pass something to it and so if we pass a mini batch of the independent variable to learn.model it will return the activations from the final layer and that is shape64 by 20 so anytime you get a tensor back look at its shape predict what the shape should be and then make sure that you're all right if you're not you guessed wrong so try to understand where you made a mistake or there's a problem with your code and this space 64 by 20 makes sense because we have a mini batch size of 64 each of those we're going to make predictions about what probability is each of these 20 possible categories and we have a question two questions two questions all right is the data block API compatible without of core data sets like Dask? yeah the data block API can do anything you wanted to do so you're passing it if we go back to the start so you can create an empty one and then you can pass it anything that is indexable and yeah so that can be anything you like pretty much anything can be made indexable in Python and something like Dask is certainly indexable so that works perfectly fine if it's indexable like it's a network stream or something like that then the data loaders data sets APIs directly which we'll learn about either in this course or the next one but yeah anything that you can index into it certainly includes Dask you can use with data blocks next question where do you put images for multi-label with that CSV table should they be in the same directory there can be anywhere you like so in this case we used a pathlib object like so and in this case the by default it's going to be using let me think about this so what's happening here is the path is oh it's saying dot okay the reason for that is that path.basePath is currently set to path and so that displays things relative so let's read of that okay so the path we set is here right and so then when we said getX it's saying path slash chain slash whatever right so this is an absolute path and so here is the exact path so you can put them anywhere you like just have to say what the path is and then if you want to not get confused by having this big long prefix that we can don't want to see all the time just set basePath to the path you want everything to be relative to and then it will just print things out in this more convenient manner right so this is really important that you can do this that you can create a learner you can grab a batch of data that you can pass it to the model it's just plain PyTorch this line here right no fastai you can see the shape you can recognize why it has this shape and so now if you have a look here are the 20 activations now this is not a trained model it's a pre-trained model with a random set of final layer weights so these specific numbers don't mean anything but it's just worth remembering this is what activations look like and most importantly they're not between 0 and 1 and if you remember from the MNIST notebook we know how to scale things between 0 and 1 we can pop them into the sigmoid function so the sigmoid function is something that scales everything to be between 0 and 1 so let's use that you'll also hopefully remember from the MNIST notebook that the MNIST loss function first did sigmoid and then it did torch dot where and then it did dot mean so we're going to use exactly the same thing as the MNIST loss function and we're just going to do one thing which is going to add dot log for the same reason that we talked about when we were looking at Softmax we talked about why log is a good idea as a transformation we saw in the MNIST notebook we didn't need it but we're going to train faster on more accurately if we use it because it's going to be better behaved as we've seen so this particular function which is identical to MNIST loss plus dot log has a specific name and it's called binary cross entropy and we used it for the threes versus sevens problem to decide whether that column is it a three or not but because we can use broadcasting in high torch and element wise arithmetic this function when we pass it a whole matrix is going to be applied to every column so is the first column so it'll basically do a torch dot where on every column separately and every item separately so that's great it basically means that this binary cross entropy function is going to be just like MNIST loss but rather than just being is this the number three it'll be is this a dog, is this a cat, is this a car is this a person, is this a bicycle and so forth so this is where it's so cool in PyTorch we can kind of run right one thing and then kind of have it expand or higher dimensional tensors without doing any extra work we don't have to write this as cells of course because PyTorch has one and it's called f dot binary cross entropy so we can just use PyTorches as we've talked about there's always a equivalent module version so this is exactly the same thing as a module nn dot bce loss and these ones don't include the initial sigmoid actually if you want to include this initial sigmoid you need f dot binary cross entropy with logits or the equivalent nn dot bce with logits loss so bce is binary cross entropy and so those are two functions last two equivalent classes for multi-label or binary problems and then the equivalent for single label like mnist and pets is nll loss and cross entropy so that's the equivalent of binary cross entropy and binary cross entropy with logits so these are pretty awful names I think we can all agree but it is what it is so in our case we have a one hot encoded target and we want the one with the sigmoid in so the equivalent built-in is called bce with logits loss so that we can make that our loss function we can compare the activations to our targets and we can get back a loss and then that's what we can use to train and then finally before we take our break we also need a metric now previously we've been using as a metric accuracy or actually error rate error rate is one minus accuracy accuracy only works for single label data sets like mnist and pets because what it does is it takes the input which is the final layer activations and it does argmax what argmax does is it says what is the index of the largest number in those activations so for example for mnist you know maybe the largest the highest probability is 7 this argmax would return 7 and then it says okay those are my predictions and then it says okay is the prediction equal to the target or not and then take the floating point main so that's what accuracy is so argmax only makes sense when there's a single maximum thing you're looking for in this case we've got multi-label so instead we have to compare each activation to some threshold by default it's 0.5 and so we basically say if the sigmoid of the activation is greater than 0.5 let's assume that means that category is there and if it's not let's assume it means it's not there and so this is going to give us a list of trues and falses for the ones that based on the activations it thinks are there and we can compare that to the target and then again take the floating point main so we can use the default threshold of 0.5 but we don't necessarily want to use 0.5 we might want to use a different threshold and remember we have to pass when we create a learner we have to pass to the metric the metrics argument a function so what if we want to use a threshold other than 0.5 well we'd like to create a special function which is accuracy multi with some different threshold and the way we do that is we use a special built-in in python called partial let me show you how partial works here's a function called say hello say hello to somebody with something so say hello Jeremy or the default is hello it says hello Jeremy say hello Jeremy comma ahoy it's going to be ahoy Jeremy let's create a special version of this function that will be more suitable for a silver it's going to use French so we can say partial create a new function that's based on the say hello function but it's always going to set say what to bonjour and we'll call that f but now f Jeremy is bonjour Jeremy and f silver is bonjour silver so you see we've created a new function an existing function by fixing one of its parameters so we can do the same thing for accuracy multi say let's use a threshold of 0.2 and we can pass that to metrics and so let's create a CNN learner and you'll notice here we don't actually pass a loss function and that's because fast AI is smart enough to realize hey you're doing a classification model with a a multi-label dependent variable so I know what loss function you probably want so it does it for us and we can call fine tune and here we have an accuracy of 94.5 after the first few and eventually 95.1 that's pretty good we've got an accuracy of over 95% was 0.2 a good threshold to pick who knows let's try 0.1 well that's a worse accuracy so I guess in this case we could buy a higher threshold 94 also not good so what's the best threshold well what we could do is call getPreds to get all of the predictions and all of the targets and then we could calculate the accuracy at some threshold and then we could say okay let's grab lots of numbers between 0.05 and 0.95 and the list comprehension calculate the accuracy for all of those different thresholds and plot them looks like we want a threshold somewhere a bit above 0.5 so cool we can just use that and it's going to give us 96 in a bit which is going to give us a better accuracy this is a something that a lot of theoreticians would be uncomfortable about I've used the validation set to pick a hyper parameter threshold right and so people might be like oh you're overfitting using the validation set to pick a hyper parameter but if you think about it this is a very smooth curve right it's not some bumpy thing where we've accidentally kind of randomly grabbed some unexpectedly good value when you're picking a single number from a smooth curve you know this is where the theory of like don't use a validation set for hyper parameter tuning doesn't really apply so it's always good to be practical don't treat these things as rules but as rules of thumb okay so let's take a break for five minutes and we'll see you back here in five minutes time hey welcome back I want to show you something really cool image regression so we are not going to learn how to use a fastai image regression application because we don't need one now that we know how to build stuff up with boss functions and the data block API ourselves we can invent our own applications so there is no image regression application per se but we can do image regression really easily what do we mean by image regression well remember back to lesson I think it was lesson one we talked about the two basic types of machine learning or supervised machine learning regression and classification classification is when our dependent variable is a discrete category a set of categories and regression is when our dependent variable is a continuous number like an age or x y coordinate or something like that so image regression means our independent variable is an image and our dependent variable is a one of all continuous values and so here's what that can look like which is the bwe head pose data set it has a number of things in it but one of the things we can do is find the midpoint of a person's face so bwe head pose data set so the bwe head pose data set comes from this paper random forest real-time 3d face analysis so thank you to those authors and we can grab it in the usual way untill our data and we can have a look at what's in there and we can see there's 24 directories numbered from 1 to 24 to 3 and each one also has a .obj file we're not going to be using the .obj file just the directories so let's look at one of the directories and as you can see there's a thousand things in the first directory so each one of these 24 directories is one different person that they photographed and you can see for each person there's frame 3 pose frame 3 RGB frame 4 pose frame 4 RGB and so forth so in each case we've got the image which is the RGB and we've got the pose is the pose.text so as we've seen we can use get image files to get a list of all of the image files recursively in a path and so once we have an image file name like this one sorry like this one we can turn it into a pose file name by removing the last 2567 letters and adding back on pose.text and so here is a function that does that and so you can see I can pass in an image file to image to pose and get back a pose file right so pil image.create is the fastai way to create an image at least a pil image it has a shape in computer vision they're normally backwards they normally do columns by rows so that's why it's this way around where else pil torch num pil tensors and arrays are rows by columns so that's confusing but that's just how things are I'm afraid so here's an example of an image when you look at the readme from the data set website they tell you how to get the center point from one of the text files just this function so it doesn't matter it is what it is we call it getCenter and it will return the xy coordinate of the center of the person's head face so we can pass this as getY because getY remember is a thing that gives us back the label okay so so here's the thing right we can create a data block and we can pass in as the dependent variables block image block as usual and then the dependent variables block we can say point block which is a tensor with two values in and now by combining these two things this says we want to do image regression with a dependent variable with two continuous values to get the items you call getImageFiles to get the y we'll call the getCenter function to split it so this is important we should make sure that the validation set contains one or more people that don't appear in the training set so I'm just going to grab person number 13 just grabbed it randomly and I'll use all of those images as the validation set because I think they did this with a xbox connect you know video thing so there's a lot of images that look almost identical so if you randomly assigned them then you would be massively overestimating how effective you are you want to make sure that you're actually doing a good job with a random with a new set of people not just a new set of frames that's why we use this and so func splitter is a splitter that takes a function and in this case we're using lambda to create the function we will use data augmentation and we will also normalize so this is actually done automatically now but in this case we're doing it manually so this is going to subtract the mean and divide by the standard deviation of the original data set that the pre-trained model used which is ImageNet so that's our data block and so we can call data loaders to get our data loaders passing in the path and show batch and we can see that looks good right here's our faces and the points and so let's like particularly for as a student don't just look at the pictures look at the actual data so grab a batch put it into an XB and a YB X batch and Y batch and have a look at the shapes and make sure that makes sense the YS is 64 by 1 by 2 so it's 64 in the mini batch 64 rows and then a the coordinates is a 1 by 2 tensor so there's a single point with two things in it it's like you could have like hands face and armpits or whatever or nose and ears and mouth so in this case we're just using one point and the point is represented by two values the X and the Y and then why is this 64 by 3 by 240 by 320 well there's 240 rows by 320 columns that's the pixels that's the size of the images that we're using mini batches 64 items and what's the 3 the 3 is the number of channels which in this case means the number of colors if we open up some random grizzly bear image and then we go through each of the elements of the first axis and do a show image you can see that it's got the red the green and the blue as the 3 channels so that's how we store a 3 channel image is it stored as a 3 by number of rows by number of columns rank 3 tensor and so a mini batch of those is a rank 4 tensor so that's why this is that shape so here's a row from the dependent variable okay and there's that XY location we talked about so we can now go ahead and create a learner passing in our data loaders as usual passing in a pre-trained architecture as usual and if you think back you may just remember in lesson 1 we learned about Y range Y range is where we tell fastai what range of data we expect to see in the dependent variable so we want to use this generally when we're doing regression so the range of our coordinates is between minus 1 and 1 that's how fastai and PyTorch treats coordinates the left hand side is minus 1 or the top is minus 1 and the bottom and the right is minus 1 so there's no point predicting something that's smaller than minus 1 or bigger than 1 because that is not in the area that we use for our coordinates we have a question sure just a moment so how is Y range work well it actually uses this function called sigmoid range which takes the sigmoid of X multiplies by high minus low and adds low and here is what looks like 4 minus 1 to 1 it's just a sigmoid where the bottom is the low and the top is the high and so that way all of our activations are going to be mapped to the range from minus 1 to 1 yes Rachel can you provide images with an arbitrary number of channels as inputs specifically more than 3 channels yeah you can have as many channels as you like we've seen images with less than 3 because we've been grayscale more than 3 is common as well you could have like an infrared band or like satellite images often have multi-spectral there's some kinds of medical images where there are bands that are kind of outside the visible range your pre-trained model will generally have 3 channels the fastai does some tricks to use 3 channel pre-trained models for non-3 channel data but that's the only tricky bit other than that it's just an axis that happens to have 4 things or 2 things or 1 thing instead of 3 things there's nothing special about it okay we didn't specify a loss function here so we get whatever it gave us which is a mse loss so mse loss is mean squared error that makes perfect sense you would expect mean squared error to be a reasonable thing to use for aggression we're just testing how close we are through the target and then taking the square taking the mean we didn't specify any metrics and that's because mean squared error is already a good metric it has nice gradients it behaves well and it's also the thing that we care about metric to track so let's go ahead and use lrfind and we can pick a learning rate so maybe about 10 to the minus 2 we can call fine tune and we get a valid loss of 0.0001 so that's the mean squared error so we should take the square root so on average we're about 0.01 off in a coordinate space that goes between minus 1 and 1 so that sounds super accurate so we can always call fastai and we always should see what our results look like and as you can see fastai has automatically figured out how to display the combination of an image independent variable and a point dependent variable on the left is the target and on the right is the prediction and as you can see it is pretty close to perfect one of the really interesting things here is we used fine tune even though think about it the thing we're fine tuning image net isn't even an image regression model so we're actually fine tuning an image classification model it becomes something totally different an image regression model why does that work so well well because an image net classification model must have learnt a lot about how images look what things look like and where the pieces of them are to kind of know how to figure out what breed of animal something is even if it's partly obscured by bore shorts in the shade or it's turned in different angles these pre-trained image models are incredibly powerful computers computing algorithms so built into every image net pre-trained model is all this capability that it had to learn for itself so asking it to use that capability to figure out where something is it's just actually not that hard for it and so that's why we can actually fine tune an image net classification model to create something completely different which is a a point image regression model I find that incredibly cool so again look at the further research after you've done the questionnaire and particularly if you haven't used data frames before please play with them because we're going to be using them more and more I'll just do the last one and also go back and look at the bare classifier from notebook 2 or whatever hopefully you created some other classifier for your own data because remember we talked about how it would be better if the bare classifier could also recognize that there's no bear at all or maybe there's both a grizzly bear and a black bear or a grizzly bear and a teddy bear so if you retrain it using multi-label classification see what happens see how well it works when there's no bears and see whether it changes the accuracy of the single multi-label model when you turn it into a multi-label problem so have a fiddle around and tell us on the forum what you find and we've got a question Rachel Is there a tutorial showing how to use pre-trained models on four channel images also how can you add a channel to a normal image so last bit how do you add a channel to an image I don't know what that means I can't I don't know I can image as an image you can't add a channel to an image is what it is I don't know if there's a tutorial but we can certainly make sure somebody on the forum has learned how to do it it's super straightforward it should be pretty much automatic we're going to talk about collaborative filtering what is collaborative filtering well think about on Netflix or whatever you might have watched a lot of movies at a sci-fi and have a lot of action and were made in the 70s and Netflix might not know anything about the properties of movies you watched it might just know that they're movies with titles and IDs but what it could absolutely see without any manual work is find other people that watched the same movies that you watched and it could see what other movies those people watched that you haven't and it would probably find they were also, you would probably find they're also science fiction and full of action and made in the 70s so we can use an approach where we recommend things even if we don't know anything about what those things are as long as we know who else has used or recommended things that are similar many of the same things that you've liked or used this doesn't necessarily mean uses and products in fact in collaborative filtering products we normally say items and items could be links you click on diagnoses for a patient and so forth so there's a key idea here which is that in the underlying items and we're going to be using movies in this example there are some features they may not be labeled but there's some underlying concept of features of those movies like the fact that there's a action concept and a sci-fi concept in the 1970s concept now you would never actually told Netflix you like these kinds of movies and maybe Netflix never actually added columns to their movies saying what movies are those types but as long as like you know in the real world there's this concept of sci-fi and action and movie age and that those concepts are relevant for at least some people's movie watching decisions as long as this is true then we can actually uncover these they're called latent factors these things that kind of decide what kind of movies you want to watch and they're latent because nobody necessarily ever wrote them down or labeled them or communicated them in any way so let me show you what this looks like um so there's a great data set we can use called Movie Lens which contains tens of millions of movie rankings and so a movie ranking looks like this it has a user number a movie number a rating and a time step so we don't know anything about who user number 196 is I don't know if that is Rachel or Dongbae or somebody else I don't know what movie number 242 is I don't know if that's Casablanca or Lord of the Rings or the Mask and then rating is a number between I think it was 1 and 5 a question sure in traditional machine learning we perform cross-validations and k-fold training to check for variance and bias trade-off is this common in training deep and how um so cross-validation is a technique where you don't just split your data set into one training set and one validation set but you basically do it five or so times like five training sets and five validation sets representing different overlapping subsets um and basically this was this used to be done a lot people often used to have not enough data get a good result and so um this way rather than kind of having 20% that you would leave out each time you could just leave out like 10% each time nowadays it's less common that we have so little data that we need to worry about the complexity and extra time of lots of models it's done on Kaggle a lot because on Kaggle every little fraction of percent matters um but it's not, it's not a deep learning thing or a machine learning thing or whatever it's just a lots of data or not very much data thing and do you care about the last decimal place of them or not it's not something we're going to talk about certainly in this part of the course if ever because it's not something that comes up in practice that often as being important there are two more questions what would be some good applications of collaborative filtering outside of recommender systems well I mean depends how you define recommender system um if you're trying to figure out what kind of other diagnoses might be applicable to a patient I guess that's kind of a recommender system or you're trying to figure out where somebody is going to click next or whatever it's kind of a recommender system um but you know really conceptually it's anything where you're trying to learn from past behavior where that behavior is kind of like a thing happened to an entity what is an approach to training using video streams i.e. from drone footage instead of images to break up the footage into image frames um in practice quite often you would because images just tend to be pretty big sorry videos tend to be pretty big um there's a lot of theoretically the time could be the fourth channel yeah or a fifth channel so if it's a full color movie you can absolutely have i guess fourth because you can have it would be a rank 5 tensor being batch by time by color by row by column um but often that's too computationally and too um memory intensive so sometimes people just look at one frame at a time sometimes people use a few frames around the key frame like three or five frames at a time and sometimes people use something called a recurrent neural network which we'll be seeing in the next week or two treated as a sequence data um yeah there's all kinds of tricks you can do to try and work with that conceptually though there's no reason you can't just add an additional access to your tensors of work it's just a practical issue around time and memory and someone else noted that it's pretty fitting that you mentioned the movie the mask yes it was not an accident because i got masks on the brain i'm not sure if we're allowed to like that movie anymore though i kind of liked it when it came out i don't know what i think nowadays it's a while um okay so let's take a look so we can untie data ml100k so ml100k is a small subset if the full set there's another one that we can grab which is about the whole lot 25 million but 100k is good enough for messing around so if you look at the readme you'll find the main table the main table is in a file called u.data so let's open it up with read.csv again this one is actually not comma separated values it's tab separated so let's open it up with ccsv and just say delimiter is a tab backslash t is tab there's no row at the top saying what the columns are called so we say header is none and then pass in a list of what the columns are called .head will give us the first five rows and we mentioned just before what it looks like um it's not a particularly friendly way to look at it we're going to cross-tab it and so what I've done here is I've grabbed the top I can't remember how many it was well I can 15 or 20 movies based on the most popular movies and the top bunch of users who watched the most movies so I've basically reoriented this so for each user I have all the movies they've watched and the rating they gave them so empty spots represent users that have not seen that movie so this is just another way of looking at this same data so basically what we want to do is guess what movies we should tell people they might want to watch and so it's basically filling in these gaps to tell user 212 do you think they might like movie 49 or 99 best to watch next so let's assume that we actually had columns for every movie that represented say how much sci-fi they are how much action they are and how old they are and maybe they're between minus 1 and 1 and so like the last skywalker is very sci-fi fairly action and definitely not old and then we could do the same thing for users so we could say user 1 really like sci-fi quite likes action and really doesn't like old and so now if you multiply those together and remember in PyTorch and NumPy you have element wise calculations so this is going to multiply each corresponding item it's not matrix multiplication don't go there this is element wise multiplication if we want matrix multiplication it would be an at sign so if we multiply each element together with the equivalent element in the other one and then sum them up that's going to give us a number which will basically tell us how much do these two correspond because remember two negatives multiply together to get a positive so user 1 likes exactly the kind of stuff that the last skywalker has in it and so we get 2.1 multiplying things together element wise and adding them up is called the dot product and we use it a lot and it's the basis of matrix I didn't say modification matrix multiplication so make sure you know what a dot product is it's this so Casablanca is not at all sci-fi not much action and is certainly old so if we do user 1 times Casablanca we get a negative number so we might think okay user 1 won't like this movie problem is we don't know what the latent factors are and even if we did we don't know how to label a particular user or a particular movie with them so we have to learn them how do we learn them well at a spreadsheet so I've got a spreadsheet version so we have a spreadsheet version which is basically what I did was I popped this table into Excel and then I randomly created a let's count this now 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 I randomly created a 15 by 5 table here so these are just random numbers and I randomly created a 5 by 15 table here and I basically said okay well let's just pretend let's just assume that every movie and every user has 5 latent factors I don't know what they are and let's then do a matrix model ply of this set of factors by this set of factors and a matrix model ply of a row column is identical to a dot product of two vectors so that's why I can just use matrix model ply so this is just what this first cell contains so they then copied it to the whole thing so all these numbers there are being calculated from the row latent factors dot product with or matrix model ply with a column latent factors so in other words I'm doing exactly this calculation but I'm doing them with random numbers and so that gives us a whole bunch of values right and then what I could do is I could calculate a loss by comparing every one of these numbers here to every one of these numbers here and then I could do mean squared error and then I could use stochastic gradient descent to find the best set of numbers in each of these two locations and that is what collaborative filtering is so that's actually all we need so rather than doing an Excel and the Excel version later if you're interested because we can actually do this whole thing and it works in Excel let's jump and do it into PyTorch now one thing that might just make this more fun is actually to know what the movies are and MovieLens tells us in u.item what the movies are called and that uses the delimiter of the pipe sign, weirdly enough so here are the names of each movie and so one of the nice things about pandas is it can do joins just like SQL and so you can use the merge method to combine the ratings table and the movies table and since they both have a column called movie by default it will join on those and so now here we have the ratings table with actual movie names that's going to be a bit more fun we don't need it for modeling but it's just going to be better for looking at stuff so we could use data blocks API at this point or we can just use the built-in application factory method since it's there we may as well use it so we can create a collaborative filtering data loader's object from a data frame by passing in the ratings table by default the user column is called user and ours is so fine by default the item column is called item and ours is not it's called title so let's pick title and choose a batch size and so if we now say show batch here is some of that data and the rating is called rating by default so that worked fine too but here's some data so we need to now create our let's assume we're going to use that five numbers of factors so the number of users is however many classes there are for user and the number of movies is however many classes there are the title and so these are so we don't just have a vocab now right we've actually got a list of classes for each category or variable for each set of discrete choices so we've got a whole bunch of users at 944 and a whole bunch of titles 1635 so for our randomized latent factor parameters we're going to need to create those matrices so we can just create them with random numbers this is normally distributed random numbers that's what random n is and that will be n users so 944 by n factors which is 5 that's exactly the same as this except this is just 15 so let's do exactly the same thing for movies random numbers and movies by 5 ok and so to calculate the result for some movie and some user we have to look up the index of the movie in our movie latent factors the index of the user in our user latent factors and then do a cross product so in other words we would say like oh ok for this particular combination we would have to look up that numbered user over here and that numbered movie over here to get the two appropriate sets of latent factors but this is a problem because look up in an index is not a linear model like remember our deep learning models really only know how to just multiply matrices together and do simple element wise nonlinearities like ReLU there isn't a thing called look up in an index ok I'll just finish this bit here's a cool thing though the look up in an index is actually can be represented as a matrix product believe it or not so if you replace our indices with one hot encoded vectors then a one hot encoded vector times something is identical to looking up in an index let me show you so if we grab if we call the one hot function that creates a as it says here one hot encoding and we got a one hot encode the value 3 with end users classes and so end users as we've discussed is 944 right then so if we go one hot one hot encoding the number 3 into end users one hot 3 we got this big array big tensor in index 3 0, 1, 2, 3 we have a 1 and the size of that is 944 so if we then multiply that by user factors or user factors remember is that random matrix of this size so what's going to happen so we're going to go 0 by the first row and so that's going to be all zeros and then we're going to go 0 again and then we're going to 0 again and then we're going to finally go 1 right on the index 3 row and so it's going to return each of them and then we'll go back to 0 again so if we do that remember at sign is matrix multiply and compare that to user factors 3 same thing isn't that crazy so it's a kind of weird inefficient way to do it right but matrix multiplication is a way to index into an array and this is the thing that we know how to do SGD with and we know how to build models with so it turns out that anything that we can do with indexing to array we now have a way to optimize and we have a question there are two questions one how different in practice is collaborative filtering with sparse data compared to dense data we are not doing sparse data in this course but there's an excellent course I hear called computational linear algebra for coders it has a lot of information about sparse the fast AI course second question in practice do we tune the number of latent factors absolutely we do it's just a number of filters like we have in much any kind of deep learning model all right so now that we know that the procedure of finding out which latent set of latent factors is the right thing looking something up in an index is the same as matrix multiplication with a one-hot vector I already had it over here we can go ahead and build a model with that so basically if we do this for a whole for a few indices at once then we have a matrix of one-hot encoded vectors so the whole thing is just one big matrix multiplication now the thing is as I said this is a pretty inefficient way to do an index lookup so there is a computational shortcut which is called an embedding an embedding is a layer that has the computational speed of an array lookup and the same gradients as a matrix multiplication how does it do that well just internally it uses an index lookup to actually grab the values and it also knows what the gradient of a matrix multiplication by a one-hot encoded vector is or matrix is without having to go to all this trouble and so an embedding is a matrix multiplication with a one-hot encoded vector where you never actually have to create the one-hot encoded vector you just need the indexes this is important to remember because a lot of people have heard about embeddings and they think there's something special and magical and they're absolutely not you can do exactly the same thing by creating a one-hot encoded matrix and doing a matrix multiply it is just a computational shortcut nothing else I often find when I talk to people about this in person I have to tell them this six or seven times before they believe me because they think embeddings are something more clever and they're not it's just a computational shortcut to do a matrix multiplication more quickly with a one-hot encoded matrix by instead doing an array lookup okay so let's try and create a collaborative filtering model in PyTorch a model or an architecture or really an nn.module is a class if you use PyTorch to its fullest you need to understand object-oriented programming because we have to create classes there's a lot of tutorials about this so I won't go into detail about it but I'll give you a quick overview a class could be something like dog or resnet or circle and it's something that has some data attached to it and it has some functionality attached to it here's a class called example the data it has attached to it is a and the functionality attached to it is say and so we can for example create an instance of this class an object of this type example we pass in sylvain so sylvain will now be in ex.a and we can then say ex.say and it will say passing in nice to meet you so that will be x and so it will say hello self.a so that's sylvain nice to meet you here it is so in python the way you create a class is to say class in its name then to say what is passed to it when you create that object it's a special method called dunder in it as we've briefly mentioned before in python there are all kinds of special method names that have special behavior they start with two underscores they end with two underscores and we pronounce that dunder dunder in it all methods in all regular methods instance methods in python always get passed the actual object itself first that we normally call that self and then optionally anything else and so we can then change the contents of the current object by just setting self.whatever to whatever you like so after this self.a is now equal to sylvain so we call a method same thing get passed self optionally anything you pass to it and then you can access the contents of self which you stashed away back here when we initialized it so that's basically how the basics of object-oriented programming works in python there's something else you can do when you create a new class which is you can pop something in parentheses after it's name and that means we're going to use something called inheritance and what inheritance means is I want to have all the functionality of this class plus I want to add some additional functionality so module is a pytorch class that has customized so it's kind of a fastai version of a pytorch class and probably in the next course we'll see exactly how it works and but it looks a lot like a it acts almost exactly like a just a regular python class we have an init and we can set attributes to whatever we like and it uses an embedding and so an embedding is just this class that does what I just described it's the same as a linear layer with a one-hot encoded matrix but it does it with this computational shortcut so you can say how many in this case users are there and how many factors will they have now there is one very special thing about things that inherit from module which is that when you call them the method called forward so forward is a special pytorch method name it's the most important pytorch method name this is where you put the actual computation so to to grab the factors from an embedding we just call it like a function right so this is going to get passed here the user IDs and the movie IDs as two columns so let's grab the zero index column and grab the embeddings by passing them to user factors and then we'll do the same thing for the index one column that's the movie IDs pass them to the movie factors and then here there's our element wise multiplication and then some and now remember we've got another dimension time the first axis is the mini batch dimension so we want to sum over the other dimension the index one dimension so that's going to give us a dot product for each user sorry for each rating for each user movie combination so this is the dot product class so you can see if we look at one batch of our data it's of shape 64 by 2 because there are 64 items in the mini batch and each one has this is the independent variables so it's got the user ID and the movie ID to deep neural network based models for collaborative filtering work better than more traditional approaches like SVD or other matrix let's wait until we get there so here's x so here is one user ID movie ID combination okay and then for each one of those 64 here are the ratings so now we've created a dot product module from scratch so we can instantiate it passing in the number of users the number of movies and let's use 50 factors and now we can create a learner now this time we're not creating a CNN learner or a specific application learner it's just a totally generic learner so this is a learner who doesn't really know how to do anything clever it just stores away the data you give it and the model you give it and since we're not using an application specific learner it doesn't know what loss function to use so we'll tell it to use MSE and fit and that's it right so we've just fitted our own collaborative filtering model where we literally created the entire architecture it's a pretty simple one from scratch so that's pretty amazing now the results aren't great if you look at the movie lens data set benchmarks online you'll see this is not actually a great result so one of the things we should do is take advantage of the tip we just mentioned earlier in this lesson which is when you're doing regression which we are here right the number between 1 and 5 is like a continuous value we're trying to get as close to it as possible we should tell fast AI what the range is so we can use y range as before so here's exactly the same thing we've got it y range we've stored it away and then at the end we use as we discussed sigmoid range passing in and look here we pass in star self dot y range that's going to pass in by default 0 comma 5.5 and so we can see ah not really any better it's worth a try normally this is a little bit better but it always depends on when you run it I'll just run it a second time while it's working now there is something else we can do though which is that if we look back at our little excel version the thing is here um when we multiply these latent factors by these latent factors and add them up it's not really taking account of the fact that this user may just rate movies really badly in general regardless of what kind of movie they are and this movie might be just a great movie in general just everybody likes it regardless of what kind of stuff they like and so it would be nice to be able to represent this directly and we can do that using something we've already learned about which is bias we could have another single number for each movie which we just add and it's another single number for each user which we just add and we've already seen this for linear models this idea that it's nice to be able to add a bias value so let's do that we're going to need another embedding for each user which is a size one it's just a single number we're going to add so in other words it's just an array lookup but remember to do an array lookup that we can kind of take a gradient of we have to say embedding so we do the same thing for movie bias and so then all of this is identical as before and we just add this one extra line which is to add the user and movie bias values and so let's train that to see how it goes well that was a shame it got worse so we used to have that finished here 0.87 0.88 0.89 so it's a little bit worse why is that? well if you look earlier on it was quite better it was 0.86 so it's overfitting very quickly and so what we need to do is we need to find a way that we can train more epochs without overfitting now we've already learned about data augmentation like rotating images and changing their brightness and color and stuff but it's not obvious how we would do data augmentation for collaborative filtering so how are we going to make it so that we can train lots of epochs without overfitting and to do that we're going to have to use something called regularization and regularization is a set of techniques which basically allow us to use models with lots of parameters and train them for a long period of time but penalize them effectively for overfitting or in some way cause them to try to stop overfitting so that is what we will look at next week ok well thanks everybody so there's a lot to take in there so please remember to practice to experiment to listen to the lessons again because for the next couple of lessons things are going to really quickly build on top of all the stuff that we've learned so please be as comfortable with it as you can feel free to go back and re-listen and follow through the notebooks and then try to re-create as much of them yourself thanks everybody and I will see you next week or see you in the next lesson whenever you watch it