 Welcome everybody to lesson five and so we have officially peaked and everything is down here from here as of halfway through the last lesson we started with computer vision because it's the most mature kind of out of the box ready to use deep learning application it's something which if you're not using deep learning you won't be getting good results so the difference you know hopefully between not doing lesson one versus doing lesson one you've gained a new capability you didn't have before and you kind of get to see a lot of the kind of trade craft of training an effective neural net and so then we moved into NLP because text is kind of another one which you really kind of can't do really well without deep learning generally speaking and it's just got to the point where it's pretty you know works pretty well now in fact the New York Times just featured an article about the latest advances in deep learning for text yesterday and talked quite a lot about the work that we've done in that area along with open AI and Google and the Allen Institute of artificial intelligence and then we've kind of finished our application journey with tabula and collaborative filtering partly because tabula and collaborative filtering are things that you can still do pretty well without deep learning so it's not such a big step it's not a kind of whole new thing that you could do that you couldn't use to do and also because the you know we're going to try to get to a point where we understand pretty much every line of code and the implementations of these things and the implementations of those things it's much less intricate than vision and NLP so as we come down this other side of the journey which is like all the stuff we've just done how does it actually work by by starting where we just ended which is starting with collaborative filtering and then tabula data we're going to be able to see what all those lines of code do by the end of today's lesson that's our goal so particularly this lesson you should not expect to come away knowing how to solve you know how to do applications you couldn't do before but instead you should have a better understanding of what how we've actually been solving the applications we've seen so far particularly we're going to understand a lot more about regularization which is how we go about managing over versus under fitting and so hopefully you can use some of the tools from this lesson to go back to your previous projects and get a little bit more performance or handle models where previously maybe you felt like your data was not enough or maybe you're under fitting and so forth and it's also going to lay the groundwork for understanding convolutional neural networks and recurrent neural networks that will do deep dives into in the next two lessons and as we do that we're also going to look at some new applications some new vision and NLP applications let's start where we left off last week do you remember this picture so this picture we were looking at kind of what is a deep neural net look like and we had various layers and the first thing we pointed out is that there are only and exactly two types of layer there are layers that contain parameters and there are layers that contain activations parameters are the things that your model learns there are the things that you use gradient descent to go parameters minus equals learning rate times parameters dot grad that's our basic that's what we do and those parameters are used by multiplying them by input activations doing a matrix product so the yellow things are our weight matrices or weight tensors more generally but that's close enough so we take some input activations or some layer activations and we multiply it by weight matrix to get a bunch of activations so activations are numbers that these are numbers that are calculated so I find in our study group I keep getting questions about where does that number come from and I always answer it in the same way you tell me is it a parameter or is it an activation because it's one of those two things that's where numbers come from I guess inputs are kind of a special activation so they're not calculated they're just there so maybe that's a special case so maybe it's an input or a parameter or an activation activations don't only come out of matrix modifications they also come out of activation functions and the most important thing to remember about an activation function is that it's an element wise function so it's a function that is applied to each element of the input activations in turn and creates one activation for each input element so if it starts with a 20 long vector it creates a 20 long vector by looking at each one of those in turn doing one thing to it and spitting out the answer so an element wise function value is the main one we've looked at and honestly it doesn't too much matter which you pick so we don't spend much time talking about activation functions because if you just use value you'll get a pretty good answer pretty much all the time and so then we learnt that this combination of matrix modifications followed by values stacked together has this amazing mathematical property called the universal approximation theorem which is if you have big enough weight matrices and enough of them it can solve any arbitrarily complex mathematical function to any arbitrarily high level of accuracy assuming that you can train the parameters both in terms of time and data availability and so forth so that's the bit which I find particularly more advanced computer scientists get really confused about is they're always asking like where's the next bit, what's the trick, how does it work but that's it, you know you just do those things and you pass back the gradients and you update the weights with the learning rate and that's it so that piece where we take the loss function between the actual targets and the output of the final layer so the final activations we calculate the gradients with respect to all of these yellow things and then we update those yellow things by subtracting learning rate times the gradient that process of calculating those gradients and then subtracting like that is called back propagation so when you hear the term back propagation it's one of these terms that neural networking folks love to use it sounds very impressive but you can replace it with your head with weights minus equals weights dot grad times learning rate parameters I should say rather than weights a bit more general okay so that's what we covered last week and then I mentioned last week that we're going to cover a couple more things I'm going to come back to these ones cross entropy and softmax later today let's talk about fine tuning now so what happens when we take a ResNet 34 and we do transfer learning what's actually going on so the first thing to notice is the ResNet 34 that we grab from ImageNet has a very specific weight matrix at the end it's a weight matrix that has 1000 columns why is that because ImageNet the problem they ask you to solve in the ImageNet competition is please figure out which one of these 1000 image categories this picture is so that's why they need a thousand things here because in ImageNet this target vector is length a thousand it's you've got to pick the probability that it's which one of those thousand things so there's a couple of reasons this weight matrix is no good to you when you're doing transfer learning the first is that you probably don't have a thousand categories you know I was trying to do teddy bears black bears or brown bears so I don't want a thousand categories and the second is even if I did have exactly a thousand categories they're not the same thousand categories that are in ImageNet so basically this whole weight matrix is a waste of time for me so what do we do we throw it away so when you go create CNN in FastCI it deletes that and what does it do instead instead it puts in two new weight matrices in there for you with a value in between and so there are some defaults as to what size this first one is but the second one the size there is as big as you need it to be so in your data bunch which you passed your learner from that we know how many activations you need if you're doing classification it's over many classes you have if you're doing regression it's over many numbers you're trying to predict in the regression problem and so remember that in your data if your data bunch is called data that'll be called data.c so we'll add for you this weight matrix of size data.c by however much was in the previous layer okay so now we need to train those because initially these weight matrices are full of random numbers okay because new weight matrices are always full of random numbers if they're new and these ones are new we've just we've grabbed them and thrown them in there so we need to train them but the other layers are not new the other layers are good at something right and what are they good at well let's remember that Zyla and Fergus paper here in examples of some visualization of some filters some some weight matrices in the first layer and some examples of some things that they found right so the first layer had one part of the weight matrix was good at finding diagonal edges in this direction and then in layer two one of the filters was good at finding corners in the top left and then in layer three one of the filters was good at finding repeating patterns another one was good at finding round orange things another one was good at finding kind of like fairy or floral textures so as we go up they're becoming more sophisticated but also more specific right so like layer four I think was finding like eyeballs for instance now if you're wanting to transfer learn to something for histopathology slides there's probably going to be no eyeballs in that right so the later layers are no good for you but there'll certainly be some repeating patterns and there'll certainly be some diagonal edges right so the earlier you go in the model the more likely it is that you want those weights to stay as they are well to start with we definitely need to train these new weights because they're random so let's not bother training any of the other weights at all to start with so what we do is we basically say let's freeze let's freeze all of those other layers so what does that mean all that means is that we're asking fastai in pyotorch that when we train you know however many epochs we do when we call fit don't backpropagate the weights but don't prep it backpropagate the gradients back into those layers in other words when you go parameters equals parameters minus learning rate times gradient only do it for the new layers don't bother doing it for the other layers that's what freezing means just means don't update those parameters so it'll be a little bit faster as well because there's a few less calculations to do it'll take up a little bit less memory because there's a few less gradients that we have to store but most importantly it's not going to change weights that are already better than nothing they're better than random at the very least so that's what happens when you call freeze it doesn't freeze the whole thing it freezes everything except the randomly generated added layers that we put on for you so then what happens next okay after a while we say okay this is looking pretty good we probably should train the rest of the network now so we unfreeze and so now we're going to train the whole thing but we still have a pretty good sense that these new layers we added to the end probably need more training and these ones right at the start that might just be like diagonal edges probably don't need much training at all so we split our model into a few sections right and we say let's give different parts of the model different learning rates so this part of the model we might give a learning rate of 1e neg 5 and this part of the model we might give a learning rate of 1e neg 3 see and so what's going to happen now is that we can keep training the entire network but because the learning rate for the early layers is smaller it's going to move them around less because we think they're already pretty good and also like if it's already pretty good to the optimal value if you use to hire learning rate it could kick it out but it could actually make it worse which we really don't want to happen okay so this this process is called using discriminative learning rates you won't find much online about it because I think we were kind of the first to use it for this purpose or at least talk about it extensively maybe other probably other people used it without writing it down so most of the stuff you'll find about this will be fast AI students but it's starting to get more well known slowly now but it's a really really important concept for transfer learning without using this you just can't get nearly as good results so how do we do discriminative learning rates in fast AI when you when you anywhere you can put a learning rate in fast AI such as with the fit function the first thing you put in is the number of epochs and then the second thing you put in is learning rate same if you use fit one cycle the learning rate you can put a number of things that you can put a single number like 1e neg 3 you can write a slice so you can write slice for example 1e neg 3 with a single number or you can write slice with two numbers what do which of those mean in the first case just using a single number means every layer gets the same learning rate so you're not using discriminative learning rates if you pass a single number to slice that means the final layers get a learning rate of whatever you wrote down of whatever you wrote down 1e neg 3 and then all the other layers get the same learning rate which is that divided by 3 so all of the other layers will be 1e neg 3 divided by 3 the last layers will be 1e neg 3 and the last case the final layers these randomly hidden added layers will still be again 1e neg 3 the first layers will get 1e neg 5 and the other layers will get learning rates that are equally spread between those two so what applicatively equal right so if there were three layers there would be 1e neg 5, 1e neg 4, 1e neg 3 so equal multiples each time one slight tweak to make things a little bit simpler to manage we don't actually give a different learning rate to every layer we give a different learning rate to every layer group which is just we decide to put the groups together for you and so specifically what we do is the randomly added extra layers we call those one layer group this is by default you can modify it and then all the rest we split in half into two layer groups so by default at least with a CNN you'll get three layer groups and so if you say slice 1e neg 5, 1e neg 3 you will get 1e neg 5 learning rate for the first layer group, 1e neg 4 for the second, 1e neg 3 for the third so now if you go back and look at the way that we're training hopefully you'll see that this makes a lot of sense this divided by three thing is a little weird and we won't talk about why that isn't your part two of the course it's a specific quirk around batch normalization so we can discuss that in the advanced topic if anybody is interested alright so that is fine tuning so hopefully that makes that a little bit less mysterious so we were looking at collaborative filtering last week and in the collaborative filtering example we called fit one cycle and we passed in just a single number and that makes sense because in collaborative filtering we only have one layer there's a few different pieces in it but there isn't a matrix model ply followed by an activation function followed by another matrix model play I'm going to introduce another piece of jargon here they're not always exactly matrix model placations they're something very similar, they're linear functions that we add together but the more general term for these things that are more general than matrix model placations is affine functions so if you hear me say the word affine function you can replace it in your head with matrix model placation but as we'll see when we do convolutions, convolutions are matrix model placations where some of the weights are tied and so it would be slightly more accurate to call them affine functions and I like to introduce a little bit more jargon each lesson so that when you read books or papers or watch other courses or read documentation there will be more of the words you'll recognize so when you say affine function it just means a linear function and it means something very very close to matrix model placation a matrix model placation is the most common kind of affine function at least in deep learning so specifically for collaborative filtering the model we were using was this one it was where we had a bunch of numbers here and we took the dot product of them and given that one here is a row and one is a column we can actually that's the same as a matrix product so m molt in Excel multiplies matrices so here is the matrix product of those two and so I started this training last week by using solver in Excel and we never actually went back to see how it went so let's go and have a look now so the average sum of squared error got down to 0.39 so we're trying to predict something on a scale of 0.5 to 5 so on average we're being wrong by about 0.4 that's pretty good and you can kind of see it's pretty good if you look at like 3.51 is what it meant to be 3.255.1 0.98 that's pretty close so you get the general idea and then I started to talk about this idea of embedding matrices and so in order to understand that let's put this worksheet aside and look at another worksheet so here's another worksheet and what I've done here is I have copied over those two weight matrices from the previous worksheet here's the one for users and here's the one for movies and the movies one I've transposed to it so it's now got exactly the same dimensions as the users one so here are two weight matrices initially they were random we can train them with gradient descent in the original data the user IDs and movie IDs were numbers like these to make life more convenient I've converted them to numbers from 1 to 15 so in these columns I've got for every rating I've got user ID, movie ID rating using these mapped numbers so that their contiguous starting at 1 now I'm going to replace user ID number 1 with this vector the vector contains a 1 followed by 14 zeros and then user number 2 I'm going to replace with a vector of 0 and then 1 and then 13 zeros and so forth so movie ID 14 all these are movie ID 14 I've also replaced with another vector which is 13 zeros and then a 1 and then a 0 so these are called one hot encodings by the way so this is not part of a neural net this is just like some input pre-processing where I'm literally making this my new inputs this is my new inputs for my movies this is my new inputs for my users these are my inputs to a neural net so what I'm going to do now is I'm going to take this input matrix and I'm going to do a matrix multiply by this weight matrix and that'll work because this has 15 rows and this has 15 columns so I can multiply those two matrices together because they match and you can do matrix multiplication in Excel using the mMult function just be careful if you're using Excel because this is a function that returns multiple numbers you can't just hit enter when you finish with it you have to hit ctrl shift enter ctrl shift enter means this is an array function it's something that returns multiple values so here is the matrix product of this input matrix of inputs and this parameter matrix or weight matrix so that's just a normal neural network layer it's just a regular matrix multiply and so we can do the same thing for movies and so here's the matrix multiply for movies well here's the thing this input we claim is this one-hot encoded version of user ID number one and these activations are the activations for user ID number one why is that? because if you think about it that matrix multiplication between a one-hot encoded vector and some matrix is actually going to find the nth row of that matrix when the one is in position n does that make sense? so what we've done here is we've actually got a matrix multiply that is creating these output activations but it's doing it in a very interesting way which is it's effectively finding a particular row in the input matrix so having done that we can then multiply those two sets together just a dot product and we can then find the loss squared and then we can find the average loss and lo and behold that number 0.39 is the same as this number because they're doing the same thing so this one was kind of finding this particular user's embedding vector this one is just doing a matrix multiply and therefore we know they are mathematically identical so let's lay that out again so here's our final version this is the same weight matrices again exactly the same I've copied them over and here's those user IDs and movie IDs again but this time I've laid them out just in a normal kind of tabular form just like you would expect to see in the input to your model and this time I've got exactly the same set of activations here that I had here but in this case I've calculated these activations using Excel's offset function which is an array lookup it says find the first row of this so this is doing it as an array lookup so this version is identical to this version but obviously it's much less memory intensive and much faster because I don't actually create the one-hot encoded matrix and I don't actually do a matrix multiply because that matrix multiply is nearly all multiplying by zero which is a total waste of time so in other words multiplying by a one-hot encoded matrix is identical to doing an array lookup therefore we should always do the array lookup version and therefore we have a specific way of doing we have a specific way of saying do a matrix multiplication by a one-hot encoded matrix without ever actually creating it I'm just instead going to pass in a bunch of ints and pretend they're one-hot encoded and that is called an embedding so you might have heard this word embedding all over the place as if it's some magic advanced mathy thing but embedding means look something up in an array okay but it's interesting to note that looking something up in an array is mathematically identical to doing a matrix product by a one-hot encoded matrix and therefore an embedding fits very nicely in our standard model of how neural networks work so now suddenly it's as if we have another whole kind of layer it's a kind of layer where we get to look things up in an array but we actually didn't do anything special we just added this computational shortcut this thing called an embedding which is simply a fast and memory efficient way of multiplying by a one-hot encoded matrix okay so this is really important because when you hear people say embedding you need to replace it in your head with an array lookup which we know is mathematically identical to a matrix multiplied by a one-hot encoded matrix here's the thing though it has kind of interesting semantics right because when you do multiply something by a one-hot encoded matrix you get this nice feature where the rows of your weight matrix the values only appear for row number one for example where you get user ID number one in your inputs so in other words you kind of end up with this weight matrix where certain rows of weights correspond to certain values of your input and that's pretty interesting it's particularly interesting here because going back to a kind of most convenient way to look at this because the only way that we can calculate an output activation is by doing a dot product of these two input vectors that means that they kind of have to correspond with each other right like there has to be some way of saying if this number for a user is high and this number for a movie is high then the user will like the movie so the only way that can possibly make sense is if these numbers represent features of personal taste and corresponding features of movies for example the movie has John Travolta in it and user ID likes John Travolta then you'll like this movie okay so like we're not actually deciding the rows mean anything we're not doing anything to make the rows mean anything but the only way that this gradient descent could possibly come up with a good answer is if it figures out what the aspects of movie taste are and the corresponding features of movies are so those underlying kind of features that appear are called latent factors or latent features they're these hidden things that were there all along and once we train this neural net they suddenly appear now here's the problem no one's going to like Battlefield Earth it's not a good movie even though it has John Travolta in it so how are we going to deal with that because there's this feature called I like John Travolta movies and this feature called this movie has John Travolta and so this is now like you're going to like the movie but we need to have some way to say unless it's Battlefield Earth or you're a Scientologist either one so how do we do that we need to add in bias so here is the same thing again same weight matrix sorry not the same weight matrix he's the same construct same shape of everything but this time we've got an extra row so now this is not just the matrix product of that and that but I'm also adding on this number and this number which means now each movie can have an overall this is a great movie versus this isn't a great movie and every user can have an overall this user rates movies highly or this user doesn't rate movies highly so that's called the bias so this is hopefully going to look very familiar right this is the same usual linear model concept or linear layer concept from a neural net that you have a matrix product and a bias and remember from lesson two the lesson two SGD notebook you never actually need a bias you could always just add a column of ones to your input data and then that gives you bias for free but that's pretty inefficient right so in practice all neural networks library explicitly have a concept of bias we don't actually add the column of ones so what does that do well just before I came in today I ran tools solver data solver on this as well and we can check the RMSE and so the root mean squared here is 0.32 versus the version without bias was 0.39 okay so you can see that this slightly better model gives us a better result and it's better because it's giving both more flexibility right and it's also just makes sense semantically that you need to be able to say it's not the whether I like the movie is not just about the combination of what act as it has and whether it's dialogue driven and how much action is in it but just isn't a good movie okay or am I somebody who rates movies highly okay so there's all the pieces of this collaborative filtering model how are we going to go any questions we have three questions okay okay so our first question then is when we load a pre-trained model can we explore the activation grids to say what they might be good at recognizing yes you can and we will learn how to should be in the next lesson can we have an explanation of what the first argument in fit one cycle actually represents is it equivalent to an epoch yes the first argument to fit one cycle or fit is number of epochs it's in other words an epoch is looking at every input once so if you do 10 epochs you're looking at every input 10 times and so there's a chance you might start overfitting if you've got lots and lots of parameters and a high learning rate if you only do one epoch it's impossible to overfit so that's why it's kind of useful to remember how many epochs you're doing can we have an explanation what is an affine function an affine function is a linear function I don't know if we need much more detail than that if you're multiplying things together and adding them up it's an affine function I'm not going to bother with the exact mathematical definition partly because I'm a terrible mathematician partly because it doesn't matter but if you just remember that you're multiplying things together and then adding them up that's the most important thing, it's linear and therefore if you put an affine function on top of an affine function that's just another affine function you haven't won anything at all that's a total waste of time so you need to sandwich it with any kind of non-linearity pretty much works including replacing the negatives with zeros which we call value so if you do affine, value, affine, value, affine, value you have a deep neural network so let's go back to the collaborative filtering notebook this time we're going to grab the whole MovieLens 100K data set there's also a 20 million data set, by the way so a really great project made by this group called GroupLens they actually update the MovieLens data sets on a regular basis but they helpfully provide the original one and we're going to use the original one because that means that we can compare to baselines because basically they say if you're going to compare to baselines make sure you all use the same data set here's the one you should use unfortunately it means that we're going to be restricted to movies that are before 1998 so maybe you won't have seen them all but that's the price we pay you can replace this with ML Latest when you download it and use it if you want to play around with movies that are up to date the original MovieLens data set more recent ones are in a CSV file that's super convenient to use the original one is a slightly messy first of all they don't use commas for delimiters they use tabs so in pandas you can just say what's the delimiter when you load it in the second is they don't add a header row so that you know what column is what so you have to tell pandas there's no header row and then since there's no header row you have to tell pandas what are the names of the columns other than that that's all we need okay so we can then have a look at head which remembers the first few rows and there is our ratings user movie rating and let's make it more fun let's see what the movies actually are I'll just point something out here which is there's this thing called encoding equals I'm going to get rid of it and I get this error unicode I just want to point this out because you'll all see this at some point in your lives that can't decode blah blah blah what this means is that this is not a unicode file this will be quite common when you're using datasets they're a little bit older back before you know us folks in the west really realized that there are people that use languages other than well English people English languages other than English nowadays you know we're much better at handling different languages we use this standard called unicode and Python very helpfully uses unicode by default but if you try to load an old file that's not unicode you actually believe it or not have to guess how it was coded but since like it's really likely that it was created by you know some western European or American person they almost certainly used Latin one so if you just pop in encoding equals Latin one if you use file open in Python or pandas open or whatever that will generally get around your problem again they didn't have the names so we had to list to the names this is kind of interesting they had a separate column for every one of however many genres they had 19 genres and you'll see this looks one hot encoded but it's actually not it's actually n hot encoded but the movie can be in multiple genres we're not going to look at genres today but it's just interesting to point out that this is a way that sometimes people will represent something like genre in the more recent version they actually list the genres directly which is much more convenient okay so I find life is so we got 100,000 ratings I find life is easier when you're modeling when you actually denormalize the data so I actually want the movie title directly in my settings so pandas has a merge function to let us do that so here's the ratings table with actual titles so as per usual we can create a data bunch for our application so a colab data bunch for the colab application from what from a data frame there's our data frame set aside some validation data really we should use the validation sets and cross validation approach that they used if you're going to properly compare with a benchmark so take these comparisons with a gram of salt by default colab data bunch assumes that your first column is user second column of item third column is rating but now we're actually going to use the title column as item so we have to tell it what the item column name is and then all of our data bunches support show batch so you can just check what's in there and there it is okay so I'm going to try and get as good a result as I can so I'm going to try and use whatever tricks I can come up with to get a good answer now one of the tricks is to use the y range and remember the y range was the thing that made the final activation function a sigmoid and specifically last week we said let's have a sigmoid that goes from 0 to 5 and that way it's going to ensure that it's going to help the neural network predict things that are in the right range I actually didn't do that in my excel version and so you can see I've actually got some negatives and there's also something bigger than 5 so if you want to beat me in excel you could add the sigmoid to excel and train this and you'll get a slightly better answer now the problem is that a sigmoid actually asymptotes at say whatever the maximums we said 5 which means you can never actually predict 5 but plenty of movies have a rating of 5 so that's the problem so actually it's slightly better to make your y range go from a little bit less than the minimum to a little bit more than the maximum and the minimum of this data is 0.5 and the maximum is 5 so this range is just a little bit further so that's one little trick to get a little bit more accuracy the other trick I used is to add something called weight decay and we're going to look at that next after this section we're going to learn about weight decay so then how many factors do you want well what are factors the number of factors is the width of the embedding matrix so why don't we say embedding size maybe we should but in the world of collaborative filtering they don't use that word they use the word factors because of this idea of latent factors and because the standard way of doing collaborative filtering has been with something called matrix factorization and in fact what we just saw happens to actually be a way of doing matrix factorization so we've actually accidentally learned how to do matrix factorization today so this is a term that's kind of specific to this domain but you can just remember it as the width of the embedding matrix and so why 40 well this is one of these architectural decisions you have to play around with and see what works so I tried 10, 20, 40 and 80 and I found 40 is going to work pretty well and it changed really quickly so like you can chuck it in a little for loop just to try a few things and see what looks best and then for learning rates so use the learning rate finder as usual so 5eneg3 seem to work pretty well, remember this is just a rule of thumb right 5eneg3 is a bit lower than both Sylvain's rule and my rule so Sylvain's role is find the bottom and go back by 10 so his rule would be more like 2eneg2 I reckon my rule is kind of find about the steepest section which is about here which again like often it agrees with Sylvain so that would be about 2eneg2 I tried that and I always like to try like 10x less and 10x more just to check and actually I found a bit less was helpful so the answer to the question like should I do blah is always try blah and see that that's how you actually become a good practitioner so that gave me 0.813 and as usual you can save the result to save you another 33 seconds from having to do it again later and so there's a a library called Libreck and they publish some benchmarks for movie lens 100k and there's a root mean squared error section and about 0.91 is about as good as they seem to have been able to get 0.91 is the root mean squared error we use the mean squared error not the root so we have to go 0.91 squared which is 0.83 and we're getting 0.81 so that's cool with this very simple model we're doing a little bit better quite a lot better actually although as I said take it with a grain of salt because we're not doing the same splits and the same cross validation so we're at least highly competitive with their approaches we're going to look at the python code that does this in a moment we're going to look at the python code that does this in a moment but for now just take my word for it that we're going to see something that's just doing this right looking things up in an array and then model playing them together adding them up doing the mean squared error loss function so given that and given that we noticed that the only way that that can do anything interesting is by trying to kind of find these latent factors it makes sense to look and see what they found particularly since as well as finding latent factors we also now have a specific bias number for every user and every movie now you could just say what's the average rating for each movie but there's a few issues with that in particular this is something you see a lot with like anime people who like anime just love anime and so they watch lots of anime and then they just rate all the anime highly and so very often on kind of charts of movies you'll see a lot of anime at the top particularly if it's like a hundred long series of anime you'll find every single item of that series in the top thousand movie list or something so how do we deal with that well the nice thing is that instead if we look at the movie bias right the movie bias says kind of once we've included the user bias right which for an anime lover might be a very high number because they're just rating a lot of movies highly and once we account for the specifics of this kind of movie which again might be people love anime right what's left over is something specific to that movie itself so it's kind of interesting to look at movie bias numbers as a way of saying what are the best movies or what do people really like as movies even if those people don't rate movies very highly or even if that movie doesn't have the kind of features that people tend to have rate highly so you get a kind of nice it's funny to say this and by using the bias we get an unbiased kind of movie score so how do we do that well to make it interesting particularly because this data set only goes to 1998 let's only look at movies that are plenty of people watch so we'll use pandas to grab our rating movie table group it by title and then count the number of ratings and not measuring how high they're rating just how many ratings do they have and so the top thousand is the movies that have been rated the most and so they're hopefully movies that we might have seen that's the only reason I'm doing this and so I've called this top movies by which I mean good movies, just movies we're likely to have seen so not surprisingly Star Wars is the one that at that point the most people had put a rating to Independence Day there you go we can then take our learner that we trained and ask it for the bias of the items listed here so this item equals true you would pass true to say I want the items or false to say I want the users so this is kind of like a pretty common piece of nomenclature for collaborative filtering these IDs tend to be called users these IDs tend to be called items even if your problem has got nothing to do with users and items at all we just use these names for convenience so they're just words so in our case we want the items this is the list of items we want we want the bias so this is specific to collaborative filtering and so that's going to give us back a thousand numbers because we asked for this has a thousand movies in it so we can now take and just for comparison let's also group the titles by the mean rating so then we can zip through so going through together each of the movies along with the bias and grab their rating and the bias and the movie and then we can sort them all by the zero index thing which is the bias so here are the lowest numbers so I can say you know Mortal Kombat Annihilation not a great movie I haven't seen Children of the Corn but we did have a long discussion at SF Study Group today and people who have seen it agree not a great movie and you can kind of see like some of them actually have pretty decent ratings even though like relative to so this one's actually got a much higher rating than the next one but that's kind of saying well the kind of actors that were in this and the kind of movie that this was and the kind of people who like it who watch it, you would expect it to be higher and then here's the sort by reverse Children's List, Titanic, Shawshank Redemption seems reasonable and again you can kind of look for ones where like the rating isn't that high but it's still very high here so that's kind of like at least in 1998 people weren't that into Leonardo DiCaprio or people aren't that into dialogue driven movies or people aren't that into romances or whatever but still people liked it more than you would have expected so it's interesting to kind of like interpret our models in this way we can go a bit further and grab not just the biases but the weights so that is these things and again we're going to grab the weights for the items for our top movies and that is a thousand by 40 because we asked for 40 factors so rather than having a width of 5 we have a width of 40 often really there's there isn't really conceptually 40 latent factors involved in taste and so trying to look at the 40 can be not that intuitive so what we want to do is we want to squish those 40 down to just 3 and there's something that we're not going to look into called PCA it's down to principal components analysis so this is a movie W is a torch tensor and FastAI adds the PCA method to torch tensors and what PCA does principal components analysis is it's a simple linear transformation that takes an input matrix and tries to find a smaller number of columns that kind of cover a lot of the space of that original matrix if that sounds interesting which it totally is you should check out our course computational linear algebra which Rachel teaches where we will show you how to calculate PCA from scratch and why you'd want to do it and lots of stuff like that it's absolutely not a prerequisite for anything in this course but it's definitely worth knowing that taking layers of neural nets and chucking them through PCA is very often a good idea because very often you have like way more activations than you want in a layer and there's all kinds of reasons you might want to play with it for example Francisco who's sitting next to me today has been working on something to do image similarity and for image similarity a nice way to do that is to compare activations from a model but often those activations will be huge and therefore your thing could be really slow and unwieldy so people often for something like image similarity will chuck it through a PCA first and that's kind of cool in our case we're just going to do it so that we take our 40 components down to 3 components so hopefully they'll be easier for us to interpret so we can grab each of those 3 factors we'll call them factor 0, 1 and 2 and let's grab that movie components and then sort and now the thing is we have no idea what this is going to mean but we're pretty sure it's going to be some aspect of taste and movie feature so if we printed out the top and the bottom we can see that the highest ranked things on this feature you would kind of describe them as you know connoisseurs movies I guess you know like Chinatown you know really classic Jack Nicholson movie everybody knows Castle Blanker and even like wrong trousers is like this kind of classic claymation movie and so forth right so yeah this is definitely measuring like things that are very high on the connoisseur level where else maybe Home Alone 3 not such a favorite with connoisseurs perhaps it's just not to say that there aren't people who don't like it but probably not the same kind of people that would appreciate secrets and lies so you can kind of see this idea that this has found some feature of movies and a corresponding feature of the kind of things people like so let's look at another feature so here's factor number one so this seems to have found like okay these are just big hits that you could watch with a family you know these are definitely not that you know trans spotting very gritty kind of thing so again it's kind of found this interesting feature of taste and we could even like draw them on a graph right I've just cut off them randomly to make them easier to see and you can kind of see like and this is just the top 50 most popular movies by rating by how many times they've been rated and so kind of on this one factor you got that of the terminators really high up here the kind of English patient and send us list at the other end and then kind of is your godfather and on the python over here and layer over there so you get the idea so that's kind of fun it would be interesting to see if you can come up with some stuff at at work or other kind of data sets where you could try to pull out some some features and play with them so how does that work any questions one okay the question is why am I sometimes getting negative loss when training you shouldn't be so um you're doing something wrong so ask on show us your your particularly since people are up voting this I guess other people have seen it too so put it on the forum I mean they said they're doing negative log likelihood yeah so we're going to be learning about cross entropy and negative log likelihood after the break today they are loss functions that have very specific expectations about what your input looks like and if your input doesn't look like that then they're going to give very weird answers so probably you press the wrong buttons so don't do that okay so we said co-lab learner and so here is the co-lab learner function the co-lab learner function as per usual takes a a data bunch and normally learners also take something where you ask for a particular architectural details in this case there's only one thing which does that which is basically do you want to use a multi-layer neural net or do you want to use a classic collaborative filtering and we're only going to look at the classic collaborative filtering today or maybe we'll briefly look at the other one too we'll see and so what actually happens here well basically we create an embedding.bias model and then we pass back a learner which has our data and that model so obviously all the interesting stuff is happening here an embedding.bias so let's take a look at that I clearly press the wrong button embedding.bias there we go so here's our embedding.bias model it is a nn.module so in PyTorch to remind you all PyTorch layers and models are nn.modules they are things that once you create them look exactly like a function you call them with parentheses and you pass them arguments but they're not functions they don't even have normally in Python to make something look like a function you have to give it a method called dunder call remember that means underscore underscore call underscore underscore which doesn't exist here and the reason is that PyTorch actually expects you to have something called forward and that's what PyTorch will call for you when you call it like a function so when this model is being trained to get the predictions it's actually going to call forward for us so this is where we do the calculations to calculate our predictions this is where you can see we grab our why is this users rather than user that's because everything's done a mini batch at a time when I read the forward in a PyTorch module I tend to ignore in my head the fact that there's a mini batch and I pretend there's just one because PyTorch automatically handles all of the stuff about doing it to everything in the mini batch for you so let's pretend there's just one user we grab that user and what is this self.u underscore weight self.u underscore weight is an embedding we create an embedding for each of users by factors items by factors users by one items by one that makes sense right so users by one is here that's the users bias right and then users by factor is here so users by factors is the first tuple so that's going to go in u underscore weight and users comma one is the third so that's going to go in u underscore bias so remember when PyTorch creates our nn.module it calls dunder in it and so this is where we have to create our weight matrices and we don't normally create the actual weight matrix tensors we normally use PyTorch's convenience functions to do that for us and we're going to see some of that after the break so for now just recognize that this function is going to create an embedding matrix for us it's going to be a PyTorch nn.module as well so therefore to actually pass stuff into that embedding matrix and get activations out you treat it as if it was a function stick it in parentheses so if you want to look in the PyTorch source code and find nn.embedding you will find there's something called .forward in there which will do this array lookup for us so here's where we grab the users here's where we grab the items and so we've now got the embeddings for each and so at this point we're kind of like here and we found that and that so we multiply them together and sum them up and then we add on the user bias and the item bias and then if we've got a y-range then we do our sigmoid trick and so the nice thing is you now understand the entirety of this model and this is not just any model this is a model that we just found is at the very least highly competitive with and perhaps slightly better than some published table of pretty good numbers from a software group that does nothing but this so you're doing well this is nice so that's probably a good place to have a break and so after the break we're going to come back and we're going to talk about the one piece of this puzzle we haven't learned yet which is what the hell does this do so let's come back at 750 okay so this idea of interpreting embeddings is really interesting and as we'll see later in this lesson the things that we create for categorical variables more generally in tabular data sets are also embedding matrices and again that's just a normal matrix multiply by a one hot encoded input where we skip the computational and memory burden of it by doing it in a more efficient way and it happens to end up with these interesting semantics kind of accidentally and there was this really interesting paper by these folks who came second in a Kaggle competition for something called a Rossman we'll probably look in more detail at the Rossman competition in part two I think we're going to run out of time in part one but it's basically this pretty standard tabular stuff the main interesting stuff is in the pre-processing and it was interesting because they came second despite the fact that the person who came first and pretty much everybody else was at top of the leaderboard did a massive amount of highly specific feature engineering whereas these folks did way less feature engineering than anybody else but instead they used a neural net and this was at a time in 2016 when just no one did that no one was doing neural nets for tabular data so they have the kind of stuff that we've been talking about kind of arose there or at least was kind of popularized there and when I say popularized I mean only popularized a tiny bit still most people aren't aware of this idea but it's pretty cool because in their paper they showed that the mean average percentage error for various techniques, K nearest neighbors, random forest and gradient booster trees well first you know neural nets just worked a lot better but then with entity embeddings which is what they call this just using entity matrices in tabular data they actually added the entity embeddings to all of these different tasks after training them and they all got way better right so neural nets with entity embeddings are still the best but a random forest with entity embeddings was not at all far behind and you know that's often kind of that's kind of nice right because you could train these entity matrices for products or stores or genome motifs or whatever and then use them in lots of different models possibly you know using faster things like random forests but getting a lot of the benefits but here was something interesting they took a two dimensional projection of their embedding matrix for state for example German state because this was a German supermarket chain I think using the same kind of approach we did I don't remember if they used PCA or something else different and then here's the interesting thing I've circled here you know a few things in this embedding space and I've circled it with the same color over here and here I've circled some same color over here and it's like oh my god the embedding projection has actually discovered geography like they didn't do that right but it's it's found things that are nearby each other in grocery purchasing patterns because this was about predicting how many sales there will be you know there is some geographic element of that in fact here is a graph of the distance between embedding vectors so you can just take an embedding vector and say what's the sum of squared you know compared to some other embedding vector and that's the Euclidean distance what's the distance in embedding space and then plot it against the distance in real life between shops and you get this very strong positive correlation here is an embedding space for the days of the week and as you can see there's a very clear path through them here's the embedding space once of the year and again there's a very clear path through them so like embeddings are amazing and I don't feel like anybody is even close to exploring the kind of interpretation that you could get right so if you've got genome motifs or plant species or products that your shop sells or whatever like it would be really interesting to train a few models and try and kind of fine tune some embeddings and then like start looking at them in these ways in terms of similarity to other ones and clustering them and projecting them into 2D spaces and whatever I think it's really interesting so we were trying to make sure we understood what every line of code did in this pretty good colab loner model we built and so the one piece missing is this WD piece and WD stands for weight decay so what is weight decay weight decay is a type of regularization what is regularization well let's start by going back to this nice little chart that Andrew Ung did in his terrific machine learning course where he plotted some data and then showed a few different lines through it this one here because Andrew is at Stanford he has to use Greek letters ok so we can say this is a plus bx but you know if you want to go there theta naught plus theta 1x is a line it's a line even if it's got Greek letters it's still a line so here's a second degree polynomial a plus bx plus cx squared bit of curve and here's a high degree polynomial which is curvy as anything so models with more parameters tend to look more like this and so in traditional statistics we say hey let's use less parameters because we don't want it to look like this because if it looks like this then the predictions over here and over here they're going to be all wrong right it's not going to generalize well it's going to be overfitting so we avoid overfitting by using less parameters and so if any of you are unlucky enough to have been brainwashed by a background in statistics or psychology or econometrics or any of these kinds of courses you're going to have to unlearn the idea that you need less parameters because what you instead need to realize this is you will fed this lie that you need less parameters for convenient fiction for the real truth which is you don't want your function to be too complex and having less parameters is one way of making it less complex but what if you had a thousand parameters and 999 of those parameters were one a neg nine well what if there was zero if there's zero then they're not really there or if they're one a neg nine they're hardly there right so like I kind of have lots of parameters if like lots of them are really small and the answer is you can you know so this this thing of like counting the number of parameters is how we limit complexity is actually extremely limiting it's a fiction that really has a lot of problems right and so if in your head complexity is scored by how many parameters you have you're doing it all wrong right score it properly right so why do we care why would I want to use more parameters because more parameters means more nonlinearities more interactions more curvy bits right and real life is full of curvy bits right real life does not look like this but we don't want them to be more curvy than necessary or more interacting than necessary so therefore let's use lots of parameters and then penalize complexity okay so one way to penalize complexity is as I kind of suggested before is let's sum up the value of your parameters now that doesn't quite work because some parameters are positive and some are negative right so what if we sum up the square of the parameters right and that's actually a really good idea right let's actually create a model and in the loss function we're going to add the sum of the square of the parameters now here's the problem with that though maybe that number is way too big and it's so big that the best loss is to set all of the parameters to zero and that would be no good right so actually we want to make sure that doesn't happen so therefore let's not just add the sum of the squares of the parameters to the model but let's multiply that by some number that we choose and that number that we choose in fast ai is called wd okay so that's what we're going to do we're going to take our loss function and we're going to add to it the sum of the squares of the parameters multiplied by some number wd what should that number be well generally it should be 0.1 people with fancy machine learning phd's are extremely skeptical and dismissive of any claims that a learning rate can be 3e neg3 most of the time or a weight decay can be 0.1 most of the time but here's the thing we've done a lot of experiments on a lot of data sets and we've had a lot of trouble finding anywhere a weight decay of 0.1 isn't great however we don't make that the default we actually make the default 0.01 why because in those rare occasions where you have too much weight decay no matter how much you train it just never quite fits well enough where else if you have too little weight decay you can still train well you'll just start to overfit so you just have to stop a little bit early so we've been a little bit conservative with our defaults but my suggestion to you is this now that you know that every learner has a wd argument I should mention you won't always see it in this list because this is concept of kwrgs in python which is basically parameters that are going to pass up the chain to the next thing that we call so basically all of the learners will call eventually this constructor and this constructor has a wd so this is just one of those things that you can use in the docs or you now know it anytime you're constructing a learner from pretty much any kind of function in fastai you can pass wd and so passing 0.1 instead of the default 0.01 will often help so give it a go so what's really going on here it would be helpful I think to go back to lesson 2 sgd everything we're doing for the rest of today really is based on this and this is where we created some data and then we added a loss function mse and then we created a function called update which calculated our predictions that's our matrix multiply this is just a one layer so there's no value we calculated our loss using that means grid error we calculated the gradients using loss.backward we then subtracted in place the learning rate times the gradients and that is gradient descent so if you haven't reviewed lesson 2 sgd please do because this is our starting point so if you don't get this then none of this is going to make sense if you're watching the video maybe pause now go back rewatch this part of lesson 2 make sure you get it remember a.sub underscore is basically the same as a minus equals because a.sub is subtract and everything in PyTorch if you add an underscore to it means do it in place so this is updating our a parameters which started out as minus 0.11 we just arbitrarily picked those numbers and it gradually makes them better so let's write that down so we are trying to calculate the parameters I'm going to call them weights because this is just more common in kind of epoch t or time t and they're going to be equal to whatever the weights were in the previous epoch minus our learning rate multiplied by it's the derivative of our loss function with respect to our weights at time t minus 1 okay so that's that's what this is doing and we don't have to calculate the derivative because it's boring and because computers do it for us fast and then they store it here for us so we're good to go okay so make sure you're exceptionally comfortable with either that equation or that line of code because they have the same thing where do we go from here alright so what's that what's our loss our loss is some function of our independent variable variables x and our weights right and in our case we're using mean squared error for example and it's between our predictions and our actuals right so where does x and w come in well our predictions come from running some model we'll call it m on those predictions and that model contains some weights right so that's that's what our loss function might be and this might be all kinds of other loss functions we'll see some more today and so that's what ends up creating a.grad over here we're going to do something else we're going to add weight decay sum number which in our case is 0.1 times times the sum weights squared so let's do that and let's make it interesting by not using synthetic data but let's use some real data and we're going to use MNIST but we're going to do this as a standard fully connected net not as a convolutional net because we haven't learnt the details of how to really create one of those from scratch so in this case there's actually deeplearning.net provides MNIST as a python pickle file in other words it's a file that python can just open up and it'll give you numpy arrays straight away and they're flat numpy arrays we don't have to do anything to them so go grab that and it's a gzip file so you can actually just gzip.open it directly and then you can pickle.load it directly and again encoding equals Latin 1 because yeah you know and then we can just put that that'll give us the training, the validation and the test set, I don't care about the test set so generally in python out you tend to use this special variable called underscore there's no reason you have to it's just kind of people know you mean I don't care about this so there's our training x and y and our valid x and y now this actually comes in as a as you can see if I sprint the shape 50,000 rows by 784 columns but those 784 columns are actually 28 by 28 pixel pictures so if I reshape one of them into a 28 by 28 pixel picture and plot it then you can see it's the number 5 so that's our data we've seen MNIST before in its pre reshaped version here it is in its flattened version so I'm going to be using it in its flattened version and currently they are numpy arrays I need them to be tensors so I can just map torch.tensor across all of them and so now they're tensors okay I may as well create a variable with the number of things I have which we normally call n and remember we normally have a thing called we tend to use c to mean the number of activations we need we're actually sorry this is not going to be activations sorry this is going to be a number of columns that's not a great name for it sorry okay so there we are and then the y not surprisingly the minimum value is 0 9 because that's the extra number we're going to correct so in lesson 2 SGD we created a data where we actually added a column of 1s on so that we didn't have to worry about bias we're not going to do that we're going to have PyTorch do that kind of implicitly for us we had to write our own msc function we're not going to do that we had to write our own little matrix multiplication thing we're going to have PyTorch do all this stuff for us now okay and what's more and really important we're going to do mini batches this is a big enough data set we probably don't want to do it all at once so if you want to do mini batches so we're not going to use too much fast ai stuff here PyTorch has something called tensor data set that basically grabs a any kind of tensor sorry two tensors and creates a data set remember a data set is something where if you index into it you get back an x value and a y value just one of them so it kind of looks a lot like a list of x, y tuples once you have a data set then you can use a little bit of convenience by calling data bunch create and what's that going to do is it's going to create data loaders for you a data loader is something which you don't say I want the first thing or the fifth thing you just say I want the next thing and it will give you a batch a mini batch of whatever size you asked for and specifically it will give you the x and the y of a mini batch so if I just grab the next of the iterator this is just standard python if you haven't used iterators in python before here's my training data loader data bunch.create creates for you and you can check that as you would expect the x is 64 by 784 because it's 784 pixels flattened out 64 in a mini batch and the y is just 64 numbers there are things we're trying to predict and if you look at the source code for data bunch.create you'll see there's not much there so feel free to do so we just make sure that your training set gets shuffled, randomly shuffled for you we make sure that the data is put on the GPU for you just a couple of little convenience things like that but don't let it be magic if it feels magic check out the source code to make sure you see what's going on okay so rather than do this y hat equals x hat a thing we're going to create an nn.module if you want to create an nn.module that does something different to what's already out there you have to subclass it so subclassing is very very very normal in PyTorch so if you're not comfortable with subclassing stuff in Python go read a couple of tutorials to make sure you are main thing is you have to override the constructor done to init and make sure that you call the superclasses constructor because nn.module's superclasses constructor is going to like set it all up to be a proper nn.module for you so if you're trying to create your own PyTorch subclass and things don't work it's almost certainly because you forgot this line of code alright so the only thing we want to add is we want to create an attribute in our class which contains a linear layer an nn.linear module what is an nn.linear module it's something which does that but actually it doesn't only do that it actually is x at a plus b so in other words we don't have to add the column of ones that's all it does so if you want to play around why don't you try and create your own nn.linear class you could create something called mylinear and it'll take you you know depending on your PyTorch background what you want to and then you'll feel like we don't want any of this to be magic and you know all of the things necessary to create this now these are the kind of things that you should be doing for your assignments this week is not so much new applications but try to start writing more of these things from scratch and get them to work learn how to debug them and check what's going in and out and so forth but we could just use nn.linear and that's just going to do forward in it that goes a at x plus b right and so then in our forward how do we calculate the result of this well remember every nn.module looks like a function so we pass our xminibatch so I tend to use xb to mean a batch of x to self.lin and that's going to give us back the result of the a at x plus b on this minibatch is a logistic regression model a logistic regression model is also known as a neural net with no hidden layers so it's a one layer neural net no nonlinearities because we're doing stuff ourselves a little bit we have to put the weight matrices the parameters onto the gpu manually so just type dot cuda to do that so here's our model and as you can see the nn.module machinery has automatically given us a representation of it it's automatically stored the .lin thing and it's telling us what's inside it so there's a lot of little conveniences that PyTorch does for us so if you look at now at model.lin you can see, not surprisingly, here it is perhaps the most interesting thing to point out is that our model automatically gets a bunch of methods and properties perhaps the most interesting one is the one called parameters which contains all of the yellow squares from our picture it contains our parameters it contains our weight matrices and bias matrices in as much as they're different so if we have a look at p dot shape for p and model.parameters there's something of 10 by 784 and there's something of 10 so what are they? okay so that's the thing that's going to take in 784 dimensional input and spit out a 10 dimensional output because that's handy because our input is 784 dimensional and we need something that's going to give us the probability of 10 numbers after that happens we've got 10 activations which we then want to add the bias to so there we go here's a vector of length 10 so you can see why this model we've created has exactly the stuff that we need to do our ax plus b so let's grab a learning rate we're going to come back to this loss function in a moment but we can't use mse we can't really use mse for this right because we're not trying to say how close are you did you predict 3 and actually it was 4 gosh you were really close it's like no 3 is just as far away from 4 as 0 is away from 4 when you're trying to predict what number somebody draw so we're not going to use mse we're going to use cross entropy loss which we'll look at in a moment and here's our update function I copied it from lesson 2 sgd but now we're calling our model rather than going a at x we're calling our model as if it was a function to get y hat and we're calling our loss funk rather than calling mse to get our loss and then this is all the same as before except rather than going through each parameter and going parameter dot sub underscore learning rate times gradient we loop through the parameters okay because very nicely for us PyTorch will automatically create this list of the parameters of anything that we created in our dunder in it and look I've added something else I've got this thing called w2 I go through each p and model dot parameters and I add to w2 add to w2 the sum of squads so w2 now contains my sum of squads weights and then I multiply it by some number which I set to 1 a neg 5 so now I just implemented weight decay okay so when people talk about weight decay it's not an amazing magic complex thing containing thousands of lines of CUDA, C++ code it's those two lines of Python that's weight decay this is not a simplified version that's just enough for now, this is weight decay that's it and so here's the thing there's a really interesting kind of dual way of thinking about weight decay one is that we're adding the sum of squads weights and that seems like a very sound thing to do and it is well let's go ahead and run this so here I've just got a list comprehension that's going through my data loader so the data loader gives you back one mini batch for the whole thing giving you x, y each time I'm going to call update for each each one returns loss now PyTorch tensors since I did it all on the GPU that's sitting in the GPU there's all this stuff attached to it to calculate gradients it's going to use up a lot of memory so if you call .item on a scalar tensor it turns it into an actual normal Python number so this just means I'm returning back normal Python numbers and then I can plot them and yeah, there you go my loss function is going down and it's really nice to try this stuff to see it behaves as you expect like we thought this is what would happen closer and closer to the answer it bounces around more and more because we're kind of close to where we should be it's kind of getting probably flatter in weight space so we're kind of jumping further and so you can see why we would probably want to be reducing our learning rate as we go learning rate annealing now here's the thing that is only interesting for training a neural net it appears here because we take the gradient of it that's the thing that actually updates the weights right, so they're actually the only thing interesting about WD times sum of W squared is its gradient so we don't do a lot of math here but I think we can handle that the gradient of this whole thing if you remember back to your high school math is equal to the gradient of each part taken separately and then add them together so let's just take the gradient of that because we already know the gradient of this is just whatever we had before so what's the gradient of WD times the sum of W squared let's remove the sum and pretend there's just one parameter it doesn't change the generality of it so the gradient of WD times W squared so what's the gradient of that with respect to W it's just 2 WD times W right and so remember this is our constant which in our case was like well in that little loop it was 1eneg5 okay and that's our weights and like we could replace WD with like 2 WD without loss of generality so let's throw away the 2 so in other words all WD does is it subtracts some constant times the weights every time we do a batch so that's why it's called WD right when it's in this form where we add the square to the loss function that's called L2 regularization when it's in this form where we subtract WD times weights from the gradients that's called weight decay and they are kind of mathematically identical for everything we've seen so far in fact they are mathematically identical and we'll see in a moment a place where they're not where things get interesting okay so this is just a really important tool you now have in your toolbox you can make giant neural networks right and still avoid overfitting by adding more weight decay okay or you could use really small data sets with moderately large sized models and avoid overfitting with weight decay it's not magic you might still find you don't have enough data in which case like you get to the point where you're not overfitting by adding lots of weight decay and it's just not training very well that can happen at least this is something that you can now play around with just to kind of go on here now that we've got this update function we could replace this MNIST logistic with MNIST neural network and build a neural network from scratch right now we just need two linear layers right and the first one we could use a weight matrix of size 50 and so we did need to make sure that the second linear layer has an input of size 50 so it matches the final layer has to have an output of size 10 because that's the number of classes we're predicting and so now our forward just goes do a linear layer calculate value do a second linear layer and now we've actually created a neural network from scratch I mean we didn't write it in linear but you can write it yourself or you could like do the matrices directly you know how to so again you know if we go model dot CUDA and then we can calculate losses with the exact same update function there it goes right so this is why this kind of idea of neural nets is so easy right once you have something that can do gradient descent right then you can try different models and then you can start to add more PyTorch stuff so like rather than add doing all this stuff yourself why not just go opt equals optm dot something so the something we've done so far is SGD and so now you're saying to PyTorch I want you to take these parameters and optimize them using SGD and so this now rather than saying for P in parameters P minus equals LR times P dot grad you just say opt dot step it's the same thing it's just less code and it does the same thing but the reason it's kind of particularly interesting is that now you can replace SGD with Adam for example and you can even add things like weight decay right because like there's more stuff that's kind of in these things for you so that's why we tend to use you know optm dot blast so behind the scenes is actually what we do in Fast.ai so if I go optm dot SGD okay so there's that right and so that's that's just that picture but if we change to a different optimizer so look what happened it diverged we've seen a great picture of that from one of our students who showed what divergence looks like this is what it looks like when you try to train something so let's use we're using a different optimizer so we need a different learning rate and you can't just continue training because by the time it's diverged the weights are like really really big and really really small they're not going to come back so start again okay there's a better learning rate but look at this we're down underneath 0.5 by about epoch 200 where else before I'm not even sure we ever got to quite that level so what's going on what's Adam let me show you and we're going to do gradient descent in Excel because why wouldn't you so here is some randomly generated data okay some X's and some Y's well they're actually they're randomly generated X's and the Y's are all calculated by doing AX plus B where A is 2 and B is 30 okay so this is some data that we're going to try and match and here is SGD and so we're going to do it with SGD now in our lesson 2 SGD notebook we did the whole data set at once as a batch in the notebook we just looked at we did mini batches in this spreadsheet we're going to do online gradient descent which means every single row of data is a batch so it's going to batch size of one okay so as per usual we're going to start by picking an intercept and slope kind of arbitrarily so I'm just going to pick them at one doesn't really matter so here I've copied over the data this is my X and Y and so my intercept and slope as I said is one right I'm just literally referring back to this cell here so my prediction for this particular intercept and slope would be 14 times 1 plus 1 which is 15 and so there's my error means there's my sum of squads well not even a sum at this point it's the squared error okay so now I need to calculate the gradient so that I can update there's two ways you can calculate the gradient one is analytically and so I you know you can just look them up on Wolfram Alpha or whatever so there's the gradients if you write it up by hand or look it up or you can do something called finite differencing the gradients just how far you move in sorry how far the outcome moves divided by how far your change was for really small changes so let's just make a really small change so here we've taken our intercept and added .01 to it right and then calculated our loss and you can see that our loss went down a little bit and we added .01 here so our derivative is that difference divided by that .01 okay and that's called finite differencing you can always do derivatives of finite differencing it's slow we don't do it in practice but it's nice for just checking stuff out so we can do the same thing for our a term add .01 to that take the difference and divide by .01 or as I say we can calculate it directly using the actual derivative analytical and you can see that that and that are as you'd expect very similar and that and that are very similar so gradient descent then just says let's take our current value of that weight and subtract the learning rate times the derivative there it is and so now we can copy that intercept and that slope to the next row and do it again and do it lots of times and at the end we've done one epoch so at the end of that epoch we could say oh great so this is our slope so let's copy that over to where it says slope and this is our intercept so we'll copy it to where it says intercept and now it's done another epoch okay so that's kind of boring copying and pasting so I created a very sophisticated macro which copies and pastes for you and so I just recorded it basically and then I created a very sophisticated for loop that goes through and does it five times and I attach that to the run button so if I press run it'll go ahead and do it five times and just keep track of the error each time so that is SGD and as you can see it is just infuriatingly slow like particularly the intercept is meant to be 30 and we're still only up to 1.57 and like it's just going so slowly so let's speed it up so the first thing we can do to speed it up is to use some encode momentum so here's the exact same spreadsheet as the last worksheet I've removed the finite differencing version of the derivatives because they're not that useful just the analytical ones here and here's the thing where I take the derivative and I'm going to update by the derivative but what I do it's kind of more interesting to look at this one is I take the derivative and I multiply it by 0.1 and what I do is I look at the previous update and I multiply that by 0.9 and I add the two together so in other words the update that I do is not just based on the derivative but a tenth of it is the derivative and 90% of it is just the same direction I went last time and this is called momentum what it means is remember how we kind of thought about what might happen if you're trying to find the minimum of this and you were here and your learning rate was too small and you just keep doing the same steps or if you keep doing the same steps then if you also add in the step you took last time then your steps are going to get bigger and bigger aren't they? okay until eventually they go too far but now, of course your gradient is pointing the other direction to where your momentum is pointing so you might just take a little step over here and then you'll start going small steps bigger steps like that so that's kind of what momentum does or if you're if you're kind of going too far like this which is also slow then the average of your last few steps is actually somewhere kind of between the two isn't it? so this is a really common idea so when you have something that says kind of my what is in this case it's like my step my step at time t equals some number people often use alpha because like I say you got to love these Greek letters some number times the actual thing I want to do so in this case it's like the gradient plus 1 minus alpha times whatever you had last time ST minus 1 this thing here is called an exponentially weighted moving average and the reason why is that if you think about it these 1 minus alphas are going to multiply so ST minus 2 is in here with a kind of a 1 minus alpha squared and ST minus 3 is in there with a 1 minus alpha cubed so in other words this ends up being the actual thing I want plus a weighted average of the last few time periods where the most recent ones are exponentially higher weighted and this is going to keep popping up again and again so that's what momentum is it says I want to go based on the current gradient plus the exponentially weighted moving average of my last few steps so that's useful that's called SGD with momentum and we can do it by changing this here to saying SGD momentum and momentum 0.9 is really common I don't know it's so common it's always 0.9 just about for basic stuff so that's how you do SGD with momentum and again it's not, I didn't show you some simplified version I showed you the version that is SGD again you can write your own try it out, it would be a great assignment would be to take lesson 2 SGD and add momentum to it or even the new notebook we've got for MNIST get rid of the opt-in dot and write your own update function with momentum then there's a cool thing called RMS Prop really cool things about RMS Prop is that Jeffrey Hinton created it famous neural net guy everybody uses it it's really popular it's really common the correct citation for RMS Prop is the Coursera online free MOOC that's where he first mentioned RMS Prop so I love this thing that cool new things appear in MOOCs that's not a paper so RMS Prop is very similar to momentum but this time we have an exponentially weighted moving average not of the gradient updates but of F8 squared that's the gradient squared, so what the gradient squared times 0.1 plus the previous value times 0.9 this is an exponentially weighted moving average of the gradient squared so what's this number gonna mean well, if my gradient's really small and consistently really small this will be a small number if my gradient is highly volatile it's gonna be a big number or if it's just really big all the time it'll be a big number and why is that interesting because when we do an update this time we say gradient minus learning rate times gradient divided by the square root of this so in other words if our gradient's consistently very small and not volatile let's take bigger jumps and that's kind of what we want when we watched how the intercept moves so damn slowly but it's like obviously you need to just try and go faster so if I now run this after just 5 epochs this is already up to 3 where else with the basic version after 5 epochs it's still at 1.27 and remember we have to get to 30 so the obvious thing to do and by obvious I mean only a couple of years ago did anybody actually figure this out is do both and that's called Adam so Adam is simply keep track of the exponentially weighted moving average of the gradient squared and also keep track of the exponentially weighted moving average of my steps right and both divide by the exponentially weighted moving average of the squared terms and you know take .9 of a step in the same direction as last time so it's momentum and that's called Adam and look at this 5 steps we're at 25 so these optimizers people call them dynamic learning rates a lot of people have the misunderstanding that you don't have to set a learning rate of course you do it's just like trying to identify parameters that need to move faster you know or consistently go in the same direction it doesn't mean you don't need learning rates we still have a learning rate and in fact if I run this again but currently my error we're trying to get to 30, 2 so if I run it again it's getting better but eventually now it's just moving around the same place so you can see what's happened is the learning rate is too high so we could just go in here and drop it down and run it some more getting pretty close now right so you can see how you still need learning rate annealing even with Adam ok so that's spreadsheets fun to play around with I do have a google sheets version of basic SGD that actually works in the macros work and everything google sheets is so awful and I went so insane making that work I gave up on making the other ones work so I'll share a link to the google sheets version oh my god they do have a macro language but it's just ridiculous so anyway if somebody feels like fighting it to actually get all the other ones to work they will work it's just annoying so maybe somebody can get this working on google sheets too ok so that's weight decay and Adam and Adam is amazingly fast and we let's go back to this one but we don't tend to use optium dot whatever and create the optimizer ourselves and all that stuff because instead but learn is just doing those things for you right again there's no magic right so if you create a learner you say here's my data bunch here's my pytorch nn.module instance here's my loss function and here are my metrics remember the metrics are just stuff to print out that's it right then you just get a few nice things like learn.lrfind starts working and it starts recording this and you can say fit one cycle instead of just fit but like these things really help a lot like by using the learning rate finder I found a good learning rate and then like look at this my loss here 0.13 here I wasn't getting much beneath 0.5 right so these tweaks make huge differences not tiny differences and this is still just one one epoch now what does fit one cycle do what does it really do this is what it really does right and we've seen this chart on the left before just to remind you this is plotting the learning rate per batch remember Adam has a learning rate and we use Adam by default or minor variation which we might try to talk about so the learning rate starts really low and it increases about half the time and then it decreases about half the time because at the very start we don't know where we are so we're in some part of function space that's just bumpy as all hell so if you start jumping around those bumps have big gradients and it'll throw you into crazy parts of the space so it starts low and then you'll gradually move into parts of the weight space that they're kind of sensible and as you get to the points where they're sensible you can increase the learning rate because the gradients are actually in the direction you want to go and then as we've discussed a few times as you get close to the final answer you need to anneal your learning rate to hone in on it but here's the interesting thing on the left is the momentum plot and actually every time our learning rate is small our momentum is high why is that? because I do have a small learning rate but you keep going in the same direction you may as well go faster but if you're jumping really far don't like jump jump really far because it's going to throw you off and then as you get to the end again you're fine tuning in but actually if you keep going in the same direction again and again go faster so this combination is called one cycle and it's just this amazing, like it's a simple thing but it's astonishing like this can help you get what's called super convergence that can let you train 10 times faster and this is just last year's paper and some of you may have seen the interview with Leslie Smith that I did last week amazing guy incredibly humble and also I should say somebody who is doing groundbreaking research well into his 60s and all of these things are inspiring I'll show you something else interesting when you plot the losses with fast AI it doesn't look like that it looks like that why is that? because fast AI calculates the exponentially weighted moving average of the losses for you so this concept of exponentially weighted stuff it's just really handy and I use it all the time and one of the things that is to make it easier to read these charts from fast AI might be kind of an epoch or two sorry a batch or two behind where they should be you know there's that slight downside when you use an exponentially weighted moving average is you've got a little bit of history in there as well but it can make it much easier to see what's going on so we're now at a point coming to the end of this co-lab and tabular section where we're going to try to understand both the code in our tabular model so remember the tabular model uses this dataset called adult which is trying to predict who's going to make more money it's a classification problem and we've got a number of categorical variables and a number of continuous variables so the first thing we realize is we actually don't know how to predict a categorical variable yet because so far we did some hand waving around the fact that our loss function was nn.cross-entropy-loss what is that let's find out and of course we're going to find out by looking at Microsoft Excel so cross-entropy-loss is just another loss function we already know one loss function which means grid error y hat minus y squared okay so that's not a good loss function for us because in our case we have like for MNIST 10 possible digits and we have 10 activations each with a probability of that digit okay so we need something where predicting the right thing correctly and confidently should have very little loss predicting the wrong thing confidently should have a lot of loss so that's what we want okay so here's an example here is a cat versus dog one hot encoded okay and here are my two activations for each one from some model that I built probability cat probability dog this one's not very confident of anything this one's very confident of it being a cat and it's right this one's very confident of being a cat and it's wrong so we want a loss that for this one should be a moderate loss because not predicting anything confidently is not really what we want so here's a 0.3 this thing's predicting the correct thing very confidently so 0.01 this thing's predicting the wrong thing very confidently so 1 so how do we do that this is the cross entropy loss and it is equal to whether it's a cat multiplied by log of the probability of cat well this is actually an activation so I should say so it's multiplied by log of the cat activation negative that minus is it a dog times the log of the dog activation and that's it so in other words it's the sum of all of your one hot encoded variables times all of your activations so interestingly these ones here are exactly the same numbers as these ones here but I've written it differently I've written it with an if function because it's exactly this quiz because the zeros don't actually add anything right so actually it's exactly the same as saying if it's a cat then take the log of cattiness and if it's a dog so otherwise take the log of one minus cattiness in other words of dog-iness so the sum of the one hot encoded times the activations is the same as an if function which if you think about it it's actually because this is just a matrix model play this is we now know from our embedding discussion that's the same as an index lookup so you can also to do cross entropy you can also just look up the log of the activation for the correct answer now that's only going to work if these rows add up to one and this is one reason that you can get screwy cross entropy numbers is this why I said you press the wrong button if they don't add up to one you've got a trouble so how do you make sure that they add up to one you make sure they add up to one by using the correct activation function in your last layer and the correct activation function to use for this is softmax softmax is an activation function where all of the activations add up to one all of the activations are greater than zero and all of the activations are less than one so that's what we want right that's what we need how do you do that well let's say we were predicting on a five things cat dog plane fish building and these were the numbers that came out of our neural net for one set of predictions well what if I did e to the power of that so that's one step in the right direction because e to the power of something is always bigger than zero so there's a bunch of numbers that are always bigger than zero here's the sum of those numbers here is e to the number divided by the sum of e to the number now this number is always less than one right because all of the things were positive so you can't possibly have one of the pieces be bigger than a hundred percent of its sum okay and all of those things must add up to one right because each one of them was just that percentage of the total so that's it so this thing softmax is equal to e to the activation divided by the sum of e to the activations and that's called softmax and so when we're doing single label multi-class classification you generally want softmax as your activation function and you generally want cross entropy as your loss and because these things go together in such friendly ways PyTorch will do them both for you so you might have noticed that in this MNIST example I never added a softmax here and that's because if you ask for cross entropy loss it actually does the softmax inside the loss function so it's not really just cross entropy loss it's actually softmax then cross entropy loss so you probably notice this but sometimes your predictions from your models will come out looking more like this pretty big numbers with negatives in rather than this numbers between 0 and 1 that add up to 1 the reason would be that PyTorch it's a PyTorch model that doesn't have a softmax in because we're using cross entropy loss and so you might have to do the softmax for it FastAI is getting increasingly good at knowing when this is happening generally if you're using a loss function that we recognize when you get the predictions we will try to add the softmax in there for you but particularly if you're using a custom loss function that might call an n dot cross entropy loss behind the scenes or something like that you might find yourself with this situation we only have three minutes left but I'm going to point something out to you which is that next week when we finish off tabular which we'll do in like the first ten minutes this is forward in tabular and it basically goes through a bunch of embeddings right it's going to call each one of those embeddings e and you can use it like a function of course it's going to pass in each categorical variable to each embedding it's going to concatenate them together into a single matrix and it's going to then call a bunch of layers which are basically a bunch of linear layers and then it's going to do our sigmoid trick and then there's only two new things we need to learn one is dropout and the other is the n-cont and these are two additional regularization strategies basically batch norm does more than just regularization but amongst other things it does regularization and the basic ways you regularize your model are weight decay batch norm and dropout okay and then you can also avoid overfitting using something called data augmentation so batch norm and dropout we're going to touch on at the start of next week and we're also going to look at data augmentation and then we're also going to look at what convolutions are and we're going to learn some new computer vision architectures and some new computer vision applications but basically we're very nearly there you already know how the entirety of collab.py fastai.collab works you know why it's there and what it does and you're very close to knowing what the entirety of your model does and this tabular model is actually the one that if you run it on Rossman you'll get the same answer that I showed you in that paper you'll get that second place result in fact even a little bit better I'll show you next week if I remember how I actually ran some additional experiments where I figured out some minor tweaks that can do even slightly better than that so yeah we'll see you next week thanks very much and enjoy the smoke outside