 Welcome back, so we had a busy lesson last week and I was really thrilled to see actually one of our master's students here at USF actually actually took what we learned, took what we learned with structured deep learning and turned it into a blog post which as I suspected has been incredibly popular because it's just something people didn't know about and so it actually ended up getting picked up by the Towards Data Science publication which I quite like actually if you're interested in keeping up with what's going on in data science it's quite good medium publication and so Kerem talked about structured deep learning and basically introduced you know the the basic ideas that we learned about last week and it got picked up quite quite widely one of the one of the things I was pleased to see actually Sebastian Ruder who actually mentioned in last week's class as being one of my favorite researchers tweeted it and then somebody from Stitch Fix said oh yeah we've actually been doing that for ages which is kind of cute. I kind of know that this is happening in industry a lot and I've been telling people this is happening in industry a lot but nobody's been talking about it and now the Kerem's kind of published a blog saying hey check out this cool thing and now Stitch Fix is like yeah we're doing that already so so that's been great great to see and I think there's still a lot more that can be dug into with this structured deep learning stuff you know to build on top of Kerem's post would be to maybe like experiment with some different data sets maybe find some old Kaggle competitions and see like are there some competitions that you could now win with this or some which doesn't work for would be equally interesting and also like just experimenting a bit with different amounts of dropout different layer sizes you know because nobody much has written about this I don't think there's been any blog posts about this before that I've seen anywhere there's a lot of unexplored territory so I think there's a lot we could we could build on top of here and there's definitely a lot of interest I saw one person on Twitter saying this is what I've been looking for ages another thing which I was pleased to see is Nikhil who we saw his cricket versus baseball predictor as well as his currency predictor after lesson one went on to download something a bit bigger which was to download a couple of hundred of images of actors and he manually went through and checked which I think first of all he like used Google to try and find ones with glasses ones without then he manually went through and checked that they'd been put in the right spot and this is a good example of one where vanilla Resnet didn't do so well with just the last layer and so what Nikhil did was he went through and tried unfreezing the layers and using differential learning rates and got up to 100% accuracy and the thing I like about these things that Nikhil is doing is the way he's he's not downloading a Kaggle data set he's like deciding on a problem that he's going to try and solve he's going from scratch from Google and he's actually got a link here even to a suggested way to help you download images from Google so I think this is great and I actually gave a talk just this afternoon at Singularity University to an executive team of one of the world's largest telecommunications companies and actually show them this post because the the the folks there were telling me that that all the vendors that come to them tell them they need like millions of images and huge data centers full of hardware and you know they have to buy a special software that only these vendors can provide and I said like actually this person's been doing a course for three weeks now and look at what he's just done with a computer that cost him 60 cents an hour and they were like they were so happy to hear that like okay there you know this actually isn't the reach of normal people I'm assuming nicole's a normal person I haven't actually met him if you're proudly abnormal nicole I apologize I actually went and actually had a look at his cricket classifier and I was really pleased to see that his code actually is the exact same code that we used in lesson one I was hoping that would be the case you know the only thing he changed was the number of epochs I guess so this idea that we can take those four lines of code and reuse it to do other things that's definitely turned out to be true and so these are good things to show like at your organization if you're anything like the executives at this big company I spoke to today there'll be a certain amount of like not just surprise but almost like pushback of like if this was true somebody you know they basically said that this is true somebody would have told us so like why isn't everybody doing this already so like I think you might have to actually show them you know maybe you can build your own with some internal data you've got at work or something like here it is you know didn't cost me anything it's all finished vitally or vitally I don't know how to pronounce his name correctly has done another very nice post on just an introductory post and how we train neural networks and I wanted to point this one out as being like I think this is one of the participants in this course who's got a particular knack for technical communication and I think we can all learn from you know from his posts about about good technical writing what I really like particularly is that he he assumes almost nothing like he has a kind of a very chatty tone and describes everything but he also assumes that the reader is intelligent but you know so like he's not afraid to kind of say here's a paper or here's an equation or or whatever but then he's going to go through and tell you exactly what that equation means so it's kind of like this nice mix of like writing for respectfully for an intelligent audience but also not assuming any particular background knowledge so then I made the mistake earlier this wake of posting a picture of my first placing on the Kaggle seedlings competition at which point five other fast AI students posted their pictures of them passing me over the next few days so this is the current leaderboard for the Kaggle plant seedlings competition I believe the pot of top six are all fast AI students or in the worst of those teachers and so I think this is like a really oh look James it's just past he was first this is a really good example of like what you can do that this is I'm trying to think it was like a small number of thousands of images and most of the images were only were less than a hundred pixels by a hundred pixels and yet we you know I bet my approach was basically to say let's just run through the notebook we have pretty much default took me I don't know an hour and I think the other students doing a little bit more than that but not a lot more and basically what this is saying is yeah these techniques work pretty reliably to a point where people that aren't using the fast AI libraries are you know literally really struggling I suspect all these are first day I students you might have to go down quite a way so I thought that was very interesting and really really cool so today we're going to start what I would kind of call like the second half of this course so the first half of this course has been like getting through like these are the applications that we can use this for here's kind of the code you have to write here's a fairly high level ish description of what it's doing and we kind of we're kind of done for that bit and what we're now going to do is go in reverse we're going to go back over all of those exact same things again but this time we're going to dig into the detail of everyone and we're going to look inside the source code of a fast AI library to see what it's doing and try to replicate that so in a sense like there's not going to be a lot more best practices to show you like I've kind of shown you the best best practices I know but I feel like for us to now build on top of those to debug those models to come back to part two where we're going to kind of try out some new things you know it really helps to understand what's going on behind the scenes okay so the goal here today is we're going to try and create a pretty effective collaborative filtering model almost entirely from scratch so we'll use the kind of we'll use PyTorch as a automatic differentiation tool and as a GPU programming tool and not very much else we'll try not to use its neural net features we'll try not to use fast AI library any more than necessary so that's the goal so let's go back and you know we only very quickly look at collaborative filtering last time so let's let's go back and have a look at collaborative filtering and so we're going to look at this movie lens data set so the movie lens data set basically is a list of ratings it's got a bunch of different users that are represented by some ID and a bunch of movies that are represented by some ID and rating it also has a timestamp I haven't actually ever tried to use this I guess this is just like what what time did that person rate that movie so that's all we're going to use for modeling is three columns user ID movie ID and rating and so thinking of that in kind of structured data terms user ID and movie ID would be categorical variables we have two of them and rating would be a would be a dependent variable we're not going to use this for modeling but we can use it for looking at stuff later we can grab a list of the names of the movies as well and you could use this genre information I haven't tried to be interested if during the week anybody tries it and finds it helpful I guess as you might not find it helpful we'll see so in order to kind of look at this better I just grabbed the users that have watched the most movies and the movies that have been the most watched and made a cross tab of it right so this is exactly the same data but it's a subset and now rather than being user movie rating we've got user movie rating and so some users haven't watched some of these movies that's why some of these are not a number okay then I copied that into Excel and you'll see there's a thing called collabfilter.xls if you don't see it there now I'll make sure I've got it there by tomorrow and here is where I've copied that table okay so as I go through this like set up of the problem and kind of how it's described and stuff if you're ever feeling lost feel free to ask either directly or through the forum if you ask through the forum and somebody answers there I won't need to answer it here but if somebody else asks a question you would like answered of course just like it and your net will keep an eye out for that because kind of as we're digging in to the details of what's going on behind the scenes it's kind of important that at each stage you feel like okay I can see what's going on okay so we're actually not going to build a neural net to start with instead we're going to do something called a matrix factorization the reason we're not going to build a neural net to start with is that it so happens there's a really really simple kind of way of solving these kinds of problems which I'm going to show you and so if I scroll down I've basically what I've got here is the same the same thing but this time these are my predictions rather than my actuals I'm going to show you how I created these predictions okay so here are my actuals right here are my predictions and then down here we have our score which is the sum of the difference squared average square root okay so this is RMSE down here okay so on average where our randomly initialized model is out by 2.8 so let me show you what this model is and I'm going to show you by saying how do we guess how much user ID number 14 likes movie ID number 27 and the prediction here this is just at this stage this is still random is 0.91 so how are we calculating 0.91 and the answer is we're taking it as this vector here dot product with this vector here so dot product means 0.71 times 0.19 plus 0.81 times 0.63 plus 0.74 plus 0.31 and so forth and in you know linear algebra speak because one of them is a column and one of them is a row this is the same as a matrix product so you can see here I've used the Excel function matrix multiply and that's my prediction having said that if the original rating doesn't exist at all then I'm just going to set this to 0 right because like there's no error in predicting something that hasn't happened okay so what I'm going to do is I'm basically going to say all right every one of my rate rate my predictions is not going to be a neural net it's going to be a single matrix multiplication all right now the matrix multiplication that it's doing is basically in practice is between like this matrix and this matrix right so each one of these is a single part of that right so I randomly initialize these these are just random numbers that I've just pasted in here so I've basically started off with two random matrices and I've said let's assume for the time being that every rating can be represented as the matrix product of those two so then in Excel you can actually do gradient descent you have to go to your options to the add-in section and and check the box to say turn it on and once you do you'll see there's something there called solver and if I go solver it says okay what's your objective function and you just choose the cell so in this case we chose the cell that contains our written squared error and then it says okay what do you want to change and you can see here we selected this matrix and this matrix and so it's going to do a gradient descent for us by changing these matrices to try and in this case minimize because it says min minimize this excel cell right grg non-linear is a is a gradient descent method so I'll say solve and you'll see it starts at 2.8 and then down here you'll see that number's going down right we it's not actually showing us what it's doing but we can see that the number's going down so this is kind of got a a new or netty feel to it in that we're doing like a matrix product and we're doing a gradient descent but we don't have a non-linear layer and we don't have a second linear layer on top of that so we don't get to call this deep learning so things where people do like deep learning ish things where they have kind of matrix products and gradient descents but it's not deep people tend to just call that shallow learning okay so we're doing shallow learning here all right so I'm just going to go ahead and press escape to stop it because I'm sick of waiting and so you can see we've now got down to the 0.39 all right so for example um it guessed that movie 72 for sorry movie 27 for user 72 would get 4.44 rating 27 72 it actually got a 4 rating so you can see like it's it's it's doing something quite useful so why is it doing something quite useful I mean something to note here is the number of things we're trying to predict here is there's 225 of them right and the number of things we're using to predict is that times two so 150 of them so it's not like we can just exactly fit we actually have to do some kind of machine learning here so basically what this is saying is that there does seem to be some way of making predictions in this way and so for those of you that have done some linear algebra and this is actually a matrix decomposition normally in linear algebra you would do this using an analytical technique or using some techniques that are specifically designed for this purpose but the nice thing is that we can use gradient descent to solve pretty much everything including this I don't like to so much think of it from a linear algebra point of view though I like to think of it from an intuitive point of view which is this let's say movie sorry let's say movie id 27 is lord of the rings part one and let's say move and so let's say we're trying to make that prediction for user 2072 are they going to like lord of the rings part one and so conceptually that particular movie maybe there's like there's four sorry there's five numbers here and we could say like well what if the first one was like how much is it sci-fi and fantasy and the second one is like how recent a movie and how much special effects is there you know and the one at the top might be like how dialogue driven is it right like let's say there's kind of five these five numbers represented particular things about the movie and so if that was the case then we could have the same five numbers for the user saying like okay how much does the user like sci-fi and fantasy how much does the user like modern uh this is a user uh modern CGI driven movies how much does the does this uh user like dialogue driven movies and so if you then took that cross product you would expect to have a good model right we expect to have a reasonable rating now the problem is we don't have this information for each user we don't have the information for each movie so we're just going to like assume that this is a reasonable kind of way of thinking about this system and let's and let's to cast a gradient descent try and find these numbers right so so in other words these these factors we call these things factors these factors and we call them factors because you can multiply them together to create this right they're factors and all the algebra sets these factors we call them latent factors because they're not actually this is not actually a vector that we've like named and understood and like entered in manually we've kind of assumed that we can think of movie ratings this way we've assumed that we can think of them as a dot product of some particular features about a movie and some particular features of to look what users like those kinds of movies right and then we've used gradient descent to just say okay try and find some numbers that that work so that's that's basically the technique right and it's kind of uh the and and the the entirety is in this spreadsheet right so that is collaborative filtering using what we call probabilistic matrix factorization um and as you can see the whole thing is easy to do in an excel spreadsheet and the entirety of it really is this single thing which is a single matrix multiplication um plus randomly initializing Jeremy we would like to know um if it would be better to cap this uh to zero and five maybe yeah so that you don't get yeah and we're going to do that later right there's a whole lot of stuff we can do to improve this this is like our simplest possible starting point right so so what we're going to do now is we're going to try and implement this in python uh and run it on the whole data set another question is how do you figure out how many you know how it's clear how long are the metrics why is this five yeah yeah so something to think about given that this is like movie 49 right um and we're looking at a rating for movie 49 think about this this is actually an embedding matrix and so this length is actually the size of the embedding matrix i'm not saying this is an analogy i'm saying it literally this is literally an embedding matrix we could have a one hot encoding where 72 uh where a one is in the 72nd position and so we'd like look it up and it would return this list of five numbers so the question is actually how do we decide on the dimensionality of our embedding vectors and the answer to that question is we have no idea we have to try a few things and see what works the underlying concept is you need to pick an embedding dimensionality which is enough to reflect the kind of true complexity of this causal system but not so big that you have too many parameters that could take forever to run or even with vectorization it might overfit so what does it mean when the factor is negative then the factor being negative in the movie case would mean like this is not dialogue driven in fact it's like the opposite dialogue here is terrible a negative for the user would be like i actually dislike modern cgi movies so it's not from zero to whatever it's uh the range of score would be negative is there a range of score even like no no maximum no there's no constraints at all here okay these are just standard embedding matrices thanks i have a couple of questions first question is uh why do why why can we trust this embeddings because like if you take a number six it can be expressed as one into six or like six into one or two into three and three into two oh so are you saying like we could like reorder these five numbers in some other different order or like the value itself might be different as long as the product is something well but you see we're using gradient descent to find the best numbers so like once we found a good minimum the idea is like yeah there are other numbers but they don't give you as good an objective value and of course we should be checking that on a validation set really which we'll be doing in the python version okay and the second question is uh when we have a new movie or a new user do we have to retrain the model that is a really good question and there isn't a straightforward answer to that um time permitting will come back to it um but basically you would need to have like a kind of a new user model or a new movie model that you would use initially um and then over time yes you would then have to retrain the model so like I don't know if they still do it but Netflix used to have this thing that when you were first on boarded onto Netflix they would say like what movies do you like and you'd have to go through and like say a bunch of movies you like and it would then like train its model could you could you just find the nearest movie to the movie that you're trying to the new movie that you're trying to add and yeah you could use nearest neighbors for sure uh but the the thing is initially at least in this case we have no columns to describe a movie so if you had something about like the movie's genre or release date who was in it or something you could have some kind of non collaborative filtering model and that was kind of what I meant by like a new movie model you'd have to have some some kind of predictors okay so a lot of this is going to look familiar and and the way I'm going to do this is again it's kind of this top-down approach we're going to start using a few features of PyTorch and FastAI and gradually we're going to redo it a few times in a few different ways um kind of doing a little bit deeper each time um regardless we do need a validation set so we can use our standard cross-validation indexes approach to grab a random set of IDs um this is something called weight decay which we'll talk about later in the course for those of you that have done some machine learning it's l2 regularization basically and this is where we choose how bigger um embedding metrics do we want okay um so again you know here's where we get our model data object um from csv um passing in that ratings file which remember looks like that okay so you'll see like stuff tends to look pretty familiar after a while um and then you just have to pass in um the uh what are your rows effectively what are your columns effectively and what are your values effectively right so any any collaborative filtering recommendation system approach there's basically a concept of like you know a user and an item um now they might not be users and items like if you're doing the that a Ecuadorian groceries competition there are stores and items and you're trying to predict how many things are you going to sell at this store of this type right um but generally speaking just this idea of like you've got a couple of kind of high cardinality categorical variables and something that you're measuring and you're kind of conceptualizing it as saying okay we could predict the rating we can predict the value by doing this this dot product um interestingly uh and this is kind of relevant to that that last question or suggestion an identical way to think about this or to express this is to say um when we're deciding whether user 72 will like movie 27 is basically saying which other users liked movies that 72 liked and which other movies were liked by people like um user 72 it turns out that these are basically two ways of saying the exact same thing so basically what collaborative filtering is doing you know kind of conceptually is to say okay this movie and this user which other movies are similar to it in terms of like um similar people enjoyed them and which people are similar to this person based on people that like the same kind of movies so that's kind of the the underlying structure at any time there's an underlying structure like this that kind of collaborative filtering approach is likely to be useful okay so um so yeah so there's basically two parts the two bits of your thing that you're factoring and then the the value the dependent variable um so as per usual we can take our model data and ask for a learner from it and we need to tell it what size and betting matrix to use um how many uh sorry what uh validation set indexes to use uh what batch size to use and what optimizer uh to use and we're going to be talking more about optimizers shortly we won't do adam today but we'll do adam next week or the week after and then we can go ahead and say fit all right and uh it all looks pretty similar interest uh it's to usual interestingly I only had to do three epochs like this kind of model seem to train super quickly um you can use the learning rate finder as per usual all the stuff you're familiar with will work fine um and that was it so this talk you know about two seconds the train there's no pre-trained anything's here this is from random scratch right uh so this is our validation set and we can compare it uh we have this is a mean squared error not a root mean squared error so we can take the square root um so that last time I ran it was 0.776 and that's 0.88 and there's some benchmarks available for this data set um and when I scrolled through and found the bench the best benchmark I could find here from this recommendation system specific library they had 0.91 so we've got a better loss in two seconds uh uh already so that's good so that's basically how you can do collaborative filtering with the Fast.ai library without thinking too much but so now we're going to dig in and try and rebuild that we'll try and get to the point that we're getting something around 0.77, 0.78 from scratch um but if you want to do this yourself at home you know without worrying about the detail that's you know those three lines of code is all you need okay so we can get the predictions in the usual way and you know we could for example plot um sns is c-borne c-borne's a really great plotting library it sits on top of matplotlib it actually leverages matplotlib so anything you learn about matplotlib will help you with c-borne there's got a few like nice little plots like this joint plot here is undergoing predictions against against actuals so these are my actuals these are my predictions and you can kind of see the the shape here is that as we predict higher numbers they actually are higher numbers and you can also see the histogram of the predictions and the histogram of the actions so I was just kind of plotting that just to show you another interesting visualization could you please explain the n factors why it's set to 50 it's set to 50 because I tried a few things that's in the way that's all what does it mean but it's this it's the dimensionality of the embedding measures okay um or to think of it in another way it's like how you know rather than five it's 50 Jeremy um I have a question about uh supposed that your um recommendation system is more implicit so you have zeros or ones instead of just um actual numbers right so basically we would then um need to use a classifier instead of a regressor um you have to sample the negative for something like that so if you don't have if you just have once let's say like just kind of implicit feedback oh I'm not sure we'll get to that one in this class but what I will say is like in the case that you're just doing classification rather than regression um we haven't actually built that in the library yet maybe somebody this week wants to try adding it it would only be a small number of lines of code you basically have to change the activation function to be a sigmoid and you would have to change the the criterion or the loss function to be cross entropy rather than RMSE and that will give you a classifier rather than a regressor those are the only things you'd have to change so hopefully somebody this week will take up that challenge and by the time we come back next week we will have that working okay so um I said that we're basically doing a a dot product right or a you know a dot product is kind of the vector version I guess of the of this matrix product so we're basically doing each of these things times each of these things and then add it together so that's a dot product so let's just have a look at how we do that in PyTorch so we can create a tensor in PyTorch just using this little capital T thing um you can just say that's the fastai version the full version is torch dot from numpy or something um but I've got it set up so you can possibly pass in even a list of lists so this is going to create a torch tensor with one two three four and then here's a torch tensor with two two ten ten okay so here are two torch tensors I didn't say dot CUDA so they're not on the GPU they're sitting on the CPU just FYI we can multiply them together right and so anytime you have a mathematical operator between tensors in numpy or PyTorch it will do element wise assuming that they're the same dimensionality which they are they're both two by two okay and so here we've got uh two by two is four three by ten is thirty and so forth okay so there's our a times b so if you think about basically what we want to do here is we want to take um okay so I've got one one times two is two two times two is four two plus four is six and so that is actually the dot product between one two and two four and then here we've got three by ten is 30 four by 40 sorry four by ten is 40 30 and 40 is 70 so in other words a times b dot sum along the first dimension so that's summing up the columns in other words across a row okay this thing here is doing the dot product of each of these rows with each of these rows so it makes sense and obviously we could do that with um you know some kind of matrix multiplication approach but I'm trying to really do things with this little special case stuff as possible um okay so that's what we're going to use for our dot products for now on so basically all we need to do now is remember we have um the data we have is not in that cross tab format so in excel we've got it in this cross tab format but we've got it here in this listed format user movie rating user movie rating so conceptually we want to be like looking up this user into our embedding matrix to find their 50 factors looking at that movie to find their 50 factors and then take the dot product of those two 50 long vectors so let's do that um to do it we're going to build um a layer our own custom neural net layer that's not going to be a neural net right so the the the more generic vocabulary we call this is we're going to build a pytorch module okay so a pytorch module is a very specific thing it's something that you can use as a layer in a neural net once you've created your own pytorch module you can throw it into a neural net um and a module works by assuming we've already got one say called model you can pass in some things in parentheses and it will calculate it right so assuming that we already have a module called dot product we can instantiate it like so to create our dot product object and we can basically now treat that like a function right but the thing is it's not just a function um because we'll be able to do things like take derivatives of it um stack them up together into a big um stack of neural network layers blah blah blah right so it's basically a function that we can kind of compose very conveniently so here how do we define a module which as you can see here returns a dot product well we have to create a python class right and so if you haven't done python o o before um you're going to have to learn um because all pytorch modules are written in python o o and it's one of the things I really like about pytorch is that it doesn't reinvent totally new ways of doing things like tensorflow does all the time uh and pytorch that you know really tend to use pythonic ways to do things so in this case how do you create you know some kind of new behavior you create a python class so Jeremy suppose that you have a lot of data not just a little bit of data you can have in memory uh will you be able to use fuzzai to solve globally filtering yes absolutely uh it's uh it uses um mini batch stochastic gradient percent which does it a batch at a time um the uh this particular version is going to create a pandas data frame and a pandas data frame has to live in memory um having said that you can get easily 512 gig you know instances on amazon so like if you had a csv that was bigger than 512 gig you know that would be impressive uh if that did happen I guess you would have to instead uh save that as a big holes array and create a slightly different version that reads from a big holes array which is streaming in or maybe from a a desk data frame which also so um it would be easy to do I don't think I've seen real world situations where you have 512 gigabyte collaborative filtering matrices but yeah uh we can do it okay now um this is pytorch specific this next bit is that when you define like the actual work to be done which is here return user times movie dot sum uh you have to put it in a special method called forward okay and this is this idea that like it's very likely you're putting neural net right and in a neural net the thing where you calculate the next uh set of activations is called the the forward pass and so that's doing a forward calculation the gradients is called the backward calculation we don't have to do that because pytorch calculates that automatically so we just have to define forward right so we create a new class we define forward and here we write in our definition of dot product okay so that's it so now that we've uh created this class definition we can instantiate our model right and we can call our model and get back the numbers we expected okay so that's it that's how we create a custom pytorch layer um and if you compare that to like any other library around pretty much this is way easier um basically i guess because we're leveraging what's already in python so let's go ahead and now create a more complex um module uh and we're going to basically do the same thing uh we're going to have a forward again uh we're going to have our users times movies dot sum um but we're going to do one more thing beforehand which is we're going to create two embedding matrices and then we're going to look up our users and our movies in those embedding matrices so let's go through and and do that so the first thing to realize is that uh the users the user IDs and the movie IDs may not be contiguous you know like they maybe they start at a million and go to a million one thousand say right so if we just used those IDs directly to look up into an embedding matrix we would have to create an embedding matrix of size one million one thousand right which we don't want to do so the first thing I do is to get a list of the unique user IDs and then I create a mapping from every user ID to a contiguous integer this thing I've done here where I've created a dictionary which maps from every unique thing to a unique index is well worth studying during the week because like it's super super handy it's something you very very often have to do in all kinds of machine learning right and so I will go through it here it's easy enough to figure out if you can't figure it out just ask on the forum anyway so once we've got the mapping from user to a contiguous index we then can say let's now replace the user ID column with that contiguous index right so pandas dot apply applies an arbitrary function in python lambda is how you create an anonymous function on the fly and this anonymous function simply returns the index through the same thing for movies and so after that we now have the same ratings table we had before but our IDs have been mapped to contiguous integers and therefore there are things that we can look up into an embedding matrix so let's get the count of our users in our movies and let's now go ahead and try and create our python version of this okay so earlier on when we created our simplest possible py torch module there was no like state we didn't need a constructor because we weren't like saying how many users are there or how many movies are there or how many factors do we want or whatever right anytime we want to do something like this where we're passing in and saying we want to construct our module with this number of users and this number of movies then we need a constructor for our class and you create a constructor in python by defining a dunder in it underscore underscore in it underscore underscore a special name so this just creates a constructor again if you haven't done OO before you wanted to do some study during the week but it's a pretty simple idea this is just the thing that when we create this object this is what gets run okay again special python thing when you create your own constructor you have to call the parent class constructor and if you want to have all of the cool behavior of a py torch module you get that by inheriting from nn.module neural net module okay so basically by inheriting here and calling the superclass constructor we now have a fully functioning py torch layer okay so now we have to give it some behavior and so we give it some behavior by storing some things in it right so here we're going to create something called self.u usually users and that is going to be an embedding layer number of rows is n users number of columns is n factors okay so that is exactly this right the number of rows is n users number of columns is n factors and then we'll have to do the same thing for movies okay so that's going to go ahead and create these two randomly initialized arrays however when you randomly initialize an array it's important to randomly initialize it to a reasonable set of numbers like a reasonable scale right if we randomly initialize them from like naught to a million then we would start out and you know these things would start out being like you know billions and billions of size rating and that's going to be very hard to do gradient descent on so I just kind of manually figured here like okay about what size numbers are going to give me about the right ratings and so we know we did ratings between about naught and five so if we start out with stuff between about naught and point oh five then we're going to get ratings of about the right level you can easily enough like back calculate that in neural nets there are standard algorithms for basically doing doing that calculation and the basic the key algorithm is something called her initialization from climbing her and the basic idea is that you take the you basically set the weights equal to a normal distribution with a standard deviation which is basically inversely proportional to the number of things in the previous layer and so in our previous layer so in this case we basically if you basically take that naught to point oh five and multiply it by the fact that you've got 40 thing that was a 40 or 50 things coming out of it 50 50 things coming out of it then you're going to get something about the right size pi torch has already has like her initialization class there like we don't in normally in real life have to think about this we can just call the existing initialization functions but we're trying to do this all like from scratch here okay without any special stuff going on so there's quite a bit of pi torch notation here so self.u we've already set to an instance of the embedding class it has a weight attribute which contains the actual the actual embedding matrix so that contains this the actual embedding matrix is not a tensor it's a variable a variable is exactly the same as a tensor in other words it supports the exact same operations as a tensor but it also does automatic differentiation that's all a variable is basically to pull the tensor out of a variable you get its data attribute okay so this is so this is now the tensor of the weight matrix of the self.u embedding and then something that's really handy to know is that all of the tensor functions in pi torch you can stick an underscore at the end and that means do it in place right so this is say create a random uniform random number of an appropriate size for this tensor and don't return it but actually fill in that matrix in place okay so that's a super handy thing to know about i mean it wouldn't be rocket science otherwise we would have to have gone okay there's the non-inplace version that's what saves us some typing saves us some screen noise that's all okay so now we've got our randomly initialized embedding weight matrices and so now the forward i'm actually going to use the same column model data that we used for rustman and so it's actually going to be passed both categorical variables and continuous variables and in this case there are no continuous variables so i'm just going to grab the zeroth column out of the categorical variables and call it users and the first column and call it movies okay so i'm just kind of too lazy to create my own i'm not so much too lazy that we do have a special class for this but i'm trying to avoid creating a special class so i'm just going to leverage this column model data class okay so we can basically grab our user and movies mini batches right and remember this is not a single user in a single movie this is going to be a whole mini batch of them we can now look up that mini batch of users in our embedding matrix u and the movies in our embedding matrix m right so this is like exactly the same as just doing an array look up to grab the user id numbered value but we're doing it a whole mini batch at a time right and so it's because py torch can do a whole mini batch at a time with pretty much everything that we can get really easy speed up we don't have to write any loops on the whole to do everything through our mini batch and in fact if you do ever loop through your mini batch manually you don't get gpu acceleration that's really important to know right so you never want to loop have a for loop going through your mini batch you always want to do things in this kind of like whole mini batch at a time but pretty much everything in py torch does things a whole mini batch at a time so you shouldn't have to worry about it and then here's our product just like before all right so having defined that i'm now going to go ahead and say all right my x values is everything except the rating in the timestamp in my ratings table my y is my rating and then I can just say okay let's grab a model data from a data frame using that x and that y and here is our list of categorical variables okay and then so let's now instantiate that py torch object right so we've now created that from scratch and then the next thing we need to do is to create an optimizer so this is part of py torch the only fast ai thing here is this line right because like I don't think showing you how to build data sets and data loaders is interesting enough really we might do that in part two of the course i mean it's actually so straightforward like a lot of you are already doing it on the forums so I'm not going to show you that in this part but if you're interested feel free to to talk on the forums about it but I'm just going to basically take the the thing that feeds this data as a given particularly because these things are so flexible right you know if you've got stuff in a data frame you can just use this you don't have to rewrite it so that's the only fast ai thing we're using so this is a py torch thing and so optim is the thing in py torch that gives us an optimizer we'll be learning about that very shortly so it's actually the thing that's going to update our weights py torch calls them the parameters of the model so earlier on we said model equals embedding dot blah blah blah right and because embedding dot derives from nn.module we get all of the py torch module behavior and one of the things we got for free is the ability to say dot parameters so that's pretty that's pretty handy right that's the thing that basically is going to automatically give us a list of all of the weights in our model that have to be updated and so that's what gets passed to the optimizer we also passed the optimizer the learning rate the weight decay which we'll talk about later and momentum that we'll talk about later okay one other thing that i'm not going to do right now but we will do later is to write a training loop so the training loop is a thing that loops through each mini batch and updates the weight to subtract the gradient times the load rate there's a function in fastai which is the training loop and it's it's pretty simple here it is right for epoch in epochs this is just the thing that shows a progress bar so ignore this for x comma y in my training data loader calculate the loss print out the loss in our in a progress bar call any callbacks you have and at the end call the call the metrics on the validation right so this there's it's just for each epoch go through each mini batch and do one step of our optimizer step is basically going to take advantage of this optimizer but we'll be writing that from scratch shortly so this is notice we're not using a learner okay we're just using a pipe watch module so this this fit thing although it's past a part of fastai it's like lower down the layers of abstraction now this is the thing that takes a regular pipe torch model so if you ever want to like skip as much fastai stuff as possible like you've got some pipe torch model you've got some code on the internet you basically want to run it but you don't want to write your own training loop then this is this is what you want to do you want to call fastai's fit function and so what you'll find is like the library is designed so that you can kind of dig in at any layer of abstraction you like right and so at this layer of abstraction you're not going to get things like stochastic gradient descent with restarts you're not going to get like differential learning rates like all that stuff that's in the learner like you could do it but you'd have to write it all by by hand yourself right and so that's the downside of kind of going down to this level of abstraction the upside is that as you saw the code for this is very simple it's just a simple training loop it takes a standard pipe torch model so this is like this is a good thing for us to use here we can we just call it and it looks exactly like what we're used to see right we get our validation and training loss for the three epochs now you'll notice that we wanted something around 0.76 so we're not there so in other words the the default fastai collaborative filtering algorithm is doing something smarter than this so we're going to try and do that one thing that we can do since we're calling our you know this lower level fit function there's no learning rate annealing we could do our own learning rate annealing so you can see here there's a fastai function called set learning rates you can pass in a standard pipe torch optimizer you can pass in your new learning rate and then call fit again and so this is how we can like manually do a learning rate schedule and so you can see we've got a little bit better 1.13 we still got a long way to go okay so I think what we might do is we might have a seven minute break and then we're going to come back and try and improve this score a bit for those who are interested somebody was asking me at the break for a kind of a quick walkthrough so this is totally optional but if you go into the fastai library there's a model dot py file and that's where fit is which we're just looking at which goes through each epoch in epochs and then goes through each x and y in the mini batch and then it calls this step function so the step function is here and you can see the key thing is it calculates the output from the model the model score m right and so if you remember our dot product we didn't actually call model dot forward we just called model parentheses and that's because the nn dot module automatically you know when you call it as if it's a function it passes it along to forward okay so that's that's what that's doing there right and then the rest of this we'll we'll learn about shortly which is basically doing the the loss function and then the the backward pass okay so for those who are interested that's that's kind of is your bit of a sense of how the code is structured if you want to look at it and as I say like the the fastai code is designed to both be world class performance but also pretty easy to read so like feel free like take a look at it and if you want to know what's going on just ask on the forums and if you you know if you think there's anything that could be clearer let us know because yeah the code is definitely you know we're going to be digging into the code more and more okay so let's try and improve this a little bit and let's start off by improving it in excel so you might have noticed here that we've kind of got the idea that user 72 you know like sci-fi modern movies with special effects you know whatever and movie number 27 is sci-fi and has special effects and not much dialogue but we're missing an important pace which is like user 72 is pretty enthusiastic on the whole and on average rates things highly you know and movie 27 you know it's just a popular movie you know which just on average it's higher so what we'd really like is to add a constant for the user and a constant for the movie and remember in neural network terms we call that a bias right so we want to add a bias so we could easily do that and if we go into the bias tab here we've got the same data as before and we've got the same latent factors as before and I've just got one extra row here and one extra column here and you won't be surprised here that we now take the same matrix multiplication as before and we add in that and we add in that okay so that's our bias so other than that we've got exactly the same loss function over here and so just like before we can now go ahead and solve that and now our changing variables include the bias and we can say solve and if we leave that for a little while it will come to a better result than we had before okay so that's the first thing we're going to do to improve our model and there's really variable show just to make the code a bit shorter I've defined a function called get embedding which takes a number of inputs and a number of factors so the number of rows and the embedding matrix and then creates the embedding and then randomly initializes it I don't know why I'm doing negative to positive here and a zero last time honestly it doesn't matter much as long as it's in the right ballpark and then we return that initialized embedding so now we need not just our users by factors which are chucking to u our movies by factors which are chucking to m but we also need users by one which we'll put into u b use a bias and movies by one which will put into movie bias okay so this is just doing a list comprehension going through each of the tuples creating embedding for each of them and putting them into these things okay so now our forward is exactly the same as before u times m dot sum I mean this is actually a little confusing because we're doing it in two two steps maybe to make it a bit easier let's pull this out put it up here put this in parentheses okay so maybe that looks a little bit more familiar right u times n dot sum that's the same dot product and then here we're just going to add in our user bias and our movie bias dot squeeze is the pytorch thing that adds an additional unit axis on that's not going to make any sense if you haven't done broadcasting before I'm not going to do broadcasting in this course because we've already done it and we're doing it in the machine learning course but basically in in short broadcasting is what happens when you do something like this where um is a matrix ub self dot ub uses is a is a vector how do you add a vector to a matrix and basically what it does is it duplicates the vector so that it makes it the same size as the matrix and the particular way whether it duplicates it across columns or down rows or how it does it is called broadcasting the broadcasting rules are the same as numpy pytorch didn't actually used to support broadcasting so I was actually the guy who first added broadcasting to pytorch using an ugly hack and then the pytorch authors did an awesome job of supporting it actually inside the language so now you can use the same broadcasting operations in pytorch as numpy if you haven't dealt with this before it's really important to learn it because like it's it's kind of the most important fundamental way to do computations quickly in numpy and pytorch it's the thing that lets you not have to do loops I could you imagine here if I had to loop through every row of this matrix and add each you know this vector to every row it would be slow it would be you know a lot more code um and the idea of broadcasting it actually goes all the way back to um APL which was a language designed in the 50s by an extraordinary guy called Ken Iverson the APL was originally designed or written out as a new type of mathematical notation he has this great essay called um notation as a tool for thought and the idea was that like really good notation could actually make you think of better things and part of that notation is this idea of broadcasting I'm incredibly enthusiastic about it and we're going to use it plenty so um either watch the machine learning lesson um or um you know google numpy broadcasting for information uh anyway so basically it works reasonably intuitively we can add on we can add the vectors to the matrix um all right uh having done that we're now going to do one more trick uh which is um I think it was Yannette asked earlier about could we squish the ratings to be between one and five and the answer is we could right and specifically what we could do is we could um put it through a sigmoid function right so to remind you a sigmoid function looks like that right and this is that's one right we could put it through a sigmoid function so we could take like 4.96 and put it through a sigmoid function and like that you know that's kind of high so it kind of be over here somewhere right um and then we could multiply that sigmoid like the result of that by five for example right and in this case we want it to be between one and five right so maybe we'd like multiply it by four and add one instance um so that's a basic idea um and so here is that trick we take the result so the result is basically the thing that comes straight out of the dot product plus the addition of the biases and put it through a sigmoid function now in pi torch um basically all of the functions you can do to tensors are available inside this thing called capital F and this is like totally standard in pi torch uh it's actually called torch dot nn dot functional but everybody including all of the pi torch docs import torch dot nn dot functional as capital F right so capital F dot sigmoid means a function called sigmoid that is coming from torches functional module right and so that's going to apply a sigmoid function to the result so squish them all between zero and one using that nice little shape and then i can multiply that by five minus one equals four right and then add on one and that's going to give me something between one and five okay so so like there's no need to do this i could comment it out and it will still work right but now it has to come up with a set of calculations that are always between one and five right where else if i leave this in then it's like makes it really easy it's basically like oh if you think this is a really good movie just calculate a really high number it's a really crappy movie calculate a really low number and i'll make sure it's in the right region so even though this isn't a neural network it's still a good example of this kind of like if you're doing any kind of parameter fitting try and make it so that the thing that you want your function to return it's like it's easy for it to return that okay so that's why we do that that function squishing so we call this embedding dot bias so we can create that in the same way as before you'll see here i'm calling dot cuda to put it on the gpu because we're not using any learner stuff normally that'll all happen for you but we have to manually say put it on the gpu this is the same as before create our optimizer fit exactly the same as before and these numbers are looking good all right and again we'll do a little change to our learning rate learning rate schedule and we're down to point eight so we're actually pretty close pretty close so that's the key steps and this is how this is how most collaborative filtering is done and yannette reminded me of an important point which is that this is not strictly speaking a matrix factorization because strictly speaking a matrix factorization would take that matrix by that matrix to create this matrix and remembering anywhere that this is empty like here or here we're putting in a zero right we're saying if the original was empty put in a zero right now normally you can't do that with normal matrix factorization with normal matrix factorization it creates the whole matrix and so it was a real problem actually when people used to try and use traditional linear algebra for this because when you have these sparse matrices like in practice this matrix is not doesn't have many gaps because we picked the users that watch the most movies and the movies that are the most watched but if you look at the whole matrix it's it's mainly empty and so traditional techniques treated empty as zero and so like you basically have to predict a zero as if the fact that I haven't watched a movie means I don't like the movie and that gives terrible answers so this probabilistic matrix factorization approach takes advantage of the fact that our data structure actually looks like this rather than that cross tab right and so it's only calculating the loss for the user ID movie ID combinations that actually appear right so it's actually like user ID one movie ID 1029 should be three it's actually three and a half so our loss is point five like there's nothing here that's ever going to calculate a prediction or a loss for a user movie combination that doesn't appear in this table right by definition the only stuff that we can appear in a mini batch is what's in this table okay and like a lot of this happened interestingly enough actually in the netflix prize so before the netflix prize came along this probabilistic matrix factorization it had actually already been invented but nobody noticed right and then in the first year of the netflix prize someone wrote this like really really famous blog post where they basically said like hey check this out incredibly simple technique works incredibly well and suddenly all the netflix leaderboard entries were much much better and so you know that's quite a few years ago now and this is like now every collaborative filtering approach does this not every collaborative filtering approach adds this sigmoid thing by the way it's not like rocket science this is this is not like the nlp thing we saw last week which is like hey this is a new state of the art like this is you know not particularly uncommon but there are still people that that don't do this it definitely helps a lot right to have this and so um actually you know what we could do is maybe now's a good time to have a look at the definition of this right so the column data module uh contains all these definitions um and we can now compare this to the thing we originally used which was whatever came out of collab filter data set right so let's go to collab filter data set here it is and we called get learner right so we can go down to get learner and that created a collab filter learner passing in the model from get model here's get model so it created an embedding dot bias and so here is embedding dot bias and you can see here here it is like it's the same thing there's the embedding for each of the things here's our forward that does the u times i dot sum plus plus sigmoid so in fact we have just actually rebuilt what's in the fast a library literally okay um it's a little shorter and easier because we're taking advantage of the fact that there's a special collaborative filtering data set um so we can actually we're getting passed in the users and the items and we don't have to pull them out of cats and cons um but other than that this is exactly the same so hopefully you can see like the fast a library is not some inscrutable code containing concepts you can never understand we've actually just built up this entire thing from scratch ourselves and so why did we get point seven six rather than point eight you know i i think it's simply because we used stochastic gradient descent with restart so the cycle multiplier and an atom optimizer you know like a few little training tricks uh yes unit so i'm looking at this and i'm thinking that is uh we could totally improve this model but maybe looking at the date and doing some tricks with the date because this this is kind of a just a regular kind of model in a way yeah you can add more features yeah exactly exactly so like now that you've seen this you could now you know even if you didn't have embedding dot bias in a notebook that you've written yourself but as some other model that's in fast ai you could look at it in fast a and be like oh that does most of the things that i'd want to do but it doesn't deal with time and so you could just go oh okay let's grab it copy it you know pop it into my notebook and let's create you know the better version right and then you can start playing right and you can now create your own model class from the open source code here and so yeah yeah that's mentioning a couple of things we could do we could try incorporating timestamps so we could assume that maybe well maybe there's just like some for a particular user over time users tend to get more or less positive about movies also remember there was the list of genres for each movie maybe we could incorporate that so one problem is it's a little bit difficult to incorporate that stuff into this embedding dot bias model because it's kind of it's pretty custom right so what we're going to do next is we're going to try to create a neural net version of this right so the basic idea here is we're going to take exactly the same thing as what we had before here's our list of users right and here is our embeddings right and here's our list of movies and here is our embeddings right and so as you can see I've just kind of transposed the movie ones so that they're all in the same orientation and here is our user movie rating but d cross tapped okay so in the original format so each row is a user movie rating okay so the first thing I do is I need to replace user 14 with that user's contiguous index right and so I can do that in excel using this match that basically says what you know how far down this list do you have to go and it said user 14 was the first thing in that list okay user 29 was the second thing in that list so forth okay so this is the same as that thing that we did in our python code where we we basically created a dictionary to map this stuff so now we can for this particular user movie rating combination we can look up the appropriate embedding right and so you can see here what it's doing is it saying all right let's basically offset from the start of this list and the number of rows we're going to go down is equal to the user index and the number of columns we're going to go across is one two three four or five okay and so you can see what it does is it creates point one nine point six three point three one here it is point one nine point okay so so this is literally modern embedding does but remember this is exactly the same as doing a one hot encoding right because if instead this was a vector containing one zero zero zero zero zero right and we multiplied that by this matrix then the only row it's going to return would be the first one okay so so it's really useful to remember that embedding actually just is a matrix product the only reason it exists the only reason it exists is because this is an optimization you know this lets pi torch know like okay this is just a matrix multiply but I guarantee you that you know this thing is one hot encoded therefore you don't have to actually do the matrix multiply you can just do a direct lookup okay so that's literally all an embedding is is it is a computational performance thing for a particular kind of matrix multiply all right so that looks up that user's user and then we can look up that user's movie right so here is movie ID movie ID 417 which apparently is index number 14 here it is here so it should have been point seven five point four seven yes it is point seven five point four seven okay so we've now got the user embedding and the movie embedding and rather than doing a dot product of those two right which is what we do normally instead what if we concatenate the two together into a single vector of length 10 and then feed that into a neural net right and so any time we've got you know a tensor of input activations or in this case a tensor of actually this is a tensor of output activations this is coming out of an embedding layer we can chuck it in a neural net because neural nets we now know can calculate anything okay including hopefully collaborative filtering so let's try that so here is our embedding net so this time I have not bothered to create a separate bias because instead the linear layer in PyTorch already has a bias in it right so when we go nn.linear right let's kind of draw this out so we've got our u matrix right and this is the number of users and this is the number of factors right and we've got our m matrix right so here's our number of movies and here's our again number of factors right and so remember we look up a single user we look up a single movie and let's grab them and concatenate them together right so here's like the user part here's the movie part and then let's put that through a matrix product right so that number of rows here is going to have to be the number of users plus the number of movies right because that's how long that is and then the number of columns can be anything we want because we're going to take that so in this case we're going to pick 10 apparently so let's pick 10 and then we're going to stick that through a value and then stick that through another matrix which obviously needs to be of size 10 here and then the number of columns is of size 1 because we want to predict a single rating okay and so that's our kind of flow chart of what's going on right it is a standard um i would call it a one hidden layer neural net it depends how you think of it like there's kind of an embedding layer but because this is linear and this is linear the two together is really one linear layer right it's just a computational convenience so it's really got one hidden layer because it's just got one layer before this nonlinear activation okay so in order to create a um a linear layer with some number of rows and some number of columns you just go nn.linear in the machine learning class this week we learnt how to create a linear layer from scratch by creating our own weight matrix and our own biases so if you want to check that out you can do so there right but it's the same basic technique we've already seen so we create our embeddings we create our two linear layers that's all the stuff that we need to start with um you know really if i wanted to make this more general i would have had another parameter here called like um num hidden you know equals equals 10 and then this would be a parameter and then you could like more easily play around with different numbers of activations so when we say like okay in this layer i'm going to create a layer with this many activations all i mean assuming it's a fully connected layer is my linear layer has how many columns in its weight matrix that's how many activations it creates all right so we grab our users and movies we put them through our embedding matrix and then we concatenate them together okay so torch dot cat concatenates them together on the first dimension so in other words we concatenate the columns together to create longer rows okay so that's concatenating on dimensional one uh dropout we'll come back to in a moment we've looked at briefly um so then having done that we'll put it through that linear layer we had we'll do our value and you'll notice that value is again inside our capital F an end of functional right it's just a function so remember activation functions are basically things that take one activation in and spit one activation out in this case take in something that can have negatives or positives and truncate the negatives to zero that's all value does um and then here's our sequence so that's that that is now a genuine neural network i don't know if we get to call it deep it's only got one hidden layer but it's definitely a neural network right and so we can now construct it we can put it on the gpu we can create an optimizer for it and we can fit it all right now you'll notice there's one other thing i've been passing to fit which is what loss function are we trying to minimize okay and this is the mean squared error loss and again it's inside F okay pretty much all the functions are inside F okay so one of the things that you have to pass fit is something saying like how do you score it's what counts as good or bad so Jeremy now that we have um a real neural net do we have to use the same number of embeddings for users and that's a great question you don't know it's absolutely right you don't and so like we've got a lot of benefits here right because if we you know think about you know we're grabbing a user embedding we're concatenating it with a movie embedding which maybe is like i don't know some different size but then also perhaps we looked up the genre of the movie and like you know there's actually a embedding matrix of like number of genres by i don't know three or something and so like we could then concatenate like a genre embedding and then maybe the time stamp is in here as a continuous number right and so then that whole thing we can then feed into you know our neural net right and then at the end remember our final non-linearity was a sigmoid right so we can now recognize that this thing we did where we did sigmoid times max rating minus min rating plus blah blah blah is actually just another non-linear activation function right and remember in our last layer we use generally different kinds of activation functions so as we said we don't need any activation function at all right we could just do that right but by not having any non-linear activation function we're just making it harder so that's why we put the sigmoid in there as well okay so we can then fit it in the usual way and there we go you know interestingly we actually got a better score than we did with our this model so it'll be interesting to try training this with stochastic gradient to set with restarts and see if it's actually better you know maybe you can play around with the number of hidden layers and the dropout and whatever else and see if you can come up with you know get a better answer than 0.76 ish okay so so general so this is like if you are going deep into collaborative filtering at your workplace or whatever this wouldn't be a bad way to go I could like I'd start out with like oh okay here's like a collaborative data set throw it in fast AI get learner there's you know not much I can send it basically number of factors is about the only thing that I pass in I can learn for a while maybe try a few different approaches and then you're like okay there's like that's how I go if I use the defaults okay how do I make it better and then I'd be like digging into the code and saying like okay well what did Jeremy actually do here this is actually what I want you know and and fiddle around with it so one of the nice things about the neural net approach is that you know as you net mentioned we can have different numbers of embeddings we can choose how many hidden and we can also choose dropout right so so what we're actually doing is we haven't just got value but we're also going like okay let's let's delete a few things at random right that's dropout right so in this case we were deleting after the first linear layer 75 percent of them right and then after the second linear layer 75 percent of them so we can add a whole lot of regularization here so you know this it kind of feels like the this this embedding net um you know you could you could change this again we could like have it so that we could pass into the constructor well if we wanted to make it look as much as possible like what we had before we could pass in p's p's equals 0.75 comma 0.75 I'm not sure this is the best api but it's not terrible probably what since we've only got exactly two layers we could say p1 equals 0.75 0.75 p2 equals 0.75 and so then this will be p1 this will be p2 you know where we go and like if you wanted to go further you could make it look more like our structured data learner you could actually have a thing this number of hidden you know maybe you could make a list and so then rather than creating exactly one hidden layer and one output layer this could be a little loop that creates n hidden layers each one of the size you want so like this is all stuff you can play with during the during the week if you want to um and I feel like if you've got like a much smaller collaborative filtering data set you know maybe you'd need like more regularization or whatever if it's a much bigger one maybe more layers would help um I don't know you know I I haven't seen much discussion of this kind of neural network approach to collaborative filtering um but I'm not a collaborative filtering expert so maybe it's maybe it's around but that'd be interesting thing to try um so the next thing I wanted to do was to talk about um the training loop so what's actually happening inside the training loop so at the moment we're basically passing off the actual updating of the weights to pi torches optimizer um but what I want to do is like understand what that optimizer is is actually doing uh and we're also I also want to understand what this momentum term is doing so you'll find we have a spreadsheet called grad desk gradient descent uh and it's kind of designed to be read left to right sorry right to left uh worksheet wise so the rightmost worksheet is some data right and we're going to implement gradient descent in excel because obviously everybody wants to do deep learning in excel and we've done collaborative filtering in excel we've done convolutions in excel so now we need sgd in excel so we can replace python once and for all okay so um let's start by creating some data right and so here's um you know here's some independent you know I've got one uh column of x's you know and one column of y's and these are actually directly linearly related so this is this is random all right and this one here is equal to x times 2 plus 30 okay so let's try and use excel to take that data and try and learn those parameters okay that's going to be our goal so let's start with the most basic version of sgd and so the first thing I'm going to do is I'm going to run a macro so you can see what this looks like so I hit run and it does five epochs I do another five epochs do another five epochs okay so um the first one was pretty terrible it's hard to see so I just delete that first one to get better scaling all right so you can see it actually it's pretty constantly improving the loss right this is the loss per epoch right so how do we do that so let's reset it um so here is my x's and my y's and what I do is I start out by assuming some intercept and some slope right so this is my randomly initialized weights so I have randomly initialized them both to one you could pick a different random number if you like but I promise that I randomly picked the number one twice there you go it was a random number between one and one so here is my intercept and slope I'm just going to copy them over here right so you can literally see this is just equals c1 here is equals c2 okay so I'm going to start with my very first row of data x equals 14 y equals 58 and my goal is to come up after I look at this piece of data I want to come up with a slightly better intercept and a slightly better slope okay so to do that um I need to first of all basically figure out um which direction is is down in other words if I make my intercept a little bit higher or a little bit lower would it make my error a little bit better or a little bit worse so let's start out by calculating the error so to calculate the error the first thing we need is a prediction so the prediction is equal to the intercept plus x times slope right so that is our zero hidden layer neural network okay and so here is our error it's equal to our prediction minus our actual square so we could like play around with this I don't want my error to be 1849 I'd like it to be lower so what if we set the intercept to 1.1 1849 goes to 1840 okay so a higher intercept would be better okay what about the slope if I increase that it goes from 1849 to 1730 okay a higher slope would be better as well not surprising because we know actually that they should be 30 and 2 so one way to um figure that out uh you know in code in the spreadsheet is to do literally what I just did it's to add a little bit to the intercept and the slope and see what happens and that's called finding the derivative through finite differencing right and so let's go ahead and do that so here is the value of my error if I add 0.01 to my intercept right so c4 plus 0.01 and then I just put that into my linear function and then I subtract my actual all squared right and so that causes my error to go down a bit okay so increasing my um which one is that increasing c4 increasing the intercept a little bit has caused my error to go down so what's the derivative well the derivative is equal to how much the dependent variable changed by divided by how much the independent variable changed by right and so there it is right our dependent variable changed by that minus that right and our independent variable we changed by 0.01 so there is the estimated value of the error db right so remember when people talking about derivatives right this is this is all they're doing right is they're saying what's this value but as we make this number smaller and smaller and smaller and smaller right as it as it limits to zero um I'm not smart enough to think in terms of like derivatives and integrals and stuff like that so whatever I think about this I always think about you know an actual like plus 0.01 divided by 0.01 because like I just find that easier just like I never think about probability density functions I always think about actual probabilities like toss a coin something happens three times you know blah blah blah so I always think like remember it's it's totally fair to do this because a computer is discreet it's not continuous like a computer can't do anything infinitely small anyway right so it's actually got to be calculating things at some level of precision right and our brains kind of need that as well so this is like my version of Jeffrey Hinton's like to visualize things in more than two dimensions you just like say 12 dimensions really quickly while visualizing it in two dimensions this is my equivalence you know to to to think about derivatives just think about division um and like although all the mathematicians say no you you can't do that you actually can like if you think of dx dy as being literally you know change in x over change in y like the division actually like the calculations still work like all the time so okay so let's do the same thing now with changing my slope by a little bit and so here's the same thing right and so you can see both of these are negative okay so that's saying if I increase my intercept my loss goes down if I increase my slope my loss goes down right and so my derivative of my error with respect to my slope is is actually pretty high and that's not surprising because it's actually you know the constant term is just being added whereas the slope is being multiplied by 14 okay now finite differencing is all very well and good but there's a big problem with finite differencing in high dimensional spaces and the problem is this right and this is like you don't need to learn how to calculate derivatives or integrals but you need to learn how to think about them spatially right and so remember we have some vector very high dimensional vector it's got like a million items in it right and it's going through some weight matrix right of size like one million by size a hundred thousand or whatever and it's bidding out something of size a hundred thousand and so you need to realize like there isn't like a gradient here but it's like for every one of these things in this vector right there's a gradient in every direction you know in every part of the output right so it actually has not a single gradient number not even a gradient vector but a gradient matrix right and so this this is a lot to calculate right I would literally have to like add a little bit to this and see what happens to all of these add a little bit to this see what happens to all of these right to fill in one column of this at a time so that's going to be horrendously slow like that that so that's why like if you're ever thinking like oh we can just do this with finite differencing just remember like okay we we're dealing in the with these very high dimensional vectors where you know this this kind of matrix calculus like all all the concepts are identical but when you actually draw it out like this you suddenly realize like okay for each number I could change there's a whole bunch of numbers that impacts and I have this whole matrix of things to compute right and so your gradient calculations can take up a lot of memory and they can take up a lot of time so we want to find some way to do this more quickly okay and it's definitely well worth like spending time kind of studying these ideas of like you know the idea of like the gradients like look up things like Jacobian and Hessian they're the things that you want to search for to start where it is unfortunately people normally write about them with you know lots of Greek letters and blah blah blah that's right but there are some there are some nice you know intuitive explanations out there and hopefully you can share them on the forum if you find them because this is stuff you really need to really need to understand in here you know because you're trying to train something and it's not working properly and like later on we'll learn how to like look inside PyTorch to like actually get the values of the gradients and you need to know like okay well how would I like plot the gradients you know what would I consider unusual like you know these are the things that turn you into a really awesome deep learning practitioner is when you can like debug your problems by like grabbing the gradients and doing histograms of them and like knowing you know that you could like plot that oh each layer of my average gradient's getting worse or you know bigger or you know whatever okay so the trick to doing this more quickly is to do it analytically rather than through finite differencing and so analytically is basically there is a list you probably all learned it at high school there is a literally a list of rules that for every mathematical function there's a like this is the derivative of that function right so you probably remember a few of them for example x squared two x right and so we actually have here an x squared so here is our two times right now the one that I actually want you to know is not any of the individual rules but I want you to know the chain rule right which is you've got some function of some function of something why is this important I don't know that's a linear layer that's a value right and then we can kind of keep going backwards right etc right a neural net is just a function of a function of a function of a function where the innermost is you know it's basically linear value linear value dot linear sigmoid or softmax right and so it's a function of a function of a function and so therefore to calculate the derivative of the weights in your model the loss of your model with respect to the weights of your model you're going to need to use the chain rule and specifically whatever layer it is that you're up to like I want to calculate the derivative here I'm going to need to use all of these all of these ones because that's all that's the function that's being applied right and that's why they call this back propagation because the value of the derivative of that is equal to that derivative now basically you can do it like this you can say let's call you is this right let's call that you right then it's simply equal to the derivative of that times derivative of that right you just multiply them together and so that's what back propagation is like it's not that back propagation is a new thing for you to learn it's not a new algorithm it is literally take the derivative of every one of your layers and multiply them all together so like it doesn't deserve a new name right apply the chain rule to my layers does not deserve a new name but it gets one because us neural networks folk really need to seem as clever as possible it's really important that everybody else thinks that we are way outside of their capabilities right so the fact that you're here means that we've failed because you guys somehow think that you're capable right so remember it's really important when you talk to other people that you say back propagation and rectified linear unit rather than like multiply the layers gradients or replace negatives with zeros okay so so here we are so here is so I've just gone ahead and grabbed the derivative unfortunately there is no automatic differentiation in excel yet so I did the alternative which is to paste the formula into wolf from alpha and got back the derivative so there's the first derivative and there's the second derivative analytically we only have one layer in this infinitely small neural network so we don't have to worry about the chain rule and we should see that this analytical derivative is pretty close to our estimated derivative from the finite differencing and indeed it is right and we should see that these ones are pretty similar as well and indeed they are right and if you're you know back when I implemented my own neural nets 20 years ago I you know had to actually calculate the derivatives and so I always would write like had something that would check the derivatives using finite differencing and so for those poor people that do have to write these things by hand you'll still see that they have like a finite differencing checker so if you ever do have to implement a derivative by hand please make sure that you have a finite differencing checker so that you can test it all right so there's our derivatives so we know that if we increase b then we're going to get a slightly better loss so let's increase b by a bit how much should we increase it by well we'll increase it by some multiple of this and the multiple we're going to choose is called a learning rate and so here's our learning rate so here's 1 e neg 4 okay so our new value is equal to whatever it was before minus our derivative times our learning rate okay so we've gone from 1 to 1.01 and then a we've done the same thing so it's gone from 1 to 1.12 so this is a special kind of mini batch it's a mini batch of size 1 okay so we call this online gradient descent it just means mini batch of size 1 so then we can go on to the next one x is 86 y is 202 right this is my intercept and slope copied across from the last row okay so here's my new y prediction here's my new error here are my derivatives here are my new a and b all right so we keep doing that for every mini batch of 1 and to eventually we run out the end of an epoch okay and so then at the end of an epoch we would grab our intercept and slope and paste them back over here as our new values there we are and we can now continue again all right so we're now starting with oops there's either in the wrong spot it should be paste special transpose values all right okay so there's our new intercept there's a new slope possibly I got those the wrong way around but anyway you got the idea and then we continue okay so I recorded the world's tiniest macro which literally just copies the final slope and puts it into the new slope copies the final intercept puts it into the new intercept and does that five times and after each time it grabs the root means great error and pastes it into the next spare area and that is attached to this run button and so that's going to go ahead and do that five times okay so that's stochastic gradient descent in excel so to turn this into a cnn right you would just replace this error function right and therefore this prediction with the output of that convolutional example spreadsheet okay and that then would be in a cnn being with sgd okay um now the problem is that you'll see when I run this it's kind of going very slowly right we know that we need to get to a slope of two an intercept of 30 and you can kind of see at this rate it's going to take a very long time right and specifically it's like it keeps going the same direction so it's like come on take a hint that's a good direction so the come on take a hint that's a good direction please keep doing that but more is called momentum right so on our next spreadsheet we're going to implement momentum okay so what momentum does is the same thing and to simplify this spreadsheet I've removed the finite difference in columns okay other than that this is just the same right so we still got our x's our y's our a's and b's our predictions our error is now over here okay and here's our derivatives okay um our new calculation for let's grab a particular row our new calculation here for our new a term just like before is it's equal to whatever a was before minus okay now this time I'm not taking the derivative but I'm minusing some other number times along right so what's this other number okay so this other number is equal to the derivative times what's this k1.02 plus 0.98 times the thing just above it okay so this is a linear interpolation between this rows derivative or this mini batches derivative and whatever direction we went last time right so in other words keep going the same direction as you were before right but update it a little bit right and so in our um stretch in our python just before we had a momentum of 0.9 okay so you can see what tends to happen is that our negative kind of gets more and more negative right all the way up to like 2000 where else with our standard sgd approach our derivatives are kind of all over the place right sometimes there's 700 sometimes negative 700 positive 100 you know so this is basically saying like yeah if you've been going down for quite a while keep doing that until finally here it's like okay that's that seems to be far enough so that's getting less and less and less negative right and still we start going positive again so you can kind of see why it's called momentum it's like once you start traveling in a particular direction for a particular weight you kind of the wheels start spinning and then once the gradient turns around the other way it's like oh slow down we've got this kind of momentum and then finally turn back around right so when we do it this way right we can do exactly the same thing right and after five iterations we're at 89 where else before after five iterations we're at 104 right and after a few more let's do maybe 15 okay seconds it's 102 for us here okay it's going right so it's it's it's a bit better it's not heaps better you can still see like these numbers they're not zipping along right but it's definitely an improvement and it also gives us something else to tune which is nice like so if this is kind of a well behaved error surface right in other words like although it might be bumpy along the way there's kind of some overall direction like imagine you're going down a hill right and there's like bumps on it right so the more momentum you get up we're going to skipping over the tops right so we could say like okay let's increase our beta up to 0.98 right and see if that like allows us to train a little faster and whoa look at that suddenly went straight to 82 right so one nice thing about things like momentum is it's like another parameter that you can tune to try and make your model train better in practice basically everybody does this everybody like you look at any like image net winner or whatever they all use momentum okay and so back over here when we said use SGD that basically means use the the basic tab of our Excel spreadsheet but then momentum equals 0.9 means add in put a 0.9 over here okay and so that that's kind of your like default starting point so let's keep going and talk about Adam so Adam is something which I actually was not right earlier on in this course I said we've been using Adam by default we actually haven't we've actually been I noticed we've actually been using SGD with momentum by default and the reason is that Adam has had it's much faster as you'll see it's much much faster to learn with but there's been some problems which is people haven't been getting quite as good like final answers with Adam as they have with SGD with momentum and that's why you'll see like all the you know image net winning solutions and so forth and all the academic papers always use SGD with momentum and Adam seems to be a particular problem in NLP people really haven't got Adam working at all well the good news is this was I built it looks like this was solved two weeks ago it basically it turned out that the way people were dealing with a combination of weight decay in Adam had a nasty kind of bug in it basically and that's that's kind of carried through to every single library and one of our students an answer has actually just completed a prototype of adding is this new version of Adam is called Adam W into fast AI and he's confirmed that he's getting the much faster both the faster performance and also the the better accuracy so hopefully we'll have this Adam W in fast AI ideally before next week we'll see how we go very very soon so so it is worth telling you about about Adam so let's talk about it it's actually incredibly simple but again you know make sure you make it sound really complicated when you tell people so that you can look like that so here's the same spreadsheet again right and here's our randomly selected A and B again somehow it's still one here's our prediction here's our derivatives okay so now how we calculate our new A and our new B you can immediately see it's looking pretty hopeful because even by like row 10 we're like we're seeing the numbers move a lot more right so this is looking pretty encouraging so how are we calculating this it's equal to our previous value of B minus j8 okay so we're going to have to find out what that is times our learning rate divided by the square root of L8 okay so we're going to have to dig it and see what's going on one thing to notice here is that my learning rate is way higher than it used to be but then we're dividing it by this big number okay so let's start out by looking and seeing what this j8 thing is okay j8 is identical to what we had before j8 is equal to the linear interpolation of the derivative and the previous direction okay so that was easy so one part of Adam is to use momentum in the way we just defined okay the second piece was to divide by square root L8 what is that square root L8 okay is another linear interpolation of something and something else and specifically it's a linear interpolation of F8 squared okay it's a linear interpolation of the derivative square along with the derivative squared last time okay so in other words we've got two pieces of momentum going on here one is calculating the momentum version of the gradient the other is calculating the momentum version of the gradient squared and we often refer to this idea as a exponentially weighted moving average in other words it's basically equal to the average of this one and the last one and the last one and the last one but we're like multiplicatively decreasing the previous ones right because we're multiplying it by 0.9 times 0.9 times 0.9 times 0.9 and so you actually see that for instance in the fast ai code if you look at fit we don't just calculate the average loss right um because what I actually want we certainly don't just report the loss for every mini batch because that just bounces around so much so instead I say average loss is equal to whatever the average loss was last time times 0.98 plus the loss this time times 0.02 right so in other words the fast ai library the thing that it's actually when you do like the learning rate finder or plot loss it's actually showing you the exponentially weighted moving average of the loss okay so it's like a really handy concept it appears quite a lot right the other handy concept to know about it's this idea of like you've got two numbers one of them is multiplied by some value the other is multiplied by one minus that value so this is a linear interpolation of two values you'll see it all the time and for some reason um deep learning people nearly always use the value alpha when they do this so like keep an eye out if you're reading a paper or something and you see like alpha times blah blah blah blah plus one minus alpha times some other blah blah blah right immediately like when people read papers none of us like read everything in the equation we look at it we go oh linear interpolation right and I say something I was just talking to Rachel about yesterday it's like whether we could start trying to find like a a new way of writing papers where we literally refact for them right like it'd be so much better to have written like linear interpolate blah blah blah come up blah blah right because then you don't have to have that pattern recognition right but until we convince the world to change how they write papers this is what you have to do is you have to look you know know what to look for right and once you do suddenly the huge page with formulas aren't aren't bad at all like you often notice like for example the two things in here like they might be totally identical but this might be a time t and this might be at like time t minus one or something right like it's very often these big ugly formulas turn out to be really really simple if only they repact them okay okay so what are we doing with this gradient squared so what we were doing with the gradient squared is we were taking the square root and then we were adjusting the learning rate by dividing the learning rate by that okay so gradient squared is always positive okay and we're taking the exponentially waiting moving average of a bunch of things that are always positive and then we're taking the square root of that right so when is this number going to be high it's going to be particularly high if there's like one big you know if the if the gradients got a lot of variation right so there's a high variance of gradient then this g squared thing is going to be a really high number whereas if it's like a constant amount right it's going to be smaller that because when you add things that are squared the squared's like jump out much bigger whereas if there wasn't if there wasn't much change that it's not going to be as big so basically this number at the bottom here it's going to be high if our gradient is changing a lot now what do you want to do if you've got something which is like first negative and then positive and then small and then high right well you probably want to be more careful right you probably don't want to take a big step because you can't really trust it right so when the when the variance of the gradient is high we're going to divide our learning rate by a big number right whereas if our learning rate is very similar kind of size all the time then we probably feel pretty good about this step so we're dividing it by a small amount and so this is called an adaptive learning rate and like a lot of people have this confusion about Adam I've seen it on the forum actually where people are like isn't there some kind of adaptive learning rate where somehow you're like setting different learning rates for different layers or something it's like no not really right all we're doing is we're just saying like just keep track of the average of the squares of the gradients and use that to adjust the learning rate so there's still one learning rate okay in this case it's one okay but effectively every parameter at every epoch is being kind of like getting a bigger jump if the learning rate if the gradient has been pretty constant for that weight and a smaller jump otherwise okay and that's Adam that's the entirety of Adam in in excel right so there's now no reason at all why you can't train image net in excel because you've got you've got access to all of the pieces you need and so let's try this out run okay that's not bad right five and we're straight up to 29 and two right so the difference between like you know standard sgd and this is is is huge and basically that you know the key difference was that it figured out that we need to be you know moving this number much faster okay and so and so it did and so you can see we've now got like two different parameters one is kind of the momentum for the gradient piece the other is the momentum for the gradient square piece and they I think they're called like I think there's just a tuple of the beta I think when you when you want to change it in pi towards there's a thing called beta which is just a tuple of two numbers that you can change Jeremy so um so you said the yeah I think I understand this concept of you know when they when a gradient is it goes up and down then you're not really sure which direction should go so you should kind of slow things down therefore you subtract that gradient from the learning rate um so but how do you implement that how far do you go I guess maybe I missed something in early on you do you set a number somewhere we divide here we divide the learning rate divided by the square root of the moving average gradient square so that's where we use it oh I'm sorry can you be a little more sure so d2 is the learning rate which is one yeah m27 is our moving average of the square gradients so we just go d2 divided by square root m27 that's it okay thanks I have one question yeah so the new method that you just mentioned which is in the process of getting implemented in yes adam w adam w how different is it from here okay I can let's do that um so um to understand adam w we have to understand weight decay and maybe we'll learn more about that later let's see how we go now with weight decay so the idea is that when you have um lots and lots of parameters like we do with you know most of the neural nets we train um you very often have like more parameters and data points or you know like regularization becomes important and we've learned how to avoid overfitting by using dropout right which randomly deletes some activations in the hope that it's going to learn some kind of more resilient um set of weights there's another kind of regularization we can use called weight decay or l2 regularization and it's actually comes kind of kind of classic statistical technique and the idea is that we take our loss function right so we take our like error squared loss function and we add an additional piece to it let's add weight decay right now the additional piece we add is to basically add the square of the weights so we'd say plus b squared plus a squared okay that is now weight decay or l2 regularization and so the idea is that now um the the loss function wants to keep the weights small right because increasing the weights makes the loss worse and so it's only going to increase the weights if the loss improves by more than the amount of that penalty and in fact to make this weight decay to proper weight decay we then need some um multiplier here right so if you remember back in our here we said uh weight decay equals wd uh 5eneg4 okay so to actually use the same weight decay i would have to multiply by 0.0005 right so that's actually now the same weight decay so um if you have a really high weight decay then it's going to set all the parameters to zero so it'll never overfit right because it can't set any parameter to anything right and so as you gradually decrease the weight decay a few more weights can actually be used right but the ones that don't help much it's still going to leave it zero or close to zero right so that's what that's what weight decay is is is literally to change the loss function to add in this um uh sum of squares of weights times some parameter some some hyperparameter I should say the problem is that if you put that into the loss function as I have here then it ends up in the moving average of gradients and the moving average of squares of gradients for atom right and so basically we end up when there's a lot of variation we end up decreasing the amount of weight decay and if there's very little variation we end up increasing the amount of weight decay so we end up basically saying penalized parameters you know weights that are really high unless their gradient varies a lot which is never what we intended right that's just not not the plan at all so the trick with um atom w is we basically remove weight decay from here so it's not in the loss function it's not in the g not in the g squared uh and we move it so that instead it it's it's it's added directly to the um when we update with the learning rate it's added there instead so in other words it would be we would put the weight decay or actually the gradient of the weight decay in here when we calculate the new a and b so it never ends up in our g and g squared so that was like a super fast description um which will probably only make sense if you listen to it three or four times on the video and then talk about it on the forum um yeah but if you're interested let me know uh and uh we can also look at annan's code that's implemented this um and you know the the the idea of using weight decay is it's a really helpful regularizer because it's basically this way that we can kind of say like um you know please don't increase any of the weight values unless the you know improvement in the loss um is worth it and so generally speaking pretty much all state-of-the-art models have both dropped out and weight decay um and i don't claim to know like how to set each one and how much of each to use other than to say like you it's worth trying both can i grab the um to go back to the idea of embeddings is there any way to interpret the final sort of user embeddings like absolutely we're going to look at that next week um it's super fun it turns out that you know we'll learn what some of the worst movies of all time uh i can possibly it's like um it's a genre of older Scientology ones like battleship earth or something i think that was like the worst movie of all time according to our embeddings at least we've learned something two of my recommendations for scaling the uh l2 penalty or is that kind of based on how how wide the nodes or how many nodes i have no suggestion at all like i i i kind of look for like papers or cackle competitions or whatever similar and try to set it frankly the same it seems like in a particular area like computer vision object recognition it's like somewhere between one in neg4 or one in neg5 seems to work you know um actually in the atom w paper um the um the authors point out that with this new approach it actually becomes like it seems to be much more stable as to what the right weight decay amounts are so hopefully now when we start playing with it we'll be able to have some definitive recommendations by the time we get to part two all right well that's nine o'clock so um this week um you know practice the thing that you're least familiar with so if it's like Jacobians and Hessians read about those if it's broadcasting read about those if it's understanding python oh oh read about that you know try and implement your own custom layers read the fast ai layers you know uh and and talk on the forum about anything that you find um weird or confusing all right see you next week