 Welcome back, lesson six. So this is our penultimate lesson, believe it or not. A couple of weeks ago in lesson four I mentioned I was going to share that lesson with this terrific NLP researcher, Sebastian Ruder, which I did and he said he loved it and he's gone on to yesterday release this new post he called Optimization for Deep Learning Highlights in 2017 in which he covered basically everything that we talked about in that lesson and with some very nice shout outs to some of the work that some of the students here have done including when he talked about this separation of weight decay from the momentum term and so he actually mentions here the opportunities in terms of improved kind of software decoupling this allows and actually links to the commit from Anand Sahar actually showing how to implement this in fast AI so fast AI's code is actually being used as a bit of a role model now he then covers some of these learning rate tuning techniques that we've talked about and this is the SGDR schedule it looks a bit different what you're used to seeing this is on a log curve this is the way that they show it on the paper and for more information again links to two blog posts the one from Vitaly about this topic and and again Anand Sahar his blog post on this topic so it's great to see that some of the work from fast AI students is already getting noticed and picked up and shared and this blog post went on to get on the front page of Hacker News so that's pretty cool and hopefully more and more of this work will be picked up once this is released publicly so last week we were kind of doing a deep dive into collaborative filtering and let's remind ourselves of kind of what our final model looked like so in the end we kind of ended up rebuilding the model that's actually in the fast AI library where we had an embedding so we had this little get embedding function that grabbed an embedding and randomly initialized the weights for the users and for the items that's the kind of generic term in our case the items are movies and the bias for the users the bias for the items and we had n factors embedding size for each for each one of course the biases just had a single one and then we grabbed the users and item embeddings model played them together summed it up each row and added on the bias terms popped that through a sigmoid to put it into the range that we wanted so that was our model and one of you asked if we can kind of interpret this information in some way and I promise this week we would see how to do that so let's take a look so we're going to start with the model we built here where we just used that fast AI library collab filter dataset from CSV and then that dot get learner and then we fitted it in three epochs 19 seconds we've got a pretty good result so what we can now do is to analyze that model so you may remember right back when we started we read in the movies dot CSV file but that's just a mapping from the ID of the movie to the name of the movie and so we're just going to use that for display purposes so we can see what we're doing because not all of us have watched every movie I'm just going to limit this to the top 500 most popular sorry 3000 most popular movies so we might have more chance of recognizing the movies we're looking at and then I'll go ahead and change it from the movie IDs from movie lens to those unique IDs that we're using the contiguous IDs because that's what our model has all right so inside the learn object that we create inside a learner we can always grab the pie torch model itself just by saying learn dot model right and like I'm going to just kind of show you more and more of the code at the moment so let's take a look at the definition of model and so model is a property so if you haven't seen a property before a property is just something in Python which looks like a method when you define it but you can call it without parentheses as we do here right and so it kind of looks when you call it like it's a regular attribute but it looks like when you define it like it's a method so every time you call it it actually runs this code okay and so in this case it's just a shortcut to grab something called dot models dot model so you may be interested to know what that looks like learn dot models and so this is there's a the fast AI model type is a very thin wrapper for pie torch models so we could take a look at this collab filter model and see what that is it's only one line of code okay and yeah we'll talk more about these in part two right but basically that it's this very thin wrapper and the main thing one of the main things that fast AI does is we have this concept of layer groups where basically when you say here there are different learning rates and they get applied to different sets of layers and that's something that's not in pie torch so when you say I want to use this pie torch model this was one thing we have to do which is to say like okay what are our later groups yeah so the details aren't terribly important but in general if you want to create a little wrapper for some other pie torch model you could just write something like this so to get to get inside that to grab the actual pie torch model itself it's models dot model that's the pie torch model and then the learn object has a shortcut to that okay so we're going to set m to be the pie torch model and so when you print out a pie torch model it prints it out basically by listing out all of the layers that you created in the constructor it's quite it's quite nifty actually when you kind of think about the way this works thanks to kind of some very handy stuff in python we're actually able to use standard python owo to kind of define these modules and these layers and they basically automatically kind of register themselves with pie torch so back in our embedding dot bias we just had a bunch of things where we said okay each of these things are equal to these things and then it automatically knows how to represent that so you can see there's the name is you and so the name is just literally whatever we called it here you right and then the definition is it's this kind of layer okay so that's our pie torch model so we can look inside that basically use that so if we say m dot i b then that's referring to the embedding layer for an item which is the bias layer so an item bias in this case is the movie bias so each movie there are 9 000 of them has a single bias element okay now the really nice thing about pie torch layers and models is that they all look the same they basically to use them you call them as if they were a function so we can go m dot i b parentheses right and that basically says i want you to return the value of that layer and that layer could be a full-on model right so to actually get a prediction from a pie torch model you just I would go m and pass in my variable okay and so in this case m dot i b and pass in my top movie indexes now models remember layers they require variables not tensors because it needs to keep track of the derivatives okay and so we use this capital V to turn the tensor into a variable and was just announced this week that pie torch 0.4 which is the version after the one that's just about to be released is going to get rid of variables and we'll actually be able to use tensors directly to keep track of derivatives so if you're watching this on the MOOC and you're looking at 0.4 then you'll probably notice that the code doesn't have this V in it anymore and so that would be that would be pretty exciting when that happens but for now we have to remember if we're going to pass something into a model to turn it into a variable first and remember a variable has a strict superset of the API of a tensor so anything you can do to a tensor it can do to a variable like add it up or take its log or whatever okay so that's going to return a variable which consists of going through each of these movie IDs putting it through this embedding layer to get its bias okay and that's going to return a variable let's take a look so before I press shift enter here you can have a think about what I'm going to have I've got a list of 3,000 movies going in turning into a variable putting it through this embedding layer so just have a think about what you expect to come out okay and we have a variable of size 3,000 by 1 hopefully that doesn't surprise you we had 3,000 movies that we're looking up each one hadn't had a one long embedding okay so there's our 3,000 long you'll notice it's a variable which is not surprising because we fed it a variable so we get a variable back and it's a variable that's on the GPU right dot CUDA okay so we have a little shortcut in FastAI because we very often want to take variables turn them into tensors and move them back to the CPU so we can play with them more easily so 2np is to numpy okay and that does all of those things and it works regardless of whether it's a tensor or a variable it works regardless of whether it's on the CPU or GPU it'll end up giving you a a numpy array from that okay so if we do that that gives us exactly the same thing as we just looked at but now in numpy form okay so that's a super handy thing to use when you're playing around with PyTorch my approach to things is I try to use numpy for everything except when I express it and you need something to run on the GPU or I need its derivatives right in which case I use PyTorch because like numpy like I kind of find numpy is often easier to work with it's been around many years longer than PyTorch so you know and lots of things like the python imaging library and opencv and lots and lots of stuff like pandas it works with numpy so my approach is kind of like do as much as I can in numpy land finally when I'm ready to do something on the GPU or take its derivative to PyTorch and then as soon as I can I put it back in numpy and you'll see that the fast AI library really works this way like all the transformations and stuff happen in numpy which is different to most PyTorch computer vision libraries which tend to do it all as much as possible in PyTorch I try to do as much as possible in numpy we have a question here so let's say we wanted to transfer build a model in the GPU with the GPU and train it and then we want to bring this to production so would we call to numpy on the model itself or would we have to iterate through all the different layers and then call to np yeah good question so it's very likely that you wanted to inference on a CPU rather than a GPU it's more scalable you don't have to worry about putting things in batches you know so on and so forth so you can move a model onto the CPU just by typing m.cpu and that model is now on the CPU and so therefore you can also then put your variable on the CPU by doing exactly the same thing so you can say like so now having said that if your server doesn't have a GPU or coot a GPU you don't have to do this because it won't put it on the GPU at all so if for inferencing on the server if you're running it on you know some t2 instance or something it'll work fine it'll all run on the on the CPU automatically quick follow-up and if we train the model on the GPU and then we save those embeddings and the weights would we have to do anything special to load down to no you won't we have something well it kind of depends how much of fast AI you're using so i'll show you how you can do that in case you have to do it manually one of the students figured this out which is very handy when we there's a load model function and you'll see what it does what it does torch dot load is it basically this is like some magic incantation that like normally it has to load it on to the same GPU it was saved on but this will like load it into whatever's whatever's available so that was a handy discovery thanks for the great questions and so to put that back on the GPU i'll need to say dot coot a and now there we go i can run it again okay so it's really important to know about the zip function in python which iterates through a number of lists at the same time so in this case i want to grab each movie along with its bias term so that i can just pop it into a list of tuples so if i just go zip like that that's going to iterate through each movie id and each bias term and so then i can use that in a list comprehension to grab the name of each movie along with its bias okay so having done that i can then sort and so here i told you the john john drivolta uh Scientology movie at the most negative of quite by a lot if this was a Kaggle competition battlefield earth would have like one by miles look at this seven seven seven ninety six so here is the worst movie of all time according to imdb and like it's interesting when you think about what this means right because this is like a much more authentic way to find out how bad this movie is because like some people are just more negative about movies right and like if more of them watch your movie like you know highly critical audience they're going to rate it badly so if you take an average it's not quite fair right um and so what this is you know what this is doing is saying once we you know remove the fact that different people have different overall positive or negative experiences and different people watch different kinds of movies and we correct for all that um this is the worst movie of all time so that's a good thing to know um so this is how we can yeah look inside our our model and and and interpret the bias vectors um you'll see here i've sorted by the zeroth element of each tuple by using a lambda um originally i used this special um item getter this is part of python's operator library and this creates a function that returns the zeroth element of something um in order to save time and then i actually realized that the lambda is only one more character to write than the item getter so maybe we don't need to know this after all so um yeah really useful to make sure you know how to write lambdas in python so this is a this is a function okay and so the sort is going to call this function every time it decides like is this thing higher or lower than that other thing and this this is going to return the zeroth element okay um so here's the same thing in item getter format and here is the reverse and um shore shank redemption right at the top i definitely agree with that godfather usual suspects yeah these are all pretty great movies um 12 angry men absolutely uh so there you go um there's how we can look at the bias so then the second piece to look at would be the the embeddings how can we look at the embeddings so we can do the same thing so remember i was the item embeddings rather than i be with the item bias we can pass in our list of movies as a variable turn it into numpy and here's our movie embeddings so for each of the 3000 most popular movies here are its 50 embeddings so um it's very hard unless you're jeffrey hinton to visualize a 50 dimensional space um so what we'll do is we'll turn it into a three dimensional space um so we can compress um high dimensional spaces down into lower dimensional spaces using lots of different techniques um perhaps one of the most common and popular is called pca um pca stands for principal components analysis um it's a linear technique um but um linear techniques generally work fine for this kind of um embedding i'm not going to teach you about pca now but i will say in rachel's computational linear algebra class which you can get to from faster ai um we cover pca in a lot of detail um and it's a really important technique it actually it turns out to be almost identical to something called singular value decomposition which is a type of matrix decomposition which actually does turn up in deep learning a little bit from time to time um so it's it's kind of somewhat worth knowing uh if you are going to dig more into linear algebra you know uh spd and pca along with eigenvalues and eigenvectors which are all slightly different versions of this kind of the same thing um are all worth knowing um but for now just know that you can grab pca from sklearn.decomposition say how much you want to reduce the dimensionality too so i want to find three components and what this is going to do is it's going to find three linear combinations of the 50 dimensions which capture as much as the variation is possible but are as different to each other as possible okay um so we would call this a a lower rank approximation of our matrix all right um so then we can grab the components so that's going to be the three um dimensions and so once we've done that we've now got three by three thousand um and so we can now take a look at the first of them and we'll do the same thing of using zip to look at each one along with its movie um and so here's the thing right we we don't know ahead of time what this pca thing is it's just it's just a bunch of latent factors um you know it's it's kind of the the the main axis in this space of latent factors and so but what we can do is we can look at it and see if we can figure out what it's about right so given that police academy four is high up here along with waterworld where our spago pulp fiction and godfather of high up here i'm going to guess that a high value is not going to represent like critically acclaimed movies uh or serious watching so i kind of what did i call this yeah okay i call this easy watching versus serious right but like this is kind of how you have to interpret your embeddings is like take a look at what they seem to be showing and decide what you think it means so this is the kind of the the principal axis uh in this set of embeddings so we can look at the next one um so do the same thing and look at the the first index one embedding um this one's a little bit harder to kind of figure out what's going on but with things like moholyn drive and purple rows of kairo these look more kind of dialoguey kind of ones or else things like lord of the rings and a latin and star wars these look more like kind of modern cgi kind of ones so you can kind of imagine that on that pair of dimensions it probably represents a lot of you know differences between how people write movies you know some people like um you know purple rise of kairo type movies you know woody allen um kind of classic and some people like these you know big hollywood spectacles um some people presumably like police academy for uh more than they like fargo um uh so yeah so like you can kind of get the idea of what's happened it's it's done a you know through a model which was um you know for a model which was literally modified two things together and add them up um it's learned quite a lot you know which is kind of cool um so that's what we can do with um with that and then we could we could plot them if we wanted to um i just grabbed a small subset uh to plot on those first two axes all right so that's that so i wanted to next kind of dig in a layer deeper into what actually happens when we say fit right so when we said learn dot fit what's it doing um for something like the store model um is it a way to interpret the embeddings for something like this the the rustman one yes yeah we'll see that in a moment uh well let's jump straight there what the okay so um so for the um rustman uh how much are we going to sell at each store on each date model um we um this is from the paper uh gore and burkan it's a it's a great paper by the way well worth you know like pretty accessible i think any of you would at this point be able to at least get the gist of it if you know and much of the detail as well particularly as you've also done the machine learning course um and they actually make this point in the paper this is in the paper that the equivalent of what they call entity embedding layers so an embedding of a categorical variable is identical to a one-hot encoding followed by uh a matrix model play right so they're basically saying if you've got three embeddings that's the same as doing three one-hot encodings putting each one through a matrix model play and then put that through a dense layer or what um pie torch would call a linear layer right um one of the nice things here is because this is kind of like well they thought it was the first paper it was actually the second i think paper to show the idea of using categorical embeddings for um this kind of data set they really go to kind of quite a lot of detail to to you know write back to the the detailed stuff that we learned about so it's kind of a second you know a second cut at thinking about what embeddings are doing um so one of the interesting things that they did was they said okay after we've trained a neural net with these embeddings um what else could we do with it so they got a winning result with a neural network with entity embeddings but then they said hey you know what we could take those entity embeddings and replace each categorical variable with the learnt entity embeddings and then feed that into a gpm right so in other words like a rather than passing into the gpm a one-hot encoded version or an ordinal version let's actually replace the categorical variable with its embedding for the appropriate level for that row right so it's actually a way of create you know feature engineering and so um the main average percent error without that for gbms um using just one-hot encodings was point one five but with that it was point one one right random florists without that was point one six with that point one eight nearly as good as the neural net right so this is kind of an interesting technique because what it means is in your organization you can train a neural net that has an embedding of stores and an embedding of product types and an embedding of I don't know whatever kind of high catnality or even medium catnality categorical variables you have and then everybody else in the organization can now like chuck those into their you know gpm or random forest or whatever and and use them and what this is saying is they won't get in fact you can even use k nearest neighbors with this technique and get nearly as good a result right so this is a good way of kind of giving the power of neural nets to everybody in your organization without having them do the fast AI deep learning course first you know they can just use whatever sklearn or r or whatever that they're used to and like those those embeddings could literally be in a database table because if you think about an embedding is just an index lookup right which is the same as an inner join in sql right so if you've got a table of each product along with its embedding vector then you can literally do an inner join and now you have every row in your table along with its product embedding vector so that's a really this is a really useful idea and gbms and random forests learn a lot quicker than neural nets do right so that's like even if you do know how to train neural nets this is still potentially quite handy so here's what happened when they took the various different states of germany and plotted the first two principal components of their embedding vectors and they basically here is where they were in that 2d space and wackily enough i've circled in red three cities and i've circled here the three cities in germany and here i've circled in purple sorry blue here are the blue here's the green here's the green so it's actually drawn a map of germany even though it never was told anything about how far these states are away from each other or the very concept of geography didn't exist so that's pretty crazy so that was from their paper so i went ahead and looked well here's another thing i think this is also from their paper they took every pair of places and they looked at how far away they are on a map versus how far away are they in embedding space and they got this beautiful correlation right so again it kind of apparently you know stores that are nearby each other physically have similar characteristics in terms of when people buy more or less stuff from them so i looked at the same thing for days of the week right so here's an embedding of the days of the week from our model and i just kind of joined up monday tuesday wednesday tuesday thursday friday saturday sunday i did the same thing for the months of the year right and again you can see you know here's here's winter here's summer so yeah i think like visualizing embeddings can be interesting like it's good to like first of all check you can see things you would expect to see you know and then you could like try and see like maybe things you didn't expect to see so you could try all kinds of clusterings or or whatever right and this is not something which has been widely studied at all right so i'm not going to tell you what the limitations are of this technique or whatever oh yeah so i've heard of other ways to generate embeddings like skip grams uh-huh i was wondering if you could uh say is there one better than the other using neural networks or skip grams um so script grams is quite specific to nlp right so like um i'm not sure if we'll cover it in this course but basically um the the approach to original kind of word to vek approach to generating embeddings was to say um you know what we actually don't have we don't actually have um a label data set you know they said all we have is like google books and so they have an unsupervised learning problem unlabeled problem and so the best way in my opinion to turn an unlabeled problem into a labeled problem is to kind of invent some labels and so what they did uh in the word to vek case was they said okay here's a sentence with 11 words in it right and then they said okay let's delete the middle word and replace it with a random word and so you know originally it said cat and they say no let's replace that with justice right so before it said the cute little cat sat on the fuzzy mat and now it says the cute little justice sat on the fuzzy mat right and what they do is they do that so they have one sentence where they keep exactly as is right and then they make a copy of it and they do the replacement and so then they have a label where they say it's a one if it was unchanged it was the original and zero otherwise right and so basically then you now have something you can build a machine learning model on um and so they went and built a machine learning model on this so the model was like try and find the the fact sentences not because they were interested in a fake sentence finder but because as a result they now have embeddings that just like we discussed you could now use for other purposes and that became word to vek now it turns out that if you do this as just to kind of a effectively like a single matrix multiply rather than making a deep neural net you can train this super quickly um um and so that's basically what they did with they did oh they kind of decided we're going to make a a pretty crappy model like a shallow learning model rather than a deep model um you know with the downside it's a less powerful model but a number of upsides the first being we can train it on a really large data set and then also really importantly we're going to end up with embeddings which have really very linear characteristics so we can like add them together and subtract them and stuff like that right um so that uh so there's a lot of stuff we can learn about there from like for other types of embedding like categorical embeddings um specifically if we want categorical embeddings which we can kind of draw nicely and expect them to us to be able to add and subtract them and behave linearly um you know probably if we want to use them in nearest neighbors and stuff we should probably use shallow learning um if we want something that's going to be more predictive we probably want to use a neural net um and so actually in nlp i'm really pushing the idea that we need to move past word to vek and glove these linear based methods because it turns out that those embeddings are way less predictive than embeddings learnt from deep models and so the language model that we learned about which ended up getting a state-of-the-art on sentiment analysis didn't use glove or word to vek that instead we pre-trained a deep recurrent neural network uh and we ended up with not just a pre-trained word vectors but a full pre-trained model so it looks like to create embeddings for entities we need like a dummy task right not necessarily a dummy task like in this case we had a real task right so we created the embeddings for ross moon by trying to predict store sales um you only need uh this isn't just in this isn't just for learning embeddings for learning any kind of feature space um uh you either need labeled data or you need to invent some kind of fake task so does a task matter like if i choose a task and train embeddings if i choose another task and train embeddings like which one is it's a great question and it's not something that's been studied nearly enough right i'm not sure that many people even quite understand that when they say unsupervised learning nowadays they almost nearly always mean fake task labeled learning and so the idea of like what makes a good fake task i don't know that i've seen a paper on that right that intuitively you know we need something where the kinds of relationships it's going to learn are likely to be the kinds of relationships that you probably care about right so for example in um in computer vision one kind of fake task people use is to say like um let's take some images and use some kind of like unreal and unreasonable data augmentation like like recolor them too much or whatever and then we'll ask the neural net to like predict which one was the augmented which one was not the augmented um yeah so it's i think it's a fascinating area and one which you know would be really interesting for people to you know maybe some of the students here to look into further is like take some interesting semi-supervised or unsupervised data sets and try and come up with some like more clever fake tasks and see like does it matter you know how much does it matter in general like if you can't come up with a fake task that you think seems great i would say use it use the best you can it's often surprising how how little you need like the ultimately crappy fake task is called the auto encoder right and the auto encoder is the thing which which won the claims prediction competition that just finished on Kaggle they had lots of examples of insurance policies where we knew this was how much was claimed and then lots of examples of insurance policies where i guess they must have been still still open we didn't yet know how much they claimed right and so what they did was they said okay so for all of the ones so let's basically start off by grabbing every policy right and we'll take a single policy and we'll put it through a neural net right and we'll try and have it reconstruct itself but in these intermediate layers and at least one of those intermediate layers we'll make sure there's less activations than there were inputs so let's say if there was a hundred variables on the insurance policy you know we'll have something in the middle that only has like 20 activations right and so when you basically are saying hey reconstruct your own input like it's not a different kind of model it doesn't require any special code it's literally just passing you can use any standard pie torch or fast ai learner you just say my output equals my input right and that's that's like the the most uncreative you know invented task you can create and that's called an auto encoder and it works surprisingly well in fact the point that it literally just won a Kaggle competition they took the features that it learnt and chucked it into another neural net and yeah and won you know maybe if we have enough students taking an interest in this then you know we'll be able to cover unsupervised learning in more detail in part too especially given this Kaggle reason Kaggle win I think this may be related to the previous question when training language models is the language model for example trained on the archive data is that useful at all in the movie lens movie like the IMDB data great question you know I was just talking to Sebastian about this Sebastian wrote about this this week and we thought we'd try and do some research on this in January it's it's again it's not well known we know that in computer vision it's shockingly effective to train on cats and dogs and use that pre-trained network to do lung cancer diagnosis and CT scans in the NLP world nobody much seems to have tried this the NLP researchers I've spoken to other than Sebastian about this assume that it wouldn't work and they generally haven't bothered trying I think it would work great so so since we're talking about rustman I'll just mention during the week I was interested to see like how how good this solution actually actually was because I noticed that on the public leaderboard it didn't look like it was going to be that great and I also thought it'd be good to see like what does it actually take to use a test set properly with this kind of structured data so if you have a look at rustman now I've pushed some changes that actually run the test set through as well and so you can get a sense of how to do this so you'll see basically every line appears twice one for test and one for one for train when we get there yeah test train test train obviously you could do this in a lot fewer lines of code by putting all of the steps into a method and then pass either the train data set or the test data set data frame to it in this case I wanted to kind of for teaching purposes you'd be able to see each step and to experiment to see what each step looks like but you could certainly simplify this code so yeah so we do this for every data frame and then for some of these you can see I kind of loop through the data frame in joined and for joined test right train and test this whole thing about the durations I basically put two lines here one that said data frame equals train columns one that says data frame equals test columns and so my you know basically idea is you'd run this line first and then you would skip the next one and you'd run everything beneath it and then you'd go back and run this line and then run everything beneath it so some people on the forum were asking how come this code wasn't working this week which is a good reminder that the code is not designed to be code that you always run top to bottom without thinking right you're meant to like think like what is this code here should I be running it right now okay and so like the early lessons I tried to make it so you can run it top to bottom but increasingly as we go along I kind of make it more and more that like you actually have to think about what's going on so Jeremy you're talking about shadow learning and deep learning could you define that a bit better by shallow learning I think I just mean anything that doesn't have a hidden layer so something that's like a dot product a matrix model player basically okay so so we end up with a training and a test version and then everything else is basically the same one thing to note and a lot of the details of this we cover in the machine learning course by the way because it's not really deep learning specific so check that out if you're interested in the details I should mention you know we use apply cats rather than train cats to make sure that the test set and the training set have the same categorical codes that they join to we also need to make sure that we keep track of the mapper this is the thing which basically says what's the mean and standard deviation of each continuous column and then apply that same mapper to the test set and so when we do all that that's basically it then the rest is easy we just have to pass in the test data frame in the usual way when we create our model data object and then there's no changes through all here we train it in the same way and then once we finish training it we can then call predict as per usual passing in true to say this is the test set rather than the validation set and pass that off to Kaggle and so it was really interesting because this was my submission it got a public score of 103 which would put us in about 300 and something's place which looks awful right and our private score of 107 meter board private is about fifth right so like if you're competing in a Kaggle competition and you don't haven't thoughtfully created a validation set of your own and you're relying on public leaderboard feedback this could totally happen to you but the other way around you'll be like oh i'm in the top 10 i'm doing great and then uh oh for example at the moment the icebergs competition recognizing icebergs a very large percentage of the public leaderboard set is synthetically generated data augmentation data like totally meaningless and so your validation set is going to be much more helpful than the public leaderboard feedback right so um yeah be very careful uh so our final score here is kind of within statistical noise of the actual third place um get us some pretty confident that we've we've captured their approach and um uh so that's that's uh pretty interesting um something to mention um there's a nice kernel about the Rossman quite a few nice kernels actually but you can go back and see like particularly if you're doing the groceries competition go and have a look at the Rossman kernels because actually quite a few of them are higher quality than the ones for the Ecuadorian groceries competition um one of them for example showed how on for particular stores like store 85 the sales for non sundays and the sale for sundays looked very different um whereas there are some other stores where the sales on sunday don't look any different and you can kind of like get a sense of why you need these kind of interactions the one i particularly wanted to point out is the one i think i briefly mentioned that the third place winners whose approach we used they didn't notice is this one and here's a really cool visualization um here you can see that the store this store is closed right and just after oh my god we ran out we ran out of eggs and just before oh my god go and get the milk before the store closes right and here again closed bang right so this third place winner actually deleted all of the closed store rows before they started doing any analysis right so remember how we talked about like don't touch your data unless you first of all analyze to see whether that thing you're doing is actually okay no assumptions right so in this case i am sure like i haven't tried it but i'm sure they would have won otherwise right because like although there weren't actually any store closures to my knowledge in the test set period the problem is that their model was trying to fit to these like really extreme things and so and because it wasn't able to do it very well it was going to end up getting a little bit confused right it's not going to break the model but it's definitely going to harm it because it's kind of trying to do computations to fit something which it literally doesn't have the data for um your neck can you pass that back there all right so um that rosman model again like it's nice to kind of look inside to see what's actually going on right and so that rosman model i want to make sure you kind of know how to find your way around the code so you can answer these questions for yourself so it's inside columnar model data now um we started out by kind of saying hey if you want to look at the code for something you can like go question mark question mark like this and oh okay i need to i haven't got this read in but you can use question mark question mark to um get the source code for something right um but obviously like that's not really a great way because often you look at that source code and it turns out you need to look at something else right and so for those of you that haven't done much coding you might not be aware that almost certainly the editor you're using probably has the ability to both open up stuff directly off ssh and to navigate through it so you can jump straight from place to place right so i'm going to show you what i mean so if i want to find columnar model data and i happen to be using vim here i can basically say tag columnar model data and it will jump straight to the definition of that class right and so then i notice here that like oh it's actually building up a data loader that's interesting if i hit control right square bracket it'll jump to the definition of the thing that was under my cursor and after i finished reading it for a while i can hit control t to jump back up to where i came from right and you kind of get the idea right or if i want to find every usage of this in this file of columnar model data i can hit star to jump to the next place it's used you know and so forth right so in this case get learner was the thing which actually got the model we want to find out what kind of model it is and apparently it uses a uh or not using collaborative filtering are we we're using columnar model data sorry columnar model data get learner which uses and so here you can see mixed input model is the pytorch model and then it wraps it in the structured learner which is the the fast ai learner type which wraps the data and the model together so if we want to see the definition of this actual pytorch model i can go to control right square bracket to see it right and so here is the model right and nearly all of this we can now understand right so we've got past um we've got past a list of embedding sizes sure there is in the mixed model that we saw does it always expect categorical and continuous together yes it does and the um the model data behind the scenes if there are no none of the other type it creates a column of ones or zeros or something okay uh so if it is null it can still work yeah yeah yeah it's kind of ugly and and hacky and well you know hopefully improve it but but yeah you can pass in an empty list of categorical or continuous variables to the model data and it will basically yeah it'll basically pass an unused column of zeros to avoid things breaking and i'm i'm leaving fixing some of these slightly hacky edge cases because pytorch 0.4 as well as getting rid of variables they're going to also add rank zero tensors which is to say if you grab a single thing out of like a rank one tensor rather than getting back a number which is like qualitatively different you're actually going to get back a tensor that just happens to have no rank now turns out that a lot of this kind of code it's going to be much easier to write then so um for now it's it's a little bit more hacky than it needs to be um Jeremy you talk about this a little bit before but maybe it's a good time uh at some point to talk about uh how can we um write something that is slightly different from what is on the library yeah i i think we'll cover that a little bit next week but i'm mainly going to do that in part two like part two is going to cover quite a lot of stuff um one of the main things we'll cover in part two is what are called generative models so things where the output is a whole sentence or a whole image um but you know i also dig into like how to really um either customize the fast ai library or use it on more custom models um but if we have time we'll touch on it a little bit next week okay so um the the learner we were passing in a list of embedding sizes and as you can see that embedding sizes list was literally just the number of rows and the number of columns in each embedding right and the number of rows was just coming from uh literally how many stores are there in the store uh category for example and the number of columns was just equal to that divided by two and a maximum of 50 so that thing that list of tuples was coming in and so you can see here how we use it right we go through each of those tuples grab the number of categories and the size of the embedding and construct an embedding right and so that's a that's a list right um one minor thing high torch specific thing we haven't talked about before is for it to be able to like register remember how we kind of said like it registers your parameters it registers your your layers like so when we like listed the model it actually printed out the name of each embedding and each bias um it can't do that if they're hidden inside a list right they have to be like a there have to be a an actual nn.module subclass so there's a special thing called an nn.module list um which takes a list and it basically says I want you to register everything in here as being part of this model okay so that's just a minor tweak um so yeah so our mixed input model has a list of embeddings um and then I do the same thing for a list of linear layers right so when I said here 1000 comma 500 this is saying how how many activations I wanted featured my linear layers okay and so here I just go through that list and create a linear layer that goes from this size to the next size okay so you can see like how easy it is to kind of construct your own not just your own model but a kind of a model which you can pass parameters to have it constructed on the fly dynamically um batch norm we'll talk about next week um this is initialization we've mentioned chiming her initialization before and we mentioned it last week um and then dropout same thing right we have here a list of how much dropout to apply to each layer right so again here it's just like go through each thing in that list and create a dropout layer for it okay so this constructor we understand everything in it um except for batch norm which we don't have to worry about for now um so that's the constructor and so then the forward um also you know all stuff we're aware of go through each of those embedding layers that we just saw and remember we just treat it like as a function so call it with the ith categorical variable and then concatenate the more together um put that through dropout um and then go through each one of our linear layers and call it apply value to it apply dropout to it right and then finally apply the final linear layer and the final linear layer has this as its size which is here right size one there's a single unit sales okay so we're kind of getting to the point where oh and then of course at the end if this I mentioned we would come back to this if you passed in a y underscore range parameter then we're going to do the thing we just learned about last week which is to use a sigmoid right and this is a cool little trick to make your not just to make your collaborative filtering better but in this case my basic idea was um you know sales are going to be greater than zero and probably less than the largest sale they've ever had so I just pass in uh that as y range and so we do a sigmoid and multiply with the sigmoid by the range that I passed it right and so uh hopefully we can find that here yeah here it is right so I actually said hey maybe the range is between zero and you know the highest times 1.2 you know because maybe maybe the next two weeks we have one bigger right but this is kind of like again trying to make it a little bit easier for it to give us the kind of results that it thinks is right so like increasingly you know I'd love you all to kind of try to not treat these learners and models as black boxes but to feel like you now have the information you need to look inside them and remember you could then copy and paste this plus paste it into a cell in um Jupyter notebook and start fiddling with it to to create your own versions uh okay I think what I might do is we might take a bit of a early break because we've got a lot to cover and I want to do it all in one big go so let's take a um let's take a break until 745 and then we're going to come back and talk about recurrent neural networks all right um so we're going to talk about RNNs before we do we're going to kind of dig a little bit deeper into SGD um because I just want to make sure everybody's totally comfortable with with SGD um and so what we're going to look at is we're going to look at a lesson six SGD notebook um and we're going to look at a really simple example of using SGD to learn y equals ax plus b and so what we're going to do here is we're going to create like the simplest possible model uh y equals ax plus b okay and then we're going to generate some random data uh that looks like so so here's our x and here's our y we're going to predict y from x and we passed in three and eight as our a and b so we're going to kind of try and recover that right and so the idea is that uh if we can solve something like this which has two parameters um we can use the same technique to solve we can use the same technique to solve something with 100 million parameters right uh without any changes at all um so in order to um find a and a b that fits this we need a loss function okay and this is a regression problem because we have a continuous output uh so for continuous output regression we tend to use mean squared error right and obviously all of this stuff there's there's implementations in NumPy there's implementations in PyTorch we're just doing stuff by hand so you can see all the steps right so there's msc okay y hat is what we often call our predictions y hat minus y squared mean there's our mean squared error okay so for example if we had 10 and 5 we're our a and b then there's our mean squared error uh 3.25 okay so if we've got an a and a b and we've got an x and a y then our mean squared error loss is just the mean squared error of our linear that's our predictions and our y okay so there's our loss for 10 5 x y all right so that's a loss function right and so when we um talk about combining linear layers and loss functions and optionally non-linear layers this is all we're doing right is we're putting a function inside a function okay that's that's all like I know people draw these clever looking dots and lines all over the screen when they're saying this is what a neural network is but it's just a it's just a function of a function of a function okay so here we've got a prediction function being a linear layer followed by a loss function being msc and now we can say like oh well let's just define this as msc loss and we'll use that in the future okay so there's our loss function which incorporates our prediction function all right so let's generate 10 000 items of fake data um and let's turn them into variables so we can use them with pi torch because Jeremy doesn't like taking derivatives so we're going to use pi torch for that um and let's create a random weight for a and for b so a single random number um and we want the gradients of these to be calculated as we start computing with them because these are the actual things we need to update in our sgd okay so here's our a and b 0.029 0.111 all right so let's pick a learning rate okay and let's do 10 000 epochs of sgd uh in fact this isn't really sgd it's not stochastic gradient descent this is actually full gradient descent we're going to each um each loop is going to look at all of the data okay um stochastic gradient descent would be looking at a subset each time so to do gradient descent we basically calculate the loss right so remember we've started out with a random a and b okay and so this is going to compute some amount of loss and then it's nice from time to time so one way of saying from time to time is if uh the epoch number mod of thousand is zero right so every thousand epochs just print out the loss see how we're doing okay um so now that we've computed the loss we can compute our gradients right and so you just remember this thing here is both a number a single number that is our loss something we can print but it's also a variable because we passed variables into it and therefore it also has a method dot backward which means calculate the gradients of everything that we asked it to everything where we said requires grad equals true okay so at this point we now have um a dot grad property inside a and inside b and here they are here is that dot grad property okay so now that we've calculated the gradients for a and b we can update them by saying a is equal to whatever it used to be minus the learning rate times the gradient all right uh dot data because a is a variable and a variable contains a tensor in its dot data property and we again this is going to disappear in pi torch point four but for now it's actually the tensor that we need to update okay so update the tensor inside here with whatever it used to be minus the learning rate times the gradient okay and that's basically it right that's basically all gradient descent is okay so it's it's as simple as we claimed there's one extra step in pi torch which is that you might have like multiple different loss functions or like lots of lots of output layers um all contributing to the gradient and you like have to add them all together and so if you've got multiple loss functions you could be calling loss dot backward on each of them and what it does is it adds it to the gradients right and so you have to tell it when to set the gradients back to zero okay so that's where you just go okay set a to zero and gradients and set b gradients to zero okay and so this is wrapped up inside the you know optium dot sgd class right so when we say optium dot sgd and we just say you know dot step it's just doing these for us so when we say dot zero gradients it's just doing this for us and this underscore here every pretty much every function that applies to a tensor in pi torch if you stick an underscore on the end it means do it in place okay so this is actually going to not return a bunch of zeros but it's going to change this in place to be a bunch of zeros so that's basically it we can look at the same thing without pi torch which means we actually do have to do some calculus so if we generate some fake data again we're just going to create 50 data points this time just to make this fast and easy to look at and so let's create a function called update right we're just going to use num pi no pi torch okay so our predictions is equal to again linear and in this case we're actually going to calculate the derivatives so the derivative of the square of the loss is just two times and then the derivative with respect to a is just that you can confirm that yourself if you want to and so here our we're going to update a minus equals learning rate times the derivative of loss with respect to a and for b it's learning rate times derivative with respect to b okay and so what we can do let's just run all this so just for fun rather than looping through manually we can use the map plot map plot lib func animation command to run the animate function a bunch of times and the animate function is going to run 30 epochs and at the end of each epoch it's going to print out on the plot where the line currently is and that creates this little movie okay so you can actually see the the line moving into place right so if you want to play around with like understanding how pi torch gradients actually work step by step here's like the world's simplest little example okay and you know it's kind of like it's kind of weird to say like that's that's it like when you're optimizing a hundred million parameters in a neural net it's doing the same thing but it it actually is right you can actually look at the play torch code and see if this is it right there's no trick we well we load a couple of minor tricks last time which was like momentum and atom right but if you can do it in excel you can do it in python so okay so let's now talk about rnn so we're now in lesson six rnn notebook and we're going to study Nietzsche as you should so Nietzsche says supposing that truth is a woman what then i love this apparently all philosophers have failed to understand women so apparently at the point that Nietzsche was alive there was no female philosophers or at least those that were around didn't understand women either so anyway so this is the philosopher apparently we've chosen to study Nietzsche is actually much less worse than people think he is but it's a different era i guess all right so we're going to learn to write philosophy like Nietzsche and so we're going to do it one character at a time so this is like the language model that we did in lesson four where we did it a word at the time but this time we're going to do a character at a time and so the main thing i'm going to try and convince you is an rnn is no different to anything you've already learned okay and so to show you that we're going to build it from plain pie torch layers all of which are extremely familiar already okay and eventually we're going to use something really complex which is a for loop okay so that's when we're going to make really sophisticated so the basic idea of rnn's is that you want to keep track of the main thing is you want to keep track of kind of state over long-term dependencies so for example if you're trying to model something like this kind of template language right then at the end of your percent comment do percent you need a percent comment end percent right and so somehow your model needs to keep track of the fact that it's like inside a comment over all of these different characters right and so this is this idea of state it needs kind of memory right and this is quite a difficult thing to do with like just a conf conf net it turns out actually to be possible but um it's it's you know a little bit tricky um whereas with an rnn it turns out to be pretty straightforward right so these are the basic ideas if you want a stateful representation where you're kind of keeping track of like where are we now have memory have long-term dependencies and potentially even have variable length sequences these are all difficult things to do with conf nets they're very straightforward with rnn's so for example um Swiftkey a year or so ago did a blog post about how they had a new language model where they basically this is from the blog post they basically said like of course this is what their neural net looks like somehow they always looked like this on the internet um you know you've got a bunch of words and it's basically going to take your particular words in their particular orders and try and figure out what the next word is going to be which is to say they they built a language model they actually have a pretty good language model if you've used Swiftkey they seem to do better predictions than anybody else still um another cool example was Andrei Kapathy a couple of years ago showed that he could use character level rnn to actually create an entire latex document so he didn't actually tell it in any way what latex looks like he just passed in some latex text like this and said generate more latex text and it literally started writing something which means about as much to me as most math papers do so um okay so we're going to start with something that's not an rnn and I'm going to introduce Jeremy's patented neural network notation involving boxes circles and triangles um so let me explain what's going on um as a rectangle is an input an arrow is a layer a circle uh in fact every square is a bunch of activate sorry every shape is a bunch of activations right the rectangle is the input activations the circle is a hidden activations and a triangle is an output activations an arrow is a a layer operation right or possibly more than one right so here my rectangle is an input of number of rows equal to batch size and number of columns equal to the number of number of inputs number of variables right and so my first arrow my first operation is going to represent a matrix product followed by a value and that's going to generate a set of activations remember activations like an activation is a number right an activation is a number a number that's being calculated by a value or a matrix product or whatever it's a number right so this circle here represents a matrix of activations all of the numbers that come out when we take the inputs we do a matrix product followed by a value so we started with batch size by number of inputs and so after we do this matrix operation we now have batch size by you know whatever the number of columns in our matrix product was by number of hidden units okay and so if we now take these activations right which is the matrix and we put it through another operation in this case another matrix product and a softmax we get a triangle that's our output activations another matrix of activations and again number of rows is batch size number of columns number is equal to the number of classes again however many columns our matrix in this matrix product had so that's a that's a neural net right that's our basic kind of one hidden layer neural net and if you haven't written one of these from scratch try it you know and in fact in lessons 9 10 and 11 of the machine learning course we do this right we create one of these from scratch so if you're not quite sure how to do it you can check out the machine learning course now in general the machine learning course is much more like building stuff up from the foundations whereas this course is much more like best practices kind of top down all right so if we were doing like a conf net with a single dense hidden layer our input would be equal to actually number yeah sorry in pytorch number of channels by height by width all right and notice that here batch size appeared every time so I'm not going to I'm not going to write it anymore okay so I've removed the batch size also the activation function it's always basically value or something similar for all the hidden layers and softmax at the end for classification so I'm not going to write that either okay so I'm kind of each picture I'm going to simplify it a little bit all right so I'm not going to mention batch size it's still there we're not going to mention value or softmax but it's still there so here's our input and so in this case rather than a matrix product we'll do a convolution let's try to convolution so we'll skip over every second one or could be a convolution followed by a max pool in either case we end up with something which is replace number of channels with number of filters right and we have now height divided by two and width divided by two okay and then we can flatten that out somehow we'll talk next week about the main way we do that nowadays which is basically to do something called an adaptive max pooling where we're basically going to average across the height and the width and turn that into a vector anyway somehow we flatten it out into a vector we can do a matrix product or a couple of matrix products we actually tend to do in fast ai so that'll be our fully connected layer with some number of activations final matrix product give us some number of classes okay so this is our basic component remembering rectangular's input circle is hidden triangle is output all of the shapes represent a tensor of activations all of the arrows represent a operation a layer operation all right so now let's going to jump to the one the first one that we're going to actually try to try to create for nlp and we're going to basically do exactly the same thing as here right and we're going to try and predict the third character in a three character sequence based on the previous two characters so now input and again remember we've removed the batch size dimension but we're not saying it but it's still here okay and also here I've removed the names of the layer operations entirely okay just keeping simplifying things so for example our first input would be the first character of each string in our mini batch okay and assuming this is one hot encoded then the the width is just however many items there are in the vocabulary how many unique characters could we have okay we probably won't really one hot encoded we'll feed it in as an integer and pretend it's one hot encoded by using an embedding layer which is mathematically identical okay and then we that's going to give us some activations which we can stick through a fully connected layer okay so we we put that through a fully through a fully connected layer to get some activations we can then put that through another fully connected layer and now we're going to bring in the input of character two right so the character two input will be exactly the same dimensionality as the character one input and we now need to somehow combine these two arrows together so we could just add them up for instance right because remember this arrow here represents a matrix product so this matrix product is going to spit out the same dimensionality as this matrix product so we could just add them up to create these activations and so now we can put that through another matrix product and of course remember all these matrix products have a value as well and this final one will have a softmax instead to create our predicted set of characters right so it's a standard you know two hidden layer I guess it's actually three matrix products neural net this first one is coming through an embedding layer the only difference is that we're also got a second input coming in here that we're just adding in right but it's kind of conceptually identical so let's let's implement that for Nietzsche right so and I'm not going to use torch text I'm going to try not to use almost any fast AI so we can see it all kind of again from raw right so here's the first 400 characters of the collected works let's grab a set of all of the letters that we see there and sort them okay and so a set creates all the unique letters so we've got 85 unique letters in our vocab let's pop a well it's nice to put an empty kind of a null or some some kind of padding character in there for padding so we're going to put a padding character at the start right and so here is what our vocab looks like okay so so cars is our vocab so as per usual we want some way to map every character to a unique ID and every unique ID to a character and so now we can just go through our collected works of Nietzsche and grab the index of each one of those characters so now we've just turned it into this right so rather than quote PRE we now have 40 42 29 okay so so that's basically the first step and just to confirm we can now take each of those indexes and turn them back into characters and join them together and yeah there it is okay so from now on we're just going to work with this IDX list the list of character numbers in the connected works of Nietzsche yes so Jeremy why are we doing like a model of characters and not a model of words I just thought it seemed simpler you know with a vocab of 80 ish items we can kind of see it better character level models turn out to be potentially quite useful in a number of situations but we'll cover that in part two the short answer is like you generally want to combine both the word level model and a character level model like if you're doing say translation it's a great way to deal with unknown like unusual words rather than treating it as unknown anytime you see a word you haven't seen before you could use a character level model for that and there's actually something in between the two called a byte pair encoding BPE which basically looks at little n grams of characters but we'll cover all that in part two if you want to look at it right now then part two of the existing course already has this stuff taught and part two of the version one of this course all the all the nlp stuff is in pytorch by the way so you'll understand it straight away it was actually the thing that inspired us to move to pytorch because trying to do it in keras turned out to be a nightmare all right so let's create the inputs to this we're actually going to do something slightly different what i said we're actually going to try and predict the fourth character the well actually the fifth character using the first four so the index four character using the index zero one two and three right so we're going to do exactly the same thing but with just a couple more layers so that means that we need a list of the zero first second and third characters that's why i'm just cutting every character from the start from the one from two from three skipping over three at a time okay so this is i i said this wrong so we're going to predict the third character the fourth character from the third from the first three okay the fourth character from the first three all right so our inputs will be these three lists right so we can just use np.stack to pop them together right so here's the zero one and two characters that are going to feed into a model and then here is the next character in the list so for example x one x two x three and y all right so you can see for example we start off the first the very first item would be 40 42 and 29 right so that's characters naught one and two and then we'd be predicting 30 that's the fourth character which is the start of the next row right so then 30 25 27 we need to predict 29 which is the start of the next row and so forth so we're always using three characters to predict the fourth so there are 200 000 of these that we're going to try and model right so we're going to build this model which means we need to decide how many activations so i'm going to use 256 okay and we need to decide how big our embeddings are going to be and so i decided to use 42 so about half the number of characters i have and you can play around with these see if you can come up with better numbers it's just kind of experimental and now we're going to build our model now i'm going to change my model slightly and so here is the the full version so predicting character four using characters one two and three as you can see it's the same picture as the previous page but i put some very important colored arrows here all the arrows of the same color are going to use the same matrix the same weight matrix right so all of our input embeddings are going to use the same matrix all of our layers that go from one layer to the next are going to use the same orange arrow weight matrix and then our output will have its own matrix so we're going to have one two three weight matrices right and the idea here is the reason i'm not going to have a separate one for every everything here is that like why would kind of semantically a character have a different meaning depending if it was the first or the second or the third item in a sequence like it's not like we're even starting every sequence at the start of a sentence we just arbitrarily chopped it into groups of three right so you would expect these to all have the same kind of conceptual mapping and ditto like when we're moving from character naught to character one you know to kind of say build up some state here why would that be any different kind of operation to moving from character one to character two right so that's the basic idea so let's create a three character model and so we're going to create one linear layer for our green arrow one linear layer for orange arrow and one linear layer for our blue arrow and then also one embedding okay so the embedding is going to bring in something with of size whatever it was 84 i think vocab size and spit out something with a number of factors in the embedding we'll then put that through a linear layer and then we've got our hidden layers we've got our output layer so when we call forward we're going to be passing in one two three characters so for each one we'll stick it through an embedding we'll stick it through a linear layer and we'll stick it through a value okay so we do it for character one character two and character three okay then i'm going to create this circle of activations here okay and that matrix i'm going to call h right and so it's going to be equal to my input activations okay after going through the value and the linear layer and the embedding right and then i'm going to apply this l hidden so the orange arrow and that's going to get me to here okay so that's what this layer here does and then to get to the next one i need to apply the same thing okay i need to apply the orange arrow to that okay but i also have to add in the second input right so take my second input and add in okay my previous layer uh yannette could you pass that back through yours i don't really see how these dimensions are the same from h and i and two from which to which from yeah okay let's go through so let's figure out the dimensions together so self dot e is going to be of length 42 okay and then it's going to go through l in i'm just going to make it of size n hidden okay um and so then we're going to pass that which is now size n hidden through this which is also going to return something of size n hidden okay so it's really important to notice that this is square this is a square weight matrix okay so we know i'll know that this is of size n hidden in two is going to be exactly the same size as in one was which is n hidden so we can now sum together two uh sets of activations both of size n hidden passing it into here and again it returns something of size n hidden so basically the trick was to make this a square matrix and to make sure that it's square matrix was the same size as the output of this hidden layer thanks for the great question can you pass that back to you now Jeremy is uh summing the only thing people can do in these cases no we'll come back to that in a moment that's a great point okay um i don't like it when i have like three bits of code that look identical and then three bits of code that look nearly identical but aren't quite because it's harder to refactor so i'm going to put a make h into a bunch of zeros so that i can then put h here and these are now identical okay so that the hugely complex trick that we're going to do very shortly is to replace these three things with a for loop okay and it's going to loop through one two and three right that's that's going to be the for loop or actually zero one and two okay at that point we'll be able to call it a recurrent neural network okay so just to skip ahead a little bit all right so we create that um that model make sure i've run all these so we can actually run this thing okay um so uh we can now just use the same columnar model data class that we've used before and if we use from arrays um then it's basically just going to spit back the exact arrays we gave it right so if we pass if we stack together those three arrays then it's going to feed us those three things back to our forward method so if you want to like play around with training models using like you know as raw an approach as possible but without writing lots of boilerplate this is kind of how to do it use columnar model data from arrays and then if you pass in whatever you pass in here right you're going to get back here okay um so i've passed in three things which means i'm going to get sent three things okay so that's how that works um batch size 512 because this is you know this data is tiny so i can use a bigger batch size um so i'm not using really much fast ai stuff at all i'm using fast ai stuff just to save me fiddling around with data loaders and data sets and stuff but i'm actually going to create a standard py torch model i'm not going to create a learner okay so this is a standard py torch model and because i'm using py torch that means i have to remember to write dot cuda okay let's take it on the gpu so here is how we can look inside at what's going on right so we can say ida md dot train data loader to grab the iterator to iterate through the training set uh we can then call next on that to grab a mini batch and that's going to return uh all of our x's and our y tensor and so we can then take a look at you know here's our x's for example right and so you would expect have a think about what you would expect for this length three not surprisingly because these are the three things okay and so then xs0 not surprisingly okay is of length 512 um and it's not actually one hot encoded because we're using embedding to pretend it is okay and so then we can use a model as if it's a function okay by passing to it the variableized version of our tensors and so have a think about what you would expect to be returned here okay so not surprisingly we had a mini batch of 512 so we still have 512 and then 85 is the probability of each of the possible vocab items and of course we've got the log of them because that's kind of what we do in py torch okay you can see here the softmax all right so that's how you can look inside right so you can see here how to do everything really very much by hand so we can create an optimizer again using standard py torch so with py torch when you use a py torch optimizer you have to pass in a list of the things to optimize and so if you call m.parameters that will return that list for you and then we can fit and there it goes okay and so we don't have learning rate finders and sgdr and all that stuff because we're not using a learner so we'll have to manually do learning rate annealing so set the learning rate a little bit lower and fit again okay and so now we can write a little function to to test this thing out okay so here's something called get next where we can pass in three characters like y full stop space right and so I can then go through and turn that into a tensor with capital T of an array of the character index for each character in that list so basically turn those into the integers turn those into variables pass that to our model right and then we can do an arg max on that to grab which character number is it and in order to do stuff in numpy land I use 2np to turn that variable into a numpy array right and then I can return that character and so for example a capital T was what it thinks would be reasonable after seeing y full stop space that seems like a very reasonable way to start a sentence if it was ppl e that sounds reasonable space th e that's bound to reasonable and space that sounds reasonable so it seems to reflect created something sensible right so you know the important thing to note here is our character model is a totally standard fully connected model right the only slightly interesting thing we did was to kind of do this addition of each of the inputs one at a time okay but there's nothing new conceptually here we're training it in the usual way all right let's now create an RNN so an RNN is when we do exactly the same thing that we did here right but I could draw this more simply by saying you know what if we've got a green arrow going to a circle let's not draw a green arrow going to a circle again and again and again but let's just draw it like this green arrow going to a circle right and rather than drawing an orange arrow going to a circle let's just draw it like this okay so this is the same picture exactly the same picture as this one right and so you just have to say how many times to go around this circle right so in this case if we want to predict character number n from characters one through n minus one then we can take the character one input get some activations feed that to some new activations that go through remember orange is the hidden to hidden weight matrix right and each time we'll also bring in the next character of input through its embeddings right so that picture and that picture are two ways of writing the same thing but this one is more flexible because rather than me having to say hey let's do it for eight I don't have to draw eight circles right I can just say oh just repeat this um so I could simplify this a little bit further by saying you know what rather than having this thing as a special case let's actually start out with a bunch of zeros right and then let's have all of our characters inside here yes join me oh yeah so um I was wondering if you can explain a little bit better why are you reusing those uh why are you the same color arrows they say yeah where are you you're kind of seem to be reusing the same same weight matrices weight matrices um yeah maybe this is kind of similar to what we did in convolutional yarnets like it's somehow no I don't think so at least not that I can see so the idea is just kind of semantically speaking like this arrow here this this arrow here is saying take a character of input and represent it as some so some set of features right and this arrow is saying the same thing take some character and represent as a set of features and so is this one right so like why would the three be represented with different weight matrices because it's all doing the same thing right and this orange arrow is saying um kind of transition from character zero's state to character one state to characters two's state again it's it's the same thing it's like why would the transition from character zero to one be different to character from transition from one to two so the idea is like but is to like say hey if if it's doing the same conceptual thing let's use the exact same weight matrix yeah my comment on convolutional yarnet works is that a filter also gets applied oh to multiple places yeah that's an interesting point of view yeah I see so you're saying like a convolution is almost like a kind of a special dot product with shared weights yeah no that's okay that's very good point and in fact one of our students actually wrote a good blog post about that last year we should dig that up okay I totally see where you're coming from and I totally agree with you all right so let's let's implement this version so this time we're going to do eight characters eight C's okay and so let's create a list of every eighth character from zero through seven and then our outputs will be the next character and so we can stack that together and so now we've got 600 000 by eight so here's an example so for example after this series of eight characters right so this is characters naught through eight this is characters one through nine this is two through ten but these are all overlapping okay so after characters one naught through eight this is going to be the next one okay and then after these characters this will be the next one right so you can see that this one here has 43 is its y value right because after those the next one will be 43 okay so so this is the first eight characters this is two through nine three through ten and so forth right so these are overlapping groups of eight characters and then this is the the next one along okay um so let's create that model okay so again we use from arrays to create a model data class and so you'll see here we have exactly the same code as we had before there's our embedding linear hidden output these are literally identical okay and then we've replaced our value of the linear input of the embedding with something that's inside a loop okay and then we've replaced the self dot l hidden thing okay also inside the loop I just realized I didn't mention last time the use of the hyperbolic tan hyperbolic tan looks like this okay so it's just a sigmoid that's offset right and it's very common to use a hyperbolic tan inside this trend this state to state transition because it kind of stops from flying off too high or too low you know it's nicely controlled back in the old days we used to use hyperbolic tan or the equivalent sigmoid a lot as most of our activation functions nowadays we tend to use value but in these hidden state to here in the hidden state transition wait matrices we still tend to use hyperbolic tan quite a lot so you'll see I've done that also here hyperbolic tan okay so this is exactly the same as before but I've just replaced it with a for loop and then here's my output yes you know so does it have to do anything with convergence of these networks uh yeah kind of well we'll talk about that a little bit over time let's let's let's come back to that though for now we're not really going to do anything special at all um you know recognizing this is just a standard fully connected network you know interestingly it's quite a deep one right like because this is actually this but we've got eight of these things now we've now got a a deep eight layer network which is why units starting suggest we should be concerned is you know as we get deeper and deeper networks they can be harder and harder to train but let's try training this all right so away it goes as before we've got a batch size of 512 we're using atom and away it goes so we'll sit there watching it so we can then set the learning rate down back to one in egg three we can fit it again and yeah it's actually it seems to be training fine okay um but we're going to try something else which is we're going to use the trick that uh Yannette rather hinted at before which is maybe we shouldn't be adding these things together and so the reason you might want to be feeling a little uncomfortable about adding these things together is that the input state and the hidden state are kind of qualitatively different kinds of things right the input state is the is the encoding of this character or else h represents the encoding of the series of characters so far and so adding them together is kind of potentially going to lose information so I think what Yannette was going to prefer that we might do is maybe to concatenate these instead of adding them does that sound good to you Yannette she's nodding okay so let's now uh make a copy of the previous cell all the same right but rather than using plus let's use cat right now if we can cat then we need to make sure now that our input layer is not from n fact to hidden which is what we had before but because we're concatenating it needs to be n fact plus n hidden to n hidden okay and so now that's going to make all the dimensions work nicely so this now is of size n fact plus n hidden this now makes it back to size n hidden again okay and then this is putting it through the same square matrix as before so it's still size n hidden okay so this is like a good design heuristic if you're designing an architecture is if you've got different types of information that you want to combine you generally want to concatenate it right you're you know adding things together even if they're the same shape is losing information okay and so once you've concatenated things together you can always convert it back down to a fixed size by just chucking it through a matrix product okay so that's what we've done here again it's the same thing but now we're concatenating instead and so we can fit that and so last time we got 1.72 this time we got 1.68 so it's not setting the world on fire but it's an improvement and the improvements are good okay so we can now test that with get next and so now we can pass in eight things all right so it's now before those that looks good or part of that sounds good as well so queens and that sounds good too all right so great so that's enough manual hackery let's see if PyTorch can do some of this for us and so basically what PyTorch will do for us is it will write this loop automatically okay and it will create these linear input layers automatically okay and so to ask it to do that we can use the nn.rnn plus so here's the exact same thing in less code by taking advantage of PyTorch and again I'm not using a conceptual analogy to say PyTorch is doing something like it I'm saying PyTorch is doing it right this is just the code you just saw wrapped up a little bit refactored a little bit for your convenience right so where we say we now want to create an rnn called rnn then what this does is it does that for loop now notice that our for loop needed a starting point you remember why right because otherwise our for loop didn't quite work we couldn't quite refactor it out and because this is exactly the same this needs a starting point too right so let's give it a starting point and so you have to pass in your initial hidden state okay for reasons that will become apparent later on it turns out to be quite useful to be able to get back that hidden state at the end and just like we could here we could actually keep track of the hidden state we get back two things we get back both the output and the hidden state right so we pass in the input in the hidden state and we get back the output and the hidden state yes could you remind us what the hidden state represents the hidden state is h so it's the it's the orange circle ellipse of activations okay and so it is of size 256 okay all right so we can okay there's one other thing to to know which is in our case we were replacing h with the new hidden state the one minor difference in pytorch is they append the new hidden state to a list or to a tensor which gets bigger and bigger so they actually give you back all of the hidden states so in other words rather than just giving you back the final ellipse they give you back all the ellipses stacked on top of each other and so because we just want the final one I just got indexed into it with minus one here okay other than that this is the same code as before put that through our output layer to get the correct vocab size and then we can train that right so you can see here I can do it manually I can create some hidden state I can pass it to that RNN I can see the stuff I get back you'll see that the um dimensionality of h it's actually a rank three tensor where else in my version it was a it was a rank two tensor okay and the difference is here we've got just a unit axis at the front we'll learn more about why that is later but basically it turns out you can have a second RNN that goes backwards right one that goes forwards one that goes backwards and the idea is it can then it's going to be better at finding relationships that kind of go backwards that's called a bi-directional RNN also it turns out you can have an RNN feed to an RNN that's got a multi-layer RNN so basically if you have those things you need an additional axis on your tensor to keep track of those additional layers of hidden state but for now we'll always have a one here and we'll always also get back a one at the end okay so if we go ahead and fit this now let's actually train it for a bit longer okay so last time we only kind of did a couple of epochs this time we'll do four epochs what did we set at one in egg three and then we'll do another two epochs at one in egg four and so we've now got our loss down to 1.5 so getting better and better so here's our get next again okay and you know let's just do the same thing so what we can now do is we can loop through like 40 times calling get next each time and then each time we'll replace our input by removing the first character and adding the thing that we just predicted and so that way we can like feed in a new set of eight characters again and again and again and so that way we'll call that get next in so here are 40 characters that we've generated so we started out with four thos so we got four those of the same to the same to the same you can probably guess what happens if you keep predicting the same to the same but so it's you know it's doing okay we we now have something which you know we've basically built from scratch and then we've said here's how PyTorch refactored it for us so if you want to like have an interesting little homework assignment this week try to write your own version of an RNN class right like try to like literally like create your like you know Jeremy's RNN and then like type in here Jeremy's RNN or in your case maybe your name's not Jeremy which is okay too and then get it to run writing your implementation that's class from scratch without looking at the PyTorch source code you know like basically it's just a case of like going up and seeing what we did back here right and like make sure you get the same answers and confirm that you do so that's kind of a good little test very simple at all assignment but I think you'll feel really good when you seem like oh I've just re-implemented nn.rnn all right so I'm going to do one other thing when I switch from this one when I've moved the car one input inside the dotted line right this dotted rectangle represents the thing I'm repeating I also watch the triangle the output I move that inside as well now that's a big difference because now what I've actually done is I'm actually saying spit out an output after every one of these circles so spit out an output here and here and here all right so in other words if I have a three character input I'm going to spit out a three character output I'm saying after character one this will be next after character two this will be next after character three this will be next so again nothing different and again this you know if you wanted to go a bit further with the assignment you could write this by hand as well but basically what we're saying is in the for loop we'd be saying like you know results equals some empty list right and then we'd be going through and rather than returning that we'd instead be saying you know results dot append that right and then like return whatever torch dot stack something like that right that may be right I'm not quite sure so now you know we now have like every step we've created an output okay so which is basically this picture and so the reason well there's lots of reasons that's interesting but I think the main reason right now that's interesting is that you probably noticed this this approach to dealing with that data seems terribly inefficient like we're grabbing the first eight right but then this next set all but one of them overlap the previous one right so we're kind of like recalculating the exact same embeddings seven out of eight of them are going to be exact same embeddings right exact same transitions it kind of seems weird to like do all this calculation to just predict one thing and then go back and recalculate seven out of eight of them and add one more to the end to calculate the next thing right so the basic idea then is to say well let's not do it that way instead let's taking non overlapping sets of characters right so like so here is our first eight characters here is the next eight characters here are the next eight characters so like if you read this top left to bottom right that would be the whole nature right and so then if these are the first eight characters then offset this by one starting here that's a list of outputs right so after we see characters zero through seven we should predict characters one through eight that makes sense so after 40 should come 42 as it did after 42 should come 29 as it did right and so now that can be our inputs and labels for that model and so it shouldn't be any more or less accurate it should just be the same right pretty much but it should allow us to do it more efficiently so let's try that all right so I mentioned last time that we had a minus one index here because we just wanted to grab the last triangle okay so in this case we're going to grab all the triangles so this this is actually the way nn.rnn creates things we only kept the last one but this time we're going to keep all of them so we've made one change which is to remove that minus one uh other than that this is the exact same code as before okay so um well there's nothing much to show you here I mean except of course um at this time if we look at the the labels it's now 512 by 8 right because we're trying to predict eight things every time through so there is one complexity here which is that we want to use the negative log likelihood loss function as before right but the negative loss likelihood loss function just like RMSE expects to receive two rank one tenses actually with the mini batch axis two rank two tenses right so two mini batches of vectors problem is that we've got eight time steps you know eight characters in an RNN we call it a time step right we have eight time steps and then for each one we have 84 probabilities we have the probability for every single one of those eight time steps and then we have that for each of our 512 items in the mini batch so we have a rank three tensor not a rank two tensor um so that means that the negative log likelihood loss function is going to spit out an error um and frankly I think this is kind of dumb you know I think it would be better if uh PyTorch had written their loss functions in such a way that they didn't care at all about rank and they just applied it to whatever rank you gave it but for now at least it does care about rank but the nice thing is I get to show you how to write a custom loss function okay so we're going to create a special negative log likelihood loss function for sequences okay and so it's going to take an input in the target and it's going to call f dot negative log likelihood loss so the PyTorch one okay um but what we're going to do is we're going to flatten our input and we're going to flatten our targets right and so and it turns out these are going to be um the first two axes are going to have to be transposed so the way um PyTorch handles RNN data by default is the first axis is the sequence length in this case eight right so the sequence length of an RNN is how many time steps so we have eight characters so a sequence length of eight the second axis is the batch size and then as you would expect the third axis is the actual hidden state itself okay so this is going to be eight by 512 by N hidden which I think was 256 yeah okay so we can grab the size and unpack it into each of these sequence length batch size and I'm hidden um our target yt dot size is 512 by 8 where else this one here was 8 by 512 so to make them match we're going to have to transpose the first two axes okay um PyTorch when you do something like transpose doesn't generally actually shuffle the memory order but instead it just kind of keeps some internal metadata to say like hey you should treat this as if it's transposed um and some things in PyTorch will give you an error if you try and use it when it has these like this internal state um and it'll basically say uh error this tensor is not contiguous if you ever see that error add the word contiguous after it and it goes away so I don't know they can't do that for you apparently so in this particular case I got that error so I wrote the word contiguous after it okay and so then finally we need to flatten it out into a single vector and so we can just go dot view which is the same as numpy dot reshape and minus one means as long as it needs to be okay and then uh the input again we also reshape that right but remember the input sorry the uh the uh the the predictions um also have this axis of length 84 all of the predicted probabilities okay so so here's a custom here's a custom loss function that's it right so if you ever want to play around with your own loss functions you can just do that like so and then pass that to fit right so it's important to remember that fit is this like lowest level fast ai abstraction you know that sits that this is the thing that implements the training loop okay and so like your the stuff you pass it in is all standard pytorch stuff except for this this is our model data object this is the thing that wraps up the test set the training set and the validation set together okay your neck could you pass that back please so when we pull the triangle into the repetitive structure right um so the the first n minus one iterations of the sequence length we don't see the whole sequence length yeah so does that mean that the batch size should be much bigger so that you get a triangular kind of the big careful you don't mean batch size you mean sequence length right because the batch size is like something else entirely yeah okay so yes yes if you have a short sequence length like eight yeah the first character has nothing to go on it starts with an empty hidden state of zeros right so what we're going to start with next week is we're going to learn how to avoid that problem right and so it's a really insightful question or concern right and but if you think about it the basic idea is why should we reset this to zero every time you know like if we can kind of line up these mini batches somehow so that the next mini batch joins up correctly it represents like the next letter in Nietzsche's works then we want to move this up into the constructor right and then like pass that here and then store it here right and now we're not resetting the hidden state each time we're actually we're actually keeping the hidden state from call to call and so the only time that it would be failing to benefit from learning state would be like literally at the very start of the document so that's where we're that's where we're going to try and head next week I feel like this lesson every time I've got a punchline coming somebody asks me a question where I have to like do the punchline ahead of time okay so we can fit that and we can fit that and I want to show you something interesting and this is coming to the punchline that another punchline that you net tried to spoil which is when we're you know remember this is just doing a loop right applying the same matrix multiply again and again if that matrix multiply tends to increase the activations each time then effectively we're doing that to the power of eight right so it's going to like shoot off really high or if it's decreasing it a little bit each time it's going to shoot off really low so this is what we call a gradient explosion right and so we really want to make sure that the initial h not h the initial what do we call it the initial l hidden that we create is like of a size that's not going to cause our activations on average to increase or decrease right and there's actually a very nice matrix that does exactly that called the identity matrix so the identity matrix for those that don't quite remember the l linear algebra is this this would be a size three identity matrix right and so the trick about an identity matrix is anything times an identity matrix is itself right and so therefore you could multiply by this again and again and again and again and still end up with itself right so there's no gradient explosion so what we could do is instead of using whatever the default random in it is for this matrix we could instead after we create our RNN is we can go into that RNN right and notice this right we can go m.rnn right and if we now go like so we can get the docs for m.rnn right and as well as the arguments for constructing it it also tells you the inputs and outputs for calling the layer and it also tells you the attributes and so it tells you there's something called weight hh and these are the learnable hidden to hidden weights that's that square matrix right so after we've constructed our m we can just go in and say all right m.rnn.weight hhl.data that's the tensor.copy underscore in place torch.i that is i for identity in case you're wondering so this is an identity matrix of size and hidden so this both puts into this weight matrix and returns the identity matrix and so this was like actually a Jeffrey Hinton paper was like hey you know after what is this 2015 so after our current neural nets have been around for decades here's like uh hey gang maybe we should just use the identity matrix to initialize this and like it actually turns out to work really well um and so that was a 2015 paper believe it or not uh from the father of neural networks and so here is the here is our implementation of his paper and this is an important thing to note right when very famous people like Jeffrey Hinton write a paper sometimes an entire implementation of that paper looks like one line of code okay so let's do it um before we got point six one two five seven um we'll fit it with exactly the same parameters and now we got point five one and in fact we can keep training point five oh so like this tweak really really really helped right um and one of the nice things about this tweak was before I could only use a learning rate of one in egg three before it started going crazy but after I use the identity matrix I found I could use one in egg two because it's you know it's better behaved weight initialization I found I could use a higher learning rate okay and honestly these things um you know increasingly we're trying to incorporate into the defaults in fast AI you you know you don't want necessarily increasingly need to actually know them but you know at this point um we're still at a point where you know most things in most libraries most of the time don't have great defaults it's good to know all these little tricks it's also nice to know if you want to improve something what kind of tricks people have used elsewhere because you can often borrow them yourself all right um well that's the end of the lesson today and so next week we will look at this idea of a state for RNN that's going to keep its hidden state around and then we're going to go back to looking at language models again and then finally we're going to go all the way back to computer vision and learn about things like res nets and batch norm and all the tricks that were in figured out in cats versus dogs see you then