 Okay. Hi everybody. Welcome back. Good to see you all here. It's been another busy week of deep learning. Lots of cool things going on. And like last week, I wanted to highlight a few really interesting articles that some of you folks have written. Vitaly wrote one of the best articles I've seen for a while, I think, actually talking about differential learning rates and stochastic gradient descent with restarts. Be sure to check it out if you can, because what he's done, I feel like he's done a great job of kind of positioning it in a place that you can get a lot out of it regardless of your background. But for those who want to go further, he's also got links to like the academic papers that came from and kind of graphs of showing examples of all of the things he's talking about. And I think it's a particularly nicely done article. So good kind of role model for technical communication. One of the things I've liked about seeing people post these articles during the week is the discussion on the forums have also been really great. There's been a lot of people helping out, explaining things, which maybe there's parts of the post where people have said actually that's not quite how it works and people have learned new things that way. People have come up with new ideas as a result as well. The discussions of stochastic gradient descent with restarts and cyclical learning rates has been a few of them actually. Anand Sahar has written another great post talking about a similar topic and why it works so well. And again, lots of great pictures and references to papers and most importantly perhaps code showing how it actually works. I think it's a nice introduction to Kaufman covered the same topic at kind of a nice introductory level. I think really, really kind of clear intuition. Manikanta talks specifically about differential learning rates and why it's interesting. And again, providing some nice context to people not familiar with transfer learning and going back to saying like, what is transfer learning? Why is that interesting? And given that, why could differential learning rates be helpful? And then one thing I particularly liked about Arjun's article was that he talked not just about the technology that we're looking at, but also talked about some of the implications, particularly from a commercial point of view. So thinking about like based on some of the things we've learned about so far, what are some of the implications that that has in real life and lots of background, lots of pictures and then discussing some of the implications. So there's been lots of great stuff online and thanks to everybody for all the great work that you've been doing. As we talked about last week, if you're kind of vaguely wondering about writing something that you're feeling a bit intimidated about it because you've never really written a technical post before, just jump in. It's a really welcoming and encouraging group I think to work with. So we're going to have kind of an interesting lesson today which is we're going to cover a whole lot of different applications. So we've spent quite a lot of time on computer vision and today we're going to try, if we can, to get through three totally different areas. Structured learning, so looking at kind of how you look at. So we're going to start out looking at structured learning or structured data learning by which I mean building models on top of things that look more like database tables. So kind of columns of different types of data. They might be financial or geographical or whatever. We're going to look at using deep learning for language, natural language processing. And we're going to look at using deep learning for recommendation systems. And so we're going to cover these at a very high level and the focus will be on here is how to use the software to do it more than here is what's going on behind the scenes. And then the next three lessons will be digging into the details of what's been going on behind the scenes and also coming back to looking at a lot of the details of computer vision that we kind of skipped over so far. So the focus today is really on like, how do you actually do these applications and we'll kind of talk briefly about some of the concepts involved. Before we do, I did want to talk about one key new concept, which is dropout. And you might have seen dropout mentioned a bunch of times already and got the got the impression that this is something important and indeed it is. So to look at dropout, I'm going to look at the dog breeds current cable competition that's going on. And what I've done is I've got a head and I've created a pre-trained network as per usual, and I've passed in pre-compute equals true. And so that's going to pre-compute the activations that come out of the last convolutional layer. And remember an activation is just a number. It's a number just to remind you an activation like here is one activation. It's a number and specifically the activations are calculated based on some weights also called parameters that make up kernels or filters and they get applied to the previous layers activations which could well be the inputs or they could themselves be the results of other calculations. So when we say activation, keep remembering we're talking about a number that's being calculated. So we pre-compute some activations and then what we do is we put on top of that a bunch of additional initially randomly generated fully connected layers. So we're just going to do some matrix modifications on top of these, just like in our Excel worksheet at the very end we had this matrix that we just did a matrix modification. So what you can actually do is if you just type the name of your learner object you can actually see what's in it. You can see the layers in it. So when I was previously been skipping over a little bit about are we at a few layers to the end, these are actually the layers that we add. We're going to do batch norm in the last lesson so don't worry about that for now. A linear layer simply means a matrix multiply. So this is a matrix which has 1,024 rows and 512 columns and so in other words it's going to take in 1,024 activations and spit out 512 activations. Then we have a value which remember is just replace the negatives with zero. We'll skip over the batch norm. We'll come back to drop out. Then we have a second linear layer that takes those 512 activations from the previous linear layer and puts them through a new matrix multiply 512 by 120. Spits out a new 120 activations and then finally put that through softmax. And for those of you that don't remember softmax, we looked at that last week. It's this idea that we basically just take the activation, let's say for dog, go e to the power of that and then divide that into the sum of e to the power of all the activations. So that was the thing that adds up to 1, all of them add up to 1 and each one individually is between 0 and 1. So that's what we added on top and that's the thing when we have pre-computed equals true, that's the thing we train. So I wanted to talk about what this dropout is and what this P is because it's a really important thing that we get to choose. So a dropout layer with P equals 0.5 literally does this. We go over to our spreadsheet and let's pick any layer with some activations and let's say, okay, I'm going to apply dropout with a P of 0.5 to cons 2. What that means is I go through and with a 50% chance I pick a cell, pick an activation, so I pick like half of them randomly and I delete them. Okay, that's what dropout is. So the P equals 0.5 means what's the probability of deleting that cell? So when I delete those cells, if you have a look, look at the output. It doesn't actually change by very much at all, just a little bit, particularly because remember it's going through a max pooling layer. So it's only going to change it at all if it was actually the maximum in that group of four. And furthermore, it's just one piece of, if it's going into a convolution rather than into a max pool, it's just one piece of that filter. So interestingly, the idea of like randomly throwing away half of the activations in a layer has a really interesting result. And one important thing to mention is each mini batch, we throw away a different random half of activations in that layer. And so what it means is it forces it to not overfit, right? In other words, if there's some particular activation that's really learned just that exact dog or that exact cat, right? Then when that gets dropped out, the whole thing now isn't going to work as well. It's not going to recognize that image, right? So it has to, in order for this to work, it has to try and find a representation that actually continues to work even as random half of the activations get thrown away. Every time. So it's, I guess, about four years old now, three or four years old. And it's been absolutely critical in making modern deep learning work. And the reason why is it really just about solved the problem of generalization for us. Before Dropout came along, if you tried to train a model with lots of parameters and you were overfitting and you already tried all the data augmentation you could, and you already had as much data as you could, there were some other things you could try, but to a large degree you were kind of stuck. And so then Jeffrey Hinton and his colleagues came up with this dropout idea that was loosely inspired by the way the brain works and also loosely inspired by Jeffrey Hinton's experience in bank telecuse, apparently. And, yeah, somehow they came up with this amazing idea of like, hey, let's, let's try throwing things away at random. And so, as you can imagine, if your P was like 0.01, then you're throwing away 1% of your activations for that layer at random. It's not going to randomly change things up very much at all. So it's not really going to protect you from overfitting much at all. On the other hand, if your P was 0.99, then that would be like going through the whole thing and throwing away nearly everything, right? And that would be very hard for it to overfit. So that would be great for generalization, but it's also going to cure your accuracy. So this is kind of playoff between high P values generalize well, but will decrease your training accuracy and low P values will generalize less well, but will give you a less good training accuracy. So for those of you that have been wondering, why is it that particularly early in training are my validation losses better than my training losses? Which seems otherwise really surprising. Hopefully some of you have been wondering why that is. Because on a data set that it never gets to see, you wouldn't expect the losses to ever be much better. And the reason why is because when we look at the validation set, we turn off dropout. So in other words, when you're doing inference, when you're trying to say, is this a cat or is this a dog, we certainly don't want to be including random dropout there, right? We want to be using the best model we can. So that's why early in training, in particular, we actually see that our validation accuracy and loss tends to be better if we're using dropout. So yes, Yannette, let me give you that. Do you have to do anything to accommodate for the fact that you're throwing away some? That's a great question. So we don't, but PyTorch does. So PyTorch behind the scenes does two things. If you say p equals 0.5, it throws away half of the activations, but it also doubles all the activations that are already there. So on average, the kind of the average activation doesn't change, which is pretty neat trick. So yeah, you don't have to worry about it. Basically, it's done for you. So if we say, so you can pass in p's. This is the p value for all of the added layers to say with fast AI, what dropout do you want on each of the layers in these added layers? It won't change the dropout in the pre-trained network. Like the hope is that that's already been pre-trained with some appropriate level of dropout. We don't change it, but on these layers that we add, you can say how much. And so you can see here, I said p's equals 0.5. So my first dropout has 0.5. My second dropout has 0.5. And remember, coming to the input of this was the output of the last convolutional layer of the pre-trained network. And we actually throw away half of that before you can start. Go through our linear layer, throw away the negatives, throw away half of the result of that, go through another linear layer, and then pass that to our softmax. For minor numerical precision reasons, it turns out to be better to take the log of the softmax than the softmax directly. And that's why you'll have noticed that when you actually get predictions out of our models, you always have to go np.x of the predictions. Again, the details as to why aren't important. So if we want to try removing dropout, we could go p's equals 0. And you'll see where else before we started with the 0.76 accuracy in the first epoch. Now we've got a 0.8 accuracy in the first epoch. So by not doing dropout, our first epoch works better. Not surprisingly, because we're not throwing anything away. But by the third epoch, here we had 84.8 and here we have 84.1. So it started out better and ended up worse. So even after three epochs, you can already see we're massively overfitting. We've got 0.3 loss on the train and 0.5 loss on the validation. And so if you look now, you can see in the resulting model, there's no dropout at all. So if the p is 0, we don't even add it to the model. Another thing to mention is you might have noticed that what we've been doing is we've been adding two linear layers in our additional layers. You don't have to do that, by the way. There's actually a parameter called extra fully connected layers that you can basically pass a list of how long do you want... How big do you want each of the additional fully connected layers to be? And so by default, well, you need to have at least one, right? Because you need something that takes the output of the convolutional layer, which in this case is of size 1024, and turns it into the number of classes you have. Pats versus dogs would be 2, dog breeds would be 120, planet satellite 17, whatever. So you always need one linear layer, at least. And you can't pick how big that is, that's defined by your problem. But you can choose what the other size is or if it happens at all. So if we were to pass in an empty list, and now we're saying don't add any additional linear layers, just the one that we have to have, right? So here if you've got P's equals 0, extra fully connected layers is empty. This is like the minimum possible kind of top model we can put on top. And again, like if we do that, you can see above, we actually end up with in this case a reasonably good result because we're not training it for very long. And this particular pre-trained network is very well suited to this particular problem. Yes, Yunet? So Jeremy, what kind of P should we be using by default? So the one that's there by default for the first layer is 0.25 and for the second layer is 0.5. That seems to work pretty well for most things, right? Like you don't necessarily need to change it at all. Basically if you find it's overfitting, just start bumping it up. So try first of all setting it to 0.5, that'll set them both to 0.5. If it's still overfitting a lot, try 0.7. Like you can narrow down. And like there's not that many numbers change, right? And if you're underfitting, then you can try making it lower. It's unlikely you would need to make it much lower because like even in these dogs versus cats situations, you know, we don't seem to have to make it lower. So it's more likely you'll be increasing it to like 0.6 or 0.7. But you can fiddle around. I find these, the ones that are there by default seem to work pretty well most of the time. So one place I actually did increase this was in the dog breeds one. I did set them both to 0.5 when I used a bigger model. So like ResNet 34 has less parameters. So it doesn't overfit as much. But then when I started bumping it up to like a ResNet 50, which has a lot more parameters, I noticed it started overfitting. So then I also increased my dropout. So as you use like bigger models, you'll often need to add more dropout. Can you pass that over there please? If we set B to 0.5, roughly what percentage is it? 50% dropout. Can you pass that back? Thanks. Is there a particular way in which you can determine if data is being overfitted? Yeah, you can see that the, like here, you can see that the training error is a loss is much lower than the validation loss. You can't tell if it's like 2 overfitted. Like 0 overfitting is not generally optimal. Like the only way to find that out is to remember the only thing you're trying to do is to get this number low, right? The validation loss number low. So in the end, you kind of have to play around with a few different things and see which thing ends up getting the validation loss low. But you're kind of going to feel over time for your particular problem. What is overfitting? What does too much overfitting look like? Great. So that's dropout and we're going to be using that a lot. And remember it's there by default. Sorry, was there another question? So I have two questions. One is, so when it says the dropout rate is 0.5, does it like, you know, delete each cell with a probability of 0.5 or does it just pick 50% randomly? I mean, I know both effectively. It's the former. Okay. Okay. Second question is why does the average activation matter? Well, it matters because the, remember, if you look at the Excel spreadsheet that the result of this cell, for example, is equal to these nine, multiplied by each of these nine, right, and added up. So if we deleted half of these, then that would also cause this number to half, which would cause, like, everything else after that to change. And so if you change what it means, you know, like, then you're changing something that used to say like, oh, fluffy ears are fluffy if this is greater than 0.6. Like you're changing the meaning of everything. So the goal here is to delete things without changing anything. Why are you using a linear activation for one of the earlier activations? Why are we using linear? Yeah, why that particular activation? Because that's what this set of layers is. So we've, the pre-trained network is the convolutional network, and that's pre-computed so we don't see it. So what that spits out is a vector. So the only choice we have is to use linear layers at this point. Okay. Can we have different level of dropout by layer? Yes, absolutely. And how to do that? Great. So you can absolutely have different dropout by layer, and that's why this is actually called P's. So you can pass in an array here. So if I went 0, 0.2, for example, and then extra fully connected, I might add 512. All right, then that's going to be zero dropout before the first of them and 0.2 dropout before the second of them. Good question. And I must admit I don't have a great intuition, even after doing this for a few years, for like, when should earlier or later layers have different amounts of dropout? It's still something I kind of play with, and I can't quite find rules of thumb. So if some of you come up with some good rules of thumb, I'd love to hear about them. I think if in doubt, you can use the same dropout in every fully connected layer. The other thing you can try is often people only put dropout on the very last linear layer. So that'd be the two things to try. So Jeremy, why do you monitor the log loss, the loss instead of the accuracy going up? Well, because the loss is the only thing that we can see for both the validation set and the training set. So it's nice to be able to compare them. Also, as we'll learn about later, the loss is the thing that we're actually optimizing. So it's a little easier to monitor that and understand what that means. Can you pass it over there? So with dropout, we are kind of adding some random noise every iteration, right? So that means that we don't do as much learning, right? That's right. So we have to play around with the learning rate. It doesn't seem to impact the learning rate enough that I've ever noticed it. I would say you're probably right in theory. It might, but not enough that it's ever affected me. Okay. So let's talk about this structured data problem. And so to remind you, we were looking at Kaggle's Rossman competition, which is a German chain of supermarkets, I believe. And you can find this in lesson three, Rossman. And the main data set is the one where we were looking to say, at a particular store, how much did they sell? Okay. And there's a few key pieces of information. One is what was the date? Another was were they open? Did they have a promotion on? Was it a holiday in that state? And was it a holiday, a state holiday there? Or was it a school holiday there? And then we had some more information about stores. Like for this store, what kind of stuff did they tend to sell? What kind of store are they? How far away are the competition and so forth. So with a data set like this, there's really two main kinds of column. There's columns that we think of as categorical. They have a number of levels. So the assortment column is categorical. And it has levels such as A, B and C. Whereas something like competition distance, we would call continuous. It has a number attached to it where differences or ratios, even if that number have some kind of meaning. And so we need to deal with these two things quite differently. Okay. So anybody who's done any machine learning of any kind will be familiar with using continuous columns. If you've done any linear aggression, for example, you can just like multiply them by parameters, for instance. Categorical columns, we're going to have to think about a little bit more. We're not going to go through the data cleaning. We're going to assume that that's a feature engineering. We're going to assume all that's been done. And so basically at the end of that, we have a list of columns. And in this case, I didn't do any of the thinking around the feature engineering or data cleaning myself. This is all directly from the third place winners of this competition. And so they came up with all of these different columns that they found useful. And so you'll notice the list. Here is a list of the things that we're going to treat as categorical variables. Numbers like year, month and day, although we could treat them as continuous, like the differences between 2000 and 2003 is meaningful, we don't have to. And you'll see shortly how categorical variables are treated. But basically, if we decide to make something a categorical variable, what we're telling our neural net down the track is that for every different level of say year, 2000, 2001, 2002, you can treat it totally differently. Whereas if we say it's continuous, it's going to have to come up with some kind of like function, some kind of smooth-ish function. And so often, even for things like year that actually are continuous, but they don't actually have many distinct levels, it often works better to treat it as categorical. So another good example, day of week. So like day of week between 0 and 6, it's a number and it means something. The difference between 3 and 5 is two days and has meaning. But if you think about how would sales in a store vary by day of week, it could well be that Saturdays and Sundays are over here and Fridays are over here and Wednesdays are over here. Like each day is going to behave kind of qualitatively differently. So by saying this is the categorical variable, as you'll see, we're going to let the neural net do that. So this thing where we say which are continuous and which are categorical, to some extent, this is a modeling decision you get to make. Now, if something is coded in your data as like A, B, and C, or Jeremy and your net or whatever, you're going to have to call that categorical. There's no way to treat that directly as a continuous variable. On the other hand, if it starts out as a continuous variable like age or day of week, you get to decide whether you want to treat it as continuous or categorical. So summarized, if it's categorical in the data, it's going to have to be categorical in the model. If it's continuous in the data, you get to pick whether to make it continuous or categorical in the model. So in this case, again, I just did whatever the third place winners of this competition did. These are the ones that they decided to use as categorical. These are the ones they decided to use as continuous. And you can see that basically the continuous ones are all of the ones which are actual floating point numbers like competition distance actually has a decimal place to it. And temperature actually has a decimal place to it. So these would be very hard to make categorical because they have many, many levels. If it's like five digits of floating point, then potentially there will be as many levels as there are rows. And by the way, the word we use to say how many levels are in a category, we use the word cardinality. So if you hear me say cardinality, for example, the cardinality of the day of week variable is seven because there are seven different days of the week. Do you have a heuristic for one to bin continuous variables or do you ever bin variables? I don't ever bin continuous variables. So yeah, so one thing we could do with like max temperature is group it into not to 10, 10 to 20, 20 to 30 and then call that categorical. Interestingly, a paper just came out last week in which a group of researchers found that sometimes bidding can be helpful, but that literally came out in the last week and until that time I haven't seen anything in deep learning saying that. So I haven't looked at it myself. Until this week, I would have said it's a bad idea. Now I have to think differently, I guess. Maybe it is sometimes. So if you're using year as a category, what happens when you run the model on a year? It's never seen. So you trained it in 2000. We'll get there. Yeah. The short answer is it'll be treated as an unknown category and so pandas, which is the underlying data frame thing we're using with categories as a special category called unknown and if it sees a category it hasn't seen before, it gets treated as unknown. So for our deep learning model, unknown would just be another category. If our training data set doesn't have a category and the test has unknown, how will it treat? It'll just be part of this unknown category. Will it still predict? It'll predict something, right? It'll just have the value 0 by the scenes and if there's been any unknowns of any kind in the training set, then it'll learn a way to predict unknown. If it hasn't, it's going to have some random vector and so that's an interesting detail around training that we probably won't talk about in this part of the course, but we can certainly talk about on the forum. Okay. So we've got our categorical and continuous variable lists defined. In this case there was 800,000 rows, so 800,000 dates basically by stores. And so you can now take all of these columns, loop through each one, and replace it in the data frame with a version where you say, take it and change its type to category. Okay. And so that's just a pandas thing. So I'm not going to teach you pandas. There's plenty of books, particularly Western McKinney's books, a book on Python for Data Analysis is great, but hopefully it's intuitive as to what's going on, even if you haven't seen the specific syntax before. So we're going to turn that column into a categorical column. And then for the continuous variables, we're going to make them all 32-bit floating point and for the reason for that is that PyTorch expects everything to be 32-bit floating point. Okay. So like some of these include like one, zero things, like I can't see them straight away, but anyway, some of them, yeah, like was there a promo, was there a holiday, and so that will become the floating point values one and zero instance. Okay. So I try to do as much of my work as possible on small data sets for when I'm working with images, that generally means resizing the images to like 64 by 64 or 128 by 128. We can't do that with structured data. So instead I tend to take a sample. So I randomly pick a few rows. So I start running with a sample and I can use exactly the same thing that we've seen before for getting a validation set. We can use the same way to get some random, random row numbers to use in a random sample. Okay. So this is just a bunch of random numbers. And then, okay, so that's going to be a size 150,000 rather than 840,000. And so my data, before I go any further, basically looks like this. You can see I've got some Booleans here. I've got some integers here of various different scales. And I've got some letters here. So even though I said, please call that a pandas category, pandas still displays that in the notebook as strings. It's just stored in internally differently. So then the first AI library has a special little function called process data frame. And process data frame takes a data frame and you tell it what's my dependent variable. And it does a few different things. The first thing is it pulls out that dependent variable and puts it into a separate variable and deletes it from the original data frame. So DF now does not have the sales column in whereas Y just contains a sales column. Something else that it does is it does scaling. So neural nets really like to have the input data to all be somewhere around zero with a standard deviation of somewhere around one. So we can always take our data and subtract the mean and divide by the standard deviation to make that happen. So that's what do scale equals true does. And it actually returns a special object which keeps track of what mean and standard deviation did it use for that normalizing. So you can then do the same thing to the test set later. It also handles missing values. So missing values and categorical variables just become ID zero and then all the other categories become one, two, three, four, five for that categorical variable. For continuous variables, it replaces the missing value with the median and creates a new column that's a boolean and just says is this missing or not. And I'm going to skip over this pretty quickly because we talk about this in detail in the machine learning course. So if you've got any questions about this part, that would be a good place to go. It's nothing deep learning specific there. So you can see afterwards year 2014, for example, has become year two because these categorical variables have all been replaced with contiguous integers starting at zero. And the reason for that is later on we're going to be putting them into a matrix. And so we wouldn't want the matrix to be 2014 rows long when it could just be two rows long. So that's the basic idea there. And you'll see that the AC, for example, has been replaced in the same way with one and three. So we now have a data frame which does not contain the dependent variable and where everything is a number. And so that's where we need to get to to do deep learning and all of the stage above that. As I said, we talk about in detail in the machine learning course, nothing deep learning specific about any of it. This is exactly what we throw into our random forests as well. So another thing we talk about a lot in the machine learning course is validation sets. In this case, we need to predict the next two weeks of sales. It's not like pick a random set of sales, but we have to pick the next two weeks of sales. That was what the Kaggle competition folks told us to do. And therefore I'm going to create a validation set which is the last two weeks of my training set, to try and make it as similar to the test set as possible. And we just posted actually, Rachel wrote this thing last week about creating validation sets. So if you go to fast.ai, you can check that out. We'll put that in the lesson wiki as well. But it's basically a summary of a recent machine learning lesson that we did. The videos are available for that as well. And this is kind of a written summary of it. Okay. So Rachel and I spent a lot of time thinking about kind of how do you need to think about validation sets and training sets and test sets and so forth, and that's all there. But again, nothing deep learning specific. So let's get straight to the deep learning action. Okay. So in this particular competition, as always with any competition or any kind of machine learning project, you really need to make sure you have a strong understanding of your metric and how you're going to be judged here. And in this case, Kaggle makes it easy. They tell us how we're going to be judged. And so we're going to be judged on the root grid percentage error. So we're going to say like, oh, you predicted three. It was actually 3.3. So you were 10% out. And then we're going to average all those percents. And remember, I warned you that you are going to need to make sure you know logarithms really well. And so in this case, we're basically being saying your prediction divided by the actual, the mean of that, right, is the thing that we care about. And so we don't have a metric in PyTorch called root means grid percent error. We could actually easily create it, by the way. If you look at the source code, you'll see like it's a line of code. But easiest deal would be to realize that if you have that, right, then you could replace a with like log of a dash and b with like log of b dash. And then you can replace that whole thing with a subtraction. That's just the rule of logs. Right. And so if you don't know that rule, then you don't make sure you go look it up because it's super helpful. But it means in this case, all we need to do is to take the log of our data, which I actually did earlier in this notebook. And when you take the log of the data, getting the root means grid error will actually get you the root means grid percent error for free. Okay. But then when we want to like print out our root means grid percent error, we actually have to go either the power of it again, right. And then we can actually return the percent difference. So that's all that's going on here. It's again, not really deep learning specific at all. So here we finally get to the deep learning. All right. So as per usual, like you'll see, everything we look at today looks exactly the same as everything we've looked at so far, which is first we create a model data object, something that has a validation set, training set and optional test set built into it. From that, we will get a learner. We will then optionally called learner dot LR find. We'll then called learner dot fit. It'll be all the same parameters and everything that you've seen many times before. Okay. So the difference though is obviously we're not going to go image classifier data dot from CSV or dot from paths. We need to get some different kind of model data. And so for stuff that is in rows and columns, we use columnar model data. But this will return an object with basically the same API that you're familiar with. And rather than from paths or from CSV, this is from data frame. So this gets passed a few things. The path here is just used for it to know where should it store like model files or stuff like that. This is just basically saying, where do you want to store anything that you save later? This is the list of the indexes of the rows that we want to put in the validation set. We created earlier. Here's our data frame. Okay. And then let's have a look. This is where we did the log, right? So I took the Y that came out of PROC DF, our dependent variable. I logged it and I call that YL, right? So we tell it when we create our model data we need to tell it that's our dependent variable. So far we've got list of the stuff to go in the validation set, which is what's our independent variables, what's our dependent variables. And then we have to tell it which things do we want treated as categorical, right? Because remember by this time, everything's a number, right? So it could do the whole thing as if it's continuous it would just be totally meaningless, right? So we need to tell it which things do we want to treat as categories. And so here we just pass in that list of names that we used before, okay? And then a bunch of the parameters are the same as the ones you're used to. For example, you can set the batch size here. So after we do that, we've got a standard model data object. There's a trainDL attribute. There's a valDL attribute, a trainDS attribute, a valDS attribute. It's got a length. It's got all the stuff exactly like it did in all of our image-based data objects. Okay, so now we need to create the model or create the learner. And so to skip ahead a little bit, we're basically going to pass in something that looks pretty familiar. We're going to be passing saying from our model data create a learner that is suitable for it and will basically be passing in a few other bits of information which will include how much drop-out to use at the very start, how many activations to have in each layer, how much drop-out to use at the later layers. But then there's a couple of extra things that we need to learn about. And specifically it's this thing called embeddings. So this is really the key new concept we have to learn about. All right, so all we're doing basically is we're going to take our... Let's forget about categorical variables for a moment and just think about the continuous variables. For our continuous variables, all we're going to do is we're going to grab them all. Okay, so for our continuous variables, we're basically going to say like, okay, here's a big list of all of our continuous variables like the minimum temperature and the maximum temperature and the distance to the nearest competitor and so forth, right? And so here's just a bunch of floating point numbers. And so basically what the neural net's going to do is it's going to take that one D array or a vector or to be very DL like rank one tensor. All means the same thing. Okay, so we're going to take our rank one tensor and let's put it through a matrix multiplication. So let's say this has got like, I don't know, 20 continuous variables and then we can put it through a matrix which must have 20 rows. That's how matrix multiplication works. And then we can decide how many columns we want, right? So maybe we decided 100, right? And so that matrix multiplication is going to spit out a new length 100 rank one tensor. Okay, that's what a matrix product does and that's the definition of a linear layer in deep line. And so then the next thing we do is we can put that through a relu, right? Which means we throw away the negatives. And now we can put that through another matrix product. So this is going to have to have 100 rows by definition and we can have as many columns as we like. And so let's say maybe this was the last layer. So the next thing we're trying to do is to predict sales. So there's just one value we're trying to predict the sales. So we could put it through a matrix product that just had one column and that's going to spit out a single number. So that's kind of like a one layer neural net, if you like. Now in practice, we wouldn't make it one layer. So we'd actually have like, you know, maybe we'd have 50 here. And so then that gives us a 50 long vector and then maybe we then put that into our final 50 by one and that spits out a single number. And one reason I wanted to change that there was to point out, you would never put value in the last layer. Like you'd never want to throw away the negatives because that the softmax needs negatives in it because it's the negatives that are the things that allow it to create low probabilities. That's minor detail, but it's useful to remember. So basically a simple view of a fully connected neural net is something that takes in as an input a rank one tensor. It spits it through a linear layer, an activation layer, another linear layer, a softmax, and that's the output. And so we could obviously decide to add more linear layers. We could decide maybe to add dropout. So these are some of the decisions that we get to make. But there's not that much we can do. There's not much really crazy architecture stuff to do. So when we come back to image models later in the course we're going to learn about all the weird things that go on in like resnets and inception networks and blah, blah, blah. But in these fully connected networks they're really pretty simple. They're just interspersed linear layers, that is matrix products, and activation functions like value and a softmax at the end. And if it's not classification, which actually ours is not classification, in this case we're trying to predict sales, there isn't even a softmax. We don't want it to be between zero and one. So we can just throw away the last activation altogether. If we have time we can talk about a slight trick we can do there, but for now we can think of it that way. So that was all assuming that everything was continuous. But what about categorical? So we've got like a day of week, and we're going to treat it as categorical. So it's like Saturday, Sunday, Monday, that should be six, Friday. Okay, how do we feed that in? Because I want to find a way of getting that in so that we still end up with a rank one tensor of floats. And so the trick is this. We create a new little matrix with seven rows and as many columns as we choose. So let's pick four. So here's seven rows and four columns. And basically what we do is, let's add our categorical variables to the end. So let's say the first row was Sunday. Then what we do is we do a lookup into this matrix. We say here's Sunday, we do a lookup into here, and we grab this row. And so this matrix, we basically fill with floating point numbers. So we're going to end up grabbing a little subset of four floating point numbers. It's Sunday's particular four floating point numbers. And so that way we convert Sunday into a rank one tensor of four floating point numbers. And initially those four numbers are random. And in fact, this whole thing, we initially start out random. But then we're going to put that through our neural net. So we basically then take those four numbers and we remove Sunday, and instead we add our four numbers on here. So we've turned our categorical thing into a floating point vector. And so now we can just put that through our neural net, just like before. And at the very end, we find out the loss. And then we can figure out which direction is down and do gradient descent in that direction. And eventually that will find its way back to this little list of four numbers. And it will say, okay, those random numbers weren't very good. This one needs to go up a bit. That one needs to go up a bit. That one needs to go down a bit. That one needs to go up a bit. And so we'll actually update our original, those four numbers in that matrix. And we'll do this again and again and again. And so this matrix will stop looking random and it will start looking more and more like like the exact four numbers that happen to work best for Sunday, the exact four numbers that happen to work best for Friday, and so forth. And so in other words, this matrix is just another bunch of weights in our neural net. And so matrices of this type are called embedding matrices. So an embedding matrix is something where we start out with an integer between zero and the maximum number of levels of that category. We literally index into a matrix to find a particular row. So the level was one. We take the first row. We grab that row and we append it to all of our continuous variables. And so we now have a new vector of continuous variables. And then we can do the same thing for let's say zip code. So we could like have an embedding matrix. Let's say there are 5,000 zip codes. It would be 5,000 rows long. As wide as we decide, maybe it's 50 wide. And so we'd say, okay, here's 94003. That zip code is index number 4 in our matrix. So go down and we find the fourth row. We grab those 50 numbers and append those onto our big vector. And then everything after that is just the same. We just put it through our linear layer, value, linear layer, whatever. What are those four numbers represent? That's a great question. And we'll learn more about that when we look at collaborative filtering. For now, they represent no more or no less than any other parameter in our neural net. They're just parameters that we're learning that happen to end up giving us a good loss. We will discover later that these particular parameters often however are human interpretable and can be quite interesting. But that's a side effect of them. It's not fundamental. They're just four random numbers now that we're learning. Or sets of four random numbers. Do you have a good heuristic for the dimensionality of the embedding matrix? So why four here? I sure do. Wait, there's more. So what I first of all did was I made a little list of every categorical variable and its cardinality. Okay, so there they all are. So there's a thousand different stores apparently in Rossman's network. There are eight days of the week. That's because there are seven days of the week plus one leftover for unknown. Even if there were no missing values in the original data, I always still set aside one just in case there's a missing or an unknown or something different in the test set. Again, four years, but there's actually three plus room for an unknown and so forth. So what I do, my rule of thumb is this. Take the cardinality of the variable, divide it by two, but don't make it bigger than 50. Okay, so these are my embedding matrices. So my store matrix, that has to have a thousand 116 rows because I need to look up to find his store number three and then it's going to return back a rank one tensor of length 50. Day of week, it's going to look up into which one of the eight and return the thing of length four. So would you typically build an embedding matrix for each categorical feature? Yeah, so that's what I've done here. So I've said four C in categorical variables see how many categories there are and then for each of those things create one of these and then this is called embedding sizes and then you may have noticed that that's actually the first thing that we pass to get learner and so that tells it for every categorical variable that's the embedding matrix to use for that variable. Now just behind you there's a question. So besides random initialization are there other ways of initializing embedding? Yes and no. There's two ways. One is random the other is pre-trained and we'll probably talk about pre-trained more later in the course but the basic idea though is if somebody else at Rossman had already trained a neural net just like you would use a pre-trained net from ImageNet to look at pictures of cats and dogs if somebody else has pre-trained a network to predict cheese sales in Rossman you may as well start with their embedding matrix of stores to predict liquor sales in Rossman and this is what happens for example at Pinterest and Instacart. They both use this technique Instacart uses it for routing their shoppers. Pinterest uses it for deciding what to display on a web page when you go there and they have embedding matrices of products in Instacart's case of stores that get shared in the organization so people don't have to train new ones. So for the embedding size why wouldn't you just use like the one heart scheme and just what is the advantage of doing this as opposed to just doing a one heart? Good question. So we could easily as you point out have instead of passing in these four numbers we could instead have passed in seven numbers all zeros but one of them is a one and that also is a list of floats and that would totally work and that's how generally speaking categorical variables have been used in statistics for many years. It's called dummy variable coding. The problem is that in that case the concept of Sunday could only ever be associated with a single floating point number and so it basically gets this kind of linear behavior. It says Sunday is more or less of a single thing. Yeah well it's not just interactions, it's saying like now Sunday is a concept in four-dimensional space and so what we tend to find happen is that these embedding vectors tend to get these kind of rich semantic concepts so for example if it turns out that weekends kind of have a different behavior you'll tend to see that Saturday and Sunday will have like some particular number higher or more likely it turns out that certain days of the week are associated with higher sales of certain kinds of goods that you kind of can't go without I don't know like gas or milk say. Where else there might be other products like wine for example like wine that tend to be associated with like the days before weekends or holidays right so there might be kind of a column which is like to what extent is this day of the week kind of associated with people going out you know so basically yeah by having this higher dimensionality vector rather than just a single number it gives the deep learning network a chance to learn these rich representations and so this idea of an embedding is actually what's called a distributed representation it's kind of the most fundamental concept of neural networks it's the idea that a concept in a neural network has a kind of a high dimensional representation and often it can be hard to interpret because the idea is like each of these numbers in this vector doesn't even have to have just one meaning you know it could mean one thing if this is low and that one's high and something else if that one's high and that one's low because it's going through this kind of rich non-linear function right and so it's this rich representation that allows it to learn such interesting relationships kind of oh another question sure I'll speak louder so are there is an embedding so I get the fundamental of the like the word vector vector algebra you can run on this but are the embedding suitable for certain types of variables like or are these only suitable for are there different categories that the embeddings are suitable for an embedding is suitable for any categorical variable so the only thing it can't really work well at all for would be something that is too high cardinality so like in other words we had like whatever it was 600,000 rows if you had a variable with 600,000 levels that's just not a useful categorical variable you could bucketize it I guess but yeah in general like you can see here that the third place getters in this competition really decided that everything that was not too high cardinality they put them all as categorical variables and I think that's a good rule of thumb you know if you can make a categorical variable you may as well because that way it can learn this rich distributed representation so if you leave it as continuous you know the most it can do is to kind of try and find a single functional form that fits it well I have a question so you were saying that you are kind of increasing the dimension but actually in most cases we will use a one holding coding which has even a bigger dimension that so in a way you are also reducing but in the most rich I think that's fair in it yes you can think of it as a one holding coding which actually is high dimensional but it's not meaningly high dimensional because everything except one is a zero I'm saying that also because even this will reduce the amount of memory and things like this that you have to practical terms this is better you're absolutely right and so we may as well go ahead and actually describe what's going on with the matrix algebra behind the scenes if this doesn't quite make sense you can kind of skip over it but for some people I know this really helps if we started out with something saying this is Sunday right we could represent this as a one hot encoded vector right and so Sunday you know maybe was position here so that would be a one and then the rest is zeros and then we've got our embedding matrix right with eight rows and in this case four columns one way to think of this actually is a matrix product right so I said you could think of this as like looking at the number one you know and finding like its index in the array but if you think about it that's actually identical to doing a matrix product between a one hot encoded vector and the embedding matrix like you're going to go zero times this row one times this row zero times this row and so it's like a one hot embedding matrix product is identical to doing a look up and so some people in the bad old days actually implemented embedding matrices by doing a one hot encoding and then a matrix product and in fact a lot of like machine learning methods still kind of do that but as that was kind of alluding to that's terribly inefficient so all of the modern libraries implement this as take an integer and do a look up into an array but the nice thing about realizing that it's actually a matrix product mathematically is it makes it more obvious how the gradients are going to flow so when we do stochastic gradient descent it's just another linear layer as I say that's like a somewhat minor detail but hopefully for some of you it helps could you touch on using dates and times as categoricals and how that affects seasonality yeah absolutely that's a great question did I cover dates in all last week but I remember no great so I covered dates in a lot of detail in the machine learning course but it's worth briefly mentioning here there's a fast function called add date part which takes a data frame and a column name that column name needs to be a date it removes unless you've got drop equals false it optionally removes the column from the data frame and replaces it with lots of columns representing all of the useful information about that date like day of week, day of month, month of year year is at the start of a quarter is at the end of a quarter basically everything that pandas gives us and so that way we end up when we look at our list of features we can see them here right year, month, week, day, day of week, etc so these all get created for us by add date part so we end up with you know this 8 long embedding matrix so I guess 8 rows by 4 column embedding matrix for day of week and conceptually that allows us or allows our model to create some pretty interesting time series models right like it can if there's something that has a 7 day period cycle that kind of goes up on Mondays and down on Wednesdays but only for dairy and only in Berlin it can totally do that but it has all the information it needs to do that so this turns out to be a really fantastic way to deal with time series so I'm really glad you asked the question you just need to make sure that the cycle indicator in your time series exists as a column so if you didn't have a column there called day of week it's very very difficult for the neural network to somehow learn to do like a divide mod 7 and then somehow look that up in an embedding matrix not impossible but really hard which is lots of computation wouldn't do it very well so an example of the kind of thing that you need to think about might be holidays for example or if you were doing something of sales of beverages in San Francisco you probably would want a list of when is the ballgame on AT&T part because that's going to impact how many people that are drinking beer in Soma so you need to make sure that the basic indicators or periodicities or whatever are there in your data and as long as they are the neural nets it's going to learn to use them so I'm kind of trying to skip over some of the non-deep learning parts alright so the key thing here is that we've got our model data that came from the data frame we tell it how big to make the embedding matrices we also have to tell it of the columns in that data frame how many of those categorical variables or how many of them are continuous variables so the actual parameter is number of continuous variables and we just pass in how many columns are there minus how many categorical variables are there so then that way the neural net knows how to create something that puts the continuous variables over here and the categorical variables over there the embedding matrix has its own dropout so this is the dropout that applies to the embedding matrix this is the number of activations in the first linear layer activations in the second linear layer the dropout in the first linear layer the dropout for the second linear layer this bit we won't worry about for now and then finally is how many outputs do we want to create so this is the output of the last linear layer and obviously it's 1 because we want to predict a single number which is sales so after that we now have a learner where we can call lrfind and we get the standard looking shape what amount do we want to use and we can then go ahead and start training using exactly the same API we've seen before so this is all identical you can pass in I'm not sure if you've seen this before custom metrics what this does is it just says please print out a number at the end of every epoch by calling this function this is a function we defined a little bit earlier the root mean spread percentage error going either the power of our sales because our sales were originally logged so this doesn't change the training at all it's just something to print out so we trained that for a while and we've got some benefits that the original people that built this don't have specifically we've got things like cyclical stochastic gradient descent with restarts and so it's actually interesting to have a look and compare although our validation set isn't identical to the test set it's very similar it's a two week period that is at the end of the training data so our numbers should be similar and if we look at what we get 0.097 and compare that to the leaderboard public leaderboard you can see we're kind of let's have a look in the top actually that's interesting there's a big difference between the public and private leaderboard it would have been right at the top of the private leaderboard but only in the top 30 or 40 on the public leaderboard so not quite sure but you can see like we're certainly in the top end of this competition I actually tried running the third place getters code and their final result was over 0.1 so I actually think that we're should be compared to the private leaderboard but I'm not sure so anyway so you can see there basically there's a technique for dealing with time series and structured data and you know interestingly the group that used this technique they actually wrote a paper about it that's linked in this notebook um when you compare it to the folks that won this competition and came second the other folks did way more feature engineering like the winners of this competition were actually subject matter experts in logistics sales forecasting and so they had their own code to create lots and lots of features and talking to the folks at Pinterest who built their very similar model for recommendations of Pinterest they said the same thing which is that when they switched from gradient boosting machines to deep learning they did like way way way less feature engineering it was a much much simpler model and requires much less maintenance um and so this is like one of the big benefits of using this approach to deep learning you can get state of the art results um but with a lot less work uh yes are we using any time series in any of these uh fits uh indirectly absolutely using what we just saw we have a day of week month of year all that stuff columns uh and most of them are being treated as categories so we're building a distributed representation of January we're building a distributed representation of Sunday we're building a distributed representation of Christmas so we're not using any classic time series techniques all we're doing is two fully connected layers in a neuron through the embedded matrix that's what the exactly exactly yeah so the embedding matrix is able to deal with this stuff like day of week periodicity and so forth in a way than any standard time series technique I've ever come across and one last question the matrix in the earlier models when we did the CNN we did not pass it during the fit we passed it when the data was uh when we got the data so we're not passing anything to fit just the learning rate and the number of cycles in this case we're passing in metrics or stuff there is a difference in that we're calling data dot get learner so with the imaging approach we just go learner dot trained and pass at the data but in for these kinds of models in fact for a lot of the models the model that we build depends on the data in this case we actually need to know like what embedding matrices do we have and stuff like that so in this case it's actually the data object that creates the learner so yeah it is a bit upside down to what we've seen before I have a question so just to summarize or maybe I'm confused so in this case what we are doing is like we have some kind of structure data we did feature engineering we got some columnar database or something similar a pandas data frame and then we are mapping it to deep learning by using this embedding matrix for the categorical variables so the continuous we just put them straight in so all I need to do is like if I have already have a feature engineering model then to map it to deep learning I just have to figure out which one I can move into categorical and then make it learn by itself great question so yes exactly if you want to use this on your own data set step one is list the categorical variable names list the continuous variable names put it in a data frame pandas data frame step two is to create a list of which row indexes do you want your validation set step three is to call this line of code using this like these exact you can just copy and paste it step four is to create your list of how big you want each embedding matrix to be and then step five is to call get learner you can use these exact parameters to start with and if it over fits or under fits you can fit it with them and then the final step is to call fit so yeah almost all of this code will be nearly identical have a couple of questions one is how is data augmentation can be used in this case and the second one is what are dropouts doing in here okay so data augmentation I have no idea that's a really interesting question I think it's got to be domain specific I've never seen any paper or anybody in industry doing data augmentation with structured data and deep learning so I don't I think it can be done I just haven't seen it done what is dropout doing exactly the same as before so at each point we have the output of each of these linear layers is just a rank one tensor and so dropout is going to go ahead and say let's throw away half of the activations and the very first dropout embedding dropout literally goes through the embedding matrix and says let's throw away half the activations that's it okay let's take a break and let's come back at five past eight okay thanks everybody so now we're going to move into something equally exciting actually before I do I just mentioned I had a good question during the break which was what's the downside like like almost no one's using this why not and basically I think the answer is like as we discussed before no one in academia almost is working on this because it's not something that people really publish on and as a result there haven't been really great examples where people could look at and say oh here's a technique that works well so let's have our company implement it but perhaps equally importantly until now with this fast AI library there hasn't been any way to do it conveniently if you wanted to implement one of these models you had to write all the custom code yourself where else now as we discussed it's you know it's basically a six step process you know involving about you know not much more than six lines of code so the reason I mentioned this is to say like I think there are a lot of big commercial and scientific opportunities to use this to solve problems that previously haven't been solved very well before so like I'll be really interested to hear if some of you try this out you know maybe on like old Kaggle competitions you might find like oh I would have won this if I'd use this technique that would be interesting or if you've got some dataset you work with at work you know some kind of different model that you've been doing with GBM or a random forest does this help you know the thing I'm still somewhat new to this I've been doing this for basically since the start of the year was when I started working on these structured deep learning models so I haven't had enough opportunity to know where might it fail it's worked for nearly everything I've tried it with so far but yeah I think this class is the first time that there's going to be like more than half a dozen people in the world who actually are working on this so I think you know as a group hopefully learn a lot and build some interesting things and this would be a great thing if you're thinking of writing a post about something or here's an area that there's a couple there's a post from Instacart about what they did Pinterest has a an O'Reilly AI video about what they did that's about it and there's two academic papers both about Kaggle competition victories one from Yoshio Benjo and his group they won a taxi destination forecasting competition and then also the one linked for this Rossman competition so yeah there's some background on that alright so language natural language processing is the area which is kind of like the most up and coming area of deep learning it's kind of like two or three years behind computer vision in deep learning it's kind of like the second area that deep learning started getting really popular in and computer vision got to the point where it was like the clear state of the art for most computer vision things in 2014 and some things in like 2012 in NLP we're still at the point where for a lot of things deep learning is now the state of the art but not quite everything but as you'll see the state of the software and some of the concepts is much less mature than it is for computer vision so in general none of the stuff we talk about after computer vision is as like settled as the computer vision stuff was so NLP one of the interesting things is in the last few months some of the good ideas from computer vision have started to spread into NLP for the first time and we've seen some really big advances so a lot of the stuff you'll see in NLP is pretty new so I'm going to start with a particular kind of NLP problem and one of the things we'll find in NLP there are particular problems you can solve and they have particular names and so there's a particular kind of problem in NLP called language modeling and language modeling has a very specific definition it means build a model where given a few words of a sentence can you predict what the next word is going to be so if you're using your mobile phone and you're typing away and you press space and then it says this is what the next word might be SwiftKey does this really well and SwiftKey actually uses deep learning for this that's a language model so it has a very specific meaning when we say language modeling we mean a model that can predict the next word of a sentence so let me give you an example I downloaded about 18 months worth of papers from archive so for those of you that don't know it archive is the most popular pre print server in this community and various others and has lots of academic papers and so I grabbed the abstracts and the topics for each and so here's an example so the category of this particular paper was CSNI is computer science and networking and then the summary let the abstract of the paper was saying the exploitation of MMWaveBands is one of the key enablers for 5G mobile blah blah blah so here's like an example piece of text from my language model so I trained a language model on this archive data set that I downloaded and then I built a simple little test which basically you would pass it some like priming text so you'd say like oh imagine you started reading a document that said category is computer science networking and the summary is algorithms that and then I said please write an archive abstract so it said that if it's networking algorithms that use the same network as a single node are not able to achieve the same performance as a traditional network based routing algorithms in this paper we propose a novel routing scheme blah blah blah so it's learned by reading archive papers somebody who was telling algorithms that where the word cat CSNI came before it is going to talk like this and remember it started out not knowing English at all it actually started out with an embedding matrix for every word in English that was random okay and by reading lots of archive papers it weren't what kind of words followed others so then I tried what if we said cat computer science computer vision summary algorithms that use the same data to perform image classification are increasingly being used to improve the performance of image classification algorithms and this paper we propose a novel method for image classification using a deep convolutional neural network parentheses CNN so you can see like it's kind of like almost the same sentence as back here but things are just changed into this world of computer vision rather than networking so I tried something else which is like okay category computer vision and I created the world's shortest ever abstract algorithms and then I said title on and the title of this is going to be on the performance of deep learning for image classification EOS is end of stream so that's like end of title what if it was networking summary algorithms title on the performance of wireless networks to towards computer vision towards a new approach to image classification networking towards a new approach to the analysis of wireless networks so like I find this mind blowing right I started out with some random matrices which are like literally no no pre-trained anything I fed it 18 months worth of archive articles and it learned not only how to write wireless pretty well but also after you say something's a convolutional neural network you should then use parentheses to say what it's called and furthermore that the kinds of things people talk could say create algorithms for in computer vision are performing image classification and networking are achieving the same performance as traditional network based routing algorithms so like a language model is can be like incredibly deep and subtle right and so we're going to try and build that but actually not because we care about this at all we're going to build it because we're going to try and create a pre-trained model what we're actually going to try and do is take IMDB movie reviews and figure out whether they're positive or negative right so if you think about it this is a lot like cats versus dogs it's a classification algorithm rather than an image we're going to have the text of a review so I'd really like to use a pre-trained network like I would at least like to start with a network that knows how to read English right and so my view was like okay to know how to read English means you should be able to like predict the next word of a sentence so what if we pre-train a language model and then use that pre-trained language model and then just like in computer vision stick some new layers on the end and ask it instead of predicting the next word in the sentence instead predict whether something is positive or negative so when I started working on this this was actually a new idea unfortunately in the last couple of months I've been doing it a few people have actually started publishing this and so this has moved from being a totally new idea to being a totally new idea so this idea of creating a language model making that the pre-trained model for a classification model is what we're going to learn to do now and so the idea is we're really kind of trying to leverage exactly what we learned in our computer vision work which is how do we do fine tuning to create powerful classification models yes you did so why don't you think that doing just directly what you want to do doesn't work better well A because it doesn't it just turns out it doesn't empirically and the reason it doesn't is a number of things first of all as we know fine tuning a pre-trained network is really powerful right so if we can get it to learn some related tasks first then we can use all that information to try and help it on the second task the other reason is IMDB movie reviews are up to a thousand words long they're pretty big and so after reading a thousand words knowing nothing about how English is structured or even what the concept of a word is or punctuation or whatever at the end of this thousand integers you know they end up being integers all you get is a one or a zero positive or negative and so trying to like learn the entire structure of English and then how it expresses positive and negative sentiments from a single number is just too much to expect so by building a language model first we can try to build a neural network that kind of understands the English of movie reviews and then we hope that some of the things it's learned about are going to be useful in deciding whether something's a positive or a negative movie review. That's a great question Thanks, is this similar to the car RNN by Carpathie? Yeah, this is somewhat similar to car RNN by Carpathie so the famous car as in C-H-A-R RNN tried to predict the next letter given a number of previous letters language models generally work at a word level they don't have to and doing things at a word level turns out to be can be quite a bit more powerful and we're going to focus on word level modeling in this course and to what extent are these generated words actual copies of what it found in the training data set or are these completely random things that it actually learned and how do we know how to distinguish between those two? Yeah, I mean these are all good questions the words are definitely words we've seen before because it's not at a character level so it can only give us the word it's seen before the sentences there's a number of kind of rigorous ways of doing it but I think the easiest is to get a sense of like well here are two like different categories where it's kind of created very similar concepts but mixing them up in just the right way like it would be very hard to do what we've seen here just by like spitting back things it's seen before but you could of course actually go back and check you know have you seen that sentence before or like a stream distance have you seen a similar sentence before in this case and of course another way to do it is the length most importantly when we train the language model as we'll see we'll have a validation set and so we're trying to predict the next word of something it's never seen before and so if it's good at doing that it should be good at generating text in this case the purpose the purpose is not to generate text that was just a fun example and so I'm not really going to study that too much but you know during the week totally can like you can totally build your you know great American novel generator or whatever there are actually some tricks to using language models to generate text that I'm not using here they're pretty simple we can talk about them on the forum if you like but my focus is actually on classification so I think that's the thing which is incredibly powerful like text classification I don't know you're a hedge fund you want to like read every article as soon as it comes out through Reuters or Twitter or whatever and immediately identify things which in the past have caused you know massive market drops that's the classification model or you want to recognize all of the customer service queries which tend to be associated with people who who leave your you know who cancel their contracts in the next month that's a classification problem so like it's a really powerful kind of thing for data journalism activism activism law commerce so forth I'm trying to class documents into whether they're part of legal discovery or not part of legal discovery so you get the idea so in terms of stuff we're importing we're importing a few new things here one of the bunch of things we're importing is torch text torch text is pie tortures like NLP library and so fast AI is designed to work hand in hand with torch text as you'll see and then there's a few text specific sub bits of fast AI that we'll be using so we're going to be working with the IMDB large booty review data set it's very very well studied in academia you know lots and lots of people over the years have studied this data set 50,000 reviews highly polarized reviews either positive or negative each one has been classified by sentiment so we're going to try first of all however to create a language model so we're going to ignore the sentiment entirely so just like the dogs and cats pre train the model to do one thing and then fine tune it to do something else because this kind of idea in NLP is so so so new there's basically no models you can download for this so we're going to have to create our own right so having downloaded the data you can use the link here we do the usual stuff of saying the path to it, training and validation path and as you can see it looks pretty traditional compared to vision there's a directory of training a directory of test we don't actually have separate test and validation in this case and just like in vision the training directory has a bunch of files in it in this case not representing images but representing movie reviews so we could cat one of those files and here we learn about the classic zombie get in movie I have to say what a name like a zombie get in and an atom bomb on the front cover I was expecting a flat out chop-socky-fung-ku rent it if you want to get stoned and afraid at night and laugh with your buddies don't rent it if you're an uptight weenie or want a zombie movie with lots of flesh eating I think I'm going to enjoy zombie get in so we've learned something today alright so we can just use standard unique stuff to see like how many words the data set so the training set we've got 17 and a half million words test set we've got 5.6 million words this is IMDB so IMDB is a random people this is not a New York Times listed review as far as I know so before we can do anything with text we have to turn it into a list of tokens a token is basically like a word right so we're going to try and turn this eventually into a list of numbers so the first step is to turn it into a list of words that's called tokenization in NLP NLP has a huge lot of jargon that we'll learn over time one thing that's a bit tricky though when we're doing tokenization is here I've tokenized that review and then joined it back up with spaces and you'll see here that wasn't has become two tokens which makes perfect sense wasn't is two things dot dot dot has become one token where else lots of exclamation marks has become lots of tokens so like a good tokenizer will do a good job of recognizing pieces of an English sentence each separate piece of punctuation will be separated and each part of a multi part word will be separated as appropriate so I think it's an Australian developed piece of software actually that does lots of NLP stuff it's got the best tokenizer I know and so fastai is designed to work well with the spacey tokenizer as is torch text so here's an example of tokenization right so what we do with torch text is we basically have to start out by creating something called a field and a field is a definition of how to pre process some text and so here's an example of the definition of a field it says I want to lower case the text and I want to tokenize it with the function called spacey tokenize okay so it hasn't done anything yet we're just telling it when we do do something this is what to do and so that we're going to store that description of what to do in a thing called capital text and so this is none of this is not fastai specific at all this is part of torch text you can go to the torch text website read the docs there's not lots of docs yet this is all very very new so probably the best information you'll find about it is in this lesson but there's some more information on this site alright so what we can now do is go ahead and create the usual fastai model data object okay and so to create the model data object we have to provide a few bits of information we have to say what's the training set so the path to the text files the validation set and the test set in this case just to keep things simple I don't have a separate validation and test set so I'm going to pass in the validation set for both of those two things so now we can create our model data object as per usual the first thing we give it is the path the second thing we give it is the torch text field definition of how to pre-process that text the third thing we give it is the dictionary or the list of all of the files we have train validation test as per usual we can pass in a batch size and then we've got a special couple of extra things here one is a very commonly used in NLP minimum frequency what this says is in a moment we're going to be replacing every one of these words with an integer which basically will be a unique index for every word and this basically says if there are any words that occur less than 10 times just call it unknown don't think of it as a word we'll see that in more detail in a moment and then we're going to see this in more detail as well BPTT stands for backprop through time and this is where we define how long a sentence will we stick on the GPU at once so we're going to break them up and in this case we're going to break them up into sentences of 70 tokens or less on the whole so we're going to see all this in a moment so after building our model data object what it actually does is it's going to fill this text field with an additional attribute called vocab and this is a really important NLP concept I'm sorry there's so many NLP concepts we have to throw at you kind of quickly but we'll see them a few times vocab is the vocabulary and the vocabulary in NLP has a very specific meaning it is what is the list of unique words that appeared in this text so every one of them is going to get a unique index so let's take a look right here is text.vocab.its this stands for this is all torch text not fast A.I. text.vocab.int to string maps the integer 0 to unknown the integer 1 the padding the integer 2 to the then comma dot and r of 2 and so forth so this is the first 12 elements of the array of the vocab from the imdb movie review and it's been sorted by frequency except for the first two special ones so for example we can then go backwards s to i string to int here is the it's in position 012 so string to int the is 2 so the vocab lets us take a word and map it to an integer or take an integer and map it to a word right and so that means that we can then take the first 12 tokens for example of our text and turn them into 12 ints so for example here is of the here you can see 7.2 and here you can see 7.2 right so we're going to be working in this form did you have a question could you pass that back there is it a common to your any stemming or limitizing not really no generally tokenization is what we want like with a language model we to keep it as general as possible we want to know what's coming next and so like whether it's future tense or past tense or plural or singular like we don't really know which things are going to be interesting in which aren't so it seems that it's generally best to kind of leave it alone as much as possible be the short answer having said that as I say this is all pretty new so if there are some particular areas that some researcher maybe has already discovered that some other kinds of preprocessing are helpful I wouldn't be surprised not to know about it so when you're dealing with natural language is in context important context is very important so if you're using the specific tokenizer and literally just looking at individual words no we're not looking at words this is look this is I just don't get some of the big premises of this like they're in order yeah so just because we replaced I with the number 12 these are still in that order there is a different way of dealing with natural language called bag of words and bag of words you do throw away the order in the context and in the machine learning course we'll be learning about working with bag of words representations but my belief is that they are no longer useful or in the verge of becoming no longer useful we're starting to learn how to use deep learning to use context properly now but it's kind of for the first time it's really like only in the last few months right so I mentioned that we've got two numbers batch size and BPTT back prop through time so this is kind of subtle so we've got some big long piece of text okay so we've got some big long piece of text you know here's our sentence it's a bunch of words right and actually what happens in a language model is even though we have lots of movie reviews they actually all get concatenated together into one big block of text so it's basically predicts the next word in this huge long thing which is all of the IMDB movie reviews concatenate together so this thing is you know what do we say it was like tens of millions of words long and so what we do is we split it up into batches first right so these like are our splits into batches right and so if we said we want a batch size of 64 we actually break the whatever it was 60 million words into just 64 sections right and then we take each one of the 64 sections and we move it like underneath the previous one I didn't do a great job of that right move it underneath so we end up with a matrix which is 64 actually I think we moved them across wise so it's actually I think just transpose it we end up with a matrix that's like 64 columns wide and the length let's say the original was 64 million then the length is like 10 million long right so each of these represents 164th of our entire IMDB review set the starting point so then what we do is we then grab a little chunk of this at a time and those chunk lengths are approximately equal to BPTT which I think we had equal to 70 so we basically grab a little 70 long section and that's the first thing we chuck into our GPU that's a batch so a batch is always 64 or batch size and each bit is a sequence of length up to 70 so let me show you so here if I go take my train data loader I don't know if you folks have tried playing with this yet but you can take any data loader wrap it with inner to turn it into an iterator and then call next on it to grab a batch of data just as if you were a neural net and you can see here we get back a 75 by 64 tensor so it's 64 wide and I said it's approximately 70 high but not exactly and that's actually kind of interesting a really neat trick that TorchText does is they randomly change the back prop through time number every time so each epoch it's getting slightly different bits of text this is kind of like in computer vision we randomly shuffle the images we can't randomly shuffle the words because we need to be in the right order so instead we randomly move their breakpoints a little bit so this is the equivalent so in other words this this here is of length 75 there's an ellipsis in the middle and that represents the first 75 words of the first review whereas this 75 here represents the first 75 words of the second of the 64 segments that's would have to go in like 10 million words to find that one and so here's the first 75 words of the last of those 64 segments and so then what we have down here is the next sequence so 51 there's 51 615 there's 25 and in this case it actually is of the same size it's also 75 by 64 but for minor technical reasons it's actually flattened out into a single vector but basically it's exactly the same as this matrix but it's just moved down by one because we're trying to predict the next word so that all happens for us if we ask for and this is the fast ai now if you ask for a language model data object then it's going to create these batches of batch size width by BPTT height bits of our language corpus along with the same thing shuffled along by one word right and so we're always going to try and predict the next word so why don't you instead of just arbitrarily choosing 64 why don't you choose like 64 is a large number maybe like do it by sentences and make it a large number and then pat it with 0 or something if you so that you actually have a one full sentence per line basically wouldn't that make more sense? not really because remember we're using columns right so each of our columns is of length about 10 million right so although it's true that those columns aren't always exactly finishing on a full stop it's so damn long we don't care because they're like 10 million long right and we're trying to also each line contains multiple sentences column contains multiple sentences yeah it's of length about 10 million and it contains many many many many many sentences because remember the first thing we did was take the whole thing and split it into 64 groups okay great so I found this pertaining to this question this thing about like what's in this language model matrix a little mind bending for quite a while so don't worry if it takes a while and you have to ask a thousand questions on the forum that's fine right but go back and listen to what I just said in this lecture again go back to that bit where I showed you just put it up to 64 and moving them around and try it with some sentences in Excel or something and see if you can do a better job explaining it than I did because this is like how TorchText works and then what FastAI adds on is this idea of like how to build a language model out of it although actually a lot of that's stolen from TorchText as well like there's sometimes where TorchText starts and FastAI ends or vice versa is a little subtle they really work closely together so now that we have a model data object that can feed us batches we can go ahead and create a model right and so in this case we're going to create an embedding matrix and our vocab we can see how big our vocab was let's have a look back here so we can see here in the model data object 4602 pieces that we're going to go through that's basically equal to the number of the total length of everything divided by batch size times BPTT and this one I wanted to show you NT, I've got the definition up here number of unique tokens NT is the number of tokens that's the size of our vocab so we've got 34945 unique words and notice the unique words that had to appear at least 10 times because otherwise they've been replaced with UNC the length of the data set is 1 because as far as the language model is concerned there's only one thing which is the whole corpus and then that thing has 20.6 million words so those 34945 things are used to create an embedding matrix of number of rows is equal to 34945 and so the first one represents UNC the second one represents pad the third one was dot the fourth one was comma fifth one I'm just guessing and so forth so each one of these gets an embedding vector so this is literally identical to what we did before the break this is a categorical variable it's just a very high cardinality categorical variable and furthermore it's the only variable this is pretty standard in NLP you have R variable which is a word we have a single categorical variable single column basically it's 34945 cardinality categorical variable and so we're going to create an embedding matrix for it so M size is the size of the embedding vector 200 okay so that's going to be length 200 a lot bigger than our previous embedding vectors not surprising because a word has a lot more nuance to it than the concept of Sunday right or Rossman's Berlin store or whatever right so it's generally an embedding size for a word will be somewhere between about 50 and about 600 so I've kind of gone some in the middle we then have to say as per usual how many activations do you want in your layers so we're going to use 500 and then how many layers do you want in your neural net we're going to use three okay this is a minor technical detail it turns out that we're going to learn later about the atom optimizer basically the defaults where it don't work very well with these kinds of models so we just have to change some of these basically any time you're doing NLP you should probably include this line because it works pretty well so having done that we can now again take our model data object and grab a model out of it and we can pass in a few different things what optimization function do we want how big an embedding do we want how many hidden activations number of hidden how many layers and how much dropout of many different kinds so this language model we're going to use is a very recent development called AWD LSTM by Stephen Merity NLP researcher based in San Francisco and his main contribution really was to show like how to put dropout all over the place in in these NLP models so we're not going to worry now we'll do this in the last lecture is worrying about like what all that like what is the architecture and what are all these dropouts for now just know it's the same as per usual if you try to build an NLP model and you're underfitting then decrease all of these dropouts if you're overfitting then increase all of these dropouts in roughly this ratio okay that's my rule of thumb again this is such a recent paper nobody else is working on this model anyway so there's not a lot of guidance but I found these ratios work well that's what Stephen's been using as well there's another kind of way we can avoid overfitting that we'll talk about in the last class again for now this one actually works totally reliably so all of your NLP models probably want this particular line of code and then this one we're going to talk about at the end last lecture as well you can always include this basically what it says is when you do when you look at your gradients and you multiply them by the learning rate and you decide how much to update this says clip them like literally like don't let them be more than 0.3 and this is quite a cool little trick right because like if your learning rate's pretty high and you kind of don't want to get in that situation we talked about where you kind of got this kind of thing where you go um you know rather than little step little step instead you go like oh too big oh too big right with gradient clipping it kind of goes this far and it's like oh my goodness I'm going too far I'll stop right that's basically what gradient clipping does so um anyway so these are a bunch of parameters the details don't matter too much right now you can just steal these and then we can go ahead and call fit with exactly the same parameters as usual so Jeremy um there are all these other work embedding things like like um work to vague and glove so I have two questions about that one is um how are those different from these and the second question why don't you initialize them with one of those yeah so um so basically that's a great question so basically um people have pre-trained these embedding matrices before to do various other tasks they're not whole pre-trained models they're just a pre-trained embedding matrix and you can download them uh and as unit says they have names like word to vex and love and they're literally just a matrix um there's no reason we couldn't download them um really it's just like kind of um I found that building a whole pre-trained model in this way didn't seem to benefit much if at all from using pre-trained word vectors whereas using a whole pre-trained language model made a much bigger difference so like you remember what a big um those of you who saw word to vex made a big splash when it came out I'm finding this technique of pre-trained language models seems much more powerful basically but I think we could combine both to make them a little better still what is the model that you have used like how can I know the architecture of the model so we'll be learning about the model architecture in the last lesson um for now um it's a recurrent neural network um using uh something called LSTN long short-term memory okay um so so yeah lots of details that we're skipping over but you know you can do all this without any of those details um we go ahead and fit the model um I found that this language model took quite a while to fit so I kind of like ran it for a while noticed it was still underfitting saved where it was up to ran it a bit more with longer cycle length saved it again it still uh was kind of underfitting you know run it again and kind of finally got to the point where it's like uh kind of honestly I kind of ran out of patience um so I just like saved it at that point um and I did the same kind of test that we looked at before so I was like oh it wasn't quite what I was expecting but I realized I didn't really wear the best and then I was like okay let's see how that goes the best performance was one of the movie where it was a little weird I was like okay it looks like the language model is working pretty well um so I've pre-trained the language model um and so now I want to use it uh fine tune it to do classification sentiment classification now obviously if I'm going to use a pre-trained model I need to use exactly the same vocab right the word the still needs to map to the number two so that I can look up the vector for the right so that's why I first of all load back up my field object the thing with the vocab in now in this case if I run it straight afterwards this is unnecessary it's already in memory but this means I can come back to this later right um in a new session basically um um I can then go ahead and say okay I've now got one more field right in addition to my field which represents the reviews I've also got a field which represents the label okay um and the details are too important here um now this time I need to not treat the whole thing as one big piece of text but every review is separate because each one has a different sentiment attached to it right and it sort of happens a torch text already has a data set that does that for IMDB so I just used IMDB um built into torch text so basically once we've done all that we end up with something where we can like grab for a particular sample we can grab its label positive and here's some of the text this is another great Tom Beringer movie blah blah blah blah alright so um this is all nothing fast.ai specific here we'll come back to it in the last lecture um but torch text docs can help understand what's going on all you need to know is that once you've used this special torch text thing called splits to grab a splits object you can pass it straight into fast.ai text data from splits and that basically converts a torch text object into a fast.ai object we can train on so as soon as you've done that you can just go ahead and say get model right and that gets us our learner um and then we can load into it the pre-trained model the language model right and so we can now take that pre-trained language model and use the stuff that we're kind of familiar with right so we can make sure that you know all except the last layers frozen, train it a bit unfreeze it, train it a bit and the nice thing is once you've got a pre-trained language model it actually trains super fast you can see here it's like a couple of minutes per epoch and it only took me to get my is my best one here and it took me like 10 epochs so it's like 20 minutes to train this bit it's really fast and I ended up with 94.5% so how good is 94.5% well it so happens that um actually one of Stephen Meredith's colleagues James Bradbury recently created a paper um looking at the state of like where they tried to create a new state of the art for a bunch of NLP things and one of the things I looked at was um IMDB and they actually have here a list of the current world's best for IMDB and even with stuff that is highly specialized for sentiment analysis the best anybody had previously come up with was 94.1 so in other words this technique getting 94.5 is literally better than anybody has created in the world before as far as we know as far as James Bradbury knows so so when I say like there are big opportunities to use this I mean like this is a technique that nobody else currently has access to which you know you could like you know whatever IBM has in Watson or whatever any big company has you know that they're advertising unless they have some secret source that they're not publishing which they don't right because people get you know they publish it then you now have access to a better text classification method than has ever existed before so I really hope that you know you can try this out and see how you go um there may be some things that works really well on and others that it doesn't work as well on I don't know um I think this kind of sweet spot here that we had about 25,000 you know short to medium sized documents if you don't have at least that much text it may be hard to train a different language model but having said that there's a lot more we could do here right and we won't be able to do it in part one of this course we do it in part two but for example we could start like training language models that look at like you know lots and lots of medical journals and then we could like make a downloadable medical language model that then anybody could use to like fine-tune on like a prostate cancer subset of medical literature for instance like there's so much we could do it's kind of exciting and then you know to Yannette's point we could also combine this with like pre-trained word vectors it's like even without trying that hard like you know we even without we could have pre-trained a Wikipedia say corpus language model and then fine-tuned it into a IMDB language model and then fine-tuned that into an IMDB sentiment analysis model and we would have got something better than this so like this I really think this is the tip of the iceberg and I was talking there's a really fantastic researcher called Sebastian Ruder who is basically the only NLP researcher I know who's been really really writing a lot about pre-training and fine-tuning and transfer learning and NLP and I was asking him like why isn't this happening more and his view was it's because there isn't the software to make it easy you know so I'm actually going to share this lecture with him tomorrow because it feels like there's hopefully going to be a lot of stuff coming out now that we're making it really easy to do this okay we're kind of out of time so what I'll do is I'll quickly look at collaborative filtering introduction and then we'll finish it next time but collaborative filtering there's very very little new to learn we basically learnt everything we're going to need so collaborative filtering we'll cover this quite quickly next week and then we're going to do a really deep dive into collaborative filtering next week where we're going to learn about like we're actually going to from scratch learn how to do stochastic gradient descent how to create loss functions how they work exactly and then we'll go from there and we'll gradually build back up to really deeply understand what's going on in the structured models and then what's going on in confidence and then finally what's going on in recurrent neural networks and hopefully we'll be able to build them all from scratch okay so this is kind of going to be really important this movie lens data set because we're going to use it to learn a lot of like really foundational theory and kind of math behind it so the movie lens data set this is basically what it looks like it contains a bunch of ratings it says user number one watched movie number 31 and they gave it a rating of two and a half at this particular time and then they watched movie 1029 and they gave it a rating of three and they watched rating one one 172 and they gave it a rating of four okay and so forth so this is the ratings table this is really the only one that matters and our goal will be for some user we haven't seen before sorry for some user movie combination we haven't seen before we have to predict if they'll like it right and so this is how recommendation systems are built this is how like Amazon besides what books to recommend how Netflix decides what movies to recommend and so forth to make it more interesting we'll also actually download a list of movies so each movie we're actually going to have the title and so for that question earlier about like what's actually going to be in these embedding matrices how do we interpret them we're actually going to be able to look and see how that's working so basically this is kind of like what we're creating this is kind of cross tab of users by movies and so feel free to look ahead during the week you'll see basically as per usual collaborative data set from csp modeldata.getlearner learn.fit and we're done and you won't be surprised to hear when we then take that and we can click the benchmarks it seems to be better than the benchmarks we looked at so that'll basically be it and then next week we'll have a deep dive and we'll see how to actually build this from scratch alright see you next week thank you