 Hi everybody and welcome to Lesson 7. We're going to start by having a look at a kind of regularization called weight decay. And the issue that we came to at the end of the last lesson is that we were training our simple dot product model with bias and our loss started going down and then it started going up again. And so we have a problem that we are overfitting. And remember in this case we're using mean squared error. So try to recall why it is that we don't need a metric here. Because mean squared error is pretty much the thing we care about really. Or we could use mean absolute error if we like. But either of those works fine as a loss function. They don't have the problem of big flat areas like accuracy does for classification. So what we want to do is to make it less likely that we're going to overfit by doing something we call reducing the capacity of the model. The capacity of the model is basically how much space does it have to find answers. And if it can kind of find any answer anywhere, those answers can include basically memorizing the data set. So one way to handle this would be to decrease the number of latent factors. But generally speaking reducing the number of parameters in a model, particularly as we look at more deep learning style models, ends up biasing the models towards very simple kind of shapes. So there's a better way to do it rather than reducing the number of parameters. Instead we try to force the parameters to be smaller unless they're really required to be big. And the way we do that is with weight decay. Weight decay is also known as L2 regularization they're very slightly different but we can think of them as the same thing. And what we do is we change our loss function. And specifically we change the loss function by adding to it the sum of all the weights squared. In fact all of the parameters squared really should stay. Why do we do that? Well because if that's part of the loss function then one way to decrease the loss would be to decrease the weights. One particular weight or all of the weights or something like that. And so when we decrease the weights if you think about what that would do then think about for example the different possible values of A in Y equals AX squared. The larger A is for example A is 50 you get these very narrow peaks. In general big coefficients are going to cause big swings big changes in the loss. Small changes in the parameters. And when you have these kind of sharp peaks or valleys it means that the change through the parameter can make a small change to the input and make a big change to the loss. And so if you're in that situation then you can basically fit all the data points close to exactly with a really complex jagged function with sharp changes which exactly tries to sit on each data point rather than finding a nice smooth surface which connects them all together or goes through them all. So if we limit our weights by adding in for the loss function the sum of the weight squared then what it's going to do is it's going to fit less well on the training set because we're giving it less room to try anything that it wants to but we're going to hope that it would result in a better loss on the validation set or the test set so that it will generalize better. One way to think about this is that the loss with weight decay is just the loss plus the sum of the parameter squared times some number we pick, a hyperparameter. Sometimes it's like 0.1 or 0.01 or 0.001 kind of region. So this is basically what loss with weight decay looks like in this equation. But remember when it actually comes to how is the loss used in stochastic gradient descent? It's used by taking its gradient. So what's the gradient of this? Well if you remember back to when you first learnt calculus it's okay if you don't. The gradient of something squared is just two times that something. We've changed from parameters to weight which is a bit confusing so just use weight here to keep it consistent. Maybe parameters is better. So the derivative of weight squared is just two times weight. So in other words to add in this term to the gradient we can just add to the gradients weight decay times 2 times weight. And since weight decay is just a hyperparameter we can just replace it with weight decay times 2 so that would just give us weight decay times weight. So weight decay refers to adding on the to the gradients the weights times some hyperparameter. And so that is going to try to create these kind of more shallow less bumpy surfaces. So to do that we can simply when we call fit or fit one cycle or whatever we can pass in a WD parameter and that's just this number here. So if we pass in point one then the training loss goes from point two nine to point four nine that's much worse because we can't overfit anymore. The valid loss goes from point eight nine to point eight two much better. So this is an important thing to remember for those of you that have done a lot of more traditional statistical models is in kind of more traditional statistical models we try to avoid overfitting and we try to increase generalization by decreasing the number of parameters. But in a lot of modern machine learning and certainly deep learning we tend to instead use regularization such as weight decay because it gives us more flexibility it lets us use more non-linear functions and still reduces the capacity of the model. Great so we're down to point eight two three this is a good model. This is really actually a very good model. And so let's dig into actually what's going on here because in our architecture remember we basically just had four embedding layers. So what's an embedding layer? We've described it conceptually but let's write our own. And remember we said that an embedding layer is just a computational shortcut by doing a matrix multiplication by a one hot encoded matrix and that that is actually the same as just indexing into an array. So an embedding is just a indexing into an array. And so it's nice to be able to create our own versions of things that exist in PyTorch and FastAI so let's do that for embedding. So if we're going to create our own kind of layer which is pretty cool. We need to be aware of something which is normally a layer is basically created by inheriting as we've discussed from module or nn.module. So for example this is an example here of a module where we've created a class called t that inherits from module and when it's constructed remember that's what DunderInit does this is just a dummy little module here we're going to set gulf.a to the number one repeated three times as a tensor. Now if you remember back to notebook four we talked about how the optimizers in PyTorch and FastAI rely on being able to grab the parameters attribute to find a list of all the parameters. Now if you want to be able to optimize gulf.a you would need to appear in parameters but actually there's nothing there. Why is that? That's because PyTorch does not assume that everything that's in a module is something that needs to be learnt. The talent that it's something that needs to be learnt you have to wrap it with nn.parameter. So here's exactly the same class but torch.ones which is just a list of three ones in this case is wrapped in nn.parameter and now if I go parameters I see I have a parameter with three ones in it. And that's going to automatically call requires grad underscore for us as well. We haven't had to do that for things like nn.linear in the past because PyTorch automatically uses nn.parameter internally so if we have a look at the parameters for something that uses nn.linear with no bias layer you'll see again we have here a parameter with three things in it. So we want to in general be able to create a parameter so something with a tensor with a bunch of things in and generally we want to randomly initialize them. So to randomly initialize we can pass in the size we want, we can initialize a tensor of zeros of that size and then randomly generate some normally distributed random numbers with a mean of zero and a deviation of .01 for a particular reason I'm picking those numbers just to know how this works. So here's something that will give us back a better notice of any size we want and so now we're going to replace everywhere that used to say embedding I'm going to replace it with create params. Everything else here is the same in the init under init and then the forward is very very similar to before as you can see I'm grabbing the zero index column from x that's my users and I just look it up as you see in that user factors array and the cool thing is I don't have to do anything with gradients myself for this manual embedding layer because PyTorch can figure out the gradients automatically as we've discussed. So then I just got the .product as before, add on the bias as before, do the sigmoid range as before and so here's a .product bias without any special PyTorch layers and we fit and we get the same result. So I think that is pretty amazingly cool. We've really shown that the embedding layer is nothing fancy, it's nothing magic, right? It's just indexing into an array. So hopefully that removes a bit of the mystery for you. So let's have a look at this model that we've created and we've trained and find out what it's learned. It's already useful, we've got something we can make pretty accurate predictions with but let's find out what those the model looks like. So remember when we... We have a question. Okay let's take a question before you look at this. What's the advantage of creating our own embedding layer over the stock PyTorch one? Oh nothing at all, we're just showing that we can. It's great to be able to dig under the surface because at some point you'll want to try doing new things. So a good way to learn to do new things is to be able to replicate things that already exist and you can expect that you understand how they work. It's also a great way to understand the foundations of what's going on is to actually create in code your own implementation. But I wouldn't expect you to use this implementation in practice. But basically it removes all the mystery. So if you remember we've created a learner called learn and to get to the model that's inside it you can always call learn.model and then inside that there's going to be automatically created for, well sorry not automatically, we've created all these attributes, movie factors, movie bias, bias and so forth. So we can grab learn.model.movibias and now what I'm going to do is I'm going to sort that vector and I'm going to print out the first five titles. And so what this is going to do is it's going to print out though the movies with the smallest bias and here they are. What does this mean? Well it kind of means these are the five movies that people really didn't like. But it's more than that. It's not only do people not like them but if we take account of the genre they're in, the actors they have, whatever the latent factors are people liked them a lot less than they expected. So maybe for example people, this is kind of, I haven't seen any of these movies luckily, but perhaps this is a sci-fi movie so people who kind of like these sci-fi movies found they're so bad they still didn't like it. So we can do the exact opposite which is to sort sanding and here are the top five movies and specifically they're the top five by bias. So these are the movies that even after you take account of the fact that LA Confidential, I have seen all of these ones, so LA Confidential is a kind of a murder mystery cop movie I guess and people who don't necessarily like that genre or I think Guy Pearce was in it so maybe they don't like Guy Pearce very much whatever, people still like this movie more than they expected. So this is a kind of a nice thing that we can look inside our model and see what it's learned. So we're not only at the bias vector but we can also look at the factors. Now there are 50 factors which is too many to visualize so we can use a technique called PCA, principle components now. The details don't matter but basically they're going to squish those 50 factors down to three and then we'll plot the top two as you can see here so what we see when we plot the top two is we can kind of see that the movies have been kind of spread out across a space of some kind of latent factors and so if you look at the far right there's a whole bunch of kind of big budget actually things and on the far left there's more like cult kind of things Fargo, Schindler's List, Monty Python by the same token at the bottom we've got some English patient Harry and Harry Metzeli so kind of romance drama kind of stuff and at the top we've got action and sci-fi kind of stuff so you can see even though we haven't asked in any information about these movies all we've seen is who likes what these latent factors have automatically kind of figured out a space or a way of thinking about these movies based on what kinds of movies people like and what other kinds of movies they like along with those but that's really interesting to kind of try and visualize what's going on inside your model now we don't have to do all this manually we can actually just say give me a collab learner using this set of data loaders with this number of factors and this y-range and it does everything we've just seen again about the same number so now you can see this is nice right we've actually been able to see right underneath inside the collab learner part of the FASTI application the collaborative filtering application and we can build it all ourselves from scratch we know how to create the SGD how to create the embedding layer we know how to create the model the architecture so now you can see we've really been built up from scratch our own version of this so if we just type learned up model you can see here the names are a bit more generic this is a user weight item weight is a bias item bias but it's basically the same stuff we've seen before and we can replicate the exact analysis we saw before by using this same idea slightly different order this time because it is a bit random but pretty similar as well another interesting thing we can do is we can think about the distance between two movies so let's grab all the movie factors or just pop them into a variable and then let's pick some movie and then let's find the distance from that movie to every other movie and so one way of thinking about distance is you might recall the Pythagorean formula or the distance on the triangle which is also the distance to a point in a Cartesian plane on a chart which is root x squared plus y squared you might know it doesn't matter if you don't but you can do exactly the same thing for 50 dimensions it doesn't just work for two dimensions there's a that tells you how far away a point is from another point if you if x and y are actually differences between two movie vectors so then what gets interesting is you can actually then divide that kind of by the length to make all the lengths the same distance to find out how the angle between any two movies and that actually turns out to be a really good way to compare the similarity of two things that's called cosine similarity and so the details don't matter you can look them up if you're interested but the basic idea here is to see that we can actually pick a movie and find the movie that is the most similar to it based on these factors I have a question what motivated learning at a 50 dimensional embedding and then using the 3 versus just learning a 3 dimensional because the purpose of this was actually to create a good model so the visualization part is normally kind of the exploration of what's going on in your model and so with 50 latent factors you're going to get a more accurate so that's one approach is this dot product version there's another version we could use which is we could create a set of user factors and a set of item factors and just like before we could look them up instead of doing a dot product we could concatenate them together into a tensor that contains both the user and the movie factors next to each other and then we could pass them through a simple little neural network linear, value, linear and then sigmoid range as before so importantly here the first linear layer the number of inputs is equal to the number of user factors plus the number of item factors and the number of outputs is however many activations we have and then which we just default to 100 here and then the final layer will go from 100 to 1 because we're just making one prediction and so we could create, we'll call that collabNN, we can concatenate that to create a model, we can create a learner and we can fit. It's not going quite as well as before it's not terrible but it's not quite as good as our dot product version but the interesting thing here is it does give us some more flexibility which is that since we're not doing a dot product we can actually have a different embedding size for each of users versus items and actually FastAI has a simple heuristic if you call get embedding size and pass in your data loaders it will suggest appropriate size embedding matrices for each of your categorical variables, each of your user and item sensors so if we pass in star m's, that's going to pass in the user couple and the item model which we can then pass to embedding this is this star prefix we learned about in the last class, in case you forgot so this is kind of interesting, we can see here that there's two different architectures we could pick from it wouldn't be necessarily obvious ahead of time which one's going to work better in this particular case the simplest one, the dot product one actually turned out to work a bit better which is interesting if we call collab learner and pass useNN equals true then what that's going to do is it's going to use this version, the version with concatenation and the linear layers so collab learner, useNN equals true, again we get about the same result as you would expect because it's just a draw cut for this version and it's interesting actually it actually returns an object of type embeddingNN and it's kind of cool if you look inside the FASTA source code or use the double question mark trick to see the source code for embeddingNN you'll see it's three lines of code how does that happen? because we're using this thing called tabular model which we will learn about in a moment but basically this neural net version of collaborative filtering is literally just a tabular model in which we pass no continuous variables and embedding sizes so we'll see that in a moment so that is collaborative filtering and again take a look at the further research section in particular, after you finish the questionnaire because there's some really important next steps you can take to push your knowledge and your skills so let's now move to notebook 9, tabular and we're going to look at tabular modeling and do a deep dive and let's start by talking about this idea that we were starting to see here which is embeddings and specifically let's move beyond just having embeddings for users and items but embeddings for any kind of categorical variable so really because we know an embedding is a look up into an array it can handle any kind of discrete categorical data so things like age are not discrete, they're continuous numerical data but something like sex or postcode categorical variables, they have a certain number of discrete levels the number of discrete levels they have is called their cardinality so to have a look at an example of a dataset that contains both categorical and continuous variables we're going to look at the Rossman sales competition that ran on Kaggle a few years ago and so basically what's going to happen is we're going to see a table that contains information about various stores in Germany and the goal will be to try and predict how many sales there's going to be for each day in a couple of week period for each store one of the interesting things about this competition is that one of the gold medalists used deep learning and it was one of the earliest known examples of a state of the art deep learning tabular model I mean this is not long ago, maybe 2015 or something but really this idea of creating state of the art tabular models with deep learning has not been very common for not very long. Interestingly the other gold medalists in this competition the folks that used deep learning used a lot less feature engineering and a lot less domain expertise and so they wrote a paper called entity embeddings of categorical variables in which they basically described the exact thing that you saw in notebook 8 the way you can think of one hot encodings as just being embeddings you can catenate them together and you can put them through a couple of layers they call them dense layers we call them linear layers and create a neural network out of that so this is really a neat kind of simple and obvious hindsight trick and they actually did exactly what we did in the paper which is to look at the results of the trained embeddings and so for example they had an embedding matrix for regions in Germany because there wasn't really metadata about this these were just learned embeddings just like we learned embeddings about movies and so then they just created just like we did before a chart where they popped each region according to I think probably a PCA of their embeddings and then if you circle the ones that are close to each other in blue you'll see that they're actually close to each other in Germany and did over red and did over green and then here's the ground so this is like pretty amazing is the way that we can see that it's kind of learned something about what Germany looks like based entirely on the purchasing behavior of people in those states something else they did was to look at every store and they looked at the distance between stores in practice like how many kilometers away they are and then they looked at the distance between stores in terms of their embedding distance just like we saw in the previous notebook and there was this very strong correlation that stores that were close to each other physically ended up having close embeddings as well even though the actual location of these stores in physical space was not part of the model ditto with days of the week so the days of the week were another embedding and the days of the week that were next to each other ended up next to each other in embedding space and ditto for months of the year so pretty fascinating the way information about the world ends up captured just by looking at branding embeddings which as we know are just index lookups into an array so the way we then combine these categorical variables with these embeddings with continuous variables what was done in both the entity embedding paper that we just looked at and then also described in more detail by Google when they described how their recommendation system in Google Play works this is from Google's paper as they have the categorical features that go through the embeddings and then there are continuous features and then all the embedding results and the continuous features are just concatenated together into this big concatenated table that then goes through in this case three layers of a neural net and interestingly they also take the kind of collaborative filtering bit and do the product as well and combine the two so they use both of the bricks we used in the previous notebook and combine them together so that's the basic idea we're going to be seeing for moving beyond just collaborative filtering which is just two categorical variables to as many categorical and as many continuous variables as we like but before we do that let's take a step back and think about other approaches because as I mentioned the idea of deep learning as a kind of a best practice for tabular data is still pretty new and it's still kind of controversial it's certainly not always the case but it's the best approach so when we're not using deep learning what would we be using well what we'd probably be using is something called an ensemble of decision trees and the two most popular are random forests and gradient boosting machines or something similar so basically between multi-layered neural networks linked with SGD and ensembles of decision trees that kind of covers the vast majority of approaches that you're likely to see for tabular data and so we're going to make sure we cover them both of course today in fact so although deep learning is nearly always clearly superior for stuff like images and audio and natural language text these two approaches tend to give somewhat similar results a lot of the time for tabular data so let's take a look some you know you really should generally try both and see which works best for you for each problem you look at Why does the range go from 0 to 5.5 if the maximum is 5? That's a great question the reason is if you think about it for sigmoid it's actually impossible for a sigmoid to get all the way to the top or all the way to the bottom those are asymptotes so no matter how far how big your x is can never quite get to the top or no matter how small it is it can never quite get to the top so if you want to be able to actually predict a rating of 5 then you need to use something higher than 5 your maximum. Our embeddings used only for highly cardinal categorical variables versus this approach used in general for low cardinality one uses a one-hot encoder. I'll remind you cardinality is the number of discrete levels in a variable and remember that an embedding is just a computational shortcut for a one-hot encoding so there's really no reason to use a one-hot encoding because as long as you have more than two levels it's always going to be more memory and give you exactly mathematically the same thing and if there's just two levels then it is basically identical so there isn't really any reason not to use. Thank you for those great questions. Okay so one of the most important things about decision tree ensembles is that at the current state of the technology they do provide faster and easier ways of interpreting the model. I think that's rapidly improving for deep learning models on tabular data but that's where we are right now. They also require less hyperparameter tuning so they're easier to kind of get right the first time. So my first approach for analyzing a new tabular data set is always an ensemble of decision trees and specifically I pretty much always start with a random forest because it's just so reliable. Your experience for highly imbalanced data such as broad or medical data what usually best out of random forest, XG boost or neural network? I'm not sure that whether the data is balanced or unbalanced is a key reason for choosing one of those about the others. I would try all of them and see which works best. So the exception to the guideline about start with decision tree ensembles is your first thing to try would be if there's some very high cardinality or categorical variables then they can be a bit difficult to get to work really well in decision tree ensembles or if there's something like, most importantly if it's plain text data or image data or audio data or something like that then you're definitely going to need to use a neural net in there but you could actually ensemble it with a random forest as we'll see. So clearly we're going to need to understand how decision tree ensembles work So PyTorch isn't a great choice for decision tree ensembles. They're really designed for gradient based methods and random forests and decision tree growing and not really gradient based methods in the same way. So instead we're going to use a library called scikit-learn referred to as sklearn as a module. Scikit-learn does a lot of things and we're only going to touch on a tiny piece of them stuff we need to do to train decision trees and random forests We've already mentioned before Wes McKinney's book also a great book for understanding more about scikit-learn So the data set for learning about decision tree ensembles is going to be another data set. It's called the blue book for bulldozers data set competition. So Kaggle competitions are fantastic they are machine learning competitions where you get interesting data sets, you get feedback on whether your approach is any good or not. You can see on a leaderboard what approaches are working best and then you can read blog posts from the winning contestants sharing tips and tricks. It's certainly not a substitute for actual practice doing end-to-end data science projects but for becoming good at creating predictive models that are predictive, it's a really fantastic resource, highly recommended and you can also submit to most old competitions to see how you would have gone without having to worry about the kind of stress of like whether people will be looking at your results because they're not synthesized or published if you do that. There's a question. Can you comment on real-time applications of random forest? In my experience they tend to be too slow for real-time use cases like a recommender system neural network is much faster when run on the right hardware. Let's get to that. Now you can't just download and untar caggle data sets using the untar data thing that we have in FastAI so you actually have to sign up to caggle and then follow these instructions for how to download data from caggle. Make sure you replace creds here with what it describes. You need to get a special API code and then run this one time to put that up on your server and now you can use caggle to download data using the API. After we do that we're going to end up with a bunch of CSV files. Let's take a look at this data. The main table is train.csv, remember that's comma separated values and the training set contains information such as unique identifier of a sale, unique identifier of a machine, the sale price, the sale date. So what's going on here is one row of the data represents a sale of a single piece of heavy machinery like a bulldozer at an auction so it happens at a date, at a price, it's of some particular piece of equipment and so forth. So if we use pandas again to read in the CSV file, let's combine training and valid together. We can then look at the columns and see there's a lot of columns there and many things which I don't know what the hell they mean like blade extension and pad type and ride control but the good news is we're going to show you a way that you don't have to look at every single column and understand what they mean and random forests are going to help us with that as well. So once again we're going to be seeing this idea that models can actually help us with data understanding and data cleanup. One thing we can look at is ordinal columns. A good place to look at that now. If there's things there that you know are discrete values but have some order like product size, it has medium and small and large, medium and many, these should not be in alphabetical order or some random order. They should be in this specific order. They have a specific ordering. So we can use that as type to turn it into a categorical variable and then we can say set categories, ordered equals true to basically say this is an ordinal column. So it's got discrete values but we actually want to define what the order of the classes are. We need to choose which is the dependent variable and we do that by looking on Kaggle. And Kaggle will tell us that the thing we're meant to be predicting is sale price and actually specifically they'll tell us the thing we're meant to be predicting is the log of sale price because root, mean, squared, log error is what we're actually going to be judged on in the competition that we take the log. So we're just going to replace sale price with its log and that's what we'll be using from now on. So a decision tree ensemble requires decision trees. So let's start by looking at decision trees. So a decision tree in case is something that asks a series of binary that is yes or no questions about data. So such as is somebody less than or greater than, less than 30, yes they are. Are they eating healthily? Yes they are. And so okay then we're going to say they're fit or unfit. So like there's an example of some arbitrary decision tree that somebody might have come up with. It's a series of binary yes and no choices and at the bottom are leaf nodes that make some prediction. Now of course for our bulldozers competition we don't know what binary questions to ask about these things and in what order in order to make a prediction about sale price. So we're doing machine learning so we're going to try and come up with some automated way to create the questions. And there's actually a really simple procedure for doing that. You have to think about it. So if you want to kind of stretch yourself here have a think about what's an automatic procedure that you can come up with that would automatically build a decision tree where the final answer would do a significantly better than random job of estimating the sale price of one of these auctions. So here's the approach that we could use. Loop through each column of the data set. We're going to go through each of well obviously not sale price that's a dependent variable. Sale ID, machine ID auction year, year made etc. And so one of those will be for example product size. And so then what we're going to do is we're going to loop through each possible value of product size large, large, medium, medium etc. And then we're going to do a split basically like where this comma is. And we're going to say okay let's get all of the auctions of large equipment and put that into one group. And everything that's smaller than that and put that into another group. And so that's here split the data into two groups based on whether they're greater than or less than that value. If it's a categorical non-ordinal value a variable it'll be just whether it's equal to that level. And then we're going to find the average sale price for each of the two groups. So for the large group what was the average sale price? For the smaller than large group what was the average sale price? And that will be our model our prediction will simply be the average sale price for that group And so then you can say well how good is that model? If our model was just to ask a single question with a yes no answer put things into two groups and take the average of the group as being our prediction then we can say how good would that model be? What would be the root mean squared error from that model? And so we can then say alright how good it would be if we use large as a split. And then let's try again what if we did large slash medium as a split? What if we did medium as a split? And so in each case we can find the root mean squared error of that incredibly simple model. And then once we've done that for all of the product size levels we can go to the next column and look at usage band and do every level of usage band. And then state level of state and so forth. And so there'll be some variable and some split level which gives the best root mean squared error of this really really simple model. And so then we'll say okay that would be our first binary decision. It gives us two groups and then we're going to take each one of those groups separately and find another single binary decision for each of those two groups using exactly the same procedure. And then we'll have four groups and then we'll do exactly the same thing again separately for each of those four groups and so forth. So let's see what that looks like. And in fact once we've gone through this you might even want to see if you can implement this algorithm yourself. It's not trivial but it doesn't require any special coding schools so hopefully you can find you'll be able to do it. There's a few things we have to do before we can actually create a decision tree in terms of just some basic data munging. One is if we're going to take advantage of dates we actually want to call Fastai's add date part function. And what that does, as you see after we call it, is it creates a whole different, a bunch of different bits of metadata from that data sale year, sale month, sale week, sale day and so forth. So sale date of itself doesn't have a whole lot of information directly but we can pull a lots of different information out of it. And so this is an example of something called feature engineering which is where we take some piece of data and we try to create lots of other pieces of data from it. So is this particular date the end of a month or not? Is it the end of a year or not? And so forth. For that handle's dates, there's a bit more cleaning we want to do and Fastai provides some things to make cleaning easier. We can use the tabular pandas class to create a tabular data set in pandas. And specifically we're going to use two tabular processes or tabular procs. A tabular processor is basically just a transform. And we've seen transforms before. So go back and remind yourself what a transform is. Except it's just slightly different. It's like three lines of code if you look at the code for it. It's actually going to modify the object in place rather than creating a new object and giving it back to you. And that's because often tables of data are kind of really big and we don't want to waste lots of RAM. And it's just going to run the transform once and save the result rather than doing it lazily when you access it. For the same reason. We're just going to make this a lot faster. So you can just think of those transforms really. One of them is called Categorify. And Categorify is going to replace a column with numeric categories using the same basic idea of like a vocab like we've seen before. Fill Missing is going to find any columns with missing data. It's going to fill in the missing data with the median of the data and create a new column, new boolean column which is set to true for anything that was missing. So these two things is basically enough to get you to a point where most of the time you'll be able to train a model. Now the next thing we need to do is think about our validation set. As we discussed in lesson one a random validation set is not always appropriate. And certainly for something like predicting auction results it almost certainly is not appropriate. Because we're going to be wanting to use a model in the future. Not at some random date in the past. So the way this Kaggle competition was set up was that the test set, the thing that you had to fill in and submit for the competition was two weeks of data that was after any of the training set. So we should do the same thing for a validation set. We should create something which is where the validation set is the last couple of weeks of data. And so then the training set will only be data before that. So we basically can do that by grabbing everything before October 2011. Create a training and validation set based on that condition and grabbing those bits. So that's going to split our training set and validation set by date, not randomly. We're also going to need to tell when you create a TabularPanda object you're going to be passing in a data frame. You're going to be passing in your TabularProx. And you also have to say what are my categorical and continuous variables. We can use FastDI's QuantCat to automatically split a data frame to continuous and categorical variables for you. So we can just pass those in. Tell it what is the dependent variable you can have more than one and what are the indexes to split into training and valid. And this is a TabularObject. So it's got all the information you need about the training set, the validation set, categorical and continuous variables and the dependent variable and any processes to run. It looks a lot like a data sets object, but has a dot train, it has a dot valid and so if we have a look at dot show we can see the data. But dot show is going to show us the kind of the string data but if we look at dot items you can see internally it's actually stored these very compact numbers we can use directly in a model. So FastDI has basically got us to a point here where we have our data into a format ready for modeling and our validation sets being created to see how these numbers relate to these strings. We can again just like we saw last week use the classes attribute which is a dictionary. It basically tells us the vocab so this is how we look up. For example 6 is 0 1, 2, 3, 4, 5, 6. This is a compact example. That processing takes a little while to run so you can go ahead and save the tabular object and so then you can load it back later without having to rerun all the processing. So that's a nice kind of fast way to quickly get back up and running without having to reprocess your data. So we've done the basic data munging we need so we can now create a decision tree and in scikit-learn a decision tree where the dependent variable is continuous is a decision tree regressor. Now let's start by telling it we just want a total of four leaf nodes. We'll see what that means in a moment and in scikit-learn you generally call fit so it looks quite a lot like FastDI and you pass in your independent variables and your dependent variable and we can grab those straight from our tabular object training set is .x's and .y and we can do the same thing for validation just to save us some typing. Okay question. Do you have any thoughts on what data augmentation for tabular data might look like? I don't have a great sense of data augmentation for tabular data we'll be seeing later either in this course or the next part drop out and mix up and stuff like that which they might be able to do that in later layers in the tabular model otherwise I think you'd need to think about kind of the semantics of the data and think about what are things you could do to change the data without changing the meaning. Sounds like a pretty tricky question. This FastDI distinguish between unordered categories such as low, medium, high and unordered categorical variables. That was that ordinal thing I told you about before and all it really does is it ensures that your classes list has a specific order so then these numbers actually have a specific and as you'll see that's actually going to turn out to be pretty important for how we train our random forest. Okay so we can create a decision tree regressor and then we can draw it, the FastDI function and here is the decision tree we just trained and behind the scenes this actually used basically the exact process that we described back here. So this is where you can try and create your own decision tree implementation if you're interested in stretching yourself. So we're going to use one that already exists and the best way to understand what it's done is to look at this diagram from top to bottom. So the first step is it says like okay the initial model it created is a model with no binary spits at all specifically it's always going to predict the value 10.1 or every single row. Why is that? Well because this is the simplest possible model is to take the average of the dependent variable and always predict that and so this is always should be your kind of pretty much your basic baseline for regression. There are 404,710 rows, auctions that we're averaging and the mean squared error of this incredibly simple model in which there are no rules at all, no groups at all just a single average is a 0.48. So then the next most complex model is to take a single column, a couple of system, and a single binary decision is a couple of system less than or equal to 0.5. True, there are 360,847 auctions where it's true and 43,863 where it's false. Interestingly in the false case you can see that there are no further binary decisions. So this is called a leaf node it's a node where this is as far as you can get and so if your couple of system is not less than or equal to 0.5 then the prediction this model makes for your sale price is 9.21 versus if it's true it's 10.21. So you can see it's actually found a very big difference here and that's why it picked this as the first binary split. And so the mean squared error for this section here is 0.12 which is far better than we started out at 0.48. This group still has 360,000 in it and so it does another binary split. This time is the year that this piece of equipment made was at less than or equal to 1991 and a half. If it was, if it's true then we get a leaf node and the prediction is 9.97 mean squared error 0.37. If the value is false we don't have a leaf node and we have another binary split. And you can see eventually we get down to here couple of system true, year made, false, product size false, mean squared error 0.17. So all of these leaf nodes have MSCs that are smaller than that original baseline model which is taking the mean. So this is how you can grow a decision tree. And we only stopped here because we said max leaf nodes is 4. 1, 2, 3, 4. And so if we want to keep training it further we can just use a higher number. There's actually a very nice library by Terence Park called DetreeViz which can show us exactly the same information like so. And so here are the same leaf nodes 1, 2, 3, 4. And you can see the kind of the chart of how many are there. This is the split, couple of system 0.5 Here are the two groups. You can see the sale price in each of the two groups and then here's the leaf node. And so then the second split was on year made. And you can see here something weird is going on with year made there's a whole bunch of year made that are a thousand which is obviously not a sensible year for a bulldozer to be made. So presumably that's some kind of missing value. So when we look at the picture like this it can give us some insights about what's going on in our data. And so maybe we should replace those thousands with 1950 because that's obviously a very very early year for a bulldozer. So we can kind of pick it arbitrarily. It's actually not really going to make any difference to the model that's created because all we care about is the order because we're just doing these binary splits that'll make it easier to look at as you can see. Here's our 1950s now. And so now it's much easier to see what's going on in that binary split. So let's now get rid of max leaf nodes and build a bigger decision tree. And then let's just for the rest of this notebook create a couple of little functions. One to create the root mean squared error which is just here. And another one to take a model and some independent independent variables predict from the model on the independent variables and then take the root mean squared error with a dependent variable. So that's going to be our model's root mean squared error. So for this decision tree in which we didn't have a stopping criteria so as many leaf nodes as you like the model's root mean squared error is zero. So we've just built the perfect model. So this is great news right we've built the perfect auction trading system. Well remember we actually need to check the validation set. So let's check the check MRSE with the validation set and oh it's worse than zero. So our training set is zero our validation set is much worse than zero. Why is that happened? Well one of the things that a random forest in sklearn can do is it can tell you the number of leaf nodes, number of leaves there are 341,000 number of data points 400,000. So in other words we have nearly as many leaf nodes as data points. So most of our leaf nodes only have a single thing in but they're taking an average of a single thing. Clearly this makes no sense at all. So what we should actually do is pick some different stopping criteria and let's say okay if you get a leaf node with 25 things or less in it don't split or don't split things to create a leaf node with less than 25 things in it. And now if we fit and we look at the root mean squared error for the validation set it's going to go down from 0.33 to 0.32. So the training set's got worse from zero to 0.248 the validation set's got better and now we only have 12,000 leaf nodes so that is much more reasonable. Alright so let's take a 5 minute break and then we're going to come back and see how we get the best of both worlds. How are we going to get something which has the kind of flexibility to get these, you know what we're going to get down to 0 but to get really deep trees so without overfitting and the trick will be to use something called bagging. We'll come back and talk about that in 5 minutes. Okay welcome back so we're going to look at how we can get the best of both worlds as we discussed and let's start by having a look at what we're doing with categorical variables first of all. And so you might notice that previously with categorical variables for example in collaborative filtering we had to kind of think about like how many embedding levels we have for example. If you've used other modeling tools you might have been seeing things with creating dummy variables stuff like that. For random forests on the whole you don't have to. The reason is that as we've seen all of our categorical variables have been turned into numbers. And so we can perfectly well have decision tree binary decisions which use those particular numbers. Now the numbers might not be ordered in any interesting way but if there's a particular level which kind of stands out as being important it only takes two binary splits to split out that level into a single piece. So generally speaking I don't normally worry too much about kind of encoding categorical variables in a special way. As I mentioned I do try to encode ordinal variables by saying what the order of the levels is. Because often as you would expect sizes for example medium and small are going to be kind of next to each other and large and extra large would be next to each other. That's good to have those as similar numbers. Having said that you can kind of one hot encode a categorical variable if you want to using get dummies in pandas but there's not a lot of evidence that that actually helps. There's actually that has been bored in a paper and so I would say in general for categorical variables don't worry about it too much just use what we've shown you. Have a question For ordinal categorical variables how do you deal with when they have like NA or missing values where do you put that in the order? So in fast AI NA missing values always appear as the first item. They'll always be the zero item and also if you get something in the validation or test set which is a level we haven't seen in training that will be considered to be that missing or NA value as well. Alright so what we're going to do to try and improve our random forest is we're going to use something called bagging. This was developed by a retired Berkeley professor named Leo Breiman in 1994 and he did a lot of great work and perhaps you could argue that most of it happened after he retired. His technical report was called bagging predictors and he described how you could create multiple versions of a predictor so multiple different models and you could then aggregate them by averaging over the predictions and specifically the way he suggested doing this was to create what he called bootstrap replicates in other words randomly select different subsets of your data train a model on that subset and store it away as one of your predictors and then do it again a bunch of times and so each of these models is trained on a different random subset of your data and then you to predict you predict on all of those different versions of your model and average them and it turns out that bagging works really well. So the sequence of steps is basically randomly choose some subset of rows train a model using that subset save that model and then return to step one do that a few times to train a few models and then to make a prediction predict with all the models and take the average that is bagging and it's very simple but it's astonishingly powerful and the reason why is that each of these models we've trained although they are not using all of the data so they're kind of less accurate than a model that uses all of the data each of them is the errors are not correlated because of using that smaller subset are not correlated with the errors of the other models because they're random subsets and so when you look at the average of a bunch of errors which are not correlated with each other the average of those errors is zero so therefore the average of the models should give us an accurate prediction of the thing we're actually trying to predict so as I say here it's an amazing result we can improve the accuracy of nearly any kind of algorithm by training it multiple times on different random subsets of data and then averaging the predictions so then Breiman in 2001 showed a way to do this specifically for decision trees where not only did he randomly choose a subset of rows for each model but then for each binary split he also randomly selected a subset of columns and this is called the random first and it's perhaps the most widely used most practically important machine learning method and astonishingly simple to create a random forest regressor you use sk loans random forest regressor if you pass n jobs minus one it will use all of the CPU cores that you have to get run as fast as possible and estimators says how many trees how many models to train max sample says how many rows to use randomly chosen rows to use in each one max features is how many randomly chosen columns to use for each binary split point min samples leaf is the stopping criteria and we'll come back to so here's a little function that will create a random forest regressor and fit it to some set of independent variables and a dependent variable to give it a few default values and create a random forest and train and our validation set RMSE is 0.23 if we compare that to what we had before we had 0.32 so dramatically better by using a random forest so what's happened when we called random forest regressor is it's just using that decision tree builder that we've already seen but it's building multiple versions with these different random subsets and for each binary split it does it's also randomly selecting a subset of columns and then when we create a prediction it is averaging the predictions of each of the trees and as you can see it's giving a really great result and one of the amazing things we'll find is that it's going to be hard for us to improve this very much you know the default starting point tends to turn out to be pretty great the sklearn docs have lots of good information in one of the things it has is this nice picture that shows as you increase the number of estimators how does the accuracy improve the error rate improves for different max features levels and in general the more trees you add the more accurate your model is not going to overfit right because it's averaging more of these weak models more of these models that are trained on subsets of the data so train uses many estimators as you like really just a case of how much time do you have and whether you kind of reach a point where it's not really improving anymore you can actually get at the underlying decision trees in a model in a random forest model using estimators underscore so with a list comprehension we can call predict on each individual tree and so here's an array a numpy array containing the predictions from each individual tree for each row in our data so if we take the mean across the zero axis we'll get exactly the same number because remember that's what a random forest does is it takes the mean of the trees predictions so one cool thing we could do is we could look at the 40 estimators we have and grab the predictions for the first I of those trees and take their mean and then we can find the root means grid error and so in other words here is the accuracy when you've just got one tree, two trees, three trees, four trees five trees etc and you can see so it's kind of nice right you can actually create your own kind of own tools to look inside these things and see what's going on and so we can see here that as you add more and more trees the accuracy did indeed keep improving although root means grid error kept improving although the improvement slowed down after a while the validation set is worse than the training set and there's a couple of reasons that could have happened maybe because we're still overfitting which is not necessarily a problem it's just something we could identify or maybe it's because the fact that we're trying to predict the last two weeks is actually a problem and that the last two weeks are kind of different to the other auctions in our data set maybe something changed over time so how do we tell which of those two reasons there are what is the reason that our validation set is worse we can actually find out using a very clever trick called out of bag error OOB error and we use OOB error for lots of things you can grab the OOB error well you can grab the OOB predictions from the model with OOB prediction and you can grab the RMSE and you can find that the OOB RMSE is 0.21 which is quite a bit better than 0.23 so let me explain what OOB error is what OOB error is is we look at each row of the training set not the validation set, each row of the training set and we say for row number one which trees included row number one in the training okay let's not use those for calculating the error because it was part of those trees training so we'll just calculate the error for that row using the trees where that row was not included in training that tree because remember every tree is using only a subset of the data so we do that for every row, we find the prediction using only the trees that were not used and those are the OOB predictions but in other words this is like giving us a validation set result without actually needing a validation but the thing is it's not with that time offset it's not looking at the last two weeks it's looking at the whole training set so this basically tells us how much of the error is due to overfitting versus due to being the last couple of weeks so that's a cool trick, OOB error is something that very quickly gives us a sense of how much we're overfitting we don't even need a validation set to do it so that's telling us a bit about what's going on in our model but then there's a lot of things we'd like to find out from our model and I've got five things in particular here which I generally find pretty interesting which is how confident are we about our predictions for some particular prediction we're making like we can say this is what we think the prediction is but how confident are we, is that exactly that or is it just about that or we really have no idea and then for predicting a particular item which factors were the most important in that prediction and how did they influence it overall which columns we're making the biggest difference in our model which ones could we maybe throw away and it wouldn't matter which columns are basically redundant with each other so we don't really need both of them and as we vary some column how does it change the prediction so those are the five things that I'm interested in figuring out and we can do all of those things with a random first let's start with the first one for the first one we've already seen that we can grab all of the predictions for all of the trees and take their mean to get the actual predictions of the model and then to get the RMSE but what if instead of saying mean we did exactly the same thing like so but instead said standard deviation this is going to tell us in our row in our data set how much did the trees vary and so if our model really had never seen kind of data like this before it was something where different trees were giving very different predictions it might give us a sense that maybe this is something that we're not at all confident about and as you can see when we look at the standard deviation of the trees let's just look at the first five they vary a lot 0.2, 0.1, 0.09, 0.3 okay so this is a really interesting it's not something that a lot of people talk about but I think it's a really interesting approach to kind of figuring out whether we might want to be cautious about a particular prediction because maybe we're not very confident about it but there's one thing we can easily do with a random first the next thing and this is I think the most important thing for me in terms of interpretation is feature importance here's what feature importance looks like we can call feature importance on a model with some independent variables let's say grab the first 10 and this says these are the 10 most important features in this random first these are the things that are the most strongly driving they all price or we could plot them and so you can see here there's just a few things that are by far the most important what year the equipment was made how big is it and the product class and so you can get this by simply looking inside your train model and grabbing the feature importance attribute and so here for making it better to print out I'm just sticking that into a data frame and sorting descending by importance so how is this actually being done it's actually really neat what scikit-learn does and Breiman the inventor of random first described is that you can go through each tree and then start at the top of the tree and look at each branch and at each branch see what feature was used the split which binary which the binary split was based on which column and then how much better was the model after that split compared to beforehand and we basically then say okay that column was responsible for that amount of improvement and so you add that up across all of the splits across all of the trees for each column and then you normalize it so they all add to one and that's what gives you these numbers which we show the first few of them in this table and the first 30 of them here in this chart so this is something that's fast and it's easy and it kind of gives us a good sense of like well maybe the stuff that are less than 0.005 we could remove so if we did that that would leave us with only 21 columns so let's try that let's just say okay X's which are important the X's which are in this list of ones to keep do the same for valid retrain our random forest and have a look at the result and basically our accuracy is about the same but we've gone down from 78 columns to 21 columns so I think this is really important it's not just about creating the most accurate model you can but you want to kind of be able to fit it in your head as best as possible and so 21 columns is going to be much easier for us to check for any data issues and understand what's going on and the accuracy is about the same or the RMSE so I would say okay let's do that let's just stick with X's important for now on and so here's this entire set of the 21 features and you can see it looks now like year made and product size of the two really important things and then there's a cluster of kind of mainly product related things that are kind of at the next level of importance one of the tricky things here is that we've got like product class desk, model ID secondary desk, model desk, base model they all look like they might be similar ways of saying the same thing so one thing that can help us to interpret the feature importance better and understand better what's happening in the model is to remove redundant features so one way to do that is to call fast ai's cluster columns which is basically a thin wrapper for stuff that scikit-learn already provides and what that's going to do is it's going to find pairs of columns which are very similar you can see here sale year and sale elapsed see how this line is way out to the right or else machine ID and model ID is not at all it's way out to the left so that means that sale year and sale elapsed are very very similar when one is low the other tends to be low and vice versa here's a group of three which all seem to be much the same and then product group desk and product group and then fi, base model and fi model desk but these all seem like things where maybe we could remove one of each of these pairs because they're basically seem to be much the same you know when one is high the other is high and vice versa so let's try removing one of each of these now it takes a little while to train a random forest and so for the just to see whether removing something makes it much worse we could just do a very fast version so we could just train something where we only have 50,000 rows per tree train for each tree and we'll just use 40 trees and let's then just get the OOB for and so for that fast simple version our basic OOB with our important X's is 0.877 and here for OOB a higher number is better so then let's try going through each of the things we thought we might not need and try dropping them and then getting the OOB error for our X's with that one column removed and so compared to 0.877 most of them don't seem to hurt very much they'll elapsed hurt quite a bit so for each of those groups let's go and see which one of the ones seems like we can remove it so here's the five I found let's remove the whole lot and see what happens and so the OOB went from 0.877 to 0.874 so hardly any difference at all despite the fact we managed to get rid of five of our variables so let's create something called X's final which is the X's important and then dropping those five save them for later we can always load them back again and then let's check our random forest using those and again 0.233 or 0.234 so we've got about the same thing but we've got even less columns now so we're getting a kind of a simpler and simpler model without hurting our accuracy which is great so the next thing we said we were interested in learning about is for the columns that are most important how does what's the relationship between that column and the dependent variable so for example what's the relationship between product size and sale price so the first thing I would do would be just to look at a histogram so one way to do that is with a value counts in pandas and we can see here our different levels of product size and one thing to note here is actually missing is actually the most common and then next most is compact and small and then many is pretty tiny so we can do the same thing for year made now for year made we can't just see the basic bar chart for year made we actually need a histogram which pandas has stuff like this built in so we can just call histogram and that 1950 remember we created it that's kind of this missing value thing that used to be a thousand but most of them seem to have been well into the 90s and 2000s so let's now look at something called a partial dependence plot I'll show it to you first here is a partial dependence plot of year made against partial dependence what does this mean well we should focus on the part where we actually have a reasonable amount of data so at least well into the 80s go around here and so let's look at this bit here basically what this says is that as year made increases the predicted sale price log sale price of course also increases and the log sale price is increasing linearly roughly then this is actually an exponential relationship between year made and sale price why do we call it a partial dependence so we're just plotting the kind of the year against the average sale price well no we're not we can't do that because a lot of other things change from year to year for example maybe more recently people tend to buy bigger bulldozers or more bulldozers with air conditioning or more expensive models of bulldozers and we really want to be able to say like no just what's the impact of year and nothing else and if you think about it from a kind of a inflation point of view you would expect that older bulldozers would be kind of that bulldozers would get kind of a constant ratio cheaper the further you go back which is what we see so what we really want to say is all other things being equal what happens if only the year changes and there's a really cool way we can answer that question with a random forest so how does year made impact sale price all other things being equal so what we can do is we can go into our actual data set and replace every single value in the year made column with 1950 and then can calculate the predicted sale price for every single auction and then take the average over all the auctions and that's what gives us this value here and then we can do the same for 1951, 1952 and so forth until eventually we get to our final year of 2011 so this isolates the effect of only year made so it's a kind of a bit of a curious thing to do but it's actually it's a pretty neat trick for trying to kind of pull apart and create this partial dependence to say what might be the impact of just changing year made and we can do the same thing for product size and one of the interesting things if we do it for product size is we see that the lowest value of predicted sale price log sale price is NA which is a bit of a worry because we kind of want to know well that means it's really important the question of whether or not the product size is labeled is really important and that is something that I would want to dig into before I actually use this model to find out well why is it that sometimes things aren't labeled and what does it mean you know why is it that that's actually such an important predictor so that is the partial dependence plot and it's a really clever trick so we have looked at four of the five questions we said we wanted to answer at the start of this section so the last one that we want to answer is this one here we're predicting with a particular row of data what were the most important factors and how did they influence that predictor this is quite related to the very first thing we saw so it's like imagine you were using this auction price model in real life you had something on your tablet and you went into some auction and you looked up what the predicted auction price would be for this lot that's coming up to find out whether it seems like it's being under or overvalued and then you decide what to do about that so one thing we said we'd be interested to know is like well are we actually confident in our prediction and then we might be curious to find out like oh I'm really surprised it was predicting such a high value why was it predicting such a high value so to find the answer to that question we can use a module called tree interpreter and tree interpreter the way it works is that you pass in a single row so it's like here's the auction that's coming up here's the model here's the auctioneer id etc etc please predict the value from the random forest what's the expected sale price and then what we can do is we can take that one row of data and put it through the first decision tree and we can see what's the first split that's selected and then based on that split does it end up increasing or decreasing the predicted price compared to that kind of raw baseline model of just take the average and then you can do that again at the next split and again at the next split so for each split we see what the increase or decrease in the addiction that's not right we see what the increase or decrease in the prediction is well I'm here compared to the parent node and so then you can do that for every tree and then add up the total change and importance by split variable and that allows you to draw something like this so here's something that's looking at one particular row of data and overall we start at zero and so zero is the initial 10.1 do you remember this number 10.1 is the average log sale price of the whole data set they call it the bias tree interpreter and so if you call that zero then for this particular row we're looking at year made as a negative 4.2 impact on the prediction and then product size has a positive 0.2 coupler system has a positive 0.046 model ID has a positive 0.127 and so forth right and so the red ones are negative and the green ones are positive and you can see how they all join up until eventually overall the prediction is that it's going to be negative 0.122 compared to 10.1 which is equal to 9.98 so this kind of plot is called a waterfall plot and so basically when we say tree interpreter.predict it gives us back the prediction which is the actual number we get back from the random forest the bias which is just always this 10.1 for this data set and the contributions which is all of these different values it's how much, how important was each factor and here I've used a threshold which means anything that was less than 0.08 all gets thrown into this other category I think this is a really useful kind of thing to have in production because it can help you answer questions whether it will be for the customer or for whoever's using your model if they're surprised about some prediction why is that prediction so I'm going to show you something really interesting using some synthetic data and I want you to really have a think about why this is happening before I tell you pause the video if you're watching the video when I get to that point let's start by creating synthetic data like so I'm going to grab 40 values evenly spaced between 0 and 20 and then we're just going to create the y equals x line and add some normally distributed random data on that here's this kind of plot so here's some data we want to try and predict and we're going to use a random forest in a kind of bit of an overkill here now in this case we only have one independent variable scikit-learn expects us to have more than one so we can use unsqueeze in PyTorch to add that go from a shape of 40 in other words a vector with 40 elements to a shape of 40 comma 1 in other words a matrix of 40 rows with one column so this unsqueeze 1 means add a unit axis here I don't use unsqueeze very often because I actually generally prefer the index with a special value none this works in PyTorch and NumPy and the way it works is to say okay excellent remember that's a vector of length 40 every row and then none means insert a unit axis here for the column so these are two ways of doing the same thing but this one is a little bit more flexible so that's what I use more often so now that we've got the shape that is expected which is a rank 2 tensor an array with two dimensions or axes we can create a random forest we can fit it and let's just use the first 30 data points stop here and then let's do a prediction plot the original data points and then also plot a prediction look what happens on the prediction it's kind of nice and accurate and then suddenly what happens this is the bit where if you're watching the video I want you to pause and have a think so what's gone on here well remember a random forest is just taking the average of predictions of a bunch of trees and a tree the prediction of a tree is just the average of the values in a leaf node and remember we fitted using a training set containing only the first 30 so none of these appeared in the training set so the highest we could get would be the average of values that are inside the training set in other words there's this maximum we can get to so random forests cannot extrapolate outside of the bounds of the data that they this is going to be a huge problem for things like time series prediction like an underlying trend for instance but really it's more general issue than just time variables it's going to be hard for render or impossible often for random forest to just extrapolate outside the types of data that it's seen in a general sense so we need to make sure that our validation set does not contain out of domain data so how do we find out of domain data so we might not even know if our test set is distributed in the same way as our training data so if they're from two different time periods how do you kind of tell how they vary right or if it's a Kaggle competition how do you tell if the test set and the training set which Kaggle gives you have some underlying differences there's actually a cool trick you can do which is you can create a column called is valid which contains zero for everything in the training set and one for everything in the validation set and it's concatenating all of the independent variables together so it's concatenating the independent variables of both the training and validation set together so this is our independent variable and this becomes our dependent variable and we're going to create a random forest not for predicting price but a random forest that predicts is this row from the validation set or the training set both the validation set and the training set are from kind of the same distribution if they're not different then this random forest should basically have zero predictive power if it has any predictive power then it means that our training and validation set are different and to find out the source of that difference we can use feature importance and so you can see here that the difference between the validation set and the training set is not surprisingly sale elapsed so that's the number of days since I think like 1970 or something so it's basically the date so yes of course you can predict whether something is in the validation set or the training set by looking at the date because that's actually how we define them it makes sense this is interesting sales ID so it looks like the sales ID is not some random identifier but it increases over time and ditto for machine ID and then there's some other smaller ones here that kind of makes sense so I guess for something like model desk I guess there are certain models that were only made in later years for instance but you can see these top three columns are a bit of an issue so then we could say like okay what happens if we look at each one of those columns those first three and remove them and then see how it changes our rmsc on our sales price model on the validation set so we start from 0.232 and removing sales ID actually makes it a bit better sale elapsed makes it a bit worse machine ID about the same so we can probably remove sales ID and machine ID without losing any accuracy and yep it's actually slightly improved but most importantly it's going to be more resilient over time right because we're trying to remove the time related features another thing to note is that it seems that this kind of sale elapsed issue that maybe it's making a big difference is maybe looking at the sale year distribution this is the histogram most of the sales are in the last few years anyway so what happens if we only include the most recent few years let's just include everything after 2004 so that is X is filtered and that subset then my accuracy improves a bit more from 231 to 330 so that's interesting right we're actually using less data less rows and getting a slightly better result because the more recent data is more representative so that's about as far as we can get with our random forest but what I will say is this this issue of extrapolation would not happen with a neural net would it because a neural net is using the kind of the underlying layers are linear layers and so linear layers can absolutely extrapolate so the obvious thing to think then at this point is well maybe what a neural net do a better job of this that's going to be the thing next up to this question question first how do how does feature importance relate to correlation feature importance doesn't particularly relate to correlation correlation is a concept for linear models and this is not a linear model so remember feature importance is calculated by looking at the improvement in accuracy as you go down each tree and you go down each binary split if you're used to linear regression then I guess correlations sometimes can be used as a measure of feature importance but this is a much more kind of direct version that's taking account of these nonlinearities and interactions as well so it's a much more flexible and reliable measure feature importance any more questions so to do the same thing with a neural network I'm going to just copy and paste the same lines of code that I had from before but this time I call it NN, DFNN and these are the same lines of code and I'll grab the same list of columns we had before in the dependent variable to get the same data frame now as we've discussed for categorical columns we probably want to use embeddings so to create embeddings we need to know which columns should be treated as categorical variables and as we discussed we can use cont cat split for that one of the useful things we can pass that is the maximum cardinality so max card equals 9000 means if there's a column with more than 9000 levels you should treat it as continuous and if it's got less than 9000 levels treat it as categorical so that's you know it's a simple little function that just checks the cardinality and splits them based on how many discrete levels they have and of course their data type if it's not actually a numeric data type it has to be categorical so there's our split and then from there what we can do is we can say oh we've got to be a bit careful of sail lapsed because actually sail lapsed I think has less than 9000 categories but we definitely don't want to use that as a categorical variable the whole point was to make it that this is something that we can extrapolate so we certainly anything that's kind of time dependent or we think that we might see things outside the range of inputs in the trading data we should make them continuous variables so let's make sail lapsed put it in continuous neural net and remove it from categorical so here's the number of unique levels this is from pandas for everything in our neural net data set for the categorical variables and I get a bit nervous when I see these really high numbers so I don't want to have too many things with like lots and lots of categories the reason I don't want lots of things with lots and lots of categories is just they're going to take up a lot of parameters because in our embedding matrix this is you know every one of these is a row in an embedding matrix in this case I notice model ID and model desk might be describing something very similar so I'd quite like to find out if I could get rid of one and an easy way to do that would be to use a random forest so let's try removing the model desk and let's create a random forest and let's see what happens and oh it's actually a tiny bit better and certainly not worse so that suggests that we can actually get rid of one of these levels or one of these variables so let's get rid of that one and so now we can create a tabular pandas object just like before but this time we're going to add one more processor which is normalize and the reason we need normalize so normalize is subtract the mean divide by the standard we didn't need that for a random forest because for a random forest we're just looking at less than or greater than through our binary splits so all that matters is the order of things how they're sorted doesn't matter whether they're super big or super small it definitely matters for neural nets because we have these linear layers so we don't want to have things with kind of crazy distributions with some super big numbers and super small numbers because it's not going to work so it's always a good idea to normalize things in neural nets so we can do that in a tabular neural net by using the normalize tabular proc so we can do the same thing that we did before by just creating a tabular pandas tabular object for the neural net and then we can create data loaders from that with a batch size and this is a large batch size because tabular models don't generally require nearly as much TPU RAM as a convolutional neural net or something or an RNN or something since it regression model we're going to want r range so let's find the minimum and maximum of our dependent variable and we can now go ahead and create a tabular learner so our tabular learner is going to take our data loaders our way range how many activations do you want in each of the linear layers and so you can have as many linear layers as you like here how many outputs are there so this is a regression with a single output and what loss function do you want we can use LR find and then we can go ahead and use fit one cycle there's no pre-trained model obviously because this is not something where people have got pre-trained models for for industrial equipment options so we just use fit one cycle and train for a minute and then we can check and our RMSE is 0.226 which here was 0.230 so that's amazing we actually have you know straight away a better result than the random forest it's a little more fussy it takes a little bit longer but as you can see interesting data sets like this we can get some great results with neural nets so here's something else we could do the random forest and the neural net they each have their own pros and cons so something's they're good at and they're less good at so maybe we can get the best of both worlds and a really easy way to do that is to use ensemble so we've already seen that a random forest is a decision tree ensemble so now we can put that into another ensemble we can have an ensemble of the random forest and a neural net there's lots of super fancy ways you can do that but a really simple way is to take the average so sum up the predictions from the two models divide by two and use that as a prediction so that's our ensemble prediction is just literally the average of the random forest prediction and the neural net prediction and that gives us 0.223 versus 0.226 so how good is that well it's a little hard to say because unfortunately this competition is old enough that we can't even submit to it and find out how we would have gone on cackle so we don't really know and so we're relying on our own validation set but it's quite a bit better than even the first place score on the test set so if the validation set is doing a good job then this is a good sign that this is a really really good model which wouldn't necessarily be that surprising because you know in the last few years I guess we've learned a lot about building these kinds of models and we're kind of taking advantage of a lot of the tricks that have appeared in recent years and yeah maybe this goes to show that well I think it certainly goes to show that both random forests and neural nets have a lot to offer and try both and maybe even find both we've talked about an approach to ensembling called bagging which is where we train lots of models on different subsets of the data take the average of another approach to ensembling particularly ensembling of trees is called boosting and boosting involves training a small model which under fits your data set so maybe like just have a very small number of leaf nodes and then you calculate the predictions using the small model and then you subtract the predictions from the targets so these are kind of like the errors of your small underfit model we call them residual and then go back to step one but now instead of using the original targets use the residuals the train a small model which under fits your data set attempting to predict the residuals and do that again until you reach some stopping criterion such as the maximum number of trees now that will leave you with a bunch of models which you don't average but which use some because each one is creating a model that's based on the residual of the previous one but we've subtracted the predictions of each new tree from the residuals of the previous tree so the residuals get smaller and smaller and then to make predictions we just have to do the opposite which is to add them all together so there's lots of variants of this but you'll see things like GBMs so gradient boosted machines or GPTTs so gradient boosted decision trees there's lots of minor details around significant details but the basic idea is what I've shown about it dropping features in a model is a way to reduce the complexity of the model and thus overfitting is this better than adding some regularization like weight decay to tabular I didn't claim we removed columns to avoid overfitting we removed the columns to simplify fewer things to analyze it should also mean we don't need as many trees but there's no particular reason to believe that this will regularize and the idea of regularization doesn't necessarily make a lot of sense to random forests can always add more trees for picking the number of linear layers in the tabular model not really well if there is I don't know what it is I guess two hidden layers works pretty well so what I showed those numbers I showed are pretty good for a large ish model so maybe start with the default and then go up to 500 and 250 if that isn't an improvement then just keep doubling them until it's improving or you run out of memory or time main thing to note about boosted models is that there's nothing to stop us from overfitting if you add more and more trees a bagging model sort of a random forest it's going to get it should generalize better and better because each time you're using a model which is based on a subset of the data but boosting each model will fit the training set better and better gradually overfit more and more so boosting methods do require generally hyper parameter tuning and fiddling around with you know you certainly have regularization boosting they're pretty sensitive to their hyper parameters which is why they're not normally much my first go to they often they more often when Kaggle competition random forests do like they tend to be good at getting that hospital bit of performance so the last thing I'm going to mention is something super neat which a lot of people don't seem to know exists there's a shanks it's super cool which is something from the entity embeddings paper the table from it was they built a neural network they got the entity embeddings and then they tried a random forest using the entity embeddings as predictors rather than the approach I described with just the raw categorical variables and the the error for a random forest went from 0.16 to 0.11 for huge improvement and very simple method K and N went from 0.29 to 0.11 basically all of the methods when they used entity embeddings suddenly improved a lot the one thing you should try if you have a look at the further research section after the questionnaire is it asks to try to do this actually take those entity embeddings that we trained in the neural net and use them in the random forest and maybe try on summoning again and see if you can beat the 0.223 that we had because this is a really nice idea it's like you get all the benefits of the boosted decision trees but all of the nice features of entity embeddings and so this is something that not enough people seem to be playing with for some reason so overall you know random forests are nice and easy to train you know they're very resilient they don't require much pre-processing they train quickly they don't over fit you know they can be a little less accurate and they can be a bit slow at inference time because for inference you have to go through every one of those trees having said that a binary tree can be pretty heavily optimized so you know it is something you can basically create a totally compiled version of a tree and they can certainly also be done entirely in parallel so that's something to consider. Gradient boosting machines are also fast to train on the whole but a little more fussy about hyper-parameters you have to be careful about overfitting but a bit more accurate neural nets maybe the fussiest to deal with they've kind of got the least rules of thumb around or tutorials around saying this is kind of how to do it it's just a bit newer a little bit less well understood but they can give better results in many situations than the other two approaches or at least with an ensemble and improve the other two approaches so I would always start with a random and then see if you can beat it using these so yeah why don't you now see if you can find a Kaggle competition with tabular data whether it's running now it's a past one and see if you can repeat this process for that and see if you can get in the top 10% of the private leaderboard that would be a really great stretch goal at this point decision tree algorithm yourself I think that's an important one and then from there create your own random forest from scratch you might be surprised it's not that hard and then go and have a look at the tabular model source code and at this point this is pretty exciting you should find you pretty much know what all the lines do with two exceptions and if you don't you know dig around and explore and experiment you can figure it out and with that we are I am very excited to say at a point where we've really dug all the way in to the end of these real valuable effective fast AI applications and we're understanding what's going on inside them what should we expect for next week for next week we will at NLP and good vision and we'll do the same kind of ideas delve deep to see what's going on thanks everybody see you next week