 All right, welcome back Something to mention somebody asked on the forums really good question was like How do I deal with version control and notebooks? The question was something like every time I change the notebook Jeremy goes and changes it on git And then I do a git pull and I end up with a conflict and blah blah blah and That's that happens a lot with notebooks because notebooks behind the scenes are JSON files Which like every time you run even a cell without changing it it updates that little number saying like what numbered cell this is and so Now suddenly there's a change and so trying to merge Notebook changes as a nightmare so My suggestion like a simple way to do it is is when you're looking at Some notebook Like lesson two RF interpretation you want to start playing around with this First thing I would do would be to go file make a copy and Then in the copy say file we name and give it a name that starts with TMP That will hide it from git right and so now you've got your own version of that notebook that you can that you can play with Okay, and so if you now do a git pull and see that the original changed it won't conflict with yours And you can now see there are two different versions There are different ways of kind of dealing with this Jupiter notebook get problem like everybody has it one One is there are some hooks you can use it like remove all of the cell outputs before you commit to get but in this case I actually want the outputs to be in the repo so you can read it on github and see it So it's a minor issue, but it's it's something which Catches everybody Uh Yes Before we move on to interpretation of the random forest model, I wonder if we could summarize the relationship between the hyperparameters on the random forest and its effect on you know overfitting and dealing with co-linearity and yada yada yada Yeah, that sounds like a question born from experience. Uh, absolutely um so I'm going to go back to lesson one rf If you're ever unsure about where I am you can always see my top here courses ml one lesson one In terms of the hyperparameters that Are interesting and I'm ignoring I'm ignoring like pre-processing, but just the actual hyperparameters The first one of interest I would say is the set rf samples Command which determines how many rows are in each sample So in each tree you're created from how many rows And each tree So before we start a new tree We either bootstrap a sample so sampling with replacement from the whole thing Or we pull out a subsample of a smaller number of rows and then we build a tree from there so Step one is we've got our whole big data set And we grab a few rows at random from it And we turn them into a smaller data set and then from that We build a tree, right? So That's the size of that is set rf samples. So when we change that size Let's say this originally had like a million rows and we said set rf samples 20,000 Right, and then we're going to grow a tree from there Um assuming that The tree remains kind of balanced as we grow it Can somebody tell me how many layers deep Would this tree be and assuming we're growing it until every leaf is of size one? Yes Log base two of 20,000 right, okay So the the depth of the tree Doesn't actually vary that much depending on the number of samples Right because it's it's Related to the log of the size Um Can somebody tell me at the very bottom? So once we go all the way down to the bottom how many Leaf nodes would there be? 20,000 right because every single leaf node has a single thing in it. So we've got Obviously a linear relationship between the number of leaf nodes and the size of the sample So when you decrease the sample size Um, it means that there are less kind of Final decisions that can be made right so therefore the tree is is Going to be less rich in terms of what it can predict because it's just making less different individual decisions And it also is making Less binary choices to get to those decisions. So therefore Setting rf samples lower is going to mean that you overfit less But it also means that you're going to have a less Accurate individual tree model, right? And so remember the way Bremen the inventor of random forest described this is that you're trying to do two things when you build a model When you build a model with bagging one is that Each individual tree or as eskeleon would say each individual estimator is as accurate as possible Right on the training set. Um, so it's like each model is a strong predictive model but then the across the estimators The correlation between them is as low as possible So that when you average them out together you end up with something that generalizes So by decreasing the set rf samples number We are actually decreasing the power of the estimator and increasing the correlation And so is that going to result in a better or a worse validation set result for you? It depends right this is the kind of compromise which you have to figure out when you do machine learning models um Can you pass that back there? If I put the OOB value equal to two So it is it is basically dividing every third it ensures that Every 30% of the data won't be there in each tree, right? The OOB say again OOB if I put OOB equal to two in random forest So isn't that make sure that out of my entire data 37% of data won't be there in every tree So all OOB equals true does is it says um Whatever your sub sample is it might be a bootstrap sample or it might be a a sub sample Take all of the other rows Right and put them into a freaks tree and put them into a different data set And calculate the the error on those so it doesn't actually impact training at all It just gives you an additional metric which is the OOB error. So if you Don't have a validation set Um, then this allows you to get kind of a a quasi validation set for free If I don't set out of sample, what is the default if you want to if I don't set out of sample set out of sample RF sample, yeah So the the default is um, actually if you say reset rf samples And that causes it to bootstrap. So it will sample a new data set as big as the original one but with replacement Okay, so, um, obviously the second benefit of set rf samples is that you can run More quickly and particularly if you're running on a really large data set like 100 million rows You know, it won't be possible to run it on the full data set So you'd either have to pick a sub sample of yourself before you start or use set rf samples Um, the second key parameter, um, that we learned about was min samples leaf Okay, so if I changed min samples leaf before we assumed that min samples leaf was equal to one All right, if I set it equal to two Then what would be my new depth how deep would it be Yes log base to 20,000 minus one, okay So each time we double the min samples leaf. We're removing one layer from the tree Um, and uh, fun. I'll come back to you again since you're doing so. Well, how many leaf nodes would there be in that case? fun How many leaf nodes would there be in that case? 10,000. Okay, so we're going to be Again dividing the number of leaf nodes by that number So the result of increasing min samples leaf is that now each of our leaf nodes has more than one thing in so we're going to get A more stable average that we're calculating in each tree. Okay We've got a little bit less depth Okay, we've got less decisions to make and we've got a smaller number of leaf nodes So again, we would expect the result of that would be that each estimator Would be less predictive But the estimators would be also less correlated. So again, this might help us to avoid overfitting Could you pass the microphone over here, please? Oh, hi, Jeremy. I'm not sure if um, in that case every node will have exactly two No, it won't necessarily have exactly two and I thank you for mentioning that So it might try to do a split and so one reason Well, what would be an example 10 she that you Wouldn't split even if you had a hundred nodes. What might be a reason for that? Sorry a hundred items in a leaf node. They're identical. They're all the same. They're all the same in terms of Well once the independent saw the dependent In terms of the dependent right, I mean, I guess either but much more likely would be the dependent so if you get to a leaf node where Every single one of them has the same auction price or in classification like every single one of them is a dog Then there is no split that you can do that's going to improve your information right and remember Information is the term we use in a kind of a general sense around for us to describe The amount of Difference about that of additional information we create from a split is like how much are we improving the model? So you'll often see this this word information gain Which means like how much better did the model get by adding an additional split point? And it could be based on rmsc or it could be based on cross entropy Or it could be based on how different are the standard deviations or or whatever. So that's just a general term Okay, so that's the second thing that we can do which again It's going to speed up our training because it's like one less set of decisions to make Remember, even though there's one less set of decisions those decisions like have As much data again as the previous set so like each layer of the tree can take like twice as long as the previous layer So it could definitely speed up training And it could definitely make it generalize better um So then the third one that we had was um max features Uh, who wants to tell me what max features uh does? Do you want to pass that back over there? Okay, Vinay Features determines how many features you're going to use in each tree. In this case, it's a fraction up So you're going to use half of the features for each tree Nearly right or kind of right. Can you be more specific or can somebody else be more specific? It's not exactly for each tree Chen Xi That is it for each tree randomly sample half of the R features so not quite it's not for each tree. So the the the set don't pass it to kerham. So the set R of samples picks a Picks a subset of samples a subset of rows for each tree But min samples leaf, sorry that um max features doesn't quite do that. Is that something different? At each split we will be at each at each split set split It will Yeah, right So It kind of sounds like a small difference, but it's actually quite a different way of thinking about it Which is um, we do our set rf samples. So we pull out our sub sample or a bootstrap sample and that's kept for the whole tree Um, and we have all of the columns in there, right and then with um Max features equals point five at each point we then at each split we pick a different half Of the features and then here we'll take a pick a different half of the features and here we'll pick a different half of the features And so the reason we do that Is because we want the trees to be as as rich as possible, right? So particularly like if you if you were only doing a small number of trees like you had only 10 trees And you pick the same column set all the way through the tree You're not really getting much variety in what kind of things are confined. Okay So this this way at least in theory Seems to be something which is going to give us a better set of trees is picking a different Random subset of features at every decision point The overall So the overall effect of max features again, it's the same It's going to mean that the each individual tree is probably going to be less accurate Um, but the trees are going to be more varied and in particular here This can be critical because like imagine that you've got one feature that's just super predictive It's so predictive that like every random subsample you look at always starts out by splitting on that same feature Then the trees are going to be very similar in the sense like they all have the same initial split, right, but There may be some other interesting initial splits because they create different interactions of variables So by like half the time That feature won't even be available at the top of the tree So half at least half the trees are going to have a different initial split. So it definitely can give us more Variation and therefore again, it can help us to create more generalized trees that have less correlation with each other Even though the individual trees probably won't be as predictive In practice, we actually looked at have a little picture of this that as as you add more trees Right, if you have max features equals none, that's going to use all the features every time Right, then with like very very few trees that can still give you a pretty good error But as you create more trees It's not going to help as much because they're all pretty similar because they're all trying every single variable Um, where else if you say max features equals square root or max features equals log two Then uh, as we add more estimators, we see improvements Okay, so there's an interesting interaction between those two and this is from the sklearn docs this cool little chart Okay So then things which don't impact our our training at all and jobs Simply says how many cpu how many cores do we run on? Okay, so it'll make it faster up to a point Generally speaking making this more than like eight or so. They may have diminishing returns Minus one says use all of your cores um so There's there's I don't know why the default is to only use one core that seems weird to me You'll definitely get more performance by using more cores because all of you have computers with more than one core nowadays And then oob score equals true Simply allows us to see the oob Score if you don't say that it doesn't calculate it um And particularly if you had set rf samples pretty small compared to a big data set Oob is going to take forever to calculate. Uh, hopefully at some point we'll be able to fix the library So that doesn't happen. There's no reason it need be that way, but right now that's that's how the library works Okay So there are Base, you know key basic parameters that we can change. Um, there are More that you can see in the docs or shift tab to have a look at them Um, but the ones you've seen are the ones that I've found useful to play with So feel free to play with others as well Um, and generally speaking, you know max features of as I said max features of like either Uh, non Means all of them Um about point five uh, or um square root Uh, or log, you know, kind of those trees seem to work pretty well and then for min samples leaf um You know, I would generally try kind of one three five 10 25 You know 100 and like as you start doing that if you notice by the time you get to 10 It's already getting worse that there's no point going further if you get to 100 It's still going better then you can keep trying right, but they're the kind of General amounts that most things seem to sit in all right so random forest interpretation um is something which You could use to create some really cool Kaggle kernels Now obviously one issue is the fast AI library is not available in Kaggle kernels But if you look inside fast AI dot structured right and remember you can just use Double question mark to look at the source code for something or you can go into the editor to have a look at it You'll see that most of the methods we're using are a small number of lines of code in this library and have no Dependencies on anything so you could just copy that Little if you need to use one of those functions just copy it into your kernel And and if you do to see this is from the fast AI library You can link to it on github because it's available on github as open source. Um, but you don't need to Import the whole thing right so this is a cool trick is that because you're the first people to learn how to use these tools You can start to show things that other people haven't seen right so for example This confidence based on tree variants is something which doesn't exist anywhere else Um feature importance definitely does And that's already in quite a lot of Kaggle kernels If you're looking at a competition or a data set that where nobody's done feature importance Being the first person to do that is always going to win lots of votes because it's like the most important thing Is like which features are important? um So last time we let's just make sure we've got our tree our data Um, so we need to change this to add one extra thing All right, so that's going to load in our data split There's our data, okay So as I mentioned when we do a model interpretation I tend to set rf samples to some subset something small enough that I can run a model in under 10 seconds or so Because there's just no point run running a super accurate model 50,000 is more than enough to To see you'll basically see each time you run an interpretation You'll get the same results back. And so as long as that's true, then you you're already using enough data. Okay So feature importance, uh, we learned it works by randomly shuffling a column Each column one at a time and then seeing how accurate the model the pre-trained model the model we've already built is When you pass it in, uh, all the data as before but with one column shuffled so, um Some of the questions I got after class, uh, kind of Reminded me that it's very easy to under appreciate how powerful and kind of magic this approach is Um, and so to explain I'll I'll mention a couple of the questions that I heard So one question was like Why don't we or what if we just um create uh took one column at a time and created a tree On just each one column at a time. So we've got our data set. It's got a bunch of columns So why don't we just like grab that column and just build a tree From that right and then like we'll see which which columns tree is the most predictive um, can anybody tell me Why why why that may give misleading results about feature importance Karen Okay, so we will be going to lose the interactions between The features. Yeah, if we just shuffle them It will be at randomness and we will able to both capture the interactions and the importance of the feature at the same time yeah, and and so This issue of interactions is not a minor detail. It's like It's massively important. So I like think about this Bulldozer's data set where for example where there's one field called year made And there's one field called sale date and like If we think about it, it's pretty obvious that what matters is the combination of these two which in other words is like How old is the piece of equipment when it got sold? So if we only included one of these We're going to massively underestimate how important that feature is now Here's a really important point though if you It's pretty much always possible To create a simple like logistic regression Which is as good as pretty much any random forest if you know ahead of time Exactly what variables you need exactly how they interact exactly how they need to be transformed and so forth, right? So in this case, for example, we could have created a new field which was equal to year made So sale date or sale year minus year made And we could have fed that to a model and got you know got that interaction for us. But the point is We never know that like you never like you might have a guess But I think some of these things are interacted in this way And I think this thing we need to take the log and so forth But you know the truth is that the the way the world works the causal structures You know, they've got many many things interacting in many many subtle ways, right? And so that's why using trees, uh, whether it be gradient boosting machines or random forests works so well So, um, can you pass that to Terence, please? One thing that bit me years ago was also I tried that Doing one variable at a time thinking. Oh, well, I'll figure out which one's most correlated with the dependent variable but what it doesn't Pull apart is that what if all variables are basically copied the same variable? Then they're all going to seem equally important. But in fact, it's really just one factor Yeah, and that's also true here. So if we had like a column Appeared twice Right, then shuffling that column isn't going to make the model much worse, right? There'll be if you think about like how it was built Some of the times particularly if we had like max features is 0.5 Then some of the times we're going to get version a of the column some of the times are going to get version b of the column so like half the time Shuffling version a of the column is going to make a tree a bit worse half the time It's going to make you know column b or make it a bit worse. And so it'll show that both of those features are somewhat important And it'll kind of like share the importance between the two features. And so this is why I write co linearity but co linearity literally means that they're linearly related So this isn't quite right But this is why having two variables that are related closely related to each other or more variables that are closely related to each other means that you will often underestimate their importance using this this random first technique Yes, Terence And so once we've shuffled and we get a a new model What exactly are the units of these importances? Is this a change in the r squared? Yeah, I mean it depends on the library. We're using so the units are kind of like I I never think about them. I I just kind of know that like in this particular library You know 0.005 is often kind of a cutoff I would tend to use but all I actually care about is is this picture right, which is the feature importance Ordered for each variable and then kind of zooming in turning it into a bar plot and I'm kind of like, okay, you know here They're all pretty flat and I can see okay. That's about 0.005 And so I remove them at that point and just see like The model hopefully the validation score didn't get worse and if it did get worse I'll just increase this a little bit. Sorry decrease this a little bit until it it doesn't get worse um So, yeah, the the The units of measure of this don't matter too much And we'll learn later about a second way of doing variable importance. Uh, by the way, can you pass that over there? Is one of the goals here to remove variables that I guess Your like your score will not Get worse if you remove them So you might as well get rid of them. Yeah, so that's what we're going to do next so So having looked at our feature importance plot we said, okay, it looks like the ones like less than 0.005 You know a kind of this long tail of Boringness. So I said, let's try removing them, right? So let's just try grabbing the columns where it's greater than 0.005 And I said, let's create a new data frame called df keep which is df train with just those cap columns Created a new training and validation set with just those columns created a new random forest and I looked to see how the Set score and the validation set rmsc changed and I found they got a tiny bit better So if they're about the same or a tiny bit better Then the thinking my thinking is well, this is uh, just as good a model But it's now simpler and so now when I redo the feature importance There's less collinearity Right, and so in this case I saw that year made went from being like quite a bit better than the next best thing which was couple of system to Way better than the next best thing All right, and couple of system went from being like quite a bit more important than the next two to Equally important to the next two So it did seem to definitely change these feature importance and hopefully give me some more insight there so How does that help our model in general like what does it mean that year made is now way ahead of the others? Yeah, so we're going to dig into that kind of now. Um, but basically It tells us, um That for example, if we're looking for like how are we dealing with missing values? Is there noise in the data? You know if it's a high cardinality categorical variable, they're all different steps We would take so for example, if it was a high cardinality categorical variable that was originally a string, right? Like for example, I think like maybe fi product class description. Um, I remember one of the Once we looked at the other day had like first of all was the type of vehicle and then a hyphen And then like the size of the vehicle We might look at that and be like, okay. Well, that was an important column Let's try like splitting it into two on hyphen and then take that bit which is like the size of it and trying You know pass it and convert convert it into an integer You know, we can try and do some feature engineering and basically until you know, which ones are important Um, you don't know where to focus that feature engineering time You can talk to your client, you know and say, uh, you know, or you know If you're doing this inside your workplace, you go and talk to the folks that like Were responsible for creating this data. So in this if you're actually working at a Bulldozer auction company, you might now go to the actual auctioneers and say I'm really surprised that coupler system seems to be driving people's pricing decisions so much Why do you think that might be and they can say to you? Oh, it's actually because Only these classes of vehicles Have coupler systems or only this manufacturer has coupler systems and so frankly This is actually not telling you about coupler systems, but about something else and oh, hey, that reminds me That's that that's something else. We actually have Measured that it's in this different csv file. I'll go get it for you So it kind of helps you Focus your attention So I had a fun little problem this weekend as you know, I introduced a couple of crazy computations In into my random forest and all of a sudden they're like, oh my god These are the most important variables ever squashing all of the others Then I got a terrible score and then is that because Now that I think I have my scores Computed correctly what I noticed is that the importance went through the roof, but the validation set Was still bad or got worse is that because somehow that computation allowed the training To almost like an identifier map exactly what the answer was going to be For training, but of course that doesn't Generalize to the validation set. Is that what I is that what I observed? Okay, so there's There's two reasons why your validation score Might not be very good um Let's go up here Okay, so we get these five numbers right the rmsc of the training Validation r squared of the training validation and the r squared of the orb Okay, so there's two reasons and really in the end what we care about like for this cargo competition Is the rmsc of the validation set assuming we've created a good validation set so In terence's case, he's saying this number is this thing I care about Got worse when I did some feature engineering. Why is that? Okay There's two possible reasons reason one Is that you're overfitting If you're overfitting Then your oob will also get worse if you're Doing a huge data set with a small set rf samples, so you can't use an oob then instead Create a second validation set which is a random sample Okay, and And do that right so in other words if your oob or your random sample validation set is has got much worse Then you must be overfitting um I think in your case terence. It's unlikely. That's the problem because random forests Don't overfit that badly like it's very hard to get them to overfit that badly Unless you use some really weird parameters like only one estimator for example like once you've got 10 trees in there There should be enough variation that you know, you can definitely overfit But not so much that you're going to destroy your validation score by adding a variable So I think you'll find that's probably not the case But it's easy to check and if it's not the case Then you'll see that your oob score or your random sample validation score hasn't got worse okay So the second reason your validation score can get worse if your oob score hasn't got worse You're not overfitting but your validation score has got worse That means you're you're doing something that is true in the training set But not true in the validation set So this can only happen when your validation set is not a random sample So for example in this bulldozer's competition or in the grocery shopping competition We've intentionally made a validation set that's for a different date range. That's for the most recent two weeks Right, and so if something different happened in the last two weeks to the previous weeks Then uh, you could totally Break your validation set. So for example If there was some kind of unique identifier, which is like Different in the two date periods Then you could learn to identify things using that identifier in the training set But then like the last two weeks may have a totally different set of ids with a different set of behavior could get a lot worse Yeah, what you're describing is not common though Um, and so i'm a bit skeptical. It might be a bug. Um, but Hopefully there's enough, uh things you can now use to figure out if it is a bug We'll be interested to hear what you learn um, okay So that's uh, that's feature importance and so, um I'd like to compare that to How feature importance Is normally done in industry And in academic communities outside of machine learning like in psychology and economics and so forth and generally speaking people In those kind of environments tend to use uh Some kind of linear regression logistic regression general linear models So they start with their data set and they basically say Well, that was weird. Um Oh, okay So they start with their data set And they say i'm going to assume that I know The kind of parametric relationship between my independent variables and my dependent variable So i'm going to assume that it's a linear relationship say or it's a linear relationship with a Link function like a sigmoid logistic logistic regression say and so Assuming that I already know that I can now write this as an equation So if I've got like x1 x2 so forth, right I can say all right my y values are equal to uh, a x1 plus b x2 Equals y and therefore I can find out the feature importance easily enough by just looking at these Coefficients and saying like which one's the highest particularly if you've normalized the data first, right so There's this kind of trope out there. It's it's very common, which is that like this is somehow More accurate or more pure or in some way better way of doing feature importance Um, but that couldn't be further from the truth, right if you think about it If you were like if you were missing an interaction Right, or if you were missing a transformation you needed Or if you've anyway Been anything less than a hundred percent perfect in all of your preprocessing so that your Model is the absolute correct truth of this situation, right unless you've got all of that correct Then your coefficients are wrong Right your coefficients are telling you in your totally wrong model. This is how important those things are Right, which is basically meaningless So, um, we're also the random forest feature importance. It's telling you in this extremely High parameter highly flexible functional form with few if any statistical assumptions. This is your feature importance right, um So I would be very cautious, you know and and again I can't stress this enough when you when you leave msan when you leave this program You are much more often going to see people talk about logistic regression Coefficients than you're going to see them talk about random forest variable importance And every time you see that happen you should be very very very skeptical Of what you're seeing anytime you read a paper in economics or in psychology or the marketing department tells you they did Disaggression or whatever Every single time those coefficients are going to be massively biased by Any issues in the model furthermore If they've done so much pre-processing that actually the model is pretty accurate Then now you're looking at coefficients that are going to be of like A coefficient of some principal component from a PCA Or a coefficient of some distance from some cluster or something at which point they're Very very hard to interpret. Anyway, they're not actual variables Right, so they're kind of the two options I've seen when people try to use classic statistical techniques to do a cover a variable importance equivalent um I think things are starting to change Um Slowly, you know, there are there are some fields that are starting to realize that this is totally the wrong Way to do things but it's it's been You know nearly 20 years since random forests appeared So it it takes a long time, you know, people say that the only way that Knowledge really advances is when the previous generation dies. And that's kind of true, right? Like Particularly academics, you know, they make a career of being good at a particular sub thing and You know often don't it, you know It's not until the next generation comes along that that people notice that oh, that's actually no longer A good way to do things and I think that's what's happened here Um, okay So, uh, we've got now a model which isn't really any better as a predictive accuracy wise But it's kind of we're getting a good sense that there seems to be like four main important things When it was made the capital system its size and uh, it's product classification. Okay, so that's cool There is something else that we can do however, which is we can do something called One-hot encoding So this is kind of where we're talking about categorical variables. So remember a categorical variable Let's say we had like Uh A string high And remember the order we got was kind of back weird it was high low Medium so it was in alphabetical order by default Right was our original category for like usage band or something And so we mapped it to zero one two Right and so by the time it gets into our data frame. It's now a number So the random forest doesn't know that it was originally a category. It's just a number, right? So when the random forest is built it basically says, oh, is it greater than one or not Or is it greater than naught or not, you know, basically the two possible decisions it could have made Um For a For something with like five or six spans, you know, um It could be that just one of the levels of a category is actually interesting, right? So like if it was like very high Very low Or or unknown Right, then we've now got like six levels and maybe The only thing that mattered was whether it was like unknown maybe like not knowing its size somehow impacts the price And so if we wanted to be able to recognize that and particularly if like It just so happened that the way that the numbers were coded was it unknown ended up in the middle Right, then what it's going to do is it's going to say, okay, there is a difference between these two groups You know less than or equal to two versus greater than two And then when it gets into this this leaf here It's going to say, oh, there's a difference between these two between less than four and greater than or equal to four and so it's going to take two splits to get to the point where we can see That it's actually unknown that matters Um, so this is a little inefficient and we're kind of like wasting tree computation And like wasting tree computation matters because every time we do a split We're halving the amount of data at least that we have to do more Analysis so it's going to make our tree less rich less effective if we're Not giving the data in a way that's kind of convenient for it to do the work it needs to do So what we could do instead is Create six columns We could create a column chord is very high is very low Is high Is unknown is low is medium and each one would be ones and zeroes Right, so either one or zero So we had six columns. Um, just one moment So having added six additional columns to our data set the random forest Now has the ability to pick one of these and say like, oh, let's have a look at is unknown There's one possible split I can do which is one versus zero Let's see if that's any good, right? So it actually now has the ability in a single step to pull out a single category level and so uh, this this kind of coding is called one hot encoding and For many many types of machine learning model. This is like Necessary something like this is necessary like if you're doing logistic regression You can't possibly put in a categorical variable that goes not through five Because there's obviously no real linear relationship between that and anything, right? Um, so one hot encoding A lot of people incorrectly assume that all machine learning requires one hot encoding Um, uh, but in this case, I'm going to show you how we could use it optionally and see whether it might improve things sometimes Yeah Hi, Jeremy. So if we have six categories like in this case, would there be any problems with adding a column for each of the Categories so because in linear regression we said we had to do it like the six categories. We should only do it for five of them yeah, so, um It you certainly can say. Oh, we'd let's not worry about Adding is medium because we can infer it from the other five um I would say include it anyway um, because like Rather than the otherwise the random forest would have to say is Very high. No is very low. No is high. No is unknown. No is low. No. Okay. And finally I'm I'm there, right? So it's like five decisions to get to that point. So, um, the reason in Linear models that you you need to not include one is because linear models hate collinearity But we don't care about about that here So we can do one hot encoding Easily enough and the way we do it is we pass One extra parameter to proc df, which is what's the max Number of Categories right so if we say it's seven then anything with less than seven levels Is going to be turned into a one hot encoded bunch of columns Right, so in this case, this has got six levels. So this would be one hot encoded Where else like zip code has more than six levels and so that would be left as a number And so generally speaking you obviously Probably wouldn't want a one hot encode zip code, right? Because that's just going to create masses of data memory problems computation problems and so forth, right? So so this is like another parameter that you can play around with so If I do that Uh, try it out run the random forest as per usual. You can see what happens to the R squared of the validation set And to the rmsc of the validation set and in this case, I found it got a little bit worse This isn't always the case and it's going to depend on your data set You know, do you have a data set where you know single categories tend to be quite important? Or not in this particular case, it didn't make it more predictive. However What it did do is that we now have different features, right? So proctf puts the name of the variable and then an underscore and then the level name And so interestingly it turns out that where else before it said that enclosure Was somewhat important When we do it as one hot encoded it actually says enclosure erots with ac is the most important thing so For at least the purpose of like interpreting your model you should always try One hot encoding, you know Quite a few of your variables. And so I often find somewhere around six or seven is pretty good um, you can try like Making that number as high as you can So that it doesn't take forever to compute and the feature importance doesn't include like Really tiny levels that aren't interesting. So that's kind of up to you to play play around with But in this case like This is actually I found this very interesting. It clearly tells me I need to find out what enclosure erupts with ac is Why is it important because like it means nothing to me Right, and but it's the most important thing. So I should go figure that out Savannah had a question Need plus that So can you explain how changing the max number of categories works because for me it just seems like there's five categories there's five categories Oh, yeah, sorry. So it's it's just like All it's doing is saying like okay. Here's a column called zip code Here's a column called usage band and Here's a column Sex, right? I don't know whatever, right? And so like zip code has whatever 5,000 levels the number of levels in a category we call its cardinality okay So it has a cardinality of 5,000 usage band maybe has a cardinality of six sex has maybe a cardinality of two So when proctor f goes through and says, okay This is a categorical variable. Should I one hot encode it? It checks the cardinality against max and cats and says oh 5,000 is bigger than seven So I don't one hot encode it and then it goes to usage band Six is less than seven. I do one hot encode it goes to sex two is less than seven. I do one encode it So it just says for each variable How do I decide whether to one hot encode it or not? In proctor f we are keeping both label encodes and one hot encode, right? No, once we decide to one hot encode it it does not keep the original variable Oh, wouldn't the fact that Maybe the best Simply it will be an interval and we would need a label encode Well, you don't need a label encode if so if the best is an interval It can approximate that with multiple one hot encoding levels Yeah, so like, you know, it's a The the the truth is that each column is going to have some You know different You know, should it be label encoded or not, you know, which you could make on a case by case basis I find in practice It's just not that sensitive to this and so I find like just Using a single number for the whole data set gives me what I need but, you know, if you were Building a model that really had to be as awesome as possible and you had lots and lots of time to do it You can go through man, you know, don't use prop df. You can go through manually and decide which things to to use dummies or not You'll see in in the code If you look at the code for prop df Proc df Right, like I never want you to feel like The code that happens to be in the fast ai library is the code that you're limited to right? So where is that done? You can see that The max n cat gets passed to numerical eyes and numerical eyes Simply checks. Okay. Is it a numeric type? And it's the number of categories either not been passed to us at all or we've got more unique Values than there are categories and if so, we're going to use the categorical codes So for any column where that's uh, where it's skipped over that Right, so it's remained as a category Then at the very end we just go pandas.get dummies We pass in the whole data frame and so pandas.get dummies you pass in the whole data frame It checks for anything that's still a categorical variable and it turns it into a dummy variable Which is another way of saying a one-hot encoding. So, you know with that kind of approach you can easily override it and do your own dummy verification variableization Did you have a question? Uh, so some data has a quite obvious order like if you have like a rating system like good bad Uh poor whatever things like that There's an order to that and destroying that order by doing the dummy variable thing Probably will work for your benefit So is there a way to just force it to leave alone one variable just like convert it beforehand yourself? um Not not in the library And to remind you like unless we explicitly do something about it We're not going to get that order so when we When we import the data This is in lesson one rf We showed how By default the categories are ordered alphabetically And we have the ability to order them Properly, so yeah, if you've actually made an effort to turn your ordinal variables into proper ordinals um Using prop df Can destroy that if you have max n cats So the simple thing the simple way to avoid that is if we know that we always want to use the codes for usage band rather than the You know like never one hot encode it you could just go ahead and replace it right you could just say Okay, let's just go df.usage band equals df.usage band dot cat dot codes and it's now an integer And so it'll never get changed all right so We kind of have already seen how Variables which are basically measuring the same thing Can kind of confuse our variable importance And they can also make our random forests slightly less good because it requires like more computation to do the same thing There's more columns to check So i've got to do some more work to try and remove redundant features And the way i do that is to do something called a dendrogram And it's a kind of a hierarchical clustering So cluster analysis Is something where you're trying to look at objects They can be either rows in a data set or columns and find which ones are similar To each other so often you'll see people particularly talking about cluster analysis They normally refer to rows of data and they'll say like oh, let's plot it Right and like oh, there's a cluster and there's a cluster right A common type of cluster analysis Time to permitting we may get around to talking about this in some detail is called k-means Which is basically where you assume that you don't have any labels at all and you take basically A couple of data points at random And you gradually Find the ones that are near to it and move them closer and closer to centroids and you kind of repeat it again and again And it's an iterative approach that you basically tell how many clusters you want and it'll tell you where it thinks the clusters are I really I don't know why but I really underused technique 20 30 years ago it was much more popular than it is today is hierarchical clustering hierarchical Also known as agglomerative clustering and in hierarchical or agglomerative clustering We basically look at every pair of options every pair of objects and say okay, which two objects are the closest right, so in this case we might go okay Those two objects are the closest and so we've kind of like delete them and replace it with the midpoint of the two And then okay here the next two closest we delete them and replace them with the midpoint of the two And to keep doing that again and again right since we're kind of removing points and replacing them with their averages You're gradually reducing the number of points by pairwise combining And the cool thing is you can plot that Like so right so if rather than looking at points you look at variables We can say okay, which two variables are the most similar that says okay sale year and sale elapsed are very similar so the kind of horizontal access here is How similar are the two points that are being compared right so if they're closer to the right it means they're very similar So sale year and sale elapsed have been combined and they were very similar Again, it's like who cares, you know, it'll be like correlation coefficient or something like that, you know In this particular case what I actually did So you get to tell it So in this case I actually used a spearman's are so You guys familiar with correlation coefficients already So correlation is is almost exactly the same as the r squared, right? But it's between two variables rather than a variable and its prediction the problem with a normal correlation is that um if the I created a new workbook here um If you have data that looks like this then you can Do a correlation and you'll get a good result, right? But if you've got data which looks like This right and you try and do a correlation and assumes linearity. That's not very good, right? So there's a thing called a rank correlation a really simple idea. It's replace every point By its rank, right? So instead of like so we basically say, okay, this is the smallest. So we'll call that one two, there's the next one three. Here's the next one four Five, right? So you just replace every number by its rank, right? And then you do the same for the y-axis So we'll call that one two Three and so forth, right? And so then you do like a new plot where you don't plot the data But you plot the rank of the data and if you think about it the rank of this data set Is going to look An exact line because every time something was greater on the x-axis. It was also greater on the y-axis So if we do a correlation on the rank, that's called a rank correlation Okay um, so Because I want to find the Columns that are similar in a way that the random forest would find them similar Random forests don't care about linearity. They just care about ordering. So a rank correlation Is the the right way to think about that? so Spearman's R is is the name of the most common rank correlation But you can literally replace the data with its rank and chuck it at the regular correlation and you'll get Basically the same answer. The only difference is in how ties are handled. It's a pretty minor issue Um, like if you had like a full parabola in that rank correlation, you'll Will not right right. It has to be has to be monotonic. Okay. Yeah. Yeah Okay, so Once I've got a correlation matrix There's basically a couple of standard steps you do to turn that into a dendogram, which I have to look up On stack overflow each time I do it You basically turn it into a distance matrix and then you create something that tells you, you know Which things are connected to which other things hierarchically. So this kind of These two and this step here are like just three standard steps that you always have to do to create a dendogram And so then you can plot it And so All right, so sail here and sail elapsed and be measuring basically the same thing at least in terms of rank Which is not surprising because sail elapsed is the number of days Since the first day in my data set Uh, so obviously these two are nearly entirely correlated with some ties Grouse attracts and hydraulics flow and coupler system all seem to be measuring the same thing And this is interesting because remember coupler system it said was super important Right and so this rather supports our hypothesis that it's nothing to do with whether it's a coupler system But whether it's whatever kind of vehicle it is that has these kind of features Um product group and product groups desks seem to be measuring the same thing fi base model and fi model desks seem to be measuring the same thing and so Once we get past that Everything else like suddenly the things are further away. So I'm probably going to not worry about those So we're going to look into these one, two, three, four groups that are very similar. Could you pass that over there? Is it important that graph that the similarity between stick length and enclosure is higher than with stick length and anything that's higher? Yeah, pretty much. I mean it it's a little hard to interpret but given that stick length and enclosure Don't join up into your way over here Um, it would strongly suggest that then that they're a long way away from each other Otherwise you would expect them to have joined up earlier. I mean it's it's it's possible to construct like a synthetic data set Where you kind of end up joining things that were close to each other through different paths So you've got to be a bit careful, but I think it's fair to us probably assume that stick length or enclosure are probably very different So they are very different, but would they be more similar than for example stick length and sale day of the year Which is at the very top No, there's nothing to suggest that here because like the key point is to notice where they sit in this tree Right, and they both they sit in totally different halves of the tree. Okay. Thank you Um, but really to actually know that the best way would be to actually look at the spm and r correlation matrix Right, if you just want to know how similar is this thing to this thing the spm and r correlation matrix tells you that Can you pass that over there? So today's we are passing the data frame, right? Say again, uh, we are passing the data frame or are we passing the model to it This is just a data frame. So we're passing in df keep. So that's the data frame Containing the whatever it was 30 or so features that our random forest thought was interesting So there's no random forest being used here the measure the the distance measure is being done entirely on rank correlation So what I then do is I take these these groups Right, and I create a little function that I call get out of fan score, right, which is it does a random forest um for some data frame um Uh, I make sure that I've taken that data frame and split it into a training and validation set Uh, and then I call fit and return the oob score, right? So, uh, basically what I'm going to do is I'm going to try removing Each one of these one two three four five six seven eight nine or so variables one at a time And see which ones I can remove and it doesn't make the oob score get worse Um And each time I run this I get slightly different results So actually it looks like last time I had seven things not not eight things So you can see I just do a loop through each of the things that I'm thinking like maybe I could get rid of this because it's redundant And I print out the column Name and the oob score of a model that is trained after dropping that one column Okay, so the oob score on my whole data frame is point eight nine And then after dropping each one of these things They're basically none of them get much worse sale elapsed Is getting quite a bit worse than sale year, but like it looks like pretty much everything else I can drop with like only like a third decimal place uh problem So obviously though You've got to remember the dendrogram like let's take fi model desk and fi based model Right, they're very similar to each other, right? So what this says isn't that I can get rid of both of them, right? I can get rid of one of them because they're basically measuring the same thing Okay, so so then I try it. I say, okay, let's try getting rid of One from each group sale year fi based model And gross attracts Okay, and like let's now have a look. It's like, okay. I've gone from point eight nine zero to point eight eight eight It's like again so close as to be meaningless So that sounds good. Uh simpler is better So I'm now going to drop those columns From my data frame Um, and then I can try running the full model Again, and I can see, you know, so reset our samples Means I'm using my whole data frame of my whole big strap sample Use 40 estimators and I've got a point nine oh seven. Okay, so I've now got a A model which is smaller and simpler and I'm getting a good score for um So at this point I've now Got rid of As many columns as I feel I comfortably can ones that either didn't have a good feature importance or were highly related to other variables and the model didn't get worse significantly when I when I remove them So now I'm at the point where I want to try and really understand my data better by taking advantage of the model And we're going to use something called partial dependence And again, this is something that you could like use in a Kaggle kernel and lots of people are going to appreciate this Because almost nobody knows about partial dependence and it's a very very powerful technique What we're going to do is we're going to find out for the features that are important How do they relate to the dependent variable? Right, so let's have a look right so let's again since we're doing interpretation We'll set set our samples to 50 000 to run things quickly Um We'll take our data frame We'll get our feature importance and notice that we're using Max and cat because I'm actually pretty interested in terms of for interpretation and seeing the individual levels And so here's the top 10 And so let's try and learn more about those top 10 So year made is the second most important. So one obvious thing we could do would be to plot Year made Against sale elapsed because as we've talked about already like it just seems to make sense. They're both important But it seems very likely that they kind of combined together to find like how old was the The product when it was sold So we could try plotting year made against sale elapsed to see how they relate to each other and when we do We get this very ugly graph And it shows us that year made actually has a whole bunch that are a thousand Right, so clearly, you know, this is where I would tend to go back to the client or whatever and say Okay, I'm guessing that these bulldozers weren't actually made in the year 1000 and they would presumably say to me Oh, yes, they're ones where we don't know when it was made, you know, maybe before 1986. We didn't track that or maybe The things that are sold in Illinois don't have that data provided or or whatever. They'll tell us some reason. So In order to Understand this plot better. I'm just going to remove them from this interpretation section of the analysis So I'm just going to say, okay, let's just grab things where year made is greater than 1930 Okay So let's now look at the relationship between year made and sale press And there's a really great Package called ggplot Ggplot originally was an r package gg stands for the grammar of graphics and the grammar of graphics is like this Very powerful way of thinking about how to produce Charts in a very flexible way I'm not going to be talking about it much in this class. There's lots of information available online But I definitely recommend it as a great package to use ggplot Which you can pip install. It's part of the fast ai environment already Ggplot in python has basically the same Parameters and api as the r version the r version is much better documented So you should read its documentation to learn how to use it But basically you say, okay, I want to create a plot of This data frame Now when you create plots Most of the data sets you're using are going to be too big to plot As in like if you do a scatter plot, it'll create so many dots that it's just a big mess It'll take forever and remember when you're plotting things You just you're you're looking at it, right? So there's no point plotting something with a hundred million samples When if you only used a hundred thousand samples, it's going to be pixel identical Right, so that's why I call get sample first. So get sample just grabs a random sample. Okay, so I'm just going to grab 500 points For now. Okay, so I've got to grab 500 points from my data frame I got a plot A year made against sale price a stands for aesthetic. This is the basic way that you set up your Columns in ggplot Okay, so this says to plot these columns from this data frame and then there's this weird thing in ggplot Where plus means basically add chart elements. Okay, so I'm going to add a smoother so Most of the very very often you'll find that a scatter plot Is very hard to see what's going on because there's too much randomness Where else a smoother basically creates a little linear regression For every little subset of the graph and so it kind of joins it up and allows you to see a nice smooth curve Okay So this is like the main way that I tend to look at univariate relationships And by adding standard error equals true. It also shows me the confidence interval of this smoother, right? Um So low s stands for locally weighted regression, which is this idea of like doing kind of like doing lots of little Many year aggressions So we can see here the relationship between year made and sale price is kind of all over the place, right? Which is like not really what I would expect. I would I would have expected that more recent um Stuff that sold more recently Would probably be like more expensive because of inflation and because they're like more current models and so forth And the problem is that when you look at a univariate relationship like this there's a whole lot of Colinearity going on a whole lot of interactions that are being lost. So for example Why did the price drop? Here is that actually because like things made between 1991 and 1997 Are less valuable Or is actually because most of them were also sold during that time and actually there was like maybe a recession then Or maybe it was like the product sold during that time a lot more people were buying Types of vehicle that were less expensive like there's all kinds of reasons for that and so again As data scientists one of the things you're going to keep seeing is that um at the companies that you join People will come to you with with these kind of univariate charts Where they'll say like oh my god our sales in Chicago have have disappeared. They've got really bad or People aren't clicking on this ad anymore and they'll show you a chart that looks like this and they'll be like what happened And most of the time you'll find the answer to the question What happened is that there's something else going on right so actually oh in Chicago Last week actually we were doing a new promotion and that's why our you know revenue went down It's not because people aren't buying stuff in Chicago anymore. It's because the prices will lower for instance So what we really want to be able to do Is say well, what's the relationship between sale price and year made? all other things being equal so all other things being equal basically means If we sold something in 1990 versus 1980 and it was exactly the same thing to exactly the same person and Exactly the same auction so on and so forth. What would have been the difference in price? And so to do that we do something called a partial dependence plot And this is a partial dependence plot. There's a really nice library which nobody's heard of called pdp Which does these partial dependence plots and what happens is this we've got our sample of 500 data points Right, and we're going to do something really interesting. We're going to take each one of those 100 randomly chosen auctions And we're going to make a little data set out of it, right? So like here's our Here's our Here's our data set of like 500 auctions And here's our columns One of which is the thing that we're interested in which is year made So here's year made Okay, and what we're going to do is we're now going to try and create a chart Where we're going to try and say all other things being equal in 1960 uh, how much did Uh bulldozers cost how much did things cost in auctions? And so the way we're going to do that is we're going to replace the year made column With 1960 we're going to copy in the value 1960 again and again and again all the way down Right, so now every row The year made is 1960 and all of the other data is going to be exactly the same And we're going to take our random forest and we're going to pass all this through our random forest To predict the sale price So that will tell us for everything that was auctioned How much do we think it would have been sold for? if that thing was made in 1960 And that's what we're going to plot here All right, that's the price we're going to plot here and then we're going to do the same thing for 1961 All right, we're going to replace all these and do 1961 yeah So to be clear, um We've already fit the random forest. Yes, and then we're just passing a new year and seeing what it determines The price should be yeah, so this is a lot like the way we did feature importance But rather than randomly shuffling the column, we're going to replace the column with a constant value All right, so randomly shuffling the column tells us How accurate it is when you don't use that column anymore Replacing the whole column with a constant tells us or estimates for us how much we would have sold that product for In that auction on that day in that place if that product had been made in 1961 Right, so we basically then take the average Of all of the sale prices that we calculate from that random forest and so we do it in 1961 And we get this value right So what the partial dependence plot here shows us is each of these light blue lines Actually is showing us all 500 lines. So it says For row number one in our data set Um, if we sold it in 1960 we're going to index that to zero right so call that zero right If we sold it in 1970 that particular Auction would have been here if we sold it in 1980 it would have been here if we sold in 1990 it would have been here So we actually plot all 500 Predictions of how much every one of those 500 Auctions would have gone for if we replace it if we replace its year made with each of these different values And then then this dark line here is the average Right, so this tells us How much would we have sold? On average all of those auctions for if all of those products were actually Made in 1985 1990 1993 1994 and so forth And so you can see what's happened here is at least in the period where we have a reasonable amount of data Which is since 1990. This is basically a totally straight line Which is what you would expect right because if it was sold on the same date And it was the same kind of tractor Sold to the same person in the same auction house Then you would expect more recent vehicles to be more expensive because of inflation And because they're they're newer Like they're not they're not as secondhand and you would expect that relationship to be roughly linear And that's exactly what we're finding. Okay, so by removing all of these externalities It often allows us to see the truth Much more clearly as a question at the back. Can you pass that back there? You're done. Okay so This this partial dependence plot concept is something which Is using a random forest To get us a more clear interpretation of what's going on in our data. And so the steps were to first of all Look at the feature importance To tell us like which things do we think we care about And then to use the partial dependence plot to tell us What's going on on average? right There's another cool thing we can do with pdp is we can use clusters and what clusters does is it uses cluster analysis to look at all of these each one of the 500 rows and say Do some of those 500 rows kind of move in the same way? And like we can kind of see it seems like there's a whole lot Of rows that kind of go down And then up and there seems to be a bunch of rows that kind of go up and then go flat Like it does seem like there are some kind of Different types of behaviors being hidden And so here is the result of doing that cluster analysis, right is we still get the same average But it says here are kind of the five Most common shapes that we see And this is where you could then go in and say all right. It looks like some kinds of vehicle Actually after 1990 their prices are pretty flat and before that they were pretty linear Some kinds of vehicle are kind of exactly the opposite and so like different kinds of vehicle Have these different shapes, right? And so this is something you could dig into. I think there's one at the back. Oh, you're good. Okay So what are we going to do with this information? well the purpose of Interpretation is to learn about a data set and so why do you want to learn about a data set? It's because you it's because you want to do something with it, right? So in this case Um, it's not so much something if you're trying to win a Kaggle competition I mean it can be a little bit like some of these insights might make you realize. Oh, I could Transform this variable or create this interaction or whatever Obviously feature importance is super important for Kaggle competitions. Um, but this one's much more for like real life You know, so this is when you're talking to somebody and you say to them like Okay, those plots you've been showing me which actually say that like There was this kind of dip in prices, you know based on like things made between 1990 and 1997 There wasn't really, you know, actually it was they were increasing. There was actually something else going on at that time um, you know, it's basically the thing that allows you to say like For whatever this outcome i'm trying to drive in my business is this is how something's driving it, right? So if it's like, um I'm looking at, you know, kind of advertising technology. What's driving clicks that i'm actually digging in to say, okay This is actually how clicks are being driven. This is actually the variable that's driving it. This is how it's related So therefore we should change our behavior in this way That's really the goal of any model I guess there's two possible goals one goal of a model is just to get the predictions like if you're doing hedge fund trading You probably just want to know what the price of that equity is going to be If you're doing insurance, you probably just want to know how much claims that guy's going to have But probably most of the time You're actually trying to change Something about how you do business how you do marketing how you do logistics So the thing you actually care about is how the things are related to each other All right. I'm sorry. Can you explain again when you scroll up and you were looking at the sale price You're may looking at the entire model And you saw that dip And you said something about that dip didn't signify what we thought it did. Can you explain why? Yeah, so this is like a classic boring Univariate plot right so this is basically just taking all of the dots all of the options plotting year made against sale price and we're gonna just fitting A rough average through them and so It's true that products made between 1992 and 1997 on average in our data set are being sold for less So like very often in business, you'll hear somebody look at something like this and they'll be like, oh, we should We should stop auctioning equipment that is made in that year in those years because like we're getting less money for for example But if the truth actually is that during those years Uh, it's just that people were making more Small industrial equipment Where you would expect it to be sold for less and actually our profit on it is just as high for instance or During those years, it's not that there's not things made during those years now would have Would be cheaper. It's that during those years When we were selling things in those years they were cheaper because like there was a recession going on So if you're trying to like actually take some action based on this You probably don't just care about the fact that things made in those years are cheaper on average But how does that impact today, you know, so So this this approach where we actually say let's try and remove all of these Externalities So if something is sold on the same day to the same person of the same kind of vehicle Then actually how does year made impact price? And so this basically says for example, if I am Deciding what to buy at an option then this is kind of saying to me. Okay, like Getting a more recent vehicle on average really does On average give you more money Which is not what the kind of the the naive univariate plot said Can pass it to Tyler so For like this bulldozer bulldozers made in 2010 probably are not close to the type of bulldozers that were made in 1960 and if you're taking something that would be so Very different like a 2010 bulldozer and then trying to just drop it to say oh if it was made in 1960 that may cause Poor Prediction at a point because it's so far outside. Absolutely. Absolutely. So, you know, I think that's a good point. It's you know, it's a limitation Over random forest is if you've got a kind of data point that's like of a kind You know, which is kind of like in a part of the space that it's not seen before like Maybe people didn't put air conditioning really in bulldozers in 1960 and you're saying how much would this bulldozer with air conditioning Have gone for in 1960. You don't really have any information to know that so You know, you it's a uh It's it's this is still the best technique I know of But it's it's not perfect um, and you know, you kind of hope that The trees are still going to find some Useful truth even though it hasn't seen that combination of features before But yeah, it's something to be aware of So you can um, also do The same thing uh in a pdp interaction plot and a pdp interaction plot Which is really what i'm trying to get to here is like how the sail elapsed and year made together Impact price and so if I do a pdp interaction plot, it shows me sail elapsed versus price It shows me year made versus price And it shows me the combination Versus price. Remember, this is always log of price. That's why these prices look Weird, right and so you can see that the combination of sail elapsed and year made Is as you would expect Later dates so more elapsed time Is Giving me Oh, sorry. It's The other way around isn't it so the highest prices Are those where there's the least elapsed and the most recent year made Um, so you can see here There's the univariate relationship between sail elapsed and price And here is the univariate relationship between year made and price And then here is the combination of the two It's enough to see like clearly that these two things are driving price together You can also see these are not like simple diagonal lines So it's kind of some interesting interaction going on And so based on looking at these plots It's enough to make me think oh, we should maybe put in some kind of interaction term and see what happens So let's come back to that in a moment, but let's just look at a couple more Remember in this case, I did one hot encoding Way back at the top here. I said max n catechal 7 so I've got like Enclosure erupts with ac so if you've got one hot encoded variables You can pass At a ray of them To plot pdp and it'll treat them as a category Right, and so in this case, I'm going to create a pdp plot of these three categories. I'm going to call it enclosure And I can see here that enclosure erupts with ac Are on average are more expensive than enclosure erupts And enclosure erupts it actually looks like enclosure erupts enclosure erupts are pretty similar Where else erupts with ac is higher So this is you know at this point, you know, I'd probably be inclined to hop into google and like type erupts and erupts And find out what the hell these things are And here we go so it turns out that erupts is Enclosed rollover protective structure, and so it turns out that if your Your bulldozer is fully enclosed Then optionally you can also get air conditioning So it turns out that actually this thing is telling us whether it's got air conditioning If it's an open structure, then obviously you don't have air conditioning at all So that's what these three levels are and so we've now learnt All other things being equal the same bulldozer sold at the same time Built at the same time sold to the same person is going to be quite a bit more expensive If it has air conditioning then if it doesn't Okay, so again, we're kind of getting this nice interpretation ability and You know now that I spent some time with this data set I've certainly noticed that this you know knowing this is the most important thing You do notice that there's a lot more Air conditioned bulldozers nowadays than they used to be and so there's definitely an interaction between kind of date and that So based on the earlier interaction analysis, I've tried First of all setting everything before 1950 to 1950 because it seems to be some kind of missing value I've then set age to be equal to sale year minus year made And so then I try running a random forest on that And indeed Age is now The single biggest thing Sale elapsed is way back down here Year made is back down here. So we've kind of used this to find An interaction But remember of course a random forest can create a can create an interaction through having multiple split points So we shouldn't assume that this is actually going to be a better result and in practice I actually found when I Looked at my score And my rmsc adding age was actually a little worse. We'll see about that Later probably in the next lesson Okay So one last thing is a tree interpreter so This is also in the category of things that most people don't know exist, but it's super important Almost pointless for like Kaggle competitions, but super important for real life. And here's the idea Let's say you're an insurance company And somebody rings up and you give them a quote And they say oh, that's $500 more than last year Why? Okay, so in general you've made a prediction from some model and somebody asks why And so this is where we use this method called a tree interpreter and what tree interpreter does Is it allows us to take? our particular row So in this case, we're going to pick Row number zero, right? So here here is row zero, right? I presumably this is like a year made I don't know what all the codes stand for but like his is all of the columns in row zero What I can do with a tree interpreter is I can go ti.predict pass in my random forest Pass in my row. So this would be like this particular customer's insurance information or this in this case this particular auction Right, and it'll give me back three things. The first is the prediction from the random forest The second is the bias the bias is basically the average sale price Across the whole original data set, right? So like remember in our random forest We started with single trees Oh, we haven't got a drawing there anymore, but remember we started with a single tree in our random forest And we split it once and then we split that once and then we split that once, right? We said like oh, what's the average value for the whole data set Then what's the average value for those where the first split was true? And then what's the average value? Where the next split was also true until eventually you get down to the leaf nodes where you've got the average value you predict, right? So you can kind of think of it this way If this for a single tree if this is our final leaf node, right? Maybe we're predicting like 9.1 Right and then maybe the average log sale price for the whole The whole lot is like 10.2, right? That's the average for all the options And so you could kind of like work your way down here. So let's go and create this Let's actually go and run this so we can see it Okay, so let's go back and redraw this single tree. You'll find like in jupyla notebooks often a lot of the things we create like Videos progress bars and stuff. They don't know how to like save themselves to the files So you'll see just like a little string here. And so you actually have to rerun it to create the string So this was the single tree that we created So the whole data set had an average log sale price of 10.2 The data set for those with couple of system equals true Had an average of 10.3 The data set for couple of system equals true enclosure less than point less than two was 9.9 And then eventually we get all the way up here And also model ID less than 4573. It's 10.2. So you could kind of like say, okay, why did this particular Row let's say we had a row that ended up over in this leaf node. Why did we predict 10.2? Well, it's because we started with 10.19 and then because the couple of system was was was less than 0.5 So it was actually false We added About 0.2 to that. So we went from 10.1 to 10.3, right? So 10.2 to 10.3 So we added a little bit because this one is true And then to go from 10.3 to 9.9. So because enclosure is less than two we subtracted About 0.4 and then because model ID was less than 4500 we added About 0.7 Right, so you could see like with a single tree You could like break down like why is it that we predicted 10.2, right? And it's like at each one of these decision points. We're adding or subtracting a little bit from the value So what we could then do is we could do that for all the treats And then we could take the average so every time we see enclosure Did we increase or decrease the value and how much buy every time we see model ID Did we increase or decrease the value and how much buy and so we could take the average of all of those And that's what ends up in this thing called contributions. So here is all of our predictors And here is the value of each and so this is telling us and I've sorted them here that The fact that this thing was made in 1999 Was the thing that most negatively impacted our prediction And the fact that the age of the vehicle was 11 years was what most positively impacted I think you actually need to sort after you zip them together They seem to be sorted negative point five values are sorted, but then they're just reassigned to the columns in the original order Which is why Is what's most positive acting price. Thank you That makes perfect sense Yes, we need to do an index sort Okay Thank you. We'll make sure we fix that by next week So we need to sort columns by The index from contributions So then there's this thing called bias and so the bias is just the the average Before we start doing any splits, right? So if you basically start with the average log of value And then we went down each tree and each time we saw a year made We had some impact couple of systems some impact product size some impact and so forth Right Okay, so I think what we might do is we might come back to because we cut about of time we might come back to tree interpreter Next time, but the basic idea. This is the last This was the last of our key interpretation points and the basic idea is that We want some ability to Not only tell us about the model as a whole and how it works on average But to look at how the model makes predictions for an individual row And that's what we're doing here. Okay, great. Thanks everybody. See you on Thursday