 So From here the next two or three lessons. We're going to be really diving deep into random forests So so far all we've learned is there's a thing called random forests For some particular data sets. They seem to work really really well without too much trouble But we don't really know yet like well, how do they actually work? What do we do if they don't work properly? What are their pros and cons? What are the what can we tune and so forth? So we're going to look at all that And then after that we're going to look at how do we interpret the results of random forests to get not just predictions? But to actually deeply understand our data in a model-driven way. So that's where we're going to go here So let's just review where we're up to So we learned that there's this library called fast AI and the fast AI library Is basically it's a highly opinionated library, which is to say We've spent a lot of time researching what are the best techniques to get like state-of-the-art results and then we take those techniques and package them into pieces of code so that you can use the state-of-the-art results yourself and so Where possible we wrap or provide things on top of existing code And so in particular for the kind of structured data analysis, we're doing Scikit-learn has a lot of really great code So most of the stuff that we're showing you from fast AI is stuff to help us get stuff into Scikit-learn and then interpret stuff out from scikit-learn The fast AI library the way it works in our environment here is that we've got our Notebooks are inside fast AI repo slash courses and then slash ml one and deal one and then inside there. There's a sim link to The parent of the parent fast AI. So this is a sim link to a Directory containing a bunch of modules So if you want to use the fast AI Library in your own code There's a number of things you can do one is to put your notebooks or scripts in the same directory as ml one or dl one where there's already this sim link and just import it just like I do You could copy this directory dot dot slash dot slash fast AI into Somewhere else and use it or you could sim link it just like I have from here to wherever you want to use it All right, so notice it's mildly confusing There's a github repo called fast AI and inside the github repo called fast AI Which looks like this. There is a folder called fast AI Okay, and so the fast AI folder in the fast AI repo contains the fast AI library And it's that library when we go From fast AI dot imports import star then that's looking inside the fast AI folder for a file called imports imports dot py and importing everything from that Okay Yes, Danielle and Just like as a clarifying question for the sim link. It's just the Ellen thing you talked about last class Yeah, so a sim link is something you can create by typing Ln minus s and then the path to the source which in this case would be dot dot dot fast AI Could be relative or it could be absolute and then the name of the destination if you just put the current directory as a destination It'll use the same name as it comes from like a alias on the Mac or a Shortcut on windows And when you do the import Can I just hang on? Yeah, go ahead import sis and then append that relative link that also creates the sim link in I don't think I've created a sim link anywhere in the workbooks the sim link actually lives inside the code I created some sim links in The deep learning notebook to some data that was different. Yeah At the top of Tim Lee's workbook from the last class there was import sis and append the fast AI Oh, yeah, don't do that probably. I mean you you can but I think this is I think this is better like this is a good example of the sim link But I think this is I think this is better like this way you can go from fast AI imports and Regardless of kind of how you got it there. It's it's going to work, you know Okay Okay, so then we had all of our data for blue books to bulldozers competition in data slash bulldozers and Here it is right so We were able to read that CSV file The only thing we really had to do was to say which columns were dates and Having done that We were able to take a look at a few of the examples of the rows of the data and so We also noted that it's very important to deeply understand the Evaluation metric for this project and so for Kaggle they tell you what the evaluation metric is and in this case It was the root mean squared log error. So that is The sum of the actuals Minus the predictions All right, but it's the log of the actuals minus the log of the predictions squared All right, so if we replace Actuals with log actuals and replace predictions with log predictions, then we it's just the same as root mean squared error So that's what we did was we replaced sale price with log of sale price And so now it's if we optimize for root mean squared error. We're actually optimizing further Root mean squared error of the logs Okay So then we learned that we need all of our columns to be numbers and So the first way we did that was to take the date column and Remove it and instead replace it with a whole bunch of different columns Such as is that date the start of a quarter is at the end of a year How many days are elapsed since January the first time in 70? What's the year? What's the month? What's the day of week and so forth? Okay, so they're all numbers Then we learned that we can use train underscore cats to replace all of the strings with categories Now when you do that It doesn't look like you've done anything different. They still look like strings All right, but if you actually take a deeper look You'll see that the data type now is not string but category and category is A pandas class Where you can then go dot cat dot and find a whole bunch of different attributes such as cat dot Categories to find a list of all of the possible categories and this says high is going to become zero low will become one medium Will become two so we can then get codes To actually get the numbers So then what we need to do to actually use this data set to turn it into numbers is Take every categorical column and replace it with cat dot codes. And so we did that using Proc df Okay, so how do I get the source code for Proc df? Question question mark, okay All right, so if I scroll down I go through each column and I numericalize it Okay, that's actually the one I want so I'm going to now have to look up numerical eyes So tab to complete it If it's not numeric Then replace the data frames field with that columns dot cat dot codes Plus one because otherwise unknown is minus one we want unknown to be zero. Okay, so That's how we turn the strings into numbers right they get replaced with a unique Basically arbitrary index. It's actually based on the alphabetical order of the feature names the other thing Proc df did remember was Continuous columns that have missing values the missing got replaced with the median and we added an additional column called column name underscore na which is a Boolean column told you if that particular item was missing or not so Once we did that we were able to call random forest regressor dot fit and Get the dot score and it turns out we have an r squared of point nine eight Can anybody tell me what an r squared is? You want to show me? Okay So our squared essentially it shows how much variance is explained by the model This is the Yeah, this is the relation of This is SSR which is like I'm trying to trying to remember the exact formula, but I mean roughly intuitively Yeah, intuitively it's how much the model explains the how much it accounts for the variance in the data. Okay. Good. So let's talk about the formula and so With formulas the idea is not to learn the formula and remember it but to learn what the formula does and understand it right So Here's the formula It's one minus something divided by something else So what's there's something else on the bottom? SS taught okay, so what this is saying is we've got some actual data Some yi's right. We've got some actual data Three two four one Okay, and then we've got some average okay, so our top bit this SS taught is the sum of H of these Minus that So in other words, it's telling us how much does this data vary per perhaps more interestingly is Remember when we talked about like last week. What's the simplest? Non-stupid model you could come up with and I think the simplest non-stupid model We came up with was create a column of the mean just copy the mean a bunch of times and submit that to Kaggle if you did that Then your root mean squared error would be this So this is the root mean squared error of the most naive non-stupid Model where the model is just predict the mean on the top We have SS res which is here Which is that we're now going to add a column of predictions Okay, and so now what we do is rather than taking the yi minus y mean we're going to take yi minus fi right and So now instead of saying what's the root mean squared error of our naive model? We're saying what's the root mean squared error of the actual model that we're interested in and then We take the ratio So in other words if we Actually were exactly as Effective as just predicting the mean then this would be top and bottom would be the same that would be one One minus one would be zero If we were perfect, so fi minus yi was always zero then that's zero divided by something one minus that is one Okay, so What is the possible range of values of r squared Okay, I heard a lot of zero to one does anybody want to give me an alternative Negative one to one Anything less than one, there's the right answer. Let's find out why we hit the box Okay, so why is it any number less than one Which you can make a model basically as crap as you want and just like there's like big errors as you want And you're just subtracting from one in the formula. Exactly. So interestingly I Was talking to our computer science professor Terrence this morning who was talking to a statistics professor Told him that the possible range of values was r squared was zero to one I said that is totally not true if you predict infinity for every column. That's sorry for every row Then you're going to have Infinity for every residual and so you're going to have one minus infinity. Okay, so the possible range of values is Less than one that's all we know And this will happen you will get negative values sometimes in your r squared and when that happens It's not a mistake. It's not a it's not like a bug. It means your model is worse than predicting the mean Okay, which is suggests it's not great So that's r squared It's not It's not necessarily What you're actually trying to optimize Right, but it's it's it's the nice thing about it is that it's a number that you can use Kind of for every model And so you can kind of start try to get a feel of like what does point eight look like what is point nine look like So like something I find interesting is to like Create some different Synthetic data sets just to two dimensions with kind of different amounts of random noise And like see what they look like on a scatter plot and see what they are squared are just to kind of get a feel For like what is an ask where you know as an aspect of point nine close or not about point seven close or not Okay So I think our squared is a useful number to have a familiarity with and you don't Need to remember the formula if you remember the meaning Which is what's the ratio between how good your model is? It means that error versus how good is the naive mean model for its square error Okay, in our case point nine eight. It's saying it's a very good model However, it might be a very good model because it looks like this All right, and this would be called Overfitting so we may well have created a model which is very good at running through the points that we gave it But it's not going to be very good at running through points that we didn't give it So that's why we always want to have a validation set Creating your validation set is the most important thing That I think you need to do when you're doing a machine learning project At least in terms of in the actual modeling because What you need to do is come up with a data set where The score of your model on that data set is going to be representative of how well your model is going to do In the real world like in Kaggle on the leaderboard or off Kaggle like when you actually use it in production I Very very very often here people In industry say I don't trust machine learning. I tried modeling once it looked great. We put it in production. It didn't work But whose fault is that right that means their validation set was not representative Right, so here's a very simple thing which generally speaking Kaggle is pretty good about doing if your data Has a time piece in it Right as happens in blue book for bulldozers in blue book for bulldozers We're talking about the sale price of a piece of industrial equipment on a particular date So the startup Doing this competition wanted to create a model that wouldn't predict last February's prices But we predict next month's prices So what they did was they gave us data Representing a particular date range in the training set and then the test set represented a future set of dates That wasn't represented in the training set Right, so that's pretty good right that means that if we're doing well on this model We've built something which can actually predict the future or at least it could predict the future then Assuming things haven't changed dramatically So that's the test set we have so we need to create a validation set that has the same properties So the test set had 12,000 rows in so let's create a validation set that has 12,000 rows Right, and then let's split the data set into the first and minus 12,000 rows For the training set and the last 12,000 rows for the validation set and so we've now got something which hopefully looks like Kaggle's test set close enough that when we actually try and use this validation set We're going to get some reasonably accurate scores and the reason we want this is because on Kaggle You can only submit so many times and if you submit too often you'll end up over fitting to the leaderboard anyway And in real life you actually want to build a model that's going to work in real life Did you have a question can we help the green box? Can you explain the difference between a validation set and a test set? Absolutely So what we're going to learn today is how to set one of the things to learn is how to set hyper parameters hyper parameters are like Tuning parameters that are going to change how your model behaves Now if you just have one holdout set So one set of data that you're not using to train with and we use that to decide which set of hyper parameters to use If we try a thousand different sets of hyper parameters We may end up over fitting to that holdout set that is to say we'll find something which only accidentally worked So what we actually want to do is we want to have a second holdout set where we can say okay, I'm finished Okay, I've done the best I can And now just once right at the end. I'm going to see whether it works and so This is something which almost nobody in industry does correctly You really actually need to remove that holdout set that's called the test set remove it from the data give it to somebody else and Tell them do not let me look at this data until I promise you I'm finished like it's so hard Otherwise not to look at it and for example in the world of psychology and sociology You might have heard about this replication crisis And this is basically because people in these fields have accidentally or intentionally maybe been P hacking Which means they've been basically trying lots of different variations until they find something that works And then it turns out when they try to replicate it in other words It's like somebody creates a test set somebody says okay This study which shows you know the impact of whether you eat marshmallows on your tenacity later in life I'm going to reword it and like Over half the time they're finding the effect turns out not to exist. So that's why we want to have a test set Can you give that next door? So for handling categorical data, you converted those two numerics to numbers order numbers I've seen a lot of Models where we convert categorical data into different columns using one hot end coding Yes, so which approach to use in which model? Yeah, we're going to tackle that today. Yeah, it's a great question Okay, so so I'm splitting my My data into validation and training sets and so you can see now that my validation set is 12,000 by 66 Where else my training set is 389 thousand by 66. Okay, so we're going to use this set of data To train a model and this set of data to see how well it's working So when we then tried that last week We found out just a moment. We found out that our model Which had 0.982 R squared on the training set only had 0.887 on the validation set Which makes us think that we're overfitting quite badly But it turned out it wasn't too badly because the root mean squared error on the logs of the prices Actually would have put us in the top 25% of the competition anyway So even though we're overfitting it wasn't the end of the world. Could you pass the microphone to Marsha, please? In terms of you dividing the set into training and validation It seems like you simply take the first and train observations of the data set and set them aside Why don't you like why don't you randomly pick up the observations? Because if I did that I wouldn't be replicating the test set So Kaggle has a test set that when you actually look at the dates in the test set They are a set of dates that are more recent than any date in the training set So if we used a validation set that was a random sample That is much easier because we're predicting Auctions like what's the value of this piece of industrial equipment on this day when we actually already have some observations from that day So in general any time you're building a model that has a time element you want your test set to be a Separate time period and therefore you really need your validation set to be at a separate time period as well And in this case the data was already sorted. So that's why this works So let's say we have our test the training set where we train the data and then we have the validation set Against which we are trying to find the R-square In in case our R-square turns out to be really bad We would want to tune our parameters and run it again. Yes So wouldn't that be eventually overfitting on the overall training set? Yeah, so actually that's that's the issue So that would eventually have the possibility of overfitting on the validation set And then when we try it on the test set or we submit it to Kaggle it turns out not to be very good and This happens in Kaggle competitions all the time Kaggle actually has a fourth data set Which is called the private leaderboard Set and every time you submit to Kaggle You actually only get feedback on how well it does on something called the public leaderboard set And you don't know which rows they are and at the end of the competition You actually get judged on a different data set entirely called the private leaderboard set so the only way to avoid this is to actually be a good machine learning practitioner and Know how to set these parameters as effectively as possible Which we're going to be doing partly today and over the next few weeks Can you get that actually what you thought? Is it too early or late to ask what's the difference between a hyperparameter and a parameter? Okay So Let's start tracking things on root mean spread error. So here is root mean squared error in a line of code and you can see here like this is one of these examples where I'm not Writing this the way a proper software engineer would write this right so a proper software engineer would be a number of things differently They would have it on a different line They would use longer variable names They would have documentation blah blah blah, right but I Really think like for me. I really think that being able to look at something in one go with your eyes and like Over time learn to immediately see what's going on has a lot of value and also to like consistently use like Particular letters to have mean particular things or abbreviations. I think works really well in data science If you're Doing it like a take-home interview test or something You should write your code according to PEP 8 standards, right? So PEP 8 is the The style guide for Python code and you should know it and use it because a lot of software engineers are super anal about this kind of thing But for your own work, you know, I Think this is I think this works well for me, you know So I just wanted to make you aware a that you shouldn't necessarily use this as a role model for dealing with software engineers, but be that I actually think this is This is a reasonable approach. Okay, so there's our root mean square error and then from time to time We're just going to print out the score which will give us the RMSE of the predictions on the training versus the actual their predictions on the valid versus the actual RMSE the R-squared for the training and the R-squared for the valid and we'll come back to OOB in a moment So when we ran that we found that this RMSE was in the top 25% and it's like, okay, there's a good start now This took eight seconds of wall time so eight actual seconds if you put percent time it'll tell you how long things took Luckily, I've got quite a few cores quite a few CPUs in this computer because it actually took over a minute Compute time so I parallelize that across cores if your data set Was bigger or you had less cores You know, you could well find that this took a few minutes to run or even a few hours My rule of thumb is that if something takes more than 10 seconds to run It's too long for me to do like interactive Analysis with it, right? I want to be able to like run something wait a moment and then continue So what we do is we try to make sure that things can run in a reasonable time And then when we're when we're finished at the end of the day We can then say okay this feature engineering these hyper parameters, whatever these are all working well And I'll now rerun it, you know, this the big slow Precise way so one way to speed things up is to pass in the subset parameter to PROC DF And that will randomly sample My data right and so here I'm going to randomly sample 30,000 rows now when I do that I still need to be careful to make sure that my validation set Doesn't change and that my training set doesn't overlap with the dates. Otherwise, I'm cheating So I call split vowels again to again do this split by dates and You'll also see I'm using rather than putting it into a validation set I'm putting it into a variable called underscore This is kind of a standard approach in Python is to use a variable called underscore if you want to throw something away Because I don't want to change my validation set like no matter what different models I build I want to be able to compare them all to each other So I want to keep my validation set the same all the time Okay, so all I'm doing here is I'm resampling my training set into a 20 the first 20,000 out of a 30,000 subset So I now can run that and it runs in 621 milliseconds so I can like really zip through things now try things out Okay So with that Let's use this subset To build a model that is so simple that we can actually take a look at it And so we're going to build a forest is made of trees And so before we look at the forest we'll look at the trees In scikit learn they don't call them trees. They call them estimators So we're going to pass in the parameter number of estimators equals one to create a forest with just one tree in and Then we're going to make a small tree So we pass in maximum depth equals three and a random forest as we're going to learn Randomizes the whole bunch of things We want to turn that off. So to turn that off you say bootstrap equals false So if I pass in these parameters, it creates a small deterministic tree So if I fit it and say print score my r squared has gone down from point eight five to point four So this is not a good model. It's better than the mean model. This is better than zero, right? It's not a good model, but it's a model that we can draw All right, so let's learn about what it's built so a tree Consists of a sequence of binary decisions of binary splits So at first of all decided to split on coupler system Greater than or less than point five. That's a Boolean variable. So it's actually true or false and Then within the group where coupler system was true It decided to split into year made greater than or less than 1987 and then where a coupler system was true and year made was less than or equal to 1986 it used fi product class desk is less than Very quickly point seven five and so forth. Right. So right at the top We have twenty thousand samples 20,000 rows, right? And the reason for that is because that's what we asked for here when we split our data in the sample I just want to double-check that for your Decision tree that you had there that the coloration was whether it's true or false not so like it gets darker It's true for the next one not the darker is a higher value. We'll get to that in a moment Okay, so let's look at these numbers here. So in the whole data set well our Sample that we're using there are 20,000 rows the meat at the average of the log of price is 10.1 and if we built a model Where we just use that average all the time then the mean squared error would be point four seven seven Okay, so this is in other words the Denominator of an R-squared, right? This is like the most basic model is a tree with zero splits, right? Which is just predict the average so The best single binary split we can make Turns out to be splitting by where the coupler system is greater than or equal to sorry less than equal to or greater than point five in other words whether it's true or false and Turns out if we do that the mean squared error of coupler system is less than point five, so it's false Goes down from point four seven seven to point one one, right? So it's really improved the error a lot In the other group. It's only improved it a bit. It's gone from point four seven to point four one and So we can see that the coupler system equals false group has a pretty small percentage It's only got twenty two hundred of the twenty thousand Right, where else this other group has a much large percentage, but it hasn't improved it as much So let's say You wanted to create a Tree with just one split. So you're just trying to find like what is the very best single binary decision you can make For your data How might you be able to do that? How could you do it gonna give it to Ford? Specify the max depth of one, but I mean you're writing you don't have a random first Right, how are you gonna? How are you gonna like write? What's an algorithm a simple algorithm which you could use? sure So we want to start building a random forest from scratch So the first step is to create a tree The first step to create a tree is to create the first binary decision How are you gonna do it? I'm gonna give it to Chris Maybe in two steps So isn't this simply trying to find the best predictor based on maybe a linear regression? You could use a linear regression, but could you do something? Much simpler and more complete. We're trying not to use any statistical assumptions here. I can't see your name. Sorry It's of course your friends Can we just do like take just one variable if it is true give it like The true thing and if it is false So which variable are we gonna choose so at each binary point we have to choose a variable and Something to split on How are we gonna do that? How do I pronounce your name? Shikhar So the variable to choose could be like which divides the population into two groups which are kind of heterogeneous to each other and Homogeneous within themselves like having the same quality within themselves and they're very different Could you be more specific? Like in terms of the target variable maybe yeah, let's say we have two groups after split So one has a different price altogether from the second group. Yes, internally. They have similar prices. Okay. That's good So like to simplify things a little bit where we're saying find a variable that we could split into such that the two groups are as different to each other as possible and Okay, how do you how would you pick which variable and which split point? That's the question Yeah, what's your first cut which variable and which split point? We don't like we're making a tree from scratch. We want to create our own tree That makes sense. We've got somebody over here Macy Can we test all of the possible split and see which one has a smaller RMSE and that sounds good Okay, so let's dig into this. So when you say test all of the possible splits What does that mean? How do we enumerate all the possible splits? For each variable you could put one aside And then put a second aside and compare the two and if it was better Good. Okay. So for each variable for each possible value of that variable See whether it's better Now give it back to Macy. So I want to dig into the better when you said see if the RMSE is better What does that mean though because after a split you've got two RMSEs you've got two two groups So you're just going to fit with that one variable comparing to the others not so So what I mean here is that before we decided to split on coupler system Yeah, we had a root mean squared of point four seven seven and after we've got two groups One with a mean squared error of point one another with a mean squared error of point four So you treat each integral model separately So for the first split you're just going to compare between each variable themselves And then you move on to the next note with the remaining but but even the first node like So the model with zero splits has a single root mean squared error The model with one split so the very first thing we try we've now got two groups with two mean squared errors You want to give it to Daniel? Do you pick the one that gets them as different as they can be? Well, we're trying well, okay. That would be one idea Get the two mean squared errors as different as possible But why might that not work? What might be a problem with that? Sample size Go on because you could just literally leave one point out Yeah, so we could have like year made is less than 1950 and it might have a single sample With a low price and like that's not a great split Is it you know because the other group is actually not going to be very interesting at all Can you improve it a bit can jason improve it a bit? Could you take a weighted average? Yeah, a weighted average so we could take point four one times 17 000 plus point one times 2000 That's good, right and that would be the same As actually saying I've got a model The model is a single binary decision and I'm going to say for everybody with year made less than 986.5. I'm going to fill in Point 10.2 for everybody else are going to fill in 9.2. And then I'm going to calculate the root mean squared error of this Crappy model and that would give exactly the same right as the weighted average that you're suggesting Okay, good. So we now have a single number That represents how good a split is Which is the weighted average of the mean squared errors of the two groups it creates Okay, and thanks to I think it was Was it jake we have a way to find the best split Which is to try every variable and to try every possible value of that variable and see which variable and which value Gives us a split with the best score Does that make sense? Okay, uh, what's your name, sir? Okay Can somebody give Natalie the box and I will When you see every possible number for every possible variable like Are you saying like here we have point five as like our criteria to split The tree so are you Are you saying we're trying out every single number for Every possible value right so a couple of system only has Two values true and false. So there's only one way of splitting which is truths and falses year made Is an integer which varies between like I don't know 1960 and 2010 So we can just say what are all the possible unique values of year made and and try them all So we're trying all the possible split points. Can you pass that back to Daniel or pass it to me and I'll pass it to Daniel I remember I don't speak loudly so that's why I'm here Right here. Okay. Otherwise, it's not possible. Is it not possible? So I just want to clarify again, uh, for the first split Why did we split on coupler system true or false to start with? Because what we did was we used Jake's technique We tried every variable To every variable we tried every possible split For each one we noted down I think it was Jason's idea which was the weighted average mean squared error of the two groups that created We found which one had the best Mean squared error and we picked it and it turned out it was coupler system true or false Does that make sense? Um, I guess my question is more like so coupler system is like one of the like Best indicators I guess it's the best. Okay. We tried every variable and every possible level Level after that it gets less and less everything else it tried wasn't as good Okay, and then you do that each time you split right so now that we've done that we now take this group here Everybody who's got coupler system equals true And we do it again for every possible variable for every possible level For people where coupler system equals true. What's the best possible split? And then are there circumstances when it's not just like binary like you split it into like three groups for like example year made So i'm going to make a claim And then i'm going to see if you can justify it. I'm going to claim that it's never necessary To do more than one split At a level because you can just split it again because you can just split it again Exactly so you can get exactly the same result by splitting twice Okay, good so That is the entirety of creating a decision tree You stop either when you hit some limit that was requested So we had a limit where we said at max depth equals three So that's one one way to stop would be you ask to stop at some point and so we stop otherwise you stop when you're Your leaf nodes these things at the end are called leaf nodes when your leaf nodes only have one thing in them Okay, that's a decision tree. That is how we grow a decision tree And this decision tree is not very good because it's got a validation r squared of point four So we could try to make it better by removing max depth equals three Right and creating a deeper tree. So it's going to go all the way down It's going to keep splitting these things further until every leaf node only has one thing in it And if we do that The training r squared is of course one Because we can exactly predict every training element because it's in a leaf node all on its own But the validation r squared is not one It's actually better than our really really shallow tree But it's not as good as we'd like Okay So we want to find some other way Of making these trees better And the way we're going to do it is to create a forest So what's the forest? To create a forest. We're going to use a statistical technique called bagging And you can bag Any kind of model in fact Michael Jordan who is one of the speakers at the recent recent data institute conference here at University of San Francisco developed a technique called the bag of little bootstraps And which he shows how to use bagging for absolutely any kind of model to make it more robust And also to give you confidence intervals The random forest is simply a way of bagging trees. So what is bagging? Bagging is a really interesting idea Which is what if we created five different models Each of which was only somewhat predictive But the models weren't at all correlated with each other. They gave predictions that weren't correlated with each other that would mean that the five models would have to have found Different insights into the relationships in the data And so if you took the average of those five models Right, then you're effectively bringing in the insights from each of them And so this idea of averaging models is a is is a technique for Ensembling, right, which is really important Now let's come up with a more specific idea of how to do this ensembling What if we created a whole lot Of these trees big deep Massively overfit trees, right But each one let's say we only pick a random One tenth of the data So we pick one out of every ten rows at random build a deep tree Right, which is perfect on that subset and kind of crappy on the rest All right, let's say we do that a hundred times a different random sample every time So all of the trees are going to be better than nothing Right because they do actually have a real random subset of the data and so they found some insight But they're also overfitting terribly But they're all using different random samples. So they all over fit in different ways on different things So in other words, they all have errors But the errors are random What is the average of a bunch of random errors? zero so in other words if we take the average of these Trees each of which have been trained on a different random subset The errors will average out to zero and what's left is the true relationship And that's the random forest, right? So There's the technique, right? We've got a whole bunch of rows of data We grab a few at random Right put them into a smaller data set And build a tree based on that Okay, and then we put that tree aside And do it again with a different random subset and do it again with a different random subset Do that a whole bunch of times And then for each one We can then make predictions by running our test data through the tree to get to the leaf node Take the average in that leaf node for all the trees And average them all together So to do that We simply call random forest regressor and by default It creates 10 what scikit-learn called estimators and estimator is a tree All right, so this is going to create 10 trees And so we go ahead and train it I can't remember if I remember to Okay, so create our 10 trees and we're just doing this on our little random subset of 20 000 And so let's take a look at one example, uh, can you pass the box to Devin? So just to make sure I'm understanding this so you're saying like we take 10 Kind of crappy models We average 10 crappy models and we get a good model exactly because the crappy models Are based on different random subsets and so their errors are not correlated with each other If the other errors were correlated with each other, this isn't going to work Okay, so the key insight here is to construct multiple models Which are better than nothing and where the errors are as much as possible not correlated with each other Is it so uh, is there like a certain number of trees that like we need that in order to be valid like? There's no such thing as like valid or invalid There's like has a good validation set rmsc or not, you know And so that's what we're going to look at is how to is how to make that metric higher And so this is the first of our Hyper parameters and we're going to learn about how to tune hyper parameters And the first one is going to be the number of trees and we're about to look at that now Yes, mesley The subset that you are selecting are they exclusive can can you have overlapping of them? Yeah, so I mentioned, um, you know one approach would be pick out like a tenth at random But actually what scikit-learn does by default is for n rows it picks out n rows with replacement Okay, and that's called bootstrapping And if memory serves me correctly that gets you on average 63.2 percent of of the rows will be represented and you know a bunch of them will be represented multiple times Yeah Sure So rather than just picking out like a tenth of the rows at random Instead we're going to pick out of an n row data set. We're going to pick out n rows with replacement Which on average gets about 63. I think 63.2 percent of the rows will be represented Many of those rows will appear multiple times I think there's a question behind you In essence what this model is doing is if I understand correct it's just picking out the Data points that look most similar to the one you're looking at Yeah, that's a great insight. So what a tree is kind of doing Isn't that quite a complicated way of going about doing that? There would be other ways of like assessing similarity There are other ways of saying assessing similarity, but what's interesting about this way is it's doing it in In tree space, right? So we're basically saying what are in this case like for this little tree What are the 593 samples, you know closest to this one and what's the average closest in tree space? So other ways of doing that would be like and we'll learn later on in this course about k nearest neighbors You could use like euclidean distance say right But here's the thing The whole point of machine learning is to identify Which variables actually matter the most? And how do they relate to each other and to your dependent variable together? Right, so if you've like imagine a synthetic data set where you create five variables That add together to create your independent to create your dependent variable and 95 variables Which are entirely random and don't impact your dependent variable And then if you do like a k nearest neighbors in euclidean space You're going to get meaningless nearest neighbors because most of your columns are actually meaningless Or imagine your actual relationship is that your dependent variable equals x1 times x2 Then you'll actually need to find this interaction Right, so you don't actually care about how close it is to x1 and how close to x2 But how close to the product so the entire purpose of modeling in machine learning Is to find a model which tells you which variables are important and how do they interact together to drive your dependent variable And so you'll find in practice the difference between Using like tree space or random forest space to find your nearest neighbors versus like euclidean space Is the difference between a model that makes good predictions and a model that makes meaningless predictions Melissa do you have a break I did but I feel like we've got only 35 minutes. So Yeah Great so So in general a machine learning model, which is effective is one which is Accurate When you look at the training data, it's it's it's accurate at predicting at actually Finding the relationships in that training data and then it generalizes well to new data And so in bagging that means that each of your individual Estimators if you're individual trees You want to be as predictive as possible But for the predictions of your individual trees to be as uncorrelated as possible And so the inventor of random forests talks about this at length in his original paper that introduced this in the late 90s this idea of trying to come up with predictive, but poorly correlated trees the The research community in recent years has generally found that the more important thing Seems to be creating uncorrelated trees rather than more accurate trees So more recent advances tend to create trees which are less predictive on their own But also less correlated with each other So for example In scikit-learn There's another class you can use called extra trees regressor or extra trees classifier with exactly the same API You can try it tonight. Just replace my random forest regressor with that That's called an extremely randomized trees Model and what that does is exactly the same as what we just discussed But rather than trying every split of every variable It randomly tries a few splits of a few variables Right, so it's much faster the train. It has more randomness Okay, but then you with that time you can build more trees and therefore get better generalization um, so in practice If you've got crappy individual models, you just need more trees to get a good end-up model. Melissa. Could you pass that over to Devon? Could you talk a little bit more about what you mean by like uncorrelated trees? Yeah um If I build a thousand trees each one on just 10 data points Then it's quite likely that the 10 data points for every tree are going to be totally different And so it's quite likely that those 10 trees are going to a thousand trees are going to give totally different answers to each other so the correlation Between the predictions of tree one and tree two is going to be very small between tree one and tree three Very small and so forth on the other hand if I create a thousand trees Where each time I use the entire data set with just one element removed All those trees are going to be nearly identical i.e. their predictions will be highly correlated And so in the latter case, it's probably not going to generalize very well Where else in the former case the individual trees are not going to be very predictive So I need to find some nice in between So, uh, yes daniel And is there a case where you want to use one over the other like any particular times? Yeah, so again hyperparameter tuning. So do you mean in terms of like random random forest versus extremely randomized trees? Yeah, so again a hyperparameter What tree architecture do we use? So we're going to talk about that now. Uh, can you pass that to dina? Yeah, I was just trying to understand how this random forest actually makes sense for continuous variables I mean, I'm assuming that you build a tree structure and the last final nodes You'd be saying like maybe this small node represents maybe a category a or a category b But how does it make sense for a continuous star? So this is actually what we have here and so the value here is the average So this is the average log of price For this subgroup and that's all we do the prediction is the average of the value of the dependent variable in that leaf node Okay, so that means uh, finally if you have just like 10 leaf nodes, you just have 10 values Yes, by predicting that's well if it was only one tree All right So a couple of things to remember the first is that by default We're actually going to train the tree all the way down until the leaf nodes are size one Which means for a data set with n rows. We're going to have n leaf nodes And then we're going to have multiple trees Which we average together, right? So in practice We're going to have a you know, lots of different possible values Is the question behind you? So for the continuous variable, how do we decide like which value to split up because there can be many values? We try every possible Value of that in the training set Won't it be computationally computationally expensive? And this is where it's very good to remember that your cpu's performance is measured in gigahertz Which is billions of clock cycles per second and it has multiple cores and each core Has something called simd single instruction multiple data where it can do up to eight computations per core at once And then if you do it on the gpu the performance is measured in terror flops So trillions of floating point operations per second and so this is where When it comes to designing algorithms, it's very difficult for us mere humans to realize how stupid algorithms should be Given how fast today's computers are So, yeah, it's quite a few operations But at trillions per second you hardly notice it Marsha, I have a question. So essentially at ditch mode, we make a decision Uh, like which category to which variable to use and which quick point. Yes, uh, yeah, but One thing I can't understand. So we have uh, mse a calculated for each node, right? So this is kind of our one of the decision criteria But this mse it it is calculated for which model like which model underlies Like the model is the model is um for the Initial root mode is what if we just predicted the average? Right, which is here is 10.098 Oh just just the average And then the next model is what if we predicted the average of those people with coupler system equals false And for those people with coupler system equals true And then the next is what if we predicted the average of coupler systems equals true year made less than 1986 Is it always average or we can use median or we can even run linear regression There's all kinds of things we could do in practice. The average works really well there are There are types of they're not called random forests, but there are kinds of trees where the leaf nodes are independent linear aggressions They're not terribly widely used, but there are certainly researchers who have worked on them Okay, thank you And pass it back over that afford and then to jake Um, so this tree has a depth of three. Yeah, and then I on one of the next commands We get rid of the max depth. Yeah, the tree without the max depth Does that contain the tree with with the depth of three? Yeah, so is that like by definition? Yeah, well except in this case we've added randomness, but if you turn boots tapping off Then yeah, the the the deeper tree Well, you know the the the the less deep tree would be how it starts and then it just keeps spinning. Okay Uh, so you have many trees. You're going to have the different leaf nodes across trees. Hopefully that's what we want Um, so how do you average leaf nodes? Across different trees. So we just take um the first row in the validation set We run it through the first tree We find its average nine point two eight Then do it through the next tree find its average in the second tree nine point nine five and so forth And we're about to do that. So you'll see it. Okay, so let's try it, right? so After you've built a random forest Each tree is stored in this attribute called estimators underscore Okay So one of the things that you guys need to be very very comfortable with is using um list Comprehensions. Okay, so I hope you've all been practicing. Okay, so here I'm using a list comprehension to go through each tree in my model I'm going to call predict on it with my validation set And so that's going to give me a list of arrays of predictions So each array will be all of the predictions for that tree and I have 10 trees np dot stack Can catnates them together on a new axis So after I run this and call dot shape You can see I now have the first axis 10 Means I have my 10 different sets of predictions and for each one my validation set is a size 12 000 So here are my 12 000 predictions for each of the 10 trees right so Let's take the first row of that and print it out and so here are What we're just saying here are 10 predictions one from each tree Okay, and so then if we say take the mean of that Here is the mean of those 10 predictions And then what was the actual The actual was 9.1 our prediction was 9.07 So you see how like none of our individual trees had very good predictions But the mean of them was actually pretty good right and so When I talk about Experimenting like Jupiter notebook is great for experimenting. This is the kind of stuff I mean Dick inside these objects and like look at them plot them Take your own averages cross check to make sure that they work the way you thought they did Write your own implementation of r squared make sure it's the same as a scikit learn version plot it Like here's an interesting plot. I did Let's go through each of the 10 trees Right, and then take the mean of all of the predictions Up to the ith tree Right, so let's start by predicting Just based on the first tree then the first two trees then the first three trees And let's then plot the r squared So here's the r squared of just the first tree Here's the r squared of the first two trees three trees four trees blah blah blah up to 10 trees And so not surprisingly r squared keeps improving Right because the more Estimators we have the more Bagging that we're doing the more it's well. It's going to generalize Right And you should find that that number there Bit under point eight six should match this number here Okay Let's rerun that. Yeah. Okay. So that actually slightly above point eight six Right. So again, these are all like the cross checks you can do the things you can visualize to deepen your understanding Okay, so as we add more trees our r squared improves it seems to flatten out After a while, so we might guess that if we increase the number of estimators to 20 Right, it's maybe not going to be that much better um So let's see we've got point eight six two Versus point eight six. Oh, yeah, so doubling the number of trees didn't help very much but double it again Eight six seven double it again eight six nine so you can see like There's some point at which you're going to you know, not want to add more trees not because it's it's never going to get worse Right because every tree is you know giving you more Semi random models to bag together Right, but it's going to stop improving things much Okay, and so this is like the first hyper parameter if you'd learn to set is number of estimators and the method for setting it is As many as you have time to fit And that actually seem to be helping Okay Now in practice, we're going to learn to set a few more hyper parameters adding more trees slows it down But with less trees you can still get the same insights So I've built most of my models in practice with like 20 to 30 trees And it's only like then at the end of the project or maybe at the end of the day's work I'll then try doing like I don't know a thousand trees and run it overnight Was there a question? Yes, uh, can we pass that to prince? So each tree might have different estimators different combination of estimators Each tree is an estimator. So this is a synonym. So in scikit learn when they say estimator, they mean tree So I mean features features tree might have each tree will have different break points on different on different columns But if at the end we want to look at the important features, we'll get to that Yeah, so after we finish with kind of setting hyper hyper parameters the next stage of the course will be Learning about what it tells us about the data If you need to know it now, you know for your projects feel free to look ahead There's a Lesson two rf interpretation is where we can see it Okay, so that's our first hyper parameter Um, I want to talk next about out of bag score Um, sometimes your data set will be kind of small and you want want to pull out a validation set Um, because doing so means you now don't have enough data to build a good model. What do you do? There's a cool trick which is pretty much unique to random forests and it's this um, what we could do Is recognize That some of our in our first tree some of our columns, sorry some of our rows Didn't get used So what we could do would be to pass those rows Through the first tree and treat it as a validation set And then for the second tree We could pass through the rows that weren't used for the second tree through it to create a validation set for that And so effectively we would have a different validation set for each tree And so now to calculate our prediction We would average All of the trees where that row was not used for training Right, so for um tree number one We would have the ones i've marked in blue here And then maybe for tree number two It turned out it was like this one This one this one and this one and so forth, right? So as long as you've got enough trees Every row is going to appear in the out of bag sample for one of them at least so you'll be averaging You know, hopefully a few trees um Now so if you've got a hundred trees, um, it's very likely that All of the rows are going to appear many times in these out of bag samples So what you can do is you can create an out of bag prediction by averaging all of the trees You didn't use to train each individual row, and then you can calculate your root mean squared error r squared, etc on that If you pass oob score equals true to scikit learn it will do that for you And it will create an attribute Called oob score underscore and so my little print score function here if that attribute exists it It adds it to the the print So if you take a look here oob score equals true. We've now got one extra number And it's r squared that is the r squared for the oob sample Its r squared is very similar the r squared and the validation set, which is what we hoped for Uh, can we pass it? Is it the case that the The prediction for the oob score has to be must be mathematically lower than the one for our entire forest Um, certainly. It's not true that the prediction is lower. It's possible that the accuracy is lower square. Yeah, um It's not mathematically necessary that it's true But it's going to be true on average because your average for each row Appears in less trees in the oob samples than it does in the full set of trees So as you see here, it's a little less good So in general, it's a great insight chris in general the oob r squared will slightly Under estimate how generalizable the model is the more trees you add The less serious that underestimation is and for me in practice. I I find it's totally good enough you know in practice okay, so um This oob score is is is super handy and one of the things that's super handy for is um, you're going to see there's quite a few hyper parameters that we're going to set and We would like to find some automated way to set them Um, and one way to do that is to do what's called a grid search a grid search is where there's a scikit-learn function called grid search and you pass in the list of all of the parameters all of the hyper parameters You want to tune you pass in for each one a list of all of the values of that hyper parameter you want to try And it runs your model on every possible combination of all of those hyper parameters and tells you which one is the best and um oob score Is a great like choice for forgetting it to tell you which one is best in terms of oob score Like that's an example of something you can do with oob which works well now um If you think about it, um, I kind of did something pretty dumb earlier which is I took a subset of 30 000 rows of the data and it built all my models of that Um, which means every tree in my random forest is a different subset of that subset of 30 000 Why do that? Why not? Pick a different like a totally different subset of 30 000 Each time so in other words, let's leave the entire 300 000 records as is, right? And if I want to make things faster Right pick a different subset of 30 000 each time so rather than bootstrapping the entire set of rows Let's just randomly sample a subset of the data And so we can do that So let's go back and recall proctf without the subset parameter to get all of our data again and so to remind you that is Okay 400 000 in the whole data frame of which we have 389 000 in our training set and instead We're going to go set rf samples 20 000 remember that was the the site of the 30 000 we use 20 000 of them in our training set If I do this Then now when I run a random forest It's not going to bootstrap an entire set of 391 000 rows It's going to just grab a subset of 20 000 rows Right and so now if I run this It will still run Just as quickly as if I had like originally done a random sample of 20 000, but now every tree Can have access to the whole data set, right? So if I do enough estimators enough trees eventually it's going to see everything All right So in this case With 10 trees, which is the default I get an r squared of 0.86 Which is actually about the same As my r squared with the with the 20 000 subset And that's because I haven't used many estimators yet. All right, but if I increase the number of estimators It's going to make more of a difference. All right, so if I increase the number of estimators to 40 All right, it's going to take a little bit longer to run But it's going to be able to see A larger subset of the data set and so as you can see the r squared has gone up from 0.86 to 0.876 Okay, so this is actually a great approach and for those of you who are doing the grocery's competition That's got something like 120 million rows, right? There's no way you would want to create a random forest Using 128 million rows in every tree like it's going to take forever So what you could do is use this set arrow samples to do like, I don't know 100 000 or A million or play around with it. So the trick here is that with a random forest using this technique No data set is too big. I don't care if it's got 100 billion rows, right? You can create a bunch of trees each one of the different random subset Can somebody pass the So my question was for the all B scores and these ones does it take the only like for the The ones from the sample or does it take from all the that's a great question. Um So unfortunately scikit-learn Does not support this functionality out of the box. So I had to write this And it's kind of a horrible hack, right? Because we'd much rather be passing in like a Sample size parameter rather than doing this kind of setting up here. So what I actually do Is um, if you look at the source code is I'm actually this is a an internal This is the internal function I looked at their source code that they call and I've replaced it with a with a lambda function that has the behavior we want Um, unfortunately, uh, the current version is not changing how o o b is calculated. Um, so Yeah, so currently O o b scores and set rf samples are not compatible with each other So you need to turn o o b equals false if you use this approach Which I hope to fix But at this stage it's it's not fixed So if you want to turn it off you just call reset rf samples Okay, and that returns it back to what it was um, okay, so In practice when I'm like Doing interactive machine learning using random forests in order to like explore my model explore hyper parameters The stuff we're going to learn in the future lesson where we actually Analyze like feature importance and partial dependence and so forth. I generally use subsets And reasonably small forests Because all the insights that I'm going to get are exactly the same as the big ones But I can run that in like, you know, three or four seconds rather than hours All right, so this is one of the biggest tips I can give you and very very few people in industry or academia Actually do this most people run all of their models on all of the data all of the time using their best possible parameters Which is just pointless, right? If you're trying to find out like which features are important and how are they related to each other and so forth Having that fourth decimal place of accuracy isn't going to change any of your insights at all Okay, so I would say like do most of your models on You know a large enough sample size that your accuracy is You know reasonable when I say reasonable it's like Within a reasonable distance of the best accuracy you can get Um, and it's taking you know a small number of seconds to train so that you can interactively do your analysis So there's a couple more parameters. I wanted to talk about so I'm going to call reset our samples to get back to our full data set Because in this case, um, at least on this computer. It's actually running in less than 10 seconds Um, so here's our baseline Um, we're going to do a baseline with 40 estimators Okay, and so each of those 40 estimators is going to train all the way down till the leaf nodes. Uh, just have one Sample in them Um So that's going to take a few seconds to run. Here we go. Uh, so that gets us a 0.898 R squared on the validation set or 0.908 On the oob now this case the oob is better. Why is it better? Well, that's because remember our validation set is not a random sample Our validation set is a different time period Okay, so it's actually much harder To predict a different time period than this one, which is just predicting random Okay, so that's why these is not the way around we expected so The next the first parameter we can try fiddling with is min samples leaf And so min samples leaf says stop training the tree further when your leaf node has um three or less Samples in They're rather going all the way down until there's one. We're going to go down until there's three So in practice, this means there's going to be like one or two less levels of decision being made So it means we've got like half the number of actual decision criteria. We have to do so it's going to train more quickly It means that when we look at an individual tree rather than just taking one point We're taking the average of at least three points That's what we'd expect the trees to generalize each one to generalize a little bit better Okay, but each tree is probably going to be slightly less powerful on its own So let's try training that so Possible values of min samples leaf. I find ones which work well are kind of one three five 10 25 You know, like I find that kind of range Seems to work well, but like sometimes If you've got a really big data set and you're not using the small samples, you know, you might need a min samples leaf of hundreds or thousands so it's you kind of got to think about like How big are your sub samples going through and try things out now in this case going from the default of one to three has increased our Validation set ask a veteran 898 to 902 so it's a slight improvement. Okay, and then it's going to train a little faster as well Okay, something else you can try which is and so since this worked I'm going to leave that in I'm going to add in max features equals point five What does max features do? Well, the idea is that The less correlated your trees are with each other the better now imagine you had one column that was So much better than all of the other columns of being predictive that every single tree you built regardless of like which subset of rows Always started with that column So the trees are all going to be pretty similar, right? But you can imagine there might be some interaction of variables Where that interaction is more important than that individual column So if every tree always splits on the first thing The same thing the first time you're not going to get much variation in those trees So what we do is in addition to just taking a subset of rows We then at every single split point take a different subset of columns So it's slightly different to the row sampling for the row sampling each new tree Is based on a random set of rows For column sampling every individual binary split we choose from a different subset of columns So in other words Rather than looking at every possible level of every possible column We look at every possible level of a random subset of columns Okay, and each time Each decision point each binary split we use a different random subset How many well you get to pick point five means Randomly choose half of them The default is to use all of them Um, there's also a couple of special values you can use here As you can see in max features You can also pass some square root to get square root of features or log two to get log two of features So in practice Good values. I found our range from one point five log two or square root That's going to give you a nice bit of variation. All right, can somebody pass it to daniel And so just to clarify does that Just like break it up smaller each time it goes through the tree Or is it just taking half of what's left over like hasn't been touched each time There's no such thing as what's left over after you've split on Year made less than or greater than 1984 Year made still there, right? So later on you might then split on year made less than or greater than 1989 So so it's just each time Rather than checking every variable to see where its best split is you just check Half of them and so the next time you check a different half the next time you check a different half But I mean like in terms is as you get like further to like the leafs You're going to have less options, right? No, you're not you'd never remove the variables You can use them again and again and again because you've got lots of different split points So imagine for example that the relationship was just entirely linear between year made and price Right then in practice to actually model that you know your real relationship is Year made versus price Right, but the best we could do would be just kind of first of all split here Right and then to split here and here Right and like split and split and split so They were binary. I'm saying like yeah, even if they're binary Most random forest libraries don't do anything special about that. They just kind of go. Okay. We'll try this variable Oh, it turns out. There's only one level left, you know So yeah that that didn't really they don't do any kind of clever bookkeeping Um, okay, so if we add max features equals 0.5 it goes up from 901 to 906 So that's better still And so as we've been doing this you also hopefully have noticed that our root mean squared error Of log price has been dropping on our validation set as well And so it's now down to 0.22 8 6 So how good is that right? So like our Totally untuned random forest got us in about the top 25 percent now. Remember our validation set Isn't identical to the Kaggle test set right and this competition unfortunately is old enough that you can't Even put in a in a in a kind of after this after the time entry to find out how you would have gone So we can only approximate how we could have gone But you know generally speaking it's going to be a pretty good approximation So 2 2 8 6 Here is the competition. Here's the public leaderboard 2 2 8 6 there we go 14th or 15th place Okay, so, you know roughly speaking looks like we would be about in the top 20 of this competition With a basically totally brainless random forest with some totally brainless minor hyperparameter tuning and so This is kind of why the random forest is such an important Not just first step, but often only step for machine learning because it's kind of hard to screw it up like Even when we didn't tune the hyperparameters, we still got a good result right and then a small amount of hyperparameter tuning got us a much better result and so Any kind of model so and i'm particularly thinking of like linear type models Um, which have a whole bunch of statistical assumptions and you have to get a whole bunch of things Right before they start to work at all Can really throw you off track right because they give you like totally wrong answers about how accurate the predictions can be Or also the random forest You know generally speaking They tend to work on most data sets most of the time with most sets of hyperparameters. So for example We Did this thing with that with our categorical variables. In fact, let's take a look at our tree single tree Look at this right f i product class desk Less than 7.5. What does that mean? so f i product class desk Here's some examples of that column All right, so what does it mean to be less than or equal to seven? Well, we'd have to look at dot cat Dot categories to find out Okay, and so it's zero one two three four five six seven So what it's done is it's created a split where all of the backhoe loaders And these three types of hydraulic excavator enter in one group and everything else is in the other group So like that's like Weird, you know like like these aren't even in order. We could have made them in order if we had you know bothered to Say the categories have this order, but we hadn't right so How come this even works right like because when we turn it into codes It's actually This is actually what the random forest sees And so imagine to think about this imagine like The only thing that mattered was whether it was a hydraulic excavator zero to two metric tons and nothing else mattered Imagine that right so it has to pick out this this single level Well, it can do that because first of all it could say, okay Let's pick out everything less than seven versus greater than seven to create You know this as one group and this as another group Right, and then within this group it could then pick out everything less than or six versus greater than six Which is going to pick out this one item, right? So with two split points We can pull out a single category So this is why it works right is because the tree is like infinitely flexible Even with a categorical variable if there's particular categories which have different levels of price it can like Gradually zoom in on those groups by using multiple splits Right now you can help it By telling it the order of your categorical variable, but even if you don't It's okay. It's just going to take a few more decisions to get there All right, and so you can see here It's actually using this product class desk quite a few times Right and and as you go deeper down the the tree you'll see it used More and more, right? So where else in a linear model or almost any kind of other model certainly any Any non tree model pretty much Encoding a categorical variable like this won't work at all because there's no linear relationship to train totally arbitrary Identifiers and anything, right? So so these are the kinds of things that make random forests very easy to use and And very resilient and so by using that, you know, we've gotten ourselves a model which is clearly You know world class At this point already, it's like, you know, probably well in the top 20 of this Kaggle competition And then in our next lesson We're going to learn about how to Analyze that model to learn more about the data to make it even better Great, so this week Try and like really experiment, right? Have a look inside look try and draw the trees try and plot the Different errors try maybe using different data sets to see how they work Really experiment to try and get a sense and maybe try to like Replicate things like write your own r squared You know write your own versions of some of these functions See if yeah, see how much you can really learn about your data set about the random forest Great. See you on Thursday