 Hello. Welcome to the start of the video series on random forests. Now random forests, as you've learned in this lesson, are a great tool for making predictions. And so essentially they build a lot of individual decision trees and then merge them all together to get a more stable and often more accurate prediction. Now within this lesson, we're going to start with classification using random forests. And then we're going to get into regression with random forests, which is one of the nice things about this technique is that you can use it for both classification of groups and regression of numbers. So with that, let's go ahead and dive in. So we're here in Google Colab, and we've got some libraries that we need to save. And so one of the things that you might notice, depending on when you start to do random forest is that this TensorFlow decision forests is a new library for us. And sometimes, depending on what your runtime is like, you need to first install it. And so right now you can see that there's got this squiggly line underneath the library, which is telling me that there's something wrong. And if I run this command, lock of commands that are loading all my libraries, we can see that I get this error, no module named tensor decision forests. And so if you get that error, all you need to do is run this pip install line to install the package onto Google Colab. And the reason this happens is because TensorFlow decision forests is not a standard package, so we need to install it. And essentially it's going to run through all of these different lines, but at the end it tells us that it successfully installed the library. And so now I can go through and run all of import all of the libraries. So once our libraries are installed, and our Google Drive is mounted, we can get into the actual problems. And so what we are going to do is try to predict the type of house based off of energy consumption. And sort of the situation that you can imagine yourself in, you're working at a utility who wants to better understand their customer base. They want to try to figure out which housing unit types exist without needing to go to those specific houses. So we are going to try to predict this housing unit type based off of features variables that you can easily see from bills. So to give some further background, our how our data set, we would be predicting type each you queue. And there are five different categories mobile home single family detached single family attached small apartment large apartment. And I want to point out that it's really convenient that our data is already coded numerically because that's something that is a requirement for the random forest package. So if you were to be running this with a separate data set that maybe was labeled a B C B, you would need to convert those into numbers in order to get this model to run correctly. But before we get started, I've already set up some data set so we are only looking at the energy use data. We've got our response variable and I've selected several predictors or explanatory variables that can be used in to maybe predict what housing type is based off of data that is available in a billing data set for example. So there's energy or electricity natural gas energy assistance liquid protein or propane fuel oil, wood cords, and then wood pellets. So with that let's go ahead and get into actually modeling. So the first step that we need to do is split our data into training and test. So I'm actually going to simultaneously create both of these variables using the train comma test. So this will create one variable called train and another called test. And the actual command we are using is train test split. We give it our data set. And then we give it our test size. And so because we want to split it into 80% training, 20% test, I specify 0.2 as my test size. This is something that you could change in your own work if you thought maybe you would get a better fit with a smaller or larger amount of test data. So let's go ahead and look at what this training data set looks like. So you can see that we've got these random numbers because it's shuffled the data and then randomly pulled out 80%. And we've got our response variable and all of our explanatory variables. But before we can use this in our TensorFlow model, we need to convert the data into a TensorFlow object. So I'm going to create a new data set called train DS. And the command we're using is TFDF. So this is from that TensorFlow decision forests, which I've nicknamed TFDF. So we say TFDF dot Keras dot PD underscore data frame underscore to underscore TF underscore data set. And then we give it what our data frame was called so train and a very important step we tell it, which data set to program as the label or response variable. So type HQ. I'm going to copy that and change some words because we need to do it again with our test data set, but we don't need to change the label. And so then if we look at what this looks like, it's essentially just an object. So it's not actually a data frame or even a dictionary like we are used to it is an object that is stored in memory space, but the computer will know what to do with it. When we get into the model definition. So that's step two. So now we have our data set. So now we need to define the model. And we've got a few parameters that we're going to define here to tell the model, how to actually formulate this random forest. So this is all my model model. And we're using that TFDF library again dot Keras dot random forest model. And this is where we actually tell it what we're actually what we want the model to look like. The first thing we want to do is say compute OOB, which stands for out of bag variable importances. And we want to set that to true. And so this is will tell the computer to calculate the importance for each of our predictors or explanatory variables, which will use later when we do interpretation. Now we want to set the number of trees. I'm going to set it to 15. It defaults the 300, the smaller number of trees you use the longer the model will take to run so I'm going to use a lower number for the sake of the video. And then I'm going to set a max depth of 12. So this is sort of how complex, you want your model to be larger depths means more complexity. So a greater chance of overfitting the data. So we can see that it's told us where it's actually storing this model. If we print it. We can see that it's once again just one of these weird objects we can actually see anything yet, but the computer knows where it is and knows how to use it. And the last part of the model definition is an optional chance to specify the metrics. So we are going to put specifically specify accuracy, which is defined here. So the way we're going to measure how good our model is, is by how accurate it is the total number of correct predictions, divided by the total number of predictions. And so, at this point we've defined to the model, there's, but we haven't added any data in here so if you look at this, there's nothing to suggest that this model is connected to our training data set yet. And so that is what we do when we fit the model in step three. So we say model dot fit. Train underscore DS using that TensorFlow object. And as we run this we can see it's going to print out a few data sets for us so it sort of tells us what it's doing as it goes through. And then it tells us that it took 5.6 seconds to read in the training data set, but then trained it in point seven eight seconds so less than a second of training. It's given us some warnings, but we don't really need to worry about that, because we're not interested in this autograph command. And then we can continue on because it's told us that it has done this correctly. And the final step here so we've now defined our model and fit it to the training data, but we want to see how accurate it is so we're going to compute the accuracy with the test set. So in this case this is the 20% of data that we held out in step one. And we say model, which is what I named my model evaluates. And in this case we give it a test object. And we're going to say return dict equals true to return it as a dictionary. So I accidentally double clicked and interrupted, but here we can see that it's now printed out our accuracy for our model so our model, it has an accuracy of point seven one, which means that 71% of our data points were accurately classified within the test set. And so we'll stop here for this video, and then in the next one we'll get into how we can interpret the results of this classification model.