 We've come to the last lecture, a very exciting one, and that is on random forests. It's really one of the better and more exciting machine learning algorithms that we can use. And these days it's becoming more and more important. People are understanding how useful it can be, and in many cases it outperforms something like deep neural networks. So in this last notebook, we're going to start with this building blocks of a random forest, and that is a decision tree. So in notebook number 14, decision trees and random forests. So random forests and something very similar called gradient booster trees, they are very commonly used machine learning techniques. And they are built up from this basic building block, random forests, built up from this basic building block called a decision tree. The reason why it's become so popular of late is that it's not only very accurate and useful, but it's also very interpretable. Because a decision tree by nature looks very much like a flow diagram. So I want to show you this little representation here, a schema where we have a bowl of fruit, and there's some green apples, some origins, and bananas in there. And we don't get to see the bowl, the bowl is there, and we instruct someone to pick up a fruit. Now before they do and answer some question pertaining to that, we have very little information about what the fruit's in that bowl. And our aim is to ask questions so that we can improve our knowledge. We can gain information from asking those questions. So here's a very simplified decision tree analog in this image below. And as much as you can see here, we have feature variables and a target class. So when it comes to these feature variables, we know the color, and they are either green or orange or yellow. We know the shape of them, they're either round or oblong, and we know their weight. And that's in grams, say for instance, from 50 to 100 grams, something like that. In our target class, we have green apples, oranges, and bananas. We don't get to see the bowl, and we just ask someone to take a random fruit and tell us something about us, or answer at least one of our questions. And one of the first questions we can ask, and that pertains to these three feature variables that we have, is what is the color? Or we could actually just say in this simplified analog, is it orange? And if they say yes, we move down here, and of course then it is an orange. It can't be a yellow banana, and it can't be a green apple, it's orange. On the other side, if they say no, it's not orange, we have a next question that we can ask is the fruit round, and if it's yes, well then it's got to be a green apple, and no, it's got to be a banana. So there's not quite how a decision tree works, but it gives you that intuitive understanding of what is going on that we ask these questions based on the feature variables that we have, and that allows us to gain information about what's in the bowl. The terminology that I want you to get used to here is this idea that this very first question that we ask, we call that the root node, and all these are nodes. So when we ask a question, that's a node, and then a node will have child nodes, and child nodes will have a parent node. And we also have this idea of the depth of our tree, and that is how many levels of these nodes are there beyond the root node. So there's one, two, so the depth of this tree would be two. We also have this idea of what do we have left behind. There's a big difference between here where we only have oranges where we might have bananas, while we have bananas and green apples. So this node we will refer to as being pure node, and this node will be an impure node. It still contains more than one class. And then all these nodes right at the end that are completely pure, or we might just decide to stop at some point, even though they're not pure, we call these leaf nodes, leaves or terminal nodes. So that's the terminology that we have to get used to when we talk about decision trees. Of course we're not going to ask questions like the color. We're just going to ask what is the color, and that is going to split up for us three nodes. One would be green, one would be orange, one would be yellow. And that's how a true decision tree works. But in this analog, understanding that we ask questions, from that we gain information. So let's have a look at the packages that we're going to use, NumPy and Pandas, as we always do. And then a bunch of things from scikit-learn, which is our favorite library as far as machine learning. Algorithms are concerned. So let's import those. Plotly we're going to use for our plotting. We might use matplotlib as well. I've put seaborn in there as well. I want you to have a look at seaborn. Of course we're not going to use it here specifically. We're sending the back-end display, in case we're using a retina display, and then as usual, to just display table. So let's look at a proper decision tree. I have hand created a Pandas data frame here. It has three feature variables and a target variable. And I've called it cat1, cat2, cat3, as far as my three feature variables are concerned, and target, and you can see their hearts constructed. And of course I'm just going to print that to the screen. So this cat1, it is a variable there that contains three unique classes, or three sample space elements, Roman numerals, one, two, and three. For cat2 there'll be A, B, and C. And for cat3 numerical, discrete values one, two, and three. And our target variable is only going to be no and yes. So binary target variable. Here we're dealing with a binary classification problem. And of course it is part of supervised learning inasmuch as we know what the actual target is. So let's just have a look at the frequency of each of our sample space elements. We can see there a few cat2s, or Roman numerals two, as far as the sample space elements of cat1 is concerned. So we're looking here at cat2. We're looking at cat3, even though it's numerical variables. Of course it's discrete variables. And we see there one, two, and three. And then let's have a look quickly at our target variable. No class imbalance there. 11 nos and 10 yeses. Now the question is our root node. Which one of those three feature variables should be in the root node? How do we decide? Well, for now we don't know how that works. So let's just go one by one. And we're just going to use the group by function. So what we're going to say is, let's make cat1 our root node. And let's see what pops out. So we're going to say, df for the data frame dot group by cat1. Then once you've done that group by, look at the target variable and give us the value counts. So let's see what happens. So for cat1, of course, it found the three values, Roman numerals one, two, and three. And if it found one, the child node will then have six nos and four yeses. So that's going to be impure. The second child node will have two for the answer there. Roman rule two. And it'll be pure because all three that end up there, as far as the target is concerned, would be nos. And three would be impure again. So let's represent that with this nice little schematic here. So our root node was the elements in cat1 variable. That was our question. So we asked a question about all the sample space elements. So child nodes are just going to represent for us all the sample space elements for that variable. And we get one, two, and three for cat1. So when it comes to this first depth of one child nodes, we look at each of those. And when cat1 was one, we had an impure node. So four yeses and six nos. When it was two, we had a pure node because all three were nos. And when it was three, we also had an impure node. So that is how we would go. This would be a terminal node because it's pure now, but these two are not pure. We have to go on with them. So what shall we do? So let's start with this one. It now becomes a parent node because we are going to ask a question and that's going to split into child nodes. So we've done cat1. So let's just for this one, let's use cat2. So remember, we're going to say df.lock and then df cat1 equals equals one. So we're only looking at these ones that turned out to be one. We're going to group by now, category two. So from this one, we're asking, what is your category two like? But only from these that were one for cat1. And we look at the values there. So let's have a look. And we see in cat2, there were a, bs and cs. And if it wasn't a, it gives us an impure node. If it was b, it gives us an impure node. But if it was c, it gives us another pure node. So that's far, that's fair as far as this node is concerned. Let's look at this child node and make it the parent node. And for that one, we might ask, let's ask cat2 of it as well. So we'll say df.lock. So we do df.cat equals equals three. So it's only this node here. When it was three, let's group it by cat2 as well. And then look at the value counts as far as the target column is concerned. And you see we get only bs and cs from there, and neither of them are pure. So let's decide, instead of using cat2 on that split, let's use cat3 there. So it's still this cat1 equals three. So we're still looking at this node. And instead of splitting it by cat, asking the question about cat2, we'll ask the question about cat3. So if we do that, oh, we see we have one, two, three child nodes and they're all pure. So that gives you an understanding that asking about cat3 at that level was better somehow than using cat2 as the way to split this node. So there must be some idea where we can measure which ones to choose at every level. And that has to do with gaining knowledge, with information gain. So we see this equation here for information gain, id. And that is very, very important. And we have to develop some sense of what this is without knowing exactly what is going on here. So in general terms, we have this idp. That is the impurity at the parent node minus, and this bit here is the average, that's the average of the impurity of the child nodes. So the impurity we have at the parent node minus the average impurity of the child nodes. So we just look at impurity in each of the child nodes and we average over those. And that gap, that difference is the amount of information we gain because here we'll have a higher, at the parent node, a higher impurity and we're trying to get to a lower impurity. And if it's a lot lower, that gap, that difference is a bunch of information that we gained. So that is how a decision tree will decide which one at every root node or at every parent node what to choose, which of the variables to choose. At once, maximum information gain. And we saw that intuitively here choosing cat3 to split there versus cat2 gained us more information. We can intuitively see that because these child nodes are all pure and here the child nodes were not pure. So definitely from the parent node which is also impure to these two, of course we're going to choose this one because we have more information gain. So how do we measure impurity then? Well, there's two common ways. One is using entropy and the other one is using the Gini score, Gini index. And you can see the two of them there, the Gini index that really is only for categorical variables. So they're useful, very powerful, used many times, but we're going to discuss this idea of entropy. And we see it has the summation symbol in it, so we have to talk about that a little bit and it has the logarithm base 2 in there. So I just want to spend just a few seconds on those two so we just have some idea of what is going on here when it comes to measuring impurity of a node. Remember what the log means. If I write y equals the log base 2 of x, what we're asking for is 2 to the power what? Remember how to take power 2 to the power 2 means 2 times 2 and that's 4. So 2 to the power what will give me x and that what is y. So y equals log base 2 of x means 2 to the power y gives me x and it's that asking 2 to the what gives me x and that what is the y that's what we're trying to get to. Summation symbol, it's just very shorthand. So we have this counter and the counter just counts up in increments of 1. So we start our counter at i equals 1 and we end at 3. So whatever is at the bottom where we start at the top is where we end. So this would be i1, i2, i3 and then we put whatever we want to iterate over and there will be some value and we put a subscript i. So that will be x sub 1 plus x sub 2 plus x sub 3. You know just iterate over those. This is a summation symbol so it sums over all of these terms and we're just incrementing some i and that i might not be a subscript it might be somewhere in this expression here but every time you sum over you go to the next term and the sum you just increment that counter by 1 and then you have some end. So that's just a shorthand way of writing all of this and you can imagine that can be quite long that's right hand side you do that by writing something very short like that. So let's talk about entropy or more commonly more specific I should say Shannon entropy and that's what we see here let's go back there to equation 2. So we have this negative upfront and that negative 1 we multiply by this summation and now i equals 1 to c now that c is going to be each of the classes that appear and we have p sub i, p sub i. p sub i is that probability of one of the classes and let's just go down there and look at an example because that will make a lot more sense. Let's just have a look at what we did here again for our first root node we said group by cat1 so cat1 was our decision and we wanted to see what that turns out to be. Now we see there were three of these so for you know we're going to go through them one by one and let's start where cat1 was one we had six nos and four yeses so we go way back up here that's what we have at this one so we're looking at the entropy how do we express this four versus the six we have to express some entropy for it remember here we had 10 and 11 11 and 10 yes that was our target over there so that'll give us an entropy now we have to calculate some entropy for four versus six zero versus three and one two three four five six versus two and we're going to calculate the entropy of the parent node and we're going to average over these and we subtract that and we want the maximum information gain so let's go here so let's look at the first one and that is where cat1 was one there were six nos and four yeses so there were the two classes no and yes so that is going to be our p's that we sum over so i is going to go from one and c is going to be two so there's going to be p1 log base 2 p1 plus p2 log base 2 p2 there's only the two of them that we have so if we look at this it's actually very easy we're going to make p1 the yes and p2 the no so that was the probability of yes times the block base 2 of the probability of yes minus remember there's a minus sign up front so this becomes if you multiply the minus throughout the summation it's going to all be minus p of no log base 2 of p of no so what was the probability then of yes well there were four yeses and in total 6 plus 4 is 10 there were 10 solutions there so 4 over 10 0.4 that's the probability and then that makes 6 over 10 6 divided by how many there were 4p2 for the no so minus 4 over 10 times log base 2 of 4 over 10 minus 6 over 10 log base 2 of 6 over 10 so you see it's really not that difficult so there we just do that and this is the entropy the impurity using entropy of that first child node 0.97 now let's look at the second one which was quite pure now the logarithm of 0 is not defined that's something that you can't calculate so we just leave that out we make that to be 0 we actually have 3 over 3 because all 3 of them were nodes so we don't sum them over the yeses as well because that was just 0 over the 3 were yeses and 0 times anything is 0 the log is not defined for 0 so all we have is just these nodes so the 3 out of the 3 was no times the log base 2 of 3 over 3 and remember to take the log base 2 that's numpy.log2 we do that subtraction so we see what the impurity is using entropy of the second node and that's 0 minimum entropy and that's really what we are only for that means it's a pure node and then let's look for when it was 3 so 6 out of the 8 were nodes and 2 out of the 8 were yeses and if we calculate the impurity there it's 0.81 and let's just have a look at what the entropy then was for the parent node remember I said there was the 11 and 10 so in this case it will be the root node so let's work that out 10 over 21 and 11 over 21 that's 0.99 so what we want as far as our information gain was concerned that we ugly first equation we take and I say that I sign this impurity entropy there as the compute variable start so it start minus the mean of all of those 3 impurities and if we do that let the information gain if we chose as our root node if we chose cat 1 let's choose as our root node cat 2 and see what happens so we're going to group by a cat 2 and have a look at that so we have 5 nodes in a single yes 8 yeses and 3 nos, 3 nos and 1 yes and I very quickly use my equation there for entropy and we calculate the 3 entropies and we take the mean of all of those and subtract that and let's look at our information gain now so our information gain was 0.404 if we chose cat 1 as our root node and now it goes up to 0.46 so this will be a better choice as our root node let's quickly go through cat 3 as a choice of a root node and that looks a lot better and indeed you know there's lots more purity there if we look at it and if we look at that information gain much higher 0.767 and this is how a decision tree will decide that this would be best for the root node and then for each of those 3 child nodes then it's 3 in this case because we have for cat 3 3 classes 1, 2 and 3 for each of those we'll now become a parent node and we'll go through all of them again and see for each one of those which would be the best one to select and I just want to let you know you can reselect you can absolutely reselect one of those you've used it once and now you can't use it again and that's how you built this decision tree because you want this maximum information gain now you know how long do we go we can go on and on and on we had a contrived little example but in a bigger data set you can go on and on and on and the depth of your tree can be enormous until you get to pure leaf nodes and what will you get there I think you can guess we had that in the previous section you can have so you can have high variance in your model it's going to overfit your data that you have there it's going to learn everything about the training data and it's going to perform very poorly on unseen data so that doesn't really work as far as the decision tree is considered a single decision tree is also very poor in averaging out over complicated data sets and it also does not generalize well to unseen data so one thing we could do is we could we can set minimum information gain so that if successively it doesn't reach that drop in our entropy then we say stop we stop the bus right there or we can build the whole tree right out to the bottom and then we can prune it back so those would be hyper parameters yes hyper parameters that we can set in the design stage to decide how to build these trees so let's look at a decision tree classifier based on the data that we have with these decision tree classes that we have in scikit-learn we have to tell it what type of data we're dealing with so we get these two classes, label encoder we've imported that and label binarizer so I'm just going to instantiate them so we just pass them to computer variables and then we're going to encode all of our variables so we're going to say label underscore encoder which is now this one it is instantiation of this class so we're going to use the fit underscore transform so it's going to fit the values and change them for cat1 and we're going to assign that to the computer variable encoded cat1 and what the label encoder does labels would be the sample space elements it's going to encode them because some of them we had ab's and c's, some of them we had roman numerals, some of them we had one, twos and threes we just have to tell the algorithm that these are labels they're phenomenal categorical variables and we do that by the fit transform method of the label encoder class so we do that and we do that for all three and our target variable had binary classes it only had two classes so there we use the label binarizer just to tell our model look, we take y and we fit transform it to this label binarizer the target we also have to call the dot flatten method there we also have to call the dot flatten method there because it is going to give us a list of lists and we just want a single list as far as y is concerned now we have to build up our feature set again row by row because we've changed all these values as individual numpy arrays and what we're going to do is we're just going to build a for loop so we have this empty list x and then we say for i equine in the range of y so each one of those rows we're just going to iterate over 1,2,3,4 for row 1, row 2, row 3, row 4 and for row 1 it's going to take the first one of encoded cat 1 the first one in encoded cat 2, the first one in encoded cat 3 put them together in a row now i increases by 1 so we jump to 2 so the next row it builds up the second value encoded cat 1 the second value encoded cat 2 the second value encoded cat 3 so we just run through all we just run through all of those and we build x back up now we have that we can actually instantiate our decision tree classifier there's our class there and the criterion we want is how do we measure impurity we want entropy and we're going to assign that to the variable d underscore tree and it remains for us to fit our data to the tree d dot underscore tree dot fit x and y and it's going to do all those things that we discussed how it chooses what to put where and it's going to build that whole tree out for us and you can see all the hyperparameters that we can set that's all the arguments in the decision tree classifier we left all of them just as the default and we get a solution so by the way if you run this on your local system and you have pi dot and graph this installed you can actually draw a nice little graph and i just leave you there with the code that you can use to illustrate Orphean co-lab but anyway let's give a new tree now now remember all those values were used the label encoder so we can't use roman numeral 1, 2 and 3 anymore or a, b's and c's or 1, 2's and 3's they were all transformed into values 1, 2, 3 1, 2, 3, 1, 2, 3 because remember there were only 3 classes or 3 sample space elements in each so our unknown sample that we're going to pass would be roman numeral 1 and b and the value 1 so now I've got to put the encoded values in so if we pass that to the predict it's going to predict that this would be class 0 based on how well it did so we can do that for all the values so remember we haven't split anything, we're just taking all of our values, we're passing that to the predict all our feature variables pass that to y pred and now we can sort of measure how many were correct so we're going to say y that was the actual values we use this are they equal to y pred so remember that gives us a 2 or false and falses are 0's and 1's are 2's so we can sum over all the ones that were 2 divided by how many they are and that's instantly going to give us the percentage the fraction that got correct, our accuracy so our accuracy is 90.47% and if we look at the confusion matrix of this very simply that's exactly what we saw previously the two labels here on the left, the predicted values on the right on the main diagonal all the 0's that were correctly predicted are 0's the ones correctly predicted and across there, the ones that were incorrectly predicted so once again we can attach to one of these are positive and the other one a negative outcome and we can do all the things that we spoke about before sensitivity, specificity positive and negative predictive values or everything other matrix other than just the accuracy so let's quickly look at how a decision tree regressor would look like, it's not very different, we just going to create a dataset for ourselves this time making use of the make underscore regression function from scikit-learn and you see where we set these 1000 samples we want 4 feature variables 2 of them being informative this time there's a noise argument we set that to 90% just to scramble up our data a little bit and we set a random state as well and then I'm just going to create columns var 1, var 2, var 3, var 4 and just create a little data frame for us so we can have a look at that and we can see we have numerical variables as far as our feature variables concern but our target variable now is also continuous numerical and of course just doing a little scatter matrix there so we can just see just looking at it is there some some data if we look at for instance target the target as far as variable 3 is concerned there seems to be a good correlation there so a bunch of information that we can get there we're going to instantiate the decision tree regressor class there so that we have an instance of it there and we assign that to the regressor regressor variable there with the descriptive name and then this time around we're going to go the full hug we will do a train test split because that is very important so train test split we've seen how that worked before in the previous notebook we just pass our feature variables our target vector and the size let's split off 20% and we set a random state and remember that that gives us four objects and we have to name them appropriately it would be the two feature variables the training and test feature variables and then the two targets the train and the test target so we have to put that in that order and if we just quickly look at the shape just to verify always that everything worked out properly it looks good so now we're just going to use the fit method so our regressor the class that we've instantiated we just fit our data to that X train and Y train so the features in the target of our training set and we assign that to the computer variable in this instance I'm calling it dt underscore reg underscore model and then we can just see how accurate it is and now of course we can't say how many were correct and how many were incorrect because our target variable now is a continuous numerical variable so in this instance we're going to express a coefficient of determination and that's what the score is going to do for us there and we see our value there 97 quite quite high now we can also do the prediction so I'm going to take X test and I'm going to pass that to the predict method so we take our trained model and we use the predict method pass all the X tests and that's going to give us predicted values for each of our observations and it's also continuous numerical and what we can build now very nicely is a scatter plot a scatter plot of the actual numerical variable against the predicted numerical variable and what we want to see is very good correlation between those two and indeed we do look at that there's all our actual values on the bottom and then as far as our predicted values very much in line very accurate there as far as our squared at least as concerned so now that we've done that let's look at a random forest now what that is going to do is that is just going to combine multiple decision trees in one so what the computer is going to do for us what our code is going to do for us what scikit-learn is going to do for us is just build a bunch of trees and it's going to somehow average over all of those trees and that makes it what we call an ensemble techniques techniques such as xg-boost as well or booster trees it's just an ensemble of a bunch of trees and by bringing that together averaging over that somehow we get much much better models now the way that random forest works it is going to just sample some of your data for every tree it's not going to use all of the data it's not going to use all of your variables and you can just select some of the variables and each time it builds a tree it's going to use different subsets of those variables so it's quite an intricate design there the way that random forests work and I encourage you just to read up a little bit more about it easy enough for us to encode so we're going to instantiate the random forest regressor here assign that to a computer variable and with that computer available we're just going to fit it to this instance of the random forest regressor so that's just different a different class than the Decision Tree class so there we go it was as simple as that and now let's look at our score again we pass x test and y test and look at that 99.9 0.99 I should say very close to 1 and if we just create predicted values for our test set out you can see the scatter plot very well indeed so all we're doing here is in a very specified way the Decision Forestables are a lot of Decision Trees and it averages over all of them giving us a much better a much better model so in this last section I want to introduce you to TensorFlow's Decision Forest now TensorFlow a couple of words about that TensorFlow is a deep neural network architecture which is a sophisticated form of machine learning it's an open source architecture written by and designed by the friendly folks over at Google there's also PyTorch that is developed by Facebook and these tools are openly and freely available it's written in a variety of languages but usually it has what we call a Python wrapper around that in other words it's Python code that we type under the hood though there will be some more sophisticated language and the one that I'm going to show you here is C++ much more advanced and I use that term very loosely language much faster language, compiled language all sorts of things that you can learn about TensorFlow then is all about deep neural networks but they've just released at the time of this recording also a new module for random forests and it's called the TensorFlow decision forests if you run a Linux machine you can install that right now and it's a bit more tricky if you're running Windows or Mac of course here in Google Colab that runs on Linux machines in the cloud with Google so we can use this so it uses what is called the Yggdrasil decision forest and some people would know what the term Yggdrasil refers to is the Yggdrasil forest C++ libraries and we can just write very simple Python code to make use of that and we're going to use that and that has rapidly become my favorite tool for designing random forests so we have to install it first and this is how we install if you install your own version of Python on your system you can create these virtual environments for all of your projects and the way that you would install things are either through package manager called Conda or through PIP so you will have exclamation mark PIP or bang PIP and then install TensorFlow underscore decision underscore forests so when you install for yourself packages like NumPy and Plotly you'll use either Conda install or PIP install and you can certainly learn how to do that so there we go that's installed now I'm going to import TensorFlow underscore decision underscore forests as TFDF TensorFlow decision forest and I'm also going to import TensorFlow itself TensorFlow as TF so let's do that import and now we have those namespace abbreviations and because it's so new there's new versions coming out as well I just always just like to see what current version Google Collab has installed on its back end 1.7 so if I read up anything about it I just make sure that you're reading up on the right on the right version so the data set that we're going to work from is we're just going to download from the internet and that is a data set on lovely penguins so the bang that's exclamation mark W get and then the URL for where the data is we can just use that command if you download a curated data it's already in the correct format from certain websites you can just use the W get function there that makes it available in memory on your google drive and we can just use read underscore CSV to import that temporary stored CSV file that is in our collab memory now so let's have a look at the shape of this data set I've called it penguins instead of the usual DF so penguins.shape and 44 observations and 8 variables that's a very small data set and decision trees, random forests very good for smaller data sets as opposed to deep neural networks that require they are hungry for data so let's have a look at that we have information about the species that's what we're going to try and predict from these set of 6 feature variables and that is the island was noted what the bill length was of the observations the millimeters, the bill depth the flipper length and the body mass and grams the sex and the year in which that was done so that's our data set you can very much see there's some data missing there's 344 observations but as far as bill length is concerned there's only 342 and the beauty behind this egressil algorithm is we don't have to worry about missing data python will take care of all of that for us so let's have a look at the first 5 observations there to sort of get a sense there's the species, there's the island and then the bill length and millimeters and you can see all the rest of them there so let's have a look if there any class in balance and we see the 3 different species there and we see that the chance trap was a bit under represented but not too bad so what we have to do to use this algorithm though when it comes to the metrics on the end is we just have to convert these names, these non-located variables just into values and a quick and easy way to do that is to call the unique method so it's spenguin.species so that series, that column.unique and then we convert that to a numpy list so or at least a python list I should say so we see the 3 species there and then we're going to just use the map so we're going to use the map function there and each of these 3, remember python list each element has a index so that would be index 0 chain 2 would be index 2 index 1 I should say, the second one and then chance trap will have index 2 because python is a 0 index, so 0, 1 and 2 so we're going to map these values 0, 1 and 2 to each of the values there in the penguins column so penguins.species, so we get that column back or that pender series and we map that to either 0, 1 or 2 depending on each observation, each row that it goes down that column is just going to assign the 3 numbers to those 3 so that's a very quick and easy way to you can use the replace function replace method of course as you remember before now as always we're going to split our data, this time we're not making use of psychics learn train to split, there's a much easier way to go about it, I'll show you here first of all we're going to define a user function, so I'm just showing you different ways of course you can use the train to split as well so we're going to define df, our own user defined function I'm going to call it split it's going to take 2 arguments one is ds which is going to be our data frame and the second one is a variable that we put at default value 2 so if the user leaves that out it will take a default value of 0.3 and we're going to call that r equals 0.3 and then what we simply do we create a local variable local to this function, it doesn't exist outside of that function I'm going to call it test underscore ind for test index and it's going to split out for us a random value so numpy.rand so rand is a uniform distribution on the interval from 0 to 1 so there are some more values between 0 and 1 and how many do we want well the length of the data frame that we're going to pass to that and so that's the number of observations and what we want back is this conditional is it less than 0.3 so you have these values between 0 and 1 and they're equally likely it's a uniform distribution so 0.3 of them are going to be less than 0.3 the other 0.7 is higher than that 0.5 0.735 so what this is going to do is just build up all the yeses and noes for us so it's either in a 0.3 30% of the cases is going to be less than 0.3 and 70% of the cases is going to be more than so you'll have these 30% false values and 70% true values but they're completely at random and then we're going to return our data frame, two of them the data frame that does not contain that index and the data frame that contains that index, the true values so that gives us the split and the data to separate data frames so very simple there instead of using train test split you can build your own one so I'm just going to see the pseudo-random number generator and then I'm going to use penguins underscore train and penguins underscore test because I'm returning two data frames and I'm passing my penguins data frame to the split function that we've just created and lo and behold we have a 70-30 split as far as our training set is concerned there, you see 237 observations versus our the remainder 107 observations in our test set so let's just make sure about this underrepresentation just to see that we have enough in each of those and we can see the number there and that bar chart we can see the number here in the bar chart for our test set so yeah we do have the underrepresentation but not too bad now if you use a tensorflow and you do use deep neural networks there's a beautiful part of the architecture that takes your data frames and passes it into the correct format to your model and that's very important for tensorflow and of course we're using there are decision trees here so we're going to say tftf.keras now keras very importantly we have tensorflow and then we also have, I'm going to call it a simplified version of the code of tensorflow very successful, it's called keras built into tensorflow you can go quite deep with a code quite complex code and then you get keras which is a module inside of tensorflow which uses simpler code you can still do all the pure powerful things you can do with the deeper tensorflow code but it's just much easier to write and it has a function called pd underscore data frame underscore 2 underscore tf underscore data sets so it takes a data frame and turns it into a tensorflow data set and that is the correct format that tensorflow models want the data in so it's a very easy way to do this and all we have to say is what is the training set now remember the training set still has the feature and the target so we just have to tell it what the target is and that's the label argument and we set that to species and then we're also going to build penguin underscore test and again it's this pandas data frame to tensorflow data sets it converts the data frame into the correct format for use with these random forests and with tensorflow so now that we have them in correct format now these random forests would know what the feature variables are what the target variable is it's all in the correct format all we have to do is to instantiate once again and it's same as with with scikit-learn we have tfdf remember that was our namespace abbreviation dot krs dot random forest model that's the one we can use we assign that to computer variable name it's now instantiated now we don't have to do this step but I like to do it and that's where we compile the model if you use a deep neural network you have to compile your model before you can use it so we take our model that we've instantiated we use the dot compile method and here all we have to do is just pass the metrics that I want and I'm interested in this instance in accuracy but there are other metrics that you can use as well and now everything is ready our model is ready remember it had missing data it had all sorts of different types of variables no problem whatsoever we're just going to say rf underscore model dot fit we can pass as x our x argument here as the penguins train remember that penguins underscore train is now a TensorFlow data set so it's in the right format no problems it can just run and there you'll see it start building it runs through it a couple of times it gives you some information there and what we can now do is do rf dot model it's now fully trained a model a random forest model we can call the dot summary method on that and that gives us a bunch of information so all sorts of things that you can read there this one right at the top here let's go right back up to the top oops that was too far so it gives us a little bit of information about our random forest what the input features were we didn't add weights so weights would be where we could correct for the fact that we had a bit of class imbalance but we didn't do that here but it gives us this variable importance feature variables in order in contributing to our solution contributing to the prediction so we see bill length and millimeters we see all the little hashtag symbols there the more they are the more important it was and you can see it's in order there of the importance and when you have a big data set and many many feature variables that shows you which ones you can build simpler models from by only including those ones that were quite important so it's a very useful thing that we can do here and if we scroll down let's just get there if we scroll down more and more and more we'll get some more indication of what's going on we get to this very important bit training OOB out of bag now remember I said we didn't go fully into the detail because it's quite complex about how we take a bunch of decision trees and build a random forest from them remember that I said it uses at each and every split only a subset of your feature variables for every tree and at every split but it also only uses a subset a random subset of the data inside of your training set and because it only uses a subset there's a set that's left and that's like having a training and a test set inside of your training set you can also refer to that in other architectures we refer to that as a validation set so you can break up your training set while the model works also into two sets and that's then called training and validation set and as far as random forests are concerned we call that out of bag values extra little ones that are not used in each one of the trees so every time it uses those values that were not selected to be part of the trees that you are building in the forest and it tests against them so what we want to see as we go through building more and more trees there's the accuracy there as the model builds more and more trees more and more trees the accuracy goes up and up and up and up there's a few dips here and there but it goes up and up and up so as the model builds and you can see the trees there we went for 300 trees as your random forest gets more complicated more information from more trees your accuracy starts going up and there we have the log loss and that is an indication of the error rate that's going down more and more accurate as it runs so let's we have this idea of the out of bag errors and the accuracy then because we compile it with the metric accuracy so we get that accuracy information we still have our test set remember so we can have a look at that set as well so it's a sort of two ways that we testing our model here so I'm going to create this computer variable evaluation and my trained model now is rf underscore model and I'm going to call the evaluate method on that and I passed my test set remember my test set is a TensorFlow data set we used we inserted our data frame and it spits out a data set ready for use and return dict is true that just gives us a dictionary of metrics because that's what we import important for us so it now takes our test data runs it through all those trees in our random forest and let's just have a look because we said return dict equals true we get the keys which is loss and accuracy and then the values for each this is now for the test data and the loss was 0 and the accuracy was 100% absolutely phenomenal random forest are very powerful approach to or machine learning tool and the more sophisticated form of those using XGBoost can be even more so and I really encourage you if you're interested after this fundamentals data science course to really look into this very nicely it can give us some overview of our decision structure how this was done by using the tfdf.model underscore plotter.plotmodel in colab function right and that gives us an idea of how the node structure eventually panned out as far as the overall view of our forest is concerned giving us some overview of the combination of those trees so very good information you can see flip a length remember that was highest up when we looked at the importance of all these variables and you can see that was chosen then as the root and you can see the three classes here so an impure node to start off with and what we trying to get to do is to get to pure nodes and you can see there's a pure node there's a pure node and there's a pure node so when we pass the test data and put it through this sort of idea here and then we can use the variable importance and that will give us that same idea flip a length then build length and then build depth and then the island and then the body mass very very nice very nice thing to do we can also call from our RF model the make inspector method and from there the evaluation method and it just gives us all this information that we've talked about before and that's it an introduction to data science it really is a massive massive field that gives us the ability to tackle the problems that we dealing with and data science has been very successful in democratizing working with data and finding solutions from data in the end finding the story that the data is trying to tell you