 Welcome back. We're going to be talking today about random forests. We're going to finish building our own random forest from scratch. But before we do, I wanted to tackle a few things that have come up during the week, a few questions that I've had. And I want to start with kind of the position of random forests in general. So we spent about half of this course doing random forests. And then after today, the second half of this course will be neural networks, broadly defined. This is because these two represent like the two key classes of techniques which cover nearly everything that you're likely to need to do. Random forests belong to the class of techniques of decision tree ensembles, along with gradient boosting machines being the other key type and some variants like extremely randomised trees. They have the benefit that they're highly interpretable, scalable, flexible, work well for most kinds of data. They have the downside that they don't extrapolate at all to like data that's outside the range that you've seen as we looked at at the end of last week's session. But you know, they're a great starting point. And so I think, you know, there's a huge catalogue of machine learning tools out there and like a lot of courses and books. Don't attempt to kind of curate that down and say like, for these kinds of problems use this for these kinds of problems use that finished, you know, but they're rather like, here's a description of 100 different algorithms. And you just don't need them, you know, like I don't see why you would ever use and support vector machine today, for instance, like, no, no reason at all I could think of doing that. People love studying them in the 90s because they are like very theoretically elegant and like you can really write a lot of math about support vector machines and people did but you know in practice I don't see them as having any place. So there's like a lot of techniques that you could include in an exhaustive list of every way that people have looked at machine learning problems. But I would rather tell you like how to actually solve machine learning problems in practice. I think they you know we've we're about to finish today the first class which is you know one type of decision tree ensembles. In part two, Yannette will tell you about the other key type there being gradient boosting, and we're about to launch next lesson into neural nets, which includes all kinds of GLM, rich regression, elastic net, lasso, logistic regression, etc. Or all variants of neural nets. You know, interestingly, Leo Breiman who created random forests did so very late in his life and unfortunately passed away not many years later to partly because of that very little has been written about them in the academic literature. Partly because SVMs were just taken over at that point, you know, other people didn't look at them. And also like just because they're like quite hard to grasp at a theoretical level like analyze them theoretically it's quite a hard to write conference papers about them or academic papers about them. There hasn't been that much written about them, but there's been a real research and so we're not research and a new wave in recent years of empirical machine learning like what actually works. Kaggle has been part of that, but also just part of it has just been like companies using machine learning to make shitloads of money like Amazon and Google. So nowadays a lot of people are writing about decision tree ensembles and creating better software for decision tree ensembles like like GBM and XGBoost and Ranger for R and scikit learn and so forth. But a lot of this is being done in industry rather than academia. But you know, it's, it's, it's encouraging to see. There's certainly more work being done in deep learning than in decision tree ensembles, particularly in academia, but, but there's a lot of progress being made in both, you know, if you look at like, of the packages being used today for decision tree ensembles, like all the best ones, the top five or six, I don't know that any of them really existed five years ago, you know, maybe other than like SK learn or even three years ago. So that's that's been good. But I think there's a lot of work still to be done. We talked about, for example, figuring out what interactions are the most important last week. And some of you pointed out in the forums that actually there is such a project already for gradient boosting machines, which is great. But it doesn't seem that there's anything like that yet for random forests. And you know, random forests do have a nice benefit over GBMs that they're kind of harder to screw up, you know, and easier to scale. So hopefully that's something that, you know, this community might help fix. Another question I had during the week was about the size of your validation set. How big should it be? So like, to answer this question about how big does your validation set need to be, you first need to answer the question. How accurate do I need, how precisely do I need to know the accuracy of this algorithm? Right. So like, if the validation set that you have is saying like, this is 70% accurate. And if somebody said, well, is it 75% or 65% or 70% and the answer was, I don't know, anything in that range is close enough. Like that would be one answer. Whereas if it's like, is it 70% or 70.01% or 69.99% like, then that's something else again. Right. So you need to kind of start out by saying like, how accurate do I need this? So like, for example, in the deep learning course, we've been looking at dogs versus cats images. And the models that we're looking at had about a 99.4, 99.5% accuracy on the validation set. Okay. And our validation set size was 2000. Okay. In fact, let's do this in Excel. That'll be a bit easier. So our validation set size was 2000. And our accuracy was 99.4%. Right. So the number of incorrect is something around one minus accuracy times N. So we're getting about 12 wrong. Right. And the number of cats we had is half. And so the number of wrong cats is about six. Okay. So then like we run a new model and we find instead that the accuracy has gone to 99.2%. Right. And then it's like, okay, is this less good at finding cats? It's like, well, it got two more cats wrong. Probably not. Right. So, but then it's like, well, does this matter? Does 99.4 versus 99.2 matter? And if this was like, it wasn't about cats and dogs, but it was about finding fraud. Right. Then the difference between a 0.6% error rate and a 0.8% error rate is like 25% of your cost of fraud. So like that can be huge. It's really interesting. Like when ImageNet came out earlier this year, the new competition results came out and the accuracy had gone down from 3%. So the error went down from 3% to 2%. And I saw a lot of people on the internet like famous machine learning researchers being like, yeah, some Chinese guys got it better from like 97% to 98% is like statistically not even significant who cares kind of a thing. But actually I thought like, holy crap, this Chinese team just blew away the state of the art and image recognition. Like the old one was 50% less accurate than the new one. Like that's actually the right way to think about it, isn't it? Because it's like, you know, we were trying to recognize, you know, like which tomatoes were ripe and which ones weren't. And like our new approach, you know, the old approach like 50% of the time more was like letting in the unripe tomatoes. Or, you know, 50% more of the time we were like accepting fraudulent customers. Like that's a really big difference. So just because like this particular validation set we can't really see 6 versus 8 doesn't mean the 0.2% different isn't important. It could be. Well, my kind of rule of thumb is that this like this number of like how many observations you're actually looking at, I want that generally to be somewhere higher than 22. Why 22? Because 22 is the magic number where the t distribution roughly turns into the normal distribution. Right. So as you may have learned the t distribution is is the normal distribution for small data sets. Right. And so in other words, once we have 22 of something or more, it kind of starts to behave kind of normally in both sense of the words. Like it's kind of more stable and you can kind of understand it better. So that's my magic number when somebody says, do I have enough of something? I kind of start out by saying like, do you have 22 observations with the thing of interest? So if you were looking at like lung cancer, you know, and you had a data set that had like a thousand people without lung cancer and 20 people with lung cancer, I'd be like, I very much doubt we're going to make much progress, you know, because we haven't even got 20 of the thing you want. So ditto with the validation set. If you don't have 20 of the thing you want, then it's very unlikely to be useful. Or if like the at the level of accuracy we need, it's not plus or minus 20. It's just it's that that's the point where I'm thinking like, be a bit careful. So just to be clear, you want 22 to be the number of samples in each set, like in the validation, the test and the train or. So what I'm saying is like, if there's if there's less than 22 of a class in any of the sets, then it's it's going to get it's getting pretty unstable at that point. And so like that's just like the first rule of thumb. But then what I would actually do is like start practicing what we learned about the binomial distribution or actually then we distribution. So what's the what is the mean of the binomial distribution of n samples and probability P and times P. Okay. Thank you. And times P is our mean. Right. So if you've got a 50% chance of getting ahead and you toss it 100 times on average you get 50 heads. And then what's the standard deviation and P1 minus P. Okay. So these are like two numbers. Well, the first number you don't really have to remember. It's intuitively obvious. The second one is one that try to remember forever more because not only does it come up all the time, the people that you work with will all have forgotten it. So you'll be like the one person in the conversation who could immediately go, we don't have to run this 100 times. I can tell you straight away, it's binomial. It's going to be NPQ, NP1 minus P. Then there's the standard error. The standard error is if you run a bunch of trials each time getting a mean. What is the standard deviation of the mean? I don't think you guys have covered this yet. Is that right? No. So this is really important because this means like if you train 100 models, right? Each time the validation set accuracy is like the meaning of a distribution. And so therefore the standard deviation of that validation set accuracy, it can be calculated with the standard error. And this is equal to the standard deviation divided by square root n. So this tells you, so like one approach to figuring out like is my validation set big enough is train your model five times with exactly the same hyperparameters each time. And look at the validation set accuracy each time. And there's like a mean and a standard deviation of five numbers you could use or a maximum and a minimum you can use. But to save yourself some time, you can figure out straight away that like, okay, well, I have a 0.99 accuracy as to whether I get the correct or not correct. So therefore the standard deviation is equal to 0.99 times 0.01. Okay. And then I can get the standard error of that. Right. So, so basically the size of the validation set you need is like however big it has to be such that your insights about its accuracy are good enough for your particular business problem. And so like I say, like the simple way to do it is to pick a validation set of like a size of 1000 train five models and see how much the validation set accuracy varies. And if it's like, if they're all close enough for what you need, then you're fine. If it's not, maybe you should make it bigger or maybe you should consider using cross validation instead. Okay. So like, as you can see, it really depends on what it is you're trying to do. How common your less common classes and how accurate your model is. Could you pass that back to Melissa, please. Thank you. I have a question about the less common classes. If you have less than 22, let's say you have one sample of something. Let's say it's a face and I only have one representation from that particular country. Do I toss that into the training set and adds right or do I pull it out completely out of the data set? Or do I put it in a test set instead of the validation set? So you certainly couldn't put it in the test of the validation set because you're asking, I mean in general, because you're asking, can I recognize something I've never seen before. But actually this question of like, can I recognize something I've not seen before? There's actually a whole class of models specifically for that purpose. It's called either one shot learning, which is you get to see something once and then you have to recognize it again. Or zero shot learning, which is where you have to recognize something you've never seen before. We're not going to cover them in this course, but they can be useful for things like face recognition. Like is this the same person I've seen before? And so generally speaking, obviously for something like that to work, it's not that you've never seen our face before. It's that you've never seen Melissa's face before. And so you see Melissa's face once and you have to recognize it again. Yeah. So in general, you know, your validation set and test set need to have the same mix or frequency observations that you're going to see in production in the real world. And then your training set should have an equal number in each class. And if you don't just replicate the less common one until it is equal. So this is, I think we've mentioned this paper before, very recent paper that came out. They tried lots of different approaches to training with unbalanced data sets and found consistently that over sampling the less common class until it is the same size as the more common class is always the right thing to do. So you could literally copy, you know, so like I've only got a thousand, you know, 10 examples of people with cancer and 100 without. So I could just copy those 10 and other, you know, 90 times. That's kind of a little memory inefficient. So a lot of things, including I think SK learns random forests have a class weights parameter that says each time you're bootstrapping or re-sampling, I want you to sample the less common class with a higher probability. Or ditto if you're doing deep learning, you know, make sure in your mini batch, it's not randomly sampled, but it's a stratified sample. So the less common class is picked more often. Okay. Okay. So let's get back to finishing off our random forests. And so what we're going to do today is we're going to finish off writing our random forest. And then after today, your homework will be to take this class and to add to it all of the random forest interpretation algorithms that we've learned. Okay. So obviously to be able to do that, you're going to need to totally understand how this class works. So please, you know, ask lots of questions as necessary as we go along. So just to remind you, we're doing the bulldozers cargo competition data set again. We split it as before into 12,000 validation, the last 12,000 records. And then just to make it easier for us to keep track of what we're doing, we're going to just pick two columns out to start with year made and machine hours on the meter. Okay. And so what we did last time was we started out by creating a tree ensemble, and the tree ensemble had a bunch of trees, which was literally a list of entries trees, where each time we just called create tree. Okay. And create tree contained a sample size number of random indexes. Okay. And this one was drawn without replacement. So remember, bootstrapping means sampling with replacement. So normally with scikit-learn, if you've got n rows, we grab n rows with replacement, which means many of them will appear more than once. So each time we get a different sample, but it's always the same size as the original data set. And then we have our setRFSamples function that we can use, which does with replacement sampling of less than n rows. This is doing something again, which is it's sampling without replacement sample size rows. Okay, because we're permuting the numbers from 0 to self.y-1 and then grabbing the first self.sample size of them. Actually, there's a faster way to do this. You can just use np.random.choice, which is a slightly more direct way, but this way works as well. All right. So this is our random sample for this one of our entries trees. And so then we're going to create a decision tree. And our decision tree, we don't pass it all of x, we pass it these specific indexes. And remember, x is a pandas data frame. So if we want to index into it with a bunch of integers, we have to use iLock, integer locations. And that makes it behave indexing-wise just like numpy. Our y vector is numpy, so we can just index into it directly. And then we're going to keep track of our minimum leaf size. So then the only other thing we really need in ensemble is some way to make a prediction. And so we were just going to do the mean of the tree prediction for each tree. All right. So that was that. And so then in order to be able to run that, we need a decision tree class because it's being called here. And so there we go. Okay. So that's the starting point. So the next thing we need to do is to flesh out our decision tree. So the important thing to remember is all of our randomness happened back here in the tree ensemble. The decision tree class we're going to create doesn't have randomness in it. Okay. So right now we are building a random tree regressor, right? So that's why we're taking the mean of the tree, the outputs. If we were to work with classification, do we take the max? Like the classifier will give you either zeros or ones. No, I would still take the mean. So the, so each tree is going to tell you what percentage of that leaf node contains cats and what percentage take contains dogs. So then I would average all those percentages and say across the trees on average, there is 19% cats and 81% dogs. Good question. So, you know, random tree classifiers are almost identical or can be almost identical to random tree regresses. The technique we're going to use to build this today will basically exactly work for classification. It's certainly for binary classification. You can do with exactly the same code for multi-class classification. You just need to change your data structure so that like you have like a one hot encoded matrix or a list of integers that you treat as a one hot encoded matrix. Okay. So our decision tree. So remember our idea here is that we're going to like try to avoid thinking. So we're going to basically write it as if everything we need already exists. Okay. So we know from when we created the decision tree, we're going to pass in the X, the Y, and the minimum leaf size. So here we need to make sure we've got the X and the Y and the minimum leaf size. Okay. So then there's one other thing, which is as we split our tree into sub trees, we're going to need to keep track of which of the row indexes went into the left hand side of the tree, which went into the right hand side of the tree. Okay. So we're going to have this thing called indexes as well. Right. So at first we just didn't bother passing an indexes at all. So if indexes is not passed in, if it's none, then we're just going to set it to everything, the entire length of Y. Right. So the root of a decision tree contains all the rows. That's the definition, really, of the root of a decision tree. So all the rows is row naught, row one, row two, et cetera, up to row Y minus one. Okay. And then we're just going to store away all that information that we were given. We're going to keep track of how many rows are there and how many rows are there. We're going to keep track of how many rows are there and how many columns are there. Okay. So then every leaf and every node in a tree has a value. It has a prediction. And that prediction is just equal to the average of the dependent variable. Okay. So every node in the tree, Y indexed with the indexes is the values of the dependent variable that are in this branch of the tree. And so here is the mean. Okay. Some nodes in a tree also have a score, which is like how effective was the split here, right? But that's only going to be true if it's not a leaf node, right? A leaf node has no further splits. And at this point when we create a tree, we haven't done any splits yet. So its score starts out as being infinity. Okay. So having built the root of the tree, our next job is to find out which variable should we split on and what level of that variable should we split on. So let's pretend that there's something that does that. Let's find bar split. So then we're done. Okay. So how do we find a variable to split on? So, well, we could just go through each potential variable. So C contains the number of columns we have. So go through each one and see if we can find a better split than we have so far on that column. Okay. Now, notice this is like not the full random forest definition. This is assuming that max features is set to all, right? Remember we could set max features to like 0.5 in which case we wouldn't check all the numbers should not to see. We would check half the numbers at random from not to see. So if you want to turn this into like a random forest that has the max features support, you could easily like add one line of code to do that. But we're not going to do it in our implementation today. So then we just need find better split. And since we're not interested in thinking at the moment for now, we're just going to leave that empty. All right. So the one other thing I like to do with my kind of when I start writing a class is I like to have some way to print out what's in that class. All right. And so if you type print followed by an object or if it Jupyter Notebook, you just type the name of the object. At the moment, it's just printing out underscore underscore main underscore underscore dot decision tree at blah, blah, blah, which is not very helpful. Right. So if we want to replace this with something helpful, we have to define the special Python method name Dunder Reptra to get a representation of this object. So when we type when we basically just write the name like this behind the scenes that calls that function and the default implementation of that method is just to print out this unhelpful stuff. So we can replace it by instead saying let's create a format string where we're going to print out n and then show n and then print val and then show val. Okay. So how many how many rows are in this node? And what's the average of the dependent variable? Okay. Then if it's not a leaf node, so if it has a split, then we should also be able to print out the score. The value we split out and the variable that we split on. Now you'll notice here self dot is leaf is leaf is defined as a method, but I don't have any parentheses after it. This is a special kind of method called a property. And so a property is something that kind of looks like a regular variable, but it's actually calculated on the fly. So when I call is leaf, it actually calls this function. Right. But I've got this special decorator property. Okay. And what this says is basically you don't have to include the parentheses when you call it. Okay. And so it's going to say, all right, is this a leaf or not? So a leaf is something that we don't split on. If we haven't split on it, then its score is still set to infinity. So that's my logic. Does that make sense? So this this at notation is called a decorator. It's basically a way of telling Python more information about your method. Does anybody here remember where you have seen decorators before? Can you pass it over here? Yeah. Where have you seen decorators before? Tell us more about Flask and how it uses decorators. It was the at app route. Yeah. What does that do? That I forgot. Okay. So how to describe it. No worries. So Flask, so anybody who's done any web programming before with something like Flask or a similar framework would have had to have said like this method is going to respond to this bit of the URL and either the post or to get and he put it in a special decorator. So behind the scenes, that's telling Python to treat this method in a special way. So here's another decorator. Okay. And so, you know, if you get more advanced with Python, you can actually learn how to write your own decorators, which as was mentioned, you know, basically insert some additional code. But for now, just know there's a bunch of predefined decorators we can use to change how our methods behave. And one of them is our property, which basically means you don't have to put parentheses anymore, which of course means you can't add any more parameters beyond self. Yep. Why if it's not a leaf, why is this for infinity? Because doesn't infinity mean you're at the root? Why? Well, infinity means that you're not at the root. It means you're at a leaf. So the root will have a split. Assuming we find one, right? Yeah. Everything will have a split till we get all the way to the bottom, the leaf. And so the leaves will have a score of infinity because they won't split. Great. All right. So that's our decision tree. It doesn't do very much, but at least we can like create an ensemble, right? 10 trees, sample size of 1,000, right? And we can like print out. So now when I go m trees to zero, it doesn't say blah, blah, blah, blah, blah, blah. It says what we asked it to say, n colon 1,000, val colon 10.8. Oh wait. Okay. And this is a leaf because we haven't split on it yet. So we've got nothing more to say. Okay. So then the indexes are all the numbers from 0 to 1,000. Okay. Because the base of the tree has everything. This is like everything in the random sample that was passed to it. Because remember, by the time we get to the point where it's a decision tree where we don't have to worry about any of the randomness in the random forest anymore. Okay. All right. So let's try to write the thing which finds a split. Okay. So we need to implement find better split. And so it's going to take the index of a variable, variable number 1, variable number 3, whatever, and it's going to figure out what's the best split point. Is that better than any split we have so far? And for the first variable, the answer will always be yes, because the best one so far is none at all, which is infinity bad. Okay. So let's start by making sure we've got something to compare to. We'll be scikit-learns random forest. And so we need to make sure that scikit-learns random forest gets exactly the same data that we have. So we start out by creating ensemble, grab a tree out of it, and then find out which particular random sample of X and Y did this tree use. Okay. And we're going to store them away so that we can pass them to scikit-learn. So we have exactly the same information. So let's go ahead and now create a random forest using scikit-learn. So one tree, one decision, no bootstrapping, so the whole data set. So this should be exactly the same as the thing that we're going to create, this tree. Okay. So let's try. So we need to define find better split. So find better split takes a variable. Okay. So let's define our X independent variables and say, okay, well, it's everything inside our tree, but only those indexes that are in this node, right, which at the top of the tree is everything, right, and just this one variable. Okay. And then for our Ys, it's just whatever our dependent variable is at the indexes in this node. Okay. So there's our X and Y. So let's now go through every single value in our independent variable. And so I'll show you what's going to happen. So let's say our independent variable is year made and not going to be in order. And so we're going to go to the very first row and we're going to say, okay, year made here is three, right. And so what I'm going to do is I'm going to try and calculate the score if we decided to branch on the number three. Right. So I need to know which rows are greater than three, which rows are less than an equal to three. And they're going to become my left-hand side and my right-hand side. Right. And then we need a score. Right. So there's lots of scores we could use. So in random forests, we call this the information gain. Right. The information gain is like how much better does our score get because we split it into these two groups of data. There's lots of ways we could calculate it. Ginny, cross entropy, root mean squared error, whatever. If you think about it, there is an alternative formulation of root mean squared error, which is mathematically the same to within a constant scale, but it's a little bit easier to deal with, which is we're going to try and find a split, which the cause is the two groups to each have as lower standard deviation as possible. Right. So like I want to find a split that puts all the cats over here and all the dogs over here. Right. So if these are all cats and these are all dogs, then this has a standard deviation of zero and this has a standard deviation of zero. Or else this is like a totally random mix of cats and dogs. This is a totally random mix of cats and dogs. They're going to have a much higher standard deviation. That makes sense. And so it turns out if you find a split that minimizes those group standard deviations or specifically the weighted average of the two standard deviations, it's mathematically the same as minimizing the root mean squared error. That's something you can prove to yourself after class if you want to. All right. So we're going to need to find, first of all, split this into two groups. So where's all the stuff that is greater than three? So greater than three is this one, this one, and this one. So we need the standard deviation of that. So let's go ahead and say standard deviation of greater than three. That one, that one, and that one. Okay. And then the next will be the standard deviation of less than or equal to three. So that would be that one, that one, that one. And then we just take the weighted average of those two. And that's our score. That would be our score if we split on three. Does that make sense? And so then the next step would be try to split on four. Try splitting on one. Try splitting on six. Redundantly try splitting on four again. Redundantly try splitting on one again and find out which one works best. So that's our code here is we're going to go through every row. And so let's say, okay, left-hand side is any values in X that are less than or equal to this particular value. Our right-hand side is every value in X that are greater than this particular value. Okay. So what's the data type that's going to be in LHS and RHS? What are they actually going to contain? They're going to be arrays. Arrays of what? Arrays of bullions. Yeah. Which we can treat as zero and one. Okay. So LHS will be an array of false every time it's not less than or equal to and true otherwise and RHS will be a bullion array of the opposite. Okay. And now we can't take a standard deviation of an empty set. Right. So if there's nothing that's greater than this number, then these will all be false, which means the sum will be zero. Okay. And in that case, let's not go any further with this step because there's nothing to take the standard deviation of. And it's obviously not a useful split. Okay. So assuming we've got this far, we can now calculate the standard deviation of the left-hand side and of the right-hand side and take the weighted average or the sum is the same thing to a scalar. Right. And so there's our score. And so we can then check, is this better than our best score so far? And our best score so far, we initially initialized it to infinity. Right. So initially this is better. So if it's better, let's store away all of the information we need. Which variable has found this better split? What was the score we found? And what was the value that we split on? Okay. So there it is. So if we run that and I'm using time it. So what time it does is it sees how long this command takes to run. And it tries to give you a statistically valid measure of that. So you can see here it's run at 10 times to get an average. And then it's done that seven times to get a mean and standard deviation across runs. And so it's taking me 75 milliseconds plus or minus 10. Okay. So let's check that this works. Find letter split tree zero. So zero is year made. One is machine hours current meter. So with one we got back machine hours current meter 3744 with this score. And then we ran it again with zero that's year made. And we got a better score 658 and split 1974. And so 1974 let's compare. Yep. That was what this tree did as well. Okay. So we've got we've confirmed that this method is doing is giving the same result that SK learns random forest did. And you can also see here the value 10.08. And again, matching here the value 10.08. Okay. So we've got something that can find one split. Could you pass that to your net please? So Jeremy, why don't we put a unique on the X there? Because I'm not trying to optimize the performance yet. But you see that no, like he's doing more. Yeah. So it's like and you can see in the Excel, I like checked this one twice. I checked this for twice unnecessarily. Yeah. Okay. So. And so you're not already thinking about performance, which is good. So tell me what is the computational complexity. Of this. Section of the code. And I like have a think about it, but also like feel free to talk us through it. If you want to kind of. Think and talk at the same time. What's the computational complexity of this piece of code? Can I pass it over there? Yes. All right, Jane, take us through your thought process. I think you have to take each different values through the column to calculate it. Once to see the splits. So then compare all the like, all the possible combinations between these different values. So that can be expensive. Like, because you're. Can you. Or does somebody else want to tell us the actual computational complexity. So like, yeah, quite high jade's thinking. How high. I think it's n square. Okay. So tell me why is it n squared? Because for the for loop it is n. Yes. So it's n squared. Okay. Or this one maybe is even easier to know, like, this is like, which ones are less than XI. I'm going to have to check every value to see if it's less than XI. Okay. And so, so it's useful to know, like, how do I quickly calculate computational complexity. I can guarantee most of the interviews you do are going to ask you to calculate computational complexity on the fly. And it's also like when you're coding, you want it to be second nature. So the technique is basically, is there a loop? Okay. We're, then we're obviously doing this n times. Okay. So there's an n involved. Is there a loop inside the loop? If there is, then you need to multiply those two together. In this case, there's not. Is there anything inside the loop that's not a constant time thing. So you might see a sort in there. And you just need to know that sort is n log n, like that should be second nature. If you see a matrix multiply, you need to know what that is. In this case, there are some things that are doing element wise array operations, right? So keep an eye out for anything where num pi is doing something to every value of an array. In this case, it's checking every value of X against a constant. So it's going to have to do that n times. So to flash this out into a computational complexity, you just take the number of things in the loop and you multiply it by the highest computational complexity inside the loop. n times n is n squared. Can you pass that? In this case, couldn't we just pre-sort the list and then do like one n log n computation? There's lots of things we can do to speed this up. So at this stage, it's just like, what is the computational complexity we have? But absolutely, it's certainly not as good as it can be. Okay, so that's where we're going to go next. Just like, all right, n squared is not great. So let's try and make it better. So here's my attempt at making it better. And the idea is this. Okay, who wants to first of all tell me what's the equation for standard deviation? Marsha, can you grab the box? So for the standard deviation, it's the difference between the value and its mean. We take a square root of that. Sorry, we take the power of 2. Then we sum up all of these observations and we take the square root out of all this sum. Yeah, you have to divide by n. Good, okay. Now in practice, we don't normally use that formulation because it kind of requires us calculating, you know, x minus the mean lots of times. Does anybody know the formulation that just requires x and x squared? Anybody happen to know that one? Do you want to pass that back there? Square root of mean of squares minus... Square root of mean? Yeah, great. Mean of squares minus the square of the means. So that's a really good one. That's a really good one to know because you can now calculate variances or standard deviations of anything. You just have to first of all grab the column as it is. The column squared. And as long as you've got those stored away somewhere, you can immediately calculate the standard deviation. So the reason this is handy for us is that if we first of all sort our data... So let's go ahead and sort our data. Then if you think about it as we kind of start going down one step at a time, then each group is exactly the same as the previous group on the left hand side with one more thing in it and on the right hand side with one less thing in it. So given that we just have to keep track of sum of x and sum of x squared, we can just add one more thing to x, one more thing to x squared on the left and remove one thing on the right. So we don't have to go through the whole lot each time and so we can turn this into a order n algorithm. So that's all I do here is I sort the data and I'm going to keep track of the count of things on the right, the sum of things on the right and the sum of squares on the right and initially everything's in the right hand side. So initially n is the count, y sum is the sum on the right and y squared sum is the sum of squares on the right. And then nothing is initially on the left so it's zeros. And then we just have to loop through each observation and add one to the left hand count, subtract one from the right hand count, add the value to the left hand count, subtract it from the right hand count, add the value squared to the left hand, subtract it from the right hand. Now we do need to be careful though because if we're saying less than or equal to one, say we're not stopping here, we're stopping here, like we have to have everything in that group. So the other thing I'm going to do is I'm just going to make sure that the next value is not the same as this value. If it is, I'm going to skip over it. So I'm just going to double check that this value and the next one aren't the same. So as long as they're not the same, I can keep going ahead and calculate my standard deviation now. Passing in the count, the sum and the sum squared, and there's that formula. The sum of squared divided by the square of the sum, sorry, minus the square of the sum. Do that for the right hand side and so now we can calculate the weighted average score just like before and all of these lines are now the same. So we've turned our order n squared algorithm into an order n algorithm and in general, stuff like this is going to get you a lot more value than like pushing something onto a spark cluster or ordering faster RAM or using normal cores in your CPU or whatever. This is the way you want to be improving your code and specifically, write your code without thinking too much about performance. Run it. Is it fast enough for what you need? Then you're done. If not, profile it. So in Jupiter, instead of saying percent time it, you say percent P run and it will tell you exactly where the time was spent in your algorithm and then you can go to the bit that's actually taking the time and think about like, okay, is this algorithmically as efficient as it can be? So in this case we run it and we've gone down from 76 milliseconds to less than 2 milliseconds and now some people that are new to programming think like, oh great, I've saved 60 something milliseconds but the point is this is going to get run like tens of millions of times, okay? So the 76 millisecond version is so slow that it's going to be impractical for any random forest you use in practice, right? Whereas the 1 millisecond version I found is actually quite acceptable. And then check the numbers should be exactly the same as before and they are, okay? So now that we have a function, find better split that does what we want, I want to insert it into my decision tree class and this is a really cool Python trick. Python does everything dynamically, right? So we can actually say the method called find better split in decision tree is that function I just created and that like sticks it inside that class, right? Now I'll tell you what's slightly confusing about this is that this thing, this word here and this word here they actually have no relationship to each other. They just happen to have the same letters in the same order, right? So like I could call this find better split underscore foo, right? And then I could like call that, right? And call that, right? So now my function is actually called find better split underscore foo but my method I'm expecting to call something called decision tree dot find better split, right? So here I could say decision tree dot find better split equals find better split underscore foo, okay? You see that's the same thing, right? So like it's important to understand how namespaces work like in every language that you use one of the most important things is kind of understanding how how it figures out what a name refers to so this here means find better split as to find inside this class, right? And nowhere else, right? Well, I mean a parent class but never mind about that. This one here means find better split foo in the global namespace. A lot of languages don't have a global namespace but Python does, okay? And so the two are like even if they happen to have the same letters in the same order they're not referring in any way to the same thing. Does that make sense? It's like this family over here may have somebody called Jeremy and my family has somebody called Jeremy and our names happen to be the same but we're not the same person, okay? Great. So now that we've stuck the decision tree, sorry the find better split method inside the decision tree with this new definition when I now call the tree ensemble constructor right? The decision tree ensemble instructor called create tree create tree instantiated decision tree decision tree called find var split which went through every column to see if it could find a better split and we've now defined find better split and therefore tree ensemble when we create it has gone ahead and done the split. Does that make sense? Does anybody have any questions or uncertainties about that? Like we're only creating one single split so far. All right, so this is pretty neat, right? We kind of just do a little bit at a time testing everything as we go and so as you all implement the random forest interpretation techniques you may want to try programming this way to like every step check that you know what you're doing matches up with what scikit-learn does or with a test that you've built or whatever. So at this point we should try to go deeper very inception, right? So let's go now max depth is 2 and so here is what scikit-learn did after breaking it year made 74 it then broke at machine hours meter 2956 So we had this thing called find var split right which just went through every column and tried to see if there was a better split there. But actually we need to go a bit further than that. Not only do we have to go through every column and see if there's a better split in this node but then we also have to see whether there's a better split in the left and the right sides that we just created. In other words the left right side and the right hand side should become decision trees themselves, right? So there's no difference at all between what we do here to create this tree and what we do here to create this tree other than this one contains 159 samples and this one contains a thousand. So this row of codes exactly the same as we had before, right? And then we check actually we could do this a little bit easier we could say if self dot is leaf, right? It would be the same thing but I'll just leave it here for now so is self dot score so if the score is infinite still, in fact let's write it properly is leaf so let's go back up and just remind ourselves is leaf is self dot score equals if, okay? So since there we might as well use it. So if it's a leaf node then we have nothing further to do, right? So that means we're right at the bottom there's no split that's been made so we don't have to do anything further. On the other hand if it's not a leaf node so it's somewhere back earlier on then we need to split it into the left hand side and the right hand side. Now earlier on we created a left hand side and a right hand side array of booleans, right? Now better would be to have an array of indexes and that's because we don't want to have a full array of all the booleans in every single node, right? Because remember although it doesn't look like there are many nodes when you see a tree of this size when it's fully expanded the bottom level if there's a minimum leaf size of one contains the same number of nodes as the entire data set and so if every one of those contained a full boolean array of size of the whole data set we've got squared memory requirements which would be bad, right? On the other hand if we just store the indexes of the things in this node then that's going to get smaller and smaller, okay? So NP.nonzero is exactly the same as just this thing which gets the boolean array but it turns it into the indexes of the trues, okay? So this is now a list of indexes for the left hand side and indexes for the right hand side, right? So now that we have the indexes of the left hand side and the right hand side we can now just go ahead and create a decision tree, okay? So there's a decision tree for the left and there's our decision tree for the right, okay? And we don't have to do anything else we've already written these we already have a construct that can create a decision tree So like when you really think about what this is doing it kind of hurts your head, right? Because the reason, the whole reason that FindVarsplit got called is because FindVarsplit is called by the decision tree constructor but then the decision tree, but then FindVarsplit itself then causes the decision tree constructor So we actually have circular recursion and I'm not nearly smart enough to be able to think through recursion so I just choose not to, right? Like I just write what I mean and then I don't think about it anymore, right? Like what do I want? Well to find a variable spit I've got to go through every column see if there's something better if it managed to do a split figure out the left hand side and the right hand side and make them into decision trees okay? Like now try to think through how these two methods call each other that would just drive me crazy, but I don't need to, right? I know I have a decision tree constructor that works, right? I know I have a FindVarsplit that works so that's it, right? That's how I do recursive programming is by pretending I don't I just ignore it, that's my advice A lot of you are probably smart enough to be able to think through it better than I can so that's fine, if you can So now that I've written that again I can patch it into the decision tree class and as soon as I do the tree ensemble constructor will now use that because Python is dynamic that just happens automatically So now I can check my left hand side should have 159 samples and a value of 9.66 There it is, 159 samples 9.66 left hand side 841, 10.15 the left hand side of the left hand side 150 samples 9.62 150 samples 9.62 So you can see because I'm not nearly clever enough to write machine learning algorithms not only can I not write them correctly the first time often every single line I write will be wrong, right? So I can start from the assumption that the line of code I just typed is almost certainly wrong and I just have to see why and how and so I can just make sure and so eventually I get to the point where much to my surprise it's not broken anymore so here I can feel like it would be surprising if all of these things accidentally happen to be exactly the same as sidekit-learn so this is looking pretty good So now that we have something free, we want to have something that can calculate predictions and so to remind you we already have something that calculates predictions for a tree ensemble by calling tree.predict but there is nothing called tree.predict so we're going to have to write that okay to make this more interesting let's start bringing up the number of columns that we use let's create our tree ensemble again and this time let's go to a maximum depth of 3 okay so now our tree is getting more interesting and let's now define how do we create a set of predictions for a tree and so a set of predictions for a tree is simply the prediction for a row for every row that's it that's our predictions so the predictions for a tree are every row's predictions in an array so again we're like skipping thinking thinking is hard so let's just keep pushing it back this is kind of handy notice that you can do four blah in array with a numpy array regardless of the rank of the array regardless of the number of axes in the array and what it does is it will loop through the leading axis these concepts are going to be very very important as we get into more and more neural networks because we're going to be all doing tensor computations all the time so the leading axis of a vector is the vector itself the leading axis of a matrix are the rows the leading axis of a three-dimensional tensor the matrices that represent the slices and so forth right so in this case because x is a matrix this is going to loop through the rows and if you write your kind of tensor code this way then it will kind of tend to generalize nicely to higher dimensions like it doesn't really matter how many dimensions are in x this is going to loop through each of the leading axes okay so we can now call that decision tree.predict alright so all I need to do is write predict row and I've delayed thinking so much which is great that the actual point where I actually have to do the work it's now basically trivial so if we're at a leaf node then the prediction is just equal to whatever that value was which we calculated right back in the original tree constructor it's just the average of the ways if it's not a leaf node then we have to figure out whether to go down the left hand path or the right hand path to get the prediction so if this variable in this row is less than or equal to the amount we decided to split on then we go down the left path otherwise we go down the right path okay and then having figured out what path we want which tree we want then we can just call predict row on that and again we've accidentally created something recursive again I don't want to think about how that works control flow wise or whatever but I don't need to because like I just it just does like I just told it what I wanted so I'll trust it to work if it's a leaf return the value otherwise return the prediction for the left hand side or the right hand side as appropriate notice this here this if has nothing to do with this if right this if is a control flow statement that tells python to go down that path or that path to do some calculation this if is an operator that returns a value so those of you that have done C or C++ will recognize it as being identical to that it's called the ternary operator if you haven't that's fine basically what we're doing is we're going to get a value where we're going to say it's this value if this thing is true and that value otherwise and so you could write it this way right but that would require writing four lines of code to do one thing and also require you to have code that if you read it to yourself or to somebody else is not at all naturally the way you would express it right I want to say the tree I got to go down is the left hand side if the variable is less than the split or the right hand side otherwise right so I want to write my code the way I would think about or the way I would say my code so this kind of ternary operator can be quite helpful for that alright so now that I've got a prediction for row I can dump that into my class and now I can calculate predictions and I can now plot my actuals against my predictions when you do a scatter plot you'll often have a lot of dots sitting on top of each other so a good trick is to use alpha alpha means how transparent the things not just in matplotlib but like in every graphics package in the world pretty much and so if you set alpha to less than one then this is saying you would need 20 dots on top of each other for it to be fully blue and so this is a good way to kind of see how much things are sitting on top of each other so it's a good trick good trick for scatter plots there's my r squared, not bad and so let's now go ahead and do a random forest with no max amount of splitting and our tree ensemble had no max amount of splitting we can compare our r squared to their r squared and so they're not the same but actually ours was a little better so I don't know what we did differently but we'll take it so we have now something which for a forest with a single tree in is giving as good accuracy on a validation set using an actual real world data set you know bulldogs for blue does compared to so let's go ahead and round this out so what I would want to do now is to create a package that has this code and I created it by like creating a method here, a method here, a method here and patching them together so what I did now is I went back through my notebook and collected up all the cells that implemented methods and pasted them all together and I've just pasted them down here so this is my original tree ensemble and here is all the cells in the decision tree I just dumped them all into one place without any change so that was it, that was the code we wrote together so now I can go ahead and I can create a tree ensemble I can calculate my predictions I can do my scatter plot I can get my r squared right and this is now with five trees right and here we are we have a model of blue duck for bulldozers with a 71% r squared with a random forest we wrote entirely from scratch so that's pretty cool any questions about that and I know there's like quite a got to get through so during the week feel free to ask on the forum about any bits of code you come across can somebody pass the box to Marsha oh there it is can we get back to the probably to the top or maybe the decision tree when we set the score equal to infinity do we calculate the score further I mean like I lost track of that and specifically I wonder when we implement when we implement find var split we check for self score equal to whether it's equal to infinity or not it seems to me it seems like unclear whether we fall out of this I mean like if we ever implement the method if our initial value is infinity so okay let's talk through the logic so the decision tree starts out with a score of infinity so in other words at this point when we've created the node there is no split so it's infinitely bad okay that's why the score is infinity and then we try to find a variable and a split that is better and to do that we loop through each column and say hey column do you have a split which is better than the best one we have so far and so then we implement that let's do the slow way since it's a bit simpler find better split we do that by looping through each row and finding out this is the current score if we split here is it better than the current score the current score is infinitely bad so yes it is and so now we set the new score equal to what we just calculated and we keep track of which variable we chose and the split we split on no worries okay great let's take a five minute break and I'll see you back here at 22 so when I tried comparing the performance of this against scikit-learn this is quite a lot slower and the reason why is that although like a lot of the work's been done by NumPy which is nicely optimized C code think about like the very bottom level of a tree if we've got a million data points then the bottom level of the tree has something like 500,000 decision points with a million leaves underneath and so that's like 500,000 split methods being called which contains multiple calls to NumPy which only have like one item that's actually being calculated on and so it's like very inefficient and it's the kind of thing that Python is particularly not good at performance wise like calling lots of functions lots of times I mean we can see it's not bad for a kind of a random forest which 15 years ago would have been considered pretty big pretty good performance right but nowadays this is some hundreds of times at least slower than it should be so what the scikit-learn folks did to avoid this problem was that they wrote their implementation in something called Cython and Cython is a superset of Python so any Python you've written pretty much you can use as Cython right but then what happens is Cython runs it in a very different way rather than passing it to the kind of the Python interpreter it instead converts it to C compiles that and then runs that C code right which means the first time you run it it takes a little longer because it has to go through the kind of translation and compilation but then after that it can be quite a bit faster and so I wanted just to quickly show you what that looks like because you are absolutely going to be in a position where Cython is going to help you with your work and most of the people you're working with will have never used it may not even know it exists and so this is like a great superpower to have so to use Cython in a notebook you say load X load extension Cython right and so here's a Python function bit one here is the same as a Cython function is exactly the same thing with percent percent Cython at the top this actually runs about twice as fast as this right just because it does the compilation here is the same version again where I've used a special Cython extension called cdef which defines the c data type of the return value and of each variable right and so basically that's the trick that you can use to start making things run quickly right and at that point now it knows it's not just some Python object called t in fact I probably should put one here as well let's try that so we've got fib2 we'll call that fib3 so for fib3 yeah so it's exactly the same as before but we say what the data type of the thing we passed to it was is and then define the data types of each of the variables and so then if we call that okay we've now got something that's 10 times faster right so it doesn't really take that much extra and it's just Python with a few little bits of markup so that's like it's good to know that that exists because if there's something custom you're trying to do I find it kind of painful having to go out and go in to C and compile it and link it back and all that whereas doing it here is pretty easy can you pass that to your right please Marsha so when you're doing Python version of it so in the case of an array or an MP array there's a specific C type of yeah so there's like a lot of specific stuff for integrating Python with NumPy and there's a whole page about it yeah so we won't worry about going over it but you can read that and you can basically see the basic ideas there's this C import which basically imports certain types of Python library into the C bit of the code and you can then use it in your Python yeah it's it's pretty straightforward good question thank you so your mission now is to implement confidence based on tree variance feature importance partial dependence and tree interpreter for that random forest removing redundant features doesn't use a random forest at all so you don't have to worry about that extrapolation is not an interpretation technique so you don't have to worry about that so it's just the other ones so confidence based on tree variance we've already written that code so I suspect that the exact same code we would have in the notebook and make sure it get that working feature importance is with the variable shuffling technique and once you have that working partial dependence will just be a couple of lines of code away because it rather than you know rather than shuffling a column you're just replacing it with a constant value but it's nearly the same code and then tree interpreter is going to require you writing some code and thinking about that well once you've written tree interpreter if you want to to creating the second approach to feature importance the one where you add up the importance across all of the rows which means you would then be very close to doing interaction importance so it turns out that there's actually a very good library for interaction importance for XGBoost but there doesn't seem to be one for random forest so you could like start by getting it working on our version and if you want to do interaction importance and then you could like get it working on the original sklearn version and that would be a cool contribution like sometimes writing it against your own implementation is kind of nicer because you can see exactly what's going on so that's your job you don't have to rewrite the random forest feel free to if you want to practice so if you get stuck at any point you know ask on the forum right there is a whole page here on wiki.fast.ai about how to ask for help so when you ask your coworkers on Slack for help when you ask people in a technical community on github or discourse for help or whatever asking for help the right way will go a long way towards you know having people want to help you and be able to help you so so like search for your and like search for the area you're getting see if somebody's already asked about it how have you tried to fix it already what do you think is going wrong what kind of computer are you on how is it set up what are the software versions exactly what did you type and exactly what happened now you could do that by taking a screenshot make sure you've got some screen shot software that's really easy to use so if I want to take a screen shot I just hit a button select the area copy to clipboard go to my forum paste it in and there we go that looks a little bit too big so let's make it a little smaller right and so now I've got a screenshot people can see exactly what happened better still if there's a few lines of code and error messages to look at create a gist a gist is a handy little github thing which basically lets you share code so if I wanted to create a gist of this I actually have an extension there we are that little extension so if I click on here give it a name say make public right and that takes my Jupyter Notebook shares it publicly I can then grab that URL copy link location and paste it into my forum post right and then when people click on it then they'll immediately see my notebook when it renders okay so that's a really good way now that particular button is an extension so on Jupyter you need to click envy extensions and click on gist it right while you're there you should also click on collapsible headings that's this really handy thing I use that lets me collapse things and open them up if you go to your Jupyter and don't see this envy extensions button then just Google for Jupyter envy extensions it'll show you how to pip install it and get it set up but those two extensions are super duper handy alright so other than that assignment we're done with random forests and until the next course when you look at gbms we're done with decision tree ensembles and so we're going to move on to neural networks broadly defined and so neural networks are going to allow us to go beyond just the kind of nearest neighbors approach of random forests all random forests can do is to average data that it's already seen it can't extrapolate it can't calculate right linear regression can calculate and can extrapolate but only in very limited ways neural nets give us the best of both worlds we're going to start by applying them to unstructured data so unstructured data means like pixels or the amplitudes of sound waves or words data where everything in all the columns are all the same type as opposed to like a database table where you've got like a revenue and a cost and a zip code and a state which should be structured data we're going to use it for structured data as well but we're going to do that a little bit later so unstructured data is a little easier and it's also the area which more people have been applying deep learning to for longer if you're doing the deep learning course as well you'll see that we're going to be approaching kind of the same conclusion from two different directions so the deep learning course is starting out with big complicated convolutional neural networks being solved with sophisticated optimisation schemes we're going to gradually drill down into exactly how they work where else with the machine learning course we're going to be starting out more with how does stochastic gradient descent actually work what can we do with one single layer which would allow us to create things like logistic regression when we add regularisation to that how does that give us things like ridge regression elastic net lasso how do we add additional layers to that how does that let us handle more complex problems and so we're not going to we're only going to be looking at fully connected layers in this machine learning course and then I think next semester with your net you're probably going to be looking at some more sophisticated approaches and so yes on this machine learning we're going to be looking much more at like what's actually happening with the matrices and the deep learning it's much more like what are the best practices for actually solving you know at a world class level real world deep learning problems right so next week we're going to be looking at like the classic MNIST problem which is like how do we recognise digits now if you're interested you can like skip ahead and like try and do this with a random forest and you'll find it's not bad but given that a random forest is basically a type of nearest neighbours right it's finding like what are the nearest neighbours in in tree space then a random forest can absolutely recognise that this nine those pixels you know are similar to pixels we've seen in these other ones and on average they were nines as well right and so like we can absolutely solve these kinds of problems to an extent using random forests but we end up being rather data limited because every time we put in another decision point you know we're halving our data roughly and so there's just this limitation in the amount of calculation that we can do where else with neural nets we're going to be able to use lots and lots and lots of parameters using these tricks we're going to learn about with regularisation so we're going to be able to do lots of computation and there's going to be very little limitation on really what we can actually end up calculating as a result good luck with your random forest interpretation and I will see you next time