 Now let's talk about three methods for improving decision tree methods. I'm going to talk about these in the context primarily of regression, but you can also extend each of these into the case of classification. These are all known as ensemble methods. Ensemble, of course, being together. These are methods we combine multiple models of the same type to make some sort of overall improved model. We already talked about bagging. In bagging, you have your training data x and y. And from this, you make bootstrap samples. Let's say x, y, 1 through x, y, b. And for each of those, you make a model, f hat 1 through f hat b. So these are bootstrap samples, which, again, we draw from our original data set with replacement. And we just simply want some sub-samples. These are all of some size. And we, assuming these are all the same size bootstrap sample, we simply make f to be 1 over b, the sum of the f bs, b equals 1 to b. So it's literally the average. Again, the idea here is that if each of these models has a certain variance sigma squared, then f hat should have variance sigma squared over b. Because if these are identically distributed independent random variables, then you average them, the variance goes down by that factor. So that's bagging. Bagging is very simple. However, it's not so great in the sense that if you have a single feature, which is really driving the regression, then each of these models is probably going to pick that out. And in fact, it might be that you're basically averaging the same model over and over and over again. Here's a simple example of this. Suppose you want a nuanced model of things that lead to early mortality. You might be collecting all sorts of maybe dozens and dozens or hundreds of medical criteria about different patients. But as it happens, every single model simply says that smoking and obesity lead to higher mortality. Well, that's fine. And to its own extent, it's certainly true. However, you're never going to learn more nuanced things about other or more subtle effects, because they're all washed out by the fact that every model picks up in the same overwhelming signal. So if you want to get more nuanced pictures, get models which can pick up more subtle variations, then in some sense, you want to force the models to not keep picking up the same details. That's the idea behind random forests. So a random forest is the following idea. You have your training data x, y. And what we'll do is, again, we will do a bootstrap procedure, breaking it up into sub-samples x1, y1 through xb, yb. So we'll do this bootstrap sampling procedure. The key change happens when we build the model. So here's what we're going to do. Let's choose a certain number of predictors that they're allowed to use. So suppose that x involves p predictors. This is an n by p array. And this is responses of length n, one for each input point. So let's choose a maximum number of predictors. Let's call that m. Experience suggests that you want m to be approximately square root of p. When you're doing this begging procedure, you're assuming, especially if you're dividing these in, not dividing the n data points into b, but sampling from it b times into roughly equal size bootstrap sub-samples. So if these are all equally sized in such a way that their total size is roughly n, then m is approximately root p makes sense. And the idea here is that when you build, say, model fb at some layer. So remember, we're making trees here as our layers. At some layer, choose m of the p predictors to use. So choose only m. So instead of building a tree that checks all m predictors, sorry, all p predictors, let's choose a random subset m of those p. And then of those, we'll build a model. If you think about it, what's happening here is that we're building a tree. But instead of saying, OK, I'm going to search across my data points in some sort of searching way and one for each dimension, and then by the depth of the tree, in fact, what you're going to be doing is actually only searching in m dimensions at each layer of the tree. So you're actually dropping down that number p by quite a bit. And in fact, this is per bag. So let's actually say that these bootstrap samples are of size n, where again, n is roughly n over b. So this is now log n, but we want to do this b times. So one thing that's interesting about random forest procedures is that although it's d times m times log n times b, if you think about it, once you do the bootstrap sampling, you can parallelize over those b different terms. So in fact, it's this overall order, but it's extremely parallelizable on the b. And so really, if you have b processors, say you have 10 cores in your computer and you do 10 bootstrap samples, then actually you're only doing this much work on each one. So then sometimes it could actually be faster than doing a standard decision tree, because you can break it up in a more ready way. Anyway, the point here is that by choosing predictors at random for each layer of each tree that you're building, it's unlikely, statistically, that multiple models will keep picking up that same predictor. So if you have predictors like smoking or measurements of obesity versus how much steak someone eats versus other factors, how much they've run every day, without they have Girl Scout Cookies for lunch every day, things like that, each model at each layer is likely to pick up a different variable. So a very dominant variable might be picked up once or twice, but you'll also see the nuance of the other more subtle features. So this is called a random forest. Again, at the end, you're essentially going to average these models, but the idea is that the average is across a more diverse collection of trees. Now interestingly, there's also something called an extremely random forest or an exceptionally random forest. And that's where when you build the tree, you're not searching for the best cutoff. You're simply just picking any random cutoffs. If you really just want to search every possible decision tree and optimize it across all of them, what you do is just literally just pick any cutoffs you wanted every layer and do it all at random, make a lot of copies, and let them sort themselves out in terms of their error analysis. But random forests in this form are quite effective, because they're going to find both big effects quickly with high probability. They'll also pick up on smaller effects and give more nuanced models that aren't as redundant as a classic begging procedure. OK, finally, I want to talk about boosting. So boosting is a simple idea, which is actually similar to the idea that we had back when we talked about lasso at Lars of learning slowly. So the idea is that you can avoid overfitting if you build your model gradually. So here's the procedure. Let's, again, let's take our data x, y, and we will again do a bootstrap procedure to split it into capital B different sub-samples with replacement, a bootstrap sample. But what we're going to do is instead of just building all the models and averaging them in some way, like both begging and random forested, what we're going to do is build an incremental model and build all the different models level by level in a way which gradually builds the layers. OK, so here's the procedure. It's going to be a loop. So 1, set the model equal to 0. If your model is equal to 0, then your residual of the model. Again, we're talking about regression here. The residual of the model is, of course, yi. On the ith entry is yi minus f hat of xi, which is, of course, yi because our model is currently 0. Now what we're going to do is 2 for each bootstrap sample. Let's build a tree, f hat little b, of some depth. So fix a depth of small depth d. I don't want this tree to be too deep. Otherwise, we're going to be overfitting already. So let's make a depth maybe 2 on this data. So but you fit the data not x versus y, but x versus the residual. Again, that's the same idea as what we did for Lasso and Lars where you fit it against the residual. Not against the original data. And then you update f hat. You replace f hat with f hat plus lambda this f hat b, where lambda is small. So you don't want to overwhelm the existing model. Maybe lambda is a 10th or 100th. So you add you make sort of a differential or an incremental change to your model. You update the residuals, of course. So the new residual is, of course, the old residual minus the updated prediction. So that's easy to update once you have this model. And you loop through all the models and do this layer by layer by layer. And then finally, you output the full model f. So the idea is that you build each of the trees to a certain depth. And don't build them too deep, but now let each of these make a small impact on f. And then we're going to build them all a bit more and add that to f. Again, time some small factor. And again, but each time we don't fit to the original data, we fit to the residual what's left. So we're learning the difference from what the model already predicts to what it predicts next. The last remark I want to make in this video is that of these three procedures, bagging or bootstrap averaging and random force and boosting, these are all ensemble methods for trees. But if you think about it for a moment, both bagging and boosting are very general in the sense they do not require a tree. They actually work on any sort of method. Of course, a random force is inherently tied to the idea of a tree, so hence the name forest. But bagging and boosting are actually very general procedures that apply in a wide variety. You could, for instance, do bagging or boosting on linear models or on quadratic models or logistic models or any sort of regression model you might imagine.