 In this video, I'm going to do a couple short demos of decision trees in scikit-learn in a Jupyter notebook, and then we'll talk about begging. So what we're going to do here is build a decision tree on a simple dataset. I'm going to, of course, import numpy scikit-learn and the tree sub-package of scikit-learn, and we're going to plot some things in matplotlib. So to make some sample data to play with, I'm going to take 20 random points in a normal distribution around negative 1, 0, and make those green, and another 20 points around 1, 0, and make those magenta. So if we look at this picture, it looks something like this. Now what we want to do is build a decision tree which can classify green versus magenta. So what I'll do is I'm going to, of course, use the cross validation and training testing split tools in scikit-learn, and we will break that dataset against only 40 points total into a training and testing set using train test split. The training points, we have 32 of them. Testing points, we have eight of them. So that's 80, 20 split. Now let's instantiate a decision tree classifier from scikit-learn. So that's what I'm going to call T. T, we're going to fit it to the training points. So, of course, we feed it the training points and their corresponding labels. Of course, scikit-learn tells us a little bit about what it's doing. It generates, you know, it has a Gini index, a max depth, and so on. So we can ask it to do that. And actually, since we know how decision trees work, let me set the max depth right away. Let's say max depth is equal to, let's say, four. So if you think about the max depth of a binary tree, that should give us roughly 16 regions, well, up to 16 regions on which we can make decisions. We can predict the value of a new point by, oops, sorry, by giving it an array. Say, if I give it the point 20, I think that's going to be magenta. If I give it the point negative 20, I think that's going to be green. Okay, so what is this actually doing? I'm going to set up kind of a complicated graphing system here, a complicated plot. What I'm really going to do here is show you the training set, and then I'm going to take a grid of points, and 200 by 200 points. And we're just going to color every single point, either green or magenta, as the model tells us. This is going to take a little bit to draw, because it tests all 40,000 points. And this is what it comes up with. It looks kind of weird, right? As a human, you'd say, look, you've got a blob of magenta on the right and a blob of green on the left, and they sort of overlap. But the way the decision tree works, it has to cut linear boundaries that correspond with the axes. So it makes some vertical cut, and then it cuts that half space into smaller pieces, and so on. So these are the cuts it decides to make. Using GraphViz, which is a nice sort of graph theoretic plotting tool, we can look at what it actually is deciding here. So it's first of all making it cut at x0 less than or equal to negative 0.781. That's probably this vertical line right here. It's cutting there. And then, basically, that's true or false. It's going to start subdividing the data more and more and more. So we have an ultimate case of one, two, three, four, five, six, seven, eight. We've got here nine, I think. Different regions it comes up with. And those come out of this picture. So it doesn't look like nine, because these look contiguous. But in fact, if you think about this, every single vertical horizontal line has to involve some sort of cut. Here, one, two, three, four, five, six, seven. I can maybe count. Maybe there's some more refined ones in there. Of course, we can predict on the testing side. It actually does surprisingly well. If we do a cross validation, it does reasonably well. So many percent accuracy. It's not great, but it's okay. Again, the one thing I want to point out in decision trees is that they're very chunky, right? They're very pixelated and prone to overfitting. So let's use a greater tree depth. I'm going to go back and change the tree depth to, let's see, if I've got 32 points, a tree depth of like five or 10. Let's do a tree depth of like nine and see how that comes out. So with a tree depth of nine, look at this. It manages to get essentially every single point correct, but by making some very strange cuts. It made a very small vertical cut right there to isolate that one green point. I think that magenta one's on the other side of that line. So by making enough cuts, it's able to exactly classify every single point, but it's obviously overfit because if this green point weren't there or if it moved left or right, it's misclassifying a whole bunch of things. Let's see here. If you look at the different leaves here, genie index of zero, genie index of zero, genie index of zero, genie index of zero, zero, zero, zero, zero. Yeah, so based on this, it has completely isolated all of the points. They're all perfectly isolated. So this is extremely overfit. Again, the reason I say it's overfit is if you were to jiggle any individual point, the model would change dramatically. If that green point weren't there or if that magenta point were a little bit to the right, you'd have to do a whole different fitting process over here. It's really extremely overfit. And that keeps happening. So if we fit, let's add some more points here. Let's make maybe 200 points like that. So I'm just going to up the data set size here to 200. And let's plot those points. Here's those points. Now if I have a total of 400 points, let's do actually a depth of nine should still do it, shouldn't it? Let's make it do a depth of 10 just to make sure it's clearly overfit. Again, overfitting is bad, but I just want to demonstrate what's going on here. So again, that's the wrong data set. I didn't rerun these segments. All right, let me rerun these. I think I just skipped over a couple segments here. There are 320 points. Sorry about that. It always happens to the Jupyter Notebook because you forget to rerun a block. So here you can really see overfitting happening. Obviously by random chance, there's some magenta points here and there. And with a tree depth of, what was it, nine or 10 that I just put in? With a tree depth of 10, you can get 1,024 different subregions here. And man, you know, it's really trying to get every single magenta and green point isolated. It's trying really hard and that's just absurd. And there's no reason you should have a model that fits that tightly to these individual random points. So decision trees are fairly simple. They're fairly dumb. They overfit very easily. Let me demonstrate a similar phenomenon now for regression trees. So let's do a regression tree, again, using a tree for regression. And again, to be able to visualize it, I'm going to have one input variable and one output variable. So let's make sort of a sine wave here. If you look at how I generate this, I took a 100 points from 0 to 4 pi. And I took their sine and then add a little bit of noise to it. A little, you know, 20% noise to the signal. And let's plot that. It looks like this. So it's a sine wave. Now we're going to fit a decision tree regressor to this. And let's use a max of the tooth that we can see the tree. So we're going to cut this in half and then in half again. Again, not really half, but cut it and then cut each of those again to do a fit. And let's plot that. Here's the plot. So it cuts somewhere and then it cuts somewhere else and cuts somewhere else. So I put this as line segments. But you can see, you know, it's at this cut, it's taking that average. At this cut, it's taking that average and so on. If you look at the actual decision tree itself, it's cutting at x0, that our variable is 2.96. So it's cutting about here and then it's cutting at 0.8 and so on. And each time is trying to minimize the error, the mean squared error on that subregion. And that's how it's choosing these cut points. Now we can get a better model if we increase the depth. So let's do a depth of, I don't know, let's do a depth of 6. So roughly 2 to 6 different things. And here, again, you really see the overfitting happening. Because of a couple of outlining points, it's making the model jump up and down to try to fit those points. So this is what overfitting looks like if you're doing regression. It's trying to fit each individual point. Some of them, it skips because it doesn't quite have enough depth. But if we make the tree depth even higher, let's say a tree depth of 10, that's extremely overfit. It's trying hard to hit every individual point by slicing and taking a new average. So how do we improve this overfitting problem? Well, there are three main tools. One is called begging, which is short for bootstrap aggregating. That's a portmanteau that kind of irritates me, to be honest. But it's bootstrapping, or in other words, taking a subsample and averaging them. There's random forests, which is a more clever way of doing that begging process, where you avoid them interacting too strongly. And then there's boosting, where you grow the tree layer by layer, gradually, only focusing on things that were previously wrong in such a way that you're gradually building a model. So the idea is that the slower you learn, the less you're going to overfit, or at least the less quickly you will overfit. So now let me show a very simple idea of begging. And I'm not going to use any fancy library here, I'm just going to have it work by hand. Okay, so I'm going to set up a very similar system. Import numpy, import scikit-learn, matplotlib, et cetera. In fact, this time I'm going to restart this so that we know how it's going to run. I'm going to do a similar model, so 100 random points between 0 and 4pi, and add some noise to a sine wave. So here is our signal. Here's the training set. We're going to do a regression tree. So here's what I'm going to do. I'm going to take those 100 points, and I'm going to make samples from them. I'm going to draw samples. I'm going to draw samples of each bag with replacements. I'm going to call these bags, since the process is called begging. So of the numbers from 0 to 99, I'm going to draw 33, draw 33, and draw 33. And notice I'm doing this with replacement. So I'm allowing something to occur in bag 2 that was also in bag 1. And now of those three samples, subsamples, I'm going to make their predictors and responses arrays. So I have x1, y1, x2, y2, x3, y3 are the three different bags of training data. To each of those bags, I'm going to fit a decision tree. So I've got tree 1, tree 2, and tree 3. Now each of those trees can predict a value. So if I want to average those bootstrap samples, all I have to do is take each of their predictions and average them. So I'm going to make a consistent testing set, which is equally spaced points from 0 to 4 pi. Let's take 1,001 of them. Let's apply each of the three tree models to that testing data. And we will simply take their outputs and average them and call that y hat bagging. So if we do that and we plot the different models, this is a little confusing to look at, but we have the sort of pink model, the green model, and the blue model. These are the three individual models that came from a tree on a random subsample, a bootstrap sample. And if we average the values of those three different regression trees, we get the black regression curve. Now the idea here is that because the green, the pink, and the blue were built on different subsets of the data, or subsets of the data may or may not be the same. They may overlap, but they don't have to. They're fairly unlikely to overlap very much, of course. But it should in some sense wash out the overfitting. At the same time, it also is actually sort of an enhanced model in the sense of intervals, because in this case I built them all of depth four, so they each have like 16 different intervals at most. But if you think about it, if I average 16 different trees, which are cut at different points, then, so for example this, because of where green and pink and blue cut at different points, you can get more refined information in black. Now that might lead to overfitting like it sort of does right here. But typically, as you know from statistics, if you have three identically and independently sampled random variables, say x1, x2, and x3, if you take their average, the variance goes down by the number of samples you've taken. That would be three. So in a theoretical sense, the variance of the bootstrap model, of this begging model, being the average of three other models of all the same variance, should have a third of the variance. So we expect the variance to go down in the begging model. So that's all the begging is. It's simply taking bootstrap samples of your data, that's different samples with replacement, making a model in each one of those, and then averaging those models. You can do the same thing for classification. Of course, what do you mean by averaging and classification? You usually interpret that as voting. So if you have three different bootstrap models drawn from your data, then if they have like a green versus blue decision to make, you'd simply ask them to, on any import point to vote green versus blue, and to the three win. So averaging for decision is essentially voting. And with that's plurality voting, it's more sophisticated that serve up to you, but the notion still makes sense of sort of averaging the sub-models. The other methods, which are random force and boosting, are a little bit more sophisticated. But in the end, they come down to the same basic idea of taking bootstrap samples and trying to somehow combine them in a way that's useful and decreases variance.