 In the previous video, we saw how decision trees are used for classification problems. In this video, we'll talk about decision trees for regression problems. Suppose you have input data x and y, where again you have n by p input data and n responses. You have p predictors, and in this case one response variable, and you have n data points total. And of course the goal for regression is to develop a function which predicts y based on x to minimize y minus f of x in terms of residual sum of squares. In this case, so that I can draw a picture that makes sense with a marker, let's suppose I only have one input variable, it's called x naught, and a single response variable y. And so here's a function, something like a sine wave, that of course varies versus the input data. So what we're going to do is build a decision tree, and again we're going to look for threshold cutoffs. And we're going to look for threshold cutoffs, but instead of looking at an entropy or a genie index, what we'll do is we'll choose the cutoff in such a way that will minimize residual sum of squares error. And the way we're going to choose f is very simple. Once we choose an interval or a domain by making a cut, we will choose f to simply be the average of the values inside that region. So let me just sort of go through this by i and we'll see how it works. Looking at this data, if I wanted to model it, well I could just model it with a single horizontal segment going all the way across. So if I use the entire domain, then actually that is the value of f that probably minimizes the residual sum of squares error. Of course that's extremely biased and has low variance. But if I cut maybe right about here, so we make a decision that x could be bigger than or equal to 3.25 as my first choice, then on this side of that cut I will take the function let's say right about here, which is the average values of the y's inside that region, and over here we'll choose this value. So after one layer of a tree, we now have the function is a two part function. It's a piecewise function with two horizontal components. And then inside of each of those regions, I will examine it and find a place where I can reduce the residual sum of squares in this region. And I don't know, maybe that happens right here. So I'll make a cut here, and instead of using this segment here I'll use that, and this one maybe goes down a little bit. Over here, oh I don't know, maybe right about here we can cut and we get an average there and average here, which is up a tiny bit. So to emphasize this, let me mark this in a different color. We now have this model for our data. That's a truth depth two. Let me indicate that in our chart. So if this statement is false, then we have the condition x is bigger than or equal to say one, and that divides these two regions. Over here we have the case x is bigger than or equal to 7.5 roughly. And so now we have a total of four regions to consider. Of course you can keep doing this, right? So cut this and draw there and there, cut this and draw there and there, and keep cutting these regions into smaller and smaller pieces in each subregion taking an average. End up with a tree of depth D. We'll have two to the D pieces to our function. Now obviously unlike a linear model or a logistic regression model or any of the polynomial models or many of the models we like to consider, this model is in some sense never smooth. In fact what it's actually pushing us towards is the classic idea of a Riemann integrable function. It's replacing our original data if it did have a smooth function underlying it. It's replacing it by a sequence of horizontal line segments that are finer and finer and finer in each region it's giving you the average of the data points. So it's actually a lot like what you would call a trapezoid rule, the rectangles of trapezoid height that you use in say a calculus two course. So in some sense what you can think of decision tree regression is doing is it's actually building the Riemann sum segments, those piecewise constant functions that approximate whatever function you give it. And since obviously the world of integrable functions is quite vast at least for purposes of applying to data, you can get very intricate models. However just like in the decision tree case it's very easy to overfit. If I only have maybe a hundred data points here then if I go beyond say a depth of seven or eight I will be absolutely pinning down every precise eight data point on its nose. Now that looks fine in a case like this but imagine I have a situation where my data is a little noisy. After all data is always a little noisy and so maybe this is supposed to look like this in some sense. But we will gradually be fitting, if you keep going down in depth you'll be fitting every individual data point and your function will jump around to catch every individual point. So that sort of jerkiness or pixelated notion of the graph is really this indication that your model has its variance is too high and it's therefore way overfit. Even though of course you can actually pin down the residual sum of squares error as zero if you have sufficient tree depth that's really not a good model because if you introduce a new data point it's not going to have an accurate value because you're not averaging over a wide enough region. So trees are extremely simplistic. In the classification case you're simply looking for cutoffs in each predictor variable that either maximize the Gini index or minimize entropy. And in a regression case you're looking for cutoffs in each input variable that minimize the residual sum of squares error on each subcomponent. But it's too easy to overfit. It's too easy to ignore a data that isn't there. And if you don't have very many data points it's too easy to make the tree depth so deep that in the exponential case because the number of pieces, the number of segments of your function is exponential in the depth you can hit every data point individually but not actually be getting a useful function out of it. You're simply labeling where the data points were which isn't really a regression, it's just a label. Okay so this is the overall point here. Basic decision trees are extremely naive. They're very simple to implement of course. It's a simple optimization problem. But it turns out they're actually quite powerful if you combine them. So what we're going to do next is imagine building a hundred or a thousand different trees on different subsets of the data in a bootstrap type procedure and averaging those. So we can do a bootstrap addition of our data that's called bagging. We can combine the trees in a slightly more sophisticated way which is called random forests or we can build them gradually which is called boosting. So we'll talk about bagging which is bootstrap aggregation and random forests and boosting in the subsequent videos. After we talk about those topics we'll move on and see how what this idea has to do with the idea of a neural network.