 In this video, we'll talk about decision trees as used for both classification and regression. The idea of a decision tree is absurdly simple, and it's actually a fairly bad method to use for essentially any machine learning problem. However, it serves as the foundation for both certain ensemble methods that we'll talk about next, and it serves as a foundation also for neural network methods, and those are extremely powerful. But for a simple classification problem, here's the idea of a decision tree. Suppose we have n data points in a p-dimensional space, so we've got p-different predictor variables. In this case, we have two predictors, x0 and x1, and each of those has a response, in this case a class, which is either a green circle or a purple x. The idea of a decision tree is to choose a threshold or a cutoff value in each variable, and choose the best cutoff that separates the data between those two classes. So in this case, what I would do is sweep across the x0 variable, looking for the best vertical line that separates the data, and I'd also sweep across the x1 variable, looking for the best horizontal line that separates the data. Of those two, I'll choose the better. It looks to me like that is probably right about here. That cutoff threshold, let's put some numbers on this. One, two, three, four, five. Maybe this is x0 is bigger than or equal to 3.75. That's our first decision cutoff. That has divided our space of predictors into two different half-spaces. Now what we'll do is in each of those half-spaces, we'll sweep through it and find a new cutoff or separator. So on this right-hand half-space, I'll sweep through it both horizontally to make vertical cutoffs or vertically to make a horizontal cutoff, and look for a nice division line. That looks right about here. Let's put some numbers here. Maybe that's 5.5. So if x1 is bigger than or equal to 5.5, that's a new cutoff there. And on this half-space, we have another cutoff. This one's a lot trickier. Let's see. Maybe the best we can do is cut it this way to get those green ones clustered a little bit. So this side's slightly more purple and that side's slightly more green. So if we're to the left, then we'll make another cutoff which is x0 bigger than or equal to 3.25. As you can see, what you're going to end up doing here is cutting the space across each axis and going down recursively to make it smaller and smaller and smaller. So I won't write out the entire tree for the sake of time, but you can see that here we're actually pretty happy. In this half-space, we'll maybe want to cut there. In this one, maybe we'll want to cut there. Maybe we'll want to cut there. Maybe we'll want to cut there and so on. And you can keep going down as far as you want. So if you have a tree of depth d, in this case I've gone down about three levels here, you of course have two of the d domains as a result. So this is the entire idea of a decision tree. You can see that it has at least three fundamental problems immediately, however. Perhaps from a data point of view, the most fundamental problem is that it requires the cuts to be aligned with the different predictor axes. And as we know from our discussion of principal component analysis, it's often true that interesting data is not recorded in the way that reveals the most interesting clustering or classification. So if the variables you recorded, x0 and x1, happen to not be the most relevant ones for your data, cutting across x0 and x1 hyperplanes isn't all that interesting. The second issue, which is related to the first one, is that those cuts are necessarily giving you little rectangular regions. And if your data is in circles or blobs or some non-rectangular shape, you're going to require a lot of tree depth to fit something which goes on a diagonal, for example. You have to make a little jagged edge, which is sort of what we're getting right here. We're gradually getting a jagged edge to separate the purple from the green, or interpreting it as little horizontal and vertical cuts. Related to that issue is the issue of tree depth. Here I have roughly, say, 5, 10, 15, 20, 30. I have maybe 30 or so points here. I didn't actually count them. But obviously, if I have a tree of depth, say a tree of depth 5 or 6, 2 to the 5th is 32 to the 6th is 64, in that sort of situation, I could easily overfit my data because I could make a little rectangular region, each of which contains a single training point. So with a single decision tree, it's extremely easy to overfit the data. And it is not really aligned to the data in any sane way. But that's the idea. It's very simple. It gives you a logic tree. And it's actually fairly fast. If you think about it, what we have to do to build this is, well, at each layer, we have to sweep through each variable. So if we're going to do a tree of depth D for each layer of depth, I have to sweep through the p-different predictors. And how hard is it to find that division line? Well, if I have a way of measuring how different the data is versus x0 or x1, so if I have some measurement of the mixing of the data as you sweep across, then in fact, all I have to do is for each location where there is a data point, do some sort of root-finding method to find the point at which the difference is maximized. So you can think of it as a recursive Newton's method or a binary tree search on the data points. So if you already have your data sort of computed and you know where it is in space, you don't have to go through and do the geometric analysis ahead of time to compute those differences, then that's of course a tree search. So that would be a log in the number of points type method. If you do have to compute how mixed up the data is, maybe there's another n in there to compute, say, how many data points of purple and green are in each region. But generally this is actually reasonably fast. Your number of predictors is, of course, fixed. The depth of your tree you decide ahead of time, usually fairly small so that you don't overfit. And if you have n points, well, you can increase n quite rapidly and the speed of the algorithm isn't too bad. The only final remark to make about classification is what do you use to measure how mixed the data is? There are two primary methods here. Let me change the sheet. So how to measure difference, how to measure mixing or difference. If you think about that previous page, we're really trying to essentially segregate the data. Now there's two primary measurements that we use. One is called the Gini index that comes out of statistics and economics. It was originally used to study things like wealth inequality in different countries. So you want to compare, for example, what portion of the population is below a certain income versus what portion of the population is above a certain income. The other method which comes out of statistical physics is entropy. So with either the Gini index or entropy, if you have data spread out in some way in different classes, so maybe I'll have my O's over here and my X's over here. For any given partition of this data set, you can ask what is the Gini index and what's the entropy. If I were to cut this data this way, each subset would actually be still approximately evenly mixed. In fact, the way this works is that that would have each subregion has high entropy and has a low Gini index. But you can see in some cases you want to segregate the data as much as possible. You'd actually want to cut it roughly here. And now if you compare these two different cells, each of them has minimal entropy and maximum Gini index. You can go look up the definitions these online. I don't want to do all of it right now. So there's a function you can measure which depends on the location of the cut. And that function you measure is something you want to maximize. And therefore finding these hyperplane cuts for a decision tree is essentially a root finding problem from calculus. In the next video, we'll talk about how decision trees are used for regression. And following that, we will study how to combine multiple decision trees into ensemble methods that are much more powerful.