 In this video, I want to talk about subset selection and specifically the Lasso-Lars algorithm and what it has to do with the curse of dimensionality. So here's a general setup. Suppose you have a collection of data, such as this orange point cloud, and you want to fit a linear model to the data. Well, there are many, many choices. Of course, the Gauss-Markov theorem tells us the best choice in terms of accuracy, but there are other models one could work with. For example, one could use the zero model, where the coordinate y in this case depends on x1 and x2 as the function zero. One could imagine from that model gradually adjusting the coefficients of the model to account for more and more of the data. For example, one could measure that using the fraction of variance explained, which is based on the RSS error of the model. So instead of thinking about a particular model being a best fit, I want you to think about having a family of models that live in some space, some space of hyperplanes in this case, and we swing through that space looking in different directions to go from models that explain very little of the data to models that explain most of the data. So when we think about linear fit, usually we have an array of predictors x, which is n entries with p predictors, and the array of responses y, which is n entries with a single response each, so we're trying to figure out how y depends on x. The goal is to find a linear model, which is really p coefficients, where y is approximately x times m, that's our linear model. Now you should know that we can measure the quality of a model using residual sum of squares. So the residual is the actual y value minus the predicted value xm. This gives us a vector in n dimensional space, one for each value entry. The residual sum of squares is simply the sum of the squares of those residuals. It's the L2 norm of the error between the actual and predicted values across your data set. And the fraction of variance explained is a quantity that is derived from the residual sum of squares, where a value of one means that all of your data is explained perfectly by the model. There's zero residual, and all the variance of the response y is accounted for. A fraction of variance explained of zero means effectively that your model is worthless, it predicts it explains zero of the variation of your response variable y. The classic case of a model which explains zero of the variance would be the zero model. Okay, so now let's think about that space of models. We start off with the zero model indicated by this blue dotted zero zero in model space. So maybe the horizontal axis here is the coefficient m1, and the vertical axis is the coefficient m2 for a model of y equals m1x1 plus m2x2. The Gauss-Markov theorem tells us that the classic linear fit is the optimum minimizer of the data based on the RSS metric, based on the RSS measurement. So our goal is to somehow find a way from the zero model, which explains nothing, to the Gauss-Markov linear fit model that explains as much of the data as possible. In other words, we have sort of an energy landscape where we want to find the optimum point in this energy landscape in model space. The closer you are to the dot in the upper right, the better your model is at explaining the data. Okay. So the best fit model should minimize error. And we can drive a perfect formula for it. This is the classic normal equations for a linear model, which you can drive simply using multivariable calculus. So to find the best fit model, you solve the system x transpose x times the model equals x transpose y. That's a system of, that's a P by P matrix. And overall, you obtain a model that way. That's the best fit model from the Gauss-Markov theorem. An important question, however, is, is that really the best model? Suppose you're in a situation where you have 10,000 variables to work with. And the coefficients of the best fit model are all over the place. Now, as a human, it's very hard to process what the model is telling you when it involves, say, 10,000 variables. And some of the, some of the coefficients are very large, say, x 10,000 in this case has a lot of effect, whereas some variables, say, x2 have very little effect on the ultimate answer. So what one can imagine doing is gradually throwing away variables that matter less and less than the model. Until maybe eventually you simplify your model so much that you're only involving the top two or three variables that really matter in the model. Anything else is effectively noise. After all, a variation in x 10,000 has roughly, oh, I don't know, a million times more importance to the model than x2. Of course, one can take this too far and use the zero model. If you dispose of all your variables, you can't explain very much of the data. So in real life, when you're trying to model a system, even with a linear model, there's a key choice to make. And that's if there's a balance between accuracy and simplicity. The Gauss-Markov theorem tells you the most accurate model in terms of your prediction, but often the simpler model is more instructive on how to proceed. For example, maybe we're talking about marketing data. There's lots and lots of data one can collect about customers. But ultimately, if, say, you're a shoe store, you might key in on just a couple of factors, such as what's the household income and how much of the person spent on clothing in the last six months. Those probably tell you a lot more about the buying habits for shoes than maybe which articles they've got on various news websites and other social media information. It might just be the couple factors give you the information you're going to, or you really need to know to solve your problem. So in fact, we're never really thinking about even a linear model. It's not just a question of accuracy. It's a question of accuracy versus simplicity and finding a balance. So what we want is a way to measure the simplicity of a model so that we can balance accuracy against simplicity. So here's our new goal. Minimize the residual sum of squares error of a model M, but minimize only among all models of a given complexity.