 In the previous two videos, we saw how when finding a linear model, we want to balance simplicity against accuracy. We also saw that accuracy is judged by the residual sum of squares, which is a quadratic function in the model. We also saw that the simplicity of a model can be judged using the L1 norm, because in high dimensions, the L1 norm is very similar to the L0 norm, which simply comes to the number of nonzero terms. So the last algorithm is what we have that balances these two against each other. There are three ways to think about it. The first way is a constrained optimization problem that's familiar from multivariable calculus. Imagine that you fix a certain complexity, you fix a certain L1 norm of your model. Then among all such models with that complexity, you find the one which minimizes the residual sum of squares. So subject to a linear constraint or a piecewise linear constraint, you want to minimize a quadratic function. On the left, here's a picture of this. So here we have our model space. The blue point in the lower left is the zero model, which is certainly the simplest possible. And the blue point to the upper right is the model, which is the Gauss-Markov model that minimizes the residual sum of squares error. It's the most accurate model. Now to constrain the L1 norm means, for example, to stay within this red box. And you'd want to find the model that gets closest to the minimum of the residual sum of squares while staying inside the red box. In this case, it looks like it's the upper corner of this box. And so from this perspective, fixing a budget of one and minimizing the other is familiar from multivariable calculus as either a Lagrange multiplier problem, or, to find the corners, a linear programming problem. There's another way to think about this, which is more akin to the way we do physics. The residual sum of squares is a quadratic function, which you can think of as an energy that you want to minimize. On the other hand, the L1 norm is a piecewise linear function that you also want to minimize. And so the question is, if I want to balance these two things I want to minimize, let me choose a way in which to balance them, and then just minimize the combined energy function. So choose a parameter T, which is a weight on how important you think the L1 norm is, how important the simplicity is, versus the accuracy. And let's look at the joint function, the total energy, ET of M, which is one-half residual sum of squares plus T times the L1 norm. So this is a very reminiscent of equations in physics for total energy, whether for wave functions or something else. So the first term, one-half RSS of M, remember RSS of M is a quadratic function, so that's familiar like one-half MV squared is a one-half times a quadratic for kinetic energy. The L1 norm is more like a potential energy. So we decide how important potential energy minimization is versus kinetic energy minimization, and we're looking to minimize the combined energy of the entire system versus those two types of energy. If you choose T equals zero as your parameter, it means that you only care about minimizing the residual sum of squares, and you're going to arrive at the Goss Markov model. If you choose T to be very, very large as T tends to infinity, you're saying that the residual sum of squares does not matter, and what really matters is the simplicity of the model. But of course ultimately you want to find some balance. So we have two valleys here, and we can decide which one is deeper relative to the other one by adjusting the parameter T. And as we do so, we're going to be moving through a trajectory of optimum, of optimal points, where some particular balance of quadratic versus piecewise linear constraints finds us on a minimum total energy. So there's some sort of path along these overlapping level sets that moves through state space to go from the simplest model to the most accurate model along a clear trajectory which depends on this variable T. So those are two ways to think about the Lasso algorithm. One is a constrained optimization problem. The other is an energy minimization problem with a tuning parameter T. The third way is to design an algorithm that iteratively finds the paths and where they change. So where does one variable become important, and how do you solve the system overall algorithmically? It's very reminiscent of solving linear systems of equations with a few slight adjustments. In the next video, we'll go through that algorithm line by line and think about what it means.