 This is lecture four of our Introduction to Machine Learning class. Today we're going to talk about regularization, and this may be one of the most important lectures of the entire course, or maybe even the most important, and the reason is that regularization is a key concept in machine learning and statistics, but also if you're just applying these things in practice outside of machine learning itself, then it's really important that you understand what regularization is, how it works, and how to apply it correctly. So last time we introduced the concept of overfitting and to illustrate it, I use this example of polynomial regression. So let me just briefly remind you what it was about. So in my example from last week, we were fitting polynomials of different degrees to a small dataset that just had maybe 10 points in there. What you can see here is that I'm increasing the degree of the polynomial. Here we're just fitting the straight line and it doesn't fit very well. Here we're fitting quadratic polynomial, it fits really good, and then if you increase the degree even further, and here for example you have ninth degree polynomial, then what happens is that the function space is now so rich that it can go through every single point that you have in the data, the loss on your training data is zero, but of course in some sense this is a really bad fit, right? Even though your error on the training set is zero, it can't be better, but we intuitively see that this is a really bad fit. If you're predicting something that lies in between the points, you're way off. So this is overfitting and this is something we want to avoid. This typically happens whenever your model is too flexible for the data that you have it had. Here I'm showing the coefficients that you would get by fitting polynomials of different degrees to the data set that is fixed. So here you see the zero degree polynomial, which means you're just fitting an offset, right? Then one degree polynomial is a line, cubic polynomial on the ninth degree, this will be the situation of this wiggly line that passes through all the points. Exactly, but notice what happens here, that's interesting is that you get very large coefficients, and the higher the degree of the polynomial is, the larger the absolute value of these coefficients is, and you get some positive and some negative values. This happens not only in this particular situation, but very often. The reason is that if you have a rich function set that you can bend so that it passes through all your points, then typically you will need to have some coefficients very large, some very low, they compensate each other, so that you get exact fit to what you have in the data. So the conclusion of this is that overfitting, one way the overfitting manifests itself is by getting you this very large coefficients in your estimate, in this case, a beta hat estimate in linear regression. So here's the key idea of regularization then, is that we will change the model, we will change the loss function that we're using, such that the model has to pay the price for having very large coefficients. We will penalize, that's the technical term here, large coefficients, and this will make the model prefer, of course, having small coefficients. So if you think about the previous slide, it will make it prefer simpler models. The models that don't make this very very weakly lines because the coefficients very large. So in some sense, this is one way to formalize the Occam's razor that would prefer simpler models to more complicated models that fit the data similarly well. So let me now write a formula for that. So we had the loss function of linear regression over here from previous weeks and now I'm just adding a term. It's just, I denote it as r of beta, it's some function, it can be a different function, it's called a penalty function or a penalty term. And the idea is that the larger the coefficients of beta, the larger the value of this term will be, right? It will penalize large betas. Lambda is something that we're going to adjust later on. So it's called a regularization parameter. If you set it to zero, you're not regularizing at all. If you set it to something very large, you're regularizing very strongly. So lambda is the tuning parameter, something that you will want to adjust here. It's like a knob, whereas r of beta is a particular function that you choose upfront. There are several common choices for the regularization term. Here are some that we're going to discuss today. The most common, the most standard term is called rich penalty or rich regression. And that's just the squared norm of beta vector. So the sum of squared coefficients, the larger coefficients, the larger the beta squared. The second important choice for the penalty function is the sum of absolute values of beta coefficients or another mathematical way to write this is the so-called L1 norm of beta. So when I write one below, this means that this is L1 norm as opposed to the standard L2 norm, which is also called Euclidean norm, where I can write two in the bottom right here or if I omitted, that just means that's a standard L2 norm and this is L1 norm, the sum of absolute values. So this turns out that this has very different properties, this penalty term, this is called lasso regression. We're gonna talk about it later. What you can also do is you can combine these things and add some rich and some lasso. This has a separate name, it's called elastic net, it's also very often used. There are other choices possible, but that's what we're discussing today. So before we get to the rich regression, let's think more conceptually about what happens in terms of the bias variance tradeoff when you're using any penalty. It doesn't matter what penalty. It can be rich, can be lasso, it can be anything. So this is a very high level conceptual picture. So I'm plotting lambda on the horizontal axis. Here's zero, which means that this is an un-regularized solution, right, and I'm imagining that my model is complicated enough that I'm getting significantly, significant overfitting if I don't penalize it. So let's say it's a polynomial regression, degree of polynomial is high, or for some other reason, I have a lot of predictors in my data. So here I have a low training error, can be even zero, can be just low, but my test error is very large. Whenever I get you a new sample, you get a large squared error. So this of course means that we are overfitting, this also means that this exhibits high variance as we discussed last time. Now when we start increasing the lambda and it's helpful to think what happens if the lambda is huge. So if your lambda is huge and this is the loss function, basically this term entirely dominates it and it doesn't really matter anymore what happens here. The model will just focus on making this small, right, and in the limit, and making this small just means setting beta to zero. So what you end up predicting is just zero, whatever your training data is. So of course this means you're predicting, you're not doing a good job, your error is large, your test and training are around the same because you're just predicting zero all the time. The mean squared error will be similar and it has like an asymptote here because when lambda is very large, that's just what you end up with. That's why I'm drawing these things horizontally here. The interesting thing is what happens in the middle because it could be that on this bias variance trade off, so here you have large bias because all your predictions are biased, you're just predicting zero all the time. And in between it can be that there is a sweet spot. So this is this bias variance trade off. We talked about it last time. Between high variance and high bias, what often happens is that there is an optimum that yields you the smallest possible mean squared error in the setting. Whereas another crucial thing to understand is that the training error will just keep increasing monotonic. The larger, the more you penalize, the worse you fit your training dataset. So if you look at just the training error alone, you see the monotonic function increasing from here to somewhere here on the right. And by looking at the training error alone, there is no way to see where is the sweet spot, where is the optimal lambda. You need the test error to be able to see that. Okay, so with this we can now talk more specifically about the ridge regression, which is the most standard, the first choice whenever you think about regularization. And the reason for why it's so standard is because it's so simple, right? We have this quadratic, quadratic penalty term here, which is similar to quadratic mean squared error term here. And as you know from the previous lectures, this quadratic norm loss is mathematically very easy to analyze. So let's see how it works here. What we need to be to deal with ridge regression, right? We need the gradient in order to do gradient descent. And also if we want to obtain the analytic solution, we need to set the gradient to zero. So we'll try that. I will always mark the penalty terms or the contributions of the penalty in this brick color. So everything in black is exactly as it was before. And the red things are new. So when we differentiate the beta squared, we get just beta, which either follows directly from the matrix calculus, or you can see very, very, simply if you remember that beta squared is the sum of squared beta I coefficients when you differentiate with respect to any beta I, you get two beta I, right? And then when you collect them all together in the gradient you get two times the vector beta and lambda just stays here as the multiplier. Okay, and that's it. So if we want to do gradient descent, we just use that in our update rule, right? So we have beta and that's the gradient and that's the new gradient term and this gets into the update. And I'm writing this down here because one can write it a bit differently. One can take these beta over here. And this is also a linear term in beta, right? So I can collect the terms and what I end up is this. So now the difference between non-regularized version is of course that's what happens when you put lambda zero. So these things just disappears. It's one, right? And then you have this gradient descent rule from before. But now we have this coefficient. And this coefficient will typically be smaller than one. Well, these values make sense only if it's smaller to one than one. So let's say it's 0.5. What does it mean then? It means that on every gradient descent update step, you take your previous beta, you add, you make a step. Eta is the learning rate. So you make a little step in the direction of the gradient. That's how, imagine that you are in this lost surface and you are like a ball moving in it. So you follow the gradient downwards. But now the beta that you had gets multiplied by something that is below one. So you keep forgetting the beta that you had and just follow the gradient more strongly, so to say. So this thing has less memory if you want. Because if you, yeah, because we just keep forgetting, we keep decreasing, another way to put it, on every step we keep decreasing the coefficient. So you have your beta as it is on a given step. Then you just multiply all coefficients by 0.5, you decrease them, and then you go in the direction of the gradient. Then you again decrease them, then you go in the direction of the gradient. At some point, this will balance out. This will be your beta hat solution that you're converging towards to. This is in the neural network community. This is called the weight decay. So this is something that neural networks will also often use. And they don't call it rich, they call it weight decay, but that's the same thing. And this terminology comes from the fact that the weights are decaying on each step a little bit. So that's the gradient descent for rich regression, but we can say, we take our gradient at the beta hat solution, at the optimum, at the minimum of the loss function, the gradient would be equal to zero. And it turns out in this case, we can actually analytically solve this equation, which is very convenient. So let me open the brackets, divide by two, put n over here. Now I want to isolate the beta hat, so that I can write beta hat equals to something. So how can I write it? Here's some matrix times beta hat, and here is just a number times beta hat. Well, a trick is to say that this is in fact also a matrix, it's just identity matrix times beta hat times a number. So let me write it like that. So this is this x transpose x matrix, and here is n lambda times identity matrix. So when you open the brackets, you would get exactly the same term. Okay, great. Now we can invert this whole thing in brackets and say that this is our solution, which again differs from the ordinary least squares solution by these coefficients here. And if you set lambda to zero, then you get what we had before. Two comments to that here. One is that this is also called a shrinkage estimator, or this is a member, this is one of possible shrinkage estimators, because you see that you are adding something, you're adding something that is positive, lambda is always positive, n is always positive. You're adding something to the x transpose x matrix before you invert it. So of course all your coefficients, it makes sense that when you add something before inversion, then this will get smaller, right? And your coefficients will decrease, which of course is what we wanted to begin with. So everything will shrink, shrinkage. Just another term essentially for regularization. An interesting statement here, a theorem, which I'm not going to prove, is that in fact, for any data set when the x transpose x matrix has full rank, which means all singular values are above zero, which means that you have more samples than features, n is larger than p, your x matrix is a tall matrix of full rank, under these conditions, there is all the optimal lambda is always strictly larger than zero. So this is an interesting statement that says that ridge regression can always, under this condition, always improve your mean square error on the test side. It is not possible to know by looking at the x what the lambda optimum is. There's no formula for lambda optimum. You have to measure the test error to estimate what the lambda optimum is. However, what one can prove is that whatever the optimal value is, it is positive. It's not even zero. So you are just using ordinary squares will always be worse than regularizing, at least a little bit. In some cases, of course, the optimal lambda may be so close to zero that it doesn't matter. That's fine. But in some other cases, this can make a huge difference. Yeah, so here is a useful plot, I think, that makes it a bit more intuitive. So here is the formula for the ridge estimator that I sometimes will be denoting like beta hat lambda to emphasize that it depends on lambda. For every, you can see this as a function of lambda. If you change the lambda from zero to infinity, then your beta hat as a function of lambda will change somehow. And this is something that we can plot. So I can imagine that again, this is my lambda axis and these are my coefficients. So let's say if the model has 10 predictors, then I will have 10 lines here. So one line for one predictor. And the values that are marked on the y-axis itself here, these are the unpanelized ordinary square estimate, beta hat without lambda. Now as you increase the lambda, they will shrink. They will decrease and in the limit over there, everything goes to zero. It is not necessarily the case that each of these lines decreases monotonically. And that's why I drew a bump here to illustrate that. This can happen. What is guaranteed to be the case is that if you look at the norm of the beta vector, so some of the squared coefficients, then as you move this direction, then this norm will decrease. But in principle it's possible that the norm decreases, but like some individual coefficients increase temporarily before decreasing again. One note, something that on the previous slides I omitted for simplicity, but now I want to mention it. So I already said in fact, what happens with your beta and with your beta hat and with your y hat, so with your predictions when lambda goes to infinity, and it's clear that when lambda goes to infinity, just beta has to go to zero, so y hat will also go to zero. And this is sometimes not super convenient. It is, one can argue that it's more convenient if your y hat in the limit goes to the average value of y in your data. And that is because, like imagine that you're predicting that you're using linear regression without a single predictor. You took all predictors out. You just have the intercept left, so the zero column of x. And then what will happen, we discussed this previously, you will just be predicting the average of y. You're just predicting a straight horizontal line at the average value. And it makes sense, like the dumbest model that you can have given some data is just always predict the average of the data set. Somehow predicting doing worse than predicting the average just is, doesn't make a lot of sense. People often compare their models relative to how the model that just predicts the average performs. That's called an R-square coefficient. Yeah, it's just often convenient not to have the models that perform worse than that. So we can achieve that in, like one way to achieve that here is just to say, so remember we talked about that previously, let's imagine all our features have been centered. So x has all columns summing to zero. The y, the response variable was also centered and we remove the intercept column. And then the y hat in the limit will be zero, but the zero is the average of the y vector. So good, it now makes sense. If you don't want to center them, then another way to write the loss then is to explicitly write it such that you only penalize non-intercept coefficients. So the formula unfortunately becomes a mess because I have to write these two sums now explicitly, but if you look up the definition of rich regression in a textbook or somewhere online, you might find a formula that looks like that. So here I'm just saying I sum over all training samples of y minus my prediction. My prediction is an intercept plus everything else. And I penalize only everything else. So if intercept is not penalized, then even when lambda goes to infinity, everything will shrink to zero apart from the intercept, which will then adjust itself to just predict the average of y, which is what we want. This is like just small technicality, one even doesn't necessarily have to do it like that. One can penalize the intercept, it's a valid choice in principle, but usually the standard implementations will not penalize the intercept just so that you know. And in fact, if you're talking about implementations, then if you compare different implementations in different programming languages, different libraries, then you will see that there can be small differences between how this loss function is set up. For example, some implementations might not have one over n term here. So if you don't have one over n term, this means that your lambda, like the optimum lambda would have a different value. It's just a scale of factor. It doesn't really matter. It doesn't change the minimum of the loss, but the lambda that you will end up selecting will have a different value. So sometimes people are confused if they run something in, I don't know, in Python, and then they run the same thing in R and in the different library and they maybe get different lambda, they are worried that something is wrong, but there were these little choices that don't matter for anything apart from comparing between different implementations. So when I say don't matter for anything, this means that the mean squared error that you get at the optimum is the same, the coefficients that you get are the same. Okay, I want to give two perspectives on ridge regression in addition to what I said before. That's like a supplementary material to understand better the ridge regression. His one perspective has to do with singular value decomposition. We talked about it two weeks ago. So what we discussed back then was that if you consider singular value decomposition of the X matrix, then you can write the formulas for the regression for beta hat also for Y hat in terms of this U, S, and V matrices. And for example, the formula for Y hat becomes particularly simple, just has U in it, just U, U transpose Y. So remember that in SVD, the U matrix has orthogonal, orthonormal even columns. And since Y hat is in some sense just an orthogonal projection on the predictor space, again, something we discussed previously, then you can write this orthogonal projection very simply if your matrix U has orthonormal columns. Okay, how will this change in ridge regression? So we can see this very easily. We have this formula now for X times beta hat in ridge regression, right? I'm just adding this term over here. So remember that this is called the hat matrix. And now plugging in the singular value decomposition in terms of, instead of X everywhere here, applying a little trick of rewriting the identity matrix as V transpose, we can simplify this like that. Now, if lambda is zero, this falls out. This becomes once, so you can just, this is just the identity matrix, and you can erase the whole red term, and you will just have U U transpose, which is what we started with. So okay, fine, this was a sanity check, it makes sense. Interestingly, the interesting thing is to see what happens when lambda is not zero, of course. And so these are singular values. This is a diagonal matrix with singular values on the diagonal. Some will be large, some will be small. And now let's see what happens. If you have S i, a singular value, that is very large. Let's say much larger than n lambda. Then you will get something, this fraction will be close to one, the n lambda will basically have no influence on it as if you didn't regularize. However, if the singular value is very small, for the small singular values, this will change a lot. Now instead of the one, you will get something that is much closer to zero here, because now this term n lambda, let's say it's much larger than the S i squared, then the denominator is much larger than the numerator, and you get zero. So what happens here, and this is a really important insight, is that in some sense, the reach regression for a given lambda will much stronger affect small singular values. It will penalize, it will shrink your predictions, it will shrink the bidders in the directions in which your data has small variance. Remember when we discussed the regression itself, like unpanelized regression in terms of the singular value decomposition, we talked about having the covariance of your data that might be stretched and have some directions with very small variance. These are the directions in which the uncertainty will be really large of your estimate, and these are exactly the directions that regression will penalize such that the beta coefficients don't increase because of the small singular values. So the directions in the data in which the singular value is large are like certain directions, directions in which the data gives you a lot of reliable information, that's the directions that will not be effectively penalized that strongly in comparison to the other directions. I find this a useful perspective, and now a very different perspective has nothing to do with SVD, and that's a Bayesian perspective on this entire thing. So what we discussed last time is that you can see the standard beta hat regression estimate as the maximum likelihood solution of this generative model, where the error is a Gaussian with the same variance and the mean zero for all points. Now here, it just treats beta as something fixed. The true beta, not the beta hat, the true beta is just a fixed thing that we are estimating. Another setting that is interesting is to treat beta as a random variable, it's a random vector that has its own distribution, and one can think of that as a prior distribution. So let's say you have a prior knowledge or a guess that your betas should be coefficient, like each coefficient of beta is something that maybe on average has zero value and has a particular variance tau squared. So you don't expect the coefficients to be super large, you just say, well, I expect them to be like around zero. So it turns out that the entire reach regression can be equivalently understood as estimating the mean of the posterior distribution in this case, and I want to spend some time to unpack these terms, the prior and the posterior and so on, so for that we need to discuss the Bayes theorem, very briefly. So let me step back, forget about the regression for a moment, we're just talking about the probabilities for the next five minutes. I assume many of you will have seen this already, if you have two events and B, then one can write the joint probability that both happened at the same time and one can decompose it in two ways, it's either the probability that, let me actually use some example throughout this slide, it makes it easier. So let's A be the event that there is, let me see what I use later on here. So the A will be the event that there is pandemics right now and B is the event that you're wearing a mask. So the joint probability that both is true can be decomposed in the either there is pandemics times the probability that given there's pandemics you're wearing a mask, or alternatively I can rewrite the same thing by saying it's a probability you're wearing a mask times the probability that given you're wearing a mask there is a pandemic. So these things are the same, then one can of course just divide by probability of beta and trivially obtain that, so people sometimes think the Bayes theorem is something confusing and complicated but it's almost a triviality that allows you to express conditional probability of A given B in terms of conditional probability of B given A, or in my example, the probability that if I see you wearing a mask there is a pandemics that can be computed as the probability that given the pandemics you would wear a mask times the probability there's pandemics times the probability you're wearing a mask ever. And this denominator can be a little bit confusing. One can rewrite the denominator like as a sum of all possible options. What is the probability you're wearing a mask? It's the probability you're wearing a mask in the pandemics plus probability there's no pandemics but you're still wearing a mask. Okay, so this is usually introduced in the context of I don't know predicting that you have a disease given a positive test or something and then there are all these exercises to compute these probabilities. We will not need this now but we will rather need to discuss what happens if you have multiple possibilities. So let's say you can wear a mask because there's pandemics that's the formula from the previous slide but maybe you also tend to wear a mask if there's Halloween. So I can write the same formula. Note that denominator is the same here but numerator changes of course or maybe you are wearing a mask in some other situation if you're diving different mask doesn't matter. I can write it like that. And now let's say I see you in a mask or rather I don't see you but I'm told that you were wearing a mask today and I want to guess what is more likely? Is it pandemics now? Is it Halloween or were you diving? So I need to compare these things but since the denominator is the same I just need to compare the numerators to see which one is better, right? So what I need to do that to do this is that I need to multiply my prior probability. What's my probability that today is a Halloween? What's maybe one over 365? What is the probability you're diving? What is the probability it's pandemics? Maybe that's one over 100 if a pandemics happens every 100 years. And then I need these conditional probabilities. Maybe if you're a good citizen then this is really high. Maybe it's close to one and so on. So these are the conditional probabilities and then I compare these products and then I conclude what is more likely? Good. Now this works the same way if everything is continuous instead of discrete events and then I can write the same formula in terms of probability distributions. And this is something that we're gonna need now because what we have is not a mask and pandemics but we have some data and we have some parameters. The data is continuous thing that you can change continuously and in our case and the parameters are beta. It's also continuous variable, right? You can change the every coefficient by any value. So what we're interested in is the probability of the parameter of the beta given your data set and one can say that it's proportional. There's some denominator that I just don't write down because it doesn't matter. It's proportional to probability to get this data given the parameters, given the beta times the probability of parameters. Now notice that this is exactly the likelihood. We discussed this already. Probability to get your particular data set given beta seen as a function of beta is called the likelihood. This is called prior. That's the probability to have this and that value of beta before you saw any data and that's what we want and that's the posterior. And now one can say let's take the logarithm because that's what we did before. We will obtain log likelihood here and the product will become the sum, right? So that's convenient. You have the log likelihood. You have here, I will call it log prior and the sum of them gives you the log posterior. So now let's apply this to our linear regression case. We now have this model. So we have the model itself and we have the prior on beta. So here's the log likelihood C last lecture to see why it's like that. I'm just copying it from there. And now the log prior, it's almost the same, right? Because this has a Gaussian, like the probability to see some data given beta has a Gaussian distribution. So we worked out the likelihood for that. But the prior on beta also has Gaussian distribution. It's even much simpler. So if you write it down, you will see that this is the log prior. And now we need to add them up. So let's do that and I will erase the minuses to get the negative log likelihood and negative log prior. And there were these terms that are constants in terms of beta, which I will hide. And the rest looks like that, which can be equivalently rewritten by like dividing by n multiplying by two sigma squared in this form. And now you see that this is actually what I started, right? This is the regression loss. Here's the mean square error. Here's the beta squared and sum coefficient depends on these relative variances. So the conclusion of that is that if you start with a particular prior on your beta coefficients, you compute the likelihood given your data, you multiply them and you arrive to the posterior, then this, and you look at the maximum value of the posterior, then this will be the same as estimating the rich regression solution with particular lambda that depends on these things. So if your prior is very, very narrow, so let me walk through this plot. Maybe it will become more intuitive here. So I'm plotting here the value of a particular coefficient, just one coefficient. We're focusing one coefficient at a time. And these are probability distributions. So here's your prior. The prior is a Gaussian. It's centered at zero. It has some width, which I call tau. Then the likelihood is something else. It's a function. It's a Gaussian that is centered at the maximum likelihood estimate, which is a standard regression beta hat estimate. And it has the variance that is given by this expression. And again, we talked about that previously. So the likelihood just depends on the data. And then you have, but now you have prior in addition. So what happens if you multiply these two Gaussians, then, and this I leave as an exercise, is that the product of two Gaussians, that's a very simple exercise. The product of two Gaussians is a Gaussian. Very convenient, which is why people love Gaussians. So you multiply the prior in likelihood, or another way to say it is you add the log likelihood. And you get another Gaussian somewhere in the middle here, which has its maximum, wherever this negative log posterior has the minimum. So that's the beta hat lambda solution. Another term here is the maximum apostoriary estimate. So if you find your posterior and then look where's the maximum of the posterior, then this is just the rich regression estimate. So you can think, you can conceptualize rich regression as having some prior on your parameters. And then the prior does not let the parameters, the estimate go very, very large. Like, if your likelihood says, my parameter is a million, is equal to a million, but your price that it should be really close to zero, then probably it's not a million. That probably that was a mistake. And then the posterior will be much closer to zero. The trick of course, usually when you apply rich regression is to find the lambda. So whenever you start with a prior, then you will maybe just have a fixed lambda that just comes out of that. But what we usually want is we want to tune the lambda. One can, in fact, think about tuning the prior. This is then not exactly Bayesian, that's called empirical bias. That's something we're not going to talk about today. But I wanted to give this conceptual picture of the rich regression. And in fact, this is true not only for rich regression. For any penalty, you can write it. You can think about it as some prior. It's just for other penalties, it will not be Gaussian prior. So having some prior on the coefficient, send it at zero, that shrinks everything towards zero. Good. Next chapter is we're discussing the lasso regression. So as I said before, here we just have a L1 norm of beta instead of squared L2 norm. And as a reminder, the L1 norm of beta, by definition, is the sum of absolute values of its coefficients. It turns out that in this case, there's no analytic solution. You can, like if you try to write the gradient, you will start seeing that it's messy because this function here, the absolute value, is a function that looks like that. It doesn't have derivative here in the bottom, right? It's not smooth, it's convex. And by the way, that's something I didn't explicitly say about the ridge regression. The penalties convex function, it's just quadratic. The loss is a convex function. Turns out if you add two convex functions, the result is also convex. So the ridge regression has a convex loss, which means if you start anywhere you want and use gradient descent, you should converge the bottom. So that's great. And this also is convex. It's just more annoying to work with because of these discontinuities in the derivative of the absolute value. One can still use gradient descent and still pretty simple, simply estimate the beta hat for this loss function. It's just that we cannot write the analytic solution, unfortunately. What one can do, though, is that one can show that the solutions that you're getting out of these are going to be sparse. And sparse means that they will have a lot of zeros. As you're increasing the lambda, you will get more and more zeros in your beta hat vector. So that's interesting because that was not the case for ridge regression. For ridge regression, I had this plot. It's the same plot as before. If you increase your lambda coefficient, your coefficients will decrease, like at least on average they're decreasing. Some can temporarily increase and then they decrease, right? But actually they will never hit zero unless you tune your data set in a very specific way. They will not hit zero exactly. They will just all decrease, decrease, decrease, decrease. The asymptote of all these lines is the zero line, but you can put any lambda you want. You look at the beta hat vector, it has very small values, but none of them is exactly zero. Whereas often, or at least sometimes in some situations, you would actually like some coefficients to be exactly zero. It's like you're choosing which predictors matter and the ones that get exactly zero estimate, then they don't matter to predict your why. So if your estimate contains, if your estimate is sparse, that's like performing a feature selection, which is often useful. So it turns out that this is what lasso will do. So if I draw a similar kind of plot for the, in the lasso case, then everything will decrease, similarly to here, but every line hits zero at some point. So the ones that started small usually will hit the zero first. The ones that started large, not necessarily this is the case though. Everything will decrease and at some point, hit zero. Then the others will start decreasing faster and then they will hit zero too, right? So if you look at, and sometimes some of them can increase, that's the same thing happens. What we can say for sure is that if you move to the right, then the sum of the coefficient, of the absolute values decreases. That has to be the case. Some can temporarily increase the value and then decrease again. But also what, so this is not obvious from the loss function, but it turns out that as you move to the right, then you will get fewer and fewer values that are not zero, which means also at some point, everything will be zero. So that's also interesting. If you put the lambda too large enough into this loss function, then you will just get a zero vector as an output. You don't need to put infinite lambda for that. There's some finite value that yields zero beta hat. I'd like to give a little bit of intuitive explanation for why this happens because as I said, this is not directly obvious from the loss function I wrote. And one can see that if one thinks about the loss functions for the region, for the lasso in a slightly different way, and we need to introduce the concept of a Lagrange multiplier for that, I will not go into a lot of details here. I will just mention it to you and you can look it up if you're interested. So it turns out that if you're minimizing any loss function, subject to some constraints. So you have some constraint. You will see the connection to region lasso a bit later. So if you want to minimize some loss, subject to some constraint. So some function of w has to be equal to zero. That's called the constraint then. Then it turns out that this is equivalent to minimizing your loss plus lambda times the constraint function over w and over lambda and the lambda is called Lagrange multiplier for that. And you recognize that this is looked very similar to my loss function for region, for lasso, for anything where this is the penalty, right? And exactly, this is true. So in turn, in fact, the same thing more or less like I'm glossing over the technicalities, the same thing is almost true for inequality constraints when you have some function of w that has to be, for example, less than zero. So this means that in fact, these two ways of writing the loss function are equivalent or at least nearly equivalent for us, we can say they're equivalent now. So one way is the way I used before. You have your mean squared error loss plus the penalty that is lambda times the L1 norm of beta. An alternative way to write the same thing is to say you are minimizing the mean squared error without a penalty, subject to a constraint that the L1 norm of beta is less than some value. So here I have lambda is adjustable parameter and here I have this t is adjustable parameter. They are not the same. So if I say it's equivalent, it means that for every lambda there is some value of t such that these things are equivalent. So which means I can, so this is in fact, now if you forget about all this math, it's very simple. It just means I can say look for solution that minimizes your mean squared error, but you're not allowed to have the norm of beta larger than 10, okay? You just look within all betas that have the norm below 10 for the ridge, it will be the squared Euclidean norm. For the saw, it will be L1 norm. And then you are only allowed to have betas that have norm below 10. Then you look for the best solution and that's your estimate. And the estimate will be the same as the estimate from this penalized loss function for some value of lambda. Okay, and the reason I mentioned all that, well I think it's also good to know, like alternative way to think about that, but another reason is that this allows actually to see why lasso is sparse and range is not. And this has to do with the shape of these constraints when you plot them in the space of betas. So in this situation, I imagine now that I have two predictors, beta one and beta two or the two coefficients that I'm estimating. I had similar plot before some weeks ago that only had this Gaussian contours over here and this is the minimum of the loss function. So this is the ordinary least squares. Unpanelized solution is just over here, right? And you can think about that as a loss surface that on this beta one, beta two plane, there's a loss and that's the minimum of the loss is over here. Okay, but now there's a twist. Now we say we are looking for the minimum subject to, let's talk about range first, subject to this constraint. So what is this constraint? It just means that the squared length of beta has to be less than t, but this just means it has to lie within a circle that's centered at zero and has the radius t, the squared radius t. So depending on t, you have either very small circle or a larger, larger, larger circle, but for a given circle, you want to minimize this while staying inside the circle. So the answer to that is in this particular case is here. It's like where it's the point where they touch this circle and this ellipse. Whenever they touch, this will be the minimum, right? If you move, because if you move away in the circle, it will increase. How can you obtain the minimum of this loss while staying inside the circle? Well, you have to go over here. And now what happens in the Lasso case and the interesting thing that happens is that this part is exactly the same, the likelihood part if you want, but this prior looks very different. Now your constraint is kind of a circle in the L1 norm and the circle in the L1 norm looks like a diamond, right? Think about that. Your beta one absolute value plus beta two absolute value have to be less than some constant. And this just means that they have to lie within a diamond like that. t just adjusts the size of this diamond. And again, we increase the diamond to some extent. So we fix the t and then we ask, you stay inside this diamond here, the largest diamond and you wanna minimize this loss. Where is the minimum? They touch over here and that's the minimum. And this is exactly, not approximately, but it's exactly on the y-axis over here. And that's because it has a corner, right? The diamond has these corners, you increase the diamond and the corner is much more likely to touch any whatever your likelihood loss surface here is. The corner is very much likely, especially in the high dimensional space. It has a lot of corners everywhere. Think about that. And you're gonna, one of these corners is going to touch the surface, but the corners are all on the axis, which means that some betas are set to zero at this point. It's not exactly a proof. One can work it out so that it becomes a proof, but this is a intuition for why you will have a lot of zero coefficients in the Lasso estimate typically. Can you can cook up a situation where maybe you don't get, or like they all go to zero at the same time or something, but typically for a normal dataset, you increase the lambda and all the coefficients start to drop out one after another. Okay. What we need to discuss now is how to select the lambda. That's something I didn't mention at all so far. So you have, you choose your panel to you, choose either the ridge or the lasso, or by the way, you can also use the elastic net, which has two lambdas, right? Some one lambda for ridge and another lambda for lasso. And you have to choose both of them then. How can you do that? And the answer to that is, well, we need to look at the test error, right? Because if you look at the training error, there is no way to tell which one is the optimal. So you need to estimate the test error. Well, so that's what we're going to do. The simplest thing you can do is to just get yourself a test set, right? So let's say you have some dataset and you say split it in two parts. One part is my training data and the other part, I will call the test data. Then I'm going to fit my models only on the training data and I'm going to assess the performance of the model. For example, the mean squared error on the test data or on both training and test, but I'm interested in the test. That's essentially allows me to plot this function over here. In a way, that's it. There's more to it though, but conceptually, that's the most important thing. You want an independent test. You don't look at the test set at all while fitting the model. Once you have the model, you test it on your test set. You can ask, okay, how am I splitting the data between training and test? So there's a trade-off because if you have the test set that is very small, then maybe your estimates will be very noisy. If you have the training set that is very small, then your model will just not be very good because little data is used to... Not enough data I use to fit the model. So usually you want larger training set and smaller test set, but it should not be too small, so people will often go with like 10% of the data for the test. An important thing is what happens if you have some hyperparameter, tuning parameter, regularization parameter, like Lambda to adjust, which is what we're discussing here now, right? And in particular, what happens if you have many of them, like for regression and lasso, you just have one parameter, but in more complicated situations, if you have a neural network, you're tuning, I don't know, the number of hidden layers and number of neurons and maybe there's... We will talk about that in later lectures. Sometimes you have more things to tune. So what you don't want to do is that you don't want to change these parameters, look at your test set performance, compare all these parameters with respect to your test set performance and then take the minimum and just say, oh, that's my best model and that's the performance of this model. The reason you don't want to do this is that you can cherry pick. You have a lot of different models for different regularization parameters or tuning parameters, whatever. And then you look at the performance and maybe there's just noise and you will just cherry pick the lowest value and say that's my best model, that's maybe okay, but then you will say that's the performance of this model which is given by the test set performance that I obtained and that's already not okay because this can be cherry picked and can be lower than the actual performance. So it will look better than it actually is and for that reason people like to use three sets instead of two, training set, validation set and the test set. You fit the model on the training set, all different models for different lambda coefficients. You test them first on the validation set, you choose the best model with the lowest error on the validation set and then you take this one winner, maybe it's a bit cherry picked, but whatever, you take your winner and you test it on the test set, only once. And then you can say that's my selected model and that's the performance. The performance on the test set will typically be a bit worse than the performance on the validation set because of this additional cherry picking that you have when you're comparing different lambdas. So if you read like machine learning papers, deep learning papers, then typically you will have a training set, a validation set and a test set, they are fixed and that's it. They usually have a lot of data. So the data often are not a problem. You can say I will save 10% for the test, I still have so much data, I can still fit my model on the rest. Often in a more, in a more, in these situations where we don't have access to so much data, it feels like to reserve just 10% as a test feels not very good. Like you're reducing your training by that and then maybe you reserve 10% of the test, but the numbers are small. So maybe this is still not enough to get a good estimate of the test performance because your test set is not large enough. So what one can do is one can repeat the entire procedure with different splits between the training and the test. And this has a name that's called cross-validation. You will see this very often. So the cross-validation means that you, for example, select 10% of your data as the test, do all what I described on the previous slide, and then you select another 10% of your data as a test set and repeat the entire thing. And then another, and then another, and then you will select the final 10% of your data here as the test set and repeat everything. Every time you choose the, every time you get some error for different lambdas, you just average all these errors across all these, so this is called folds. That's your first cross-validation fold, the second and so on. You average your test estimates across all folds, then you get the final estimate and there you look for the minimum lambda. If your model fitting is not too expensive, and which for linear regression, of course it isn't, and if your data set is not very large, which is also often the case, then this is just, looks like a good idea and is a good idea, a better idea than just have a fixed training test split. So this is called k-fold cross-validation. Sometimes people use n-fold cross-validation, which means that your test set just contains, consists of one example, right? So just one example in your test set, in your test set, but you of course have n-folds then because you try out each point as a test set. So in the end, you end up testing all points in your test set, but across n-folds, and this is also called leave one out cross-validation, the n-fold cross-validation. Standard rule of thumb is to use 10-folds, which is equivalent to using 10% in your test. Note here that this does not measure the performance of one particular model because you will end up with 10 different models here in 10-fold cross-validation. You fit a model here, or you fit a bunch of models for different lambdas on this training set, and then you repeat and you fit this bunch of models on this training set, and then you fit them on so on until this training set. So for each lambda, for each given lambda, you have like 10 different models fit on different training sets. What the cross-validation, I think the right way to think about that is the cross-validation measures the performance of a model building procedure. So your model building procedure says, here I'm taking these values of lambdas, I try them out and I pick the best. That's what is being tested here, which this is important because sometimes people do that in cross-validation. They choose the best lambda, but then they want to look at the model actually. Very often in science, for example, you want to use the coefficients to interpret something. So you want to look, for example, you're using the lasso regression, you want to see which coefficients are zero. You're selecting the features, so which ones got dropped out. But maybe different features will be kicked out here on this fold and in this fold and in this fold. So now you have 10 models, what to do. So I think the right way to deal with that is that once you finish with all these cross-validation, you say your best lambda on the basis of your average test set performance is, I don't know, 7.5. And now you can take this lambda 7.5, the winner, and fit the model with lambda 7.5 on the entire data. So you pull together training and test, you fit the model with lambda 7.5, you get some beta vector, beta hat vector, and that's what you can then interpret. But the performance, the cross-validation performance is given by the average performance of course, measured on the test set. So that only makes sense. Or another situation where this can make sense is if you want to use this model, so this is not a science example, but more like a production example, you built a model, now you want to use it somewhere. Well, it makes sense to put more old data that you have into this final model so that you can like refit, as I said, refit the model using the chosen lambda on the entire data set. Okay, there's one more tricky thing here is that, well, I spoke before about how you need training, validation, and test, and then I introduced cross-validation where I just had training and test. So where's validation? In the situation where you want a validation set, you have to make something, you use something that is called nested cross-validation. And at this point, it becomes a little bit confusing. So, yeah, spend some time here making sure that you understand that. The conceptually, it's not very confusing though. Now you just need to, we're repeating the training validation test split over and over and over again. And the way people typically do this is by using these nested loops. So you have an outer cross-validation loop, which is what we had before. It fixes a test set. Let's say on the first fold of the outer loop, this is my test, and the rest is what I can do. I can use for model fitting. But now if I want to have a validation, I start an inner loop where I split this remaining thing into the actual training and the validation. This will be also a K-fold cross-validation. For example, here, that's my inner loop. I will try out different lambdas on my training set here. And then I will test them on the validation set. I will choose the best lambda. Once I'm done with that, I will maybe use this best lambda to fit the model on the entire white part here. And then I'm going to try it, test it on my test set. And that's my final model performance. And now I keep doing this in K-folds of the outer loop, average everything in the end, and get my final estimate. That's the correct way to use cross-validation if you have tuning parameters. So it becomes like a double cross-validation within the cross-validation. You can, in principle, have different number of folds here. Wait, do I have a... So you can... Yeah, and again, if you need a final model after all that, then you can use the same model procedure you built, you used here, the ones you are testing here. You apply this to the entire data, which of course means, again, that you will have to apply the inner loop on the entire data without the test. So you can have pause here for a second and try to understand, try to work out how it works. Let's say you have 10 folds for the outer loop, and then you can use five folds for the inner loop. So how many models are you going to actually build? And for example, you can imagine that you're trying 100 values of lambda. So if you're trying 100 values of lambda, it means that here on one row here, you're building 100 models that you are testing on the validation set, right? So if the inner loop has five iterations, then you will fit 500 models here on the entire inner loop. But then you need to fit it once more for the best lambda. So that's what, 501, and then you're testing it on the test set. And then you do it 501 times for 10 outer loops, and then you're done unless you want a model for inspection, in which case you need to repeat the entire 501 once more. So you can compute how many model fits you will need for all that. Okay, so that is how you're doing the model selection whenever you have to choose something. So typically what happens is like in practice, if you want to use the rich regression, for example, there are of course packages that will do it for you. You just throw the data into the package, the package splits it into 10 folds, tries different lambdas from very small to very large, picks the best lambda, gives you the result. If you want to code it yourself, you need to code, well, either you use cross validation and then you have a loop over folds, or you just have a fixed test set on validation set and do it like that. Okay, the last thing that I want to very briefly talk about today, and then we're done with this lecture, is the following topic. So this is the same figure I already used to talk about the bias variance trade-off for polynomial regression fits. So in fact, we talked about this last week, and I talked about it again in the beginning of today's lecture. So here, as the number of predictor grows, we go from high bias situation to very high variance situation, the training error monotonically decreases, and the test error behaves like that. And by the way, maybe let me just say it again, that so we have this bias variance trade-off situation, it can arise in different scenarios. So here on this plot, on this slide, there's a bias variance because I'm including more and more polynomial predictors into my model. Just 15 minutes ago, we were talking about bias variance that happens when you have lambda penalty on the x-axis. So you can think about that as different things you're changing in your model that will move you from a situation where you have more bias to a situation that you have more variance. And the conceptual statement here is that there's typically a sweet spot somewhere in between. Anyway, now I want to talk about polynomial regression again. You're fitting high and high degree polynomial to a fixed dataset and that's what happens. And then I'm stopping here over here. So why does this plot stop here? This plot stopped here because this is the point where your training error goes to zero. So it will hit zero whenever you're fitting nine degree polynomial to nine data points or a hundredth degree polynomial to 100 data points. That's where you can fit them all exactly. And your training goes to zero and your test becomes huge or even diverges. However, one can think, one can ask, well, what happens if you're beyond this regime? If you're putting even more predictors into your model, we have 10 samples, but let's have all predictors until x to the power of 100. Why not? So what happens then? So what happens is that you have an under-determined problem. You now have many ways to choose beta such that the training loss is exactly zero. Usually in regression, there's only one minimum. You have this loss function. It has one minimum that we're looking for and that's the beta hat. If you have more predictors than the samples, then there are many minimums. Your loss looks like that, but then there is some direction in which it's actually the same and it's at zero. So you have something that looks like a quadratic surface with a value at zero. You can choose many beta values that all yield zero loss. So you can't even choose which one of them is the best. That's called under-determined problem and that's what happens with p is larger than n and you don't use the penalty. If you were to add some reach penalty, then there will be again only one beta because you're choosing the one that can be with smallest coefficients in here. That's what the reach wants and of course your loss will not be exactly zero anymore. But now we're forgetting about which regression for a moment or any penalty. We're just talking about normal estimates and then this is under-determined problem. Here's one trick and after we discussed all this penalties that make the coefficients be smaller, this trick will make sense to you. I can say I can get a lot of different betas that all yield zero loss. How do I choose one? Well, maybe there are different rules for how one would decide to choose one of them, but a sensible approach would be to choose the one with the smallest norm. So I'm looking at all the betas that yield me zero loss. There's no penalty, no reach penalty, but I'm just among those, I choose the one with the smallest norm. That fixes it. There will be only one beta hat with the smallest norm. It's called a minimum norm solution to this under-determined problem and it turns out something that I'm not proving, but it turns out that it very simply can be written. This minimum norm estimator can be very simply written in terms of the SVD again of the X matrix. So previously we said this is the beta hat for the normal case where the problem is not under-determined. What we discussed before, that was the solution. And now we can say, well, I'm just going to use the same formula. Now the S, if you think about SVD, if X is a tall matrix, then U is tall, S is square, P times P and V is square, P times P. If X is now a fat matrix, it has more predictors than the sample sizes, then the SVD still works, but now the U will be square, S will be n times n square, and V will be a fat matrix again. Not square anymore, but everything else stays the same. S is still a diagonal square matrix. We can still write exactly the same formula, invert all the singular values and get an estimate. This still works. This will just give you only one possible estimate out of all beta estimates that gives you zero loss. And this particular estimator given by this formula will have the minimum norm, so that's nice. Another illustration of how useful a singular value decomposition is. You just plug in the same thing, you get a sensible estimator out that has a nice property, it has the minimum norm. And now a very interesting thing can happen. If you go beyond this line, that people you like to call interpolation threshold, because this is the point at which you can interpolate all your training data and have zero error. So whenever you have a prediction that has zero error, that's sometimes called interpolator. So this is an interpolation threshold. In our case of linear regression, this happens with p equals to n. So your training error after the interpolation threshold is just zero. It will never go up again. It stays at zero because you can comfortably fit your training data whenever you, if you add more predictors, fine, you can still fit your training data. So training error is at zero. Test, in a way, there are many lines here because I can choose different betas, right? And they will have, they can have the training performance for all of them is zero, but the test performance can be very different. I didn't say anything about the test performance yet. But if I'm selecting the minimum norm beta that what can happen doesn't always happen, but it can happen is that the test performance of the minimum norm estimator goes down again. It goes down from this awful overfitting, down, down, down, and it can be, in particular situations, that it goes even below the optimal here. It just goes even lower without any explicit reach penalty that I'm adding to the model, which is why this is called implicit regularization. The implicitness is in this minimum norm condition. There are different solutions possible, but let's say I'm using the SVD formula and perhaps even unknowingly to me, I just programmed it so that my software that it such that it uses the SVD formula, then I give it X that has more predictors than sample sizes. The formula still works. I still get some output. It's a particular minimum norm one, but perhaps it can be that I'm not even aware of that, but I see that my error decreases because this minimum norm solution performs implicit regularization for me. And people are super interested in things like that since, I don't know, a couple of years. This is a very hot topic right now because it turns out that when you're fitting a neural network, something like that happens. We'll talk a bit about that when we talk about neural networks, but neural networks are also so flexible that basically they can fit any data. You give them perfectly with zero loss. That's at least often the case. However, the test performance is also pretty good. Why is it pretty good? That's what people are debating. So there's some implicit regularization that happens just in the way the training of the neural network happens. There's maybe other possible weights of the neural network that will also achieve zero loss, and they will perform very bad, but the ones that you actually get when you train it are good. That's largely unsolved question for why this happens, but it turns out even in linear regression you can have exactly the same thing. It's just a very simple situation where you can analyze it and understand why this happens and under which conditions it happens. And unfortunately, I don't have time to discuss this today. I just wanted to mention that you can have this implicit regularization phenomenon happening here. And this curve is sometimes called double descent curve because it descends here and then it descends again over here. So if you just Google double descent, you will find a lot of research about that. And one can use linear regression and this polynomial predictor setting to study it too. It's related to what we discussed today because this minimum norm requirement is conceptually related to using rich explicit rich penalty that just says I want the beta norm to be smaller. But this was important in this regime where the solution was not undetermined and here the solutions are undetermined and you can use minimum. You can use both. So you can say I'm here, but I'm still using the rich penalty. That's fine. Then you have explicit regularization maybe in some sense on top of implicit. Okay, this is beyond the scope of the lecture. I just wanted to mention that. And with this, we finish this lecture. Thank you.