 The following program is brought to you by Caltech. Welcome back. Last time, we introduced the notion of overfitting. And the idea was that we are fitting the data all too well at the expense of the generalization out of sample. And we took a case where the target was simple, but we added very little noise. And that little noise was enough to misguide the fit using a higher-order polynomial into getting an approximation that is very poor approximation of the target that we are trying to approximate. The overfitting as a notion is more in scope than just bad generalization. If you think of what the VC analysis told us, the VC analysis told us that given the data resources and the complexity of the hypothesis set, with nothing said about the target, given those, we can predict the level of generalization as a bound. Now, overfitting relates to the target. For example, in this case, the target is noisy, and we have overfitting. If the target was noiseless, if we had points coming from the blue curve and we would fit them, we would fit them perfectly, because it's a very simple equation to solve for a polynomial. And then we will get the blue curve exactly. Since the VC analysis doesn't tackle the target function, you might be curious about, are we changing the game? What is the deal here? So the idea is that the VC doesn't tackle the target, not in the sense that it doesn't know how to deal with it. What it does, it gets you a bound that is valid for all possible targets, noisy or noiseless. And therefore, it allows the notion of overfitting. It gives you a bar for bad generalization. And the generalization would be that good, and it could be that, et cetera. And furthermore, it could be that the in-sample error is going down, while the out-of-sample error is going up, which is our definition of overfitting. Or it could be that both of them are going down, but the generalization is getting worse and whatnot. So it doesn't specify to us whether overfitting will happen or not. So although it doesn't predict it, it allows it. So now we are zooming in into the details of the theory and trying to characterize a situation that happens very often in practice, where the noise in the target function results in overfitting. And we can do something about it. That's why we are actually studying it, because there will be ways to cure that disease, if you will. And then we characterize that the source of overfitting is fitting the noise. And the conventional meaning of the noise we have is what we are referring to now as stochastic noise, because we are introducing another type. And the idea is that if you fit the noise, this is bad, because you are fitting something that cannot be fit. And because you are fitting it, you are predicting, extrapolating out-of-sample into a non-existing pattern. And that non-existing pattern will take you away from the target function. So it will contribute in a harmful way to your out-of-sample error. But the novel notion was the fact that even if we don't have stochastic noise, even if the data is not noisy in the conventional sense, there is something which we refer to as deterministic noise, which is a function of the limitations of your model. So here, your model is, you know, for order polynomial. Other models will give you different deterministic noise. And they are defined as the difference between the target function, in this case, the blue wiggly curve, and the best approximation within your hypothesis set to that target function. So again, it captures a notion of something that we cannot learn at all, because it's outside of our ability as a hypothesis set. And therefore, it behaves like a noise. If we try to fit it on a finite sample, try to dedicate some resources to it, whatever we are learning doesn't make sense, and it will lead to a pattern that harms the out-of- sample error. And indeed, we ran an extensive experiment where we compare the deterministic noise parameterized by the target complexity. The more complex the target is, the more deterministic noise we have. And we found that the impact on overfitting is fairly similar to the behavior when we increase the stochastic noise in a similar experiment. So today, we are going to introduce the first cure for overfitting, which is regularization. And next time, we are going to introduce validation, which is the other side of this. And regularization is a technique that you will be using in almost every machine learning application you will encounter. So it's a very important technique, very important to understand. And there are many approaches to it. So as an outline, I am going to first talk about it informally and talk about the different approaches. Then I'm going to give a mathematical development of the most famous form of regularization. And from that, we are not only going to get the mathematical result, but we are going to get very good intuition about the criteria for choosing a regularizer. And we'll discuss it in some detail. And then we will talk about the ups and downs of choosing a regularizer at the end, which is the practical situation you will face. You have a problem that is overfitting. How do I choose my regularizer? OK, let's start. You will find two approaches to regularization in the literature, which are as vigorous as one another. One of them is mathematical, purely mathematical. And it mostly comes from function approximation, where you have an ill-posed problem. You want to approximate a function, but there are many functions that actually fit it. So the problem is ill-posed, and then you impose smoothness constraints on it in order to be able to solve it. This is a very well-developed area, and it is borrowed in machine learning. And actually, the mathematical development I'm going to develop relates to that development. Also, in the Bayesian approach to learning, regularization is completely mathematical. You put it in the prior, and from then on, you have a very well-defined regularizer in this case. And in all of those cases, if the assumptions that you made in order to make the developments hold, then this is the way to go. There is no reason to go for intuition and heuristics and sort of the other stuff, if you have a solid assumption and a solid mathematical derivation that gets you the result. The problem really is that in most of the cases you will encounter, the assumptions that are made here are not realistic. Therefore, you end up with these approaches having a very careful derivation, based on assumptions that do not hold. And it's a strange activity when this is the case. If you are very rigorous in trying to get a very specific mathematical result, when your main assumption is not going to hold in the application, you are going to use it in, then you are being pennywise dollar foolish. The best utility for the mathematical approach in practical machine learning is to develop the mathematics in a specific case, and then try to interpret the mathematical result in such a way that we get an intuition that applies when the assumptions don't apply. Very much like we did with the VC analysis. We don't compute the VC dimension and get the bound in every case, but we got something out of the VC bound, which gives us the behavior of generalization. And from then on, we used it as a rule of thumb. So here we are going to use something similar. The other approach you will find is purely heuristic. And in this case, you are just handicapping the minimization of the in-sample error, which putting the brakes, as we described it last time. And indeed, this is what we are going to do. And we are going to borrow enough from the math to make this not a completely random activity, but rather pointed at something that is likely to help our cause of fighting overfitting. So I'm going to start by an example of regularization and how it affects overfitting. And the example will be quite familiar. You have seen it before. You probably recognize this picture. This is a sinusoid. And we had a funny problem where we had only two points in the training set, so capital N equals 2. And we were fitting our model, which was a general line in this case. So we passed the line through the two points. And we get a variety of lines depending on the data set you have. And we noticed, after doing a careful analysis of this using the bias variance analysis, is that this is really bad. And the main reason it's bad is that because it's all over the place, and being all over the place results in a high variance term, that was the key. And in that case, a simplistic constant model, where you approximate the sign by just a constant, when it's being a zero on average, is actually better in performance out of sample than fitting with a line. That was the lesson that we got there. So let's see if we can improve the situation here by regularizing it, by controlling the lines. Instead of having wild lines, we are going to have mild lines, if you will. So what we are going to do, we are going to not let the lines be whatever they want. We are going to restrict them in terms of the offset they can have, and the slope they can have. That is how we are putting the brakes on the fit. Obviously, we are sacrificing the perfect fit on the training set, when we do that. But maybe we are going to gain, yet to be seen. So this would be without regularization, using our new term. And when you have it with regularization, and put the constraint on the offset and the slope, these are the fits you are going to get on the same data sets. Now you can see that they are not as crazy as the lines here. Each line tries to fit the two points. It doesn't fit them perfectly, because it is under a constraint that prevents it from passing through the points perfectly. Nonetheless, it looks like the grade variance here has been diminished here. But we don't have to judge it visually. We can go to our standard quantities, the bias and variance, and do a complete analysis here, and see which one wins. So let's see who the winner is. This is without regularization versus with regularization. We have seen without regularization before. This was the case where, if you remember, this red guy is the average line you get. It is not a hypothesis that you're going to get in any given learning scenario, but it is the average of all the lines people get when they get different data sets of two points each. And around that is a great variance, depending on which two points you get. And this is described as a standard deviation by this region. And the width of the gray region is what killed us. In that case, because the variance is so big, that in spite of the fact that if you have an infinite number of data sets each with two points, you will get the red thing, which is not bad at all. But you don't get that, you will get only two points. So sometimes you will be doing something like that instead of this, and on average, the out-of-sample error will be terrible. So let's look at the situation with regularization. As expected, the gray region has diminished, because the lines weren't as crazy. If you look at the red line, the red line is a little bit different, because we couldn't fit the points, we couldn't fit the points, so there is a little bit of an added bias, because the fit is not perfect, and we get this. Now, regularization in general reduces the variance at the expense possibly of increasing the bias just a little bit. So think of it as that I am handicapping the fit. Well, you are handicapping the fit on both the noise and the signal. You cannot distinguish one from another. But the handicapping of the noise is significant. That's what reduced the noise. The handicapping of the fit results in a certain loss of the quality of the fit that is reflected by that. Let's look at the number and see that actually this stands to the reality. So the bias here was 0.21. We have seen these numbers before, and the variance was a horrific 1.69, and when we added them up, the linear model lost to the simplistic constant model. So let's look at with regularization. So now we are using still the linear model, but we are regularizing it, and you get a bias of 0.23. Well, that's not too bad. We lost a little bit. Think of it as a side effect of the treatment. You are sort of attacking the disease which is overfitting, and you will get some funny side effects. So instead of getting the 0.21, you are getting 0.23. OK, fine. How about the variance? Totally dramatic, 0.33. And when you add them up, not only do you win over the un-regularized guy, you also win over the constant model. If you get the numbers for the constant model, this guy wins. And that has a very interesting interpretation, because when you are trying to choose a model, you have the constant, and then you have the linear, and then you have the quadratic. This is sort of a discrete grid. Maybe the best choice is actually in between these guys. And you can look at regularization as a way of getting the intermediate guy. There is a continuous set of models that go from extremely restricted to extremely unrestricted. And therefore, you fill in the gap, and by filling in the gap, you might find the sweet spot that gives you the best out-of-sample error. And in this case, we don't know that this is the best out-of-sample error for the particular level of regularization that I did, but it certainly beats the previous champion, which was the constant model. So knowing this, we would like to understand what was the regularization in specific terms that resulted in this. And I'm going to present it in a formal setting. And in this formal setting, I'm going to give a full mathematical development until we get to the solution for this regularization, which is the most famous regularization you will encounter in machine learning. And my goal is not mathematics for the sake of mathematics. My goal is to get a concrete conclusion in this case, and then read off that conclusion what lessons can we learn in order to be able to deal with a situation which is not as ideal as this one, which indeed we will succeed in. So let's look at the polynomial model. We are going to use polynomials as the expanding components. And we are using Legendre polynomials, which I alluded to last time. And there is nothing mysterious about them. They are polynomials, as you can see. And the number L2 is of order 2, L3 is of order 3, and so on. And the only thing is that they are created such that they would be orthogonal to each other, which would make the mathematics nice. And we'll make it such that when we combine them using coefficients, the coefficients can be treated as independent. They deal with different coordinates that don't interfere with each other. If we use just the monomials, the monomials are extremely correlated. Therefore, the relevant parameter, as far as you are concerned, would be rather a combination of the weights rather than an individual weight. So this saves you by getting the combinations ahead of time so that the weights actually are meaningful in their own right. That's the purpose here. So what is the model? The model will be h sub q, which is, by definition, the polynomials of order q. And the non-linear transformation that takes the scalar variable x and produces this polynomial is given by this vector, as usual. You start with the mandatory 1, and then you have Lebesgue of order 1 up to Lebesgue of order q. When you combine these linearly, you are going to get a polynomial of order q. Not a weird polynomial of order q, just a regular polynomial of order q, just represented in this way. So if you actually sum up all the coefficients, there will be a coefficient for constant, coefficients for x, coefficient for x squared, up to x to the q. So using the polynomials, the formal parameterization of the hypothesis set would be the following. You take these guys and give them weights. And these weights are the parameters that will tell you one hypothesis versus the other. And you sum up over the range that you have. And this will be the general hypothesis in this hypothesis set. And because it has that nice form, which is linear, we obviously are going to apply the old fashioned linear regression in this space in order to find the solution. So it will be a very easy analytic solution because of this. Let me just underline one thing. I'm talking here about the hypothesis set. I'm using the Legendre polynomials and this model in order to construct the hypothesis set. I didn't say a word about the target function. The target function here is unknown. And the reason I am saying that is because last time in the experimental overfitting, I constructed the target function using the same apparatus. And I did it just because the overfitting depended on the target function, and I wanted to pin it down. But here, the target function goes back to the normal learning scenario. The target function is unknown. And I am using this as a parameterized hypothesis set in order to get a good approximation for the target function using a finite training set. That's the deal. OK. So let's look at the unconstrained solution. Let's say I don't have regularization. This is my model. What do you do? We have seen this before. I'm just repeating it because it's in the z-space and in order to refresh your memory. OK. So you are giving the examples, x1 up to xn, with the labels, the labels being real numbered in this case. And x1 up to xn, I'm writing them as a scalar because they are. And then I transform them into the z-space. So I get a full vector corresponding to every x, which is the vector of the Legendre polynomials. Evaluated at the corresponding x. So I get this. And my goal of the learning is to minimize the in-sample error. The in-sample error will be function of the parameters w. And this is the formula for it. Exactly the squared error formula that we use for linear regression. So you do this, which is the linear combination in the z-space. You compare it to the target value, which is yn. The error measure is squared. You sum up over all the examples that are normalized by n. So this is indeed the in-sample error, as we know it. And we put it in vector form, if you remember this one. So all of a sudden, instead of z as vector, we have z as a matrix. And instead of y as a scalar, we have y as a vector. So everybody got promoted. And the matrix z is where every vector z is a row. So you have a bunch of rows describing this. So it's a tall matrix if you have a big training set, which is the typical situation. And the vector y is the corresponding vector of the labels y. And when you put it in vector form, you have this equals that, very easy to verify. And it allows us to do the operations in a matrix form, which is much easier to do. So we want to minimize that. And the solution, we are going to call w sub lin for linear regression, in this case. And we have the form for it. It's the one-step learning. And it happens to be the pseudo-inverse now in the z-space, so it has this form. So if I give you the x's, and you know the form for the Legendre polynomial, you complete the z's. You have the matrix, you have the labels, you plug it into the formula, and you have your solution. So this is an open and shut case. Let's look at the constrained version. What happens if we constrain the weights? Now, come to think of it, we have already constrained the weights in one of the applications. We didn't say that we did, but that's what we effectively did. Why is that? We actually had a hard constraint on the weights. When we used h2 instead of h10. Remember, h2 was the second-order polynomial, h10 was the tenth-order polynomial. Wait a minute. These are two different hypothesis sets. I thought the constraint was going into one hypothesis set and then playing around with the weights. Yes, that's what it is. But one way to view h2 as if it was h10 with a constraint. And what would that constraint be? Just set all the parameters to 0 above power 2. That is a constraint. But obviously, that's an extreme case. What we mean in a constraint usually regularization is something a little bit softer. So here is the constraint we are going to work with. We are going to work with a budget c for the total magnitude square of the weights. Now, before we interpret this, let's first concede that this is indeed a constraint. The hypotheses that satisfy this are a proper subset of h sub q, because I have excluded the guys that happen to have weights bigger than that. So because of that, I'm already ahead using the VC analysis that I have in my mind. Oh, I have a smaller hypothesis set. So the VC dimension is going in the direction of being smaller, so I am standing at chance of better generalization. So this is good. Now, interpreting this as something along the same lines here is that instead of setting some weights to 0, which is a little bit hard, you just want them in general to be small. So you cannot have them all big. So if you have some of them 0, that leaves more in the budget for you to play around with the rest of the guys. And because of this, if you think of this as a hard-order constraint, that is, you say it's 2, anything above 2 is 0. Here you can deal with it as if it's a soft-order constraint. I'm not really excluding any orders, whatever. I'm just making it harder for you to play around with all the powers. So now let's look at the problem given that this is the constraint. You are still minimizing the in-sample error. But now you are not free to choose the W's here any which way you want. You have to be satisfying the constraint. So that minimization is subject to, and you put the constraint in vector form, and this is what you have. So this is now the problem you have. Now when you solve it, however you do that, we are going to call the solution Wreg for regularization instead of Wlin for linear regression. And the question is, what happens when you put that constraint, what happens to the old solution, which is Wlin given Wreg, which one generalizes better, what is the form for each, et cetera? So let's see what we do to solve for Wreg. You are minimizing this subject to the constraint. Now I can do this mathematically very easily using Lagrange multipliers, or the inequality version of Lagrange multipliers, KKT, which I will actually use in the derivation of support vector machines next week. But here I am just going to settle for a pictorial proof of what the solution is in order to motivate it. And obviously, after you learn the KKT, you can go back and verify that this is indeed the solution you get analytically. So let's look at this. So I have two things here. I have the error surface that I'm trying to minimize, and I have the constraint. So let's plot both of them in two dimensions, because that's what we can plot. So here is the way I'm drawing the in-sample error. I am putting contours where the in-sample error is a constant. So inside will be a smaller e in, smaller e in, and outside it will be bigger e in, et cetera. But on all points on that contour, which actually happens to be the surface of an ellipsoid, if you solve it analytically, the e in is the same value. When you look at the constraint, the constraint tells you to be inside this circle. So let's look at the centers for this guy. What is the center for here? Well, the center for here is the minimum possible, in sample error, you can get without a constraint, and that we already declared to be w-linear. That's the solution for linear regression. So that is where you achieve the minimum possible e in. And as you go further and further, the e in increases. And now here's the constraint. What is the center of the constraint? Well, the center of the constraint is the origin, just because of the nature of it. So now the idea here is that you want to pick a point within this disk, such that it minimizes that. Shouldn't be a surprise to you that I will need to go as far out as I can, afford to, without violating the constraint, because this gets me closer to that. So the visual impression here is actually true mathematically. So indeed, the constraint that you will actually end up working with is not that tw less than or equal to c, but actually equal c. That is where the best value for e in, given the constraint, will occur at the boundary. So let's look at a possible point that satisfies this and try to find an analytic condition for the solution. Before we do that, let's say that the constraint was big enough to include the solution for linear regression. That is, c is big enough that this is the big circle. What is the solution? You already know it. It's w ln, because that is the minimum absolute, and it happens to be allowed by the constraint, so this is the solution. So the only case where you are interested in doing something new is when the constraint takes you away from that, and now you have to find a compromise between the objective and the constraint. A compromise is such that you have to obey the constraint, there is no compromise there, but given that this is the condition, what would be the best you can get in terms of the in-sample error. So let's take a point on the surface. This is a candidate. I don't know whether this gives you the minimum. I don't think it will give me, because I already said that it should be as close as possible to the outside. But let's see maybe this will give us the condition. Let's look at this point and look at the gradient of the objective function. The gradient of the objective function will give me a good idea about directions to move in order to minimize E in, as we have done before. So if you draw this, you will find that the gradient has to be orthogonal to the ellipse, because the ellipse, by definition, has the same value of E in. So the value of in does not change as you move along this. So the only change it is allowed would have to be orthogonal to this. So the direction of the gradient will be this way, and I'm putting it outside, because E in grows as you move away from W in. So that's one vector. Now let's look at the orthogonal vector to the other surface, the red surface. That's not a gradient of anything yet, but if we draw it, it looks like that. And then I find out that this is what? This is just W. If I take a point here, this is the origin, this is the vector, it happens to be orthogonal. So this is the direction of the vector W, this is the direction of the vector, the gradient of E in. Now by looking at this, I can immediately tell you that W does not achieve the minimum of this function subject to this constraint. How do I know that? Because I look at these, and there is an angle between them that makes me, if I move in this direction, E in will increase. If I move in this direction, E in will decrease. I wouldn't be having that situation if they were exactly the opposite of each other. Then I would be moving, and nothing will happen. But now E in has a component along the tangent here. And therefore, moving along this circle will change the value of E in. And if I increase it and decrease it by moving, then definitely this does not achieve the minimum of E in. So I keep going until I get the point where I achieve the minimum of E in. And at that point, what would be the analytic condition? The analytic condition is that this guy is going one direction, this guy is going in exactly the opposite direction. So let's write the condition. The condition is that the gradient, which is the blue guy, is proportional to the negative of w of your solution. Because now we declared the solution. This is the value at which you achieve the optimal under the constraint. We already called that wreg. So at the value of wreg, the gradient should be proportional to the negative of that. Now, because it's proportional to the negative of it, I'm going to put the constant of proportionality in a very convenient way for further derivation. I'm going to write it as minus twice. I'm going to differentiate a square somewhere, and I don't want the two to hang around, so I'm putting it already. Lambda, that is my generic parameter. And I'll divide it by n. Of course, I'm allowed to do that, because there is some lambda that makes it right. So I'm just putting it in that form. So when I put it in this form, I can now go, OK, so this is the condition for wreg. This equals minus that. I can move things to the other side. And now I have an equation which is very interesting. I have this plus that equals the vector 0. Now this looks suspiciously close to being the gradient of something. And if it happens to be the minimum of a function, then I can say, OK, the gradient is 0. So that corresponds to the minimum of whatever that is. So let's look at what this is the differentiation of. It's as if I was minimizing. E n gives me this fellow. And conveniently, this fellow gives me this fellow when I differentiate. So the solution here is the minimization of this guy. That's actually pretty cool, because I started with a constrained optimization problem, which is fairly difficult to do in general. You need some method to do that. And by doing this logic, I ended up with minimizing something unconditionally. Just minimize this, and whatever you find will be your solution. And here, we have a parameter lambda. And here, we have a parameter c. They are related to each other. And actually, parameter lambda depends on c, depends on the data set, depends on a bunch of stuff. So I'm not going to even attempt to get lambda analytically. I just know that there is a lambda. Because when we are done, you realize that the lambda we get for regularization is decided by validation, not by solving anything. So we don't have to worry about it yet. But it's a good idea to think of what is c related to lambda, just to be able to relate to the translation of the problem from the constrained version to the unconstrained version. So the idea is that c goes up, lambda goes down, and vice versa. So let's start with the following. What happens if c is huge? Well, if c is huge, then w lean is already the solution. And therefore, you should be just minimizing e in, as if there was nothing, no constraint. But that does correspond to lambda equals 0, doesn't it? You will be minimizing e in. So if c is huge, lambda is 0. Now let's get c smaller and smaller. When c is smaller, the regularization is more severe, because the condition now is becoming more severe. And in order to make the condition here more severe in terms of the regularization term, you need to increase lambda. The bigger lambda is, the more emphasis you have to put on the regularization part of the game. And therefore, indeed, if c goes down, lambda goes up. To the level where, let's say that c is 0. What is the solution here? Well, you just left me one point in the domain. I don't care what e in is. It happens to be the minimum, because it's the only value. So the solution is whatever the value is. So w equals 0 is the solution. How do you force this to have the solution w equals 0? By getting lambda to be infinite. In which case, you don't care about the first term. You just absolutely, positively, have to make w 0. So indeed, that correspondence matters. So we put it there, and we understand in our mind, there are two parameters that are related to each other, and analytically, we didn't find them. But now we have a correspondence, and that form, the form we have here, will serve as our form. And we have to be able to get lambda in a principled way, which we will. So this is the only remaining sort of outstanding item of business. Now let's look at augmented error, which is an interesting notion. If you are minimizing e augmented, what is e augmented? We used to minimize e in. Now we augmented it with another term, which is a regularization term. So we write it down this way. And this simply can be written for this particular case, because e in is no mystery. We have a formula for it, and you look at this. And now this looks very promising. If I ask you to solve this, oh, this used to be a quadratic form, and now it's a quadratic form. So I don't think the solution will be difficult at all. But the good news is that solving this is equivalent to, which is unconditional optimization, unconstrained optimization, solves the following problem. You minimize e in by itself, which we have the formula for, subject to the constraint. Now it's an important correspondence because of the following. The bottom formulation of the problem lends itself to the VC analysis. I am restricting my hypothesis set explicitly. There are certain hypotheses that are no longer allowed. I am using a subset of the hypothesis set. I expect good generalization. Now mathematically, this is equivalent to the top one. If you look at the top one, I am using the full hypothesis set without explicitly forbidding any value. I'm just using a different learning algorithm to find the solution. Here, in principle, OK, minimize this. Whatever the solution happens happens. And I'm going to get a full-fledged w that happens to be a member of h sub q, my hypothesis set. So nothing here is forbidden. Certain things are more likely than others, but that's an algorithmic question. So it would be very difficult to invoke a VC analysis here, but it's easier to invoke it here. And that correspondence between a constrained version, which is sort of the pure form of regularization, as stated, and an augmented error, which doesn't put a constraint, but adds a term that captures a constraint in a soft form, that correspondence is the justification of regularization in terms of generalization, as far as VC analysis concerned. And it's true for any regularizer you use. We're just giving here an example for this particular type of regularizer. Now let's get the solution. That's the easy part. So we minimize this, not subject to anything. And this is the formula for it. What do you do? You get the gradient of it equated to 0. Can anybody differentiate this? We have done it before. That results in, this is the solution. So this is the part we got from the first part, as we got in linear regression. That's what got us the pseudo-inverse solution in the first place, and the other guy conveniently get lambda. And you can see why I chose the parameter to be funny. The two was because of the differentiation. Now I have squared. The over n, because this one has an on over n. So I was able to factor 1 over n out and leave lambda here, which is clean. That's why I chose the constant of proportionality in that particular functional form. So I get this and solve it. And when you solve it, you get W rig. That's the formal name of the solution to this problem. And that happens to be, it's not the pseudo-inverse, but it's not that far from it. All you do is what you do. You just group the w guys, and then get the y on the other side, and do an inverse, and that's what you get. So this would be the solution with regularization. And as a reminder to us, if we didn't have regularization and we were solving for W lean, W lean will be simply this fellow, the regular pseudo-inverse, which you can also get by simply setting lambda to 0 here. So let's look at the solution. We have this without regularization. And let's put this, because this is the one we are going to use with regularization. Now, this is remarkable in this case under the assumptions, under the clean thing. We actually have one-step learning, including regularization. You just tell me what it is, and I actually have the solution outright. So instead of doing a constraint optimization or doing it in increments of that, this is the solution. So that's a pretty good tool to have. Now, it also is very intuitive, because look at this. If lambda is 0, you have the unconstrained, and you have without regularization. As you increase lambda, what happens? The regularization term becomes dominant in the solution. So this is the guy that carries the information about the inputs. This guy is just lambda i. Now, take it to the extreme. Let's say lambda is enormous. If lambda is enormous, this completely dominates this. And the result of getting this, this would be about lambda i. The other guy is just a noise. And when I invert it, I will get something like 1 over lambda. So w rig would be 1 over lambda for lambda huge times something. Who cares about the something? 1 over lambda is huge. It will knock it down to 0. I'm going to get a regularization that is very close to 0. And indeed, I'm getting smaller and smaller w rig solution, given that lambda is large, which is what I expect. And in the extreme case, I'm going to be forced to have w equal 0, which is the case we said before as the extreme case. So this indeed stands to the logic of what we expect. So we have the solution. Let's apply it and see the result in a real case. So we're now minimizing this, but we know what the solution is explicitly. And what I'm going to do, I'm going to vary lambda. Because this would be a very important parameter for us. So we have the same regularizer, w transpose w. And I'm going to vary the amount of regularization I put. And I'm going to apply it to a familiar problem. This is for different lambdas. Remember this problem? Yeah, we saw it last time. Actually, we saw it earlier in this lecture. So this is the case. Now we are going to put it in the new terminology. What is this? This is unconstrained. Therefore, it is really constrained, but with lambda equals 0. Now let's put a little bit of regularization. And here's what I mean by a little bit. Is this a little bit for you? Let's see the result. Wow. This is the guy I showed you last time, just as an appetizer. Remember? That's what it took. So the medicine is working. A small dose of the medicine did the job. That's good. Let's get carried away, like people get carried away with medicine, and get a bigger dose. What happens? I think we are overdosing here. Let's do it further. You can see what's happening. I'm constraining the weights, and now all it's doing is just constraining the weights, and it doesn't care as much about the fit. So the line keeps getting flatter and more horizontal, until there is absolutely nothing in the line. So if you keep increasing lambda, this is the line that used to be fit, and now the curvature is going small, and the slope is really mitigated, and the curvature is going to be small, et cetera. And eventually, what will happen? This will be just a silly horizontal line. You have just taken a fatal dose of the medicine. That's what happened. So when you deal with lambda, you really need to understand that the choice of lambda is extremely critical. And the good news is that, in spite of the fact that our choice of type of regularizer, like the W, the transpose W in this case, that choice will be largely heuristic, studying the problem, trying to understand how to pick a regularizer. This will be a heuristic choice. The choice of lambda will be extremely principled, based on validation. And that will be the saving grace if our heuristic choice for the regularizer is not that great, as we will see in a moment. So if you want to characterize what's happening as you increase lambda, here we started with overfitting. That was the problem we were trying to solve. And we solved it. And we solved it. And we solved it all too well. We are certainly not overfitting here. But the problem is that we went to the other extreme, and now we are underfitting, just as bad. So the choice of lambda, the proper choice of lambda, is important. Now, the regularizer that I described to you is the most famous regularizer in machine learning. And it's called weight decay. And the name is not very strange, because we're trying to get the weights to be small. So decay is not a farter. But I would like to understand why it is actually called specifically decay. I mean, it's small. So the reason is the following. Let's say that you are not in a neat, linear case like that. Let's say you are doing this in a neural network. And in neural network, weight decay, they're trying to minimize W transpose W, is a very important regularization method. We know that in a neural network, you don't have a neat, closed form solution, and you use gradient descent. So let's say you use gradient descent on this. And let's say just batch gradient descent for the simplicity of the derivation. What do you do? So batch gradient descent, you have a step that takes you from W at time t to W at time t plus 1. And they happen to be this minus eta, which is the learning rate, times the gradient. So we just need to put the gradient, and we have our step. So the gradient is the following. The gradient is the gradient of the sum of this. The gradient of the first part is what we had before. If we didn't have regularizations, that's what we would be doing. And that is what happens. And we go back propagation and whatnot. But now there is an added term because of this. And that added term looks like that, just by differentiating. So now if I reorganize this by taking the terms that correspond to Wt by themselves, I'm going to get this term, basically collecting these two fellows. This guy and this guy, which happen to be multiplied by Wt. And then I have this remaining guy, which I can put this way. So now look at the interpretation of the step here. I am in the weight space, and this is my weight. And here is the direction that back propagation is suggesting that I move to. It used to be, without regularization, that I'm moving from here to here. Now using this thing, before I do that, which I'm going to do, I'm actually going to shrink the weights. Here's the origin. I'm here. I'm going to move in this direction. Because this fellow is a fraction. And it could be a very small fraction, depending on lambda. I mean, I could be going by a factor of half or something. Most likely I'll go by very little, like 0.999. But in every step now, instead of just moving from this according to the solution, I am shrinking, then moving. Shrinking, then moving. So these guys are informative. They tell me about what to do in order to approximate the function. This guy is just obediently trying to go to 0. So that makes you unable to really escape very much. If I was just going this way, this way, that way, et cetera, I would be going very far. But here now, every time, there is something that grounds you. And if you take lambda to be big enough that that fraction is huge, then your urinary would be here. And this is the suggested direction. I'm going to do it, but before I do that, I am going to go here. And then you move this way. And the next time, you go here. And before you know it, you are at the origin, regardless of what the other guy is suggesting. And that is indeed what happens when lambda is huge. You are so tempted towards the zero solution that you don't really care about learning the function itself. The other factor pushes you there. So that's why it's called weight decay, because the weight decay is from one iteration to the next. And it applies to neural network. And all you need to remember in neural network, the W transpose W is a pretty elaborate sum. You have to sum over all the layers, all the input units, all the output units, and you sum the value of the weight squared. So that's what you have. Now let's look at variations of weight decay. So this is the method we developed. And we'd like to move to other regularizers and try to infer some intuition about the type of regularizer we pick. So what do we do here? You can, instead of just uniformly giving a budget C and having the sum of the W squared being less than or equal to C, you can decide that some weights are more important than others. And the way you do it is by having this as your regularizer. You introduce an importance factor, you would call it gamma. And by the choice of the proper gamma, these are constants that specify what type of regularizer you are working with. Now if this becomes less than or equal to C, now you have a play. I can, if gamma is very small, I have more liberty of making that weight big, because it doesn't take much from the budget. If gamma is big, then I'd better be careful with the corresponding weight, because it kills the budget. So let's look at two extremes. Let's say that I take the gamma to be positive exponential. How do you articulate what the regularizer is doing? Well, the regularizer is giving huge emphasis on higher order terms. So what is it trying to do? It is trying to find, as much as possible, a low order fit. Because if it tries to get, let's say it's h, q equals 10. If it tries to put a 10th order polynomial, the smallest weight for the 10th order polynomial will kill the budget already. Let's look at the opposite. If you have that, well, you find it. Now the bad guys are the early guys. I'm OK with the high order guys, but not the order guys, so this would be a high order fit. You can see quite a variety of this. And in fact, this functional form is indeed used in neural networks, but not for high order or low order, but in something else. It is used because when you do the analysis properly for neural networks, you find that the best way to do weight decay is to give different emphasis to the weights in different layers. They play a different role in affecting the output. And therefore, this would be accommodated by just having the proper gamma in this case. And the most general form of this type of things is the famous Tikhonov regularizer, which is a very well-studied set of regularizers, that has this general form. This is a quadratic form, but it's a diagonal quadratic form. I only take the w0 squared, w1 squared, it's wq squared. This one, when I put it in matrix form, is a general quadratic form. So it has the diagonal guys and it has off-diagonal guys. So it will be giving weights to guys that happen to be w1, w3, et cetera. And by the proper choice of the matrix gamma in this case, you will be able to get weight decay, you will get the low order, high order, and many others that are fit in that. And therefore, studying this form is very interesting, because you cover a lot of territory using it. So these are some variations. Now let's even go more extreme and go for not weight decay, but weight growth. Why not? The game was what? The game was constraining. You don't want to allow all values of the weights. You didn't allow big weights. I'm going not to allow small weights. What's wrong with that? It's a constraint. Let's see how it behaves. So first, let's look at weight decay. Now I'm plotting the performance of weight decay that's expected out of sample error as a function of the regularization parameter lambda. There is an optimal value for the parameter, like we saw in the example, that gives me the smallest one. And before that, I'm overfitting. And after that, I'm starting to underfit. And there's a value. Any time you see the curve going down and then going up, it means that that regularizer works if you choose lambda right. Because if I choose lambda here, I'm going to get better out of sample performance than if I didn't use regularization at all, which is lambda equals 0. So now we are going to plot the curve for if we constrain the weights to be large. So your penalty is for small weights, not for large weights. What would the curve look like again here? So if it goes down from 0 to something, then it's fine, etc., but it looks like this. It's just bad. But it's not fatal because what? Because our principled way of getting lambda got us lambda equals 0 as the proper choice. So we killed the regularizer altogether. But it's a curious case, because now we are using regularizers, now it seems like you can even use a regularizer that harms you. And I'm not sure now that I need to use a regularizer, not etc. So you have to use a regularizer, because without a regularizer, you are going to get overfitting. That is, there is no question about it. It's a necessary evil. But there are guidelines for choosing the regularizer that I'm going to talk about now. And after you choose the regularizer, there is the check of the lambda. If you happen to choose the wrong one, and you use a correct validation, the correct validation will recommend that you give the weight 0. So there is no downside except the price you pay for validation. So what is the lesson here? It's a practical rule. I'm not going to make a mathematical statement here. What is the criteria that we learned with decay that will guide us in the choice of a regularizer? Here is the observations that lead to the practical rule. Stochastic noise, which we are trying to avoid fitting, happened to be high frequency. That is, when you think of noise, it's like that. Whereas the usual target functions are this way. So this guy is this way. How about the other type of noise, which is also a culprit for overfitting? Well, it's not as high frequency. But it is also non-smooth. That is, we captured what we could capture by the model. And what we lift out the chances are, we couldn't capture it because it's going up and down faster or stronger than we can capture. Again, I'm saying this is a practical observation. This happens in most of the hypothesis sets that I get to choose, and the target functions that I get to encounter. And because of this, here is the guideline for choosing a regularizer. Make it tend to pick smoother hypotheses. Why is that? We said that regularization is a cure. And the cure has a side effect. It's a cure for what? For fitting the noise. So you want to make sure that you are punishing the noise more than you are punishing the signal. These are the organisms we are trying to fight. If we harm them more than we harm the patient, we'll be OK. We'll put up with a side effect, because we are killing the disease. So these guys happen to be high frequency. So if your regularizer prefers smooth guys, it will fail to fit these guys more than it will fail to fit the signal. That is the guideline. And it turns out that most of the ways you mathematically write a hypothesis set as a parameterized set is by making smaller weights correspond to smoother hypotheses. I could do it the other way around. I can say, instead of my hypothesis being summation of w times a polynomial, I can make it summation of 1 over w times a polynomial. These are my parameters. In which case, big weights will be better, smoother. But that's not the way people write hypothesis sets. So in most of the parameterization you're going to see, small weights correspond to smoother hypotheses. That's why small weights or weight decay works very well in those cases, because it tends towards smooth guys. OK. So now let's write the general form of regularization, and then talk about choosing a regularizer. We are going to call the regularizer, like the weight decay regularizer by itself, without the lambda. We are going to call it capital omega. And it happens to be capital omega of h. It used to be capital function of w. W are the parameters that determine h. So if I now leave out the parameters explicitly, and I'm talking about the general hypothesis set, it depends on which hypothesis you pick. The value of the regularizer is that. And the regularizer will prefer the guys for which omega of h happens to be smaller in value. So you define this function, and you have defined a regularizer. OK. So what is the augmented error that we minimize? In this case, the augmented error is, again, augmented error of the hypothesis. It happens to be of the weight, if that is the way you parameterize your hypotheses. And we write it down as this. This is the form we had. You get E in. That's already we have. And then you have the lambda, the important parameter, the dose of the regularizer. And the form of the regularizer itself, which we just called capital omega of h. So this is what we minimize. Does this ring a bell? Does it look like something you have seen before? Well, yeah, it does. But I have no idea what the relation might possibly be. I've seen this one before from the VC analysis. But it was a completely different ball game. We were talking about Eout, not Eog. We're not optimizing anything. This was less than or equal to. OK, less than or equal to is fine, because we said that the behavior is generally proportional to the bound. So that's fine. This is E in, so that's perfect. This guy is capital omega. Oh, I'm sneaky. I call this capital omega deliberately. But this one was just, this was the penalty for model complexity, and the model was the whole model. This was not a single hypothesis. You give me the hypothesis set. I came up with a number that tells you how bad the generalization would be for that model. But now let's look at the correspondence here. This is a complexity, and this is a complexity. Although the complexity here is for individual hypotheses, that's why it's helpful for me to navigate the hypothesis set. Whereas this was just sitting here as an estimate. Now, when I talk about Occam's razor, I will relate the complexity of an individual object to the complexity of the set of objects, which is a very important notion in its own right. But if you look at that correspondence, you realize that what I'm really doing here, instead of using E in, I'm using Eog as an estimate for E out, if I take it literally. And the thing here is that Eog, the augmented error, is better than E in, better in what? Better as a proxy to E out. You can think of the holy grail of machine learning as to find an in-sample estimate of the out-of-sample error. If you get that, you are done. Minimize it, and go home. But there is always this slack, and there are bounds, and this and that. And now our augmented error is our next attempt from using the plain vanilla in-sample error, adding something up that gets us closer to the out-of-sample error. So of course, the augmented error is better than E in and approximating E out, because it's purple. Purple is closer to red than blue. No, that's not the reason. But that's at least the reason for the slide. So this is the idea in terms of the theory. We found a better proxy for the out-of-sample. Now, very quickly, let's see how we choose a regularizer. Or very quickly, not because of anything, but because it's really a heuristic exercise. And I want to emphasize a main point here. What is the perfect regularizer? Remember when we talked about the perfect hypothesis set? This was the hypothesis set that has a single tone that happens to be our target function. Dream on. We don't know the target function. We cannot construct something like that. Well, the perfect regularizer is also one that restricts, but in the direction of the target function. I think we can say that we are going in circles here. The target function, we don't know the target function. Now, if you know a property of the target function that allows you to go there, that is not regularization. There is another technique which uses properties of the target function in order to improve the learning explicitly. This property holds for the target function, and there is a prescription for how to use it. Regularization is an attempt to reduce overfitting. So it is not matching the target. It doesn't know the target. All it does is apply generically a methodology that harms the overfitting more than it harms the fitting. It harms fitting the noise more than it harms fitting the signal. And that is our guideline. And because of that, it's a heuristic. So the guiding principle we found was you move in the direction of smoother. And the direction of smoother, we need to find the logic in our mind. We are moving in the direction of smoother because the noise is not smooth. That is really the reason. Because we tend to harm the noise more by doing that. And smoother is fine when we have a surface like that. In some learning problems, we don't have a surface to be smooth, so the corresponding thing is simpler. So I'll give you an example from something you have seen before, which is the movie rating, our famous example that we keep going back to. We had the error function for the movie rating. We were trying to get the factors to multiply to a quantity that is very close to the rating of this user that has those factors, to this movie that has those factors. That's what we did. And the factors were our parameters. So we're adjusting the parameters in order to match the rating. And now in the new terminology, you realize that this is very susceptible to overfitting. Because let's say I have a user, and I'm using 100 factors. That's 100 parameters dedicated to that user. If that user only rated 10 movies, then I'm trying to determine 100 parameters using 10 ratings. That's bad news. So clearly regularization is called for. So a notion of simpler here is very interesting. The default that you are trying to go to is that everything gives the average rating. In the absence of further information, consider that everything is just average rating of all movies or all users. Or you can more finicky about it, the average of the movies that I have seen, and average of the ratings that I have done. Maybe I'm an optimistic user or not. But just an average. So you don't consider this particular movie or this particular user, you just take an average. So if you pull your solution toward the average, now you are regularizing toward the simpler solution. And indeed, that is the type of regularization that was used in the winning solution of the Netflix competition. So this is another notion of simpler, which, in a case where smoother doesn't lend itself. So what happens if you choose a bad omega? Which happens? It's a heuristic choice. I'm moving towards this. I may choose a good one or a bad one. And in a real situation, you will be choosing the regularizer in a heuristic way. You can do all the math in the world. But whenever you do the math, remember that you are always making an assumption. And your math will be as good or as bad as your assumption is valid or not valid. There is no escaping that. So you don't hide beyond a great-looking derivation when the basis of it is shaky. We don't worry too much, because we have the saving grace of lambda. We are going to go to validation. So we had better be here for the next lecture, where we are going to choose lambda. And if we happen to be unlucky that after applying the guidelines, we end up with something that is actually harmful, then the validation will tell us it's harmful and will factor the regularizer out of the game altogether. But trying to put a regularizer in the first place is inevitable. If you don't do it, you will end up with overfitting in almost all the practical machine learning problems that you will encounter. So now, let's look at neural network regularizers in order to get more intuition about them. And it's actually pretty useful for the intuition. So let's look at weight decay for the neural network. The math is not as clean, because we don't have a cross-form solution. But there is a very interesting interpretation that relates the small weights to simplicity in this case. So remember this guy. This was the activation function of the neurons. And they were soft threshold. And we said that the soft threshold is somewhere between linear and hard threshold. What does it mean to be between? It means that if the signal is very small, you are almost linear. If the signal is very large, one way or the other, you are almost binary. So let's say that you are using small weights, versus big weights. If you use very small weights, then you are always within here, because the weights are the ones that determine the signal. So every neuron now is basically computing the linear function. So I have this big network, layer upon layer upon layer upon layer, and I'm taking it because someone told me that multi-layer perceptrons are capable of implementing large things. So if I put enough of them, I'll be able to grade. And then I look at the functionality that I'm implementing if I force the weights to be very, very small. Well, this is linear. But this is linear of linear. Linear of linear of linear. And when I'm done, what am I doing? I'm implementing just a simple linear function in a huge camouflage disguise. All the weights are just interacting and adding up, and I end up with just a linear function, a very simple one. Very small weights, I'm implementing a very simple function. As you increase the weights, you are getting into the more interesting non-linearity here. And if you go all the way, you will end up with a logical dependency. And a logical dependency, as we did with the sum of products, you can implement any functionality you want. You're going from the most simple to the most complex by increasing the weights. So indeed, we have a correspondence in this case, not just smoothness per se, but actually the simplicity of the function you are implementing in terms of the size of the weights. There is another regularizer for neural networks, which is called weight elimination. So the idea is the following. We said that the VC dimension of neural networks is the number of weights, more or less. OK, so maybe it's a good idea to take the network and just kill some of the weights. So although I have the full-fledged network, I am forcing some of the weights to be zero. In which case, the number of three parameters that I have will be smaller. I will have a smaller VC dimension, and I stand a better chance of generalizing, maybe I won't overfit. Now this is true, and there is an implementation of it, which is, the argument I just said, is fewer weights lead to a smaller VC dimension. There is a version of it that lends itself to regularization, which is called soft weight elimination. So I'm not going to go and combinatorially say, should I kill this weight or kill that weight. You can see this is a nightmare in terms of optimization. I'm going to apply something on a continuous function that will result more or less in emphasizing some of the weights and killing the others. So here is the regularizer in this case. It looks awfully familiar to the weight decay. If that's all I had, and this wasn't upstairs in anticipation of something that will happen downstairs, this would be just weight decay. I'm adding these guys, and that is there to turn. So I'm just doing this. But the actual form is this fellow. So what does this do? For very small w, beta dominates. So you end up with something proportional to w squared. So for very small weights, you are doing weight decay. For very large weights, the w dominates. Therefore, this basically is one, close to one. So there is nothing to begin by changing the weights. At least not much to begin by changing the weights. So in this case, big weights are left alone. Small weights are pushed towards you. So you end up after doing the optimization, clustering the weights into two groups. Serious weights, that happen to have value, and other weights that are really being pushed towards zero that you have considered to be eliminated, although they are soft eliminated. And that's the corresponding notion. Early stopping, which we alluded to last time, is a form of regularizer, and it's an interesting one. So remember this thing. We were training on E in, no augmentation, nothing, just the in-sample error. And we realized, by looking at the out-of-sample error using a probabilistic set, that it's a good idea to stop before you get to the end. So this is a form of regularizer, but it's a funny regularizer, it's through the optimizer. So you are not changing the objective function. You are just handing the objective function, which is the in-sample error to the optimizer, until, please, minimize this. By the way, could you please not do a great job? Because if you do a great job, I'm in trouble. So that's what you do. It's a funny situation. It's not funny for early stopping, because the way we choose when to stop is principled. We are going to use validation to determine that point. But some people get carried away and realize, OK, maybe we can always put the regularizer in the optimizer, and just do a sloppy job of optimization, thus regularizing the thing. Oh, wait a minute. Maybe local minima is a blessing in disguise. They force us to stop short of the global minimum, and therefore, that's a great regularizer. OK, guys. Heuristic is heuristic, but we are still scientists and engineers. Separate the concerns. Put what you consider to be the right thing to minimize in the proper function. In this case, the augmented error. And after that, give it to the optimizer to go all the way in optimizing. The wishy-washy thing is just unanswered. I have no idea how this will work. But if we capture as much as we can in the objective function, and we know that we really want to minimize it, then we have a principled way of doing that, and we'll get what we want. OK. Final, this was done by validation. So the final slide is the optimal lambda, which is a good lead into the next lecture. So what I'm going to show you is the choice of the optimal lambda in the big experiment that I did last time. So last time, we had overfitting in a situation that had the colorful graphs. And now I'm applying regularization there, using weight decay. And I'm just asking myself, what is lambda given the different levels of noise? So you look here, and OK. So I'm applying regularization. This is the lambda. It's the same regularizer, and I'm changing the emphasis on the regularizer. And I'm getting the bottom lines expected out of sample error as a result. When there is no noise, guess what? Regularization is not indicated. You just put lambda equals 0, and you are fine. There's no overfitting to begin with. As you increase the level of noise, as you see here, first you need regularization. The minimum occurs at a point which lambda is not 0. So that means you actually need regularization. And the end result is worse anyway, but the end result has to be worse, because there is noise. The expected out of sample error will have to have that level of noise in it. Even if I fit the other thing perfectly. And as I increase the noise, I need more regularization. The best value of lambda is good. This is very, very intuitive. And if we can determine this value using validation, then we have a very good thing. Now, instead of using this, which was horrible overfitting, I am getting this, and I'm getting the best possible given those. Now, this happens to be for stochastic noise. Out of curiosity, what the situation would be if we were talking about deterministic noise. And when you plot deterministic noise, well, you could have fooled me. Now, I'm not increasing the sigma squared, I'm increasing the complexity of this guy, the complexity of the target, and therefore I'm increasing the deterministic noise. Exactly the same behavior. Again, if I have this, I don't need any regularization. As I increase the deterministic noise, I need more regularization, the lambda is bigger, and I end up with worse performance. And if you look at these two, that should seal the correspondence in your mind that as far as overfitting and its cures are concerned, deterministic noise behaves almost exactly as if it were unknown stochastic noise. I will stop here, and we'll take questions after a short break. Let's start the Q&A. So the first question is, when the regularization parameter is chosen, say lambda, if it's chosen according to the data, does that mean we're doing data snooping? OK. So if we were using the same data for training as for choosing the regularization parameter, that would be bad news. And I mean, it's snooping, but it's so clear that I wouldn't even call it snooping. It's blatant in this case. So the reality is that we determined this using validation, which is a very controlled form of using the data, and we will discuss the subject completely from beginning to end in the next lecture. So there will be a way to deal with that. Would there be a case where you use different types of regularization in the same correction? So sometimes you use a combination of regularizers with two different parameters, depending on performance. As I mentioned, it is an experimental activity more than a completely principled activity. There are guidelines, and there are regularizers that stood the test of time. And you can look at the problem, and you realize that, OK, I'd better use these two regularizers, because they behave differently in different parts of the space or something of that sort, and then decide to have a combination. So in the examples you were using, the gender polynomials, the orthogonal functions. Was there any reason for these, or can you choose other functions? So they give me a level of generality, which is pretty interesting, and the solution is very simple. So it's the analytic appeal of it that got me into this. The typical situation in machine learning is somewhere between theory and practice, and it really has very strong grounding in both. So the way to use theory is that because you cannot really model every situation such that you can get the closed form solution. You are far from that. What you do is you get an idealized situation, but a situation as general as you can get it. With polynomials, you can do a lot of things. So because I can get the solution in this case, when I look at the form of the solution, I may be able to read off some intuitive properties that I can extrapolate and apply as a leap of faith to situations where my assumptions don't hold. And in this case, after getting this, we had a specific form for weight decay. And when we look at the performance, we realize that smoothness is a good criteria. And then we look for smoothness or simplicity. And we interpret that in terms of smoothness is actually good because of the properties of noise, and so on. So there is a formal part where we can develop it completely and try to make it as general as possible while mathematically tractable. But then try to see if the lessons learned from the solution that you got analytically can apply to a situation in a heuristic way where you don't have the full mathematical benefit because the assumptions don't hold. Could noise be an indicator of missing input? So missing input is a big deal in machine learning. Sometimes you are missing some attributes of the input and whatnot, and it can be treated in a number of ways. And one of it is as if it's noise. But missing inputs are sufficiently well defined that they are treated with their own methodology rather than being generic noise. How do you trade off choosing more features in your transformation with a regularization? Yeah, it's a good question. The first question was a question that we addressed even before we heard of overfitting and regularization. And it was a question of generalization. What is the dimensionality that we can afford given the resources of the data? What regularization adds to the equation is that maybe you can afford a little bit of a bigger dimension, provided that you do the proper regularization. So again, it's the question of having, instead of having discrete steps, I'm going from this hypothesis set to this hypothesis set to this hypothesis set. Let me try to find a continuum such that by the validation or by other methods, I will be able to find a sweet spot where I get the best performance, and the best performance could be lying between two of the discrete steps. So in this case, I couldn't initially afford to go to the bigger hypothesis set, because if I go for it and I go unconstrained, the generalization just kills me. But now what I'm going to do, I'm going to go to it any way and apply regularization. So I go this, and then I'm tracking back in continuous steps using regularization, and I will end up with a situation maybe that I can afford that wasn't accessible to me without regularization, because it didn't belong to the discrete grid that I used to work in. OK, when regularization is done, will it depend on the data set that you use for training? OK, so the regularization is a term added. So there is no explicit dependency of the regularization on the data set. The data set goes into the input error. The regularization goes into a property of the hypothesis that is fairly independent, in the examples we give, we're independent of the inputs. The dependency comes from the fact that the optimal parameter lambda does depend on the training set. But I said that we are not going to worry about that analytically, because when all is said and done, lambda will be determined by validation. So it will inherit any properties just because of that. OK, I think that's it. That's good. So we'll see you next week.