 The following program is brought to you by Caltech. Welcome back. Last time we introduced neural networks. And we started with multi-layer perceptrons. And the idea is to combine perceptrons using logical operations like ors and ands in order to be able to implement more sophisticated boundaries than the simple linear boundary of a perceptron. And we took a final example where we were trying to implement a circle boundary in this case, and we realized that we can actually do this at least approximated if we have a sufficient number of perceptrons. And we convinced ourselves that combining perceptrons in a layered fashion will be able to implement more interesting functionalities. And then we faced the simple problem is that even for a single perceptron when the data is not linearly separable, the optimization finding the boundary based on data is a pretty difficult optimization problem. It's combinatorial optimization. And therefore it is next to hopeless to try to do that for a network of perceptrons. And therefore we introduced neural networks that came in as a way of having a nice algorithm for multi-layer perceptrons by simply softening the threshold that instead of having it as just going from minus 1 to plus 1, it would go from minus 1 to plus 1 gradually using a sigmoid function, in this case the tanh. And if the signal which is given by this amount, the usual signal that goes into the perceptron, is large, large negative or large positive, the tanh approximates minus 1 or plus 1, so we get the decision function we want. And if s is very small, this is almost linear, tanh s is linear. And the most important aspect about it is that it's differentiable. It's a smooth function. And therefore the dependency of the error in the output on the parameters wijl will be a well-behaved function for which we can apply things like gradient descent. And the neural network looks like this. So it starts with the input followed by a bunch of hidden layers followed by the output layer. And we spend some time trying to argue about the function of the hidden layers and how they transform the inputs into a particularly useful linear transformation as far as implementing the output is concerned and the question of interpretation. And then we introduce the back propagation algorithm, which is applying stochastic gradient descent to neural networks. Very simply, it decides on the direction along every coordinate in the w space using the very simple rule of gradient descent. And in this case, you only need two quantities. One of them is xi that was implemented using this formula, the forward formula, so to speak, going from layer L minus 1 to layer L. And then there is another quantity that we defined which was called delta that is computed backwards. You start from layer L and then go to layer L minus 1. And the formula is strikingly similar to the formula in the forward thing, but instead of the non-linearity being applied, you multiply by something. And once you get all the deltas and x's by a forward and a backward run, then you simply can decide on the move in every weight according to this very simple formula that involves the x's and the deltas. And the simplicity of the back propagation algorithm and its efficiency are the reasons why neural networks have become very popular as a standard tool of implementing functions that need machine learning in industry for quite some time now. Today I'm going to start a completely new topic. It's called overfitting. And it will take us three full lectures to cover overfitting and the techniques that go with it. And the techniques are very important because they apply to almost any machine learning problem that you are going to see. And they are applied on top of any algorithm or model you use. So you can use neural networks or linear models, etc. But the techniques that we are going to use here, which are regularization and validation, apply to all of these models. So this is another layer of techniques for machine learning. And overfitting is a very important topic. And it is fair to say that the ability to deal with overfitting is what separates professionals from amateurs in machine learning. Everybody can fit, but if you know what overfitting is and how to deal with it, then you have an edge that someone who doesn't know the fundamentals will not be able to comprehend. So the outline today is first we are going to start what is the notion, what is overfitting. And then we are going to identify the main culprit for overfitting, which is noise. And we will, after observing some experiment, we realize that noise covers more territory than we thought. There is actually another type of noise, which we are going to call deterministic noise. It's a novel notion that is very important for overfitting in machine learning. And we are going to talk about it a little bit. And then very briefly, I'm going to give you a glimpse into the next two lectures by telling you how to deal with overfitting. And then we will be ready to, having diagnosed what the problem is, to go for the cures, regularization next time, and validation the time after that. OK. So let's start by illustrating the situation where overfitting occurs. So let's say we have a simple target function. Let's take it to be a second-order target function, a parabola. So my input space is the real numbers. I have only a scalar input x. And there is a value y. And I have this target that is second-order. OK. We are going to generate five data points from that target in order to learn from. This is an illustration. So let's look at the five data points. As you see, the data points look like they belong to the curve, but they don't seem to belong perfectly to the curve. So there must be noise. So this is a noisy case where the target itself, the deterministic part of the target, is a function. And then there is added noise. It's not a lot of noise, obviously, a very small amount. But nonetheless, it will affect the outcome. OK. So we do have a noisy target in this case. OK. Now, if I just told you that you have five points, which is the case you face when you learn, the target disappears, I have five points and you want to fit them. OK. Going back to your math, you realize, OK, I want to fit five points. Maybe I should use force-order polynomial. We'll do it. Right? We have five parameters. OK. So let's fit it with force-order polynomial. OK. This is the guy who doesn't know machine learning, by the way. OK. So you say, OK, I'm going to use a force-order polynomial. And what will the fit look like? Perfect fit in sample. OK. And you measure your quantities. So the first quantity is E in. Success. We achieved zero training error. And then when you go for the out-of-sample, you are comparing the red curve to the blue curve, and the news is not good. I'm not going to even calculate it. It's just huge. OK. Now, OK, this is a familiar situation for us, and we know what the deal is. The point I want to make here is that when you say overfitting, overfitting is a comparative term. It must be that one situation is worse than another. You went further than you should. And there is a distinction between overfitting and just bad generalization. So the reason I'm calling this overfitting, because if you use, let's say, third-order polynomial, you will not be able to achieve zero training error in general. OK. But you will get a better E out. OK. Therefore, the overfitting here happened by using the fourth order instead of the third order. You went further. OK. That's the key. That point is made even more clearly when you talk about neural networks and overfitting within the same model. In the case of overfitting of third-order polynomial versus fourth-order polynomial, you are comparing two models. Here, I'm going to take just neural networks, and I'll show you how overfitting can occur within the same model. OK. So let's say you have a neural network, and it is fitting noisy data. That's a typical situation. So you run your back propagation algorithm with a number of epochs, and you plot what happens to E in, and you get this curve. OK. Can you see this curve at all? Let me try to magnify it, hoping that it will become clearer. OK. OK. A little bit better. OK. So this is the number of epochs. So you start from an initial condition, a random vector, and then you run a scarcity gradient descent, and evaluate the total E input at the end of every epoch, and you plot it, and it goes down. It doesn't go to zero. The data is noisy. You don't have enough parameters to fit it perfectly. But this looks like a typical situation where E in goes down. OK. Now, because this is an experiment, you have set aside a test set that you did not use in training. And what you are going to do, you are going to take this test set and evaluate what happens out of sample. Not only at the end, but as you go. Just to see, as I train, am I making progress out of sample or not? You are definitely making progress in sample. So you plot the out of sample, and this is what you get. So this is estimated by a test set. OK. Now, there are many things you can say about this curve. And one of them is, OK, so in the beginning, when you start with a random W star, in spite of the fact that you are using a full neural network, when you evaluate on this point, you have only one hypothesis that does not depend on the data set. This is the random W that you got. So it's not a surprise that E in and E out are about the same value here, because they are floating around. As you go down the road and start exploring the weight space by going from one iteration to the next, you are exploring more and more of the space of weights. So you are getting the benefit or the harm of having the full neural network model gradually. In the beginning here, you are only exploring a small part of the space, et cetera. So if you can think of an effective VC dimension as you go, if you can define that, then there is an effective VC dimension that is growing with time until it gets, after you have explored the whole space, or at least potentially explored the whole space if you had different data sets, then you have the effective VC dimension will be the total number of three parameters in the model. So the generalization error, which is the difference between red and green curve, is getting worse and worse. That's not a surprise. But there is a point, which is an important point here, which happens around here. So let me now shrink this back, now that you know where the curves are. And let's look at where overfitting occurs. Overfitting occurs when you knock down E in. So you get a smaller E in, but E out goes up. So if you look at these curves, you realize that this is happening around here. Now there is very little in terms of the difference of generalization error before the blue curve and after the blue curve. Yet I am making a specific distinction that crossing this boundary went into overfitting. Why is that? Because up till here, I can always reduce the E in. And in spite of the fact that E out is following suit with very diminishing returns, it's still a good idea to minimize E in, because you are getting smaller E out. The problems happen when you cross, because now you think you are doing well, you are reducing E in, and you are actually harming the performance. That's what needs to be taken care of. So that's where overfitting occurs. So in this situation, it might be a very good idea to be able to detect when this happens and simply stop at that point and report that instead of reporting the final hypothesis you will get after all the iterations. Because in this case, you are going to get this E out instead of that E out, which is better. And indeed, the algorithm that goes with that is called early stopping. And it will be based on validation. And although it's based on validation, it really is a regularization term in terms of putting the brakes. So now we can see the relative aspect of overfitting. And overfitting can happen when you compare two things. Whether the two things are two different models or two instances within the same model. And we look at this and say that if there is overfitting, we'd better be able to detect it in order to stop earlier than we would otherwise, because otherwise we'll be harming ourselves. So this is the main story. So now let's look at what is overfitting as a definition and what is the culprit for it. Overfitting as a criteria is the following. It's fitting the data more than is warranted. And that is a little bit strange. What would be more than is warranted? I mean, we are in machine learning. We are in the business of fitting data. So I can fit the data. I keep fitting it. But there comes a point where this is no longer good. So why does this happen? What is the culprit? The culprit in this case is that you are actually fitting the noise. The data has noise in it. And you are trying to look at the finite sample set that you got, and you are trying to get it right. In trying to get it right, you are inadvertently fitting the noise. So this is understood that I can see that this is not good. At least it's not useful at all. Fitting the noise, there is no pattern to detect in the noise. So fitting the noise cannot possibly help me out of sample. However, if it was only just useless, we would be OK. We wouldn't be having this lecture, because you think, OK, I give the data, the data has the signal and the noise. I cannot distinguish between them. I just get x and get y. Y has a component which is a signal and a component which is noise. But I get just one number. I cannot distinguish between the two. And I am fitting them. And now I am going to fit the noise. Let's look at this. I'm in the business of fitting. I cannot distinguish the two. Fitting the noise is the cost of doing business. If it's just useless. I wasted some effort, but nothing bad happened. The problem really is that it's harmful. It's not a question of being useless. And that's a big difference. Because machine learning is machine learning. If you fit the noise in sample, the learning algorithm gets a pattern. It imagines a pattern and extrapolates that out of sample. So based on the noise, it gives you something out of sample. And this is the pattern in the data, obviously, which it isn't. And that will obviously worsen your out of sample, because it's taking you away from the correct solution. So you can think of the learning algorithm. In this case, when detecting a pattern that doesn't exist, the learning algorithm is hallucinating. Oh, there's a great pattern. And this is what looks like. And it reports it. And eventually, obviously, that imaginary thing ends up hurting the performance. So let's look at a case study. And the main reason for the case study, because we vaguely now understand that it's a problem of the noise. So let's see how does the noise affect the situation? Can we get overfitting without noise? What is the deal? So I'm going to give you a specific case. I'm going to start with a 10th-order target. 10th-order target means 10th-order polynomial. I always work on the real numbers. So the input is a scalar. And I'm defining polynomials based on that. And I'm going to take 10th-order target. So the 10th-order target, one of them, looks like this. You choose the coefficient somehow. And you get something like that, a fairly elaborate thing. And then you generate data. And the data will be noisy, because we want to investigate the impact of noise on overfitting. So let's say I'm going to generate 15 data points in this case. So this is what you get. So you get these points. The noise here is not trivial as it was last time. There is a difference. But obviously, these are not lying on the curve. So there is a noise that is contributing to that. Now, the other guy, which is a 50th-order, is noiseless. That is, I'm going to take, generate a 50th-order polynomial. So it's obviously much more elaborate than the blue curve here. But I'm not going to add noise to it. I'm going to generate also 15 points from this guy. But the 15 points, as you will see, perfectly lie on the curve. So all of them here. So this is the data. This is the target. And the data lies on the target. So these are two interesting cases. One of them is a simple target, so to speak. Added noise. That makes it complicated. This one is complicated in a different way. It's a high-order target to begin with. But there is no noise. So these are the two cases that I'm going to try to investigate overfitting in. So we are going to have two different fits for each target. We are in the business of overfitting. We have to have comparative models. So I'm going to have two models to fit every case. And see if I get overfitting here. And I get it here. So this is the first guy that we saw before. The simple target with noise. And this guy is the other one, which is the complex target without noise. 10th order, 50 orders. We'll just refer to them as noisy low-order target and a noiseless high-order target. This is what we want to learn. Now, what are we going to learn with? We are going to learn with two models. One of them is the same thing. We have a second-order polynomial that we are going to use to fit. That's our model. And we are going to have a 10th-order polynomial. These are the two guys that we're going to use. So here's what happens with the second-order fit. You have the data points, and you fit them. And it's not surprising. First, the second order is a simple curve. And it tries to find a compromise here. We are applying mean squared errors. So this is what you get. So now let's analyze the performance of this pair. So what I'm going to list here, as you see, I'm going to say, what is the in-sample error? What is the out-of-sample error? For the second order, which is already here. And the 10th order, which I haven't shown yet. So the in-sample error in this case is 0.05. This is a number. Obviously, it depends on the scale. It's some number. When you get the out-of-sample version, not surprisingly, it's bigger, because this one fit the data. The other one is out-of-sample, so it's going to be bigger. But the difference is not dramatic. And this is the performance you get. Now let's apply the 10th order fit. You already foresee what a problem can exist here. The red curve sees the data, tries to fit it, uses all the degrees of freedom it has. It has 11 of them. And then it gets this guy. And when you look at the in-sample error, obviously the in-sample error must be smaller than the in-sample error here. You have more to fit, and you fit it better. So you get smaller in-sample error. And what is the out-of-sample error? Just terrible. So this is patently a case of overfitting. When you went from second order to tenth order, the in-sample error indeed went down. The out-of-sample error went up, way up. So you say, OK, this confirms what we have said before. We are fitting the noise, et cetera. And you can see here that you are actually fitting the noise. You can see the red curve is trying to go for this guy. And you know that these guys are off the target. Therefore, the red curve is bending particularly in order to capture something that is really noise. So this is the case. Here it's a little bit strange, because here we don't have any noise. And we also have the same model. So we are going to take the same two models. We have second order and tenth order fitting here. So let's see how they perform here. Well, this is the second order fit. Again, that's what you expect from a second order fit. And you look at the in-sample error and out-of-sample error. And they are OK. Bolt-park fine. You get some error. And the other one is bigger than it. Now we go for the tenth order, which is the interesting one. So this is the tenth order. Now you need to remember that the tenth order is fitting a fiftieth order. So it really doesn't have enough parameters to fit if we had all the glory of the target function in front of us. But we don't have all the glory of the target function. We have only 15 points. So it does as good a job as possible for fitting. And when we look at the in-sample error, definitely the in-sample error is smaller than here, because we have. It's actually extremely small. It did it really, really, really well. And then when you go for the out-of-sample, oh no. You see, this is squared error. So these guys, when you go down and when you go up, kill you. And indeed they did. So this is overfitting galore. And now you ask yourself, OK, you just told us about noise and not noise. This is noiseless, right? Why did we get overfitting here? And we will find out that the reason we are getting overfitting here, because actually this guy has noise. But it's not your usual noise. It's another type of noise. And getting that notion down is very important to understand the situations in practice where you are going to get overfitting. You could be facing a completely noiseless in the conventional sense situation, and yet there is overfitting because you are fitting another type of noise. So let's look at the irony in this example. Here is the first example. The noisy simple target. You are learning a 10th-order target, and the target is noisy. And I'm not showing the target here. I'm showing the data points together with the two fits. Now let's say that I tell you that the target is 10th-order. And you have two learners. One of them is O, and one of them is R. O for overfitting, and R is for restricted, as it turns out. And you tell them, guys, I'm not going to tell you what the target is, because if I tell you what the target is, this is no longer machine learning. But let me help you out a little bit. The target is a 10th-order polynomial. And I'm going to give you 15 points. Choose your model. Fair enough, the information given does not depend on the data set, so it's a fair thing. So the first learner says, OK, I know that the target is 10th-order. Why not pick a 10th-order model? Like a good idea. And they do this, and they get the red curve, and they cry, and cry, and cry. The other guy says, oh, it's 10th-order model. Who cares? How many points do you have? 15. OK, 15. I'm going to take a second order, and I'm actually pushing my luck, because second order is three parameters. I have 15 points. The ratio is five. Someone told us a rule of thumb that it should be 10. I'm flirting with danger, but I cannot use a line when you're telling me the thing is 10th-order, so let me try my luck with second. That's what you do. And they win. So it's a rather interesting irony, because there is a thought in people's mind that you try to get as much information about the target function and put it in the hypothesis set. In some sense, this is true for certain properties. But if you are matching the complexity, here's the guy who actually took the 10th-order target and decided to put the information all too well in the hypothesis. I'm taking a 10th-order hypothesis set, lost. So again, we know all too well now. The question is, you match the data resources rather than the target complexity. There will be other properties of target functions that we will take to heart, symmetry and whatnot. There are a bunch of hints that we can take. But the question of complexity is not one of the things that you just apply the general idea of, let me match the target function. That's not the case. In this case, you are looking at generalization issues, and you know the generalization issues depend on the size and the quality of the data set. Now, the examples that I just gave you, we have seen it before when we introduced learning curves, if you remember what those were. Those were, yeah, I'm going to put in and out change with the number of examples. I gave you something, and I told you that this is an actual situation we'll see later, and this is the situation. So this is the case where you take the second-order polynomial model, h2, and the inevitable error, which is the black line, comes now not only from the limitations of the model in ability for a second order to replicate a 10th order, which is the target in this case, but also because there is noise added. Therefore, there is an amount of error that is inevitable because of the noise, but the model is very limited. The generalization is not bad, which is the difference between the two curves. And if you have more examples, the two curves will converge, as they always do, but they converge to the inevitable amount of error, which is dictated by the fact that you are using such a simple model in this case. And when we looked at the other case, also introduced in this case, this was the 10th-order fellow. So the 10th-order fellow is, you can fit a lot. So the in-sample error is always smaller than here, that is understood. The out-of-sample error starts by being terrible, because you are overfitting. And then it goes down, and it converges to something that is better, because that carries the ability of h of 10 to approximate a 10th order, which should be perfect, except that we have noise. So all of this actually is due to the noise added to the example. And the gray area is the interesting part for us. Because in the gray area, the in-sample error for the more complex model is smaller. I mean, it's smaller always, but we are observing it in this case. And the out-of-sample error is bigger. That's what defines the gray area. Therefore, in this gray area, very specifically, overfitting is happening. If you move from the simpler model to the bigger model, you get better in-sample error, and worse out-of-sample error. Now you realize that this guy is not going to lose forever. The guy who chose the correct complex is not going to lose forever. They lost only because of the number of examples that was inadequate. If the number of examples is adequate, they will win handsome. Like here, if you look here, you end up with an out-of-sample error far better than you will ever get here. But now I have enough examples in order to be able to do that. So now we understand overfitting, and we understand that overfitting will not happen for all the numbers of examples, but for a small number of examples where you cannot pin down the function, then you suffer from the usual batch generalization that we saw. Now we notice that we get overfitting even without noise, and we want to pin it down a little bit. So let's look at this case. So this is the case of the 50th order target, the higher order target, that doesn't have any noise, as conventional noise at least. And these are the two fits. And there is still an irony, because here are the two learners. The first guy chose the 10th order. The second guy chose the second order. And the idea here is the following. You told me that the target now doesn't have noise, right? That means I don't worry about overfitting. Wrong, but we will know why. So given the choices, I'm going to try to get close to the 50th order, because I have a better chance. So if I choose the 10th order, somebody chooses the second order, I'm closest to the 50th, so I think I will perform better. That's the concept. So you do this, and you know that there is no noise, so you decide on this idea. And again, you get bad performance. And you ask yourself, this is not my day. I tried everything, and I seem to be making the wise choice, and I'm always losing. And why is this the case when there is no noise? And then you ask, is there really no noise? And that will lead us to defining that there is an actual noise in this case, and we'll analyze it and understand what it is about. So I will take these two examples and then make a very elaborate experiment. And I will show you the results of that experiment. And I will encourage you, if you are interested in the subject, to do simulate this experiment. All the parameters are given. And it will give you a very good feel for overfitting, because now we are going to look at the figure and have no doubt in our mind that overfitting will occur whenever you actually encounter a real problem. And therefore, you have to be careful. It's not like I constructed a particular funny case, et cetera. No, if you take average over a huge number of experiments, you will find that overfitting occurs in the majority of the cases. So let's look at the detailed experiment. I'm going to study the impact of two things. The noise level, which I already conceptually convinced myself that it's related to overfitting. And the target complexity, just because it does seem to be related. Not sure why, but it seems like when I took a complex target, albeit noiseless, I still got overfitting. So let me see what the target complexity does. So we are going to take a general target function. I'm going to describe what it is. And I'm going to add noise state. The noise is a function of x. So I'm just getting it generically. And as always, we have independence from one x to another, so in spite of the fact that the parameters of the noise distribution depend on x, I can have different noise for different points in the space. The realization of epsilon is independent from one x to another. That is always the assumption that when we have different data points, they are independent. So this is the thing. And I'm going to measure the level of noise by the energy in that noise. I'm going to call it sigma squared. So I'm taking this to be the expected value of epsilon is 0. If there were expected value, I would put it in the target. So I remain with 0. And then there is fluctuation around it. And the fluctuation either could be big, large sigma squared or small. And I'm quantifying it with sigma squared. No particular distribution is needed. I mean, you can say Gaussian or not. And indeed, I applied Gaussian in the experiment. But for the statement, you just need the energy of that. So now let's write it down. Now I want to make the target function more complex at will. So I'm going to make it higher-order polynomial. So now I have another parameter, pretty much like the sigma squared. I have another parameter, which is capital Q, the order of the polynomial. I'm calling it capital Q sub f, because it describes the target complexity of f, just to remember that it's related to f. And what I do, I define a polynomial, which is some coefficient time, a power of x, from q equals 0 to q. So it's capital Q. So it's indeed a q-th-order polynomial. And I add the noise here. Now, in order to run the experiment right, I'm going to normalize this quantity such that the energy here is always 1. And the reason I do that is that because I want the sigma squared to mean something, so the signal-to-noise ratio is always what means something. So if I normalize the signal to energy 1, then I can say sigma squared is really the amount of noise. And if you look at this, it is not easy to generate interesting polynomials using this formula. Because if you pick these guys at random, let's say independently, coefficients at random, in order to generate a general target. These guys, the powers of x are OK. So you start with the x and then the parabola, and then the third order, and then the fourth order, and then the fifth order. It's very, very boring guys. One of them is doing this way, and the other one is doing this way, and they get steeper and steeper. So if you combine them with random coefficients, you will almost always get something that looks this way or something that looks this way. And the other guys don't play a role, because this one dominates. So the way to get interesting guys here is instead of generating the alpha i's here as random, you go for a standard set of polynomials, which are called Legendre polynomials. Legendre polynomials are just polynomials with specific coefficients. There's nothing mysterious about them except that the choice of the coefficients is such that, from one order to the next, they are orthogonal to each other. So it's like harmonics in a sinusoidal expansion. So if you take the first order, Legendre, and the second, and the third, and the fourth, and you take the inner product, they are 0. They are orthogonal to each other, and you normalize them to get energy 1. Because of this, if you have a combination of Legendres with random coefficients, then you get something interesting. So all of a sudden you get the shape. And when you are done, it is just a polynomial. All you do, you collect the guys that happens to be the coefficients of x, the coefficients of x squared, the coefficient of x cubed, and this will be your alphas. So there's nothing changed in the fact that I'm generating a polynomial. I just was generating the alphas in a very elaborate way in order to make sure that I get interesting targets. That's all there is to it. As far as we are concerned, we generated guys that have this form and happen to be interesting, representative of different functionalities. So in this case, we have the noise level, that's one parameter that affects overfitting. We have potentially the target complexity seems to be affecting overfitting, at least we are conjecturing that it is. And the final guy that affects overfitting is the number of data points. If I give you more data points, you are less susceptible to overfitting. So now I'd like to understand the dependency between these. And if we go back to the experiment we had, this is just one instance of those where the target complexity here is 10. I use the 10th order polynomial, so qf is 10. The noise is whatever the distance between the points and the curve is. That's what captures sigma squared. And the data size here is 15. I have 15 data points. So this is one instance. I'm basically generating at will random instances of that in order to see if the observation of overfitting persists. So now how am I going to measure overfitting? So I'm going to define an overfit measure, which is a pretty simple one. So we are fitting a data set from x1y1 to xnyn. And we are using two models. Our usual two models, nothing changed. We either use second order polynomials or the 10th order polynomial. And if going from the second order polynomial to the 10th order polynomial gets us in trouble, then we are overfitting. And we'd like to quantify that. So when you compare the out-of-sample errors of the two models, you have a final hypothesis from the H2. And this is d-fit, the green curve that you have seen. And another final hypothesis from the other model, which is the red curve, the wiggly guy. And if you want to define an overfit measure based on the two, what you do is you get the out-of-sample error for the more complex guy minus the out-of-sample error for the simple guy. Why is this an overfit measure? Because if the more complex guy is worse, it means its out-of-sample error is bigger, and you get a positive number, large positive if the overfitting is terrible. And if this is negative, it means that actually the more complex guy is doing better, so they are not overfitting. Zero means that they are the same. So now I have a number in my mind that measures the level of overfitting in a particular setting. And if you apply this to, again, the same case we had before, you look at here. And the out-of-sample error for the red is terrible. The out-of-sample error of green is nothing to be proud of, but definitely better. And the overfit measure, in this case, will be positive, so we have overfitting. So now let's look at the result of running this for tens of millions of iterations. Epoxy iterations complete runs. Generate a target, generate the dataset, fit both. Look at the overfit measure. Repeat 10 million times for all kinds of parameters. So you get a pattern for what is going on. So this is what you get. First, the impact of sigma squared. So I'm going to have a plot in which you get n, the number of examples, and the level of noise, sigma squared. And on the plot, I'm going to give a color, depending on the intensity of the overfit. So that intensity will be depending on the number of points, and the level of the noise that you have. And this is what you get. So first, let's look at the color convention. So 0 is green. If you get redder, there is more overfitting. If you get bluer, there is less overfitting. Now I looked at the number of examples, and I picked interesting range. If you go for, this is 80, 100, and 120 points. So what happens to 40? All of them are dark red, terrible overfitting. And if you go beyond that, you have enough examples now not to overfit, so it's almost all blue. So I'm just giving you the transition part of it. You look at it, so there is a noise level. As I increase the noise level, overfitting worsens. Why is that? Because if I pick any number, for example, let's say 100. If I had 100 and I had that little noise, I am doing fine, doing fine in terms of not overfitting. And as I go, I get into the red region, and then I get deeply into the red region. So this tells me, indeed, that overfitting worsens with sigma squared. By the way, for all of the targets here, I picked a fixed complexity, 20, 20-order polynomial. I fixed it because I just wanted a number, and I wanted only to relate the noise to the overfitting. So that's what I'm doing here. When I change the complexity, this would be the other plot. So for this guy, we get something that is nice, and it's really according to what we expect. As you increase the number of points, the overfitting goes down. As you increase the level of noise, the overfitting goes up, that is what we expect. So now let's go for the impact of QF, because that was the mysterious part. There was no noise, and we were getting overfitting. Is this going to persist? What is the deal? This is what you get. So here, we fixed the level of noise. We fixed it at sigma squared equals 0.1. Now we are increasing the target complexity from trivial to 100-order polynomial. That's a pretty serious guy. And we are plotting the same range for the number of points from 80, 100, and 120. That's where it happens. And you can see that overfitting occurs significantly. And it worsens also with the target complexity, because let's say you look at this guy. If you look at this guy, you are here in the green, and it gets red, and then it gets dark red. Not as pronounced as in this case, but you do get the overfitting effect by increasing the target complexity. And when the number of examples is bigger, then there is less overfitting as you expect it to be. But if you go high enough, I guess it's getting lighter, blue, green, yellow, eventually it will get to red. And if you look at these two guys, the main observation is that the red region is serious. Overfitting is real. And here to stay, and we have to deal with it. It's not like an individual case there. Now there are two things you can derive from these two figures. The first thing is that there seems to be another factor, other than conventional noise, let's call it conventional noise for the moment, that affects overfitting. And we want to characterize that. That is the first thing we derive. The second thing we derive is a nice logo for the course. That's where it came from. So now let's look at noise and look at the impact of noise, and you can notice that noise is between quotation marks here, because now we are going to expand our horizon about what constitutes noise. Here are the two guys. And in the first case, we are going now to call it stochastic noise. So noise is stochastic, but obviously we are calling it stochastic, because the other guy will not be stochastic. And there is absolutely nothing to add here. This is what we expect. We are just calling it a name. Now we are going to call whatever effect that is done by having a more complex target here, we are going to also to call it noise. But it is going to be called deterministic noise, because there is nothing stochastic about it. There is a particular target function. I just cannot capture it, so it looks like noise to me. And we would like to understand what deterministic noise is about. However, if you look at it, and now you speak in terms of stochastic noise and deterministic noise, and you would like to see what affects overfitting. So we'll put it in a box. First observation, if I have more points, I have less overfitting. If you move from here to here, things get bluer. If you move from here to here, things get bluer. I have less overfitting. Second thing, if I increase the stochastic noise, increase the energy in the stochastic noise, the overfitting goes up. Indeed, if I go from here to here, things get redder. And finally, with deterministic noise, which is vaguely associated in my mind with the increase of target complexity, I also increase the overfitting. If I go from here to here, I'm getting redder. Albeit I have to travel further, and it's a bit more subtle, but the direction is that I get more overfitting as I get more deterministic noise, whatever that might be. So now, let's spend some time just analyzing what deterministic noise is and why it affects overfitting the way it does. So let's start with the definition. What is it? It will be actually noise. If I tell you what is the stochastic noise, you say, OK, here's my target, and there is something on top of it. That is what I call stochastic noise. So the deterministic noise will be the same thing, except that it captures something deterministic. It's the part of the target that your hypothesis set cannot capture. Let's look at a picture. So here is the picture. This is your target, the blue guy. You take a hypothesis set that, let's say, is simple, and you look for the guy that best approximates f, not in the learning sense. You actually try very hard to find the best possible approximation. You are still not going to get f, because your hypothesis set is limited. But the best guy will be sitting there, and it will fail to pick certain part of the target. And that is the part we are labeling the deterministic noise. And if you think from an operational point of view, if you are that hypothesis, noise is all the same. It's something I cannot capture. Whether I couldn't capture it because there is nothing to capture, as in stochastic noise, or I couldn't capture it because I am limited in capturing, and this I have to consider out of my league. Both of them are noise, as far as I'm concerned, something I cannot deal with. So this is how we define it. And you ask yourself, why are we calling it noise? So it's a bit of a philosophical issue. But let's say that you have a young sibling, your kid brother, is just learned fractions. They used to have just 1, 2, 3, 4, 5, 6. They're not even into negative numbers and whatnot, and they learn fractions. And now they are very excited. They realize that there is more to numbers than just 1, 2, 3. So you are the big brother. You are a big Caltech guy. So you must know more about numbers. Tell me more about numbers. So now, in your mind, you probably can explain to them negative numbers a little bit by deficiency, real numbers. Just intuitively continuous. You're not going to tell them limits or anything like that. They are too young for that. But you probably are not going to tell them about complex numbers, are you? Because their hypothesis state is so limited that complex numbers for them will be completely noise. And the problem with explaining something that people cannot capture is that they will create a pattern that really doesn't exist. And then you tell them complex numbers. They really can't comprehend it, but they got the notion. So now it's the noise. They feed the noise, and they tell you, is 7.34521 a complex number? Because in their mind, it's a complex number. So they just got onto the other thing. So you are better off just killing that part, and giving them a simple thing that they can learn, because the additional part will actually mislead them. Mislead them as in noise. So this is our idea, that if I have a hypothesis set, and there is part of the target that I cannot capture, there is no point in trying to capture it, because when you try to capture it, you are detecting a false pattern that you cannot extrapolate given your limitations. So that's why it's called noise. Now the main differences between deterministic noise and stochastic noise, both of them can be plotted, a realization. But the main differences are the first thing is that deterministic noise depends on your hypothesis set. For the same target function, if you use a more sophisticated hypothesis set, the deterministic noise will be smaller, because you are able to capture more. Obviously the stochastic noise will be the same. Nothing can capture it, so all hypotheses are the same. We cannot capture it, and therefore it's noise. The other thing is that if I give you a particular point x, deterministic noise is a fixed amount, which is the difference between the value of the target at that point and the best hypothesis approximation you have. If I give you stochastic noise, then you are generating this at random. And if I give you two instances of x of the same x, the noise will change from one occurrence to another, whereas here it's the same. Nonetheless, they behave exactly the same for machine learning, because invariably, we have a given data set. Nobody changes x's on us and gives us another realization of the x. The x is given to us together with the labels. So this doesn't make a difference for us. And we settle on a hypothesis set. Once you settle on a hypothesis set, the deterministic noise is as bad as the stochastic noise. It's something that we cannot capture, and it depends on something that we have already fixed, so it doesn't depend on anything. So in a given learning situation, they behave the same. So now let's see the impact on overfitting. This is what we have seen before. This is the case where we have increasing target complexity, so increasing deterministic noise in the terminology we just introduced, and the number of points, and red means overfitting. So this is how much overfitting is there. And we are looking at deterministic noise as it relates to the target complexity, because the quantitative thing we had is target complexity. We defined what a realization of deterministic noise is, and it's not clear to us what quantity we should measure out of deterministic noise in order to tell us that this is the level of noise that results in overfitting yet. We have the one in the case of stochastic noise very easily. We just take the energy of it. So here we realize that as you increase the target complexity, the deterministic noise increases, at least the overfitting phenomena that we observe increases. But you notice that something interesting here, it doesn't really get to 10, because this was overfitting of what? The 10th order versus the second order. So if you're going to start having deterministic noise, you'd better go above 10 so that there is something that you cannot approximate. So this is the part where it's there. So here, I wouldn't say proportional, but it definitely increases with the target complexity, and it decreases with the n as we expect. Now for the finite n, you suffer the same way you suffer from the stochastic noise. We have declared that deterministic noise is the part that your hypothesis cannot capture. So what is the problem? If I cannot capture it, it won't hurt me because when I try to fit, I won't capture it anyway. No, you cannot capture it in its entirety. But if I give you only a finite sample, then you only get a few points, and you may be able to capture a little bit of the stochastic noise or the deterministic noise in this case. If I have 10 points, if you give me a million points, and even there is stochastic noise, there is nothing I can do to capture the noise. Let me remind you of the example we gave in linear regression. We took linear regression and said, let's say that we are learning a linear function. So linear regression would be perfect in this case. So this is the target. And then we added noise to the example. So instead of getting the points perfectly on that line, you get points right and left. And then we try to use linear regression to fit it. So if you didn't have any noise, linear regression would be perfect in this case. Now, since there is noise, and it doesn't really see the line, it only sees those guys, it eats a little bit into the noise, and therefore gets deviated from the target. That is why you are getting worse performance than without the noise. Now, if I have 10 points, linear regression will have easy time eating into that, because there isn't much to fit. There are only 10 guys, and maybe there's some linear pattern in them. If I get a million points, the chances are I won't be able to fit any of them at all, because there are noise all over the place, and I cannot find a compromise using my few parameters, and therefore I will end up really not being affected with them. In the infinite case, I cannot get anything. They are noise, and I cannot fit them. They are out of my ability. But the problem is that once you have a finite sample, you are given the unfortunate ability to be able to fit the noise, and you will indeed fit it. Whether it's stochastic, that doesn't make sense, or deterministic, that there is no point in fitting it, because you know in your hypothesis that there is no way to generalize out of sample for it, it is out of your ability. So the problem here is that for the finite n, you get to try to fit the noise, both stochastic and deterministic. Now let me go quickly through a quantitative analysis that will put deterministic noise and stochastic noise in the same equation, so that they become clearer. Remember bias variance? Well, that was a few lectures ago. What was that about? We had a decomposition of the expected out-of-sample error into two terms, and this is the expected value of out-of-sample error. I remember this is the hypothesis we get, and we have dependency on the data set that we got us. We compare it to the target function, and we get the expected value with respect to those, and that ended up being variance, which tells me how far I am from the centroid within the training set, and that means that there is a variety of things I can get based on d. And the other one is how far the centroid is from the target, which tells me the bias of my hypothesis set from the target, and the leap of faith we had is that this quantity, which is the average hypothesis that you get over all data sets, is about the same as the best hypothesis in the hypothesis set. So we had that, and in this case, f was noiseless in this analysis. So now I'd like to add noise to the target and see how this decomposition will go, because this will give us a very good insight into the role of the stochastic noise versus deterministic noise. So we add noise, and we are going to plot the thread because we want to pay attention to it, and because we are going to get expected values with respect to it. So why now is the realization is the target plus epsilon, and I'm going to assume that the expected value of the noise is 0. Again, if the expected value is something else, we put that in the target and leave the part which is pure fluctuation outside, and call that epsilon. So now I would like to repeat the analysis more quickly, obviously, with the added noise. So here is the noise there. First, this is what we started with. So I'm comparing what you get in your hypothesis in a particular learning situation to the target. But now the target is noisy, so the first thing is to replace this fellow by the noisy version, which is y. I know that y has f of x plus the noise. That's what I'm comparing to. And now because y depends on the noise, I'm not only getting the averaging with respect to the data set, I'm also getting the average with respect to the expected value with respect to d and epsilon, epsilon affecting y. So you expand this, and this is just rewriting it. f of x plus epsilon is y, so I'm writing it this way. And we do the same thing we did before by just carrying this around until we see where it goes. So what did we do? We added and subtracted the centroid, the average hypothesis. Remember, in preparation for getting square terms and cross terms and whatnot. And here we have still the epsilon added to the mix. And then we write it down. And in the first case, we get the squared. So we put these together and put them as a square. We take these two guys together and put them as square. And this guy, by itself, we put it as a square. We still have cross terms, but these are the ones that I'm going to focus on. And then we have more cross terms than we had before because there's epsilon in it. But the good news is that if you get the expected value of the cross terms, all of them will go to zero. We'll go to zero. The other ones will go to zero because the expected value of epsilon goes to zero. And epsilon is independent of the other random thing here, which is the data set. The data set is generated. Its noise is generated. Epsilon is generated on the test point x, which is independent. And therefore, you will get zero. So it's very easy to argue that this is zero. And you will get basically the same decomposition with this fellow added. So let's look at it. Well, you will see that there are actually two noise terms that come up. So this is the variance term. Let me put it. This is the bias term. And this is the added term, which is just sigma squared, the energy of the noise. Let me just discuss this a little bit. We had the expected value of respect to d and with respect to epsilon. And then remember that we take the expected value with respect to x average over all the space in order to get just the bias and variance rather than the bias of x of your test point. So I did that already. So every expectation is with respect to the data set, with respect to the input point, as with respect to the realization of the noise epsilon. But I am keeping the guys that survive, because the other guy is OK. So epsilon doesn't appear here. So the thing is constant with respect to it, so I take it out. Here, neither epsilon nor d appears here. So I just leave it for simplicity. And here, d doesn't appear, but epsilon x appears, so I do it this way. So I could put the more elaborate notation, but I just wanted to keep it simple. Now look at this decomposition. So we have the moving from your hypothesis to the centroid, from the centroid to the target proper, and then from the target proper to the actual output, which has a noise aspect to it. So it's again the same thing of trying to approximate something and putting it in steps. Now if you look at the last quantity, that is patently the stochastic noise. The interesting thing is that there is another term here which is corresponding to the deterministic noise, and this is this fellow. That's another name for the bias. Why is that? Because our Leibofiz told us that this guy, the average, is about the same as the best hypothesis. So we are measuring how the best hypothesis can approximate f. Well, this tells me the energy of deterministic noise. And this is why it's a deterministic noise. Putting it this way gives you the solid ground to treat them the same. Because if you increase the number of examples, you may get better variance. There is more examples, so you don't float around fitting all of them. So the red region that used to be the variance shrinks and shrinks. These guys are both inevitable. There is nothing you can do about this, and there is nothing you can do about this given a hypothesis set. So these are fixed. But again, in the bias variance, remember, the approximation was overall approximation. We took the entire target function and the entire hypothesis. We didn't look at particular data points. We looked at that approximation proper, and that's why these are inevitable. You tell me what the hypothesis set is. Well, that's the best I can do. And this is the best I can do as far as the noise, which is just not predicting anything in the noise. Now, both the deterministic noise and the stochastic noise will have a finite version on the data points. And the algorithm will try to fit them. And that's why this guy gets a variety. Because it depends on the particular fit of those. You will get one or another. So these guys affect the variance by making the fit more susceptible to going in more places, depending on what happens. I will go this way and that way, not because it's indicated by the target function I want to learn, but just because there is a noise present in the sample that I am blindly following, because I can distinguish noise from signal. And therefore, I end up with more variety, and I end up with worse variance and overfit. OK. Now, very briefly, I'm going to give you a lead into the next two lectures. We understand what overfitting is. And we understand that it's due to noise. And we understand that noise is in the eye of the beholder, so to speak. There is stochastic noise. But there is another noise which is not really noise, but depends on which hypothesis it looks at it. It looks like noise to some, and not looks like noise to the other. And we call that deterministic noise. And we saw experimentally that it affects overfitting. So how do we deal with overfitting? What does it mean to deal with overfitting? We want to avoid it. We don't want to spend more energy fitting and get worse out of sample error, whether by choice of a model, or by actually optimizing within a model like we did with neural networks. OK. So there are two cures. One of them is called regularization. And that is best described as putting the brakes. OK. So overfitting, you are going, going, going, going, going. And you hurt yourself. OK. So all I'm doing is I'm just making sure that you don't go all the way. OK. And when you do that, I'm going to avoid overfitting this way. OK. The other one is called validation. OK. So what is the cure in this case for overfitting? You check the bottom line and make sure that you don't overfit. So it's a different philosophy. OK. That is, the reason I'm overfitting is that because I'm going for E in, and I'm minimizing it, and I'm going all the way, I say, no, wait a minute. OK. E in is not a very good indication for what happens. Maybe there is another way to be able to tell what is actually happening out of sample, and therefore to avoid overfitting because you can check on what is happening in the real quantity you care about. OK. So these are the two approaches. So I'll give you just an appetizer, a very short appetizer for the putting the brakes. OK. So this is the finalization part, which is the subject of next lecture. OK. Remember this curve? That's what we started with. We had the five points. We had the fourth order polynomial. We fit, and we ended up in trouble. OK. And we can describe this as free fit. That is, fit all you can. So fit all you can. Five points. I'll take fourth order polynomial, go for it. I get this, and that's what happens. OK. So now putting the brakes means that you are going to not have to go all the way, and you are going to have a restrained fit. OK. So the reason I'm showing this is that because it's fairly dramatic, you will think that I need this curve so incredibly bad that you think you really need to do something dramatic in order to avoid that. But here what I'm going to do, I'm just going to make you fit, and I'm actually going to make you fit using a fourth order polynomial. I'll give you that privilege. But I'm going to prevent you from fritting the points perfectly. I'm going to put some friction in it, such that you cannot get exactly to the points. OK. And the amount of brake I'm going to put here is so minimal, it's laughable. When you go for your car service, they measure the brake, and they tell you, oh, the brake is 70%, or this brake, et cetera. And then when it gets to 40%, they tell you, you need to do something about the brakes. The brakes here are about 1%. OK. So if this was a car, you'd be braking here, and you'd be stopping in Glendale. OK. It's completely ridiculous. But that little amount of brake will result in this. Totally dramatic. Fantastic fit. The red curve is a fourth order polynomial, but we didn't allow it to fit all the way. And you can see that it's not fitting all the way, because it really is not getting the points right. It's getting there, but not exactly close. OK. So we don't have to do much to prevent overfitting. OK. But we need to understand what is regularization and how to choose it, et cetera. And this we'll talk about next time. And then the time after that, we are going to talk about validation, which is the other prescription. I will stop here, and we will take questions after a short break. OK. So let's start the Q&A, and we'll start with a question in-house. So on previous lecture, we spoke about stochastic gradient descent. And we said that we should choose point by point and move in the direction of gradient of error in this point. Negative. So the question is how important is to choose points randomly. I mean, we can choose them just from the first point, second point, and so on. Yeah. So depending on the runs, it could be no difference at all, or it could be a real difference. And the best way to think of randomization in this case is that it's an insurance policy. If there is something about the pattern that is detrimental in a particular case, you are always safe by picking the points at random, because there is no chance that the random thing will have a pattern eventually, if you keep doing it. So in many cases, you just sort of run through examples 1 through n, 1 through n, 1 through n, and you will be fine. Some cases, you take a random permutation. Some cases, even you stay true to picking the point at random, and you hope that the representation of a point will be the same in the long run and whatnot. So in my own experience, there is little difference in a typical case. Every now and then, there is a funny case, and therefore you are safer using the stochastic presentation, the random presentation of the example, in order to be able not to fall into the trap in those cases. There is another question in-house. Hi, Professor. I have a question about slide 4. It's about neural networks. I don't understand how do you draw the out-of-sample arrow on that plot? In general, you cannot obviously draw the out-of-sample arrow. If you could draw it, you would just pick it. So this is a case where I give you a data set, and you decide to set aside part of the data set for testing. So you are not involving it at all in the training. And what you do, you go about your training, and at the end of every epoch, when you evaluate the in-sample error on the entire batch, which is the green curve here, you also evaluate for that set of weights, the frozen weights at the end of the epoch, you evaluate that on the test set, and you get a point. And because that point is not involved in the training, it becomes an out-of-sample point, and that is get the red point. And you go down. Now there is an interesting tricky point here, because if you decide at some point to look, OK, so maybe I look at the red curve, now I am going to stop where the red curve is minimum. Now at that point, the set that used to be a test set is no longer a test set, because now it has just been involved in a decision regarding training. It becomes slightly contaminated. This is the validation set, which we are going to talk about when we talk about validation, but that is really the premise. OK, so here... All right, I understand. Also, can I go back to slide 16? Slide 16, yeah. I didn't follow that. Why the two noises are the same for the same learning problem? OK, so they are the same in the sense that they are part of the outputs that I am being given, or that I am trying to predict, and that part I cannot predict, regardless of what I do. In the case of stochastic noise, it's obvious. There is nothing to predict there, so whatever I do, I miss it. In the case here, it's particular to the hypothesis set that I have. So I take a hypothesis set and look in a non-learning scenario, look at the target function and choose your best scenario. You choose this is my best hypothesis, which we call here H star. OK, so if you look at the difference between H star and f, the difference is part which I cannot capture, because the best I could do is H star. So the remaining part is what I am referring to as deterministic noise, and it is beyond my ability, given my hypothesis set. So that's why they are the same, the same in the sense of unreachable as far as my resources are concerned. So in real problem, do we know the complexity of the target function? In general, no. We also don't know the particulars of the noise. We know that the problem is noisy, but we cannot identify the noise. We cannot, in most cases, even measure the noise. So the purpose here is to understand that even in the case of a noiseless target, in the conventional sense, there is something that we can identify, conceptually identify, that does affect the overfitting, and even if we don't know the particulars of it, we will have to put in place the guards in order to avoid overfitting. That was the goal here, rather than try to... Any time you see the target function drawn, you should immediately have an alarm bell that this is conceptual, because you never actually see the target function in a real learning situation. Oh, so that's why the noise is equivalent, because we don't know the target function, so we don't know which part is deteriorating. I mean, if I knew the target, then the situation would be good, but then I don't need machine learning. I already have the answer. So we go for the questions from the outside. Yeah, okay, so quick conceptual. Is it okay to say that deterministic noise is a part of the reality that is too complex to be modeled for? It is definitely part of the reality, that part. And basically, our failure to model it is what made it noise, as far as we are concerned. So obviously, you can, in some sense, model it by going to a bigger hypothesis set. So the bigger hypothesis set will have a closer H star to the target, and therefore the difference will be small. But the situation pertains to the case where you already chose the hypothesis set according to prescriptions of VC dimension, number of examples, and other considerations. And given that hypothesis set, you already concede that even if the target is noiseless, there is part of it which behaves as noise as far as I am concerned. And I will have to treat it as such when I consider overfitting and the other considerations. Also, is it fair to say that overtraining will cause overfitting? I think they are probably as synonymous. Okay, so overfitting is relative. Overtraining will be relative within the same model if I try to give it a definition that you are overtraining. So you already settled the model and you are overtraining it. So the case of neural network will be overtraining. The case of choosing the third-order polynomial versus the fourth-order polynomial will not really be overtraining, but it will be overfitting. I mean, it's all technicalities, but just to answer the question. Okay, practically, when you implement these algorithms and there's also some approximations, maybe due to the floating point number or something, so is this another source of error? Does it produce overfitting? It's, okay, formally speaking, yes, it's another source, but it is so minute with respect to the other guys that it's never mentioned. We have another in-house question. So a couple of lectures ago we spoke about a third-linear model, which is... You said the third-linear model? Yes. Okay. So the question is, is it true that initially I have data which is completely linearly separable, so the points marked minus one and some points are marked minus one and some are plus one, and there is a plane which separates them. Is it true that applying this learning we will never stuck in a local minimum and get zero in-sample error? Okay. I mean, okay, this is a very specific question about this. If the thing is completely clean, then you obviously can get closer and closer to having the probability being perfect by having bigger and bigger weights. So there is a minimum, and again it's a unique minimum, except that the minimum is at infinity in terms of the size of the weight. But this doesn't bother you because you are really going to stop at some point when the gradient is small, according to your specification, and you can specify this any way you want. So the goal is not necessarily to arrive at the minimum, which hardly ever happens even if the thing is not at infinity, but enough in the sense that the value is close to the minimum, and therefore you achieve the small error that you want. Okay, can you resolve again the contradiction of when you increase the complexity of the model, you should be reducing your bias, and hence your deterministic noise. So here we had an example when we had H sub ten had more error than H sub two. So H sub ten had total error than H sub two. So if we were doing the approximation game, H sub ten would be better. If we had three terms in the bias variance, if we were only going by these two, then there is no question that the bigger model, the H ten, will win, because this is for all, and this one will be better for H ten than H two, because H ten is closer to the target we want, and therefore we will be making smaller error. This is not the source of the problem of overfitting. This is just identifying terms in the bias variance decomposition, bias variance noise decomposition in this case, that correspond to the different types of noise. The problem of overfitting happens here, and that happens because the finite sample version of both. That is, I get n points in which there is a contribution of noise coming from the stochastic and coming from the deterministic. On those points, the algorithm will try to fit that noise, in spite of the fact that if it knew, it wouldn't, because it knows that they are out of reach. But it gets a finite sample, and it can use its resources to try to fit part of that noise, and that is what is causing overfitting. And that ends up being harmful, and so harmful in the H of ten case that the harm offsets the fact that I am closer to the target function. That doesn't help me very much, because the same thing we said before. Let's say that it's H ten, and the target function is sitting here. That doesn't do me much good if my algorithm and the distraction of the noise and whatnot leads me to go in that direction. I would be further from the target function than another guy who had only working with this, remained from the confines, and ended up being closer to the target function. So it's a question of the variance term that results in overfitting, not this guy, in spite of the fact that these guys contain both types of noise, contributing to the value. But their value is static, it doesn't change with N, and it has nothing to do with the overfitting aspect. In the case of polynomial fitting, a way to avoid overfitting could be to use piecewise linear functions around each point. Is this a method of regularization? It depends on the number of degrees of freedom you have. You can have piecewise linear, which is really horrible. It depends on how many pieces. If you have as many pieces as there are lines, you can see what the problem is. It really is what is the VC dimension of your model. If it's piecewise linear, and I have only four parameters, then I don't worry too much that it's piecewise linear. I only worry about the four parameters aspect of it. The 10th order polynomial was bad because of the 11 parameters, not because of other factors. But anything you do to restrict your model in terms of the fitting can be called regularization, and there are some good methods and bad methods, but they are all regularization in terms of putting the brakes. Some practical questions. How do you usually get the profile of the out-of-sample error? Do you sacrifice points? This is obviously a good question. When we talk about validations for validation, it has an impact on overfitting. It's used to do that. But it's also used in model selection in general. Because of that, it's very tempting to say, I'm going to use validation, I'm going to set aside a number of points, but obviously the problem is that when you set aside a number of points, you deprive yourself from a resource that you could have used for training in order to arrive at a better hypothesis. So there is a trade-off, and we'll discuss that trade-off in very specific terms and find ways to go around it, like cross-validation and whatnot. But this will be the subject of the lecture on validation coming up soon. In the example of the color plots, here the order of the polynomial is a good indication of the VC dimension. These are the plots. What is the question? Here QF is directly related to the VC dimension. The target complexity has nothing to do with the VC dimension. It's the target. I'm talking about different targets and whatnot. The VC dimension has to do only with the two fellows we are using. We are using H2 and H10, second-order polynomials and tenth-order polynomial. So if we take the degrees of freedom as being a VC dimension, they will have different VC dimensions. And the discrepancy in the VC dimension, given the same number of examples, is the reason why we have discrepancy in the out-of-sample error. But we also have a discrepancy in the in-sample error, and the case of overfitting is such that the in-sample error is moving in one direction and the out-of-sample moving is another direction. So the only relevant thing in this plot to the VC dimension is the fact that the two models have different VC dimensions, H2 and H10. So you never really have a measure on the target complexity in practice? Correct. So this was an illustration. And even in the case of the illustration, when we had explicitly a definition of the target complexity, it wasn't completely clear how to map this into energy of deterministic noise, a counterpart for sigma squared here. This is completely clear. And as you can see, because of that, the plot is very regular and whatnot. Here, first we define this in a particular case in order to be able to run an experiment. Second, in terms of that, it's not clear. Can you tell me what is the energy of the deterministic noise here? There is quite a bit of normalization that was done. So when we normalize the target in order to make sigma squared meaningful, we sacrifice the fact that the target now is sandwiched between limited range. And therefore, the amount of energy of whatever the deterministic noise is will be limited, regardless of how complex the target is. So there is a compromise we had to do in order to be able to find these plots. However, the moral of the story here is that there is something about the target complexity that behaves in the same way as far as overfitting is concerned as noise. And we identified it as deterministic noise. We didn't quantify it further. And it's possible to quantify it. You can get the energy for this and that and you can do it. But this is sort of research topics. As far as we are concerned, in a real situation, we won't be able to identify either the stochastic noise or the deterministic noise. We just know they are there. We know their impact of overfitting. And we will be able to find methods in order to be able to cure the overfitting without knowing all the specifics that we could possibly know about the noise involved. Do you ever measure the... Is there some similar kind of measure of the model complexity of the target function? Do you ever use the VC dimension for that? Not explicitly. One can apply it. You say what is the model that would include the target function and then based on the inclusion of the target function, you can say that this is the complexity of that model. The analysis we use is such that the complexity of the target function doesn't come in in terms of the VC analysis. But there are other methods, other approaches other than the VC analysis where the target complexity matters. So I didn't particularly spend time trying to capture the complexity of the target function until this moment where the complexity of the target function could translate to something in the bias variance decomposition. And that has an impact on overfitting and generalization. Okay. I think that's it for today. Okay. So we will see you on Thursday.