 The following program is brought to you by Caltech. Welcome back. Last time we finished the VC analysis and that took us three full lectures. And the end result was the definition of the VC dimension of a hypothesis set. It was defined as the most points that the hypothesis set can shatter. And we used the VC dimension in establishing that learning is feasible on one hand and then in estimating the example resources that are needed in order to learn. So one of the important aspects of the VC analysis is the scope. The VC inequality and the generalization bound that corresponds to it describe the generalization ability of the final hypothesis you are going to pick. And it describes that in terms of the VC dimension of the hypothesis set and makes a statement that is true for all but delta of the data sets that you might get. So this is where it applies. And the most important part of the application are the disappearing blocks because it gives the generality that the VC inequality has. So the VC bound is valid for any learning algorithm, for any input distribution that may take place, and also for any target function that you may be able to learn. So this is the most theoretical part. And then we went into a little bit of a practical part where we are asking about the utility of the VC dimension in practice. You have a learning problem, you have someone comes with a problem, and you would like to know how many examples, what is the size of the data set you need in order to be able to achieve a certain level of performance. And the way we did this analysis is by plotting the core aspect of the delta, the probability of error in the VC bound. And we found that it's behaving regularly. We focused on a certain aspect of these curves, which correspond to different VC dimensions. And the main aspect is below this line. This line designates the probability 1. So we want the probability of the bad event to be small. So we are working in this region. And the x-axis here is the number of examples, the number of the size of your data set. And we don't particularly care about the shape of these guys. They could be a little bit nonlinear, et cetera. But the quantity we are looking for is, if we cut through this way, what is the behavior of the x-axis, the number of examples, in terms of the VC dimension, which is the label for the colored curves. And we realize that, given this analysis, it is very much proportional. And we were able to say that, theoretically, the bound will give us the number of examples needed will be proportional to the VC dimension, more or less. And although the constant of proportionality, if you go for the bound, will be horrifically pessimistic, you will end up requiring tens of thousands of examples for something for which you really need only maybe 50 examples. The good news is that the actual quantity behaves in the same way as the bound. So the number of examples needed is in practice, as a practical observation, indeed proportional to the VC dimension. And furthermore, as a rule of thumb, in order to get to the interesting part, interesting delta and epsilon, you need the number of examples to be 10 times the VC dimension. More will be better, less might work. But the ballpark of it is that you have a factor of 10 in order to start getting interesting generalization properties. We ended with summarizing the entire theoretical analysis into a very simple bound, which we are referring to as the generalization bound, that tells us a bound on the out-of-sample performance, given the in-sample performance. And that involved adding a term, capital omega, and capital omega captures all the theoretical analysis we had. It's function of n, function of the hypothesis set through the VC dimension, function of your tolerance for probability of error, which is delta. And although this is a bound, we keep saying that in reality, E out will be equal to E in plus something that behaves like omega. And we will take advantage of that when we get a technique like regularization. So that's the end of the VC analysis, which is the biggest part of the theory here. And we are going to switch today to another approach, which is the bias variance trade-off. It's a stand-alone theory. It gives us a different angle on generalization. And I'm going to cover it beginning to end during this lecture. So this is the plan. And the outline is very simple. We are going to talk about the bias and variance, define them, see the trade-off. Take a very detailed example, one particular example, in order to demonstrate what the bias and variance are. And then we are going to introduce a very interesting tool for illustrating learning, which are called learning curves. And we are going to contrast the bias variance analysis versus the VC analysis on these learning curves, and then apply them to the linear regression case that we are familiar with. So this is the plan. So the first part is the bias and variance. And in the big picture, we have been trying to characterize a trade-off. And roughly speaking, the trade-off is between approximation and generalization. So let me discuss this for a moment before we put bias and variance into the picture. We would like to get small e-out. That's the purpose of learning. If e-out is small, then you have learned. How you process that approximates the target function well. There are two components to this, and we are very familiar with them now. So we are looking for a good approximation of f. That's the approximation part. But we would like that approximation to hold out of sample. We are not going to be happy if we approximate f well in sample and behave badly out of sample. So these are the two components. And in the case of a more complex hypothesis set, you are going to have a better chance of approximating f, obviously. I have more hypotheses to run around. I'll be able to find one of them that is closer to the target function I want. The problem is that if you have the bigger hypothesis set, you are going to have a problem identifying the good hypothesis. That is, if you have fewer hypotheses, you have a better chance of generalization. And one way to look at it is that, okay, I'm trying to approximate the target function. You give me a hypothesis set. Now, if I tell you, I have good news for you. The target function is actually in the hypothesis set. You have the perfect approximation under your control. Well, it's under your hand, but not necessarily under your control. Because you still have to navigate through the hypothesis set in order to find the good candidate. And the way you navigate is through the dataset. That is your only resource for finding one hypothesis versus the other. So the target function could be sitting there calling for you to please, I am the target function, come. But you can see it. You are just navigating with that. You have very limited resources, and you end up with something that is really bad. So the approximation is the ability to, having f in the hypothesis set is great for approximation. But having a big hypothesis set that is big enough to include that may be bad news, because you will not be able to get it. Now, if you think about it, what is the ideal hypothesis set for learning? Or if I only had a hypothesis set that has a singleton hypothesis, which happens to be the target function. Then I have the best of both worlds, the perfect approximation. I will zoom in directly, because it is only one. Well, you might as well go and buy a lottery ticket. That's the equivalent. We don't know the target function, so we will have to make the hypothesis big enough to stand a chance. And once we do that, then the question of generalization kicks in. This is the big picture. So let's try to fit the VC analysis in it, and then fit the bias variance analysis in it before even we know what the bias variance analysis is in order to see where we are going with this. So we're quantifying this trade-off, and the quantification in the case of the VC analysis was what? Was the generalization bound. E in is approximation, because I'm actually trying to fit the target function. I'm just fitting them on the sample. That's the restriction here. So if I get this well, then I'm approximating f well, at least on some points. This guy is purely generalization. The question is, how do you generalize from in sample to out of sample? So this is the way of quantifying it. Now the bias variance analysis has another approach. It also decomposes E out, as you did in the generalization bound, but it does decompose them into two different entities. The first one is an approximation entity. How well h can approximate f? Well, what is the difference then? The difference is that the bias variance asks the question, how can h approximate f overall, not on your sample, in reality. As if you had access to the target function, and you are stuck with this hypothesis set, and you are eagerly looking for which hypothesis best describes the target function. Then you quantify how well that best hypothesis performs, and that is your measure of the approximation ability. Then what is the other component? The other component is exactly what I alluded to. Can you zoom in on it? This is the best hypothesis, and it has a certain approximation ability. Now I need to pick it. I have to use the examples in order to zoom in into the hypothesis set, and pick this particular one. Can I zoom in on it, or do I get something that is a poor approximation of the approximation? That decomposition will give us the bias variance, and we'll be able to put them at the end of the lecture side by side, in order to compare here is what the VC analysis does, and here is what bias variance does. The analysis from a mathematical point of view applies to real-value targets. This is good news, because remember in the VC analysis, we were confined to binary functions in the particular analysis that I did. You can extend it, but it's very technical. So it's a good idea to see the same trade-off and the same generalization questions apply to real-valued functions. So now we have regression, and we are able to make a statement about generalization or regression, which we will apply very specifically to linear regression the model that we already studied that has real-valued outputs. And we are going to confine the analysis here to squared error. The reason we are doing this is that for the math to go through in such a way that these two guys decompose cleanly, there are no cross terms, we will need the squared error. So this is a restriction of the analysis. There are ways to extend it. They are not as clean, so this is the simplest form that we are going to use. Okay, let's start. Our starting point is E out, so let me put it. Don't worry about the gap. The gap here will be filled. So what do we have? We have E out. E out depends on the hypothesis you pick. E out is E out of your final hypothesis. How does it perform on the overall space? And in order to do that, since we are talking about squared error, you are going to take the value of your hypothesis and compare it to the value of the target function and take that square, and that will be your error. So this is the building block for getting the out-of-sample performance. Now the gap here comes from the fact that if you look at the final hypothesis, the final hypothesis depends on a number of things. Among other things, it does depend on the data set that I am going to give you. Because if I give you a different data set, you will find a different final hypothesis. Now that dependency is quite important in the bias variance analysis. Therefore I am going to make it explicit in the notation. It has always been there, but I didn't need to carry ugly notation throughout when I'm not using it. Here I'm using it, so we'll have to live with it. So now I'll make that dependency explicit. So I'm having now a superscript, which tells me that this G comes from that particular data set. If you give me another data set, this will be a different G. When you take the same G, apply it to x, compare it to f, and this is your error. And finally, in order for it to be genuinely out-of-sample error, you need to get the expected value of that error over the entire space. So this is what we have. Now what we would like to do, we would like to see a decomposition of this quantity into the two conceptual components, approximation and generalization that we saw. So here is what we are going to do. We are going to take this quantity, which equals to this quantity, as I mentioned here, and then realize that this depends on that particular data set. Now I would like to read this from the dependency on the specific data set that I give you. So I'm going to play the following game. I'm going to give you a budget of capital N examples, training examples to learn from. Now if I give you that budget N, I could generate one D and another D and another D, each of them with N examples, and each of them will result in a different hypothesis G, and each of them will result in a different out-of-sample error. Correct? So if I want to get rid of the dependency on the particular sample that I give you, and just know the behavior, if I give you N data points, what happens? Then I would like to integrate D out. So I'm going to get the expected value of that error with respect to D. This is not a quantity that you are going to encounter in any given situation. In any given situation, you have a specific data set to work with. However, if I want to analyze the general behavior, I tell you, someone comes on my door, how many examples do you have? And they sell me 100. I haven't seen the examples yet. So it stands to logic that I say, for 100 examples, the following behavior follows. So I must be taking an expected value with respect to all possible realizations of 100 examples. And that is indeed what I'm going to do. I'm going to get the expected value of that. And this is the quantities that I'm going to decompose. And this obviously happens to be the expected value of the other guy, and we have that. So now I'm going to take this quantity, the expression for the quantities that I'm interested in, and keep driving stuff until I get to the decomposition I want. So the first order of business, I have two expectations. So the first thing I'm going to do, I'm going to reverse the order of the expectations. So why I can do that? I'm integrating. So now I change the order of integration. I'm allowed to do that because the integrand is strictly non-negative. So I get this. And the reason for that is that because I'm really interested in the expectation with respect to d, and I'd rather not carry the expectation with respect to x throughout. So I'm going to get rid of that expectation for a while until I get a clean decomposition. And when I get the clean decomposition, I'll go back and get the expectation, just to keep the focus clear. So you focus on the inside quantity. If I give you the expression for the inside quantity for any x, then all you need to do in order to get the quantity that you need is get the expected value of what I said with respect to x. So this is the quantity that we are going to carry to the next slide. Let's do that. And the main notion in order to evaluate this quantity is the notion of an average hypothesis. It's a pretty interesting idea. So here is the idea. You have a hypothesis set, and you are learning from a particular dataset. And I am going to define a particular hypothesis. I'm going to call it the average hypothesis. And because it's average, I'm going to give it a bar notation. So what is this fellow? Well, this fellow is defined as follows. You learn from a dataset, you get a hypothesis. Someone else learns from another dataset, you get another hypothesis, etc. So how about getting the expected value of these hypotheses? So what does that formally mean? We have x fixed. So we actually are in a good position, because g of x is really just a random variable at this point. It's a random variable determined by the choice of your data. The data is the randomization source. x is fixed. So you think I have one test point in the space that I'm interested in. Maybe you are playing the stock market, and now you are only interested in what is going to happen tomorrow. So you take the inputs, and these are the only inputs you are interested in performing on. That's your x. And all the questions now pertain to this. You are learning from other data, and then you ask yourself, how am I performing on this point? That is the point x. So now we are looking at this point, and say, if you give me a dataset versus another, I'm going to get different values for the hypothesis on that point. Now it stands to logic that if I take the average with respect to all possible datasets, that would be awesome, because now I'm getting the benefit of an infinite number of datasets. I'm using them in the capacity of one dataset at a time, but I'm getting value. Maybe the real is the correct value should be here. But since I'm getting fluctuations because of the dataset, sometimes I'm here, sometimes I'm here, etc. If you get the expected value, you will hit it right. So this looks like a great quantity to have, and in reality, we will never have that. Because if you give me an infinite number of examples, I'm not going to divide them neatly into n and n and n and learn from these and then take the average. I'm just going to take all your examples and learn all through and get the target function almost perfectly. So this is just a conceptual tool for us to do the analysis. But we understand what it is. And if you now vary x, your test point is general, then you take that random variable and the expected value of it, and the function that is constituted by the expected values at different points is your g bar. So this is understood. Why do I need this for the analysis? Because if you look at the top thing, I have here a square, so I'm probably going to expand it. And in expanding it, I'm just going to get a linear term of this, and I have an expected value. So you can see that I'm going to get something that requires me to define g bar. That's the technical utility here. But the conceptual utility is very important. And if you want to tell someone what g bar is, think that you have many, many data sets. And the game is such that you learn from one data set at a time, and you want to make the most of it after you learn. And what you do, you take votes. You take just the average. You take it. There is 1 over n here. So 1 over capital K, the size of those. So this is the average. So now let's see how we can use g bar in order to get the decomposition we want. So here, this is again the quantity I'm passing from one slide to another in order not to forget. This is the quantity that I did decompose. So the first thing I'm going to do, I'm going to make it longer by the simple trick of adding g bar and subtracting it. I'm allowed to do that. Now, doing that, I'm going to consolidate these two terms. And I'm going to consolidate these two terms. And then expand with the square. So let's do that. So you get this. So this is the first consolidated guy with the square. This is the second consolidated guy with the squared. Am I missing something? Yes, I'm missing the cross terms. So let's add the cross terms. And I get twice the product. So this equals that. Now, I would like to get, so the expected value here applies to the whole thing. Now, the first order of business is to look at the cross terms, because they are annoying. And see if I can get rid of them. That's where the benefit of the squared error comes in. Now, I am getting the expected value with respect to d. So this fellow is a constant. Therefore, when I get the expected value of this whole thing, all I need to do is get the expected value of this part, because this one will factor out. Now, if I get the expected value of this, the expected value of the sum is the sum of the expected values, one of the few universal rules that you can apply without asking any other questions. So I get the expected value of this. What is the expected value of gd? Wait a minute. That was g bar by definition. That's how we defined it. So I get g bar minus the expected value of a constant, which happens to be g bar. So this goes to 0. And happily, this guy goes away. So now I have only these two guys. So let's write them. I have the expected value of this whole thing, which again is the sum of the expected values. So the first guy is a genuine expected value of these guys. When I apply the expected value to this guy, again, this is just a constant, so the expected value of it is itself. So the second guy I add without bothering with the expected value, because it's just a constant. So this is what I have as the expression for the quantities that I want. So now let's take this and look at it closely, because this will be the bias and variance. This is the quantity again, and it equals this fellow. So now let's look beyond the math and understand what is going on. This measure, this quantity tells you how far your hypothesis that you got from learning on a particular data set differs from the ultimate thing, the target. And we are decomposing this into two steps. The first step is to ask you, how far is your hypothesis that you got from that particular data set from the best possible you can get using your hypothesis set? Now there's a leap here, because I don't know whether this is the best in the hypothesis. I got it by the averaging, but it looks like, since I'm averaging from several data sets, it looks like a pretty good hypothesis. I'm not even sure that it's actually in the hypothesis set. It's the average of guys that came from the hypothesis set. But I can definitely construct a hypothesis set where the average of hypothesis is not necessarily belongs there. So there is some funny stuff. But just think of it that this is an intermediate step. Instead of going all the way to the target function, here is your hypothesis set. It restricts your resources. Now I'm getting the best possible out of it, using some formula. I'm learning from infinite number of data sets. This is a pretty good hypothesis. So how far are you from that hypothesis? That's the first step. The second step is how far that hypothesis, that great hypothesis, is from the ultimate target function. Hoping from your guy to the target goes into a small hope from your guy to the best hypothesis, and another hope from the best hypothesis to the target function. And the neat thing is that they decompose clearly. And we found that they decompose clearly, because the cross-term disappeared. And that's the advantage of the particular measure that we had. Now we need to give names to these guys. So they will be the bias and variance. Like you would think for five seconds. And you don't have to even answer the question, which will be the bias and which will be the variance. Just look at which would be a better description to them. So I'm not going to ask if this is not a quiz like last time. This is the bias. Why is it the bias? Because what I'm saying is that learning or no learning your hypothesis set is biased away from the target function. Because this is the best I could do under fictitious scenario. You have infinite data sets, and you are doing all of this, and you're taking the average, and that's the best you could come up with. And you are still far away from the target. So it must be a limitation of your hypothesis set. And I'm going to measure that limitation and say that your hypothesis set, which is represented at its best by this guy, is biased away from the target function. So this is the bias there. And again, bias applied to that particular point x, the test point in the input space that I'm interested in. So the other guy must be the variance. Why is that? Because if I knew everything, if I could zoom in perfectly, I would zoom in on to the best, assuming this is there. So I have this guy. But you don't. You have one data set at a time. So when you get one data set, you get this guy, you get another data set, you get another guy. These are different from them. So you are away from that. And I'm measuring how far you are away. But because the expected value of this fellow is g bar, and I'm comparing the difference square with this, it is properly called variance. This is the variance of what I am getting due to the fact that I get a finite data set. Every time I get a data set, I get a different one. And I'm measuring the distance from the core that I get here. So this will recall the bias. And this will recall the variance. This is very clean. Now let's go back and put it into the original form. Remember this guy? This is where we started. And we got the other expression. And then we neglected to take the expected value with respect to x in order to simplify the analysis. We would like to get that back. So we will look at this. This was the expected value with respect to x of the quantity we just decomposed. So now I take the decomposition and put it back in order to get the expected value of the out-of-sample error in terms of the bias and variance. So this would be what? This would be the expected value with respect to x of bias plus variance with x. And the expected value of the bias with respect to x, I'm just going to call it bias. And we're going to get the expected variance. And that's what you get. And this is the bias variance decomposition. So now I have a single number that describes the out-of-sample and the expected out-of-sample. So I'll give you a full learning situation. I'll give you a target function and an input distribution and a hypothesis set and a learning algorithm. And you have all the components. You go about and learn for every data set. And you get someone else learned from another data set. And you get the expected value of the out-of-sample error. And I'm telling you, if this out-of-sample error is 0.3, 0.05 of it is because of bias. And 0.25 is because of variance. So 0.05 means that your hypothesis set is pretty good in approximation. But maybe it's too big. Therefore, you have a lot of variance, which is 0.25. So this is the decomposition. So now let's look at the trade-off of generalization versus approximation in terms of this decomposition. That was the purpose. So here is the bias, explicitly written as a formula. And here is the variance. Now we would like to argue that there is a trade-off, that when you change your hypothesis set, you make it bigger, more complex, or smaller. One of these guys goes up, and one of these guys goes down. So I'm going to argue about it informally. And then we'll take a specific example where we are going to get exact numbers. But this is just to realize, really, that this decomposition actually captures the trade-off of approximation versus generalization. Why is that? Let's look at this picture. Here, I have a small hypothesis set. One function, if you want, but in general, let's call it small. This one, I have a huge hypothesis set. So I have here the back points, our hypotheses, that are candidates. Someone gives me a data set, and I learn and choose something. Now the target function is sitting here. If I use this guy, obviously I am far away from the target function, and therefore the bias is big. If I have a big hypothesis set, this is big enough that it actually includes the f, then when I learn, on average, I will be very close to f. Maybe I won't hit f exactly because of the nonlinearity of the regime. The regime, I get n examples, learn, and keep it. Another example, learn, and keep it. And then take the average. I might have lost some because of the nonlinearity. I might not get f, but I'll get pretty close. So the bias here is very, very small, close to 0. In terms of the variance here, there is no variance. If I have one target function, I don't care what data set you give me. I will always give you that function. So there is nothing to lose here in terms of variance. Here, well, depending, I mean, I have so many varieties that depending on the examples you give me, I may pick this and another example, because I'm fitting your data. Et cetera. So I get a red cloud around this. And their centroid will be g bar, the one that is good. But I may get one or the other. And the size of this guy measures the variance. This is the tricep. So now you can see that if I go from a small hypothesis to a bigger hypothesis, the bias goes down, and the variance goes up. So the idea here, if I make the hypothesis bigger, I am making the bias smaller, because I'm making this bigger, and getting it closer to f, and being able to approximate it better, so the bias is diminishing. But I am making this, so the bias goes down. And here the variance goes up. Why is the variance goes up? Because the red cloud becomes bigger and bigger. If I have this thing, then I have more variety to choose from, and I'm getting bigger variance to work with. So this is the nature of the trade-off. You may not believe this, because I just drew a picture, and just argued very informally. So now let's take a very concrete example, and we will solve it beginning to end. And if you understand this example fully, you will have understood bias and variance perfectly. So let's see. I took the simplest possible example that I can get a solution of fully. My target is a sinusoid. That's an easy function, and I just wanted to restrict myself to minus 1 plus 1, so I'm going to get sine pi x. Just to scale it so that it's from minus 1 to plus 1, gets me the whole action. And therefore, the function formally defined, the target function, is from minus 1 plus 1 to the real numbers. The co-domain is the real numbers, but obviously the function will be restricted from minus 1 to plus 1 as a range. Now, the target function is unknown. That's what we have been preaching for several lectures now. And now I am just giving you the target function. Again, this is an illustration. When we come to learning, we will try to blank it out so that it becomes unknown in our mind. But in order to understand the analysis of the bias variance, we would like to know what target function we are working with. We are going to get things in terms of it, and then you will understand why the tradeoff exists. So the function looks like this. It's like a sinusoid, fine. Now, the catch is the following. You are going to learn this function. I'm going to give you a data set. How big is the data set? I am not in a generous mood today. So I'm just going to give you two examples. And from the two examples, you need to learn the whole target function. I'll try n equals 2. So the next item is to give you the hypothesis set. So I'm going to give you two hypothesis sets to play with. One of you gets one, and the other gets other, and you try to learn and then compare the results. Well, I have two examples, so I cannot give you a 17th order polynomial. So I'm just going to give you the following. The two models are H0 and H1. H0 happens to be the constant model. Just give me a constant. I'm going to approximate the sine function with a constant. This doesn't look good. That's what we are working with. And the other one is far more sophisticated. It's so elaborate, you will love it. It's linear. Looks good now, having seen the constant already, right? These are your two hypothesis sets, and we would like to see which one is better. Better for what? That's the key issue. So let's start to answer the question of approximation first, and then go to the question of learning. Here's the question of approximation. H0 versus H1. When I talk about approximation, I'm not talking about learning. I am giving you the target function outright. It's a sinusoid. If it's a sinusoid, why don't I say it's just a sinusoid and have E out equals 0? Oh, because the rule of the game is that you are using one of the models. You have to use either the constant or the linear. Do your best, use all the information you have, but if you use the constant, return a constant. If you use a linear, return a line. You are not going to be able to return a bigger hypothesis than those. That's the game. So let's see what happens with H1. Here is the target. I'm trying to fit it with a line, an arbitrary line. Can you think of what it looks like? OK, line is not much, but at least I can get something like this, right? Try to get part of the slope, et cetera. I can solve this. I get a line in general. Complete the mean square error. It will be a function of a and b. Differentiate with the rest of the a and b. And get the optimum. It's not a big deal. So you end up with this. That's your best approximation. This is not a learning situation, but this is the best you can do using the linear model. Under those conditions, you made errors, right? And these are your errors. You didn't get it right, and these regions tell you how far you are from the target. Let's do it with the other guy. Now I have a constant. I want to approximate this guy with a constant. What is the constant? I guess I have to work with 0. That's the best I have. Remember, this means square error. If I move the 0, the big error will contribute a lot because it's squared. So I just put it in the middle, and this is your hypothesis. And how much is your error? Big. The whole thing is your error. Let's quantify it. If you get the expected value of the mean square error, you get a number, which here will be 0.5, and here will be approximately 0.2. So the linear model wins. Yeah, I'm approximating. I have more to get sure. If you give me a third order, I will be able to do better. If you give me a seventh order, I'll be able to do better. But that's the game. In terms of approximation, the more the merrier. Because you have all the information, there is no question of zooming in. Now let's go for learning. This course is about machine learning, right? Not about approximation. So this is the important part for us. So let's play the same game with a view to learning. You have two examples. You are going to learn from them. You are restricted to one hypothesis set or the other. OK. So let's start with h1, and I go to h0 again. This is your target function. Now you get two examples. So I'm going to, let's say, uniformly pick two examples, independently. And I get these two examples. Now I'd like you to fit the examples, and we'll see how well you approximate the target function. The first item of business is to get rid of the target function, because you don't know it. You only know the examples. So in a learning situation, this is what you get. Now I ask you to fit a line. Line two points. OK, I can do that. This is what you do. Now that you've settled on the final hypothesis, I'm going to grade you. So I'm going to bring back the target function and compare this to that and give you what is your out-of-sample error. Let's do it for h0. You have the same two points. You are fitting them with a constant. How would you do that? Probably the midpoint will give you the least squared error on these two points. So this would be your final hypothesis. And you get back your target function in order to evaluate your out-of-sample error, and this is what you get. Now you can see what the problem is. The error, OK, I can compute the error here, and I can have the yellow regions and all of that. But this depends on which two points I gave you. I'll give you another two points. I'll give you another two points, et cetera. So I'm not sure how to really compare them, because it does depend on your data set. That's why we needed the bias variance analysis. That's why we got the expected value of the error with respect to the choice of the data set, so that we actually are talking inherently about a linear model learning a target using two points, regardless of which two points I'm talking about. So let's do the bias and variance decomposition for the constant guy. So here is the figure. It's an interesting figure. Here I am generating data sets each of size two, two points, and then fitting a line, and the line will be the midpoint. And I keep repeating this exercise, and I am showing you the final hypothesis you get. I repeated it a very large number of times. This is the real simulation, and these are the hypotheses you get. So you can see that when you get this line, it means that the two points were equally distant from here. Sometimes I get the points here. Sometimes I get them equal to that, so I get here. The middle point is a little bit heavier, because, obviously, chances of getting them on us, the two lobes is there and so on. So this is basically the distribution you get. So each of them will give you an out-of-sample error, and the interesting thing for us is the expected out-of-sample error. That's what will grade the model. So now what we are going to do, we are going to get the bias and variance decomposition based on that, and that is our next figure. So now look at this carefully. The very light green guy is g bar of x. This is the average hypothesis you get. How did I get that? I simply added up all of these guys and divided by their number. And it is expected, obviously, by the symmetry, that on average I will get something very close to 0. So the interesting is that you can see now that the g bar here happens to be also the best approximation. If I keep repeating this, I will actually get the zero guy, which I was able to get when I had the full target function I was approximating. Here I don't have the full target function. I have one hypothesis at the time I'm getting the average, but I am getting this. So there is a justification for saying that g bar will be the best hypothesis, because this game of getting one at a time and then getting the average does get me somewhere. But do remember, this is not the output of your learning process. I wish it were. It isn't. The output of your learning process is one of those guys and you don't know which. This just happens that if you repeat it, this will be your average. And because you are getting different guys here, there will be a variance around this, and the variance, I'm describing it basically by the standard deviation you are going to get. So the error between the green line and the target function will give you the bias, and the width of the gray region will give you the variance. Understood what the analysis is. So that takes care of h0. Let's go to h1. So to remember, the learning situation for h0 was this. This is when I had the constant model. What will happen if you are actually fitting the two points, not with a constant, which you do at the midpoint, but you are fitting them with a complete line? What will it look like? It will look like this. Wow, you can see where the problem is. Talk about variance. Take two points, you connect them, wherever the two points, you get this jungle of lines. This is for exactly the same data set that gave me the horizontal lines in the previous slide. So this is what I get. So now I ask myself, what on average will I get? On average, you'd better get a positive slope, because there is a tendency to get a positive slope, because when you get the point slip, you will get this. Sometimes you get a negative slope here, but that is balanced by getting a positive slope here. You can argue this, but you can do the math. And then you get the bias variance decomposition. So this will be your average. This is g bar. And this will be the variance you get. The variance depends on x. This is the way we defined it. And when you want one variance to describe it, you get the expected value with the square of that gray region. This gray region has the standard deviation. So now you can see exactly that I am getting better approximation than the previous guy, but I'm sure I'm getting very bad variance, which is expected here. So now you can see what the tradeoff is. And the question is, given these two models, which one wins from a learning scenario? You need to ask the question to remember what it is. I am trying to approximate a sinusoid. Is it better to do it with a constant or a general line? The answer to that question is obvious, but that is not the question I am asking in learning. The question I'm asking in learning, you have two points coming from something I don't know. Is it better to use a constant or a line? You notice the difference. So I'm going to put them side by side, and then see which is the winner. So this guy has a big bias and a small variance. This guy has a small bias and a big variance. And let's get quantitative. What is the bias here? It's actually 0.5 exactly the same we got when we were approximating outright. 0, that's the expected value, you get 0.5. That means square L. What is the bias here? It's 0.21. Interestingly enough, when we did the approximation, it was about 0.2. And indeed, this is not exactly the best fit. Remember that when I told you there is a non-linearity aspect. You are taking two points at a time, and then taking a fit, and then taking the average, and it's conceivable that this will give you something different from if you have the full curve and you are fitting it outright. The difference is usually very small, and it is. But here, you get something which is not exactly perfect, but very close to perfect. So obviously, here the bias is much smaller. Let's look at the variance. What is the variance here? The variance here is 0.25. It's not too bad. The variance here, we expect it to be bigger, but is it big enough to kill us? It's a disaster. Complete and utter disaster. And now when you see what is expected out of sample error, you add these two numbers. Here I'm going to get 0.75, and here you are going to get something much bigger. And the winner is? So now you go to your friends and tell them that I learned today that in order to approximate a sign, I am better off approximating it with a constant than with a general line, and have a smile on your face. Of course, you know what you are talking about, but they might not really appreciate the humor here. This is the game. I think we understand it well. So the lesson learned if I want to articulate it is that when you are in a learning situation, always remember you are matching the model complexity to the data resources you have, not to the target complexity. I don't know the target, and even if I knew the level of complexity it has, I don't have the resources to match it, because if I match it, I will have the target in my hypothesis set, but I will never arrive at it. Pretty much like I'm sitting in my office, and I want a document of some kind. Someone has asked me for a letter of recommendation, and I don't want to rewrite it from you, so I want to take the old guy and just see what I wrote and then add the update to that. Before everything was archived on the computers, it used to be a piece of paper, so I know that the letter of recommendation is somewhere. Now, should I write the letter of recommendation from scratch, or should I look for the letter of recommendation? The letter of recommendation is there. It's much easier when I find it. However, finding it is a big deal. So the question is not that the target function is there, the question is can I find it? Therefore, when I give you 100 examples, you choose the hypothesis set to match the 100 examples. If the 100 examples are terribly noisy, that's even worse, because their information to guide you is worse. So that's what I mean by the data resources you have. The data resources you have is, what do you have in order to navigate the hypothesis set? Let's pick a hypothesis set that we can afford to navigate, that is the game in learning. Done with the bias and variance, now we are going to take just an illustrative tool called the learning curves, and then we are going to put the bias and variance versus the VC analysis on those curves. So what are the learning curves? They are related to what we think of intuitively as a learning curve, but they are a technical term here. We are basically plotting the expected value of E out and E in. We have done E out already, but here we also plot the expected value of E in as a function of N. So let's go through the details. I'll give you a data set of size N. We know what the expected value of the out-of-sample error is. We have seen that already in the bias-variance decomposition, and this is the quantity. I know this is the quantity that I will get in any learning situation. It depends on the data set. If I want a quantity that describes just the size of the set, I will integrate this out and get the expected value with respect to D. That's the quantity I have. And the other one is exactly the same, except it's in-sample. We didn't use it in the bias-variance analysis. This one, I'm going to get the expected value of the in-sample. So I want to get, given this situation, if I give you any examples, how well are you going to fit them? Well, it depends on the examples, but on average, this is how well you are going to fit them. And you ask yourself, how does this vary with N? And this comes in the learning curve. As you get more examples, you learn better. So hopefully, the learning curve looks better. And we'll see what the learning curve looks like. So let's take a simple model first. So it's a simple model. And because it's a simple model, it does not approximate your target function well. The best out-of-sample error you can do is pretty high. When you learn, the in-sample will be very close to the out-of-sample. So let's look at, first, the behavior as you increase N. As you increase N, hopefully, the out-of-sample error is going down. I have more examples to learn from. I have a better chance of approximating the target function. And indeed, it goes, and it can go down and down until it gets to the absolute limit of your hypothesis set. The hypothesis set is very simple. It doesn't have a very good approximation for your target. This is the best it can do. The best you can do is the best you can do. So that's what you get. When you look at the in-sample, it actually goes the other way around, because here, my task is simpler than here. Here, I'm trying to fit five examples. Here, I'm trying to fit 20 examples. And I only have the examples to fit. I'm not looking at target function or anything like that. So obviously, I can use my degrees of freedom in the hypothesis set and fit the five examples better and get a smaller in-sample error. Whereas, if I increase, I will get a worse in-sample error. It doesn't bother me, because the in-sample error is not the bottom line. The out-of-sample is. And as you can see, although I'm getting worse in-sample, I'm getting better out-of-sample. And indeed, the discrepancy between them, which is the generalization error, is getting tighter and tighter as an increase. Completely logical. By the way, this is a real model. So when we talk about overfitting, I will tell you what that model is, the simple model and the complex model. The complex model exactly the same behavior, except it's shifted. It's a complex model. So it has a better approximation for your target function, so it can achieve, in principle, a better out-of-sample error. You have so many degrees of freedom that you were able to fit the training set perfectly up to here. This corresponds more or less to the VC dimension. The VC dimension can shatter everything. So you can shatter these guys, you can fit them perfectly, so you get zero error. You start compromising when you have more guys, and you cannot shatter, so maybe you have to compromise, and you end up with starting to have in-sample error. And the in-sample error goes, and the out-of-sample error goes down. The interesting thing is that in India, when you have this, I fit the example perfectly. I'm so happy. What is the out-of-sample and utter disaster? Absolutely no information. We didn't learn anything. We just memorized the examples. So here, again, the out-of-sample error goes down. The in-sample error goes up. Same argument exactly. They get closer together, but obviously the discrepancy between them is bigger, because I have a more complex set. Therefore, the generalization error is bigger. The bound on it is bigger in the VC analysis, and the actual value is bigger, and so on. So this is the analysis. So it's a very simple tool, and the reason I introduced it here is that I want to illustrate the bias and variance analysis versus the VC analysis using the learning curve. It will be very illustrative to understand how the two theories relate to each other. So let's start with the VC analysis on learning curve. So these are learning curves. The in-sample error goes up as promised. The out-of-sample error goes down. There is a best approximation that corresponds to this level of out-of-sample error, if we actually knew the thing. And what did we do in the VC analysis? We had the in-sample error, which is this region, the height of this region. And then we had a bound on the generalization error, which is omega. And we said that the bound behaves the same way as the quantity itself. So the bound actually will not be this thing. Way bigger. But in proportionality, it will give us the same proportion. So as you increase n, the generalization error goes down. The bound on it goes down. Omega goes down, which we already realized. And obviously, you can take another model. And the model is very complex. The discrepancy between them becomes bigger, which agrees with that. So this is the decomposition of it. Now I took some liberties in order to be able to do that. The VC analysis doesn't have expected values. So I took expected values of everything there is. So there is some liberty taken in order to fit in that diagram. But the principle holds that this is the region. The blue region is the in-sample error. And the red region is basically the omega. That is what happens in the generalization bound. Think for a moment which region will be blue and which region will be red in the bias variance. I'll get exactly the same curves, the same model. So what will it be? It will be this. That's the difference. In the bias variance, I got the bias based on the best approximation. I didn't look at how you perform in-sample. I assumed, hypothetically, that you could look for the best possible approximation. And I charged the bias for that. And this is the bias you have. So this is the best you can do. And this is the error you are making. Again, there is a liberty taken here. Because this is genuinely the best approximation in your hypothesis set. The one I am using for the bias variance analysis is the error on G bar. And we said, OK, G bar will be close in error to this guy. It may not even be in the hypothesis set. So there is some liberty. But it's not a huge liberty. This is very much close to what you are getting in the bias variance. And the rest of it is the variance. Because you get the bias plus that, and you will get the expected value of the out-of-sample error. Now you can see why they are both talking about the same thing. Both of them are talking about approximation. That's the blue part. Here, it's approximation overall. And here, it's approximation in-sample. And both of them take into consideration what happens in terms of generalization. Well, the red region here is maybe twice the size. Not twice the size in general. It will be twice the size, actually, in the linear regression example. But basically, they have the same behavior. They have just different scales. So they capture the same principle of generalizing or the uncertainty about which hypothesis to pick, or how much do I lose from going in-sample to out-of-sample. So they have the same behavior. And the only difference here is that the bias, obviously, is constant with respect to n. The bias depends on the hypothesis set. Now, this is also an assumption, because it says, OK, I have two examples, and take the average, I will get an error. If I have 10 examples, and take the average, I'll get an error. Is it the same? Well, in both cases, you effectively used an infinite number of examples. Because the first one, you used 2 at a time, and you repeated it an infinite number of times. And you took an average. You lose 10 at a time, and you took an average. OK, I grant you maybe the 10 will give you a better situation. But again, it's a little bit of a license in order to be able to attribute the bias and variance to this line, which happens to be the best hypothesis proper within your hypothesis set. So this is the contrast between the two theoretical approaches that we have covered in this lecture and the previous three lectures. I'm going to end up with the analysis for the linear regression case. So I'm going to basically go through it fairly quickly. This is a very good exercise to do. And if you read the exercise, and you follow the steps, it will give you very good insight into the linear regression. I'll try to explain the highlights of it. So let's start with a reminder, the linear regression. So linear regression, I'm using a target. For the purpose of simplification, I am going to use a noisy target, which is linear plus noise. So I'm using linear regression to learn something linear plus noise. If it weren't for the noise, I would get it perfectly. It's already linear. But because of the noise, I will be deviating a little bit. This is just to make the mathematics that results easier to handle. Now, you gave in the data set, and the data set is a noisy data set. So each of this is big independently. And this y depends on x, and the only unknown here is the noise. So you get this value. They give you the average. And then you add a noise to get the y. Now, do you remember the linear regression solution? Regardless of what the target function is, you look at the data, and this is what you get for the solution. You take that input data set and the output data set. You do this algebraic combination. And whatever comes out is your output of the linear regression. This is your final hypothesis. We've done that. And now we are going to think about the notion of the in-sample error, not in-sample error as a summary quantity, but the in-sample error pattern. How much error do I get in the first example? How much error do I get in the second, third, et cetera? Just for a little bit. So what would that be? Well, that would be what I got in the final hypothesis. So I apply the final hypothesis to the input points. I'm going to get a pattern of values that my hypothesis is predicting. I compare them to the actual targets, which happen to be stored in the y. And that would be an error pattern. So it would be plus something, minus something, plus something, minus something. And if I add the square values over here, get the average of those, I will get what we call the in-sample error. For the out-of-sample error, I'm going to play a simplifying trick here in order to get the learning curve in the finite case. Here, I'm going to consider that in order to get the out-of-sample error, what I'm going to do, I'm going to just generate the same inputs, which is a complete no-no in out-of-sample. Supposedly, out-of-sample, you get points that you haven't seen before. But you have seen these x's before. But the redeeming value is that I'm now going to give you fresh noise. So that's the unknown. And that is what allows me to say that it plays the role of an out-of-sample. So I'm going to generate another set of points with different noises, but on the same inputs, in order to simplify the analysis. You see that the x's here are involved, and if I use the same inputs, things will simplify. And in that case, if you ask yourself, what is the out-of-sample error of those, it's exactly the same. I evaluated on the points. They happen to be the points for the out-of-sample. And I'm comparing it with y. Which is exactly the same thing, except with noise-dash, another realization of the noise. So this is the outline of the setup to get us the learning curves we want. So when you do the analysis, not that difficult at all. You will get this very interesting curve. So this is the learning curve. And it has very specific values. Sigma squared, that's the variance of the noise. This is the best you can do. I expect that, because you told me the target is linear. So I can get that perfectly. But then there is this added noise. I cannot capture the noise. What is the variance of the noise sigma squared? So this is the error that is inevitable. Look at the in-sample error. Up to d plus 1, you are perfect. Of course, I'm perfect, because I have d plus 1 parameters, and I'm fitting less than those, so I can fit them perfectly. It doesn't mean much for the out-of-sample error. But that's what I get. I start compromising when I get more points. And as I go, with more points, here I'm fitting the noise. I'm fitting less than the noise. The noise is averaging out, and now I'm getting very, very close to as if there was no noise. Because the pattern persists, which is the linear guy. And the noise, if I get more examples, more or less cancels out in the fitting. I don't have enough degrees of freedom to fit them all, so I get every other. Until, eventually, I get to as if I'm doing it perfectly. And the out-of-sample goes down. And there is a very specific formula that you can get, which is interesting. So let me finish with this. So the best approximation error is sigma squared. That's the line, right? What is the expected in-sample error? It has a very simple formula, which is all everything is scaled by sigma squared. So what you have here is it's almost perfect. And you are doing better than perfect by this amount, the ratio of d plus 1. Remember what d plus 1 was? For the perceptron, it was the VC dimension. Here, it's also a VC dimension of source, the degrees of freedom that linear regression has. So you divide the degrees of freedom by the number of examples. That is the factor that you get. And you realize here that this is the best you can do. And here, you are doing better than the best. Why is better than the best? Because I'm not trying to fit the whole function. I am only fitting the finite sample. So I'm doing very well, and I'm very happy about it. Little that I know that I'm actually harming myself. Because what I'm doing here, I'm fitting the noise. And as a result of that, I'm deviating from the optimal guy, and I'm paying the price in out-of-sample error. And what is the price I'm paying in out-of-sample error? It is the mirror image. I lose exactly in out-of-sample what I gained in sample. And the most interesting quantity is the summary quantity. What is the expected generalization error? What is the generalization error? There is a difference between this and that. I have the formula for them. So all I need to do is write this. So let me magnify this. This is the generalization error. It has the form of the VC dimension divided by the number of examples. In this case, it's exact. And this is what I promised last time. I told you that this rule of proportionality between a VC dimension and a number of examples persists to the level where sometimes you just divide the VC dimension by the number of examples, and that will give you a generalization error. This is the concrete version of it. In spite of the fact that here is not a VC dimension, this is real valued, but it's degrees of freedom. So it plays the role. We could actually solve for it and realize that this is indeed the compromise between the degrees of freedom I have in the case of linear regression and the number of examples I am using. OK? So we will stop here, and we will go into questions and answers after a short break. OK. So let's go into the questions. Right. So the first question is, if you can go back to slide 19. 19. So the question is, if you can explain how complex models are better than simple models. OK. Better in something. I mean, I think the key issue in the theory is there is a trade-off. Nothing is better on all fronts, and nothing is worse on all fronts. So let's compare the simple model and the complex model. In terms of the ability to approximate, whether that ability to approximate is in-sample or whether the ability to approximate is absolute. What is the ability to approximate in the absolute? Here is my hypothesis set, and I have a target function. The horizontal line, that height gives you the error of approximation. So if you go from a simple model to a complex model, you will be able to approximate better. That is obvious. And that also is inherited if your approximation is focused only on the training example. In this case, you are comparing not the horizontal lines, but the blue curves. This is the error you make in approximating the sample you get. And again, the approximation for the simple model is worse than the approximation for the complex model. So if your game is approximation, and that's your purpose, then obviously the complex model is better. In this particular case, you can also ask yourself about the generalization ability. The generalization ability will be the discrepancy between either the blue or red curve. That would be the VC analysis. This would be how much I lose from going from in-sample to out-of-sample. Or how much I lose from a perfect approximation, in the case of the bias variance analysis, to getting E out because my inability to zoom in on the right hypothesis. This would be that area here. So whether you are taking the difference between the blue curve or the difference between the red curve and the black line, that area is smaller here than here. Therefore, the simple model is better as far as the generalization is concerned. Now, because it's a trade-off and I have one of them better and one of them worse, then the question is when I put them together, which is better? Because the bottom line in learning is the red curve. That's what I care about. This is the performance of the system that I'm going to deliver to my customer, and they are going to test it out-of-sample. And if they get it right, they will be happy. Now, because I have two quantities that I'm adding, and one of them is going down, and one of them is going up, then it is obvious that the sum could go either way. In this case, you can see that it is going either way. For example, if you have few examples, then E out here is not great, but it's decent. If you have the same number of examples here, E out is a disaster. So if you have few examples, you simply cannot afford the complex model. You are better off working with the simple model, and you will get better out-of-sample error. If I give you a much bigger resource of the example if you are here, now this one is now limited by the fact that it's simple. It cannot get any better. It has all the information. It zooms in perfectly, but it cannot get any better. This guy now gets to use its degrees of freedom properly and gets you to a smaller value. So for a larger number of points, you get a better performance here. That's why we are saying that you should match the complexity of the model to the data resources you have, which is this case represented by N. We're talking about different target functions and different things, but in choosing this model or another, what really dictates the performance is the number of examples versus the complexity of the model. When you did the analysis for linear regression, if you did it using the perceptron model, would you get the same generalization error? You need. Let's go for that. The analysis of the bias variance, and it's also inherited in the learning curve, is the analysis is very clean when you use mean squared error. Obviously, you can use mean squared error in the perceptron. There will be a correspondence here, but the ability to get such a clean formula here really depends on the very particulars of linear regression. If you go back to the previous slide where the assumption is, it was very critical to make the assumption that the out-of-sample is this way, and to make the target very specifically linear plus noise in order to be able to simplify. The result, by the way, holds in general asymptotically. So if you take genuine out-of-sample, which means that you pick different points, you will get a different matrix X. You'll apply W that you got from in-sample. You'll apply it to X-dash in this case, which is the y-dash. The problem is that when you plug it in here and try to get a formula for that, the formula will depend on how the X-dash relates to the X. When it's the same, they cancel out neatly and you get the formula that I had. But asymptotically, if you make certain assumptions about how X is generated and you take the asymptotic result, you will get the same thing. The short answer is the following. X is in the exact form that I gave, which gives me these very neat results. It's very specific to linear regression, very specific to the choice of out-of-sample as I did it if you want to give the answer exactly in a finite case. If you use a perceptron, you will be able to find a parallel, but it may not be as neat. Quick clarification. Sigma squared is the variance of the noise in the... Yeah, so the... Okay, I just realized that. So I've been using bias variance, bias variance, the lecture is called bias variance, and now we have variance of the noise. So obviously I'm so used to these things that I didn't notice. When I say the variance here, this has absolutely nothing to do with the bias variance analysis that I talked about. It's a noise. I'm trying to measure the energy of it. So it says zero mean noise, so the energy of it is proportional to the variance. So I should have called it the energy of the noise. Sigma squared in order not to confuse people, but I hope that I did not confuse too many people. Can the bias variance analysis be used for a model selection? It's okay. So bias variance analysis, just because it is so specific, it actually assumes that you know the target function if you want to get the quantities explicitly. So for example, linear regression, I assume the form is linear plus noise. For the sinusoidal case, we got the answers and we were able to choose, but we actually knew that it was a sinusoid. So the bias variance analysis is taken as a guide, but it's a very important guide because I can ask myself, how do I affect instead of talking, I want to get E out to be down. Now I know that there are two contributing factors, bias and variance. Can I get the variance down without getting the bias up? That's a bunch of techniques, regularization will belong to that category. Can I get both of them down? That would be learning from hints. There would be something that affects both of them and so on. So you can map different techniques to how they are affecting the bias and variance. So I would say that in terms of any application to learning situation, it's a guideline rather than something that I'm going to plug in and tell you what the model is. The answer for the model selection is mostly through validation, which we're going to talk about in a few lectures. And this is the gold standard for the choices you make in a learning situation, including choosing the model. So a question getting a little bit ahead in the methods where you use ensemble methods like boosting or something. Is there a reason under these analysis why those methods work? I almost included this in the lecture, but I thought it was one too many. So if you look at the idea of G-bar. So let me try to get to its definition. Now this was just a theoretical tool of analysis. So I have G-bar equals the expected value of that. And if I want to do it with a finite number of sets, I sum up this and normalize by 1 over K. Now although this was just a theoretical way of getting the bias variance decomposition, and this is a conceptual way of understanding what it is, there is an ensemble learning method that builds exactly on this, which is called bagging, bootstrap aggregation. And the idea is that what do I need in order to get G-bar? We said G-bar is great if I can get it, but it requires an infinite number of data sets, and I have only one data set. So the idea of bagging is that I am going to use my data set to generate a large number of different data sets. How am I going to do that? Well, that's bootstrapping. Bootstrapping always looks like magic, because you know where the expression comes from, bootstrapping. You try to lift yourself by pulling on your bootstraps, which is obviously... You cannot do that because you are pulling on it, but that's what you do. So here you are trying to create something where it is in there. And in this particular case, what you do is you sample randomly from your data set in order to get different data sets, and then average. And believe it or not, that gives you actually a dividend to give you something about the ensemble learning. And there are other obviously more sophisticated methods of ensemble learning, and one way or the other, they appeal to the fact that you are reducing the variance by averaging a bunch of stuff. Okay? So you can say that either take an outright like bagging or inspired in some sense that it's a good idea to average because you cancel out fluctuations. Okay. If you use the Bayesian approach, does this bias variance dilemma still appears? Repeat the question, please. If you use a Bayesian approach, does this bias variance still appears? Okay. Okay, so the bias variance is there to stay. It's a fact, okay? And we can take a particular approach, and then we are going to perhaps find an explicit expression for the bias and explicit expression for the variance, but nothing will change about the nature of things because of the approach I have. Now, the Bayesian approach is very particular because the Bayesian approach makes certain assumptions, and after you make these assumptions, you can answer all questions perfectly. Okay? So you can answer questions like that and other questions as well. Okay? And I will talk about the Bayesian approach in the very last lecture of the course. So I will defer answers that are specific to that until that point. But basically, the answer to this very specific question is that this is, you know, and it's like if you ask, does the VC dimension change if you apply the Bayesian? Well, I mean, you apply the Bayesian, you know, a bunch of assumptions. The VC dimension is there. Maybe by using the Bayesian, you'll be able to find more direct quantities to predict what you want. But the VC dimension, you know, is there because it's defined in a general setup. Yeah, a question about relation with a numerical function approximation. So in that field, there's interpolation and extrapolation. When is there extrapolation in machine learning? Yeah. So function approximation is one of the fields that is very much related, because you are given a finite sample and you're coming from a function and you're trying to approximate it, and this is one of the applications. And in general, interpolation is easier than extrapolation because you have a handle, and if you want to articulate that in terms of the stuff we have, the variance in interpolation is smaller than the variance in extrapolation in general. Remember the lines in the sinusoid, they're all over the place. If you take between the two points, I'm reflecting them with a line. Between the two points, I'm very much in good shape because the sign is this way and I am this way. So it's not that big a deal. The further out you go, then there's a lot of fluctuation and that is reflected in the extrapolation. Okay, so basically, when the variance is big, we know we're extrapolating. Is that an answer? No, I would say there is an association between them. To answer this specifically, you need to understand the particular case. There may be cases where the extrapolation doesn't have a lot of variance and whatnot. I'm just trying to map in general what the quantity here corresponds to in that. So the problem with extrapolation can be posed in this case in terms of more variance than interpolation. But I'm not making a mathematical statement that this is guaranteed to be the case. So could you explain what the literature means by the bias variance covariance dilemma? Okay. Okay, so you can pursue this analysis a little bit further to the cases where you have cross terms, particularly for boosting this is the case. And then there's a question of, okay, so I'm trying to get these guys that I'm going to average in order to get the final hypothesis. That's my game. Now, it would be nice if I can get them to be independent, because when I get them to be independent, then adding them up reduces the variance in a very good way. But then in general, when you actually apply some of these algorithms, there is a correlation between one or another, so there's a covariance. So there's a question of the balance between the two. But it really is, in terms of application, related more to ensemble learning than to just the general bias variance analysis because in the bias variance analysis, I really did it, I had the luxury of picking independently generated data sets, generating independent guys, and then averaging them, because it's a conceptual aspect. But when you actually are using a technique where you are constructing these guys based on variations of the data set, then the covariance starts playing a role. A question about, I guess, name in this thing. So is linear regression actually learning, or is it just fitting along the lines of functional approximation? Linear regression is a learning technique, and fitting is the first part of learning. So you always fit in order to learn. The only added thing is that you want to make sure that as you fit, you also perform well out of sample. That's what the theory was about. So I've been spending four lectures trying to make sure that when you do the intuitive thing, I give you data, you fit them. You could do that without taking a machine learning course. So now I'm telling you that you have to have the checks in place, such that when you fit in sample, something good happens in what you care about, which is out of sample. All right, I think that's it. Very good. So we'll see you next week.