 The following program is brought to you by Caltech. Welcome back. Last time, we introduced the main result in learning theory. And there were two parts. The first part is to get a handle on the growth function, m sub h of n, which characterizes the hypothesis set h. And the way we got a handle on it is by introducing the idea of a breakpoint, and then bounding the growth function in terms of a formula that depends on that breakpoint. There was a simple recursion that you recall by this figure. And then we finally found the formula that upper bounds the growth function, given that it has a breakpoint k. And it's a combinatorial formula that is fairly easy to understand. And the most important aspect about it, as far as the theory is concerned, is that this is polynomial. It is bounded above by a polynomial in n, since k is a constant. And if you look at this, this is indeed a polynomial, and the maximum power you have in this expression is n to the k minus 1. So not only is it polynomial, but also the order of the polynomial depends on the breakpoint. We were interested in the growth function because it was our way of characterizing the redundancy that we need to understand in order to be able to switch from the hefting inequality to the VC inequality. And the VC inequality will be the case that handles the learning proper. And in order to do that, we looked at the bad events that are characterized by a small area according to hefting. And then we went here and looked at the redundancy that results from the fact that the different hypotheses have by and large overlapping bad regions. And the way to characterize this was through the growth function. And after an argument that took the redundancy and related it to the growth function, and then got rid of a technical problem with E out, if you recall that one, we ended up switching completely from the hefting inequality, which is the top one, into the VC inequality, which is the final theoretical result in machine learning, the characterization of generalization. And they are very similar, except for a fundamental difference, which is here, and technical differences which are in the constants. So as you go through the proof, you will have to change 2's into 4's into smaller epsilon and whatnot. But the main thing is that instead of the number of hypotheses, capital M, we were able to replace it by the growth function. And we had final technical finesse, because we took the growth function not on n points, in spite of the fact that we have only n points in the sample. We needed to have 2n in order to have another sample, and carry the argument, not for the single sample that we have, but for the difference between two samples. And that got rid of the technicalities that we alluded to, which is the role of E out that really destroys the utility of the growth function, because the growth function depends on dichotomies, and the E out depends on the full hypothesis itself. OK, so this is where we stand. So in today's lecture, I'm going to put this together in the main notion of the theory, which is the VC dimension. It will not be a new notion for you. It's very much related to the break point. But it is the quantities that you are going to remember from all of this after a while. So you may forget about the recursion. You may forget about the growth function. But you will remember the VC dimension. And when you are in a learning situation, you ask yourself, what is the VC dimension? Is it 7 or 10? And then you say, oh, this guy is using a hypothesis set with VC dimension 5,000. He must be crazy, and so on. So this will be the currency we use out of the theory in order to use in a real learning situation. So the topics for today, first, I'm going to give the definition. This will be an easy definition. And I'm going to discuss it a little bit to make sure that everybody understands it. And then we are going to spend some time getting the VC dimension of perceptrons. We will be able to compute the VC dimension exactly for perceptron in any dimensional space. So it doesn't have to be two dimensional space like the one we did before. You take any dimension, and we'll get the value of the VC dimension exactly. This is a rare case, because usually when we get the VC dimension, we get the abound on the VC dimension, just out of the practicality of the situation. But here, we'll be able to get it exactly. And that will help us going through the interpretation of the VC dimension. We will ask ourselves, now we understand it, and we compute it for a concrete case that we are familiar with, then we would like to understand what does it signify, how do we apply it in practice, and whatnot. And this will be the subject of the interpretation. Finally, I will spend the last few minutes of the lecture transforming the theory into a form that is extremely simple, and it's very easy to remember. And this is the one that will survive with us for the rest of the course, and we'll be able to relate it to different theories and techniques as we go. So let's start with the definition. The VC dimension is a quantity that is defined for a hypothesis set. You give me a hypothesis set. I return a number, which I call the VC dimension. And the notation for it will be d, as in dimension, sub-VC, as in Vapnicscher-Vonenkes. And it is applied to script H. And every now and then, we will drop the dependency on script H, where it is clear from the context. So we don't have to carry this sort of long notation. It will just say d sub-VC, and we'll understand that this is the VC dimension. What is it? In words, it is the most points you can shatter. That is not a foreign notion for us. So if you can shatter 20 points, and that is the most you can do, then the VC dimension is 20. In terms of the technical quantities we defined, this would be the largest value of n, such that the gross function is 2 to the n. So if you go one above, the 2 to the n will be broken. So you can think, ah, OK, so VC dimension is the maximum. Next one must be a breakpoint. And that is indeed the case. The most important thing to realize is that we are talking about the most points you can shatter. It doesn't guarantee that every n points, let's say that the VC dimension is n, it doesn't say that every n points can be shattered. All you need is one set of n points that can be shattered in order to say that you can shatter n points. That has always been the case in our analysis. OK, so let's try to take this definition and interpret it. Let's say that I computed the VC dimension. I told you the VC dimension in this case is 15. Now what can you say about n, which is at most 15, in terms of the ability to shatter or not? You can say that if n is at most 15, the VC dimension, then it is guaranteed to be able to shatter n points. Which n points, I haven't said, but there has to be n points for which the hypothesis can shatter. Why is that? It's simply because, since the VC dimension is this number, there will be that many points that can be shattered. Well, any subset of them will have to be shattered as well. Therefore, a smaller number will also be shattered, which means that if n is smaller, you can shatter it. The other direction is also meaningful. If n is greater than the VC dimension, now the statement is strong that n is a breakpoint. You cannot shatter any set of that many points. Because by definition, the VC dimension was the maximum. And although I called it n here, we used to call it small k. So anything above the VC dimension is a breakpoint. And anything below, you can shatter. Very simple notion. So if you look at the gross function in terms of the VC dimension, when we had the breakpoints in terms of this k, we were able to find the bound that I showed in the review. So we know that the gross function is bounded above by this formula. And k appears here for the index of summation, which gives us the maximum power of this formula. Now, in terms of the VC dimension, it's not a big deal, because the smallest breakpoint is one above the VC dimension. So all you need to do is substitute, and you will get this formula involving the VC dimension. The VC dimension is unique, because it's the maximum. So you get that number. And now you can say that the gross function for any hypothesis set that has VC dimension d sub VC is bounded above by this. A nicer formula than this. It doesn't have the annoying one. And furthermore, when you look at the maximum power in this polynomial, the maximum power happens to be n to the VC dimension. So the VC dimension will also serve as the order of the polynomial that bounds the gross function of a hypothesis set that has that VC dimension. All of this is very simple. Now, let's take examples in order to get the VC dimension in some cases. And you have seen this before. Remember positive rays? How many points can positive rays shatter? Oh, it's just one point. This is where you can get all possible patterns. If you get two, it's a breakpoint. Remember that argument? Therefore, VC dimension here is one. Good. How about two deep perceptrons? How many can we shatter? Oh, we remember that constellation. If we have three points in that position, then we can get all possible patterns. And four is a breakpoint, so we cannot go up. Therefore, the VC dimension is three. Convex sets. What is the VC dimension of convex sets in two dimensions? The other examples we gave before. We had this funny construction, where if you choose your points on the perimeter of a circle, you can shatter any number of points. Therefore, what would be the VC dimension in this case? It would be infinite. There's no maximum. Now, we said before that this one is particularly pessimistic, which indeed it is, because this is a very specific way of getting the points. And in all of the analysis, you will find that using the VC dimension always gives an upper bound, worst case. You cannot violate it, but you can do better at times. Here, for example, you can do better if you choose the points, let's say, uniformly over a space. Therefore, some of them will be internal points, and therefore, the corresponding growth function, which can be defined in this case, will not be two to the end, and you may be able to learn. Now, let's look at the VC dimension as it relates to learning. This is an important view graph. And when we talk about learning, we have to go back to our friend, the learning diagram. And in case you forgot it, let me magnify it a little bit. Now, there are different components, and now we have studied the matter so well that we now can relate more to this. Remember, this is a target function. It gives you the example, learning algorithm picks the hypothesis, puts it at the final. We hope that the final hypothesis approximates this guy, and we introduce this thing in order to get the probabilistic analysis. We have seen this before. Now, let's look at this diagram and see what the VC dimension says. The main result is that if the VC dimension is finite, that's all you are asking, then now the green final hypothesis will generalize. That we have established by the theory. So you don't even need to know its value. You just need to know its finite. And then you can say that g will generalize. That we have in the back. Now, I'd like to understand the rest of the diagram, in terms of the VC dimension. So we can understand the green part. So here, we'll generalize to the target function, for better or for worse. Could be doing very poorly in the in-sample, and that will generalize, or could be doing great in the in-sample, and that will generalize. We're only talking about generalization here. So now, this statement is independent of the learning algorithm. Why is that? Because the learning algorithm here, if it picks a hypothesis, it will have to pick it from the hypothesis set. We have gone through all of this trouble, in order to guarantee that generalization will happen uniformly, regardless of which hypothesis you pick. Therefore, you can find the craziest learning algorithm, and it can pick anything you want, and you still can make the statement about the final hypothesis. So now, the learning algorithm doesn't matter as far as generalization is concerned. So let's punish it by graying it out. Now, it's also independent of the input distribution. So this is the box. Now, this was technically introduced in order to get hefting, and obviously, it has to survive in order to get the VC inequality. The reason I am talking about the independence here is because of an interesting point. We mentioned that when we defined the gross function or the VC that I mentioned, I give you the budget N, and then you choose the points any way you want, with a view to maximizing the dichotomies. So now, there is no probability distribution that can beat you. I can pick the weirdest probability distribution that has preferences for funny points and whatnot, and your choice of the points will be fine, because you choose the points that maximize. So whatever the probability distribution does, you will be doing more. And therefore, your bound will hold. Therefore, we don't have to worry about probability distributions. The learning statement that this guy will generalize will hold for any probability distribution. So another guy beats the dust. Now, you look at this, and then there is the third guy, which is an obvious one, which is the target function. All of this analysis, the target function didn't matter at all. As far as generalization is concerned, we are generalizing to it, but we don't care what it is. As long as it generates the examples we learn from, and then we test on it, that's all we care about. The generalization statement will hold. So it also goes. So now, as far as the VC theory is concerned, we really have three blocks. The first one is the hypothesis. That is the one that we are claiming the generalization with respect to. That's number one. The hypothesis set is where we define the VC dimension. And if you remember very early on, I told you that the hypothesis set is a little bit of an artificial notion to introduce as part of the learning diagram. And I said that we are going to introduce it because there is no downside, there is no loss of generality, which is true, and there is an upside for the theory. Now you can see the upside. The entire VC theory deals with the hypothesis set by itself. That's what has a VC dimension, and that's what we'll tell you you are able to generalize. The rest of the guys that are more intuitive, that disappeared here, are not relevant to that theory. Now, the training examples are left because the statement that involves the VC dimension is a probabilistic statement. It says that with high probability, you will generalize. With high probability, with respect to what? It's with respect to generating the data. You may get a very unlucky data set, for which you are not going to generalize. The guarantee is that this happens with a very small probability. So this remains here just because it is part of the statement. And this triangle is where the VC inequality lives. Now we go into a fun thing of computing the VC dimension for the perceptrons. There are two goals for doing this. We'll do it exactly. We'll get the exact formula for it. The one thing is to test your understanding of the definition. The definition is a little bit tricky, because I give you n. You choose the points any which way. You maximize this. This is the bound. So what is minimum, what is maximum, or not? Maybe a little bit fuzzy, and trying to get the number for a particular case will seal the deal. So that's the number one. Number two is that because we understand the perceptron model so well, we will be able to get the result, which is the VC dimension of perceptrons. And that will give us insight into what the VC dimension signifies. And that will set the stage when we go to interpreting the VC dimension. So this is an important part that will take a little bit of analysis. So for the two-dimensional perceptron, we already have done the exercise, and we got the VC dimension to be 3. Now, if you go for the general case and you have d-dimensional space, you expect the VC dimension to be more, because even if you just go to three dimensions, the troublesome case of four points that we had before is very easily shattered in this case. Just pick the points, not on a plane. And remember, the problem with those guys is that if you want these guys to be minus 1, and these guys to be plus 1, it was a problem for the plane. Now you can very easily separate any two points from the other two points, and you can shatter four points. So the VC dimension went up for sure. And we asked ourselves, how much did it go up? It turns out to be a very simple formula. The VC dimension of perceptrons is exactly d plus 1. Now we need to prove that. And we're going to prove it in two stages, very simple stages. One of them is that we are going to show that the VC dimension is at most d plus 1. And then we are going to show that the VC dimension is at least d plus 1. And that leaves the single possibility that the VC dimension is d plus 1. So let's go. Here is one direction. And by the way, pay attention, because I'm going to give you a quiz in the middle of the argument to make sure that you are paying attention. This is for real, and for the online audience as well. So here is the first direction. I am going to construct a specific set of n points. And that n in this case is d plus 1, because that's where the number of points I want to shatter. And I'm going to construct them in the d-dimensional Euclidean space, r to the d. I am going to construct them with a view to being able to shatter them. So I get to choose the points, which is my privilege. As long as I can shatter them, we are OK. So what are these points? I'm going to construct them using a matrix. And you have seen this matrix before. Remember our old friend, linear regression? We actually set the input points in linear regression this way in order to get the algorithm that's you do inverse and all of that. And in the case of linear regression, this was a very tall matrix, where this is one data point, which means that it's a d plus 1 dimensional vector. The one dimension is the constant x0, which is the constant plus 1 we add to take care of the threshold. And then the rest of the dimensions from 1 to d are actually the coordinates of the point. So we put this, and this is one data point, this is the second, this is the third, this is the n. And usually, since we have many, many more points than dimensions, this is a tall matrix. In this case, I am choosing n to be exactly d plus 1. And since we already established that this is d plus 1, this is actually a square matrix in this case. But that's all I need for the purpose that I am after here. So I need to give you the identity of these guys. What are these guys? These guys look like this. This is no mystery. If you look at the first column, it's all ones. Well, it has to be. That's dictated. That is the constant coordinate. It has to be plus 1. If I want a legitimate point in this representation, the d plus 1 dimensional representation, the first coordinate has to be 1. The rest of the guys, I chose the simplest possible form I can imagine. So you have basically a diagonal matrix here, and I added the all 0's here. So these are the guys that are my data set. Now you can see that I chose them such that x is invertible, because that is the technique I'm going to use in order to be able to shatter them. You will see in a moment. Do I know that this is invertible? Yes, the determinant is 1, actually. And that means it's invertible. Can you compute the determinant? This is 1. And then every time you have this guy, you have the 0 term wiping out everything, so I get a 1. So this is an invertible matrix. Why am I interested in an invertible matrix? Because we can shatter the data set. This is how we are going to do it. Look at any set of labels you want. So this is a dichotomy. This is the value at the first one, plus or minus 1, plus or minus 1, and plus or minus 1, on the x points that I just showed you. So all of these could be any pattern, plus or minus 1. So I would like to tell you that any dichotomy you pick from this, plus 1, plus 1, minus 1, minus 1, plus 1, et cetera, I can find a perceptron that realizes this dichotomy. If I do that, then I have showed you that I can shatter the set. So let us look for the W that satisfies. And what does it satisfy? It satisfies this condition. This computes the signal for all the points at once in vector form. You take the sign of that, and you would like the sign of that to agree with the particular Y you chose. So you give me Y. I am supposed to come up with W, such that this holds. If I can do that for every choice of Y you give me, then I'm done. I have shattered your set, or my set, the set I chose. How am I going to do that? It's pretty easy. What I'm going to do, I'm going to do even better than this. I'm going to actually have XW numerically equals Y, even before taking the sign. So when you multiply the metric X by W, you're going to get specifically a pattern of plus or minus 1. Well, if you get plus or minus 1, guess what happens when you take the sign of that? You'll get the same thing, plus or minus 1. So that will satisfy that. But that is easy to handle, because now I have algebra working for me. Remember that X was invertible. That's pretty easy. So all you do is just solve for it. W would be the inverse of X times Y. And you have a solution that realizes any dichotomy you can think of. That's wonderful. So we were able to shatter D plus 1 points. Now comes the quiz. We can shatter these points. Wonderful. But we are not really interested in shattering for its own sake. We were trying to establish the value of the VC dimension. So let's see what we have established. I showed you particular D plus 1 points. I showed you that we can shatter them. What is the conclusion? Is it? Oh, we have established the VC dimension is D plus 1. Thank you. Or we only established that it's greater than or equal to D plus 1. Wait a minute. We actually established that it's less than or equal to D plus 1. Or maybe we didn't establish anything at all, as far as the value of the VC dimension. I'd like to ask you to think about it and tell me which of those can we conclude. And I'd like the online audience to text A, B, C, or D, as if you are solving a homework. And tell me which of these choices is valid conclusion given what we have argued. So let's say by shouting, you just shout A or B or C or D. And I hope that there is enough signal that I will be able to decipher the majority. Shout. You guys are a tough crowd. Well, why is that? We were able to shatter D plus 1 points. So we are guaranteed that for at least D plus 1 points, we are OK. It is conceivable that we can shatter a bigger set. We haven't argued that yet. But if we even fail, at least we have the VC dimension to be at least D plus 1. If we shatter a higher set, it will be even bigger. If we cannot, it will be equal. So that is what we have established. Since you are very good at this, let's do another quiz. So I have now greater than or equal to D plus 1. So now I need to show that the VC dimension is less than or equal to D plus 1. I wonder what I need to do in order to achieve that. We need to show that one of several choices. I need to show that there are some points, D plus 1 points, a set of D plus 1 points, that we cannot shatter. No, no, no. I need to show that there is a set of D plus 2 points that I cannot shatter. Oh, no, no, no. Maybe we need to show that we cannot shatter any set of D plus 1 points. Or was it D plus 2? I'm confused now. What among those guys will establish the premise? The premise that we are trying to establish is that the VC dimension is at most D plus 1. Which of these statements will establish that? Again, think about it. And similarly, for the online audience, to text the result, give you 10 seconds, and then we'll also answer by shouting. Shout. Oh, I like that. How about the online audience? D. So everybody gets the idea. So now we know what we want to prove. Let's go ahead and prove it. So now it's a question of any D plus 2 points. So I don't get to choose the points. You get to choose them. So I tell you, choose them, please. And give them to me. Now, when you give me your points, I'm going to make a statement about the points you chose. How can I make a statement if you choose them anyway? I'm going to make a statement. I'm going to say that for those particular points that you chose, which I really don't know what they are, I can say that you have more points than dimensions. Why is that? Each of these guys still comes from a D plus 1 vector. Because it's a D dimensional space, plus the added coordinate. So each one is D plus 1 dimension. And I ask you to give me D plus 2. So obviously, D plus 2 is bigger than D plus 1. What do you know when you have more vectors than dimensions? Oh, I know that they must be linearly dependent. So therefore, we must have that one of them will be a linear combination from the rest of the guys. So you take j, whichever it might be. And this will be equal to the sum over the rest of the guys of some coefficients that I don't know, times those guys. This will apply to any set you choose. And this is the property that I'm actually going to use in order to establish what I want. Now, furthermore, I can actually make something about the AIs. AIs could be anything for this statement to hold. But I'm going to claim that not all of them are zeros in this case. At least some of the AIs are non-zero. How do I know that? This is not part of the linear dependence. This is actually because of the particular form of these guys, where the first coordinate of all of these guys is always 1. So when you look at this and apply it to the first coordinate, 1 equals, well, these cannot all be zeros, because it has to add up to 1. Therefore, some of the AIs will be non-zero. That's all I need. I need this to hold. I need some of the AIs not to be zero. Everybody buys that this is the case. That's all I need. And then we go and show the dichotomy that you cannot implement. We have that, right? Consider the following dichotomy. I'm going to take the x-i's corresponding to non-zero AIs. So some of the AIs are non-zero, for sure. Maybe some of them are zeros. I'm going to focus only on the non-zero guys. I don't care what you do with the terms that have ai equals 0. Do whatever you want. Give them any plus or minus one. Let's not look at them. So I'm looking at those guys, and I am now constructing a dichotomy that I'm going to show you that you cannot implement using a perceptron. So for the x-i's with the non-zero ai, I'm going to give the label, which happens to be the sign of that coefficient. That is a non-zero number. It's positive or negative. So I will give it plus or one or minus one, according to whether it's positive or negative. And I will do that for every non-zero term here. Everybody sees that. And now I'm going to complete the dichotomy by telling you what will happen with xj. I'm going to require that xj goes to minus one. Now all you need to realize is that this is a dichotomy. These are values of plus or minus one on specific points. The other guys, which happen to be zero, give them any plus or minus one. You choose. And for the final guy, which is sitting here, I'm going to give it minus one. This is a legitimate dichotomy. And I'm going to show you that you cannot implement this particular one. How is that? Because I really don't know your points. So I must be using just that algebraic property in order to find this. And the idea is very simple. This is the form we have. xj happens to be the linear sum of these guys. I'm going to multiply by any w. So for any w, I mean the perceptrons. Do you multiply by w? That is what makes it a perceptron. So I'm going to multiply by it. And then I realize that w transpose time xj, this would be actually the signal for the last guy, is actually the sum of the signals for the different guys with these coefficients. That has to happen. So what is the problem? The problem is that when you take this as your perceptron, then by definition, the label is the sign of this quantity. For the guys where ai is non-zero, we force this quantity, which is the value yi, to be the sign of ai. That's what we constructed. What can you conclude, given that the sign of this fellow is the same as the sign of this fellow? It must be that these guys agree in sign. They are either both positive or both negative. Therefore, I can conclude that if you multiply them, you get something positive. That is for sure. So now I have a handle on this term. Now, this forces the sum of these guys to be greater than 0. Why is that? Because this happens for every non-zero ai. For zero ai's, they don't contribute anything here. So if I add up a bunch of positive numbers and zeros, I'm going to get a positive number. What is this quantity? Do I see it anywhere else on the slide? Oh yeah, I can see it here. So this actually is the signal on the outstanding point. So I know that the signal on the outstanding point is positive. What does this force the value of the perceptron, your perceptron, the one you had here, to be? It will have to be plus 1. Therefore, it's impossible to get that to be minus 1 if you chose this. This is a choice that is legitimate. It is a dichotomy. And now if you pick those guys, pick the rest of the zero coefficient guys any way you want, you are forbidden from having this as minus 1. Therefore, you cannot shatter your set for any set you choose. Therefore, you cannot shatter any set of d plus 2 points. And I have the result. So let's put it together. First, we showed that the VC dimension is at most d plus 1, and then we showed that it's at least d plus 1. Did we do it this way or the other way around? That's another quiz. No, no, it's not another quiz. The conclusion is that the VC dimension is d plus 1. And now, d dimensional at perceptron, the VC dimension is d plus 1. Let's ask ourselves a simple question. What is exactly d plus 1 in a d dimensional perceptron? It's 1 above the dimensions, and you can find many interpretations for it. But the interpretation of interest to us will be the fact that this is actually the number of parameters in the perceptron model. What are the parameters in the perceptron model? We used to call it a vector w. Let's spell it out in order to count. This happens to be w0. This is the one for the shell, w1 up to wd. These are the parameters you are free to choose, and that we have been choosing through this argument. And how many of them there are d plus 1? Why am I attaching the VC dimension to the number of parameters? Because the VC dimension gives me the maximum number of points I can shatter. So now I can do any which way. The reason I can do any which way, because I have a bunch of parameters that I can set one way or the other in order to achieve that. So it stands to logic that when I have more parameters, I will have a higher VC dimension. And that will be the basic part of the interpretations that we are going to go into. So let's do now the interpretation of the VC dimension. Now, look at this. We are going to prove two things. Not prove, but show two things in terms of the interpretation. One of them, we understand the mathematical definition of the VC dimension. What does it signify? That's number one. And it will relate to the number of parameters and whatnot, and we'll get it a little bit more elaborately. The second one, I know what it signifies. Is it at all useful for me? I am a learning person. I went through the theory just because you asked us to do that. But when all is said and done, I just care about the result and how I'm going to use it in practice. So how can we apply the value of the VC dimension in practice? That's number two. These are the two parts of the interpretation. So the main idea of understanding what the VC dimension signifies is to look at the degrees of freedom. So what is that? When you have a model, the model is characterized by a set of hypothesis. You get one hypothesis or another by setting the set of parameters one way or another. So the parameters give you degrees of freedom in order to create one hypothesis or another hypothesis. So think of this picture. Think of the parameters as knobs. So this is w0, w1, w2, et cetera. And when you are actually having a hypothesis set, you are given this, and you are able to set the knobs any way you want. Increase the volume, decrease this, et cetera. Just do this, and you get a setting that tells you what the hypothesis is. These are obviously degrees of freedom. And it actually has a pleasant thing, because let's say that you are buying a big-time audio system. Usually, if we are not very much into stereo and stuff, you want a couple of knobs, and you adjust them and get it right. If I give you really 17 channels, and this and that, and I give you this panel, that's great if you know how to use it. If you don't know how to use it, what? So now the problem that we are facing is actually here, because now the specification that is needed in order for you to get you to pick the right hypothesis is pretty elaborate. You need a lot of examples here. So that is the relation we are going to see. Now the parameters happen to be analog degrees of freedom. When I talk about w0, w0 can assume a continuous value from r, and it matters. If you pick a different threshold, you will get a different perceptron. It will return different values for parts of the space. So this is genuinely degrees of freedom that happen to be analog. The great thing about the VC dimension is that it translated those into binary, if you will, degrees of freedom. Because all you are trying to do is get a dichotomy. So you're asking myself, when am I free to get any dichotomy I want? For any point, I can get the plus one or I can get minus one, independently of the second point. I get plus one or minus one, all the way. So you can keep adding the points and see how far you can get away, and the maximum you can get is the VC dimension. So by the time you get there, you have that number, which is the VC dimension, degrees of freedom, but they are binary degrees of freedom, which is what matters here. Because inside the box that tells you it's a perceptron or a neural network or something like that, there may be parameters praying around and whatnot. As far as I'm concerned, all I'm interested in, how expressive is this model? How many different things I can actually get? So the VC dimension now abstracts whatever the mathematics that goes inside and et cetera. You go outside, and if I can shatter 20 points, that's good. If someone else can shatter 30 points, they have more degrees of freedom to be able to do that, regardless of where this came from. Now let's look at the usual suspects and see if the correspondence between degrees of freedom and VC dimension holds. Who are the usual suspects? I think this is the last lecture where we are going to see the positive rays and the rest of the gang. So don't despair. Positive rays. So what are positive rays? What is the VC dimension? That was 1. I can shatter at most one point. What were positive rays in the first place? Oh, yeah, that was the diagram. We had this thing, and I'm pretending plus 1 here, and we're taking the minus 1 here. And what determines one hypothesis versus the other within this model is the choice of A. Oh, the choice of A? One parameter, one degree of freedom, corresponds to VC dimension equals 1. That's nice. Let's see if this survives. Positive intervals. The positive intervals, the VC dimension was 2. That is the most I can shatter. What do they look like? Oh, in this case, I have this guy, the small blue guy. And there is a beginning and an end. And between them, I return plus 1. And here, I return minus 1. So depending on the choice of the beginning and the end, I will get one hypothesis versus another. How many parameters or how many degrees of freedom? 2. And what is the VC dimension? 2. Then by induction, it's true. No, no, no, the induction is true. This is just to illustrate the idea. So now, let's go back and contradict ourselves. It's not just parameters. It's really degrees of freedom. And I'd like to make the distinction. So let's take an example where the parameters are not contributing to degrees of freedom. So I'll construct an artificial example just to give you the idea. In more complicated models, it may be difficult to argue what is the parameter that is contributing to what is not. But at least we are establishing the principle that a parameter may not necessarily contribute a degree of freedom. And since the VC dimension is a bottom line, it looks at what you are able to achieve. It would be a more reliable way of measuring what is the actual degrees of freedom you have, instead of going through the analysis of your model. So let's take one-dimensional perceptron. Very simple model. You have only one variable. And you are going to give it a weight. That's one parameter. And then you are going to compare it with a threshold, w0, that's a second parameter. And then you are going to give me plus or minus 1. So this fellow has two parameters, indeed two degrees of freedom. And I get the VC dimension to be 2, because it's d plus 1. We proved it generally. And here d is the dimensionality of the space is 1. This is just the real number. Now this is not my model, actually. This is only part of the model. What I'm going to do, I'm going to take that output and feed it into a perceptron. And then get that output. And feed it into a perceptron. And then get that output. And feed it into yet another perceptron. And that will give me the output of the model. So now let's see how many parameters I have. This guy has two. One for the weight here, and one for the threshold. Whatever the output here gets weighted by something, that's a third parameter, compared to a threshold. Fourth, fifth, sixth, seventh, eighth. I have eight parameters in this model. Anybody would argue at all about this? I have eight parameters. There's no question about it. Do I get eight degrees of freedom? No, because these guys are horribly redundant. By the time I did this, I am done. This doesn't add anything. Take it plus 1 or minus 1. OK, give it a weight compared to a threshold. What are you going to get? You're either going to get plus 1 for plus 1 and minus 1 for minus 1, or the vice versa. So you are just replicating a function, and doing it again and again and again. So this whole thing is a very, very elaborate perceptron in one dimension. That's all. I know that I constructed it in a very funny way, but that's the function it does. So if you are counting the number of parameters, you will say, OK, I have a bunch of parameters. But if you are resorting to the VC dimension, you don't care about this box. You don't even know it. It's a black box. You look at x and y, and ask yourself, how many points can I shatter, and you get the answer that you will get for one of these blocks by itself. So the rest of the guys don't matter. So you can think of now the VC dimension as measuring the effective number of parameters, rather than the row number of parameters. And I give you a case where the effective number of parameters is smaller, which seems to be the case. Believe it or not, you can construct mathematically a case where you have one parameter, a literal parameter, a number that is a real number. And then you can milk out of it so many degrees of freedom that you can get many more than a degree of freedom from one parameter. So the other case is really just sort of you constructed because you want to. But the message here is that you don't look at the number of parameters. You look at the effective number of parameters, and effective for us as far as the result. And the result is captured by the VC dimension, so this is our quantity for measuring the degrees of freedom. Now let's look at the number of data points needed, which a practitioner would be interested in, and doesn't care about the rest of the stories that I told you. So you have a system. Let's say that you manage a learning system, and you look at the hypothesis set, and you say a VC dimension is 7. I want a certain performance. Could you please tell me how many examples I need? First, we know that the most important theoretical thing is that the fact that there is a VC dimension, finite one, means that you can learn. That is the biggest achievement that we have done in theory. But now we go a little bit closer and ask yourself, well, the value of the VC dimension, how does it affect the number of examples you need? So in order to do that, let's do the following. We notice that the VC inequality in which the VC dimension arose has two small quantities, performance quantities that we'd like to be small. So let's remind ourselves. One of them is this fellow. You didn't want E in to be far from E out. You said that they should be tracking within epsilon, and therefore the probability that they are not tracking within epsilon, the small number, should be small. The probability is small. The other quantity is the other small quantity. And we are just going to call it something now. It's a small quantity delta, which is not small because of the expression. The expression is long. It's small in value. Hopefully that when you get large n, this will reduce to a small number, so we're just going to call it delta for now. So try to sort of phase out the details of this and look at, OK, I have two quantities. This is the probability. This is the approximation. And we are making a statement that is probably approximately correct, as we said before, as these are our two guys. So in a normal situation, what you are trying to do is that you are trying to say, I want a particular epsilon and delta. I want to be at most 10% away from E out, and I want that statement to be correct 95% of the time. That's your starting point. And then after you said that, you ask yourself, how many examples do I need? Fair enough. When you say I want to be 10% away from E out, what you are really saying is that epsilon is 0.1. When you want to say that I want to be 95% sure that the statement is correct, you are picking delta to be 5% or 0.05. So that's what you do. So you want certain epsilon and delta. And then you ask yourself, how does n depend on the VC dimension? You are competing with someone else. You are solving the same problem. The guy gives you the same data. And you look at it, and you say, OK, I'm using a VC dimension, and they are using the VC dimension. If I achieve this with this VC dimension, can he also achieve it with the bigger VC dimension? Because a bigger VC dimension will give them more flexibility. They might be able to fit the data better. They will get a better E in. So if they can get the same generalization bound, they are better off. So I'm really interested in this question. I just want to know how they relate. So in order to do this, we are going to look at this function, which is a polynomial, a very simple polynomial, just one term. It's not polynomial, it's monomial, I guess. n to the d. I'm just writing as d. It will play the role of the VC dimension. I just want the notation to be simple. So n to a certain power times e to the minus n. What is this quantity? This quantity is a caricature of this quantity. There are constants here. I have here multiple terms. I'm just taking the biggest of them, because it's the dominant. I have a negative exponential, but it has this terribly damping coefficients, et cetera. I am leaving them out. So I'm just trying to understand the behavior of functions that look like this. I'm taking the simplest case. Because this will give me the trade-off between d and n, which is the trade-off I want here. So let's look at it. This is our quantity. And what I'm going to do, I'm going to plot it for you. So let's plot it for the case of d, the power equals 4. So here is what it looks like. Basically, the initial part is mostly n to the 4. If there wasn't the exponential, this would be going up and up and up and up. The negative exponential is just warming up here. And then it starts becoming really, really effective here. And then it wins over and keeps doing it. And by then, you will forget that it was n over 4. You will just remember that there is a negative exponential. And the interesting part here is, obviously, I mean, this will play the role of the right-hand side of the VC inequality. So this would be a probability. And we'd like the probability to be small. So obviously, the initial part of the curve is completely meaningless. You tell the probability, the probability now is less than 5. That's very nice. But it's only interesting once you cross back to the interesting region, which is the less the probability. And then you are actually making a statement. So this will be the interesting part. Let's look at the next one. So this is n to the 4th. And let's look at n to the 5, just to get a feel for these quantities. Well, this one, it will pick at a different point. But the main point is huge now. The probability now is less than 20-something. That's nice. But eventually, the exponential wins. That's the good part, because it goes down and then goes. So now you can see that this is the interesting region. And I'm going to ask myself, in order to get to this region, which is the performance you want, how many examples, which is this coordinate, do I need? Given the different VC dimensions, which supposedly is 4 here and 5 here. Again, this is a caricature, because this is not the function I have. If you start adding the real constants here, this thing, instead of becoming, what, 5 examples, 10 examples, probably would be 5,000 examples, 10,000 examples. It's a pessimistic estimate, because it's an upper bound. But the shape will remain the same. And similarly, if you add the other probabilities, the probability will not be 1. It will be 2, et cetera. All you need to do is just shift a little bit, and you will be getting here. So if you understand this quantity, you'll be able to translate it to the other one. But because it's going up so fast, I have to do something to make it visible, and I'll do that in a second. So let's now look at it. You fix the value here at a small value. Whatever the value is, not 1, maybe 0.5, or 0.1, or 0.01, et cetera. And you would like to see how n changes with d. Now what I'm going to do now, I'm going to switch plots on you. And I'm going to have the y-coordinate here, which is the probability, drawn on a log scale. The reason I'm doing that is that because, obviously, if n equals 4, n equals 5, get me that. If I get to n equals 10, that will go upstairs. So I really want to keep a handle on it. So I am going to do this in order to keep a handle on it. And more importantly, because this is really, you see this very, very thin slide. That's all I'm interested in. So I want to magnify it and look at it, and this happens to be less than value over 1. So all the negative logs will be here, and I'll be able to look at it clear. So let's go and plot the n equals 5 on a log scale. That's what it looks like. This is exactly. So now I can afford to have more examples, because I will have more curves. But this is the blue curve we had before. It peaks at a certain point, and then goes down, and 10 to the 0 here, that's the probability 1. And this is the intensity. So below the line here is the interesting region, that when you tell me what is delta is, you are telling me a level. And these levels are not going to be very much different. Only this scale, this is 1. This is what? This is 10 to the minus 5. That's a very, very, very small probability. So the play here is very small, as you vary delta significantly. And the play with epsilon, which will affect the exponent, is that these guys will be spread. Instead of being 2040, it will be 2,000, 4,000, and so on. But this will be the shape. So now let's add the other curve. So this is n to the 5. How about n to the 10? What does it look like? Now, it didn't go upstairs. Well, it went upstairs, if you had it linearly, because now we look at the values of the top. This is really serious business. And you get this one. Varying the VC dimension, I'm getting a different curve, and I'm getting the behavior in the interesting region. You see the point here. And then I keep adding, and I will add up to 30. So these are the curves that I get by varying the VC dimension, the alleged VC dimension here, 5, 10, 15, 20, 25, 30. Something very nice is observable here. These guys are extremely regular in their progression. They are very much linear. I mean, they are not exactly linear, but very close to linear. And indeed, you will find that, theoretically, the bound, in terms of the number of examples to achieve a certain level. So let's say you cut the level here, and you are increasing the VC dimension. It basically is proportional to the VC dimension. You go from 5 to 10 to 15 to 20, et cetera. This is the number of examples you want, and they are pretty much proportional to this. Now, this is in terms of the bound. The problem with the bound is that if I'm using a system and the other guy using the system, I have a VC dimension, and he has his. Now I know that my performance is less than or equal to something, and his performance is less than or equal to something. And let's say that this thing is this way. He has a better bound than mine. That's the bound. There is no guarantee that when we get the actual quantity that is bounded, they will also follow that same monosynicity. It is conceivable that the two quantities are this way. The bounds are this way. The quantities are this way. Under those conditions, the bounds are satisfied, correct? And the bounds are monotonic in one direction, and the quantity is monotonic in the other. That is a pretty annoying feature. So what I'm going to make now is a statement that is not a mathematical statement, but a practical observation that is almost as good as a mathematical statement. The practical observation is that the actual quantity we are trying to bound follows the same monotonicity as the bound. That is the observation. You use bigger VC dimensions, the quantities you will get are bigger, and actually very close to proportional. This is an observation by trying this n times, where n is very large. That is the observation I got, and many other people got. So in spite of the fact that we cannot get an absolute value because this is just a bound, the relative aspect of the VC dimension holds. And therefore, you have a bigger VC dimension. You will need more examples. And in that case, there are some estimates, practical estimates, that if you take the ratio of the VC dimension to the number of examples, this will give you a handle on the error. And we will see provable versions of that when we get to the bias variance trade-off next time, in a different theoretical line of analysis. So the first lesson is that there is a proportionality between the VC dimension and the number of examples needed in order to achieve a certain level of performance. That is a theoretical observance for the bound, and a practical observance for the actual quantity you get. That's number one. Number two is that, give us just a guide, proportional, not proportional. I just want to know if I have just a reasonable epsilon and a reasonable delta. How many examples does it take me to get over this hump? How many examples does it get me to get to the comfort zone of the VC inequality where I'm actually making a statement? So now the probability is less than 0. So do I take the VC dimension twice the VC dimension 100 times the VC dimension? This is a practical observation. And again, it's a practical observation, obviously. This depends on your epsilon and delta. And this depends on the particular application. So the statement I'm making is that for a huge range of reasonable epsilon and delta, and for a huge range of observations, of practical applications, the following rule of thumb holds. You have a VC dimension, and you are asking for a number of examples in order to get reasonable generalization. The rule is you need 10 times the VC dimension. No proof. It's not a mathematical statement. And I'm saying greater than or equal, obviously, because if you get more, you will get better performance. But that will get you in the middle of the interesting region. That will get you in the region where the probability statement is meaningful rather than completely unknown what the generalization will be like. So now I'll spend just a couple of minutes to talk about generalization bounds, which is a form of the theory we have that will survive with us. So we are not going to talk about growth functions anymore. We are not going to go through easiest. We are only going to remember that the VC dimension is there, and it determines the number of examples. And it also corresponds to the degrees of freedom. And the form of the theoretical bound we are going to carry will be the following form. I'm just rearranging things. There is absolutely nothing new introduced here except simplification. But it's an important simplification because it's surviving with us. We start with the VC inequality. We can bid farewell. This is the last time we'll see it in this form. And it's complex and all, but we will now simplify it. So now we have epsilon and delta again. And it gives them different color. And the logic we will use is that you specify epsilon. And then I will compute what delta is depending on the number of examples. So you tell me what your tolerance is, and I'll tell you what the probability is. Another way of looking at it is the other way around. You tell me what delta is. You would like to make a statement with reliability 95%. Can you tell me what tolerance can you guarantee under the 95%? You start with the delta, and you go to the epsilon. That's not very difficult. There is nothing mysterious about this. Delta equals that. Can I solve for epsilon? If I start with delta, can I solve for epsilon? I will get this part, put it on the other side. That makes this ready to take a log of both sides. So this now goes down. Now I need to get rid of the extra constants. They go on the other side, but now there is a log on the other side, because I took a log here. And finally, I take a square root. So that's what you get. Very straightforward. So now I can start with this fellow and get epsilon. Now I'm going to call this formula capital omega, which is a notation that will survive with us. It's a formula that depends on several things. And as you can see, if the gross function is bigger, the VC dimension is bigger. Omega is worse. That's underserved, because the bigger VC dimension, the worse the guarantee on generalization, which is this approximation thing. And if I have more examples, I am in good shape, because now the gross function is polynomial. By the time you take the natural logarithm, this guy becomes logarithmic. And logarithmic gets killed by linear as much as linear gets killed by an exponential. So this is just one step down in the exponential scale of the previous statement. So indeed, if I have more examples, I will get a smaller value of omega. And obviously, if you are more finicky, if you want to guarantee to be 99% instead of 95%, so now delta is 0.01 instead of 0.05, then epsilon will be looser, because you are making a statement that is true for more of the time, so you will have to accommodate bigger reps. So that's what we have. So now the statement now is a positive one. It used to be that we are characterizing bad events. Now we are going to state the good event. The good event happens most of the time, so it happens with probability greater than or equal to 1 minus delta. And that statement is that E in tracks E out. They are within this omega. And omega happens to be a function of the number of examples, which goes down with it. Of the hypothesis set, it goes up with its VC dimension. And with the delta, the probability you choose to make the statement about, and this guy will go up with smaller delta. So I'm just keeping it in this form, because we don't worry about this anymore. We just want to understand this well. So let's look at it. And that will be called the generalization bound. So with probability on my delta, we have this well. So now I'm going to simplify this. So here is the first simplification. Instead of the absolute value of E out minus E in, I'm going to have just E out minus E in. Why is that? Well, because I can. If I have the absolute value, I guarantee that this one and it's opposite. So among other things, I guarantee this one, so I can make the statement. The reason I'm making it is two-fold. First, this is really the direction that matters. Because invariably, E in will be much smaller than E out, at least smaller than E out. Because E in is the guy you minimize deliberately. So it used to be that in terms of a sample, this is E out, and there is a sample, so the sample will have values here. Now you start deliberately pulling this down. The other guy, there is an elastic band, but the elastic band is getting looser and looser as you make more effort. But invariably, E in now has a bias, which is an optimistic bias. And therefore, this will be the quantity that actually happens to be positive. This doesn't say that E out is always less than E in. Once in a blue moon, maybe on your birthday or something, you will get E out that is smaller than E in. But the rule, in general, is the fact E out is bigger than E in. So they are less than or equal to that. So in spite of the fact that they have all of these dependencies, I'm just going to forget about these dependencies for the moment. I know that omega is an elaborate quantity. I don't want to carry the details. I understand its general behavior. So I have this fellow. So now we rearrange this thing. By the way, this fellow is called the generalization error, because it's the difference between what you did out of sample versus in sample. So this is a bound on the generalization error. And when you rearrange it, you can say with probability greater than equal to 1 and so on. And this is the form that will survive with us. You just take E in and put it on the other side. Now, this is a generalization bound. And it is very interesting to look at it. So it bounds E out on the left-hand side with E in plus omega. This guy we don't know. Both of these guys we know and have some control over. This one we are minimizing. This one is according to the choice of our hypothesis set. So it tells us something about E out in terms of quantities that we control. Furthermore, it shows that, remember when we talked about the trade-off, remember when someone said, bigger hypothesis is good or bad. It's good for E in, but bad for generalization. Now you can see why. So this guy goes down with a bigger hypothesis set. This guy goes up with a bigger hypothesis set, poorer generalization. Therefore, it's not clear that it's a good idea to pick a bigger hypothesis set or a smaller hypothesis set. There may be a balance between them that will make the smallest possible. And that affects the quantity I care about. So this will translate to that. The other thing is that now that I got rid of the absolute values, we'll be able to take expected values in certain cases and compare it with other stuff. So this will be a very friendly quantity to do. It's so friendly that we are going to derive a technique, one of the most important techniques in machine learning based on this. It's called regularization. And the idea here is that, OK, I use E in as a proxy for E out. Now, after all of this analysis, I realize that it's not E in only that affects the game. It's also that choice of this guy. So maybe instead of using E in as a proxy, I'm going to use E in plus something else as a proxy, hoping that this will be a better reflection that will get me the E out I want. And that will be the subject of regularization. We'll stop here and take questions and answers after a short break. OK, so let's go for the Q and A. Are there questions? Yeah, OK, so there was one confusion. So why is the VC dimension exactly K minus 1? OK, so when we define the breakpoint, we define that I can call K a breakpoint if I cannot get all dichotomies on any K points. That means, really, that if I have a breakpoint, then any bigger point is also a breakpoint. And most of the discussion deals with the smallest breakpoint, OK? So the notion of a breakpoint covers a lot of values. The VC dimension is a unique one, which happens to be the biggest value just short of the first breakpoint. Does this cover it? Yeah, because people were wondering if it was the breakpoint for some set of endpoints or all sets. It's always the case that when I say you are able to shatter, I give you the privilege of picking the points to shatter. So I insist that you get all possible dichotomies, but you get to choose which points to shatter. So that is always the logic in those guys. This does not affect a breakpoint versus a VC dimension, whatever. This is always the case when we talk about shattering endpoints. It means that you shatter some set of endpoints. Now, the only distinction between a breakpoint and a VC dimension is that a breakpoint poses it negatively, and the VC dimension poses it positively. Breakpoint is a failure to shatter, and VC dimension is an ability to shatter. And obviously, if you take the maximum ability to shatter, which will give you the value of the VC dimension, that will be one short of the next guy, which you fail to shatter, so that is your smallest breakpoint, and the other ones will be other breakpoints. Can you repeat the practical interpretation of epsilon and delta? epsilon and delta. So the epsilon and delta, as two quantities, they are the performance parameters of learning. So there are two things that I want to make sure of. I want to make sure that e in tracks e out. The level of tracking is epsilon. That's the approximation parameter. Now, I cannot guarantee that statement absolutely. I can only guarantee it in a probabilistic sense. But I'd like that probability to be as high as possible. So the probability that that statement doesn't hold is small, and that it happens to be delta, which is the probability measure. So there are always these two quantities, and that is an integral part of this type of analysis. I think we have an in-house question. So I wanted to know what is the effect of error measure on the number of points that we have to choose? OK. Obviously, as you can see from the VC analysis, the error measure has always been a probability of error, versus all the binary error. When you go to other co-domains, real value or multiple, or we go to other error measures, you need to modify these things. Some variances will come in, and some other aspects. So for example, in case of the error measure, the binary error measure happens to be bounded. Therefore, you never worry about the variance, because there is an upper bound on the variance. If you talk about, let's say, mean squared error, depending on the probability distribution you put on things, this could be very big. So you need first to say that the variance is finite, and then actually the value of the variance will come into these inequalities to go through. However, the reason I didn't venture into that is very simple. There is really nothing added conceptually. And as you can see from the utility we are using, we are not going to go back and unravel the mathematics and apply it to a practical situation. We borrowed the following. We borrowed that finite I can learn. The value is proportional to the number of examples, and the rest are rules of thumb. That's where we stand. So it's not worth sweating bullets over the other technicalities, when this is the message we are getting, and that message will hold intact in other situations. A question about, when you're mentioning the bound, usually say, so dvc is known, dvc dimension is known. Is it true that for most hypotheses, this is really known? So to get the dvc dimension exactly is an exception, not the rule, as I mentioned. So getting it for perceptrons is really a great achievement, because the perceptron is a real model that you use, and we know the dvc dimension exactly. When you go to a neural network, we will get a dvc dimension estimate. We say the dvc dimension cannot be, for the same reason that we had when we talked about parameters versus effective number of parameters. Because in a neural network, the parameters will go from one layer to another, and there will be some cancellation or redundancy. And therefore, you can't really either keep track of these redundancies exactly, or say it cannot be more than the number of parameters, because even taking into consideration. So in many of the cases, the dvc dimension is estimated as a bound. But again, we are already in a bound. Even if you know it exactly, it's not like we know what the generalization error would be like. We know a bound on the generalization error. So in this series of logical development, we get a bound, on a bound, on a bound, on a bound. So by the time we are done, the bound is so loose that in absolute value, it's really not indicative at all. But the good news is that in relative value, it maintains its conceptual meaning. We can use it as a guide to compare models and to get a general number of examples, notwithstanding the fact that if you decide to say, OK, I'm going to go for a perceptron in two dimensions, and I'm going to want epsilon to be 0.1, and delta to be 0.05. Could you please tell me how many examples I need? If you actually go and solve the VC inequality and try to get a bound, the bound will be ridiculously high, much higher than you actually need in practice. But you don't use it as an absolute indication. You use it only as a relative indication. We come across any interesting examples where n has to be much bigger than 10 times the VC dimension. Well, the interesting example is that when the customer is very finicky and wants a very small, smaller epsilon and delta, because the smaller epsilon and delta, the bigger, the more number of examples you had. So the rule of thumb is not to tell you use 10 times the VC dimension. It tells you that you are in the thick of the game when you have 10 times the game. Now we are talking. There is actually generalization. There is a certain level. There is some compromise between epsilon and delta. Now you can tighten the screws and try to get it better. So this is just the rule of thumb for getting into the interesting region of the VC inequality. And that has stood the test of time. Is there a relation between this material and the topic of design of experiments and the number of experiments you require to achieve a certain confidence? There is a relationship. And some of the experimental design and whatnot, there are lots of commonalities between here. There you have control over certain things that you may not have here. But some of the principles definitely extend to that. As I mentioned, when we talked about the premise of learning, it is so general that it would not be a surprise at all that many of the concepts go and tackle situations that are not strictly learning but have the same theme as learning. I think that's it. OK. So we will see you on Thursday.