 المساعدة المساعدة تساعدتك بكالتاك مرحبا بك لدينا مرحبا لدينا بعض الأشياء المهمة في محاولاتنا والأشياء الأولى كانت دايكاتوميز والأفضل هو أن هناك محاولة في محاولاتنا ويوجد هناك محاولات التي يتبقى رجل رجل من رجل بلو لكننا لا نرى ذلك ما يجب أن نرى are just the data points holes in that sheet if you will and there could be very exciting stuff happening behind that sheet and all you get to see is when the boundary crosses one of these points and blue points turn red or vice versa so if you think of the purpose for the dichotomies we had a problem with counting the number of hypotheses because we end up with a very large number but if you restrict your attention to the dichotomies which are the hypotheses restricted to a finite set of points the blue and red points here then you don't have to count everything that is happening outside you only count it as different when something different happens only on those points so a dichotomy is a mini hypothesis if you will and it counts the hypotheses only on the finite set of points this resulted in a definition that parallels the number of hypotheses which is the number of dichotomies in this case so we defined the growth function the growth function is you pick the points x1 up to xn you pick them wisely with a view to maximizing the dichotomies such that the number you get will be more than any number another person gets with n points that's the purpose so you take your hypothesis set which applies to the entire input space and then apply it only to x1 up to xn so this will result in a pattern of plus or minus 1 n of them and as you vary the hypothesis within this set you will get another pattern so you will get a set of different patterns that are all the dichotomies that can be generated by this hypothesis set on this set of points and the number of those guys is what we are interested in it will play the role of the number of hypotheses and that is the growth function now in principle the growth function can be 2 to the n you may be in an input space and a hypothesis set such that you can generate any pattern you want however in most of the cases the restriction of using hypotheses coming from H will result in missing out on some of the patterns some patterns will simply be impossible and that led us to the idea of a breakpoint for the case of a perceptron in 2 dimensional which is the case we studied we realized that for 4 points there will always be a pattern that cannot be realized by a perceptron there is no way to have a line come here and separate those red points from the blue points and any choice of 4 points will also result in missing hypothesis missing patterns therefore the number k equals 4 in this case is defined as a breakpoint for the perceptrons and our theoretical goal is to take that single number which is the breakpoint and be able to characterize the entire growth function for every n and therefore be able to characterize generalization as we will see we then talked about the maximum number of dichotomies under the constraint that there is a breakpoint and we had an illustrative example to tell you that when you tell me that you cannot get all patterns on any in this case 2 points that is a very strong restriction on the number of dichotomies you can get on a larger number of points so this is the simplest case if you take any 2 columns you can get all 4 patterns I am telling you that the hypothesis has a breakpoint of 2 and then I am asking you under those constraints how many lines you can get how many different patterns you can get and you go and you add them up and you end up in this case with only 4 so you lost half of them and you can see that if we have 10 points and you apply the same restriction there will be so many loss because now the restriction applied to any pair of points now if you look at this schedule this does not appeal to any particulars but if you look at the run space other than the fact that the breakpoint is 2 I could be in a situation where the hypothesis state cannot generate some of these guys for other reasons but here I'm abstracting only a hypothesis state and an input space I don't want to bother no more about them just tell me that they have a breakpoint and I'm trying to find under that single constraint how many can I possibly have and I already have by that combinatorial constraint I have a restriction which is strong enough to get me a good enough result لذلك جيد، لأن الآن لا يجب أن أقلق بشكل حيوانات جيدة ويجب أن تقوم with every input space you give me. أسأل لك ما هو the breakpoint. وأستطيع أن أقوم بعمل بشكل جيد about the gross function, ليس بشكل أكثر من شيء. هذا هو the key. لذلك, نرى إلى اليوم's lecture. ونقوم بالنقل بشكل جديد من the theory of generalization. إنه جديد من الثلاثية. ويجب أن اليوم's lecture is the most theoretical of the entire course. حسنًا، فلن نبدأ. لدينا أعيانين بشكل جيدة. أول هو أن نرى أن the gross function with a breakpoint is indeed polynomial. أعيان الثلاث هو أن نستطيع أن نقوم بعمل that notion, the gross function, and put it in place of capital M, the number of hypotheses, in Hefding's inequality. لذلك, نقول that the first part, it's worthwhile to study the gross function, because being polynomial will be very advantageous. ثم, the second one is, we can actually do something good with it. We can do the replacement. These are the only two items. Let's start. So we are going to bound the gross function by a polynomial, and I just wanted to point some of the aspects of that. So if I say mhn is polynomial, it's not that I'm going to actually solve for the gross function and show that it is this particular polynomial and the coefficients. All I am saying is that it is really just bound above by a polynomial. So I don't have to get the particulars of mhn, the gross function. I'm going to just tell you that this is less than something, less than something, less than a polynomial. That's all I need, because eventually I'm going to put this in the Hefding inequality, and as long as it's bounded by a polynomial, I am in business, because the negative exponential will kill it as we discussed, and we are okay. So we can be a bit loose, which is very good in theory, because now you leave a lot of artifacts that you don't need to study, and just talk about the upper bound in the general case, and still get what you want to get. So the key quantity we are going to use, which is a purely combinatorial quantity, we are going to call it b of n and k. This is exactly the quantity we were seeking in the puzzle. I give you n points. I'll tell you that k is a breakpoint, and I'll ask you how many different patterns can you get under those conditions. So in that case, we had 3 points, and the breakpoint was 2, and we answered this question by construction. We played around with the patterns until we got, and then we said it's 4. Now, as I developed the theory, the puzzle will come up in one of the results. So I would like you to keep an eye, and say which slide and which particular part of the slide addresses the very specific puzzle we talked about. So the definition here is the maximum number of dichotomies on n points, such that they have a breakpoint k. So this is n, and this is k. And the good thing here is that I didn't appeal to any hypothesis set or any input space. This is a purely combinatorial quantity, and because it's a combinatorial quantity, I'm going to be able to pin it down exactly as it turns out. And now, when I pin it down exactly, you go, and you find the fanciest input space and the fanciest hypothesis set. You pick the breakpoint for that, and you use that for here, reading the problem of all the other aspects, and you still are able to make an upper bound statement. You can say that the gross function for the particular case you talked about is less than or equal to, and just go to this combinatorial quantity. The plan is clear. So let's look at the bound for p of nk, and we are going to do it recursively. It's a very cute argument, and I'm going to build it very carefully, so I want your attention. Consider the following table. Very much like the puzzle, we are going to list x1, x2, up to xn, capital n points, which used to be 3 points, and I'm going to try to put as many patterns as I can under a constraint that there is a breakpoint. So I will be putting the first pattern this way, and the second pattern, and so on, trying to fill the schedule. Now I'm going to do a structural analysis of this, and this will happen through this division. So look at it. Still the same problem. x1 and xn is my vector, and I'm trying to fill this with as many rows as possible under a constraint of a breakpoint. But now I'm going to isolate the last point. Why am I isolating the last point? Because I want a recursion. I want to be able to relate this fellow to the same fellow applied to smaller quantities. And you have seen enough of that to realize that if I manage to do that, I might be able to actually solve for b of n and k. That's why I'm isolating the last point. So after I do the isolation, I am going to group the rows of the big matrix into some groups. So this is just my way of looking at things. I haven't changed anything. What I'm going to do, I'm going to shuffle the rows around after you have constructed them. So we have a full matrix now, and I'm shuffling them and putting some guys in the first group. And the first group I'm going to call s1. Here is the definition of the group s1. These are the rows that appear only once as far as x1 up to xn minus 1 are concerned. Well, every row in its entirety appears only once, because these are different rows. That's how I'm constructing the matrix. But if you take out the last guy, it is conceivable that the first n minus 1 coordinates happen twice. Once with extension minus 1, once with extension plus 1. So I'm taking the guys that go with only one extension. Whatever it might be, it could be minus 1 or it could be plus 1, but not both. And putting them in this group. Fairly well defined. So you fill it up and these are all the rows that have a single extension. Now you go under this and you define the number of rows in this group to be alpha. It is a number. I'm just going to call it alpha. And you can see where this is going, because now I'm going to claim that the b of n and k, which is the total number of rows in the entire matrix, is alpha plus something. That is obvious. I have already taken care of alpha and I'm going to add up the other stuff later on. So what is the other stuff? The other stuff I'm going to call s2. And you probably have a good guess what these are. These are the guys that happen with both patterns. That is, they happen with extension plus 1 and with minus 1. That is disjoint from the first group. So a typical member will look like this. This is the same guy from x1 up to xn minus 1 as it appears here. It just appears here with plus 1 and appears here with minus 1. And I keep doing it. So what I'm doing is I just reorganize the rows of the matrix to fall into these nice categories. The other guy, exactly the same thing. So the second one corresponds to the second one. And so on. Now that covers all the rows. When I look at x1 up to xn minus 1, I either have both extensions or one extension. That's it. One extension belongs to the first group. Two extensions belong to the second group in both ways with plus 1 and minus 1. So in terms of the accounting, this has beta rows, whatever beta might be. This also has beta rows because they are identical. And therefore, the number b of n and k which I'm interested in is alpha plus 2 beta. That is complete, just calling things names. So now I am going to try to find the handle on alphas and betas so that I can find a recursion for the big function b of n and k. b of n and k are the maximum number of rows, patterns. I can get on n points such that no k columns have all possible patterns. That's the definition. I'm going to relate that to the same quantity on smaller numbers, smaller n and smaller k. So the first is to estimate alpha and beta. So I'd like to ask you to focus on the x1 up to xn minus 1 columns. And I'm going to help you visually do that by graying out the rest. Now for a moment, look at these. Are these rows different? They used to be different when you have the extension. Well, let me see. The first group, I know they are different because they have one extension. If there is one which is repeated, then it must be repeated with both extensions in order to get different rows or all over. And that violates the condition for being here. They are here because they have only one extension. These guys are the same. This one appears with minus 1 and here appears with plus 1. But if you cut the last guy, this guy is identical to this guy. This second guy is identical to the second guy. So I cannot count these as different rows. I can do that when I gray out one of the groups. Now these are patently different. Nothing here is repeated because we said they have only one extension and they are all tucked in here. These two guys, there are no two guys here that are equal because they all have the same extension. And supposedly the whole row makes it different rows. So these guys are different from each other. And these guys are different from here because again, if they are equal, then I will have an extension and then the guys here will belong to a row that had both extensions. Very easy. I mean, it's just like a sort of ever-boss argument, but they end up with these guys being different. Now I like the fact that these guys are being different because when they are different, I can relate them to B of N and K. B of N and K was the maximum number of rows, patterns, different rows. That's how I'm counting them. Such that a condition occurs. So what is the condition that is occurring here? I can say that alpha plus beta, which is the total number of rows or patterns in this mini matrix. Can I say something about a breakpoint for this small matrix? Yeah. The original matrix I could not find all possible patterns on any K columns. Right? So I cannot possibly find all possible patterns on any K columns on this smaller set. Because if I find all possible patterns on K columns here, they will serve as all possible patterns in the big matrix. And I know that doesn't exist. So I can now confidently say that alpha plus beta, which is the number of different guys here, is less than or equal to B of N minus 1 because I have only X1 up to X minus 1 and K because that is the breakpoint for these guys as well. Why am I saying less than or equal to? Not equal. When I constructed the original matrix it was equal by construction. I looked at the maximum number of rows I get and I told you this is what I constructed and therefore by definition this is B of N and K. Here I obtained this in a special way. I took out a guy from the other matrix and did that. I am not sure that this is the best way to maximize the number of rows. At least it's conceivably not. But for sure it's at most B of N and K N and K minus 1 because that is the maximum number. So I am safe saying that it's less than or equal to. Okay. Good. So I have the first one. Now let's try to estimate beta by itself. This is the more subtle argument. Okay? So in this case we are going to focus now on the second part only. The S2 part. The guys that appear twice in the big matrix. Okay? So let's focus on them. Okay? Now when I focus on them these guys are very easy to analyze. They are here and here exactly the same. So this block is identical to this block. Okay? Now the interesting when I look at these guys is that I am going to be able to argue that these guys have a breakpoint of K minus 1. Not K. The argument is very cute. Okay? Let's say that you have all possible patterns on K minus 1 guys in this small matrix. Okay? First I have to kill I mean these are not these are not different guys because these are identical to these. So let me reduce it to the guys that are patently different. So I'm now looking at this matrix. I'm claiming that K minus 1 is a breakpoint here. Why is that? Because if you had K minus 1 guys here where you get all possible patterns then by adding both copies plus 1 and minus 1 and adding Xn you will be getting كولم overall that have all possible patterns. Which you know you cannot have because K is a breakpoint for the whole thing. So now I'm taking advantage of the fact that these guys repeat. So it's very dangerous to have K minus 1 guys because now I have the K that I know it doesn't exist. Okay? So let's do it illustratively. Okay? Here's a pattern here. You add the plus 1 extension and the minus 1 extension by taking this column. Okay? And you add this guy. Then you have both patterns here and you will end up with both all possible patterns on K points on the overall matrix. Okay? So that enables me to actually count this in terms of B of N and K again with the proper values of N and K. So we can say that beta is less than or equal to again less than or equal to because I obtained this matrix by lots of eliminations. I didn't do it deliberately to maximize the number. So I don't know whether it's the maximum. But I sure know that it's less than or equal to the maximum. By the definition of what the maximum is. Okay? And that would be of what? I have N minus one point and I argued for a break points of K minus one. So I end up with this one. Okay? Both arguments are very simple. Okay? Now we pull the rabbit out of the hat. You put it together. What do we have? This is the full matrix. The first item was just calling things' names. The number of rows in the big matrix is B of N and K by definition, by construction. I organized it such that there is alpha and there is beta and there is another beta. So this one is the first result I got, which is B of N and K equals alpha plus 2 beta. Okay? What else did I get? I got that alpha plus beta is at most B of N minus 1 and K. That was the first slide of the analysis. Okay? We have seen that. Okay? So this basically takes this matrix. Okay? And does an analysis on it and it has a breakpoint K because K will be inherited when you go to the bigger one. That's what we did. The other one is beta is less than N minus 1 and K minus 1. And this is the case where I only looked at this guy and now I have to be more restrictive in terms of all possible patterns because I have an extension to add and I will be violating the big constraint. So I ended up with this being less than equal to be N minus 1 and K minus 1. Okay? Anybody notice anything in this slide? How convenient! I have alpha plus 2 beta there and I have alpha plus beta on 1 and beta plus 1. Okay? If I add them, I am in business. I can actually now relate B of N and K to other B of N minus K as an alpha and beta are gone. So B of N and K now I know has to be at most this fellow. So you can see where the recursion is going. Okay? So now I know that this property holds for the B of N and K and now all I need to do is solve it in order to find an actual numerical value for B of N and K and that numerical value will serve as an upper bound for any gross function of a hypothesis set that has a breakpoint K. Let's do the numerical computation first. Okay? So I have this recursion and I can see that from smaller values of N and K I can get bigger values or I can get an upper bound on bigger values. So let's do it in a table. Okay? Here is a table. Here is the value of N. 1, 2, this is the number of points the number of columns in the matrix. And this is K. This is the breakpoint I'm talking about. So this will be there's a breakpoint of 1, breakpoint 2, breakpoint 3, etc. And what I'd like to do here I'd like to fill this table with an upper bound on B of N and K. I'd like to put numbers here that I know that B of N and K can be at most that number. Okay? And we can construct this matrix very, very easily having this recursion. So here's what we do. First, I fill the boundary conditions. Okay? So let's look at this. Here it says that there is a breakpoint of 1. Okay? I cannot get all possible patterns on one point. Well, what are all possible patterns on one point? This minus 1 and plus 1. It's one point. So I cannot get both minus 1 and plus 1. That's a pretty heavy restriction. So I'm asking myself let's say you have now N columns in the matrix. How many different rows can you get in that matrix under that constraint? Well, I'm in trouble because if I have the first pattern and then I put a second pattern, the second pattern must be different from the first one in at least one column. That's what makes it different if it's identical in every column that it's not a different pattern. Right? So you go to that point where it's different and unfortunately for that point you get both possible patterns. So you are stuck. We can only have one pattern under this constraint. Hence the ones 1, 1, 1, 1, 1, 1, 1. OK? That's good. Now in the other direction it's also easy. In this case it's two. It's very easy to argue. Now I'm taking the case where I have only one column. So I'm asking myself how many patterns I can get for one column. Well the most is two. OK? Why am I getting twos here? Because in the upper diagonal of this table the constraint I am putting is vacuous. Here, for example, I am telling you how many different patterns can you get on one point? Such that no four points have all possible patterns. Four points what are you talking about? You have only one point. So that's no constraint at all. Therefore it doesn't restrict the choices and the maximum number is the maximum number I would get unrestricted which happens to be two. If I have one point I get two patterns. That's why you have the twos sitting here. OK? Now I covered the boundary conditions and that's really all I need to complete the entire table given the nature of the constraint I have. Why is that? Because that constraint looks like this. If you know the solid blue guys I will tell you the empty blue guy. Because this would be look at n and k. So this is n and k. This would be n minus 1 and k. This would be n minus 1 and k minus 1. That's exactly what this says. Right? So if I have these two points I can get a value here which would be an upper bound on this fellow. So let us actually go through this table and fill it up. So the first guy I'm going to take is this one and two according to this shape I might be able to get this fellow. What would that fellow be? Three. You just add the two numbers. How about the next guy? Anybody has a guess here? Four. And then a bunch of fours. Because always get twos. I'm actually happy about this because you see that when k grows big much bigger than n as we said the constraint is vacuous so I should be getting all possible patterns on the number of points I have. And as you can see for one I get the twos. For two I will get eventually the fours. And for three it will be the eighth. So that is very nice. Let's go for the next row. Can I solve this one? Now that I've got this one I can become more sophisticated and get this one. See where this came from? How about the next one? What would that be? That should be seven, right? Eight. A bunch of eights. This is kind of fun. And you can fill up the entire table. So we have it's completely solved. Numerically would be nice to have a formula which we will have in a minute. But numerically we will have that. Now let me highlight one guy. Do you see anything that change colors? I claim that you have seen this before. That's the puzzle. Right? You had three points. Your breakpoint was two. And now we know for a fact that the maximum number you can get is four. Without having to go through the entire torture we went through the last time. Can we try this? Can we try that? Oh, I'm valid. You don't have to do that. OK? Here are the numbers just by computing a very simple recursion. OK? So now let's go for the analytic version of that. So what I'd like to do I'd like to find a formula that computes this number outright. I don't have to go through this computation numerically. OK? So let's do that. So this is the analytic solution for B of N and K. OK? Again, this is the recursion. And now we have a theorem. Yeah! We have to when you are doing mathematical theorem you have to have theorems. Otherwise you lose your qualifications. OK? So what does the theorem say? It tells you that this is a formula that is an upper bound for B of N and K. OK? What is this formula? This is N choose I the combinatorial quantity. And you sum this up from I equals 0 to K minus 1. So both N and K appear. N appears as the number here. And K appears as the limit for the index of summation appears as K minus 1. So this quantity will be an upper bound for B of N and K. OK? So you can now, if you believe that, which we will argue in a moment, OK? Is that you compute this number and that will be an upper bound for the gross function of any hypothesis set that has a break point K. Without asking any questions whatsoever about the hypothesis set or the input space. OK? Now it shouldn't come as a surprise that this quantity is right because if you look at this this is really screaming for something binomial or combinatorial. I mean it's just like clearly it will come out one way or the other. OK? But why is it this way? Well, what we are going to show we are going to show that this is exactly the quantity we computed numerically for the other table. OK? And we are going to do it by induction. So the recursion we did we are just going to do it analytically. OK? So how do you do that? You start with boundary conditions. OK? What were the boundary conditions? We argued that this is indeed the value of B of N and K and hence an upper bound on it. OK? From the last slide. So now we want to verify that this quantity actually returns those numbers when you plug in the value N equals 1 or K equals 1. OK? How do I do that? You just do it. Just plug in and it will come. I mean, I'm not even to bother doing it. It's a very simple formula. You just evaluate it and you get that. OK? The interesting part is the recursion. OK? So now I would like to argue that if this formula holds for the solid blue points then it will also hold for this guy. OK? And then by induction since it holds for all of these guys I can just do this step by step and fill the schedule with the truth of this being the correct value for the numbers that appear here. OK? Everybody is clear about the logic of it. So let's do the induction. OK? So we have the induction step. So we just want to make clear what the induction step is. OK? You are going to take assume that the formula holds for this point and this point. OK? So indeed if you plug in the values for n and k which here n minus 1, k minus 1 and here would be n minus 1 and k you plug it into that particular formula. OK? Then the numbers will be correct. That's the assumption. OK? And then you prove that if this is true then the formula will hold here. That's the induction step. OK? Fair enough. So let's do that. This is the formula for n and k. You just need to remember it. n appears here and k appears here minus 1 is an integral part of the formula. So this is the value for k not for k minus 1. The value of k happens to be the sum from i equals 0 to that k minus 1. OK? So this is the formula that is supposed to be here. OK? And we would like to argue that this is equal to what is this one? This one is for n minus 1 and still k. OK? So this would be I am moved from here to here. So this would be the value here. OK? And what is the other guy? That would be the value for n minus 1 and now it's for k minus 1 because you still take the other minus 1. It becomes k minus 2. So this part belongs here. OK? So this is the induction step. We don't have it yet. That's what we want to establish. So let me put a question mark to make sure that we haven't established it yet. OK? What I'm going to do is I'm going to take the right-hand side and keep reducing it until the left-hand side appears. That's all. And then we'll be done with the induction step. And since we have the boundary conditions, we will have proved the theorem we asserted. OK. So the first thing I'm going to do I'm going to look at this fellow and I notice that the index goes from i0 to k minus 1. Here goes from i0 to k minus 2. I'd like to merge the two summations. So in order to merge the two summations, I will make them the same number of terms first. OK? So very easy. I will just take the zero's term, which would be n minus 1 2 is 0, which is 1 out. And now the summation goes from i equals 1 to k minus 1. OK? Now I go to the other guy and do this. So what did I do? I just changed the name of the dummy variable i. OK? I wanted the index to guide from 1 to k minus 1 in order to be able to merge it easily. Here it goes from 0 to k minus 2. So what do I do? I just make this i and make this i minus 1. So i minus 1 goes from 0 to k minus 2 as this I used to. OK? Just changing the names. OK? And now, having done that, I'm ready to merge the two summations. And they are merged. OK? So now I would like to be able to take this and produce one quantity. OK? And you can do it by brute force. This is no mysterious quantity. This is what? This is n minus 1 times n minus 2 times n minus 3 i terms divided by i factorial. And this one applies the same thing. So you end up with something and then you do all kinds of algebra and it looks familiar and then you reduce it to another quantity. So there's always an algebraic way of reducing it. OK? But I'm going to reduce it with a very simple competitive argument. I'm going to claim that this is so the one remains the same. And this actually the whole thing here reduces to n 2 i. So these two guys become this one. OK? So instead of doing the algebra I'm going to give you a combinatorial argument. OK? That is this. This quantity is identical to n 2 i. Let's say that I'm trying to choose 10 people from this room. And let's say that the room has capital N people. OK? There are N people. How many ways can you choose 10 people out of this room? OK? That is N Choose 10. OK? Let's put this on the side. Here is another way of counting it. We can count the number of ways you can pick 10 people excluding me plus the number of ways you can pick 10 people including me. Right? These are disjoint and they cover all the cases. Let's look at excluding me. How many ways can you pick 10 people from the room excluding me? Well, then you are picking the 10 people from N minus 1. I am the minus 1. OK? So that would be N minus 1 Choose 10. Put this in the bank. How many ways can you pick 10 people including me? Well, you already decided you are including me so you are only deciding on the 9 remaining guys. Right? So that would be N minus 1 Choose 9. OK? So we have N minus 1 Choose 10 plus N minus 1 Choose 9 That equals the original number which was N Choose 10. OK? Look at this. What do we have? OK? This is excluding me. This is including me. And this is the original count. So it's a combinatorial identity and we don't have to go through the torture of the algebra in order to prove that it's exactly the same. OK? So now I go back and look, OK, this goes from I equals 1 to K minus 1. I have this one. So I conveniently I put it back and get this formula. Have you seen it before? Yeah, it looks familiar. Oh, this is the one we want to prove. So it means that we are done. That's it. We have an exact solution for the upper bound on B of N and K. OK. Now let's, since we spent some time developing it, let's look at it and celebrate it and be happy about it. So the first thing, yes, it's a polynomial because all of this torture was to get a polynomial. Right? If we did all of this and its perfect math and the end result was not a polynomial, then we are in trouble because although the quantity is correct, it's not going to be useful in the utility that we are aiming at. OK. So why is it polynomial? Remember that for a particular hypothesis set, the breakpoint is a fixed point. It doesn't grow with N. You ask a hypothesis set. OK. Can I get all possible dichotomies on four points? That's a question for the perceptron. No. Then four in the perceptron is a breakpoint. Now I can ask myself what the perceptron does on 100 points and the breakpoint is still four. Just a constant. OK. So you give me a hypothesis set, I give you a breakpoint. So that's a fixed number. OK. So according to our argument now, the gross function for a hypothesis set that has a breakpoint K is less than or equal to the purely combinatorial quantity B of N and K, which is defined as the maximum such number of dichotomies you can get under the constraint that K is a breakpoint. And that was less than or equal to the nice formula we had. So we can now make this statement. OK. So you go in a real learning situation. Let's say you have a neural network making a decision and you tell me the breakpoint for that neural network is 17. OK. I don't ask you what is a neural network. Because we don't know yet. So you don't have to know. I don't ask you what is the dimensionality of the Euclidean space you are working on. You told me 17. OK. Your gross function of your neural network that I don't know in the space that I don't know happens to be less than or equal to that and I know that I'm correct. OK. So is this quantity polynomial in N? That's what we need. Because remember in the Hefding there was a negative exponential in N. If we get this to be polynomial in N, we are in business. Well, any one of those guys is what? N times N minus 1 times N minus 2 i times divided by i factorial. i factorial doesn't matter. It's a constant. So you basically get N multiplied by itself a number of times. i times for the i-th term. So the most that N will be multiplied by itself is when you get to i equals k minus 1 the maximum. And then N will be multiplied by itself k minus 1 times. Therefore the maximum power in this quantity is N to the k minus 1. This comes from N times N minus 1 times N minus 2 times k times that corresponds to the case where i equals k minus 1. So when you get N choose k minus 1 that's what you get. Anything else will give you a power of N but it's smaller power. So this is the most you will have. What do we know about this fellow k? We know it's just a number. It's a constant. Doesn't change with N. And therefore this is indeed a polynomial in N. And we have achieved what we set out to do. OK. That is pretty good. So let's take three examples in order to make this relate to experiences we had before. This is the famous quantity by now. You know it by heart. I have the N. I remember k. I have to put minus 1. And that is the upper bound for anything that has a breakpoint k. So now let's take hypothesis sets we looked at before for which we computed the gross function explicitly and see if they actually satisfy the bound. I mean, they had better because this is mass. We proved it. OK. But just to see that this is indeed the case. OK. Positive ray. Oh, remember positive rays from some time ago? Oh, these were case we have one dimensional. So that's the real line. And then we take from a point 1 goes to plus 1 minus 1. And we said that the whole analysis here is to exactly avoid what I just did. You don't have to tell me what is the input. You don't have to tell me what you just have to tell me what point. That's all we want. OK. So you can call it positive rays. You can call it George. I don't care. It has a break point of 2. That's what I pull. OK. So now we did compute the gross function for the positive. We did it by brute force. We looked at it and we see what the patterns are and did a combinatorial argument. And we ended up that the gross function for these guys is in plus 1. Let us see if this satisfies the bound. OK. So now this is supposedly less than or equal 2. And you substitute here for capital N which is the number here and the break point is k. So you're summing from i equals 0 to 1 this quantity. OK. So you have N choose 0 also known as 1 plus N choose 1 also known as N. That's it. So you get this to be less than or equal 2. Wow. OK. Look at the analysis we did to get the N plus 1 and we get exactly the same with all the bounds and we think that this is there is a big slack here. But here actually it's exactly tight. We get the same answer exactly without looking at anything of the geometry of what the hypothesis was. OK. Let's try another one. Maybe we'll continue to be lucky. Positive intervals. Yeah. I remember those were the most sophisticated one. I'm sorry. I'm not supposed to ask any questions about the hypothesis. I'm asking about the break point only. OK. I remember now. OK. So tell me what the break point is. That was k equals 3. OK. And we did compute the gross function. Remember, this one was a funny one. We were picking two segments out of N plus 1 and then adding the one which was OK. So we ended up and this would be the formula. OK. What would be the bound according to the result we just had? OK. This would be again this formula and now k equals 3. So I have N choose 0 plus N choose 1 plus N choose 2. So I get 1 plus N plus something that has square terms and I do the reduction and what do I get? Boring, boring. I seem to be getting it all the time. OK. It doesn't happen this way. I mean it will always happen that it's true but there will be a slack in many cases. OK. So now, OK. We verified it. We are very comfortable now with the result. Let's apply it to something where we could not get the gross function. Remember this old fellow? Well, in the two-dimensional perceptron we went through a full argument just to prove that the breakpoint is 4. But we didn't bother go through a general number of points N and ask ourselves how many hypotheses can the perceptron generate on N points? Can you imagine the torture? OK. We do this. Can I get this pattern? And you have to do this for every N. So we didn't do it. So the gross function is unknown to us. We just know the breakpoint. But using just that factor, we are able to bound the gross function completely. And you substitute again with k equals 4. You get another term which is cubic. And you do the reduction. And lo and behold, you have that statement. That statement holds for the perceptrons in two-dimensions. And you can see that this will apply to anything. So now it was all worth the trouble. Because now we have a very simple characterization of hypothesis sets. And we can take this and move to the other part. Remember? This part, which has now disappeared, is proving that it's polynomial. Proving that we are interested in the gross function. If it wasn't polynomial, we wouldn't be interested in it. So now we OK, this is an interesting quantity. This one tells us that, oh, and by the way, it's not only interesting, it actually can use it. We can put it in the hervding inequality and claim that the hervding inequality is true using the gross function. Now, let's see what we want to remind you of the context of substituting for the total number of hypothesis by the gross function. We wanted, instead of having this fellow this is hervding and this is the number of hypothesis. We use the union Bowen, which we said is next useless whenever M is big or M is infinite. And instead of that, we wanted really to substitute for this by the gross function. So this is what we are trying to do. We are trying to justify that instead of this, you can actually say that. Well, it turns out that this is not exactly what we are going to say. We are going to modify the constants here for technical reasons that will become clear. But the essence is the same. There will be the gross function here. It will be polynomial in N and it will be killed by the negative exponential provided that there is a break point any break point. Okay? So now how are we going to prove that? We are going to have a pictorial proof. What a relief because I think you are exhausted by now. Okay? So the formal proof is in the appendix. It's six pages. And my purpose of the presentation here is to give you the key points in the proof so that you don't get lost in y and deltas. Okay? There are basically certain things you need to establish. And once you know that that's what you are looking for, you can bite the bullet and go through it line by line. Okay? So the two aspects are the following. Okay? Why did we do this gross function? Okay, we use the gross function because it's smaller so it will be helpful but how could it possibly replace M? Ah! Because capital M was assuming no overlaps at all in the bad regions. Remember? So now that we know that there are overlaps. This will take care of it. So the question is well, how does the gross function actually relate to the overlaps? You need to establish that. Okay? So this is the first one. And when we establish that, we find that it's completely clean argument everything except for one annoying aspect. Gross function relates to a finite sample. So we will get a perfect handle on the E in, the in-sample error part of the deal. But in the hefty inequality there is this E out. And E out relates to the performance over the entire space. So we are no longer talking about the ecotubes. We are talking about full hypotheses. So we lose the benefit of the gross function. So what do we do about E out? That was a question that was asked last time. So what to do about E out in order to get the argument to conform while we are just using a finite sample is the second step. And then after that, it's a technical put it together in order to get the final result. That's the plan. Okay? But the proofs are pictures. Okay? And let's say that you are an artist and this is your canvas. Okay? It's a very special canvas. It's the canvas of datasets. Okay? What is that? Every point here is an entire dataset. X1, X2 up to Xn. Okay? Fix n in your mind. Okay? So this is one vector. This is another vector. This is another vector. Okay? And this canvas covers the entire set of possible datasets. Okay? Now why am I doing this space? Well, I'm doing this space because the event being good or bad whether E in goes to E out depends on the sample. Depends on the dataset. For some datasets, you will be close to the E out. For some datasets, you are not going to be close. Okay? So I want to draw it here in order to look at the bad regions and the overlaps and then argue why the growth function is useful for the overlaps. Now we assume that there's a probability distribution and for simplicity, let's say that the area corresponds to the probability. So the total area of the canvas is one. Okay? Now you look at the event, the bad event that E in is far away from E out and let's say that you paint the points that correspond to that event red. So you pick is this dataset good or bad? What does it mean good or bad? I look at E in in that dataset, compare it to E out on a particular hypothesis and then paint it red if it's bad. Okay? So I have a hypothesis in mind and I'm painting the points here red or leaving them alone according to whether they violate Hervding's inequality or not. And I get this just illustratively. Okay? And you realize that I didn't paint a lot of area and that is because of Hervding inequality. Hervding inequality tells me that that area is small. Okay? So I'm entitled to put a small patch. Okay? Now we went from one hypothesis which is this guy to the case where we have multiple hypotheses using the union bound. So again, this is the space of datasets exactly the same one. And now I am saying for the first hypothesis you get this bad region. What happens when you have a second hypothesis? Because I'm using the very pessimistic union bound. I am hedging my bets and say that okay, you get a bad region that is disjoint. Another hypothesis. Two of them. More. More. Oh no. We are in trouble. The colored area is the bad area. Now the canvas is the bad area. That's why we get the problem with the union bound because obviously having them disjoint fills up the canvas very quickly. Each of them is small but I have so many of them. Infinity of them as a matter of fact. This will overflow. No, it won't overflow. Just figuratively speaking. Okay? So that's what I'm going to have. Okay? So now what is the argument we are applying now? We are not applying the union bound. We are going to a new canvas and that canvas is called the VC bound. As in Vapnik-Schirvonenkis. We'll see it in a moment. Okay? So what do you do? Okay? Your first hypothesis is the same thing. When you take the second hypothesis you take the overlaps into consideration. So it falls here. You get more. You get all of them. Okay? It's not as good as the first one. I never expected it to be. But definitely not as bad as the second one. Because now they're overlapping. And indeed the total area which is the bad region something bad happening is a small fraction of the whole thing. And I can learn. Okay? So we are trying to characterize this overlap. That's the whole deal with the growth function. Now, one way to do it is the one that I alluded to before. Study the hypothesis set. Study the probability distribution. Get the full joint probability distribution of any two events involving two hypothesis. And then characterize this. Well, good luck. Okay? We won't do that. The reason we introduce the growth function because it's an abstract quantity that is simple and it's going to characterize the overlaps. Okay? So the question is how is the growth function going to characterize the overlaps? Here is what is going to happen. I will turn to tell you that if you look at this canvas, if any point gets painted at all, it will get painted over a hundred times. Let's say that I have that guarantee. Okay? I don't know which hypothesis will paint it again. But any point that gets a red, it will get a blue and a green after it, a hundred times. If I tell you that statement, what do you know about the total area that is colored? Now it's at most one hundredth of what it used to be because when I had them overlapping, they filled it over. Now for every point that is colored, I have to do it a hundred times. So I am overusing these guys and these guys will have to shrink and I will get one hundredth of that. That is basically the essence of the argument. What the gross function tells you is that, okay, what are the gross function number of dichotomies? If you take a dichotomy, this is not the full hypothesis, but the hypothesis on a finite set of points. There are many, many hypotheses that return the exact same dichotomy. Right? Remember the gray sheet with the holes? Lots of stuff can be happening behind the sheet and as far as I'm concerned, they are all the same dichotomy. So all of these guys will be behaving exactly the same way. If one of them colored the point, the others will. Okay? So now this tells me that the redundancy is captured by the gross function. Okay? That would be a very clean argument and it would have been a very simple proof except for one annoyance. Okay? That the point being colored doesn't depend only on the sample but depends also on the entire space because the point gets colored because it's a bad point. What is a bad point? The frequency on the sample that is patently on the sample deviates from E out. Oh, E out involves the entire hypothesis. If I have the gray sheet and the holes, I cannot compute E out. I have to peel it off, look and get the areas in order to get E out. Okay? So the argument is great as long as you can tell me how do I go around the presence of E out? And that's the second part of the proof. Okay? What to do about E out? The simplest argument possible. This is really, I mean, that is really the breakthrough that that Wapnick and Gervrenckes did. Okay. Back to the bin just because it's an illustration of the binary case. So here we have one hypothesis and we have E out which is the the overall in the entire space the the error in the entire space we pick a sample and then we get E in which is the value for the error on this one. So we've seen this before. And we said this tracks this according to to the Hefding inequality and the problem is that when you have many many bins some of these guys will start deviating from E out to the level if you pick according to the sample you are no longer sure that you pick the right one because the deviation could be big. That was the argument. Now I want to get rid of E out. The way I'm going to do it is this. Instead of picking one sample I'm going to pick two samples independently. Okay, so obviously they are not identical samples some of them you know green and red et cetera but they are coming from the same distribution. Now let's see what is going on. E out and E in track each other because E in is generated from this distribution. Okay. Now let's say I look at these two samples and give them names. I'm going to call them E in and E in dash. We are both in sample. It happens to be a different sample. So I have two samples. I'm going to call this E in and this E in dash. Okay. My question is does E in track E in dash? If you have one one bin. Well each of them tracks E out. Right? Because it was generated by it. So consequently they track each other. A bit more loosely because you have now two ways of getting the sample error. On the other hand if I do two presidential polls one polling thing as 3,000 people. Another as 3,000 people. These are different 3,000 people. You fully expect that the result will be close to each other. Right? Okay? So these guys track each other. Okay fine. What is the advantage? The advantage is the following. If I now have multiple bins the problem I had here is reflected exactly in the new tracking. That is when I had multiple bins the tie between E out and E in became looser and looser because you know they you know I'm looking for worst case and I might be unlucky enough that the tracking now lost lost the the tightness that one bin with herding would dictate. If I am doing multiple bins and not looking at the bin at all just looking at the two samples from each bin they track each other but they also get loosely a part as I go for more. Let's say I tell you this experiment, okay? You pick two samples, okay? They are close in terms of the fraction of red, okay? If you keep repeating it can you get one sample to be mostly red and the other sample to be mostly green? Yeah, if you are patient enough it will happen exactly for the same reason because you keep looking for it until it happens. So the mathematical ramifications of multiple hypothesis happen here exactly the same way they happen here, okay? The finesse now is that if I characterize it using the two samples then I am completely in the realm of dichotomies because now I'm not appealing to E out at all I am only appealing to what happens in a sample. It's a bigger sample I have two n marbles now instead of n but still I can define a gross function on them and now the characterization is full and I'm ready to go. These are the only two components you need to worry about as you read the proof, okay? So now let's put it together. This is what we wanted this is not true don't hold this against me and to make sure this is not quite what we have this would be direct substitution of the plane vanilla gross function in terms of m, okay? We are not going to have that but we are going instead to have this let's look and compare. Okay. This looks the same except that this two became four is this good or bad? Well, it's bad I want this probability to be small bad but not fatal. Okay? This one goes to here I have twice the sample you know why I have two now okay? Because now I'm using the bigger sample for the argument so I need two n Okay? Oh, but all of this was about a polynomial and now I don't know whether this will be a polynomial yes, you do if it's polynomial in n it's polynomial in n here because you get 2 to the n to the k then you get 2 to the k that's a constant and you still get n to the k Okay? So that remains a polynomial a bigger polynomial I don't like it but you know you don't have to like it it just has to be true and do the job we want Okay? And finally you can see this is minus 2 which was a very helpful fact this is in the exponent you know 2 in the exponent you know goes a lot of mileage and now we knock it down all the way to 1 eighth that's really really bad news Okay? The reason for this is happening is that as we go through the technicalities of the proof the epsilon will become epsilon over 2 and then will become epsilon over 4 just to take care of different steps and when you plug in epsilon over 4 here you get epsilon squared over 16 and so you get a factor of 1 over 8 that's the reason for it Okay? So this is what we'll end up with and you can be finicky and try to improve this constant a lot but the basic message is that here is a statement that holds true for any hypothesis said that has a breakpoint and this fellow is polynomial in N with the order of the polynomial decided by the breakpoint and you will eventually learn because if N is big enough if I give you enough examples using that hypothesis you will be able to claim that E in tracks E out correctly Okay? This result which is called the Wappnick-Schervenick is inequality is the most important theoretical result in machine learning On that happy note we will stop here and take questions after a short break Okay? So let's start the Q&A Okay? So just a first a few clarifications from the beginning so in slide 5 when you choose the endpoints does it mean your data set is of endpoints or you just chose endpoints from the data set? Okay? When I apply this to an actual hypothesis set in an input space then this actually corresponds to a particular set of points in that space However in the abstraction that just defines the function B these are just abstract labels these are labels for which column I'm talking about Okay? So although I call them x1 up to xn-1 Okay? these are not really I mean in the abstraction here they don't correspond to any particular input space in mind but when they do they will correspond to a sample and I'm supposed to pick the sample in that space that maximizes the number of dichotomies etc as we define the growth function Okay? But it's a sample that I pick when I apply this to a particular input space and a hypothesis set Okay? Also there are some people asking they didn't understand exactly why alpha was different to beta alpha is different from beta Yeah, why? Well the short answer is that I never made the statement that alpha is different from beta I just didn't bother ascertain any relationship between alpha and beta I just called them names if they happen to be equal I'm happy if they happen to be unequal I'm happy Okay? So all I'm doing here is just calling the guys that happen to have a single extension the number of them calling it alpha calling the guys that happen to have double extension beta Okay? I don't know whether alpha is bigger than beta or smaller than beta in any particular case and it doesn't matter as far as the analysis is concerned if I call them this way then it will always be true that the total number of rows here is alpha plus beta plus beta which is alpha plus twice beta Okay? So there is really no assertion about the relative value of alpha and beta Okay, so moving on to the case where you show the break points and how it satisfies the bound so what happens in K equals infinity no break points basically This is for the positive raise and whatnot? Yeah So what So for example if you had the comeback set Okay The K equals infinity means there is no break point In that case you don't have to bother with any of the analysis I did no break points means what? Means the gross function is 2 to the N for every N Right? We just computed it exactly Okay? So if you want a bound for it yes, it's bounded by 2 to the N Not a polynomial Okay? So all of these cases we're addressing the case where there is a break point because that is the case where I can guarantee a polynomial and therefore I can guarantee learning That is the interesting case If there is no break point this theoretical line of analysis will not guarantee learning So if I have a hypothesis set that happens to be able to shatter every set of points I cannot make a statement using this line of analysis that it will learn And one example we had was convex sets So convex sets have a gross function of 2 to the N Well, it's really is a very pessimistic estimate here because the points have to be really, really very funny You have to build a pathological case in order not to be able to learn And in many cases you might be But again if I want a uniform statement based only on the break point this is the most I can say using this line of analysis Okay So yeah just a quick review So how is the break point calculated? Just Calculate The break point is this is the only time you actually need to visit the input space and the hypothesis set Okay So you basically you are sitting in a room with your hypothesis set if someone gave you a problem for credit approval you decided to use perceptrons and you decided to use a non-linear transformation and you do that and you start programming it And you would like to know some prediction of the generalization performance that you are going to get So you go into the room and ask yourself for this hypothesis set over this input space What is the break point? Okay So now you have to actually go and study your hypothesis set and then find out that using this hypothesis I cannot separate let's say 10 points in every possible way Very much along the argument we use for the perceptron in two dimension We found out that we cannot separate four points in every possible way Okay But the good news is that you don't have to do it anew because for most of the famous learning models this has already been done For the perceptrons we will get an exact break point for any dimensional 20 dimensional perceptron Here is the break point and here is the gross function or here is the bound Similarly for a neural network there is a break point not exact estimate of the break point but a bound on the break point And again in most of these cases bounds work because we are always trying to bound above Okay And we have a room to play with because a polynomial is a polynomial is a polynomial So if you become a little bit you know sloppy and you know forget something on the break point you say 10 instead of 7 It's not going to break the back of learning versus non-learning It's just going to tell you more pessimistically how much resources do you need in order to learn which is a sort of a more benign damage than decided Oh, I cannot learn at all Okay Also Can you come up with an example where these bounds are not are not tight as here Okay There is one case which I could have covered but I didn't where you take positive and negative raise So positive raise you take the real line and from a point on it's plus one before it's minus one positive and negative raise it means you are also allowed to take raise that return plus one first and then minus one later Okay And the union of them is the model called positive and negative raise Okay It's a good exercise to do Take that home and try to find what is the break point And you'll find that although the break point for positive raise is 2 In this case the break point is actually 3 And the reason is that for 2 points now you can get everything because you know the raise can be here so they are both minus the raise could be here they are both plus the raise could be here it's minus plus but now use the negative raise to get the plus one minus one So now you can shatter 2 points And you will fail only for the 3 points when the middle guy is different because you cannot get it this way Okay So you will get and the break point is 3 When you do the break point is 3 you will get the bound the blue bound here you will get that to be square pretty much like here because we have a break point that corresponds directly to square I don't care whether the 3 is coming from positive intervals or coming from positive and negative raise It's 3 therefore the blue bound is quadratic If you compute the number of of dichotomies you can get which is the gross function it will actually be linear So there will be a discrepancy between linear for the exact estimate of the gross function to square of the bound So there are cases that you can come up with easy And as a matter of fact this is the exception on this rather than the rule In most of the cases there will be a slack Okay And this question I think it drives the point of the whole lecture but it says We have been focusing on having E in and equal to E out or close to E out not in the in the actual value of E in So using our hypothesis there are just as many percentage errors in the training data as the real data Why is that? Okay So this goes back to separating the question of learning into the two questions There was one question which was addressed now We are trying to get Why do I need that? Because I don't know E out and I will not know E out That is simply an unknown quantity for me Okay And I want to work with something to minimize I cannot minimize something that I don't know So if the theoretical guarantees tell me that E in is a proxy for E out And if I minimize E in E out will follow suit then I can now work with the quantities that I know and do it So that's the first part that they are tracking each other The second part is the practical Okay Now I'm going to go and try to minimize E in So this is the second part of it Okay Also they asking if so Can you clarify more why is the VC dimension useful? Okay The VC dimension as of now is an unknown quantity I didn't say the word VC dimension at all I said every building block that will result in the definition However, the good news is that What is the title of the next lecture? The VC dimension We will you will be completely content with everything you wanted to know about the VC dimension and weren't afraid to ask Okay Yeah, the crowd is saying that they are still digesting the lecture Okay As I mentioned before if you didn't follow this in real time don't be discouraged It's actually a very sweet material Okay And you can look at the lecture again and you can read the proof and you can do all of the homework until it settles in your mind This is the most abstract or the most theoretical lecture of the entire course And if you get through this one and you understand it well you are in good shape as far as the rest of the course There will be mathematics But it will be more friendly mathematics Friendly as in less abstract For someone who is not sort of the theoretically inclined The more abstract the mathematics is the less they can follow it because they cannot relate to So this one has the abstraction The other mathematics will be much easier to relate to Okay, so What was wrong with the not quite expression on the last slide? Okay Okay So basically The top statement is simply false Okay It was my way of relating what I'm trying to do to what has already happened There used to be a capital M in place of the gross function So the gross function is here Okay There used to be capital M So the easiest way for me to describe what is happening with this theory is to tell you that you are going to take capital M out and replace it with this Okay As usual it's not that easy Even remember with the helping when I complained about the two here and the two here Okay Well, you have to have them in order for the statement to be true So for the statement to be true we needed to do some technical stuff that really didn't change the essence of the statement here but made it a little bit different by changing the constants And therefore we have a proof for it at all And it captures the essence of that Okay I just didn't want to bother telling you this because if I told you this in the first place you would have been completely lost Y4 Y2 What is this one of our eighth and forget about the essence So the easiest way to do it Okay Until you get the idea of this indeed I can replace it but oh in order to replace it I need to have to have a bigger sample that we argued for So I need to end oh And now the bigger samples are not tracking each other as well as the each of them is tracking the actual out of sample error So I need to modify these values and so on So it becomes much easier to swallow that the technicalities will come in in order to make the proof go through Okay So can you review the definition of B and K B and K Okay أسهم أنك لديك نظر و أسهم أن ك is a breakpoint لذا أنت تأكد أن لا ك نظر سيكون لديها كل مستوى مستوى بعد هذه الثلاثة لا تجعل أساملات أساملات أنك لا تعرف أين this came from أنك لا تعرف ما الذي تعمل معك أنك لا تعرف ما هو ما هو فقط تعرف أن في تلك الثلاثة when you get نظر that is the n here and the breakpoint is K that is K here under those conditions هل you bound the growth function هل you tell me that the growth function is can never be bigger than something that something is what I'm calling B of N and K so what I'm doing I'm taking the minimal conditions you gave me I have n points and K is a breakpoint and asking myself what is the maximum number of dichotomy you can possibly have under no other constraints in order to satisfy these two constraints and I'm calling this B of N and K why did I do it first it's going to help being an upper bound for any hypothesis set that has a breakpoint K because it is the maximum the second one it's a purely combinatorial quantity so I have a much better chance of analyzing it without going through the hairy details of input spaces and correlation between events and so on and that is indeed what ended up being the case we had a very simple recursion on it and we found a formula for it and that formula now serves as an upper bound for the more hairy quantity which is the growth function that is very particular to a learning situation an input space and a hypothesis set also a particular question on the proof of B and K of the recursion so slide five so the question is why why does K not change when going back from N to N minus 1 okay so here if you look at X1, X2 up to XN that is appearing XN here okay no K columns can have all possible patterns these K columns could involve the last column and could involve only the first N minus 1 columns just no K columns whatsoever can have all possible patterns so when I look at the reduced one N minus 1 I know for a fact that no K columns of these guys can have all possible patterns because that would qualify as K columns of the bigger one okay so K doesn't doesn't really change the only time I had a different K is when I had a sort of a a nice argument that if you have K minus 1 points which have all possible patterns on the smaller set then adding the last column will get us in trouble with K columns okay so for that I needed an argument okay but in general when I take the statement on face value K is fixed and the K columns could be anything could be involving the last column or could be restricted to the first guide could be the first K columns for all I care okay so that so how does the disformalization apply to say a regression problem okay again it's a classification of plus 1 and minus 1 and as I mentioned the entire analysis the VC analysis can be extended to real value functions it's a very technical extension that in my humble opinion does not add to the insight and therefore instead of doing that and going very technical in order to gain very little in terms of the insight I decided that when I get to regression functions I'm going to apply a completely different approach which is the bias variance tradeoff it will give us another insight into the situation and we'll tackle the real value functions directly the regression functions and therefore I think we'll have both the benefit of having another way of looking at it and covering both types of functions okay so there's there's this person that says I felt silly asking this but is the bottom line that we can prove learnability if the learning model cannot learn everything okay we we proved learnability under a condition about the hypothesis set when you say learning everything you are really talking about the target function okay so the target function is unknown what I am telling you here is that if you tell me that there is a breakpoint I can tell you that if you have enough examples E in will be close to E out for the hypothesis you pick whichever way you pick it it remains to be seen whether you are going to be able to minimize E in to a level that will make you happy I will never know that until you start minimizing so if the target function happens to be extremely difficult or completely random unlearnable you are not going to see this in the generalization question the generalization question is independent of the target function I didn't bring it up here at all it has to do with the hypothesis set only the target function will come in is okay if I get E in to be small E out will be small I know that from the generalization argument that I made can I get E in to be small if the target function is random you will get a sample that is extremely difficult to fit and you are not going to be able to get E in to be small but at least you realize that you could not learn okay in that particular case and in another target function do you realize that you can learn because E in went down okay so the question of whether I can learn or not okay the generalization part of it is independent of the target function okay the second question is very much dependent to the target function but the good news is that it happens in sample I can observe it and realize how well or not so well I learned okay also going back to like to previous questions so is this also does this also generalize to multi-class problems okay it's basically there is no restriction on the inputs or the outputs there is a counter part and it says your break point what is the break point the dichotomies they are not really the dichotomies you have real value do you have what are the real values so there are technicalities to be done in order to be able to reduce them to this case but the same principle applies regardless of the type of function you have okay I think that's it very good okay thank you and we'll see you next week