 سوف نتحدث عن المشاركات المتحدة بها by Caltech. مرحباً. لدينا مرحباً أننا نتحدث عن المشاركات المتحدة و هذه are the two notions that relate the learning problem to practical situations. لذا في حالة المشاركات المتحدة نحن نعرف that in order to specify the error that is caused by your hypothesis we should try to estimate the cost of using your H instead of F which should have been used in the first place and that is something the user can specify of the price to pay when they use H instead of F and that is the principled way of defining an error measure. In the absence of that which happens for quite a bit of time we go to plan B and resort to analytic properties or practical properties of optimization in order to choose the error measure. Once you define the error between the performance of your hypothesis versus the target function on a particular point you can plug this in into different error quantities like the in-sample error and the out-of-sample error and get those values in terms of the error measure by getting an average. In the case of the training set you estimate the error on the training points and then you average with respect to the capital N example that you have and in the case of out-of-sample theoretically the definition would be that you also evaluate the error between H and F on a particular point X give the weight of X according to its probability and get the expected value with respect to this X. The notion of noisy targets came from the fact that what we are trying to learn may not be a deterministic function the only function in mathematics where Y is uniquely determined by the value of X but rather when Y is affected by X so Y is distributed according to a probability distribution which gives you Y given X and we talked about for example in the case of a credit application two identical applications may lead to different credit behavior therefore the credit behavior is a probabilistic thing not a deterministic function of the credit application you can go back to our first example let's say of the movie rentals if you rate a movie you may rate the same movie at different times differently depending on your mood and other factors so there is always a noise factor in these practical problems and that is captured by the the transitional probability from X to Y probability of Y given X when we look at the diagram involving this probability so now we replace the target function which used to be a function by a probability distribution which can be modeled as a target function plus noise and these feed into the generation of the training examples and when you look at the unknown input distribution which we introduced technically in order to get the benefit of a heavy inequality that also feeds into the training examples this determines X and this determines Y given X and then you generate these examples independently according to this distribution so when we had X being the only probabilistic thing and Y being deterministic function of X then X's X1 was independent of X2 independent of Xn and then you compute each Y according to the function on the corresponding X when you have the noisy version then the pair X1 and Y1 is generated according to the joint probability distribution which is P of X the original one times P of Y given X the one you introduced to accommodate the noise and then the independence lies between different pairs so X1 Y1 would be independent of X2 Y2 independent of X3 Y3 and so on and when you get the expected values for errors you now have to take into consideration the probability with respect to both X and Y so what used to be the expected value with respect to X is now the expected value with respect to X and Y and then you plug in X into H and corresponded to the probabilistic value of Y that happened to occur and that would be now the out of sample error in this case now in this lecture I am going to start the theory track that will last for this particular route 3 lectures followed by another theory lecture on a related but different topic and the idea is to relate training to testing in sample and out of sample in a realistic way so the outline will be the following we will spend sometime talking about training versus testing ولكن نحن نضع المثامتية التي تفتح أن تتعلم ماذا تفتح تستخدم ثم سنضع المستخدمات التي ستكون المثامتية المساعدة في تفكير هذه الوصول. وعندما أعطيكم بعض الأشياء لكي تأكد that the notion is clear سنضع the key notion, the breakpoint and the breakpoint is the one that will later result in the VC dimension the main notion in the theory of learning. وعندما نهاي with a puzzle, it's an interesting puzzle that will hopefully fix the ideas that we talked about in the lecture. حسنا now let's talk about training versus testing and I'm going to take a very simple example that you can relate to let's say that I'm giving you a final exam so now I want to help you out so before the final exam I give you some practice problems and solutions so you can work on and prepare yourself for the final exam that is very typical. now if you look at the practice problems and solutions this would be your training set so to speak you are going to look at the question you are going to answer, you are going to compare it with the real answer and then you are going to adjust your hypothesis your understanding of the material in order to do it better and go through them and perhaps go through them again until you get them right or mostly right or figure out this is the material and now you are more ready for the final exam now the reason I gave you the practice problems is to help you do better on the final why don't I just give you the problems on the final then excellent idea I can see now the problem is obvious the problem is that doing well on the final is not the goal in and of itself the goal is for you to learn the material to have a small E out the final exam is only a way of gauging how well you actually learned and in order for it to gauge how well you actually learned I have to give you the final at the point you have already fixed your hypothesis you prepared, you studied, you discussed with people you sit now down to take the final exam so you have one hypothesis and you go through the exam and therefore your answer on the final exam and therefore your answer on the 50 questions of the final hopefully it's not going to be that long if there is a final we will reflect what will be outside so the distinction is conceptual and now let's put mathematically what is training versus testing it will be extremely simple distinction although it's an important distinction so here is what testing is in terms of a mathematical description you have seen this before this is the plane Vanilla Hoefding this part is how well you did on the final exam this is how well you understand the material proper and since you have only one hypothesis this is the final you are fixed and you just take the exam your performance on the exam tracks well how you understand the material and therefore the difference between them is small and the probability that it's not small is becoming less and less when the number of questions in this case goes up so that is what testing is how about training almost identical except for one thing this fellow because in the case of training this is how you perform on the practice problems so in the practice problems you had the answers okay and you modified your hypothesis and you looked at it and you got an answer wrong so you modified your hypothesis again you are learning better that's all very nice but now the practice set is contaminated you pretty much almost memorize what it is and there's a price to pay for that in terms of how your performance on the practice which is E in in this case tracks how well you understand the material which is still E out and the price you pay is how much you explored and that was reflected by the simple capital M which was the number of hypothesis in the very simple derivation we did okay so if you want an executive summary of this lecture we are just going to try to get M to be replaced by something more friendly because you realize that capital M if you just measure the complexity of your hypothesis set by the number of hypothesis this is next to useless in almost all cases something as simple as the perceptron has capital M equals infinity and therefore this guarantee is no guarantee at all okay if we can replace capital M with another quantity and justify that and that quantity is not infinite even if the hypothesis set is infinite then we are in business and we can start talking about feasibility of learning in an actual model and be able to establish the notion in a way that we can apply to a real situation okay so that's the plan so we are talking about M so the first question is to ask where did this M come from if we are going to replace it we need to understand where it came from to understand the context for replacing it well there are bad events that we are talking about and the bad events are called you know script B because they are bad okay that's good and then these are the bad events what is the bad event where you avoid the situation where your in-sample performance does not track the out-of-sample performance so if their difference is bigger than epsilon this is a bad situation and we are trying to say that the probability of a bad situation is small that was the starting point now we applied the union bound and we got the probability of several bad events so this is the bad event for the first hypothesis you can see here that there is M small M is 1, 2, 3, 4 up to capital M so there are capital M hypothesis that I am talking about and I would like the probability of any of them happening to be small why is that because your learning algorithm is free to pick whichever hypothesis it wants based on the examples so if you tell me that the probability of any of those being any of the bad events is small and the hypothesis your algorithm picks they will be okay and I want that guarantee to be there so let's try to understand the probability of the B1 or B2 or BM what does it look like well if you look at event diagram and you place B1 and B2 and B3 as areas here these areas these are different events they could be disjoint they will be on top of each other they could be independent which means that they are proportionately overlapping there could be many situations now the point of the bound is that we would like to make that statement regardless of the correlations between the events and therefore we use the union bound which actually bounds it by the total area of the first one plus the total area of the second one as if they were disjoint well that will always holds regardless of the level of overlap but you can see that this is a poor bound because in this case we are estimating it to be about three times the area when it's actually closer to just the area because the overlap is so significant and therefore we would like to be able to take into consideration the overlaps because with no overlaps you get just capital M terms and you're stuck with capital M and infinity in almost all the interesting hypothesis sets now of course you can go I mean in principle you can go and I give you the hypothesis which is the perceptrons and you can try to formalize what is this bad event in terms of the perceptron and what happens when you go to the other perceptron and try to get the full joint distribution of all of these guys and solve this exactly well you can in principle theoretically it's a complete nightmare completely undoable and if we have to do this we have to give up so what we are going to do we are going to try to abstract from the hypothesis set a quantity that is sufficient to characterize the overlaps and get us a good bound without having to go through the intricate details of analyzing how the bad events are correlated and what not that would be the goal and we will achieve it through a very simple argument so when we ask can we improve on M maybe M is the best we can do I mean it's not like we wish to improve it so it has to be improved maybe that's the best we can say if you have an infinite hypothesis then you are stuck and that's that but it turns out that no the overlap situation we talked about is actually very common yes we can improve on M and the reason is that the bad events are extremely overlapping and we understand what this is I am going through the example because now we have lots of binary things plus one versus minus one for the target plus one versus minus one for the hypothesis agreeing versus disagreeing etc so I want to pin down exactly what is the bad event in terms of this picture so that we understand what we are talking about so here is the target function for a hypothesis for a perceptron and it returns plus one for some guys minus one for some guys that's easy and then you have a hypothesis a perceptron and this is not the final hypothesis this is sort of a badly performing hypothesis but it is a general perceptron if you find any vector of weights you will find another blue line so now in terms of this picture could someone tell me what is E out what is the out of sample error for this hypothesis when it's applied to this target function it's not that difficult it is actually just these areas the differential areas this is where they disagree one is saying plus one one is saying minus one so these two areas if you get the total area if it's uniform the total probability if it's not then this will give you the value of E out that's one quantity how about E in for E in you need a sample so first you generate a sample here and here and I color them red so the fraction of red compared to all the sample gives you E in okay that is understood this is E in and E out and these are the guys that I want to track each other okay fine I understand this part and in words now you look at what is the change in E in and E out when you change your hypothesis so here's your first hypothesis you are perceptron you probably already suspect that this is hugely overlapping whatever you are talking about it must be overlapping because they are so close to each other but let's spin down the specific event that we would like to argue is overlapping so the change in E out when you go from let's say the blue hypothesis this blue hypothesis to the green hypothesis the change in E out would be the area of this yellow thing not very much a very thin area that's where E in changed right so if you look at the area that gives you E out if you look at the delta E in the change of the labels of data points so if one of the data points happens to fall in this yellow region then its error status will change from one hypothesis to another because one hypothesis goes to try it and the other one goes it wrong okay now the chances of a point falling here is small so you can see why we are arguing that the change in delta out and the change in delta in is small the area is small and the probability of a point falling there is small moreover they are actually moving in the same direction because that change is actually depending on the area of the yellow part okay so this let's say that this is increasing if they increase they increase both because I get a net positive area for the delta out and the probability of falling there also increases okay now the reason I am saying that is that because what we care about are these we would like to make the statement that how E in tracks E out for the first hypothesis for the blue perceptron is comparable to how E in tracks E out for the second one why are we interested in that because we would like to argue that this exceeding epsilon happens often when this exceeds the events are overlapping we are not looking for the absolute value of those we are just saying that okay if this exceeds epsilon this also exceeds most of the time and therefore the picture we had last time is actually true these guys are overlapping the bad events are overlapping and at least we stand the hope that we will get something better than just counting the number of hypothesis for the complexity we are seeking okay so we can improve them that's good news okay what are we going to replace it with so I'm going to introduce to you now the notion that we'll replace capital M it is not going to be completely obvious that we can actually replace M with this quantity okay that will require a proof and that will take us into next lecture the purpose here is to define the quantity and make you understand it well because this is the quantity that will end up characterizing the complexity of any model and we are going to motivate that it can replace M it will be plausible it makes sense it's not a crazy quantity it also counts the number of hypothesis of sorts okay and therefore let's define the quantity and become familiar with it and then next time we will like the quantity so much that we will buy the bullet and go through the proof that we can actually replace capital M with this quantity we obviously take into consideration the entire input space okay what does that mean okay these are four different perceptrons okay so I take the input space and the reason these guys are different because they are different on at least one point in the input space that's what makes two functions different and because the input space is counting the number of hypothesis on the entire input space I am going to restrict my attention only to the sample okay so I generate only the input points which is finite points put them on the diagram so I have this constellation of points and when I look at these points alone regardless of the entire input space these guys will turn into red and blue according to the region they fall in now in order to fully understand what it means to count only on the number of points we have to wipe out the input space so that's what I'm going to do that's what you have so you can imagine the perceptron is somewhere and it's splitting the points and now what I'm counting is when you do this you are not counting the hypothesis proper because the hypothesis are defined on the input space you are counting them on a restricted set but still you are counting you are counting the number of hypothesis for example if I give you a hypothesis set where you get all possible combinations of red and blue that's a powerful hypothesis if I give you a hypothesis where you get only few that's which in our mind what we try to capture by the crude capital M okay so we are going to count the number of hypothesis and putting them between quotation why because now the hypothesis are defined only on a subset of the point so I'm going to give them a different name when I define them only on a subset of the points in order not to confuse the hypothesis on the general input space with this case I'm going to call them dichotomies and there is a dichotomy between what goes into red and what goes into blue that's where the name came from so when you look only at the points and you look at this which ones are blue and which ones are red are a dichotomy okay and if you want to understand it let's look at this let's say that you are looking at the full input space and this is your perceptron and this is your you only see through the eyes of those points so what do you see when you put this you end up with this here you don't you just see that these guys turn blue and these guys turn red or pink okay now as you vary the perceptron as you vary the line here okay you are not going to notice it here until the line crosses one of the points okay so I could be running around here here here and here and generating infinite number of hypothesis for which I'm charging okay and then when you cross you end up with another pattern so all of a sudden these guys are blue and these guys are red that's when let's say this guy is horizontal here rather than vertical here okay so you can always think that we reduce the situation we are going to look at the problem exactly as it is except in terms the dichotomies which are the mini hypothesis the hypothesis restricted to the data points okay a hypothesis formally is a function and the function takes the full input space X and produces minus one plus one that's the blue and red region that we saw okay a dichotomy on the other hand is also a hypothesis we can even give it the same thing because it's use on okay but the domain of it is not the full input space but very specifically X1 up to Xn these are each one of these points belongs to capital to script X to the input space but now I'm restricting my function here and again the result is minus one plus one exactly as it was here that's what the dichotomy is okay now if I ask you how many hypothesis there are let's say very easy it's can be infinite in the case of a perceptron it's infinite why? because this guy is seriously infinite okay so the number of functions is just infinite by a margin okay so that's fine now if you ask yourself what is the number of dichotomy so let's look at the notation first and then answer the question okay the dichotomy is a function small h applied to one of those value I would say h of x1 or h of x2 one value at a time if I decide to use the fancy notation I say I'm going to apply small h to the entire vector x1, x2 up to xn I would be meaning that you tell me the values of h of x on each of them so you turn a vector of the values h of x1 h of x2 up to h of xn okay that's not an unusual notation now if you apply the entire set of hypothesis capital H to that what you are doing is that you are applying each member here which is small h to the entire vector so each time you apply one of those guys you get minus 1, minus 1 plus 1, plus 1, minus 1 plus 1, minus 1 etc so you get a full dichotomy okay and then you apply another h and you get another dichotomy and so on however as you vary h which has an infinite number of guys many of these guys will return exactly the same dichotomy and I'm returning plus 1 or minus 1 on them only so how many different ones can I possibly get okay at most 2 to the n okay if capital H is extremely expressive it will get you all 2 to the n okay if not it will get you smaller than 2 to the n so I can start with the most infinite type of hypothesis and if I translate it into dichotomies now becomes a candidate for replacing the number of hypothesis instead of the number of hypothesis we are talking about the number of dichotomies okay so now we define the the actual quantity so capital M is capital M red and I keep it red throughout and we are going now to define small m which I will also keep as red that will hopefully and provably as we will see next time replace capital M okay so it's called the gross function the gross function counts the most dichotomies the most dichotomies you can get using your hypothesis set okay on any n points so here is the game I give you a budget n that's my decision you choose where to place the points x1 up to xn your points using the hypothesis set okay so it would be silly for example to take the points and put them let's say on a line because now you are restricted in separating them but you can see okay the most I can get if I put them in this you know general constellation and then you count the number okay so let's put it formally the gross function is going to be called small m in red as I promised and it is the maximum okay maximum with respect to what with respect to any choice of capital n points from the input space that is your part I give you the n okay so I told you okay what are you maximizing well we had this funny notation capital H applied to this entire vector is actually the set of dichotomies okay the vectors minus 1 plus 1 minus 1 plus 1 and then the next guy and the next guy the actual the actual vectors here when you put this cardinality on top of them you are just counting them so you are asking myself how many dichotomies do I get x1 up to xn this thing that will give you the most expressive facet of the hypothesis set on n points that number if I tell you 10 and you come back with the number 500 it means that your choice of x1 up to xn you manage to generate 500 different guys according to the hypothesis set that I gave now because of this you can see now that there is an added notation here is the gross function for your hypothesis set so I am making that dependence explicit by putting a subscript capital H or script H okay furthermore this is a full-fledged function capital M is a number I give you a hypothesis it's a number it happens to be infinite but it's a number okay now here I am giving you a full function that is I tell you n you tell me what the gross function is okay so it's a little bit more complicated and because it is this way that's the gross function okay so that is the notion okay now what can we say about the gross function well if the number of dichotomies is it as most 2 to the n because that's as many plus minus 1 n tuples you can produce okay then the maximum of them is also bounded by the same thing at most 2 to the n okay well if we are going to replace capital M with small m I would say 2 to the n improvement over infinity if we can afford to do it okay maybe it's not a great improvement nonetheless improvement okay okay so now let's apply the definition to the case of perceptrons in order to give it you know a flesh so we understand what the notion is it's not just an abstract quantity okay so we take the perceptrons and we would like to get the gross function of the perceptron getting the gross function of the perceptron is quite a task if I tell you what is capital M infinity and then you go home okay what is the gross function of the perceptron you have to tell me what is the gross function at n equals 1 what is that n equals 2 at n equals 3 at n equals 4 it's a whole function okay so we say okay 1 and 2 it's easy let's start with n equals 3 okay so I'm choosing 3 points and I chose them wisely so that I can get the gross function for the perceptron for the value capital N equals 3 well it's not that difficult you can say okay I can actually get everything there is to get why because I can have my line here or I can have my line here or I can have my line here that's 3 possibilities times 2 because I can plus 1 versus 2 minus 1 or minus 1 here that will make them all plus 1 or I can have it sitting here where it makes them all minus 1 that's 8 that's all of them the hypothesis the perceptron hypothesis as strong as you can get if you only restrict your attention to 3 points okay so the answer would be what is it already 8 wait a minute someone else shows the points collinear and then found out that if you want these guys to go to 1 class and these minus 1 class and these guys to go to the plus 1 class there is no perceptron that is capable of doing this correct you cannot pass a line that will make these 2 guys go to plus 1 and these guys go to minus 1 if these are collinear does this bother us no because we are taking the maximum okay so this the quantity you computed here since you got to the 8 you cannot go above 8 that defines it and indeed you can with authority answer the question that the gross function for this case point at n equals 3 is 8 okay now let's see if we are still in luck when we go to n equals 4 what is the gross function for 4 points we'll choose the point in general position again like you know we are not going to have any collinearity in order to maximise our chances but then we are stuck with the following problem even if you choose the points in general position there is this constellation there is this particular pattern on the 1 minus 1 and plus 1 plus 1 can you generate this using a perceptron no and the opposite of it you cannot either okay if this was minus 1 minus 1 and this one plus 1 plus 1 okay can you find any other 4 points where you can generate everything no I can play around and there is always 2 missing guys or even worse if I choose the points unwisely I will be missing more of them so the maximum you will be going 2 out of all the possibilities and the gross function here is 14 not 16 as it might have been if you had the maximum now this is a very satisfactory result because perceptrons are pretty limited models okay I mean we use them because they are simple and there is a nice algorithm that goes with them okay so we have to expect that that quantity we are measuring the sophistication of the perceptrons with which is the gross function had better declaring perceptrons as strong as can be so now they break and they are limited and if I pick another model which let's say just for the extreme case the set of all hypothesis what would be the gross function for the set of all hypothesis it would be 2 to the n because I can generate anything okay so now according to this measure that I just introduced the set of all hypothesis is stronger than the perceptrons okay satisfactory result okay simple but satisfactory okay now what I'm going to do I'm going to take some examples in which we can compute the gross function completely for all values of n you can see that if I continued with this and say okay let's go with the perceptron 5 points you put the 5 points and then you try okay am I missing this or maybe if I change the position of the point it's just a nightmare just to get 5 and basically if you just do it by brute force it's not going to happen so I'm taking examples where we can actually by a simple counting argument get the value of the gross function for the entire domain capital N from 1 up to infinity okay in order to get a better feel for the gross function that's the purpose of this portion okay our first model I'm going to call positive raise so let's look what positive raise look like they are defined on the real line so the input space is R the real numbers and they are very simple from a point on which we are going to call A that decides one hypothesis versus the other in this particular hypothesis set all the points that are bigger go to plus 1 all the points that are smaller go to minus 1 and call positive raise because here is the raise very simple hypothesis set now in order to define the gross function I need a bunch of points so I'm going to generate some points I'm going to call them X1 up to XN okay and I am going to choose them as general as possible I guess there is very little generality when you are talking about line just make sure that they don't fall on each other if they fall on each other you cannot really dichotomize them at all so if you put them separate you will be okay so you have these N points now when you apply your hypothesis the particular hypothesis that is drawn on the slide to these points you are going to get this pattern true and you are asking yourself how many different patterns I can get on these N points and I am varying my hypothesis which means that I am varying the value of A that is the parameter that gives me one hypothesis versus the other okay so formally okay the hypothesis set is a set from the reals to minus 1 plus 1 okay and I can actually find an analytic formula here if you want an analytic formula you remember the sign this I think okay so here is a very simple argument if you have N points the value of the dichotomy which ones go to blue and which one go to red depends on which segment between the points A will fall in if A falls here you get this pattern if A falls here this guy will be red and the rest guy will be blues okay so I get a different dichotomy so I get different dichotomies when I choose a different line segment how many line segments are there to choose from I have N points I have N minus 1 sandwiched ones and one here when all of them are red and one here when all of them are blue right so I have N plus 1 choices and that's exactly the number of dichotomies I'm going to get an N points regardless of what N is so I found that the gross function for this thing is exactly capital N plus 1 okay let's take a more sophisticated model and see if we get a bigger gross function that's the whole idea right so the next guy is positive intervals okay so what are these they are like the other guys except they are a little bit more elaborate instead of having an array you have an interval again you are talking about the real line and you are going to define an interval from here to here and anything that lies within here to plus 1 and will become blue and anything outside whether it is right or left will go to minus 1 okay that's obviously more powerful than the previous one because you can think of the positive array as having an infinite interval okay but okay that's fine okay so you put the points we have done this before and they get classified this way and I'm asking myself how many different dichotomies I can get now by choosing really two parameters the beginning of the interval and the end of the interval these are my two parameters that will tell me one hypothesis versus the other how many different patterns I can get okay so again the function is very simple it's defined on the real numbers and now the counting argument which is you know an interesting one the way you get a different dichotomy is by choosing two different line segments to put the ends here and end it here I get something if I start the interval here and end it here I get something else if I start the interval here and here I get something else and that is exactly one to one mapping between the dichotomies and the choice of two intervals okay so if this is the case then I can very simply say that the gross function in this case is the number and that would be n plus one choose two there is only one missing okay when you count there are two rules okay make sure that you count everything and make sure that you don't count anything twice very simple okay so we counted almost everything but the missing guy here is what let's say that all of them are blue is this counted already yes and this segment and that is already counted in this okay but if they are all red what does that mean it means that the beginning of the interval and the end of the interval happens to be within the same segment so they didn't capture any point and that I didn't count okay and it doesn't matter which segment they are in because I will get just the all red so it's one dichotomy so all I need to do is just add one and that's the number okay do little algebra and you get this so that is the gross function for this hypothesis set okay and now I'm happy because I see it's quadratic it's more powerful than the previous guy which was linear okay now let's up the anti and go to the third one convex sets okay so this time I'm taking the plane rather than the line okay so it's R squared and my hypothesis are simply the convex regions so if you look at the values at the values of X at which the hypothesis is plus one this has to be a convex region any convex region okay so okay a convex region is a region where if you pick any two points within the region the entirety of the line segment connecting them lies within the region that's the definition okay so this is my you know my artwork for a convex region okay okay you took any two points and here so this is an example of that the blue is the plus one and the red is the minus one that's the entire space so this is a valid hypothesis now you can see that there is an enormous variety of convex of convex sets that qualify as hypothesis but there are some which don't qualify for example this one is not convex because of this fellow here's the line segment and it went out of the region so that's not convex so we understand what the hypothesis is now we come to the task what is the gross function for this hypothesis set okay so in order to answer this what you need is you put your points I give you capital N and you place them so here is a cloud of points I give you N and you say okay it seems like putting them in general position is a good idea okay so let's put them in general position and let's try to see how many patterns I can get out of these using convex regions okay man this is going to be tough because I can see okay let's see first I cannot get all of them right because let's say I take the outermost points and map them all to plus one this will force all the internal points to be plus one because I'm using a convex region right therefore I cannot get plus ones for the out guys and any any minus one whatsoever inside so that excludes a lot of dichotomies so now I have to do real counting and what not okay but wait a minute the criteria for choosing the cloud of points was not to make them look good in general but to maximize your growth function is there another choice for the points that gives me more hypothesis than these as a matter of fact is there another choice for where I put the points that will give me all possible dichotomies using convex regions if you succeed in that then you don't care about this cloud okay the other one will count because you are taking here is the way to do it take a circle and put your points on the perimeter of that circle now I maintain that you can get any dichotomy you want on these points okay what is the argument well pick your favorite one so I have a bunch of blues and a bunch of red can I realize this using a convex region yes I just connect these guys and the interior of this plus 1 and whatever is outside goes to minus 1 and I am assured it's convex because the points are on the perimeter of a circle okay that means what that means that the gross function is 2 to the end notwithstanding the other guy now you realize now a weakness in defining the gross function as the maximum because in a real learning situation the chances are the points you are going to get are not going to end upon a perimeter the interior points in which case you are not going to get all possible possibilities but we don't want to keep studying the particular probability distribution and the particular dataset you get and so on we would like to have a simple quantity and therefore we are taking the maximum overall which will have a simple combinatorial property the price we pay is that the chances are the bound we are going to get is not going to be as tight as possible but that's a normal price if you want a general result that applies to all situations it's not going to be all that tight in any given situation that is the normal trade-off but here the gross function is indeed 2 to then so let's just as a term when you get all possible hypothesis all possible dichotomies you say that the hypothesis set shattered the points broke them in every possible way so you can say can we shattered this set that's what it means you get all possible combinations on them just as a term now let's look at the gross three gross functions on one slide in order to be able to compare we started with the positive raise and we got a linear gross function and then we went on to the positive intervals and we had a quadratic function and that is good because we are getting more sophisticated and the gross function is getting bigger and then we went to convex sets which are okay it's powerful and two-dimensional but not that powerful although we got a bigger one it's inordinately bigger I mean maybe we should have gotten but that's what we have at least it goes this way so sometimes that thing will be you know too much but in general you can see that trend that with more sophisticated you get a bigger gross function okay now let's go back to the big picture to see where that gross function will fit okay remember this inequality oh yes we have seen it we have seen it often we are tired of it okay what we are trying to do is replace capital M and we decided that to replace it with the gross function small m now capital M can be infinity small m is a finite number at most two to the end that's good so what happens if we replace capital M with small n let's say that we can do that which we will establish in the next lecture what will happen okay now if your gross happens to be polynomial you are in great shape why is that because if you look at this quantity this is a negative exponential epsilon can be very very small epsilon squared can be really really really small but this remains a negative exponential in n and for any choice of epsilon you wish okay this will kill the heck out of any polynomial you put here eventually right I can put thousands order polynomial and can have epsilon equals 10 to the minus 6 and if you are patient enough or if your customer has enough data which will be an enormous amount of data you will eventually get this to win and you will get the probability to be diminishingly small which means that you can generalize so that's a very attractive observation because now all you need to do is just declare that this is polynomial and you are in business we saw that easy to evaluate this explicitly but maybe there is a trick that will make us able to declare that it is polynomial and once you declare that a hypothesis set has a polynomial gross function we can declare that learning is feasible using that hypothesis period we may become finicky and ask yourself how many examples do you need for how etc but at least we know we can do it if you are given enough examples you will be able to generalize from a finite set albeit big general space with a probability assurance okay so that's pretty good okay I'm happy that this is the case so maybe we can as I mentioned just prove that mh is polynomial the gross function is polynomial can we do that maybe we cannot okay so now here's the key notion that will enable us to do that okay we are going to define what is called a breakpoint you give me a hypothesis set okay perceptrons 4 okay another set the breakpoint is 7 just one just one number okay that's much better than giving me a full gross function for every n just one number so what is the breakpoint the definition is the following is that it's the point at which you fail to get all possible dichotomies so you can see that if the breakpoint is a hypothesis set I can't even generate all eight possibilities on three points right if the breakpoint is 100 well that's a pretty respectable guy because I can generate everything up to 99 points all 2 to the 99 of them and then I start failing at 100 so you can see that the breakpoint also has a correspondence with the complexity of the hypothesis set okay so if no dataset of size k points in which you are able to generate all possible dichotomies then you call k a breakpoint for H okay so let's look at what is the the okay so that's what it means you know you can chatter less than 2 to the k which are all the possibilities for k data points so for the 2D perceptron can you think of what is the breakpoint we did it already we did it explicitly say it in those terms but this is the hint okay for 3 we did everything for 4 we knew we cannot do everything okay so it doesn't matter whether it's 14 or 15 or 12 or 5 okay as long as it breaks it breaks it's not 16 and therefore therefore in this case the breakpoint is 4 okay so that number 4 will characterize the perceptrons just to tell me okay I have a hypothesis oh wait okay I'm not going to tell you I'm going to tell you the hypothesis the hypothesis are produced by that I don't want to just tell me the breakpoint and I will tell you the learning behavior okay also if you have a breakpoint so every bigger point is also a breakpoint if you just that is if you cannot get all possibilities on 10 points then you certainly cannot get all of them on 11 if you could get an 11 just kill one and you will have gotten one on 10 okay positive raise had this guy this is a formula we can plug in for capital N and we ask ourselves when do I get to the point where I no longer get 2 to the N numerically for a particular value okay what is the breakpoint here okay N equals 1 I get 1 plus 1 that's 2 that also happens to be 2 to the 1 okay 2 N plus 1 is 3 oh that's less than 2 less than 4 so 2 must be a breakpoint since we invested in computing the function we are just lazy now and just substituting but you could go for the original thing and say okay that's obvious because this particular combination of points if I want the right most point to be red and the left one to be blue there is no way for the positive ray to generate that okay and therefore that 2 is a breakpoint there's something where I failed okay let's go for this one okay we need faster calculators now et cetera okay wow it's exactly when I put 1 it gives me 2 must be the correct formula okay let's try 2 okay 4 I get 2 and calculators what is the breakpoint must be bigger than the other guy because it's very elaborate okay and you realize it's 3 okay if you put 2 points you will get the 4 and if you put 3 you will get 7 which is short of 8 okay again that's not a mystery that's what you will not get the middle point to be red while the other ones are blue so you cannot get all possibilities on 3 points therefore 3 is a breakpoint okay what is the breakpoint for the convex sets tell me how many points where I can fail I'm never going to fail so if you like you can say this is infinity they will define it this way okay so also the breakpoint just a single number so what is the main result okay the main result is that first part will be if you don't have a breakpoint I have news for you the gross function is 2 to the N okay yes that's the definition thank you so that cannot possibly be the main result so what is the main result the main result is that you have a breakpoint any breakpoint 1 5 7000 just tell me that there is a breakpoint you don't even have to tell me what is the breakpoint we are going to make a statement about the gross function the gross function is do I hear a drum roll do do do do do do do do do do do do do do do it's guaranteed to be polynomial in N wow we have come a long way I used to ask you what are the hypothesis and count them okay that was hopeless because it's infinity we define the gross function and we have to evaluate it that was painful then we found the breakpoint maybe it's easier to compute the breakpoint I just find a clever way and say that I cannot get it now all I need to hear from you is that there is a breakpoint and I'm in business as far as the generalization is concerned because I know that regardless of what polynomial you get you will be able to learn eventually I will become more particular and ask you what is the breakpoint when I try to find the budget of examples you need in order to get a particular performance but in principle if I just want to say that you can use this hypothesis and you can learn I just want you to tell me I have a breakpoint that's all I want okay now this is a remarkable result and I have to give you a puzzle to appreciate it so the idea of the puzzle is the following if I just tell you that there is a breakpoint the constraint on the number of dichotomies you get because there is a breakpoint is enormous okay if I tell you a breakpoint is let's say 3 how many can you get on 100 points on those 100 points for any choice of 3 guys you cannot possibly have all possible combinations and any 3 points all 100 choose 3 of them so that combinatorial restriction is enormous and you will end up losing possible dichotomies in draws because of that restriction and therefore the things that used to be 2 to the n if it's unrestricted will collapse to polynomial so let's take a puzzle and try to compute this in a particular case so here is the puzzle we have only 3 points and for this hypothesis set I am telling you that the breakpoint is 2 so you cannot get all possible for dichotomies on any 2 points if you put x1 and x2 you cannot get minus 1 minus 1 minus 1 plus 1 plus 1 minus 1 and plus 1 plus 1 all of them one of them has to be missing so I am asking you given that this is the constraint how many dichotomies can you get on 3 points you can see this is what I am trying to do because I am telling you that the restriction on 2 will if I didn't have the restriction I would be putting 8 so I am just telling you that it gives so how many do I get so for visual clarity I am going to express them as either black or white circles just for you to be able to instead of writing minus 1 or plus 1 it's fine doesn't violate anything I mean only one possibility so we keep adding everything is fine as a matter of fact everything will remain until we get to 4 because the whole idea is that I cannot get all 4 on any of them so if I have less than 4 I cannot possibly get 4 combination you see what that point is so this is still allowed I am going through it as a binary one so this is 0 0 0 0 0 1 etc I am still okay right am I still okay 0 0 0 you have violated the constraint you cannot put the last row because it now violates the constraint I have to take it out so let's take it out try the next guy maybe we are in luck are we okay okay that's promising okay so let's go for the next guy maybe we'll get are we okay 0 0 0 tough okay so we have to take out the last row how about this one okay we take it out we don't have too many options left right actually this is the last guy had better work does it work no okay so that's what we can do we lost half of them now you may think okay maybe you messed it up because you started very regularly this time from all 0 0 1 but if I started differently I may be able to achieve okay it's conceivable please don't lose sleep over it okay the only row you are going to be able to add to this table is this one okay this is indeed the solution and you can verify it at home okay so now we know that indeed the break point is a very good restriction and we are going in the next lecture to prove that it actually leads to a polynomial growth which is the main result we want let me stop here and we will take the questions after a short break let's start with the questions okay so the first question is so what if the target or the hypothesis are not binary okay there is a counterpart for the entire theory that I am going to develop for real valued functions and other types of types of functions the development of the theory is technical enough that I am going to develop it only for the binary case because it is manageable and it carries all the concepts that you need the other case is more technical and I don't find the value of going to to that level of technicality useful in terms of adding insight what I am going to do is that I am going to apply a different approach to real valued functions which is the bias variance tradeoff and it's a completely different approach from this one that will give us another angle on generalization that is particularly suitable for real valued functions but the short answer is that if the function is not binary there is a counterpart what I am saying that will work but it is significantly more technical than the one I am developing okay so this is a sanity check so when hypothesis says set can shatter the points this is a bad thing right okay there is a tradeoff that will stay with us for the entire course okay it's bad and good okay if you shatter the points it's good for fitting the data because I know that if you give me the data regardless of what your data is I am going to be able to fit them because I have something that can generate as hypothesis for any particular set of combination so if your question is can I fit the data then shattering is good when you go to generalization shattering is bad because basically you can get so it doesn't mean anything that you fit the data and therefore you have less hope of generalization which will be formalized through the theoretical results and the the correct answer is what is the good balance between the two extremes and then we'll find a value for which we are not exactly shattering the points but we are not very restricted in which we are getting some approximation and we are getting some generalization and that will get in higher dimensions okay so if you okay so the principle I explained I explained it in terms of two dimensional and perceptron if you look at the essence of it the space is x could be anything the only restriction I have are binary functions okay so this could be a high dimensional space and the surfaces would be very sophisticated surfaces and all I am reading off as far as this lecture is concerned is how many patterns do I get on number of n points and what not okay okay also a question on the complexity so why is usually polynomial time considered as acceptable okay but polynomial in this case is polynomial grows in the number of points n okay so it just so happens that we are working with the Hefding inequality that gives us which is the negative exponential okay and therefore if you get a polynomial as I mentioned any polynomial you are guaranteed that for large enough n the probability the right hand side of the Hefding including the gross function will be small and therefore the probability of something bad happens is small okay now obviously there are other functions that also will be killed okay that's not a polynomial but that will also be killed by the negative exponential because the square root versus the other one okay it just so happens that we are in the very fortunate situation that the gross function is either identically 2 to the n or else it's polynomial there is nothing in between okay if you draw something that is super polynomial and sub exponential and try to find the hypothesis set for which this is a gross function you will fail I'm just sort of taking the simplicity of the polynomial because lucky for me the polynomials are the ones that come out and they happen to serve the purpose okay a few people are asking could you repeat the constraints of the puzzle because they didn't get the okay so let's look at the puzzle okay I'm putting 3 bits on every row I'm trying to get as many different rows as possible under the constraint of them so if I focus on x1 and x2 and go down the columns it must be that one of the possible patterns for x1 and x2 is missing because I'm saying that 2 is a breakpoint so I cannot shatter any two points therefore I cannot shatter x1 and x2 among others and meaning that I cannot get all possible patterns there are only using sort of the black ones so in this case the x1 and x2 get 00 so to speak okay if I keep adding a pattern okay so let's look at here okay x1 and x2 what how many patterns do they have okay they have this pattern they have it again that doesn't count so there's only one pattern here plus one is 2 so on x1 and x2 so I'm okay and similarly for the other guys things become interesting when you start getting the fourth throw now again if you look at the first two points I get one pattern here and one pattern here there are only two patterns nothing is violated as far as these two points are concerned but the constraint has to be satisfied for any choice of two points so if you particularly choose x2 and x3 and count the number of patterns that's why we put it in red okay because now these guys have all possible patterns and I know by the assumption of the problem that I cannot get all four patterns on any two points so I cannot get this so I'm unable to add this row under those constraints and therefore I'm taking it away and I'm going through the exercise and every time I put a row I keep an eye on all possible combinations so here I put let's look at x1 2 3 I'm okay x2 and x3 one pattern which is here and here 2 3 I'm also okay and then you put x1 x3 okay here is a pattern it repeats here 0 0 and 0 0 okay so that's one and then I get this one and this one 3 I extend this further and start putting the new guys okay for this guy there is a violation and you sort of you can scan your eyes and try to find the violation and I'm highlighting it in red okay so I am showing you that for x1 x3 there are the four patterns here is one pattern the second one I didn't count this one just because it's already happened so I'm just highlighting four different ones and then the third one and the fourth one so I cannot possibly add this row because it violates the constraint this is your attempt this is the next guy it still violates why does it violate for the same argument look at the red guys you find all possible patterns so I cannot have it so you take it away and then last one that is remaining is this guy and that also doesn't work because it violates it for those guys you can look at it and verify and the conclusion here is that I cannot add anything so that's what I'm stuck with and therefore the number of different rows I can get under the constraint that 2 is a breakpoint in this case the mark I mentioned is that maybe you can start instead of gradually from 0000001 maybe you can start you know more cleverly or something but however any way you try it it's sufficiently symmetric in the bits that it doesn't make a difference you will be stuck with at most four okay okay in the in the slide with the happening inequality does anything change when you change specifically does the probability measure change when you change hypothesis to dichotomy for this one yeah so the idea here okay so capital M is the number of hypotheses period okay so it's infinity for perceptions we have to live with that okay in our attempt to replace it with the growth function okay we are going to replace it by something that is not infinite bounded above by 2 to the N as you can see 2 to the N is not very helpful because I have I am trying to find if I can put a growth function not only put the growth function here but also show that the growth function is polynomial for the models of interest that I have and therefore be able to get this to be a small quantity for a real learning model like the perceptrons or the other neural networks all of these will have a polynomial growth functions as we will see okay so that's that's where the number of hypothesis which is M goes to the number of dichotomies which is involved but that is what gets me the right hand side to be a manageable right hand side and goes to zero as N grows which tells me that the probability of generalization will be high okay so is there a systematic way to find the breakpoint breakpoints there is okay I mean not sort of one size fits all there are arguments for example you can go for neural networks and sometimes by finding a particular combination that you cannot break and argue that this is a breakpoint sometimes you can argue let me try to find a crude estimate for the growth function okay let's say the growth function cannot be more than this and then as you go by your allies that this is not exponential so there has to be a breakpoint at some point this will be less than 2 to the N and therefore will be a breakpoint so in that case the estimate for the breakpoint will be just an estimate it will not be an exact value okay so in this slide the top N is the number of testing points and the lower N is the number of training points yeah N is always the size of the sample and it's a question of interpretation between the two whether that sample are used for testing in which means that you have already frozen your hypothesis and you are just verifying testing it or in the other case you haven't frozen your hypothesis and you are using the same sample to go around and find one and you are charged for the going around so let's say that our customer gives us like a K sample points how do we decide how many of them we reserve for testing points how many for training okay this is a very good point and there will be a lecture down the road called validation in which this is going to be addressed very specifically okay there are rules of thumb I mean there are some mathematical result but there is a rule of thumb there are few rules of thumb that I'm going to say without proof some has to apply how many do we reserve in order to first not to diminish the training set very much and still have a big enough test set so that the estimate is reliable so this will come up thank you okay there is another question hi professor I have one question so for two hypothesis has the same dichotomy is it true that the in sample error is the same for the two hypothesis okay if it has the same dichotomy it's even a stronger condition than this because it returns exactly the same values okay now the in sample error are the fraction of errors that I got right and wrong the target function is fixed so that is not going to change so obviously I'm going to get the same pattern of errors okay and if I get the same pattern of errors then obviously I'm getting the same fraction of errors among other things okay now if you ask if these two hypothesis what is the out of sample error that's a different story because for the out of sample error it's entirety so in spite of the fact that it's the same on the set of points it may be not the same on the entire input space which it isn't because they are different hypothesis and therefore you get a different E out but the answer is yes you will get the same oh yes I see that's what I was asking because I think the out of sample error is different for two hypothesis then how can we replace the M with exactly and the biggest technicality in the proof when I said okay we are going to replace M by the growth function okay that's a very helpful thing there has to be a proof and I will argue for the proof and the overlapping aspect and some of this the key point is what do I do about this fellow because when I consider the sample this one is very much under control as you said if I have two hypothesis they are the same but they are not the same here so the statement here depends on E out depends on the whole so how am I going to get away with that that's really the main technical contribution for the proof and that will come up next time okay sure thank you so why is it called the growth function growth function yeah I really okay so I mean the person who introduced it called it the growth function I guess he called it the growth function because it grows I mean it's as you increase N I don't think there is any particular merit for the name yeah okay is there a so what is a real life situation similar to the one in the puzzle where you realize that this break point may be too small okay so the first order of business is to get the break that there is a break point we are in business okay second one is how does the value of the break point relate to the learning situation do I need more examples when I have a bigger break point the answer is yes what is the estimate and there is a theoretical estimate a bound maybe the bound is too loose we will have found you know practical rules of thumb that course that translate the break point to a number of examples all of this is coming up so the existence of the break point means learning is feasible the value of the break point tells us the resources needed to achieve a certain performance and that will be addressed so is there a probabilistic statement for the haveding in quality as an alternative to the case-by-case discussion on M's growth rate in N so yeah I'm guessing the okay so there are alternatives to haveding okay so there are alternatives and you can sort of get different results or emphasize different things I'm sticking to haveding and I'm not indulging too much into its derivation or the alternatives or not because this is a mathematical tool that I'm borrowing and I'm taking it for granted and I picked the one that will help us the most which is this one so yes there are variations but I'm deliberately not getting into them in order not to dilute the message I want people to become so incredibly familiar and bored with this one then they know it called because when we get to modify it including the growth function and the other technical technical points I'd like the base point to be complete in people's mind so that they don't get lost with the modifications so that's why I'm sticking to this okay okay I think that's it okay very good so we'll see you next time