 So, let us get started. So, today we are going to look into what is the best you can do when you are in an online learning setup. The online learning setup we discussed in the last class that is what is the smallest number of errors that you could incur and that number of possibly errors is kind of unavoidable for you. What is that number? So, for that we started introducing some setup to understand what is the best the adversary can do to inflict the maximum number of errors on you. So, for that we started defining a graph and then we introduced the notion of shattering. So, let us let us repeat some aspects of that. So, all of you are able to go through what I asked you to do in the last class that is to understand what is VC dimension. So, VC basically kind of captured what is the complexity of the hypothesis class you are going to learn right. So, there is a similar notion for the online learning setting also that we are going to discuss today ok. So, in the last class we introduced a binary graph right sorry a binary tree which should be like this and we call this like v 1, v 2, v 3, v 4, v 5, v 6, v 7 and like that ok. So, let us say this is my initially finally 4. So, these are the points which are coming from my sample space and we said that there are let us say associated label y 1. If this y 1 is 1 that label we said we are going to write and if it is 0 then we said we are going to left. So, you fix a binary graph when I such binary tree when I say you are fixing a binary tree you are basically coming up with this sequence of points like this which are numbered in this fashion and then you are going to depending on your labeling sequence you are going to take one of these parts right. For example, if you are labeling sequences y 1 is 0 and let us say y 2 is 1 then you would have gone here and then it is something let us say 0 you would have gone here. So, the sequence 0, 1 and let us say 0 will take you to this path and if you are going to change this v 1, v 2, v 3 you may end up with an another graph ok sorry another tree. So, let us say that is why we are going to call this binary tree let us say d that we are simply going to be denoting by this sequence of points v 1, v 2, all the way to v 2 power d minus 1. So, this is going to be a binary tree of let us say depth d this will consist of this many points and depending on what is your binary labeling sequence you are going to take this path. Now, what we are saying in this we are basically trying to think about a strategy for the adversary ok. Suppose let us say this is at the beginning a sample point v 1 is shown to you let us say x 1 happens to be v 1 ok and let us say you want to assign a label y 1 for this point let us say you assign v 1 whatever based on you selected 1 or 0 you will end up at some point and let us say at some x t we have shown this right what will be the index we have written what will be the index in the tth round right what is that value is something like t plus plus right you are going to in this and whatever the associated label you are going to see x t plus 1 right and let us say you are going to. Now, the question is say you are going through these points and let us say these are the associated label. Now, suppose I have a comma with the hypothesis H belongs to my hypothesis class H such that my H of x t is equals to y t if I can do this ok. So, what I am basically doing is whatever path you are going to go through this I have a hypothesis that is going to map this point to that y t here. If you could do this for any given labeling sequence then we are going to say that say that this tree is shattered ok. So, is this point clear? No, you give me you give me like you you take a depth d tree and now you fix this point. Now, we are going to talk about is this tree if you are going to change this point it is a different tree with this points that is represent by tree will just be shattered. We are going to say that yes this is going to be shattered if you give me any sequence of this I will come up with a H such that the path I will have a traverse such that all my x t's will have this label white ok. Let me just make that more formal you said that. So, this is the formal definition of H scattering. So, notice that what is important here is once I have given this binary tree I should be able to do this job of assigning mapping this labels to a particular point here on my tree for every label ok. So, let us think like this. So, let us say I have been given this tree and in addition I have been given a binary sequence like this d labels. Now, can you come up with a hypothesis which according to a path taken according to this labels will ensure that all the points on this will have the associated labels. If you can do this we are going to call this this binary tree shattered ok. Now, let us introduce this definition and then discuss what house how this is useful capital L DIM of H. So, this is a shorthand for little dimension sorry little stones dimension. So, is this notion of shattering of a tree is clear ok. Now, we are going to say that you give me a hypothesis class right now I am not saying anything about whether this hypothesis class is finite or infinite it could be countable infinite or uncountable infinite I am not telling anything. Then we are going to say that this notation this is little stones dimension is the maximum integer D such that there exists a tree of depth D that will be shattered. That means, you keep on taking such trees of different depth ok. If you can ensure that a tree some tree which has depth D that will be shattered by some hypothesis class in H, but if you go to a tree of depth D plus 1 that will not be shattered ok. Then it is clear that that integer D is the maximal depth of the shattered tree right and that value we are going to call as little stones dimension of the hypothesis class H. So, good. So, what is the usefulness of this then? So, can you think of on this graph what kind of errors that an adversary can force on you? From this can you realize or get some intuition about it will be at least D or little dimension D little dimension D why is that? Opposite direction. Opposite direction yeah which will fix which will be able to still make his labels legitimate with some point right. So, ok let us understand here. You first show the learner this point V 1 that is x 1 equals to V 1. Learner let us say he predicts some y 1 hat ok. Then what you can say oh ok true label is opposite of y 1 hat and whatever that opposite of y 1 hat you call that y 1 and declare y 1 as your true label. So, you forced that learner to incur a one error right in the first one. Now, based on that whether he made 0 or 1 you are in the opposite direction. So, let us say you happen to he happen to say 0 and you happen to declare it as 1 now you came to this path. Now, you show this to him whatever the label he says here now you say opposite of that ok and because of that what you are basically generating is this sequence of y. So, you were so let us say user is going to say y t hat in round t you are going to declare the label to be y t to be 1 minus y t hat. So, whatever is sequence of predictions you have made whatever let him whatever let him apply whatever algorithm whatever strategy is applying you will do. You are just declaring this to be your true label, but now this labels are not arbitrary they are still governed by some hypothesis in your class right. So, the adversary is still sticking to a rule you remember we are trying to enforce reliability assumption here. It is not that adversary will just look into your prediction and just say opposite of that. Because if you can say that there should be there may not be any hypothesis which will make such labels feasible right what you have to ensure is whatever the adversary is also telling it has to corresponds to some hypothesis for some samples and this kind of shattering is exactly ensuring that right. Whatever the path I am going to take I am ensuring that there is some hypothesis which is conforming to that labels that I am going to observe is this clear ok. So, because of that I am able to inflict errors on the learner, but now the question is what is the maximum number of errors I can at least what number of errors I mean the learner make can make himself some errors I do not care about that what I am caring is what is what is the minimum that I can enforce on him. By doing this we are ensuring that you will be at least able to enforce this much of error on him. Is this clear? Yes. Yes, all they are saying is the adversary has a strategy for him that exist a strategy which will allow him to enforce this much of error on the learner. How is it coming? We are saying this is what used by the adversary ok. You remember what is happening in our online algorithm in every time you are going to show one context and the learner has to give a prediction for that. Let us say learner give a prediction of y t t in round t. Now what you are going to do? You are going to actually declare that the true label is complement of this and you declare y t is the true label. By doing this you have made sure that the learner has made an error ok, but you also have to ensure that this y t is such that I should have a hypothesis which has been consistent till that point like right because adversary sticking to one hypothesis and he is generating these labels. So, let us say till t minus 1 rounds he has used hypothesis and given y t minus 1 as the label and till all this point let us say right from x h 1 to y 1, but in the t th round also he has to ensure that he uses this same once, but now is it feasible for him to do so? It is feasible for him to do so if he is following such a strategy he is using the graph which is getting shattered by this hypothesis right. Now, we are just saying that what is the maximum depth he can go doing like this and that is exactly the maximal integer they are saying that is what the little dimension ok. Because of this even the learning is perfect in the sense like I am myself not making any unforced errors, but adversary is making me at least forced error this much of on this many points ok. You remember we had a notation what is this notation number of mistakes algorithm makes a while using hypothesis while learning on hypothesis class H. Now, can I say anything between this and what can can I now relate these two things? So, this is the we are saying that the maximum number of errors by our definition. So, what is our definition of MAH? MAH you have defined as super power all set of inputs I am going to see then M of A of S and what is MA of S? The number of errors you made in a prediction right using your algorithm A. Now, this is the kind of maximum number of errors you would have made using algorithm A while learning hypothesis class H and what is this? We just said that this is the minimum number of errors that will be forced on you. So, this has to be this has to be lower bound on this right. So, is this clear now? Like by using this kind of strategy whatever you are predicting you just make a complement of that as the true label the adversary is going to adversary is going to make sure that we are going to at least incur this much of mistakes ok right. So, I have taken it over all possible sequence right, you this is the worst case scenario the worst case will be at least this much. If you are removing the soup you may end up you may be lucky and may be making less mistakes on this, but if you are going to take worst case like if you are testing your algorithm over all possible sequences then there exist one sequences where you will be forced to have this many mistakes ok. Now, let us say my hypothesis class is some finite strictly finite. Can I say what will be an upper bound on this load and L D? So, notice that this L dimension is the maximal integer such that a given tree of this depth gets shattered. So, then what can can I connect this with size of my hypothesis class ok, y log right? So, we can always think of suppose whatever points we have whatever you take whatever points each of this path should corresponds to different hypothesis right right because it is a it is a unique binary sequence ok. So, that should be a different hypothesis class and how many how many if I have cardinality of H hypothesis class how many. So, the total number of leaves here will be equals to the number of hypothesis, but then what is the depth? If it has cardinality of H leaves the graph how many what will be depth it will be exactly log 2 cardinality of H. So, we have this natural point ok. So, this bound fine this is one way of looking into this, but we can derive this result from this and also what we know earlier. We know that if you are going to take this algorithm A to B halving algorithm what is an upper bound on this? Log 2 cardinality of H right. So, we have already this result. So, if you just take this and this that is what we have derived this. This is just alternate way of saying the same thing yeah yeah this is a specific algorithm. I am just saying say this bound is independent of algorithm. Now, if I am irrespective of what algorithm is this bound holds, but for a halving algorithm I know this bound holds. So, because of that this this I can take which is both independent of just and this since this part is independent of algorithm I can just make this comparison ok. Now, let us take a simple example. Let us say my sample set only d points which have enumerated as 1 2 up to d ok. And let us say my hypothesis class will be consist of of again only d hypothesis where H j of equals to 1 if x equals to j and 0 other x. So, is this example is clear just realize what is this example? I have only d points and I have only d hypothesis and let us focus on hypothesis 1. This hypothesis 1 is such that. So, let us say j equals to 1 it is going to assign label 1 only when x equals to 1 and everywhere it is 0 ok. So, if you want to represent this H 1 in just in like it is like 1 0 0 0 like till d 0 and H 2 will be first element will be 0 and 1 like this if if these are the labels on this points. Now, can we compute what is the L dimension of this hypothesis class ok. So, how to how to go about computing the L dimension of this hypothesis class when let us start looking into. So, I know that maximal integer d I have to look at, but right now I do not know what is the maximum. So, I will start with some small number and keep on looking into till what depth I will successfully able to shatter and after that I will not be able to shatter. So, one simple thing is to do ok take some initial point x you show it to the adversary that x is one of these points ok and after that so, you you know let us say the learner said predicted label to be let us say he predicted label to be y 1 to be 0 you are going to make him you are going to contradict him right. So, you are going to say the label is 1 you are going to say label is 1. The only way you can say label is 1 yeah. So, if you have started with let us say 1 then you should have applied h 1 there if you have started with 2 you should have applied h 2 there right. So, suppose let us say you started with showing 1 the only way you can say y 1 equals to 1 had you selected h 1. So, now, let us say you are here now here you showed let us say some other point 2 let us say or let us say 3. Suppose the learner says y 1 hat equals to 0 can you contradict him here you cannot contradict him here right because on h 1 you are going to assign label 3 on point 3 0 only you cannot contradict him. Because of this if my learner keeps on just saying let us say 0 0 you can enforce at most one mistake on him nothing more than that right. So, now, what will be the L dimension of this little little stones dimension of this hypothesis class it is going to be 1. We have to show that this is the maximum number of errors he can inflict or this is the maximum depth of the tree that can be shattered beyond that you cannot shatter it that is the whole point here right. We showed that he could only shatter up to depth 1. So, that is what I am saying right, but for depth 2 also you can go, but for depth 2 also you should show that for any sequence of this labels he should be able to shatter it, but I am now showing you your sequence your this sequence is let us say 0 0 now you are not able to shatter this. Yeah, you can choose anything. So, this is there exist this could be any tree I mean you could come up with all you need to show come up with is a one tree where you are not able to shatter 0 0 sequence all you need to do is say ok this is for a given graph this is the definition of shattering. Now, for little this L dimension to be defined I want as long as there exists a tree of depth d that I am able to shatter fine and what I am looking is the maximal depth I can go whichever tree it is that is fine with me. So, I am here just showing you one thing which is not able to shatter. Now, I am saying you take anything that till then also it is going to not work you cannot go beyond one step. So, this is where this L dimension is going to be one. Sir anything in any form. Yeah. So, this point you take any points in this will not be able to do beyond one step for this specific example ok, but where what is the cardinality of H in this it is d right because there are exactly d hypothesis in this. You see that whatever d here d could be very large right whatever d could be your little dimension is always one for the hypothesis class. Whereas, your cardinality of your size of hypothesis class can be large if your d grows. So, you because of that you see that for this specific example this can be very large if I I can keep increasing d, but whereas, this is going to be small ok. So, the only even though you can have large number of hypothesis the adversary can only the worst case or that is the best case for you enforce only one error on it ok, even though it has large hypothesis. So, in a way this is again kind of capturing what is the complexity of this hypothesis class right. So, that like analogous to VC dimension in its online learning setup little store dimension is capturing what is the complexity of like in terms of learning it like how many errors you are going to incur before you are going to learn the right one that is the analog versions here ok. So, I want you to just work out this exercise yourself ok just exercise that is there in the book, but you have to carefully analyze it. So, let us take my to be 0 1 interval ok and let my hypothesis class to be x goes to indicator x. So, what is my sample space? It is all points in the interval 0 1. Now, what is my hypothesis class? My hypothesis class is all such. So, my each for a given a one hypothesis class is defined ok. If you fix a what it is going to say whatever x you are going to give if that happens to be less than your given a my label is 1, if my x happens to be greater than or equals to that a your label is going to be 0 ok. So, depending on what is the threshold you are going to choose you are going to have many many many hypothesis right. So, in fact the cardinality of this h is unbounded ok. Now, compute for this hypothesis class ok you will end up showing that this is also infinity.