 So, now, the question is fine I went from cardinality of log H to log to the log of cardinality of H, maybe I can do further small maybe like that just give you two algorithm right. Then if then we may ask the question is it possible like I will get something better maybe something better could be just like I can say ok instead of log to H is it like half of log to H if I can get that that algorithm is still better right that is giving that is making smaller mistake. Now, the natural question when we have this ok this algorithm better this algorithm seems to be still better still this algorithm to be still better. So, how much better we can do is there a limit to that. So, that limit will be decided by what yeah of course, but who is you are the one who is picking things there is an environment and you are a learner right you are learning an environment. If the environment become more and more and tougher right. So, as I said earlier if I have not put this visibility condition there is no way you could have gotten all these things right you would environment could have made you incur loss in every round. So, of course, what kind of bounds we are going to get it depends on how tough our environment is. We had restricted our environments capability by putting this reliability condition here ok ok fine. Even once we have put this reliability condition and we are restricted the power of environment we may want to say ok under this scenario what is the best I can do what is the best bound I can achieve is it that log to cardinality h is the best I can do or something I can go beyond this ok. So, to do this let us introduce some notion and notice that this bounds I am giving you is irrespective of how the environment is choosing x t's ok. Even though I have said that there is a fixed h star that is going to assign label I have not made any assumption that how this h star itself is chosen. The h star tuition could be very complicated and because of that you may not easy for you to figure out what is that h star ok. So, to understand that what is the best we can do maybe if we can give a lower bound on this. If we can say that ok no matter what algorithm you are going to use you will be you going to incur at least this much of loss. That means, that is like a dead end for me right like no matter how intelligent I am how best my algorithm is I will be going to suffer this much of loss. Then we will see that if at all we can derive that then we will see that whether whatever the upper bound I got was that how good it was compared to the lower bound ok. So, to do this we will just formally introduce the notion of mistake bounds and online learnability ok. Let us say this is one sequence from the environment you faced ok in round in the first round environment generated despair x 1 and associated label h star x 1 like that and in the nth round this was the one. So, notice that this is not the same sequence you are going to face every time you run a algorithm. If you are going to restart your algorithm the sequence seen by your algorithm could be altogether different ok. So, I am just saying let us say this is one sequence that to your algorithm faced and let me denote m a of s denote the maximum number of mistakes sorry the number of mistakes your algorithm made on this sequence s. So, for all my talk all the discussion we are going to fix this small n we are going to assume that my algorithm is for run for some some fixed number of rounds that is going to be n. What my interest is the number of mistakes I am going to make and now I do not want to I will be not interested in giving a bound on a particular sequence. I would like to give a bound which is irrespective of what sequence I am going to see right because I do not know what is the sequence my algorithm is going to face ok. So, then I am going to denote. So, this is the mistake my by my seen made by my algorithm a on a sequence s. Now, I am saying that I am putting it against all possible sequences and see what is the maximum number of mistakes it is going to make on any of these sequences. So, in a way I am going to take the worst case scenario here right I am looking at what is the the toughest sequence you face so far. So, toughest sequence is the one on which you made the largest number of mistakes. So, that I am going to denote by m a of h. So, h is my hypothesis class a is my algorithm. So, this is the number of the maximum number of mistakes made by my algorithm a while learning my hypothesis class h. So, then we are going to say that ok. Now, we are going to say that and I think this is ok let me rewind this. I said that taken s which has this n points in this and I am going to call this m a of s to be the number of mistakes made on sequence s and now I am going to take this supremum or s to be the maximum number of mistakes I made in any sequence and here I do not need to restrict my s to be of size n. This could be of arbitrary or arbitrary length you may have run it for n equals to 100 or you may be you might have run it for n equals to 1000 or may be n equals to 1 million. Now, we are going to say that this hypothesis class is online learnable if there exists some algorithm a. I do not know what algorithm is this for which I should be able to bound this guy what is this this guy this is the maximum number of mistakes I am going to make on any sequence if I can bound it by some b which is a constant and finite if that is the case then I am going to say online learnable ok. Now, my two algorithms like consistency algorithm and my halving algorithm were they online learning using them if I have a finite hypothesis class h can that final hypothesis class be made online learnable using these algorithms ok. So, like like like put alternatively if I have a hypothesis class which is finite in size is it online learnable according to this definition yeah let us assume this this right now we are in that reliability assumption is this online learningable according to this definition. So, what was b for my consistency algorithm? If h is finite then coordinate h is finite and I can take that value as to be my b right. So, I have a bound which is independent of what was my sequence length and for the halving algorithm I got log 2 cardinality of v which I can take it to be b. So, that is why it is ok. So, as long as I have a hypothesis class which is finite in size I am good according to this definition I am learnable. It is not like if you if I continue to use that algorithm forever I will make a large number of errors at some point you stop making errors because your number of mistakes is bounded by a finite number after that you are going to make no more mistake or just think of like if I have an algorithm you start applying on it and you keep on applying it forever after some point that algorithm will not make mistake right because you have ensured that it is mistake bound is finite ok fine. So, now we have to now answer the question what is the best we can do right is this whatever b we got earlier like cardinality h and log of cardinality h is that something better than we can we can do better or if that is the case what is the what is that we should aim for. So, for that we now look a lower bound on the number of minimum number of mistakes that the environment can force on me because see like environment is the one against whom we are learning right. If the environment becomes complicated maybe it can force more errors on me, but what is the maximum number of errors it can force on me and can I also ensure that what are the it can force maximum number of errors I will not make more errors than that. What are been enforced on me which is unavoidable ok fine that much errors I will make, but no more than that ok. So, then we will talk about what we call as any questions so far about this algorithms and the notion of whether hypothesis class is online learnable ok and we. So, see how you are stating this research and what assumption like if we do not have this reliability assumption this are all incorrect. We will in the next class we will relax that assumption like if we do not have that reliability assumption what is that we should look for ok. So, to understand this what is the lower bound we are going to get on this mistake we have to understand what is the power the environment has. So, the power environment has we we should think it in an adversarial manner. So, what is the best mistakes that the environment can force on you right. So, I mean sometimes we see it in some sports right and it is and all if you are playing against a very tough player he can force a lot of error on you right like I mean you can make self errors, but the opponent is very strong you can hit it in some spots which you which will make you error. So, it depends on what is the power of your enemy. Now, here let us treat the environment to be in a similar fashion let us say in environments aim is to just make sure that you make lot of errors you want to win against you how much errors he can force on you ok. So, in that way we are always going to treat this interaction between environment and the learner as a game. One guys you are trying to play trying to learn your opponent and by making minimum number of mistakes and maybe like opponent let us say he is trying to enforce a more number of mistakes on you ok. Now, the question is you have ensured that through these bounds however tough is your enemy or however tough your opponent I will not going to make more than these mistakes you have ensured this, but now let us ask from the opponent's point of view or from your enemy point of view how much mistakes he can at least enforce on you that will be his question right. So, for that we are now going to understand what is the power of environment or what is the power of your opponent and remember we are going to look at the power of your opponent or environment still under the restriction that he has to follow the Relasability Assumption. We have a set up a rule for this game which is Relasability Assumption he has also has to show his power under this rule. So, to understand this we will consider little bit of this tree structure. So, consider a binary tree of depth n ok. So, if I have a binary tree of depth n how many nodes it is going to have? 2 to the power n plus 1 minus 1 right. So, let us consider for simplicity let us take the case of simply 3 nodes mean sorry at depth 2 I have a depth 2 tree here and I have 7 nodes here ok. Let us call this to be v 1. Now let us say I will come up with I am going to look at for this points I mean this v 1, v 2, v 3 they are basically sample points which we denoted as x t earlier ok these are the ones. So, v 1 is the first point, v 2 is the second point, v 3 is the third point like that. So, let us say the adversary tries to follow do the apply the following strategy. He will initially throw you v 1 let us forget what the learner is doing let us only think of from the adversary point of view or the environment point of view. He will start with v 1 this is the root node here. And then let us say he is going to apply assign some label to this. If he applies label while 0 label is going to move towards its right child and in the next round he is going to use this as his sample. If he gives label 1 he will move to this point and then he is going to use this as the next sample like this he keep on doing this. At this point after he suppose in this point he has assigned label 1 to this let us say he moved here and at this point he assigns label 0 he will move towards his left side and if we which is ok this is when I see this let us say this is my right and this is my left. So, when he is here when he applies 0 label he will going to move towards right and then he is going to take left he is going to assign label 1 and like that he keeps branching like this ok. So, now if you have we have to the power 1 plus 1 minus n nodes I am going to enumerate them as v 1, v 2, v 3 all the way up to v to the power 1 plus 1 minus 1. So, these are the node choice he has and using these node choice he has constructed a binary tree like this ok. So, now as I said his strategy is going to be he is going to choose x 1 to be v 1 to begin with ok. Now, right away let us at round t let us do at round t let us say at round t is at I t is this current node ok at something like let us say I t is this current node. Now from this node if he is going to assign a label 0 is going to go to the next child node which is towards his right and if he is going to assign value label 1 to the point at I t is going to move towards his left ok. So, let us say how to write. So, let us say if he choose x t equals to v t I think I have to use this slightly different label to be consistent ok. I will I will just flip the labels here. We go to left child if this is the word and then we go to right child if y t equals to 1. So, let us try to define just this strategy of whatever this construct like this and then we will see that how the adversary is going to use this to inflict what kind of damage he can inflict on the learner ok. If this is the case at I t throughout he is going to either right or left depending on whether your label is going to be y t equals to 0 or y t 1 and I have labeled my nodes in this fashion v 1 v 2 v 3 if I have further down I will label it as v 4 v 5 v 6 v 7 then v 8 v 9 like this ok. Then if it is this case what is I t plus 1 is going to be it is going to be 2 I t. So, I am going to one stage down going to next stage. So, 2 into I t plus y t can I write this in an iteration fashion the node numberings. So, let us say let us take the simple case let us say I am in this node this is my I t let us say I made a label 1 and I came here what is this value is going to be. So, it is going to be 6 right. So, 3 it is 3 to the 6. So, when I said it is going to come this is going to be 0 right y t is 0 in this case it is going to be 3 into 2 6 we are going to get. So, like that I can iteratively like to write this. So, this is I t plus written expressed in terms of I t I can do the same thing right I can replace I t here by I t minus 1. If I keep on doing this repetition what I am going to get you can verify this. So, he can write I t plus 1 in terms of the labels. So, finally, what I have I t plus 1 express in terms of the labels I have seen till time t minus 1 right. Now, I am saying that let us say my environment is just going to use this strategy is going to start with this node we will assign a label to that based on whether you assign 0 or 1 we will go to the next one and we can continue to do this. Now, this is j running from 1 to t minus 1 right we are calculating the node at time t plus 1 that is expressed in terms of all the labels I have observed till t minus 1 ok. So, you are saying where t is missing right let me check that just to be concrete let us verify this one more time yeah let us do this. So, let us say from let us take this path can you verify this quickly. So, we are getting this to be 5 if I take this path where 0 and this is 1 this looks correct right ok. Just let me introduce one definition and then we will break. So, how many of you done again machine learning course in this rest properly hand. So, how many of you are already know the notion of VC dimension you know VC dimension which course ok anybody else. So, in machine learning especially in the supervised learning setting those of you know what is VC dimension I mean those who do not know about VC dimension kind of VC dimension tells you what is the complexity of your hypothesis class how tough it is and based on that we determine what is how many samples we need to get. So, that we can guarantee certain number of error on your hypothesis on your risk basically. We are going to use a similar notion of something VC dimension we will come to that, but as always like before we define the notion of VC dimension there is something called shattering right. So, we will just introduce the analog notion of that and we will just discuss that in the next class. So, those who do not know that notion of VC dimension. So, just read it before you come to the next class in the supervised learning setup when you look into basically if you look into the learning theory of supervised setting you will definitely come across this term shattering and the VC dimension. So, as I said we constructed basically this decision binary decision tree using this instances right. So, let us say you have the sequence of instances which forms a decision tree of depth D. We are going to say that this decision tree here whatever we have is going to be shattered if there exists a hypothesis class H such that you give any kind of sequence of labels. So, you have these points you give any arbitrary labels set of labels of D up to D then we should be able to find a hypothesis class H such that all these points here will be getting a label y t according to this strategy that we adopted that is the ith node will be selected in this fashion and when we select that ith node the corresponding node on that will get the label that you want. So, basically we are saying that this decision tree like this is going to be shattered if I can come up with a hypothesis such that you for any given you give me this sequence of labels label series right. So, this labels you are going to see I will reproduce that label by taking this path you understand this ok. So, we will discuss this more this notion of shattering it will be easier for you to digest this concept if you look into what we what is VC dimension in supervised learning ok we will revisit in the next class please read that notion of VC dimension.