 So, in the last class we started discussing about how to deal with the case when hypothesis just generates the labels does not belong to a hypothesis class right. We said we want to relax the reliability assumption and we started defining what is the notion of performance criteria when we do not have this reliability case. So, we introduce the notion of regret which was basically say like how good you are in hindsight ok. How good you are in hindsight compared to a I mean how good your algorithm is performing compared to a algorithm which knew all the information ok. So, just to reiterate that we have defined our regret as. So, this was our notion of regret and we said that our goal is to identify algorithm which makes this regret sublinear or alternately we said that this hypothesis class is learnable if I have an algorithm such that goes to 0 that is regret per round goes to 0 as n tends to infinity ok. Now, let us try to understand what kind of bound we can expect on this quantity is it possible to achieve this at all can I come up with an algorithm which makes my hypothesis class learnable ok. Now that I have removed the reliability assumption the power of adversary is much more now like maybe he can just look into the label what you predicted and then just give opposite of that as the true label right. In that case what could be this value if if the adversary can look into what is that you predicted then what is the what value what is the maximum he can make this value he can make it here right by just flipping what you predicted he can make it fine. So, he can force you to do. Now we will try to understand what this quantity can be for a simple case of two hypothesis. Let us say that this hypothesis class take it to be consisting of just two hypothesis H 0 and H 1 where H 0 always says 0 H 1 always says 1. Let us try to understand what this quantity is going to look like for this specific example. So, it is hypothesis class clear this is a trivial hypothesis class ok. Now given any sequence of labels let us say whatever the sequence I have to look at this quantity over because since I am doing the supremum over H I will look into this quantity for only two H possible H values one is H 0 and another is H 1. Now irrespective of what is the sequence if I am going to look at this value. So, let us let us take let us take this soup inside here instead of doing this let us say I will take here I basically looking at the infimum value of this quantity right what will be the infimum quantity of this quantity of this what is that y 0. So, ok let us say let us say y t sequence right. So, if I have y t sequence for t rounds either if less than t by 2 or 0 rest of them are 1 or other way around right if less than t by 2 or 1 others are all 0. Now because of this if you are going to look at two possible thing H 0 has H 1 whatever this quantity can by choosing one of this H 0 and H 1 depending on what your sequence is can this quantity be made at least sorry at most t by 2 the mistakes ok. So, let me put it. So, let us consider this quantity. So, let us say you have been given this y t sequence let us say now you have to predict using your hypothesis over H 0 H 1 you have to match this label sequence. Suppose let us say in this half of them are 0 less than t by 2 are 0 and the remaining ones are 1 by choosing your hypothesis class to be H 1 what which is always predicting 1 what this quantity can be less than t by 2 right because they are the one only mistake remaining remaining ones are always already 1. So, there will not be error. Now look at the other possibility where the first t by 2 are 1 and the remaining ones are 0. In this case if you are going to choose this hypothesis H 0 what will be this bound again it is going to be t by 2 right irrespective of what is this sequence is going to be if you are going to look minimize it over H 0 or H 1 this guy is always going to be t by 2 yeah I mean in that case it is this value is going to be 0 right. If environment give all this y t is to be 0 in that case you since you are minimizing you the best one is to choose 0 it is going to be upper bounded by 0, but we are taking the worst case irrespective of what is the sequence we are going to choose if I have to choose one of them I am then this quantity will not be more than t by 2 ok. Now if you go with this this can be t the this remaining quantity can be at most t by 2 and if I want to then can I get a bound on this like this for this problem and because of this this is going to be like t by 2. So, even for the simple case of two hypothesis here we have a case here where this regret will be at least t by 2 that is what right like this is I did it over arbitrarily sequence. So, you do it for any possible sequence still this is going to hold yeah right like whatever the sequence this bound is going to hold right. Now because of this we have a lower bound of t by 2. So, instead of t I have been using and let me consistent with that. Now if I get a bound like this like lower bound is n by 2 can I ever get a can this hypothesis class will be learnable no right because if I divide it by n and let n go to infinity this fraction is never going to be 0 it will be at least half right. So, because of that if we under this unrealizable case there is no way that I can come up with an algorithm which makes this even this such a simple hypothesis class learnable and this is called like impossibility. So, this is a bad case right like if adversary is so powerful that there is nothing you can make like you will not be able to kind of get a sublinear regret here. Now so, what we will do is instead of going to compete against such a powerful adversary we will restrict again the adversaries adversaries power here it is it is again in the unrealizable case only, but we will give bit more flexibility to the learner while we restricting the power of adversary. So, what we are going to do we allow the learner to make probabilistic predictions ok. It is not necessary that he has to in any round his strategy has to be deterministic. He can maybe in every round he can have his own strategy where he is going to say a label one with some probability even though he knows it is he he want to say it label one, but he will only going to say it some probability ok. So, even if I make this assumption will anything change here in this will if the learner is going to make a probabilistic decision can this become learnable? Not really right, because when I said he is going to make why the hat is the predictions made by the learner I did not say what is the strategy actually he is going to use. He could have use randomized strategy, but what was the case here like if the learner adversary can see the final predictions made by the learner then he could force such things here. So, that is why even though we have allowing the learner to make probabilistic prediction we will further say that the adversary is going to come up with his label y t before learner makes his prediction y t hat. You understand this point? That is he is not first in in round t learner this whatever let us say in x and y n in in nth round the adversary comes this prediction before he sees what is the prediction made by the learner in that round ok. So, if we restrict that then maybe like this kind of situations can be avoided at least we cannot make you to be wrong in every round ok. So, when I say this point second point it is only for that particular round let us say I am in particular round t in that round adversary does not know a priori with what probability you are going to make a prediction of 1 and also you he will not see what is the prediction that has been made in that round before he chooses his label ok. But, but he might have known all the prediction you have made in the past he has that history, but what he does not know is in that round what is the probability with you are going to make a prediction and what is the true label what is the predictor label yeah no right. So, so ok fine we are not making that assumption here right we are not making that there is a relation like this yeah. So, adversary is generating this according to his own logic it could be doing this, but it is not necessary ok. The only restriction we are putting on the adversary is that he cannot see your prediction because before he declares the label ok. Now, how does things change with this? So, what we are making is we are kind of allowing the learner to kind of confuse the learner adversary also right because you are making a randomized prediction maybe adversary even if he looks what is the prediction you made finally, he may not be knowing what is the actual value you would have predicted right and now that you have also restricted this adversary is power may be in this case learner can do something better than this bad case where there is no way that learner would have been able to learn even the simplest case here ok fine. So, with this assumption y t hat in round t is no more deterministic right because of that we are going to now start dealing with what is the probability that y t equals to 1. So, now, this is the strategy of the learner in each round he is going to come up with probability with which he is going to declare the label to be 1 ok, but what you care is yeah environment it could be a completely random you also do random things maybe if if the environment is totally random if you somehow figure out that random things you are now started doing as good as him right that will be our aim somehow you figured out the adversary is doing that with probability 1 he is flipping he is declaring label 1 and probability half is declaring 0. You also suppose you also did that do you think you will be able to do better like you are basically understood what the adversary is doing right that means, you basically learn the environment if there is a hypothesis which makes the label that is fine for you right it will be you are you will be able to you see that you should be able to figure out that. Now, the question is what happens if there need not exist a hypothesis class ok. So, right now this is not obvious to you, but can we hold on this thought and see that without making this assumption if we can show that we will come up with an algorithm such that this sub linear regret holds. So, fine all your your question now is you are not able to see an algorithm which will guarantee this right, but if you can eventually come up with an algorithm that should tell you why that is possible right. So, we will exactly do that ok. So, we will try to show an algorithm such that this is indeed true ok and there we will see that is it necessary to have this condition that there exist a hypothesis or it is not at all necessary ok. So, now, the strategy of the learner is to in every round come up with this probability with which he is going to declare the label to be 1 ok fine. Now, that the learners is learner is randomizing his predictions what we have to now do is we have to redefine this regret to be a regret in an expected sense right because this quantity is no more deterministic which chosen by the learner ok. So, how to how to redefine this? So, we are going to take the expected regret which is going to take the expectation of this quantity over the randomness of the learner ok. So, let me put an expectation here and this is what I am going to now call as this expected regret. And now, this quantity is no more we are allowing this to be random quantity right. So, because of this this quantity anyway is not going to change there is nothing random here right because for a given sequence these are fixed and anyway your algorithm your h is also fixed. Now, anyway given the sequence this white is fixed now what is going to change is this depending on your strategy ok. Now, let us write this. So, what how can you write this quantity? So, if I take the expectation here I will I will just interested in knowing what is the this quantity here can I write it as indicator y t hat not equals to y t and now if I take expectation of this indicator what it will be right. So, in that case it is going to be probability that y t hat is not equals to y t right. So, so, so I want to write this quantity. This quantity can be written as t hat minus y t. So, let us understand this ok. So, just let me write this bit more clearly expectation of y t hat minus y t is equals to expectation of indicator y t hat not equals to y t and this. So, why this is true? Suppose let us say your y t is equals to 0 sorry let us begin with y t equals to 1. So, y t equals to 1 yeah y t equals to 1 that means, what you are saying y t hat not equals to 1 that means, y t is basically 0 ok let us take the case y t equals to for 0. So, y t equals to 0 the right hand quantity is simply p t right and what is this quantity here y t hat not equals to 0 that is same as saying y t hat equals to 1 right and y t hat equals to 1 is exactly p t by our definition. So, this is true. Now, similarly if you take y t equals to 1. So, now, this quantity is like p t minus 1 because of this mod I can think this as simply 1 minus p t and what is what is 1 minus p t? 1 minus p t is the probability that you make prediction to be 0 and this is exactly this right like if y t equals to 1 this is like saying y t hat equals to 0 and that is 1 minus p t. So, we have this relation because of that if you simplify this quantity I am we are going to write our regret as simply. So, what is this quantity inside now can I write it as t equals to 1 to p my p t minus y t right can I can I I just rewrote this is exactly that expectation we have removed and then we showed that this is exactly this quantity right. Is it not that what we just now argued? Yeah there is a deterministic nothing to do expectation over here right the only random quantity ok. What is the difference between this quantity and that quantity there now? So, here this quantity y t hat was either 0 or 1 the prediction right, but what here is here this quantity here is anything between 0 or 1 right. So, we are just now when we move from this setting to this setting we are just saying that everything remains same except that instead of saying that ok y t hat here has to be either 0 or 1 we are allowing it to be any value between 0 1 0 and 1 it is like any here we are just this is like when p t equals to some quantity this is just like probability of predicting label 1 ok. So, what now after you did this we have redefined the rate like this. Now, am I interested in showing this do there exist an algorithm such that this quantity holds. So, whenever this happens we are going to say my regret is sublinear or my hypothesis class is learnable ok. So, the next theorem we are going to make this bit more formal. So, what this theorem is saying? You take any hypothesis class H right now we are not mentioning anything like this hypothesis class is finite or infinite ok you just take any hypothesis class. There exist an algorithm for online classification notice that we are still in this classification regime like 0 1 we are our predictions are always 0 1. So, we are just interested in classification so far. And whose predictions come from 0 1? You remember like instead of going the prediction to be 0 1 we have may translated to to predictions to be coming from 0 1 which we interpreted as probability of giving label 1 such that you take any hypothesis then this quantity will be upper bounded like this. All of you can see this and read it. So, this quantity is nothing, but the quantity inside whatever inside this square brackets and is going to be upper bounded by this. So, what we are saying is we are not you you take any hypothesis use that as a benchmark and now if you compare whatever the laws we are going to incur this is going to be your regret bound. So, it is saying that if you are going to use like let us say this fixed hypothesis class ok. Now, you compare the laws we incur with this fixed hypothesis class against the laws you incur by your policy this difference in the laws is going to be upper bounded by this quantity. And this this H could be any H that is coming from a hypothesis class yeah. Yeah, in this in this like we already said that right if he uses this he can force you a loss of total t. Yes, this theorem is valid like yeah all these statements holds provided this condition holds. If you if this any of these are violated this theorem is no more guaranteed ok. All of you read what is this bound here. So, this is the L dimension of H this is the log of cardinality of H it is a bit complicated, but I think you can parse it this is the log of E t. So, this is the one term here L dimension of H into log E t and log H the minimum of these two whatever it is you take this and this whole thing multiplied by t here ok. Maybe so, this is the precise statement, but what we would be interested in a bit slightly weaker version of this. This is an arbitrary this this statement holds irrespective of what is the cardinality of your hypothesis class H ok. So, it may so, happen that even though your hypothesis class has infinitely many points in that, but the L dimension of that hypothesis can be finite. In that case this in the minimum term that will only make impact like that that that has no meaning I mean that has no importance there. So, similarly it may happen that for some hypothesis class even if it is anyway if the hypothesis class is finite we know it is the L dimension is also going to be finite right. Because we know that L dimension of hypothesis class H is upper bounded by what? Log of cardinality of H to the best two. So, if if H is a finite size everything is fine. So, if H is infinitely many then maybe like only thing that matters is this quantity not this quantity ok. Now, suppose let us say because this is a bit many terms here suppose let us assume for time being my hypothesis class is finite ok. So, in that case can I write this bound like this? So, this is a simplified version of this when my hypothesis class is finite. Why this came? Because if I am going to take this is the minimum of these two quantities right. So, if I only take this quantity it is still an upper bound on this is that clear? So, if I drop this and only retain this it should be still an upper bound because here I have a minimize instead of looking at minimum of these two terms and just take one of the terms ok. Now, if I have a bound like this notice that this bound is independent of the sequence is that clear? It is independent of this clear and this bound hold irrespective of what hypothesis you have chosen. Because of that is this also bound on this quantity right. Now, can I say there exist an algorithm such that this holds? Why is that? Because here it is square root of n right yeah this is some constant, but it is like growing like order square root n. So, this we are going to denote it like order square root n. All of you understand this order square root of n notation. So, this regret is order square root of n and now if you are going to divide square root of n by n and let n go to infinity this quantity is going to be 0 right. So, this theorem is saying that hypothesis class H is going to be learnable as long as either it is going to be finite or that it is little dimension is finite. If if let us say either hypothesis class is if it is little dimension is going to be infinity this upper bound is already vacuous I cannot say anything, but as long as this little dimension is finite we can always guarantee that this there exists an algorithm which will make my hypothesis class learnable.