 Now, we know that the minimum number of errors that can be incurred is L dimension of the hypothesis class. Now, can we come up with an algorithm which also ensure that I do not make no more than these mistakes that is L dimension of H. Is the question clear? I know that that is unavoidable, maybe in the worst case I have to make L dimension of mistakes. Can I now aim for an algorithm which actually makes no more than this mistake? If I can do that then that is the optimal algorithm right. It is only going to make the mistake which is unavoidable and it is not going to make any mistakes ok. Now, how to come up with an algorithm? These are such a there exist an algorithm. So, we will just discuss one algorithm which indeed does that job called standard optimal algorithm. So, you just notice that like when I computed the little dimension sorry L dimension I computed for a simple case right where there are only d points there and there are d hypothesis for which it was almost easy to compute. But I kind of avoided computing L dimension for the case where it has large number of hypothesis right. So, it is because it needs to bit work out it is not like as simple as this. I mean finally, when you work out you will feel that that is not much. So, what is in general the case is computation of L dimension is not easy. If I give you arbitrary hypothesis class it may be pretty computationally involved. How to? So, how? So, I define ok the maximal depth d if there exist a binary graph which will be shattered then I call it as L dimension. But that is the definition, but how to compute it right like right right when you and I computing the simple example you are thinking ok why not something if I go further down take graph up bit more depth will let be shattered or not. So, we have to exclude those cases right like we to understand what is L dimension only I can shatter up this point and not beyond this point. So, that all these things could be pretty computationally involved, but when we are going to write down this we do not care about computational aspects ok. We just care about as long as they exist that is fine and as long as our definitions are consistent and make sense fine. Computationally may be to compute simple hypothesis it may be taking on several GPUs it may be taking 4, 5 days, weeks, months, years I do not care as long as it is well defined and it exist we should be fine. Now, so this one standard optimal algorithm which we will argue that. So, you see that this algorithm is almost same as holding algorithm except for the fact that here you are going to compute the L dimension of the hypothesis class corresponding to the one which predicted label 0 and label 1 and then make your prediction based on that ok. Let us understand this. So, at any given time you are maintaining this hypothesis class and you are going to partition your hypothesis at time t into two parts. One set of all hypothesis which are assigning label let us say R equals to 0 and another set of hypothesis which is to 1 you have two hypothesis class there. So, one set saying 0, one set saying 1 and the holding algorithm what we did we just said a label which corresponding to the one which has the largest number of hypothesis in this right, but here instead of that we compute the L dimension of the hypothesis class of both both my hypothesis class and then take the one which has the largest L dimension here ok and then going to predict according to this. So, in all this case it may happen that sometimes the two hypothesis class I have here they met we may end up with the same L dimensions for them in which case we are going to make the break the try arbitrarily. We are going to say if they are both of them same L dimension you are going to just say either 0 or 1 ok either you toss a coin or say that or it is up to your discretion whatever you want to do and then you receive y t and then you are going to retain only those which are making a correct prediction on your correct predictions on your even observed point x t rest you are going to throw away. You can show that this algorithm is optimal in the sense I am just going to state this that is even this algorithm it is guaranteed that it is not going to make no more than L dimension number of errors that means, really what I know that this guy is also lower bounded by this. So, it is going to make the Milman number of mistakes ok. So, I am just so as a corollary to this I can based on my lower bound I can actually say that this guy is actually equals to L dimensional or this particular algorithm ok. I will just the proof is also state forward, but I will skip that part. So, finally, we are completing the loop for the case of reliability condition when the reliability assumption holds right. We have first showed that for the reliability assumption what kind of bounds we can get then we ask for what is the best bound we can get. Then we establish that what is the minimum bound we can expect right that we shown through a lower bound. Then we are showing that that lower bound is achievable through this one. So, this is the best like any algorithm it this is the best it can guarantee it cannot guarantee anything more than this when the reliability assumption holds ok. Now, why should reliability assumption hold ok yeah. So, that is why like I am I have skipped the proof of this. We are just saying that if you are going to if you are going to choose this y t hat based on the one which has the largest L dimension, you will be ending up ensuring that you will be not making this many number of mistakes ok fine. Now, what about what happens if there is no fixed hypothesis in my class according to which the labels are generated ok. Now, what will be my strategy and what will be my evaluation criteria ok. So, first if there is no hypothesis in my hypothesis class according to which my true labels are generated, can you guarantee a finite bound on on any of your algorithm like whatever algorithm hypothesis you are going to choose, it is not necessary that the labels are going to come from that. It could be coming from labels could be coming from some hypothesis which is outside your hypothesis class. So, because of that there is no way like you can guarantee that you will stop making mistake up to some point. You may still want to guarantee that you want to still want to minimize the number of mistakes you are going to make. So, in that case now we have to see what kind of evaluation criteria we are going to set up and how we can minimize that, how how how we can do best on that ok. Now, we are looking to unrealizable case. So, by just unrealizable I am saying that there hypothesis that generates the label need not be in the my hypothesis class H. So, that is this H star which generates my label need not be in my hypothesis class H. So, then in this case now the first question is how we are going to set up our evaluation criteria, how I am going to evaluate myself, how good I am doing. All I can do is while learning I can only use hypothesis that are coming from this hypothesis class ok, but the true label could be generated from a hypothesis which is outside, then how I am going to measure my performance against ok. So, earlier when H star was included in my hypothesis class H, I aim to quickly identify that and always play that identify quickly that and always play that right. If you can quickly identify always play that you are making no mistakes. Now, anyway that H star is not there in my hypothesis class. Now what I would like to do is what is that single hypothesis in my hypothesis class, if I keep to playing throughout that would have given me the smallest number of mistakes. I do not know apriary which is the one which is going to give the minimum number of mistakes apriary, but I want to do as close as to that ok. So, let us suppose I kept on telling you ok in round 1 this was x 1 and this was corresponding label y 1 in the x 2 y 2 like this ok and these labels are generated by this H star, but this H star does not belong my capital H. Now, all you can use is hypothesis from this hypothesis class H right, which one among this you would have like to apply on this. Let us say you have after observing all of this. So, one which with minimum number of. Right. So, you would like ok you will take first take one hypothesis from this and apply it on all h 1 x 2 thing and see at how many point it made mistakes that is the number of mistakes you get for the first hypothesis and you can do it for all the hypothesis and you will see somewhat someone which has made the smallest number of mistakes. You would have like to ideally start applying that one right from the beginning right, because that is the one which is. So, here now you have the advantage of hindsight. You have observed all the things and at this point you are seeing what is the best I should have done. Now, I am going to take that as my evaluation criteria. I am going to take that hypothesis which has done best on this hypothesis class is my benchmark, but I apparently do not know that right, because I have not seen this. This is I am in an online setting I am not in a bath setting. If I start from if I start making predictions how much errors I am going to make compared to that hypothesis which is the best in hindsight ok. That we are going to express as follows ok. Fix first hypothesis. So, what is the first part here? Y t hat is what you predicted, Y t is what been revealed. So, Y t hat minus Y t is the mistake you are going to make right, we are in a binary setting. So, over t rounds so this is the mistake you have made. Suppose you happen to apply the same hypothesis H in every round by applying this hypothesis H this is the number of mistakes you have made. Now, this is basically you are comparing yourself against a fixed hypothesis H ok. Now, what I will do? I will take the worst of things I am going to because I do not want to be dependent on a sequence right. Like I do not care about which particular sequence I have been observing I want to give this guarantee for all possible sequences. So, that is why I am going to take this supremum over all of this ok. What is this now? I have to play a single hypothesis throughout my interaction with the environment. This is how that that is going to give me this much of errors for all possible sequence and this is what I would have incurred according to applying my own algorithm. So, Y t hat is given by your own algorithm. This is what you are going to incur if by your algorithm this is what you are going to incur by playing a fixed hypothesis. Which gives me minimum number of mistakes. No I do not know whatever I do not know I am not saying that gives me the minimum number of mistakes it is just one hypothesis from that. Belonging to H. Belonging to H and let me call this as. So, this R your algorithm A which is going to predict Y t hat in round t and in n rounds this is what you are going to incur. So, you think it like this suppose you give me a sequence that you are you have you have been faced. So, let us say when you are going to face an environment this is the sequence you observed in that this is the number of mistakes you incurred. This is the number of mistakes incurred by hypothesis H and this is the difference between U and the hypothesis H. Now, I am looking at the maximum difference of this. That is also we have got from some hypothesis. No this is by your algorithm. You are you are you are going to use one of the hypothesis in each round right you are going to but what you are going to do this your Y t hat is coming from this right in in your S Y algorithm it is coming like this. In in Halving algorithm it is coming in a different way whatever it is I am just going to that is why I am just writing it as Y t hat and I have now looked at the worst case how is this one yeah because I am looking at the number of rounds right. I have been I said that my interaction is for n number of rounds and my sequence length is also n here that is what the x 1 y 1 all the way up to x n y n I have looked at n interactions. If you are going to change n the summation will also be larger ok. Now, what I want to now look into this is I am going to think I am going to now call it as the regret I have for not using hypothesis H all the time. This is what you get the number of mistakes you incurred had you played H this is what you would have gotten where compare yourself with this. So, you are basically saying how well I am doing or how I am doing compared to the case where had I played H throughout my benchmark here is plain the same hypothesis throughout. Now, as you said this hypothesis is here need not be the best one ok I do not know which one in my hypothesis class does the best job ok. I want to now compare my performance against that best hypothesis. So, how to take that into account now I am going to define my regret as supremum of this guy. So, this is the regret I am going to incur if I have been playing H right if I have played a good H this quantity is going to be smaller right because if I am using a good H it is going to make a smaller number of mistakes. So, that is why I am taking supremum over all H ok. So, alternatively you can just before writing this step I am going to do this. So, what I am doing I am looking instead of this particular H I am looking for a H which minimizes this quantity. In a way I am looking as my best hypothesis which gives me the smallest value right. So, if I write here this H is now a function of this hypothesis sorry this sequence right the best one, but I wanted to it to not define it in this fashion. So, this is minus of in right if I pull this outside this in it becomes super over H, but now if I write super over H here this H is still a function of the sequence the best soup here ok ok. So, I, but I do not want it to be depend on H. So, the way we are defining it is by just defining it as like this. So, ok fine whatever you can ponder on this finally, but are you convinced that the benchmark where setting is the single best hypothesis against that I would have like to use compared to how much I am going to incur when I am using my algorithm. I am just comparing whatever algorithm I am going to use how it fares against a hypothesis which does the which gives me smallest number of mistakes. I do not know what it is, but a priori, but I want to perform as close as possible to that one. And this is characterize this is capturing that and we are going to call this as right now because we are taken super over H. Now, this is only function of n here. Suppose my realizability condition holds ok, what would be this term here? It is going to be 0 right because there exist one H star which is always going to make a production and this term is going to be 0, but now I do not know I have removed that condition right because of that this guy need not be no necessarily 0. Even the best guy I have in my hypothesis it is going to possibly make mistakes, but fine that is what I can afford to and I am going to compare my performance against the best I could I can afford or I have access to. And this we are going to call it as regret in this setup. We will we will define keep defining different notions of regret throughout, but in this case this is how we are defining our regret. Did I make any such assumption? In defining this I did not make any such assumption here or if the finite hypothesis class is not finite does anything break down here in my definitions is everything still makes sense as long as these things are consistent and not violating any of our consistency checks that is why. So, here nothing is being violated ok. Now, fine this is what you are you have defined your regret to be this is your performance criteria. This is what you are going to set your criteria evaluation criteria as now what you want. What you want this regret to be small large small how small 0 can you guarantee 0 right yeah. No only it can be 0 is this quantity is for the best hypothesis I have I am doing similar to that. Now, we get to the notion of learnability here. Remember how did you define learnability in the realizable case? We said that as long as the number of mistakes I can bound by some finite number then we said that hypothesis class is learnable by algorithm A. What is the notion of learnability here ok? We say that. So, actually I should also write this as this is now you are learning this hypothesis class right. So, I am just also it is important to mention what is the hypothesis class we are learning. So, this is R capital H. Now, we are going to say that. So, we are going to say that this hypothesis class H is learnable if my regret is sub linear. And what is the definition of sub linear? If you are going to divide this quantity the regret H R A H by N by N and let N go to infinity this goes to 0 ok. What is that ratio telling R A divided by N? So, R A is was regret over N round right. When you divide it by N that is regret per round. If you are learning that means, you are if you are figuring out which is the best hypothesis in your hypothesis class you should be eventually make lesser and lesser mistakes right. And you should be start doing as good the best hypothesis class in your hypothesis class would have done. If you do this you should be incurring kind of the same amount of mistakes that the best hypothesis class in your hypothesis class has made. Because of that maybe as for some N this could be some some some finite number, but if you are going to start looking at that difference beyond some number that should be smaller right ok. Let us say this is for N right. Let us split it into two parts. I am going to now take this N as N 1 plus N 2 first N 1 rounds and subsequent N 2 N 2 rounds. Maybe the initial N rounds you do not have any clue what is happening it will be some making mistakes, but you are kind of figuring out what is happening. But later in the next N 2 rounds you have kind of figured out what is happening you you expect yourself to make lesser mistakes right. And maybe like after further rounds you expect to make lesser and lesser mistakes. Because of that if you look into this summation here maybe for the first N terms this is some number. But in the next N 2 rounds you expect this difference to shrink and also further shrink right as N goes. So, that is why if you are really doing figuring your algorithm is smart and it is figuring out what is happening it is eventually going to as N increases later in the subsequent part it is not going to incur too much of penalty here. It will be doing as good as the best hypothesis class in your algorithm would have done. So, because of that we if this condition holds like you will see in this case you are on an average when N is large you are doing as good as your best hypothesis in your hypothesis class would have done. Is that do you understand like by letting this N go to infinity this going to 0 it exactly means that. So, this is why we call if this condition holds this hypothesis class is learnable. Because if you give me sufficient number of rounds I am kind of figuring out what is the best hypothesis and I am incurring lesser and lesser mistakes ok. So, you will see like as I said you will be seeing different notions of regret as we go along this course. Now, we are our next step is to see what kind of bounds we can give on this what kind of guarantees we can give and always we will ask the question is this the best guarantee I can give ok like the way we did it for the realizable case. So, note that this if I can come up with an algorithm a such that this condition holds then called learnable fine.