 So, in the last class, we briefly just touched upon the supervised learning setup right. We just said what all the things, what kind of settings we consider in supervised learning. We said what is the sample space and how the labels are generated and the samples are generated. We then said the process of generation, we said they are going to be generated I then we said we set our metric of evaluation as the one which minimizes the we are interested in hypothesis which minimizes the empirical class, empirical error sorry the test error which we also called as risk. Then we looked at how to come up with a good hypothesis which gives me smallest or whatever minimum small test error right we did it through empirical risk minimizer. So, that was all in the batch setting, given me a bunch of data points, how I will come up with a whatever given to me, how I will come with a good hypothesis that performs better on a test point which I have not seen, but still generated from the same process on which I have trained myself. Now, we will move on and we will now consider a scenario where I do not have the luxury of collecting all these data points apriery to train, but what happens in my setting is the data points come one at a time and once I see a data points I have to make a decision. So, for time being we will restrict our set to binary classification like when a data point comes I have to say whether it is going to be plus or 0 or minus 1 and I goal is to make sure that I strictly I very fast making small errors or eventually I do not make any error and I want to make sure that that I do as quickly as possible. So, in this setup there is no separation of training and testing you get a point you have to decide you see you get to see whether you made a good decision or bad that is what we call as feedback. Then in the next round you get another point then you take an a you apply in let us say hypothesis then you get to see whether we made a good choice or bad then the process continues. So, we introduce this general notion of online learning algorithm. So, as usual we are going to say that samples are generated. So, this is the point where the features are drawn and then then let us say this and we have in hypothesis H this is 20 ok and let us say labels are also. So, the environment is going to draw a sample and it is going to also associate a label to that sample, but that label is not going to reveal to you ok. So, let us say now I am talking about rounds right round 1 round 2 because my data point is coming one at a time. So, I am going to now call time t equals to 1 to whatever in every term what happens a data point is drawn and when I say data point it will have both x t and y t. What is y t? y t is the associated label with x t. At the beginning of the round you will get to see x t you are going to choose a hypothesis from hypothesis class apply on that you make a prediction. So, that is basically H of x t. After you do this you get to see what is y t ok. Now, you get you can compare whether H of x t is same as y t. If they are same you know that you made a correct decision otherwise not. So, and this repeats and whatever you are going to opt based on that you are going to choose your hypothesis in the next round. So, this interaction any and how you are going to choose your next action that is a thing which is specific to your algorithm ok. Based on that you you can come up with different different algorithm. So, we are going to consider this simple setting as. So, in every round let us say write only 2. Then after you do this true label y t is little revealed and then maybe you suffer loss. So, again let us repeat this. This is going to run in rounds t equals to 1 2 all the way up to maybe let us say for some fixed number of rounds which I am going to denote it as n. So, in each round the environment has selected x t y t pair. How environment assigned a label to x t we do not know. How it is going to draw this x t we do not know. That is completely we do not know that is that is the environment. In round t as a learner my action is to choose a hypothesis and then to give a prediction in round t and this is my exactly my prediction. After I do this I get to know y t and if my prediction. So, actually this is y t hat not same as y t then I incur a penalty of 1 unit if they are the same then I incur no loss ok. Now this I do not have any control over. How the environment is selecting this pair x t y t I do not have any control, but what is in my control is how can I choose a hypothesis in each round right. Depending on this logic how you are going to choose your hatch in each round the kind of loss you are going to make is going to differ ok. Now, so this is how I am going to call any in the beginning any algorithm which is in this format where the interaction between the learner and the environment happening this I am going to call it as online learning algorithm and denote it as simply by a. So, if we have an online algorithm what are the desirable features what are the desirable property we expect from it from it ok. So, before also let us say the total loss or maybe mistakes of algorithm A we are going to denote it as. So, over t periods this is the number of mistakes my algorithm A made right. What do you expect? What do you expect what are the desirable properties of a good algorithm? You want these number of mistakes made by algorithm to be as small as possible right. Suppose you do not know anything about the environment right you are a learner if you start choosing your hypothesis in this way is it possible this you can make this losses 0 ok. Can you think enough an algorithm that is going to make this total loss to be 0 ok. Let us say I forgive you I will not count if you make a first one can you guarantee on the second one you can make a correct one maybe like right now I do not know right like whether even the second one I will get it correct or third one I will get it correct or from which point I will start getting correct. What I want what is all I am desiring now is the best one which is going to give me the smallest total number of loss. Now we are going to see that what is the best we can what is the algorithms we have that gives me some bound on this like obviously, one natural bound for this is t this cannot be greater than t right, but that is an useless algorithm. I want suppose if I can a bound here which is strictly less than t maybe that is not so bad and not I will maybe I will not immediately reject. So, first we look into algorithm what is the best bound we can what are some possible bounds we can give then we look at what is the best we can do. That means, I will not be able to bound this this I can ensure always that this bound is going to be smaller than some quantity. It may be possible that it could be larger than that, but I will see that can I achieve the smallest bound ok. So, let us get into see how to do that. So, for this interaction still you have any confusion any doubts like about this setup what we call is yeah. So, culture x t although it is not it. It is it is it is allowed to it it sees ok maybe I should write environment select this and x t is revealed to the learner x t sees sorry learners is x t and based on that it is he is going to choose a hypothesis. So, all that t minus 1 x t is also there in round t he has all these things and in round t he also have y 1 all the way up to y t minus 1 he only do not know y t all the path history is available to him ok yeah we replaced it by n smaller right this is the n number of. Yeah go yeah if this is true I mean you are saying if this happens like I had made all the mistakes, but you are now you are saying I am you are going to make a decision in hindsight after ensuring that ok I have made all the wrong decisions I should have done something else, but I am asking you to guarantee me till n point not like then it is same as batch setting right. You have seen all what happened till n rounds then you are saying what is best that is exactly the difference between batch setting and online setting. I do not have the luxury to see what all the points I got till round n then make a decision ok now to kind of start thinking what is a good algorithm. Let us try to follow the approach we did it in the supervised setting. So, when we talked about supervised setting to guarantee something on my risk or test error I made an assumption right what was that in the last class have we called it by something. So, we said that realizability assumption let us make such an assumption here right. Suppose I do not have any assumption here let us see what happens. Suppose let us see I did not put any restriction on the way the samples and the labels are generated right. Suppose let us say you I say right now I am not specifying this how this y t is generated. Let us say x t the the environment generated it gave it to you and you made a prediction. If the environment is let us say is an adversary your adversary once you make this prediction he would say the true level are just opposite of this then you are you are always going to make a mistake right. If you are making this adversary or whatever the generation of the label very powerful there is no way you can do a good job here. So, that is why we assume that. So, we assume that this y t label in each round it is not like arbitrarily generated. There is a fixed rule that governs this label generation process. We do not know this though. If we if it has been told to you then we know which is the best hypothesis right we just do not know this and also we do not know how this excess are generated what we are all assuming is the labels are assigned to the sample samples in some fixed fashion. By putting such constraint we are restricting power of environment right. He cannot do the same thing which I said earlier like whatever I said y t had he cannot say the negation of this is the label because the negation of this need may not satisfy this condition right. So, because of that the environment can have such adversarial role in that case. Let us say such a reliability condition holds now can you think of some good algorithm for this case? What would be your hypothesis selection strategy? In terms of the terminology I may also sometimes call this hypothesis as actions ok. So, we have set of hypothesis class right. What I am doing is choosing one hypothesis I can just think that each hypothesis is like an action and in each round I am picking one of the actions ok. Yeah, yeah let us assume that is also finite ok. So, this process this h is such that the true label generation process is coming from my class itself. So, by this I am ensuring that at least there is one hypothesis class one hypothesis in my hypothesis class which I happen to apply in each round this should be 0 it is just that I do not know which is that I need to identify that. Now, the question is then the question putting this question alternatively is how quickly I identify that hypothesis? Once I identify the hypothesis I am no more going to make any mistakes right ok. So, to begin this let us start to looking into some algorithms. Once one simple thing I can do is I will let us say in some round I pick some arbitrary hypothesis I do prediction at the end I got a hypothesis right sorry I got a label. Now, does it make sense once I get this I keep only those hypothesis which is giving me a label y t on that x t and throw everybody else why? So, if I throw everybody else whatever remains in the remaining one h star should be there right. So, I am not losing that. So, I can do keep throwing that those guys who are making mistakes on that particular y t then I can narrow down on my remaining ones. So, let us see how that algorithm I am going to call it as consistent algorithm. So, this is how my algorithm is going to look like. So, anyway now as I said once I make this reliability assumption that h star belong to h the problem now boils down to identify which is the right hypothesis in my hypothesis class h right. So, initially only that is given to me my input is hypothesis class and my objective is to identify the good one there. So, what I will do in this cases I will keep updating my hypothesis class. So, initially I will take my entire hypothesis class as my current hypothesis class then after I receive x t in round t I will just apply I will just choose one hypothesis arbitrarily from my current hypothesis class what so, ok. So, notice this I am maintaining a new set V t here which is getting updated in every round and this V t is the remaining set of hypothesis. So, in this we are eliminating hypothesis whatever remains that is going to remain in this set. So, let us say whatever the hypothesis remains I have in round t I am going to choose one hypothesis from that and get a label as given by that hypothesis. After I do this I will receive the true label which is generated according to this h star. Now what I will do I am going to retain only those hypothesis that are in V t which are consistent with my true label right. So, other other hypothesis has gone has been thrown out will this V t and V t plus 1 the V t plus 1 is going to be no it it cannot be bigger than V t right because we are selecting hypothesis from that only ok. Now the question is how much a mistakes this algorithm can make can we say anything about it why is that right. So, can we write that formally? So, the intuition says that if there are h hypothesis one of them is good the remaining ones get eliminated if I make cardinality of h minus 1 mistakes right ok let us see this. Let us say till some point I have made m mistakes sometime t I do not know what is that t let us say till some round p m mistakes have been made ok. Now we know that from this update logic if a mistake is made at least the guy the hypothesis which made a mistake that will definitely go going to get eliminated right. So, my size if a mistake happens I know for sure that V t plus 1 will be smaller than V t by at least 1. So, if right say till round t m mistakes have happened then what I can expect I can write as this quantity how I can upper bound this because m elements have been thrown out out of my cardinality of h ok. Now let us say till V t number of rounds some number of whatever let us say let us say whatever let us say max number of rounds the maximum number of errors have happened I want to bound this maximum number of errors right. And notice that this bounds here whatever I am writing ok let me call this as this bounds whatever I am writing they are independent of what is the sequence I am going to see ok the sequence has nothing to do the way I am bounding this errors ok fine. Can I say something about this also the lower bound on this in every round it is going to be at least what one right because the true hypothesis belongs to my class h and that will never going to be eliminated right because that never going to make a mistake. So, because of that this is going to be this. Hypothesis at nth round final round. Yeah final round number of ones remaining if when it remains let us say if h max number of error m max number of round errors have been done this is the bound right. So, now from this we know that max number of bounds I am going to get this that is what we also said right this is the number of mistakes ok fine. Now the question is can we do better than this. Max of them predict y 1 then we output y 1. Yeah. So, in how you are maintaining the hypothesis what remains in each round? The ones which made a correct decision. So, and this is also. No, but to do this you need to know y t right which I do not know before hand. I am only changing there is choose any h right. Yeah. I am only choosing like such an h which agree with the maximum like. So, that means, you are using whatever remains in the previous round the hypothesis use that to see which one says 0 and which one says 1 if majority of them says 1. Yeah I will choose such an h. So, why is that better? Because if there is a mistake then more than half of them we will be able to do it is the next round. So, is this natural like at least from this one to if you want to improve that you would like to do that or not ok. Let us see first let us write that algorithm and see why it is going to be better than if at all it is better than this. So, we are going to call that halving algorithm why we are going to call halving algorithm because in each round according to his logic if you are going to make a mistake I am going to throw more than half of the hypothesis ok. Let me write that. So, the algorithm is the same as this except it differs the way I am going to choose the label ok. So, let us see the input is the same initialization is the same. Now, instead of choosing this hypothesis here I am going to in this step directly give up label which is as follows. So, wherever I have written this bar that means this step is going to be the same as that step in the consistent algorithm. So, initialization is the same we are going to still give me the same hypothesis you are initialized with v 1 equals to h. Now, in each round after you receive x t what you are going to do can you all parse this statement here what I have written what I am doing is I am going to do is arc max looking at its arc max over this variable r which is taking two values 0 1. So, 0 for the label 0 and 1 for label 1. What I am looking at here maybe I should write a set notation here I am looking at I will first take one r let us say r equals to 0 I will looking for all hypothesis in v t which are giving label this is r here 0 on my x t and now I will count cardinality I will count how many hypothesis are there which they label 0 on my point x t. Then I will can go and do r equals to 1 I will take and I will do the same thing and what I will get the number of hypothesis which are giving label 1 on that point x t and I take that cardinality. So, this is basically if you look into this this set is one set is all hypothesis which is giving label 0 and another set is all hypothesis which is giving label 1 on the same point x t and now I am looking at arc max here right that means, I am just looking at the set which has larger value and whatever the value they are telling maximum number of hypothesis saying I am going to take take that as my prediction y t hat. If the set which told 0 has the larger larger in number then I will take 0 as my prediction and then what I do I will resume my true label y t then I just do the same updation step here. If we have selected y t hat as 0 then all the hypothesis that gave 1 we will reject about that. Yeah we do not know we do not know yet whether they are going to be rejected it depends on y t. So, right let us say you have taken for time being assume that in round t it is so happened that all the hypothesis that said 0 label they are larger in number they are in majority. So, you you gave 0 as your prediction after that now assume you received 0 as the true label then those guy remain the guy who said 1 they got kicked off. But let us say that y t happened to be 1 then all those guys who have said 0 which were in majority they will get kicked off ok. So, it depends on finally, you are going to make elimination only after you saying the y t label right. So, sir even if the number of 0s are more and we get y t the larger set get kicked off because it makes sense right because by kicking them off I am not losing h t what my focus is on keeping my true h t ok just ok. Now, with this can you see when you made a mistake are you ensuring that more than at least half of your hypothesis are getting eliminated when you made a mistake here you could only ensure that 1 of the hypothesis got eliminated but whereas this half of the bad ones got eliminated ok. So, because of this if you made a mistake in some round t ok may be just let me take some space here. So, suppose let us say mistake happens in round t. So, what is the relation between y t plus 1 and v t in terms of their size can we say something about this is equals to half of this quantity right and if I keep on iterating this over t periods what kind of bound I will get. So, over n period let us say over n plus 1 whenever there is a mistake is going to happen this is going to get halved and eventually if I keep repeating if let us say whatever 2 to the power some max number of mistakes have happened I will get and then I will going to get this bound is this correct we give many times that that example. So, let us say at some round t you are left with 20 hypothesis ok. Now, an instance x t came and you notice that 15 of them said 1 and 5 of them said 0 ok what will be a prediction? Your prediction will be 1. Now, after you did this prediction is 1 you saw that your true prediction is 0 right then what you are going to do? You are going to kick out all the 15 ones. It most? Like the feeling of 1 and it can also happen that the original label was 1. Yeah. And our current hypothesis is not true. So, the other 5 would have been kicked out. Yeah. So, in that case you did not make a mistake what were only that is why I if mistake happens then only this relation is true this relation is true for every t. Yeah, if mistake happens in round 2 this relation holds if I am if the mistake did not happen in a round t this relation is not true. So, that is why I am saying like if maximum of number of mistakes has 2 m to the power if m max then this is the bound we are going to get. And if you just invert this we know that this is going to be this and what you are going to get the max number of bounds is log to the power 2 times cardinality of 1. So, you see that if I now go for a halving algorithm I will get significantly better bound right compared to cardinality h I am going to get log of cardinality of h that means this bound is exponentially better than the consistent algorithm.