 So, till the last class, we were discussing about this adversarial settings where we went from the prediction with expert advice where we had full information to the bandit setting. And now, we in the bandit setting, we came across different different algorithms right. That was those like e x p, 3 and e x p, 3 p and e x 3 i x. So, all of these algorithms, if you run it over n rounds. So, pi here is the policy. What is the regret you got? If you just ignore the constants there, the expected regret I am talking about may be like I should write a bar here, this I am talking about the pseudo regret here. We got it to be order n k log k right. So, there are some constant like sometimes for e x p 3, we got a constant of 2 here right. And then for e x p i x, we got an another constant there. So, I am just ignoring this constant. Other than that in terms of the parameters, number of rounds, the number of actions, this is how it look like. So, of course, this regret is sublinear, but then the question is whether this regret bounds are kind are they good or how far they are from optimal ok. So, for that we need to know what is the best we can do with any algorithm ok. So, there is a one can show that or one can come up with a lower bound on this setup saying that this pseudo regret I when I say pi. So, here pi is for specific it was like either e x p 3 or e x p 3 p or e x p 3 i x. Now, one can show that for any policy pi, adversary can choose a distribution and come up with a sequence such that irrespective of what policy you are going to use, he can make sure that you incur a regret of this much. That is no matter what algorithm you use, adversary or the environment can come up with a scheme in which are basically a kind of distribution which will make you incur this regret. So, how much is the gap between this and this square root log k right, but if you ignore that square root log k factor, k is fixed right, k is the number of arms. What we were interested in mostly is how it varies with the number of rounds, because if we run more rounds we want to see that how quickly we start doing good compared to the our hindsight strategy. So, if you just ignore for this, these bounds are like almost identical right, maybe up to constants. So, because of that these three strategy we already saw they are like optimal ok, if you just ignore this log k factor. So, we are now not going to show this lower bound proof, we will revisit it at a bit later point where I want to now show this bound once I complete the stochastic bandage also. Right now we are only talking about adversarial setting right. So, after some time I will also complete stochastic bandage then we will going to revisit this lower bound proof, because the lower proofs uses some stochastic argument ok. Now, in a sense all these three algorithms they are optimal ok. So, now what we will do is we will stop our study of this special this adversarial bandage setting and today we move on to something called online convex optimization and we will show that the convex the algorithm we are going to generate or discuss for online convex optimization, they already capture what we have already studied for this adversarial bandage settings ok. Also notice that when we started with initially the full information setting right that is prediction with expert advice for that case what is the regret bound to we had for the expert advice case when we use weighted majority algorithm to log 2 n log d right under root of. So, so d is like k there that is their number of experts are number of actions. So, the difference when we had a full information setting we already discussed this we would have got done through weighted majority square root. So, order square root n log k even for that case one can show that that is even the full information case that is the best one can get. The lower bound one can also show that you can that lower bound will be also order square root n ok and also there will be some logarithmic term there. So, that also for the full information setting I did not discuss the lower bound proof we will not do that we will only eventually do this part, but the algorithm we studied already weighted majority for the full information case and this exp3 based algorithm for the banded case they are optimal ok. Now, today we will move to something called ok before I move on to this topic any questions about this adversarial setting we studied so far yes. We already discussed that if you look into the estimators we have right how are the estimates they were some indicator divided by p i t ok. And how was that p i t was divided in terms of the exponential weights right that was defined to be distribution and we also argued that the variance is this of the order 1 by p i t that we already discussed right. So, if this p i t is very small then this variance can be very high, but once you add this p i by adding that gamma now your variance is inversely proportional to 1 upon p i t plus gamma. So, the variance would not be bad right even if p i t is very small it will be restricted by this gamma term there exp3 p, but the same logic also holds for exp i x right same thing ok let me review that. In exp3 p how did you define p i t to be 1 minus some gamma then there was exponential factors right divided by the summation and there was gamma by k and when we defined the estimators the estimators were how did we define for my exp3 p this was my estimator. Now, if you look into this estimator this estimator is already this probability is here there will be at least gamma by k because of that it variance would not blow up if this quantity is small because it is always making it at least this much this p i t will be at least gamma by k. Now, in the exp3 i x p i t was just this there was no this term what we did we just added gamma here. So, the effect in both the cases were to make sure that if you look into the various this term do not become arbitrarily small they will be capped by this gamma term here. But so, then why it is called implicit exploration here like earlier you are forcing exploration right by adding gamma by k here you are you are forcing some uniform distribution there, but now it is not there you you can just bit if you look into bit carefully that kind of exploration is already being done through this gamma, but not actually forcing that kind of uniform exploration. So, that is why it is called implicit exploration. So, adding gamma here or here both are kind of doing kind of same job of reducing your variance anything else about that anything about the setup we discussed the algorithm or the analysis part ok. Now, how when we started with our method first we started assuming that my labels are generated by a hypothesis in my hypothesis class right we call that as a realizability case. For the realizability case we showed that we can get a regret bound of what if you have a finite hypothesis class to learn from and it is a realizable place what is the bound what is the algorithm and what is the best bound we showed yeah what was that bound. So, for the ok at least one thing we did is we had discussed halving algorithm and it we showed that it is bound it gives me what bound log 2 cardinality of H right. So, we show we are able to argue that ok if realizable case is there I can guarantee that much, but then if I remove that realizable case then we argued that adversary can make you incur really a regret ok by arguing the coerce impossibility result right, but then we said ok if you. So, this is too bad then we restricted adversaries power by what allowing the learner to randomize his predictions ok and when the learner did this randomization then we are able to show that one can still come up with a sub linear regret that means, we can do on an average as good as an oracle asymptotically. So, basically this two things realizability at the randomization helped us to come up with good algorithms. So, as we go along now we will show that this both case of realizability and randomization allowing the learner to randomize is it basically coming from we are doing the convexification of the problems. So, this is nothing, but we are we are looking at some convex loss functions. So, we will come to that, but before that I am going to just define what is the convex online convex problem we are going to set up and then we will see that how all of those things will fit into this setup ok. So, so this is the setup we are going to consider. So, in the online convex optimization setup we will assume that adversary is choosing convex function in each round ok, I do not know that function beforehand, but in a round adversary is going to choose a convex function my goal is to choose a vector in each round and I want to choose my vectors such that that function at the point that I have chosen is the smallest value ok fine. So, we will we will see what is my objective, but this is the how the interaction happens. In every round learner predicts a vector w t from a set S this set is assumed to be a convex set ok and after that the environment or the adversary generates a convex function. So, it is not a one value it is a function it is from it is defined on the entire convex set. So, and then whatever the vector you predicted you are going to incur loss of loss computed at that w t on this function f of t ok this is what you are going to incur. Now, regret for the setting is defined as so depending on how you are going to choose your w t how you are going to predict this algorithms are going to be different ok. So, let us say pi is your thing now the total loss you incurred is f t w t. Remember w t is what you are predicting in each round f t was generated by the environment f t w t is what you incur in t th round and this is the total loss you incurred. Now, you want to compare it against the smallest loss you would have gotten by playing one point. So, what is that? Suppose so this is the loss you incurred right I am going to compare it with the hindsight strategy. Suppose you knew all the t's f of t's from t to n rounds ok and now ok let us take this u to be for time being to be s only. So, we will discuss like this could be different, but let us say take for time being this because this f t is from s to r ok, you want to play an u in this s so that in hindsight that would have given you the smallest value. We are going to call this as regret and your goal is to come up with an algorithm that minimizes this value ok. So, what environment is joining? Environment is choosing this f t's, you are choosing w t's in each round this is the total loss you incurred and this is the best you could have done if you know all the f t sequence over t rounds ok. Now, does this setup map to our earlier setup where in each round a loss. So, remember we were earlier discussing the online classification problem right where in each round a point x t was detected and you decide what is the label to predict, after you predict a label the true label would have been revealed and you would have incurred a loss right. Now, let us see can that setup be fit into this setup. So, any guess how you can do that? One thing is let us see what will be the convex set in this case. So, there what we are doing every time you are coming up with a probability vector on the hypothesis right and then we were taking the average of the hypothesis we have ok. So, can I take my s to be all my w's basically simplest where for all i and my summation w i equals to 1 ok. So, again now I am going a bit backwards. Now, I am not recovering this for the bandit setting, but I am recovering for the full information setting ok. So, recall in the full information setting what we did when we applied the weighted majority algorithm. So, in the weighted majority algorithm we were maintaining a weights on all the hypotheses and then when the loss vector and when and we played an expert drawn according to this probability vector and then we incurred a loss and there we were interested in the expected loss ok. Now, I am going to set this s to be this ok. So, there what was happening in each round adversary used to select a loss vector right in each round and then if w t is was was your weight vector and v t was the loss vector what was the expected loss you incurred, w t the inner product of w t x t right. So, your loss was so, can I treat this as f t of w t. So, it is now w t is what you chose now this is parameterized by the loss vector chosen that is and is this a convex function in w t yes right this is just the impact linear in this w t and now let us choose ok maybe now I am just I also want to write this to be in u. So, I will just let us say I will be allowed to select from a reference point which is coming this set u which need not be always be same as this convex set s it could be a subset of my convex set ok. So, so, I am saying this u could be some subset or maybe like even if it is u I could choose this u to be some set s where u is set of all unit vectors what I mean by unit vectors u is like one if it will let us say some d dimension. I am going to say that this is a convex set which is a subset in R d R d is what d dimensional Euclidean space. So, this will have some like this and 0 1 0 and 0 0 1 I am going to choose this u to be this vector this set ok. Now, if I do this then what is this term is going to be so f t of u. So, let me so this is like this is like E 1 vector right this is like E 2 this is like E 3 and this is like E d these are the unit vector in the d dimensional space is that clear. Now, I am looking at this quantity over this unit vectors ok. So, suppose now I take this some particular let me call it let me take this to be E 1 what is this quantity is going to be it is going to retain only the first component in this loss right. So, f t of u is w t into sorry u into v t and if that u happens to be this vector it is only going to retain the first component of this or in general if I am going to take E i it is going to keep the first component of this and now what I am looking at I am looking at the minimum over this quantity right. So, it is basically looking at that component of my v t vector which is smallest. So, so this is like basically this portion is basically saying as minimum over i 1 to d of v t i i cos minus 2 l and now we already argued that this portion is nothing, but v t w t. So, you see that this is the same setup as prediction with expert advice right this is exactly the irrigate there we have defined and try to solve it. Now, what we are trying to do is just like instead of. So, this was like a what we are basically is trying to now treat it as a convex function by convexifying what we are earlier did is we did this convexification by allowing my learner to randomize over his action and then taking the expected loss is going to incur. So, because of that if our expected loss was defined in this case. So, we basically made this to be a convex function there even though it was linear. Now, what we are basically now we are saying is let us try to do not necessary linear function like this we can just try to solve a bit more one a convex function ok. So, as you can see here the randomization we allowed on the learner is basically translates our problems in basically convexified our problem ok. Now, we are looking to solve a problem here where I am observing a convex function in each round. Earlier it was equivalent to that convex function was earlier parameterized by the v t the loss vector we used to observe in every round ok and if. So, this v t defined my f t there and that was what environment chose that is I am now calling it as f t function and v w t is what you chose as a learner right. Yes. Yes. It could be something else. Yeah. So, I am just saying what we studied earlier happened to the special case of this setup by choosing your u to be in this fashion that is why I am saying we can it is not necessary that u this u has to be the same as s v would be interest this is like my benchmark right. I am I may be interested to consider my benchmark in whichever set I am interested in I may not always go for this entire set s ok. So, if in this case if you set this u to be in this fashion then you are already getting back your fill information band is setting fine. So, right now what we will do we will simply focus on this setup now and try to see how we are going to solve by considering different convex functions. This was one convex function ok, but convex function could be more than just linear right we will consider the other aspects what is the algorithm. The setup is clear the goal is clear now what is the algorithm how we are going to approach this ok. So, if you recall the consistent algorithm we had what we did in our consistent algorithm was that a halving algorithm or consistent algorithm that was halving ok. Just before that also like in the consistent algorithm what we did we picked an hypothesis from a hypothesis class and we checked how many of the hypothesis were consistent with the observed label we retained them everybody else we threw and whichever we picked in that round whatever that predicted we calculated our loss based on that. So, what we try to do is every in every round we try to keep the hypothesis which you are consistent with our observation and through everybody else ok. So, in a way what I am doing we are doing is trying to see which are the hypothesis that are performing better at that time and we know that by the realizability assumption there the good one has to be there ok. So, drawing similarly there what we did we tried to be consistent with our observation every time. Now what we can do we can also be in when we try to do this we can try to be consistent or base try to find my prediction that best fits my past observations ok. So, I am saying in round t after you made a prediction f t is revealed, but before round t you have seen f t's of all the previous rounds right ok. So, let us consider round in round t we know f 1, f 2 all till f t minus 1 this function is revealed to me what I do not know is what is that will be selected in the tth round. Now, I have to make a choice of w t in the tth round. So, what is the best thing I can do yeah why you want to do that yeah. So, this is so far we know and I kind of want to do this, but I do not have full all the n observation, but at least I try to be consistent with what I have already seen so far ok. So, then one thing is to do is you select a w t which is going to be I henceforth we will just take this u and s to be same like just to avoid every time writing u and s let us take this u to be s ok. Then I am going to do is f of i 1 to t minus 1 of t of w. So, what I am trying to do is in round t I am just trying to see which is that value which is consistent like when I say consistent that is basically minimizing my losses that have been observed so far and this algorithm is called as follow the leader algorithm. So, what I am doing I am just specifying that how I am going to choose this w t and the way I am choosing w t is as per this minimization function fine. Now, the question is fine you are trying to be consistent with your observation why does this work good? We are not making any assumption yeah, but we are just assuming they are all convex in this set up currently ok. They did not be any relation between the current like it is the same right like when we said when an earlier case when we said w t is the loss vector that has been selected by that environment we did not assume anything like how v 1 and v 2, v 3 are related they could be arbitrary. So, here that is why we are saying this f 1 f 2 it could be arbitrary. What I am interested is I want to minimize this setup ok. So, next let us discuss what kind of guarantee I get if I am going to use this kind of algorithm follow the leader algorithm. So, any questions or doubts about this setup or why is there anything else we can do in this setup why to just follow the leader is this a intuitive one or this is the most natural one anyway. So, let us see how to get a bound for this. Now, why is that suppose let us say first in the first round you have to select w before you knowing the function let us say you did something. So, w 1 you have to select without any knowledge right basically. Now, w 1 earlier like if you are starting with weighted majority right w 1 like a uniform distribution I did not have any knowledge. So, I am going to put equal mass on all of them I started with that and after that I observed the loss after I did that and then in the second round I updated my weights. So, you start with some w 1 because you do not have any information you play it you are going to incur some loss you do not have any control over that. Then you observed f 1 then f 2 how you are going to do you are going to arg min over w of f 1 w you got this and you played it. Now, comes f 2 now how you are going to do w 3 is arg min of f 1 w plus f 2 w this w 3 whatever you are get why it has to be same as this w 2 it could be potentially different right.