 So, by the way, what was R s here in my stochastic case? So, how was that? It was like R s is equals to theta sorry, if you are going to whatever action you chose, A s transpose theta star plus epsilon right in the stochastic case that was the model. We are going to observe a noisy reward in every round. The noise, we assume some sub-question noise but what remained fixed throughout is this theta star. Because of this my rewards are all correlated across but here when I come to adversarial setting that need not be the case. What we allowed is so in the adversarial setting, we have actually get rid of this noise, yes there is no noise but this theta star could be adversarily selected by the environment. It is under an environment's control and unknown also. Here theta and it could be in adversarial it could be changing. In the stochastic case, this one also theta star also selected by environment that it held fixed throughout. What we are observing only the noisy versions of the rewards. Now, let us see how to adopt EXP 3 for adversarial case. So, the algorithm goes as following. So, input R, action set, learning rate, beta, exploration distribution and then exploration parameter. So, gamma and pi where we are used here when we try to consider this exploration distributions. Now, you do the following for t equals to 1. No, not necessarily. So, when I wrote here, I just said like this is does not depend on time, right? It is like one fixed which I am using throughout. So, this algorithm takes the action set and the learning rate eta. We will see how to set this learning rate eta and then exploration distribution and the exploration parameter gamma. So, in each round, it is going to have this distribution defined for each the probability distribution, probability defined for each of this action in this fashion which is nothing, but the linear combination of this exploration distribution plus what are the exploration, the distribution we have through these estimates, ok. So, notice this I did not explicitly mention the case what happens initially like because initially I will not have these terms, right? When I start with t equals to 1 round, this set is this summation is empty. So, we will just assume they are all uniform in the first round the way usually do, ok. For t equals to 0 round, this quantity is nothing, but 1 upon cardinality of A that is the uniform distribution. Then subsequently we are going to sample and arm A t occur from this distribution P t. You play this action A t, you are going to observe a loss for that action A t. Once you observe the loss, you are going to compute this y t hat, the loss vector you are going to compute. So, sorry y t hat is what is the estimation for the loss, I mean the vector that the adversary or the environment would have selected. Once you estimate it in this fashion, we have just discussed that this is going to be an unbiased estimate or of that vector y t. Now, you go back and see what happens, what is the loss you would have incurred for each of the possible actions. So, notice that for this particular A t, you have already know that this is the loss you have observed, but you are not going to take that loss for action A t. Whatever that this kind going, this is going to give for the estimated value of y t that is what you are going to take. Like the way we did it in exp3 and then you are going to repeat this process. Yeah, yeah, yeah, action satisfied. In the we are going to use it in the statement of the regret. Okay, for this algorithm is generic, it works for anything. Like I mean even if it is this algorithm as of now only if you can compute this for A even if my capital A is uncountably many, that is fine. This algorithm as of now so, so where is the issue in this algorithm if A happens to be uncountably many terms actions. This I can't define properly, right? Because this sum could be summation over infinitely many terms. So, we will look into that aspect how to handle the case where A is uncountable or it could be a continuous set. But as long as my A is finite, this is fine, right? Everything works here. And now let's see and okay, coming to this exploration distribution that is a given to me by and the same one I am going to use it in every round. Okay, now the question is how to choose this exploration distribution, right? Naturally, if you are going to change exploration distribution for that, maybe the performance of this algorithm may change. Okay, so now, first we are going to say that there exist some good exploration distribution that will help us give a sublinear regret. And then we will see how whether indeed such an exploration distribution exist and if at all how to get it. Okay, so this is the statement, we are going to say there exist exploration. So, just I don't know there is a name, no name is don't see I am just going to call this x p lean. This is just like our notation. We are going to say that regret of exp 3 lean is upper bounded by two times. Then we can specifically set A has to be k y 3 dt and what is gamma. Okay, fine. So what this saying that see the now this is the regret this theorem requires that to state the require it requires that your action space spans your Rd and it says that there exists an exploration distribution pi and if you are going to set eta to be like this and gamma have not specified how we will see then the regret that your algorithm exp 3 lean was going to achieve is going to be of this form. It is going to be what this is going to be sublinear in T and it looks very similar to what we had for exp 3 algorithm. But what's the difference now? What is that? I mean just the regret bound. So, the D coming into picture right like the dimension earlier it was just like a square root of T. So, we have D k log k right, but that k has here that k got replaced by D. So, it is not the number of arms actually mattering us but the number of dimensions that we have to figure out because now we have linearized the rewards. So, now what matters is in what dimension the unknown parameter lies. So, once I know that how many are number of arms I can figure out the reward for everybody right. So, as long as I can find out those D dimensions the D parameters I have knowledge of all the arms. Now, how is this distribution pi right all we just said is there exist a pi such that this holds. So, for that it is bit involved we are going to just again this comes from some other result which guarantees existence of such a pi okay. So, let us briefly discuss that part. I am not going to the proof of this part you can just look into the book. This is again most of the time it is going to be very similar to what we have done it in exp 3. But of course, with little jugglery of the estimation the new kind of estimations we have bought into picture here. So, to understand this existence of how does this pi looks like. So, we have to look into some design of experiments have any of you have gone through in any of the courses you have taken the design of experiments is covered okay. So, design of experience could be like as simple as if you want to estimate some parameter with high confidence and every time you are going to play an action let us say the reward is going to be linear in this unknown parameter. So, for time being assume that the reward is going to be X transpose theta star. If you are going to choose X the reward or the thing you are going to observe is X transpose theta star but plus noise added okay. So, we already notice that when we had this is exactly the setup of linear bandits right we had. Now there what we wanted we wanted to ensure that how to quickly get a good bound on this. So, what would we say for this we say this we have VT here and what would you say we had. So, we said this upper bound by some quantity with high probability right what would we say. So, we actually said that before this we said if I am going to take theta star and theta hat and some I am some arm aim. So, we did this argument right like we did not exactly show this under very generality but under some assumptions we did show such a thing is possible right. The probability that the projection of this error on a particular arm is upper bounded is it yeah this being larger than some number is going to be very small the probability delta. So, now so we want to achieve such a thing suppose we want to achieve such a thing like I keep on observing my rewards by playing a particular action that reward is going to be simply let us say A transpose theta star plus epsilon I am going to get. Now the question how should I choosing a sequence of A's I am going to play such that as quickly as possible this is achieved. So, that means I have been able to estimate my theta hat good very fast right. So, what is this V inverse here? This is the depends on the data we have been gathering. How should I be using my data to make the observations such that my estimation error quickly falls down. Okay so this is then just this is the question about how should be I designing my experiments so that I quickly able to estimate my underlying parameters well right. Well if you just going to randomly select in every round some A's maybe that is not a good idea. What you want to do is you want to always select some actions such that the all the dimensions all the directions in this theta star are well explored. So, if you just happen to randomly play some actions maybe you may end up only exploring certain directions and also if you just do randomly you may happen to explore all the directions, but on one of the directions you have good information. But if you are going to design your experiment maybe in adaptive fashion such that as you go on you feel that some directions are not explored well. I will choose my actions such that in those directions I get better information. So, whatever the point is the way you are going to select actions in each round matters and linear pandemic exactly did this they try to select actions in each round such a way that you get a better your estimates improve from each step to the next step. So, to exactly to cover with this distribution pi we are going to state a result from this design of experiments that tell that indeed such a pi exists I mean some good pi exists and how to compute that. So, let us say I have this pi which is a map from my a to 0 1 and such that my summation of pi a is equals to 1. Now, let me define this q of pi as we already defined it it is pi of a into a at the as suppose and then I am going to define g of pi to be max over a of a norm of b of pi inverse. So, if you now map it to our linear banded stochastic banded problem oh sorry this is q pi. So, let us say this is about I want to cover with a sampling distribution pi such that my confidence bounds become tight. So, if my confidence bounds are tight can I do a cover with a better algorithm yes right because we already know that the confidence bounds are really play important role right if you have a tighter bounds then when I select my action when I am going to order them maybe like optimistically based on just the what are the estimation plus the confidence term I have. So, if my confidence terms are tight maybe the probability that I make error is also smaller. Let us say I am just going to now I want to sample my actions and observe the loss in the linear setting and now I want to estimate quickly what is the confidence I have in in these arms about the reward this quantity g of pi here in a way corresponds to that. So, it tells you how much confidence you have or like how much confidence or how much amount of the information you are going to gather by playing a particular action a maybe what you want eventually is as you keep on playing playing the actions from this particular sampling distribution pi you want this quantity to be get smaller and smaller. So, what is this let us say if I am going to think this as a confidence term and this is the largest confidence among all. Now, if I want to have a good experiment like if I want to do a good selection of my sampling through this particular pi I want this to be eventually fall start quickly falling down. So, in the design of these experiments in that terminology this usually this pi is called this design and then minimizing g of pi is called geoptimal design problem. So, just think as this is a separate problem. So, you want to know how with it you are looking for a distribution pi such that it minimizes this and what is this q of pi inverse q of pi inverse is just defined like this it is nothing but the expected value of this A transpose. So, this is just the problem like we will see how it connects to what we want to do. Now, the question is what is a good pi that minimizes this right. So, this is like as of note there is no iteration here right round 1, round 2 like I am not going like this it is just think of like one shot I want to do I want to sample my arms such that whatever this quantity here which I am calling the g of that sampling distribution is minimized. As I said this quantity I can interpret it as the confidence term corresponding to that action A and I want this to be smaller. So, this is called minimizing g of t or geoptimal design problem in the statistics or in general in the optimal experimental setup problem. Now, for this we have one result called P for this. So, we are going to the theorem uses the result from this theorem he says that assume A span of A equals to r d. So, this result may be just of independent interest to you may be you may want to use it in some other analysis also the following are equivalent ok. So, what this result says? It says these three statements are equivalent ok. It says that if pi star is the minimizer of that g here. So, we said that we are interested in minimizing this g function right. It is same as saying that at that pi star this g pi star is exactly equals to d. If this is the case it says that this pi star can be also is the is the maximizer of this quantity here. So, this algorithm actually tries to construct such a pi because as I said this pi here is the one which is going to minimize which is this term here which is an equivalent of a confidence term for me. And in every round as I go on from one round to another I want this confidence term to be smaller. So, it is going to start choosing in every round it want this to be smaller and it is going to try to cover within pi such that it kinds of minimizes this term. And this is as we said this is for one round here, but we have multiple rounds. How it does that? It is going to use that using the the set whatever I have a transpose and using the knowledge of a cube pi star. So, how exactly that overall pi is come up? It is based on this idea. As I am just telling you okay this exploration exists. So, it need to be exactly constructed and I will leave to you to look into the proof what is the exact pi that is the algorithm is using. So, what it yeah this pi is fixed yeah. So, but what they make sure is at every this pi is such that it is trying to minimize this quantity every time okay. So, you know right now you see that the way this minimization problem it defined it is only in terms of the whatever the action set I have and how it is going to define based on the norm of this. So, I am not sure like this whatever this is happening this is the pi exactly the one this algorithm is claiming which exists, but I think it is some tweak version of this it is not necessarily whatever. So, okay fine if that is the case how to compute such a pi the pi computation is already given here right it is the maximizer of this quantity which is easier to compute all you need to do is find the determinant of that and take the log and try to see which is that pi it minimizes this sorry that maximizes it. So, we know such a pi there are some good pies which is going to give us this G optimal design problem or in a way they are trying to minimize try to shrink your they are going to give a tighter confidence terms. So, the exact pie is going to be based on whatever the pi stars we are getting here based on that exactly how it is I will just leave it to you to look into the proof okay. So, okay the last part what is it says is where such a pi star exists its support is bounded by this quantity s u p p stands for support. So, will you understand what is support pi star means? So, the number of places where it is going to put nonzero values and that is going to be at most d d plus 1 by 2 when this this term is used only in the analysis it is to upper bound this. So, that means what it is actually not putting marks on all the actions right as we said the number of actions could be much much larger than the dimension okay like my dimension could be 10 but the number of actions could be 1000. But what it is saying is that this pie star is has to be defined on all possible actions right that this pie is defined on all possible actions it is saying that so suppose if you just take d equals to 10 what are these quantities 10 into 11 by 2 10 into so some 55 right even though you could have 10 into 11 yeah even though you could have like 1000 actions it is only going to put some marks on some 55 actions. So, it is not necessary that it has to put it will ask you to go on exploring all the actions okay. So, fine so it is saying that as long as you can come with a good exploration distribution and this is a standard tweak like right like the eta parameter to be set like this and also I think you can again look into that I do not have the exact value here what is the value for gamma I think it is somewhere submerged in the proof so some gamma I think it should be some function of eta like once you have this eta gamma could be set in terms of that. So, once you set it you can get a regret point of this form and we know it is we have a sub-linear regret okay. So, before we conclude on this part I just want to highlight one more aspect of these studies suppose in this setup what if the entire y t vectors reveal to you in every round. So, what I said right now in every round if you are going to take action A t what is revealed to you is y t A t from this we are going to estimate. Suppose let us say the environment is nice to you and it revealed to you exactly y t itself you play whatever A t action you wanted to play and you actually incur this much loss after this the environment actually revealed to you what is y t would you have been in a better position like you can come up with a better algorithm yeah now it is a full information case like once you know y t you have information for loss of all the all the actions you have. So, this is exactly what we started in the first class first couple of classes right. So, this is like when a label is revealed suppose so if you are if you look into the classification problem we have a set of hypothesis if a label is revealed at the end of the instance you already know how what is the loss you would have incurred by applying any of the classifiers. So, that is exactly this. So, in this case what algorithm you would have like to use if it is revealed at the end of each round you would have like to use weighted majority right. In general there are both class of algorithms which will not have time to look into that but they called follow the regularized reader. I am now talking about this case where we have this full information case what they do is in every round try to play an action that is the best so far what I mean by that. So, if y t is revealed in every round right let us say y 1, y 2, y t minus 1 till round t minus 1 this you have been revealed to you. So, what you are going to do in the next round one possibility you want to do is you want to would you like to do this and play an action in round t you want to play the action that is like arg min of this. So, what is this y s has been revealed to you in every round so far y t is not revealed to you in round t yet but you have to make an action for that round. What we are going to do we are going to see that for all the things I have observed so far which action would have given me the best possible loss and then whatever that is you may want to play that this is unregularized. But you can show that even if you just to do this case there are some instances where you may be get stuck to some bad actions but the general one could do this to allot that you may want to make this a bit smooth by bringing in the regularizer terms and one possibility for that is you may want to include a regularizer here and which is now the action. For each action you are going to define a function h now you want to minimize this and play the action and one can so you can make sure that but properly choosing this regularizer you should be able to get a good performance. So, one particular choice of this regularizer which is often used is the entropy function. So, I am right now assuming that this action sets are not probability vectors for me. So, in that case we already know that this is nothing but a i log 1 by hat how many are there. So, did you notice did you realize this formulation like where did you see this? So, now you should take this entropy exactly like this what is that you are going to get what is this a star is going to be like this is a distribution right how this a star is going to look like it is going to look like my exponentiated weights for that. So, this is going to look like my ETI is equals to what my whatever ETI whatever the loss I have observed in that round and whatever the summation of this. So, what is the optimizing value? This is just like so the a t i th component a t minus 1 i th component exponential what are the the loss you have observed for that particular i in the previous round and it is just a normalization this is just what we had observed in the weighted majority algorithm right. So, if we have this entropy we are going to get this weighted majority correspondence and in that we already know if once we have this kind of distribution we already know what should be the good value of eta right like because when we use the weighted majority we started with ok we start with such a distribution we use this distribution and then further optimize my regret by tuning this parameter eta. So, once my distribution result in this I know what is already from my knowledge of weighted majority algorithm how should be I tuning this parameter eta. So, there are class of algorithms based on this idea this regularization entropy is just one function you could think of other functions ok and like people have used something like a divergence I mean not the Kulkel divergence but there is another notion called Brigham divergence and all. So, by using different different regularization you will get different different performance ok. So, you may also want to look into that one chapter on such a regularized. So, this is called follow the regularized reader FTRL algorithm. So, there are this it is a because such such FTRL algorithm give a very good performance. So, people have been using different playing with different different regularizers and coming up with different different bounds. So, you may also just want to look at. So, I am not going to give because for to define other other regularizers we need to kind of diverge into some other topics. So, we will not go into that. So, as you see that we started like looking into specific cases, but thing can be studied in more generality like what we stated as a weighted majority algorithm is nothing, but this regularized. So, why is called follow the reader because we are trying to play the reader right till that point till that point whoever is the leader you just want to play it, but it is just like taking a regularized version of that.