 So, when we looked into the contextual bandits initially especially the stochastic version, we made an assumption that my mean rewards are all linear in the context right and we then try to solve that problem by building an algorithm that especially uses the confidence ellipsoids. So, the main problem there was how to construct the confidence ellipsoid. So, in the problem what we assumed in the stochastic case, we assumed that there was a fixed parameter theta star and if you are going to if you see if you are going to play an arm x. So, arms were there like vectors if you play some x, you said the mean reward you are going to get is x transpose theta star right, so we said that the rewards were linear in some unknown parameter theta star. Now, we will move to the adversarial version of that in which this theta star did not be fixed and unknown. Yes, like it is unknown earlier in the stochastic case it was fixed, but now we are going to assume we are going to consider a set of where this theta could be selected by an adversary in an arbitrary fashion. So, then how what could be the learning setup ok. So, we are going to study adversarial linear bandit. So, what we are going to assume is like let us say you have an action set A which is a subset of R D and this is how the game setup in whatever the in each round is that fine. So, sequence of Y T is selected by the environment and now you have to select an arm in that round and by selecting an arm A T you are going to get a reward which is the inner product of this A T and Y T. So, just to compare and of and naturally your goal is to maximize your reward cumulative reward. So, I am just going to consider the loss setting here let me call this as loss. So, this is like loss. So, this is what if you know the sequence of let us say Y 1, Y 2 all the way up to Y capital T what does this give you? So, this will give the best loss would have incurred in hindsight right. If you are known all Y 1, Y 2 we are now looking for what is the action I should be playing so that over this sequence I get the smallest loss and what is this part? This part is like you are playing some A T in round T and this is the loss actually you are incurring and this is the total loss you have incurred. So, this is for a given sequence Y 1, Y 2 all the way up to Y T. So, then I have this expectation here yeah so, the learner could select this A T's in a random fashion ok. So, the randomization this could be randomly selected. So, now we are interested in this algorithm in this setup where I want to minimize this regret ok. So, let us say initially my A is finite that is my set of actions is finite ok and let us say further as a special case let us say I am going to set A to be V 1, V 2 all the way up to E. If I have this setup what is this setup then if what is this? This is the standard adversarial bandit we have which we have already studied there right. Like in every round the adversaries going to assign a reward so, here I am saying this is a vector right like in the K arm adversarial setting we had we said that the adversarial is going to assign reward or loss to each of the arms whichever you are going to pick you are going to observe only the loss from that arm others you are not going to observe. So, here you are going to pick a particular A T that is one of this unit vectors then you are only going to observe that component of Y T and that is the loss you are going to incur and this is simply which is the single best R M you want to pull in hindsight right because A's are all coming from that unit vectors that is just a which is the single best arm I should be. So, this in that way this is just like a generalization of your K arm adversarial setting but now you are allowing the environment anyway it is going to choose a vector Y T but now I am allowing the learner to play this actions A T which are not just unit vectors it could be any subset of R D. So, for time being henceforth I am going to only focus on the case when cardinality of this A is less than this is finite then we will discuss what happens when this is not the case. So, before we continue we are going to make the following two assumptions for any so what this says take any one that is why that is selected by the adversary or the environment for playing any action the reward your or the loss you are going to see is bounded by one. So, this is we are just making sure that the losses are the rewards they lie in the interval 0 1 ok. So, this is like equivalent to like when we did stochastic case we assume that the means are in the interval 0 1 right. So, the or the supports are in the interval 0 1 we are making just the same assumption if this is not the case then you just scale everybody appropriately so that you bring down the rewards at any round to be in the interval 0 1. So, if this is not the case your algorithm will just scale by whatever is the maximum loss you are going to incur in any of the rounds ok this is fine. Other thing I am going to assume is the action set spans R D or I am going to assume that A is the basis for forms a basis for my R D ok. So, I will see that we are going to use this assumption when we when we are going to have a derived the regret bound for this setup. So, this will just like this assumption also make sure that since it spans R D I want to like so this helps me to explore all possible directions of my y t vectors whatever I am going to see. So, each of the terms I want to explore well that is why that will if this is the case I can achieve that target ok fine. So, now is the setup clear for this adversely linear bandit this is the setup we have under this assumption now what is a good algorithm to minimize this regret what is that. So, can you think of any algorithm any generalization of the algorithms we already know ok we already know when this special case when my action sets are all just the unit vectors we already know how to solve this problem right I am going to simply use like exp3 or exp3 ix whatever algorithm. Now, we have just a generalized version of this what could be a good algorithm. So, when we studied this exp3 there the main thing for us was how to estimate the losses of each of the actions in every round right like the for some actions which I actually played I observed the reward or loss for the ones which I did not observe I did not have any information, but still in that round I want to estimate the losses for each of the arms in that round. Now, the arms have been replaced by my action set here. Similarly, I want to do like the same thing in every round I want to estimate the loss I would have observed for each of my actions and that is possible for me if I can estimate what is the yt that would have occurred in round t. So, if you somehow figure out what is possibly potentially the yt in round t you could just go and find out for each action what is the loss we are going to incur and from that you can go to play a one which has the smallest loss right, but the question now boils down to how we are going to estimate that yt in any given round. So, for that if you want to estimate we also want to ensure that whatever we are estimating is unbiased the way we did it in using the importance sampling method. So, now we are going to now discuss about how to do this. Suppose you could figure out, so let this denote the loss the estimated loss action a in round s. So, this is an estimated value in round s for the loss incurred for action a. Suppose if I can estimate this for all the actions a. So, then what would be how you are going to play an arm in that round? So, we can do that exponentially weighted distributions right. So, in that case we know that that has a better properties right like we have been using it many times in our exp3 algorithms. So, then we want to construct a function in round t that will give me a probability distribution and we know that one particular way to do this is. So, this is what like the exponentially weighted probability distributions we have provided we could estimate the losses I am going to incur for each of my actions in that round. So, it is it is an action set it is an Rd. So, we can do a we can do a transformation that is basically they will get transformed into like orthogonal like that is like basically a basis orthogonal basis and then we can very what this yt whatever we are getting we can separate it out into components in these directions and then wouldn't be wouldn't this be the similar setup. So, that again so like you are saying if you have this action set that can span. They may be correlated this actions may be correlated. The actions is good, but you are saying I can always come up with I will transform them and try to get this basis vector. Yes. But then you can map the losses one to one in that case. So, suppose let us say I have some action a and I can let us say this a hat is your transformed basis. It could be more only dimension is d it could be large. I only said a is a subset of Rd I did not say it has exactly d elements in it. This is just a special case where it did exactly d elements in it. Okay. So, I do not see like how you can just restrict any of these actions space to a particular basis and then you still be able to do a one to one mapping between the loss for a particular action from this and a loss and map it a particular action in the transformed space. So, I do not see that, but let us say how to just do with whatever we are given. So, we are given an arbitrary subset of Rd and this is this consider action set. So, now we know that if you go with such a probability distribution, you have already seen that okay, even though it gives us an unbiased estimation like in the EXP3 setup I am talking about, but we saw that its variance could be very bad. So, the way we handled that bad variance, how did we handle that? In should use comma, otherwise we also deliberately include an exploration term there, right? Okay. So, let us say, so one thing we will come possibly do is I can come up with instead of just going with like this, we can go like, so pi i is going to be my exploration distribution. It will specify how it looks like. So, once we could do this, then I am going to once I can construct like this, I am going to simply pull an action 80 which is drawn from this distribution, play it and observe whatever the reward in that corresponds to that. So, fine. Now, what remains is okay, how to estimate this? If I have this, I can do everything. Now, how to do this? One possible way to estimate my yt in round t is let us say in this, I am going to do this and what is yt here? Yt is the loss. Okay. So, let me just read why. Okay. So, let us see this. Suppose I am saying that, okay, let us say in round t, you played an arm 80 and you observed this reward yt. Okay. So, once you play, I am going to 80 whatever been observed, whatever been selected by adversary yt, you are going to observe the inner product of this, let us call this 80. So, if I want to make, estimate my yt hat based on this. Now, what is unknown for me is this rt is the thing which is not specified here. Now, how can I make this yt hat unbiased? How should be I choosing my rt? Okay. Suppose now let us say, okay, let me take expectation of suppose yt hat given all the things you have observed, all the things. So, a1 all the way up to 801, t minus 1 is the action you have already played till t minus 1. This you already know. This you have done before time t. So, what is 80? So, what is 80? This is a scalar, right? Yes. This is a vector and now I want to construct a matrix rt. 80 is what the learner selected in round t, right? The distribution or the… It is the actual action. The actual action played in round t. They are the d dimensional action. So, he is going to select an action 80 from a, right? According to some distribution he has selected but he has selected that. It is just a scalar. What is just a scalar? Because it is just an inner product. What is the multiplying? It is just a scalar. Yeah, this is scalar. That is fine. That is fine. What I am saying is, okay, now this rt, I want to come up with, this is going to be what? Yt is d dimensional, right? Let me say d cross d. How should I select my i t vector in round t so that this guy conditioned on whatever I have observed so far, this becomes an unbiased estimate of my yt. See, yt, t hat is what? This is an estimate for yt. I am just trying to estimate this quantity yt. This is the estimate for entire yt, small yt. Yt and i t. Now components, sorry, this is for each action, right? This is ys of hat of a. So, this is for one particular a. This is the probability of selecting a with this. This is proportional. So, if you want to exact, you can just normalize which happens to be the sum of all these quantities. Okay? So, now suppose I take this rt. Now, what is random here? At. Okay. So, before I write this, so if I just simplify this, this is going to be rt into at. And what is yt? Yt is nothing but at transpose yt. So, I just replace this whatever yt that I am going to observe in round t, but by its definition. Okay? So, this quantity is nothing but rt. Now, expectation of whatever random quantities at this point is at, at transpose and yt. Right? So, right now I am not a specified. I am just whatever it is, I am going to choose this rt deterministically in round t. We will see what that is going to be. And the expected value of this yt hat, this estimate condition on this is simply going to be rt into this because the condition, so okay, so this condition, this quantity is conditioned that I am, so when I say t, this means is I have already conditioned on all the quantities I have observed so far. So, this quantity here, rt, I am going to write it as, so conditioned on A1, A2, all the way up to At minus 1, I already know what is my quantity, this probability, pt of A. And then, this quantity is then nothing but expectation of this quantity is nothing but you just take their values and multiply with the corresponding probabilities. Right? So, so I am just now, for this At and At transpose, I am just replacing for particular A and At transpose. And I am now going to consider expectation with respect to the distribution pt of A. So, A is what? A is the action, one of the action in my action set A. I am just looking for all possibilities of At, At and transpose. I am going to pick them according to this distribution pt of A, A1 into, no this is conditioned on that because pt of A depends on all this observation. But this does not yt and as I said rt, I am going to choose deterministically in that round. Now, if I want to make this estimate here, yt hat and I am the estimator of this quantity yt. How should I be choosing rt? So then if I choose rt to be just the inverse of this matrix, then this estimate becomes unbiased estimator for yt. So, let us choose this quantity to be, I am going to denote this as, so as we said this pt quantity here could depend on the exploration distribution set I have set. So, that is why let me call this as, this whole quantity as q of pi. And if I set my rt to be equals to q of pi inverse, then my expectation of t, yt hat is going to be t. So, when I say expectation with subscript t, that means conditioned on this quantity. So, I am just, so now you have kind of have built up an estimator for your yt. If you have kind of built up an estimator for yt, you have done this job now. Now, you can just repeat the process, right? So, now let me write down the exact algorithm now. So, this is the general idea. What? Yeah? So, if you are right here, what do you want? Like as I said, if I can somehow estimate the losses for each actions, then I could just follow whatever the idea I was following, above that is, yeah, this particular form. This is basically coming from the intuition of what we did it in the least square regression, right? How did the least square regression work? What was the estimate? We estimated theta hat to be what? V inverse of, okay. So, how did our theta, t hat work out in the stochastic case, 1 to t, you recall this? So, this is our estimator in the least square regression, right? So, this was what? As is the action you played in round s. Now, we know that in this case what? We have a fixed theta star from which all these rewards are correlated through, okay. So, that is why we, all these rewards are correlated. So, we use all the information we will have till round t, all from s 1 to t. So, now what? When I am going to use adversarial one, there may not be correlation across theta t that adversary is selecting. So, the theta star need not be the same, right? It could be changing in every round. So, when we only focus on one round, that is why I do not have summation here. And for that, just try to use this idea. So, now if you just ignore for all of them, then for a single one it is going to look like AS, AS transpose inverse and then it is going to ASRS, right? If you have to deal with only one term here. So, that is exactly what we have used. But instead of directly writing it like this, we just wrote like this and just saying that to get an unbiased estimator of my YTH, the way you have to choose RT is like this.