 So, so far we have been studying stochastic contextual bandits. We just observed that this stochastic contextual bandits can be considered as a special case of stochastic linear bandits when we assume the mean rewards are linear ok. In the last class we discussed some algorithms and we just discussed the broad sketch of how the proof goes about under some assumptions and then we said using some special techniques we should be able to relax those assumption and prove the regret bounds whatever we claimed. So, we will just left it there. Now that was the part for the stochastic case and in the assignment you are going to look into some problems on that and also you are going to look into some algorithms for this stochastic banded case. So, now we are going to move to the case when my setup can be adversarial. So, far my rewards were all stochastic, but I could as well consider the case when my rewards are adversarial right. So, how does my contextual bandits work for the adversarial case. So, the setup is same as earlier except that when you have a context the reward associated with that context from an arm need not be coming down from a fixed distribution it could be chosen in an arbitrary fashion and in particular that could be selected by an adversary. So, we will consider the following setup I am going to denote C to be the set of context goes to 1 to following happens. So, this C set of context is assumed to be fixed it is some set from which the context are drawn in every round ok. So, the learner observe context from that set and then and the learner. So, let us say there are k arms. So, the learner observe observing this C T context is going to select a distribution on this k arms and let us say samples 90 from P T is going to play an arm and for that he correspondingly learner X T which is nothing, but X T I T. Maybe what should I do is here I will insert another line this is environment ok. So, let us go through this we are saying that in every round T learner observes the context and environment assigns the reward vector in that round to each of my arms and then learner select a distribution P T on this arms. So, what does learner how does he select P T he observes the context in that round T and based on that he is going to come up with a distribution, but he does not see the rewards assigned by the environment in that round and then whatever he going to sample an arm according to the desk distribution and he is going to get a reward that he actually played in that round. So, X T I T is the reward that is assigned to arm I T in round T and this is what his reward in that round T ok. Now, so this is how the interaction between the learner and the environment is happening here. Now, our goal is as usual to compare the performance of a learner against the best reward I would have got if I knew all the reward assignments ok. So, then we are going to define regret or the expected regret as of a policy whatever policy pi we have expected value of. Now, notice that so ok what we are doing here what is this quantity here this quantity says that suppose you fix a context. So, for time being assume that this C is finite. Now, take a context and look which is the arm that gives the maximum reward for that context ok and then sum it over all possible context right. So, notice that when I am saying this I am only looking at all counting adding the reward for which the context in round T is C. So, this is for a particular say this is giving me the total reward accumulated when I saw particular context C and now this is the total divided I would have accumulated over T periods and now this I am comparing against the reward I would have collected by playing my policy pi. So, how does this pi affect this x t? So, according to that policy pi that policy pi is going to tell how I am going to choose my distribution in every round and accordingly I would have drawn some arm I play that arm and for that particular arm I am going to receive the corresponding reward. So, all these x t's are governed by my policy pi and that is why this is the total reward I would have accumulated over a time period T and this is what I would have gotten. So, here we are looking. So, this is what we would have got. Now, what we are comparing is this against is a benchmark this is my benchmark in this I am saying I am looking for the best reward I would have got for each possible context right. So, I am going to take a context and I am going to see what is the best reward I would have got here. Now, I do not care it why should I write it about like you whatever you do you are seeing its c t it is may be up to you you want to use it or ignore it. Let it be in whatever way it is I am just want to this is my total reward I would have got and this is what my cumulative reward is. Where I am saying it is for single context it is you take a c, but you only look those rewards in which the context is c out of like this 100 that particular c may have occurred only 10 times only for those 10 times you would are now looking for which is the best time you would have like to pick yeah, but why you have to write specifically c t equals to exactly doing that job you are only restricting this sum to only those points in which in the tth round the context is c right, but you are not summing it different x t i for c in round t you have a context c t right. We have a vector assigned for each arm there is a vector of reward here for whatever the context that has been observed in that round right. Yeah, it is not defined at that time whatever the context happens that time for that you have this reward and if that in that context if it happens to be c then you are going to do this, some otherwise this sum is 0. Now rewriting we can do it in different ways is this fine if this is the total reward I am going to get over time purity and here I am looking at for every possible context what is the best reward I would have got is this fine or if this fine. Now suppose there is only one context okay, so there is only one context means this summation there is no this summation I do not need to worry about right in every time c t equals to c is this the standard adversarial setup I have. So there I am looking at that the best single best arm against what I would have got, but now that I have different possible context now I am looking at for every context what is the best I would have obtained. So now, so what it is saying is basically see like I as of now I know that like my optimal arm could depend on my context whenever that context I appear observe I want to see over the duration wherever I have observed this context what is the best action I would have taken in hindsight. So what is this part doing wherever I saw context c it is give me the total reward you accumulated whenever I you saw that c. Now max over ik is telling that in hindsight if you know all those x t is for that corresponding c what is the best action you should have played and what is the best corresponding reward you have got and now this is for all possible context and now you are trying to compare it with what is that cumulative reward you would have got fine. So now obviously you want to minimize this okay. Now let us start thinking about suppose for a particular context what is my regret this is over all possible context right for a for a particular context I am going to write it as and then c t equals to this particular context and then I am going to write x t I now I am going to write it inside this. So now I am only focusing on those instances where I have observed my context c I have fixed my context c now. Now I am seeing what is the regret I am going to incur for that particular context the best I could have got for that context is this sum and minus this this is the reward I would have accumulated whenever I saw that particular context okay yes. So in that case what is the RT pi is going to be simply going to be the sum of all this term as long as I assume that my context are finite okay. Okay now let us focus on this how you are going to deal with each context yes there are multiple context but if you focus only on particular context you already in a known territory right you are what is that that is the standard adversarial bandit setting you already know okay. So do you like like whenever now you can what you can do for each particular context you can run an exp3 algorithm yes like like exp3 algorithm was maybe like that is a so the standard adversarial setting was a special case of this where it is assumed there is only one context now that we have multiple context. So I could run different exp algorithms for each of this context anyway I know this context I know how many contexts are there for each context I would like to run a different exp algorithm. Now if I do that what is the recreate I am going to get for this particular context see can I write a bound on this so let us say this I am now on to write RTC where I am going to use my policies exp3 what could be the bound I am going to get. So my context particular C is appearing only at some points it is not like it is appearing in every round at some point it is appearing let us say out of 100 rounds I ran only 10 times this particular context C appeared. Now if I have that information can I write what is the regret bound I could get on this if I apply exp3 algorithm what is that so what is the general so what is the for the standard setup what is the regret bound we get by applying exp3 so we get something like a 2 times square root k t log k right what was t there number of times you have played but that t is the same t you are going to get here no right what is that t is going to be number of times you have observed the context right so in that case can I write it as can I write it like this so what is this this is basically telling you the number of times my context C has been observed right under which this is the bound. But notice that we got this bound in exp3 by assuming that we knew the time horizon right so in exp3 we needed to tune a parameter eta right that parameter eta dependent on what the time horizon but in when I am going to apply exp3 on this setup I apparently do not know how many times I am going to see this particular context C right it may appear on all the rounds or it may never appear or it may appear only on few of the instances. Now how we are going to apply exp3 yeah so we could use the case where we do not know the time horizon then what we do we usually do the doubling trick but with the doubling trick we know that we almost get the same regret bound with some constant factor loss in the regret bound right so fine we can still do this then I just keep doing the doubling trick and with that I am going to get this as my regret bound. So then my RT phi is nothing but summation C belongs to C RT phi and if I am going to use my exp3 kind of algorithm let us say I am going to get is okay so we have this regret bound if I get it so now but somehow this upper bound here appears like it depends on what is the sequence of context I observed right. So let us try to get rid of that what what is the sequence of context I observed let us try to get an bound here which is independent of the sequence of context observed so can you apply a Cauchy Schwarz inequality on this and see what is the bound you are going to get we have been doing this trick many times right. So treat this as so let us say there is a 1 here this is 1 into some quantity treat it as a product of A i into B i where A i are all 1 B i are all changing for each context. So now if you apply on this Cauchy Schwarz inequality what is the bound you are going to get so keep the constant outside if I do that so I am going to get summation A i equals to summation C belongs to C 1 square then summation C belongs to C the square of this term which is K log K tan T equals to 1 to capital T indicator C T equals to C. So what I will do now is so this is cardinality of C the first term is simply cardinality of C so is this correct if I just apply Cauchy Schwarz inequality on this so now just what is this term is going to be? This term is going to be simply capital T right so this is so what has just happened is if you are going to simply apply each context as a separate for each context you are going to think it as a separate adversarial banded problem and apply maintain an EXP algorithm for each context then this is the regret you are going to get and how this regret has scaled compared to the single contextual banded by square root of cardinality of C this much of so as long as your context size is finite that is cardinality is finite for C you could apply EXP 3 for maintaining it separately for each context and you are going to still get a sublinear regret like this it is this regret is still sublinear right but as long as fix it I mean this is about fix you are not going to vary your variables here are T other quantities you have to fix whatever C you are going to choose then that is fix for cardinality C then it is it is sublinear in T ok fine then the question is what if if this cardinality of C is large then this regret bounds can be very bad right.