 So, for any doubts on this setup here, if you have any doubts on this setup just ask me now. So, that is it is assigning it could be after arbitrarily right you are just assuming that there is some phenomena according to which this losses are generated which you do not know ok. Our goal is to now select a action here which would have given the smallest loss. So, ok let to just to rewind this. So, in every round environment is coming up with this loss vector right at the end of n rounds. So, this is the loss if we incur if you happen to play always the ith action right. We are just saying that ok let I do not have any control over how this environment is generating the losses, but I want to play an action would on which the loss assigned by the environment is the smallest. So, this is the total loss assigned to action I by the environment right I am just trying to take the smallest one. So, what we are saying in this setup is I do not have any control over how this losses are being assigned to action I would pick an action which gives me the smallest total loss. In the actual I will I will take it as my benchmark. It can keep changing that is why we have indexing by subscript t right. So, if t changes this vector could change. X t L. Yeah. Then we you are saying that we are setting the I of action on which the losses are minimum. No, I am looking at you take one I look at the loss I get on that action. In all the rounds. In all the rounds summed over all the rounds and now I am looking at an action on which this sum is the smallest. Yeah. So, maybe just to make this concrete let us say let us take you three guys sitting in this column like let us say in every class I will come up with some numbers for you guys according to my own criteria. Assign this value this value 2, 3 and I keep on doing it for let us say 10 days. So, the number assigned for you let us say it is the score for you. Now somebody who do not know how I am assigning scores to you, but he would like to pick one among you to whom possibly I am going to assign the highest score total in total sum. If he has to figure out how he is going to do that like the guy who is going to get the maximum score right like the total sum he if you want to identify he would be your that would be your best choice that is the best student let us say among three of you. If I want to identify the best this is what I want to do, but I do not know upper area right like how I am going to assign this and this assignment process is could be I it is up to up to me how I assign and you do not any of you do not know about it. Now how you are going to select that is what this setup is telling you minimum no I am saying over the entire round I am basically talking about single best action for the entire round right. If you are allowing ok fine if you want to take this mean inside this summation what you are looking is the smallest score in each round right. So, right now I have not gone into that, but that situation is harder to handle that means you are you could have done that ok. So, let me so, what he is saying is ok let me write this why not consider this right ok. Now let us ok just I want to again write that quantity here you understand the difference between these two criteria's now all of you . So, what we are doing is when I am looking into this my criteria's I am looking at the single best action throughout, but if you take the minimum inside what I am looking at in each round I am looking at the smallest value and because of that it is not necessary that in each round I would be playing the same guy. So, which criteria is good? First one is more stringent right you want to see in each round what is the smallest. So, if you could do that it may happen that in one particular route I may assign him 3, I may assign him 2 and I may assign him 1 in that in this round he got the smallest value, but in the next round I may assign him 0, I may assign him 4 and I may assign him 2. So, now in the second round this guy got the smallest value. So, in that way in this criteria is asking go and choose the best action in each round that is that is the benchmark, but often it is so happens that this criteria is very hard to compete against we are always. So, this is what we are saying this is what we incur and this is what we are competing this is our benchmark strategy right. In this case playing the single arm always is a benchmark and in this case playing the best in each round is the benchmark, but this mark benchmark always happens to be hard to deal with yourself ok. Now, let us is that relation between this quantity and this quantity where right now we do not putting any constraint on how this x t i's are generated right. As a special case it could be cyclic as is telling. So, for example, let us say let us take k equals to 3 and in this case my x t first in x 1 is 1, 2, 3 and x 2 is 3, 1, 2 and x 3 is what is that after one more cycle 2, 3, 1 you understand this. So, the environment is assigning this and then again x 4 becomes equals to x 1, x 5 it could be in this cyclical fashion. So, with this if you are going to look at a single best yeah. So, fine if you have such a specific sequences if you you are saying ok I am not learning anything by this one in this case this criteria is better ok, but as I am telling this is fine this is when we have such a special special sequence, but apriory you do not know anything about how this sequence is being generated right. For an arbitrary sequence giving any performance guarantees with such is very hard whereas, if you look at comparing against a single best action we will be able to say something ok. This is indeed a string this is a stronger benchmark, but it also will it is not much tractable we cannot say much about this, but we will be able to say much about this ok. For time being assume that these x t's are randomly generated ok, it could be some something, but some random values. Can I say something can I say something about this and this expectation of minimum of random variables and minimum of the expectation which one is greater this guy is going to be smaller this guy right because you are taking expectation of the path wise minimum this is going to be minimum expectation. So, because this is smaller, but this is coming with a negative sign ok. So, which which is going to be larger what will be the relation. So, this is going to be like this right and what we have defined our regret to be actually we have defined this to be our regret this is the expected regret. What we ideally would have like to give a bound on this right this is what here we are interested in, but if this is going to be if I allow this x t to be random then this relationship holds and we will see that later when I allow this x t i's to be random I will only be interested in bounding this not this then and that time we are going to call it as pseudo regret. So, right as long as as sequence is given to you it is fine. If the sequence are random then comparing against a single best action will be lower bound on this in this fashion ok. So, this we will again revisit when we are going to talk about stochastic bandits where we allow this loss to be drawn according to some distribution. Here we are saying this x t i's could be stochastic or it could be arbitrarily generated ok. I mean stochastic in the sense they follow some particular distributions or they did not follow any but any distributions ok. So, this we call as pseudo regret ok fine. Now, the question is what is the algorithm yeah right this one here for a given sequence this expectation has no meaning right. If if you if you give me the sequence x t this is a fixed quantity whereas, this quantity is random because your i t is random. So, I mean I could as well not write expectations here ok, but now what we are saying is if this x t i's are random or stochastic you are assuming that this is not a particular sequence, but this is some stochastically generated sequence it by an according to some arbitrary distribution and not necessarily in i i d fashion or anything it could be in arbitrary fashion then this expectation here is also valid because the sequence is random. In which case if I am going to look at the single best action as my benchmark then this becomes a lower bound on this. This is point here ok maybe what is confusing here is this is this is given sequence. So, in that case maybe I do not need to write an expectation here whereas, if I allow this x t is to be stochastic then this holds it is clear ok. Then in this case this expectation involves two level of randomness here right when I write when I allow this x t's to be stochastic these are random and also this selection of i t is random. So, this expectation is over randomness of losses and arm selection. So, the randomness of losses is by the adversary that is the environment and the arm selection is by the learner. So, you are basically taking average over the distributions of both environment as well as the learner whereas, in this part we have removed learner right we are just trying to see what is the best thing we can get. We only take the expectation with respect to the randomness of this loss values right this should be x t yeah. i is also random. No, because we are taking minimizing it all possible i's right it could be depending on what is the value of this x t ok. Again let me revisit these things we are saying the environment in each round is going to generate a loss vector and then the learner is going to pick an action ok and the learner is going to pick an action according to policy that policy will be based on the past history that he has observed ok. This is the observation the learner has made. Now, this policy may be this policy maybe I should also write it as a function of x t because now this regret is defined for a given sequence is that clear? Now, if for a given sequence this is the loss you have incurred for playing i t ok and this is what the best you could have got. So, that is why I have I said maybe again I will write. So, this is the regret I have got, but still this is even though this is for a given sequence this regret is still random right because the learner is randomizing his strategy. So, I would be be I may want to look at the expected value of this, but still for this given sequence. So, that is exactly this one that is only this part is random. So, I am going to take the expectation of this, but this part there is no randomness once you give me the sequence there is no randomness in this quantity and this is the expected regret. So, because of this having expectation here has no meaning right because this is for a given sequence I could get rid of this expectation as well. So, these things are clear here. All this confusion arise when I wrote expectation here and the question was the whether this expectation and minimization were interchangeable here ok. Now, as you instead of x t sequence is given this x t sequence is stochastic ok. So, because stochastic I am going to write it as capital X t because now the loss vector assigned in each round is a random quantity. So, this quantity is now going to be x t I t and now this is x t I t and this is x t I t ok. Now, instead of looking this regret for a given sequence x t, now I am allowing this x t is to be arbitrary then what we will get this is going to be r of n pi expected value and notice that I am now no more writing it as x t because that sequence is I am not considering a fixed sequence, but I am considering stochastic sequence now yeah ok. It is the same thing what I have written here, but instead of a fixed sequence I am allowing this taking this x t is to be stochastic. So, fine fine I am just saying this is what when from this when I went here when I allowed my x t is to be random this is the quantity right finally. So, this is expectation I have also taken expectation of this because my x t's are random, but here this sequence x t is yeah yeah here it is remain same in the second same. So, I am just saying this happens because it is not the same they are not equal. What is the relation? Here it is stochastic right because of this you have expectation of this minimum right. Now, if you are going to. So, what is this I could write this as this quantity here as also expectation of sorry summation of expectation of minimum if I take expectation inside. Now, expectation of minimum of xi it is going to be greater than this ok. So, maybe you guys are still confused let me rewrite this ok let me rewrite this ok expectation of t equals to 1 to n x t i t minus t equals to 1 to n expectation of minimum of i x t i. This is fine. Now, I will just leave it like this now minus this quantity instead of expectation of minimum of this if I write it as minimum of i expectation of x t i what it will be this is going to be ok. So, this is still now can I go back to this step this step is what I mean this I can always rewrite it as minimum of i or I could as well write it as is this clear. So, we have just doing this manipulation here. Now, so this is equal now this was looking at the minimum quantity in every round, but now what it is looking at it is looking at the single action which is giving me minimum over the all the rounds. So, and this is going to be actually lower bound on this ok. What we actually wanted is we we define this to be our regret ok, but this regret is often too demanding the benchmark here is too demanding in each round I want to find the minimum loss. But here if I do some manipulation I will end up with this here my benchmark is to look at the single backs action for the entire end rounds, but this is the lower bound it is fine, but we will we will take this as our regret definition and we call this as as I wrote earlier pseudo regret yeah we call it pseudo regret. I mean see like what we just the whole business of getting confused here happened because we first defined this regret for a given sequence and then we wanted to take the expected value. So, this expected regret is not like this, but had we allowed if for a sequence particular sequence if we are allowed for any probabilistic sequence we I will end up with this definition, but this definition requires this stringent benchmark instead of that we can go and consider this benchmark which is like a weaker version of this ok. I would have like to come to this ideally bit later, but now you have asked this we kind of clarified this. Is there any confusion now on what is an expected regret and what is a pseudo regret here? That is fine right in the expected regret we have this benchmark which is bit stringent ok. Now, let us come back to by the way we should call this adversarial banded setting not just banded setting. So, we will just discuss the main idea that we are going to use in this setup and then talk the algorithm and just proof in the next class ok fine. In the width majority we got to know the loss for all the actions and what we did we updated the weights for each of the actions right according to some exponential factors. So, there we were able to update the weights for all the actions in each round because we have observation for all the actions, but now in the banded setting we have observation for only the action we played, but not for the other actions. Now, how I am going to update is it that in each round I am only going to update the action that I played and not update anything about the other actions or that is there a mechanism that I update all the weights in each round ok. So, we will see that it even though we do not observe losses for the actions which we do not play, but we can pretend to come up with a mechanism where we will say that we have some information about the other arms even though we did not observe them and accordingly update their weights. So, what is that mechanism? That mechanism is called as important sampling. Let us say so, PT is the distribution with which you select the arms in each round. So, we said arms are selected according to some distribution right and the policy governs what is the distribution? Let us say in round T this is the distribution according to which we are going to select an action. Now, that is we are going to select i t according to distribution PT. So, we are going to select one of this. Suppose let us say you happen to select i t equals to i in round T and then you observe x t i as your loss for this action and you do not observe anything for the other actions ok. Now, how to update the weights for all the actions ok. Let us say I am going to define this quantity called the estimates for the loss. I am going to define it you may be I should write I am going to define the estimates in round T the environment chooses this actor x t ok. What I am going to do is I am going to estimate the value that environment chooses in round T by some mechanism and I am writing that values to be in this fashion ok. So, so this is my estimator for the loss observed on round on on arm j in round T ok. So, how I am defining this x t j divided by PT j i t equals to j ok. So, suppose now let us say if you have played i t equals to i in the t th round for what is this quantity is going to be. So, let us say so this quantity how this is going to be if i t equals to i x t i upon PT i for i and for what happens for j not equals to i it is going to be what is that indicator s indicator. So, this is how I am estimating the values of the loss in round T. I am saying that if I observe if I have played action i I know the value x t i because I observed that and I define its estimate to be x t i by PT i. For the other guys I just define it to be 0 ok this is what this definition is saying that ok. Now, the question is and this is an estimator estimator for loss values the question is why why it is estimator ok. So, now let us try to see if this estimator here is a random quantity yes why because it depends on i t which is selected randomly in that round right ok. Now, let us take expected value of maybe because this is a random quantity maybe I will write it as capital X t ok. So, this is an estimator which is now this is what this quantity I am going to just write it as cap here T j. What is the expected value of this? If you want to compute the expected value of this this is going to be. So, this is the random quantity right. I am going to looking expectation with respect to i t. So, this is going to be E j and this is now summation over i i equals to 1 to k. So, now, I am allowing this i i t can takes value i which is going from 1 to k right and now this has to be multiplied with P t i right. This is the probability with which i t is equals to i is P t i. Is this expected value correct? Now, simplify this only when i is equals to j this indicator remains for all others it is going to be 0 right. So, what is this quantity to be? This quantity turns out to be X t j divided by P t j into when i is j is P t which is X t j. So, this estimator is such that the expected its expected value is exactly the sample value in that round even though you may only I have observed X t i in that round, but if you are going to define your estimators like this even for every this is true for all j. So, for every components you have an unbiased estimator. So, you understand what I mean by unbiased estimator? What is an unbiased estimator? The expectation of the estimator is the same as the of what? The value which we are estimating ok. So, here this guy what it was whatever its quantity this is if you are if you are assuming that this quantity is trying to estimate the true value that is X t j in round j this is exact it is in expectation it is doing that job. So, this estimator here is an unbiased estimator. Now, you see that if you forget about this estimator in every round you kind of have values for each of you have predicted or predicted each component which are a good estimators for the true values even though you do not know the true values, but you have a mechanism in which you are able to estimate those unknown quantities for which you are estimations are pretty much they are unbiased. So, can you now use them in your learning as you did in the full information case. So, in the full information case you had information about you have observed the loss for all the actions. So, you are using them to update all the actions, but here you only observed loss for one action, but you for everybody else you have these estimators, but these estimators in expectation as good as the true values which you did not observe. So, maybe yes I do not have true observe, but I have this value observed which could act as a proxy for true losses and I can use them to update all my weights, weights for all the actions right. So, we will use this idea to come up with an algorithm for adversarial bandit algorithm called exp3 that we will do it in the next