 So, one more thing I want to make a distinction between a type of adversaries we are going to see. We said that the adversary cannot see what is the loss value predicted by the player in each round ok. He would have observed it in the fast, but in the current round he do not know what is the loss value predicted by the adversary. So, had he seen, had he is been able to see the loss value predicted by the adversary, he can always make the loss value incurred by that adversary very high right. If he predicts the adversary the loss to be 0, then the learner predicts it to be 0, then the adversary can assign one to it and he can make him suffer more losses. So, because of that we said that ok, we allow the learner to randomize his strategy ok. And because of that adversary and we did not allow the adversary to know what is the strategy that the player is going to I mean whatever the final action that the player is going to select in that round ok, because he is basically randomizing it. Now, he is not able to and see what because of this randomness what is the action that the learner will be selecting, but once the learner selects in that round he knows the adversary what is the action that the learner took ok. And because of that he cannot manipulate what and he is going to assign the losses before the learner selected an action in that particular round ok. So, suppose let us say the learner is going to randomize, but at the end of the randomization he picks a particular arm ok. So, what we said again revisiting that in in in each round learner picks distribution PT and picks loss vector LT. The adversary may know might have observed what is the actual action that has been selected by the learner, but in the and based on that he may have some information about how the learner is going to come up with this new strategy. But suppose the learner is using a deterministic strategy then the adversary can what he do he can see what is the next action that the player is going to play and give a very high loss to that. But given that in the in the teeth round we have made the learner to pick this actions according to some distribution the adversary is is no more having that capability. Because you are choosing this actions in according to some distribution he cannot particularly make some actions to have a very high loss and others low loss right because he apparently do not know which is the action you are going to select because you are randomizing. But once you select an action he knows that in that round you selected that action and maybe in the next round he may get some sense of what you are doing what is your strategy and accordingly he can come up with his losses. But still in the next round you are going to randomize your actions he do not know precisely which is the action you are going to pick and he cannot penalize you on that itself. So, because of this even adversary can update the way he is going to pick his losses in every round ok. Because adversary we are not assuming anything about the way he is putting losses right. So, we can consider a worst case and see that this adversary is choosing this losses in such a way that he will make you incur maximum penalty or maximum loss. So, based on this we are going to consider two types of adversaries called oblivious and non oblivious. So, we will come across these terms in the literature when you read about this adversary setting. What does oblivious says oblivious says that the adversary is not of adopting his losses by observing what the learner is doing ok. So, for example, what we can say the loss vectors for the T rounds before the game starts the learner can decide up very ok this is how them I am going to assign the losses in each round without knowing what the learner is doing. So, he will come up with his losses he fix that and you are going to play against that. If this is the case then we are going to say adversary is oblivious you still do not know the learner do not know what is the sequence it is just that the adversary has come with this sequence before the start of the game. But in the non of PS the adversary can in each round can decide what is the loss he is going to assign to the actions based on the past observations. So, I understand the difference between these two now oblivious and non oblivious adversary. So, which is the more difficult adversary non oblivious right because he can in each round he can see what you are doing and try to incur maximum damage to you. But the analysis we did so far does it hold for both the cases or on any of the specific cases. So, we did not assume whatever the strategy of the adversary we said this right like even he may be updating his losses based on the observations made so far ok. So, our analysis still holds because that is why we have two expectations right like one expectation who is over the randomness of the adversary. So, in each round based on this observation the adversary can come up with a new loss according to some he may be updating his distribution in every round ok. So, you adversary seeing that you are playing some strategy by to seeing what is the actions are going to select. Based on that he may be updating his strategy in every round may be one possibility he may have his own PT to say I am going to assign high values to this actions and low values to this action whatever and accordingly he can update his strategy in every round ok. So, whatever we have done the analysis can hold for the non oblivious adversary, but of course, this also holds for oblivious because oblivious is a simpler case I really. So, in that case if I take like this I really do not need to do the expectation on the strategy of the adversary right because adversary is already picked a sequence and I am facing that sequence whereas, here the adversary can keep on updating his strategy and there could be a randomness in the way he assigns the losses and that randomness can keep change from one round to another. The way we define regret of a policy pi to be this and we have also said I am interested in the expected value of this I said this is going to be the. Now, in this case what I am seeing is second case what we said that this is going to be with which we called it as pseudo regret ok. So, this is we are trying to guarantee the regret in expectation whereas, this is like a when you come through a sequence of losses this could be your true loss incurred. So, let us say that I want to give instead of bound on this expected regret I want to give a bound on this regret where you are going to not take the expectation over both randomness of the adversary and the randomness of the learner ok. See when you are going to run this algorithm for n round you are going to incur some regret based on the sequence of losses you have observed right, but when we gave this bound in expectation whatever the bound we get this need not hold on that particular realization because the expected bound can be different from the bound you are going to give on a particular realization right. So, because of this what happens it may happen that in expectation this bound could be smaller, but if you look into one particular realization this bound could be pretty large. For example, this is like if you have a random variable X and its expectation. So, this X let us say it takes 100 values when you draw one sample from this X that some value of that random variable could be much higher than this expected value right. So, in this says we may want to guarantee itself when you are going to look into when you draw one sample of this what is the probability that this is going to be actually small or that the value you have got from this sample it will have a bound that will have this that will be have the same bound that you gave on the expectation. So, what now basically I want to ask this question or whatever the bound you have here let me call that as regret value maybe I want to see that whatever the value of X you got it is going to be smaller than the expected value itself. Then if you have a bound on expectation that bound also holds on this X with that probability right. So, same thing you may want to do here it is basically saying that if you observe one sample how does it concentrate about its mean value ok can we say something about this things ok. Now, but when we do this we are now going to consider only the oblivious cases where we say that this sequence is fixed by the adversary apriori, but I am only now looking into the randomness of the learner. Can I say anything about what bound this regret has you may be interested in that right. So, now let see if we can give a bound for this quantity rather the expected value ok. So, quickly one thing is when you want to give a bound on the sample itself rather than the expected value you want to make sure that the estimates you are going to have there of course, they should be unbiased and what is the other quantity that I would like. So, what are the properties you look at when you are having an estimator one is it should be unbiased another one is its variance should be small right if its variance is too large then it may it fluctuates too much right. So, let us now quickly to see that what is the variance of the estimates we have in the exp3 ok. So, what is the we have estimates to be of this format right. We know that in expectation this is equals to LTI, but what about this variance you can verify that this variance is like again of the order like 1 by PTI because like this term gets squared and you will have PTI here, but you can compute this to be like of the order 1 by PTI. So, if PTI happens to be small in a particular round then you see that the variance can be very large right. So, which the variance is very large your estimates are not good enough and because of that you cannot expect like your values to start concentrating well I mean soon ok. So, because of that even though when we use this kind of estimator in the exp3 algorithm it gave us a good bound on the expected regret, but if you are going to look at the this quantity itself directly it may end we may end up getting a small a very bad regret because with some probability this quantity can be very high that is like this may not happen with some probability that every time we are going to draw a sample it need not be smaller than its expectation it may be larger than this also. So, at all I want to give a good bound in probability I need to make sure that my estimators have good variance ok. So, how to get a good variance do you see any easy way. So, when is the variance is going to be bad when when these guys are small when PTI is very close to 0 these guys are going to be very bad that is when some quantity PTI is very slow I am not picking it often right because the probability of selecting that term becomes small. So, I am not observing it enough times. One possibility is somehow if you can increase all of this by some factor gamma right because of this even if PTI is arbitrarily small this quantity will not be very large right. What is that in in that case it is going to be at least going to be greater than 1 by gamma right you are ensuring that this goes. So, one way to ensure this is kind of do some exploration let me write that now what this algorithm is called as pxp 3 dot p. So, again this algorithm has initialized pg then k then sorry this is input then initialize with l 0 i equals to 0 then for t equals to 1 2 3 you do the same thing PTI equals to what for all i now you are going to do now for all i you have the estimators. So, now, you are going to say gamma by k and for all i you do the same thing. So, what is the difference between the exp 3 and exp 3 algorithm which step we are differing PTI right. So, I should also adding 1 minus gamma here and also gamma is an input. So, now tell me what is what is the difference between exp 3 and exp 3 dot p here what is that I am changing. So, gamma is between 0 and 1. So, I am waiting this was there already right I am assigning a weight of 1 minus gamma to this and weight of gamma to this. What does this mean? I am going to choose an arm i which is a convex combination which is a convex combination of these two. So, this was my earlier probability and now this is my new what does 1 by k means here what are that a 1 by k implying it is basically saying that uniform weight. Now, what basically it is saying is I am taking uniform distribution with weight gamma and 1 by gamma weight the other distribution I have right alternatively what this thing is this is like P T is whatever like the new P T I am going to have 1 minus gamma of P T P T was the earlier one with gamma uniform over k you understand this. So, earlier I was only I am taking these probabilities with this now I am perturbing it by a uniform distribution, but I am weighing them accordingly. So, the earlier algorithm you had it was a kind of assigning the probabilities based on the losses you have observed. Even though there was some discounting factor here which control basically exploration and exploitation, but here you are deliberately bringing another term which is biasing it towards somewhat uniform exploration also. Suppose let us say you said gamma equals to 1, gamma equals to 1 means this term is not valid right no more affecting what affects is only 1 by k that means your uniform distribution you are selecting. If you are going to set this gamma equals to 0 it is exactly the exp3 algorithm we had earlier it was only looking into this exponential weights, but now we are giving taking linear combination of this uniform weights and the probability defined through this exponential weights. Now, with this kind of things you see that P t i there will be always one constant term we are adding right gamma by k because of that you see that this P t i here will be never 0 very close to 0 it will be at least gamma by k always right this quantity is going to be a gamma by k because I do like this I am controlling my variance right. So, because of this this P t i is not going to be arbitrarily 0 it will be at least gamma by k and because of that this will be at least lower bounded by some positive quantity it cannot be arbitrarily 0 close to 0 fine. Now, the question is what is the bound I am going to get for this ok I am just going to state this and we are not going to the proof you see already the proof of exp3 that was already pretty right. So, you can go through this proof yourself, but the steps are largely the same except to take into account the factor the additional factor gamma by k ok. I think I made one more thing I had to add not only this, but there is also a beta there we should go as an input to this algorithm ok there is another term beta there or any yeah it will not be it will not be anyway we are basically biasing it ok for any. So, we can come demonstrate this kind of bound for exp3, exp3. So, now, we are saying that like suppose if you give delta as an input factor another factor here is delta here. So, you see that we have so many input parameters here. So, if you are going to set beta like this ok. So, maybe what you should do is only these are inputs you can set beta in this whatever the way you want, but I am saying that this is enough based on that you set your beta like this and your eta to be a constant all these eta's to be like this and your gamma to be like this. If you can do this you can show that the regret is upper bounded by some 5.15 this is some constant times n k log k by delta and this bound holds with probability 1 minus delta. So, because we are giving on the sample regret right it is not an expected regret it is a sample regret. So, we can only say this holds with some probability and we are saying that this holds with probability 1 minus delta. Suppose, let us say you want this to hold with very arbitrary position that means, you want to set delta very small. If you set your delta very small this beta is has to be large right and you also see that if you set your delta to be very small this bound is also going to be larger this bound is like function of delta. So, if you want it with the arbitrary precision you can expect this bound to be also worse larger right because if you want it to be like very high precision this is going to be bounded by this maybe then the one can only guarantee that ok. In that case I will give a larger bound ok like this and similarly one can suppose like here I wanted you to give delta as input. Suppose, you do not give delta as input ok if this delta is not there then I will am going to set beta to be like this and suppose if I am going to stop after n rounds I am going to claim that my regret is going to be this and this is going to hold with probability again 1 minus delta. If I knew delta apriori I would have set like this and then I would have guaranteed this. If I do not know the delta I am going to give you guarantee with high probability guarantee with 1 minus delta, but this bound is I can only guarantee this much. You see that compared to this I have ended up with an extra factor here right. So, this is the penalty I am going to incur if I do not know this delta upgrade. Any doubts about do you understand difference between expected regret bound and high probability regret bound? Which one is desired? Expected regret bounds are desired or high probability regret bound. So, this is called high probability bounds because you are going to give the regret to hold with high probability and that is the with what position you want it is given through this delta parameter ok. So, any questions about or any confusion about difference between high probability bounds and expected bounds or ok which one you prefer for algorithm? Let us say you want somebody to give an algorithm for your problem. If he gives expected guarantee you would be happy or if he gives you high probability guarantee you would be happy. You would like high probability right why is that? So, the one you actually face not what would have happened if you kept on running it multiple times or in expectation ok. So, these are desired, but these are also very sensitive in the sense you want this whatever the regret sample regret you are going to observe to be bounded. It may happen that with remain this is true with probability 1 minute delta with another delta probability it may happen that this will not satisfy this property and that value could be very large and that is often the case ok. So, to control that with whatever probability you are giving this that is going to well behaved like this it requires bit more effort by tweaking your algorithm in certain fashion. So, that you control your variance ok. I will just take couple of two three more minutes to just introduce one more algorithm which you are going to see in the assignment. By the way if you have an high probability regret bounds like this one can always converted this to expected bounds you know how suppose let us say. So, this is just I am going to write this here. Suppose let us say x is the random variable is value is upper bounded as 0 to 1 1 by delta probability that x is going to be less than or equals to now 1 delta star using this relation. Suppose let us say if you can tell for some random variable x what is the probability that it is going to be larger than 1 by delta for a given delta. If you know this then you can translate that bound as a bound on your expected regret you understand this. So, let us quickly see how to use this result in in this case. So, here my random quantity is what this regret. I know that if I am going to rewrite my regret I know that if I am just manipulating this take it that side then this guy being greater than log k by n k this guy being greater than this log 1 by delta. So, I am saying that this bound holds with probability 1 minus delta that means, that the opposite inequality holds with probability delta right. If I am saying that this relation holds with 1 minus delta if I say that this this if I look into greater than this holds with probability delta right. So, suppose I know that this relation it holds probability delta. So, now, I if I am going to treat this quantity as x I know that this guy being larger than 1 minus delta that is log 1 minus delta this holds with probability delta. So, then this probability I know is at least delta this is with with probability at least or at most because this is at least here. So, if that is the case this is at most that means, this term I can write it as upper bounded by delta right not delta. So, I am saying that this holds with at least 1 minus delta. So, the negation of this should hold at most delta right. So, because of that if I am going to define this quantity as delta this to be x then I know that this relation holds by this proposition with probability delta. So, if I plug it in here I am going to get this to be delta by 1 by delta this happens to be 1 and now I know that this quantity in expectation is upper bounded by 1. So, that is expected value of R n dx p 3 p this quantity here minus 5.15 and k log k square root log k by n k is upper bounded by 1 then you can translate this to expected bound on this quantity which will turn out to be. So, you have through this and this relation you will end up with expected regret bound of exp 3 p. So, a homework for you is go and compare the regret bound of expected regret bound you got this on the exp 3 p with that of the bound you got for exp 3 you have also regret bound there right just compare. So, one last remark I want to let us say there is one more algorithm that you have to implement in the book that we call as exp i x what that algorithm does is instead of getting alpha by k here it adds the gamma here explicitly in this that is the only difference. So, earlier what you are doing by adding gamma by k you are having an explicit exploration right you are forcing uniform exploration, but now by adding here you are not forcing that explicit information, but you are still enforcing that your p t i this denominator is at least gamma even this p t i is 0. By this you can see that I am not having an explicit exploration some kind of explicit exploration happens with this addition and when we will see in your assignment that the case we have specified this kind of algorithm actually performs better than your exp 3 p and exp 3 algorithm ok. So, that is what and I will you not mention any regret bound for this, but you understood what is the exp i x algorithm. So, for that you also need to give input gamma.