 in the last class we started discussing about adversarial multi-amp bandit right. So, there was a slight confusion in the last class about the way we are going to define regret. So, let us revisit that part and as I may be it stands for multi-amp bandits ok. So, we said that given n rounds and a policy pi and if I also tell you the sequence which you do not know, but this is the sequence that you will be faced with. We define this regret to be what? So, this is the regret where I t is random because the learner can randomize his choice of I t ok. So, now, this is the regret you are going to face, you are going to incur when you have played against a particular sequence that is generated by adversary. Of course, you do not know the the sequence sequence apriori, but now we also said that instead of this since we are allowing the learner to randomize his selection, I am also going to look at expected regret which is going to be this. But when I wrote this expectation, this expectation is with respect to the randomness of the learner, but why I will necessary that I am going to worry about one particular sequence that I will be faced with. The adversary may himself be generating this sequence in a random equation in which case I want to account for both the randomness in which this sequence is generated as well as the randomness with which learner is playing his actions right. In that case, I said ok, now I will not worry about a particular sequence that I am faced with I am going to the adversary can generate the sequence in arbitrary fashion. So, in that case I will be interested in the expected regret which is defined now in this fashion. So, notice that now I have allowed the adversary to randomize his sequence as well. Now, what is this expectation with respect to? Now, what on the things I am averaging about on? Whereas, when I did this expectation, it was with the randomness of the player strategy, but now I when I write this expectation, it involves two random quantities where what are those? One is over randomness of adversary and another over. So, this expectation here involves expectation over these two quantities ok. Now, looking into this further, we said that this quantity here, here this benchmark here against which I am competing, here we are asking for whatever the. So, I could always this quantity here and so here. So, if I am going to look into the minimum quantity in each round, this is going to be a two demanding task because in every round I am looking what is the smallest round. So, instead of that we said that I will be looking at this quantity, but here we said that this quantity is going to be what? This is a lower or upper bound, we said that this is going to be smaller than this quantity right. And here again what when I look into this reference here that is my competitor, now I am again looking at the single best action over my expected total loss I am going to incur. So, now, this one we called it as pseudo regret and I am going to denote it as with a bar here. Now, henceforth we are going to give bounds on this pseudo regret, not the actual regret that I am looking at here ok. Today what we are going to now look at is it possible to bound this quantity, what is a good algorithm for me when I am in a adversarial multi banded setting where we are saying that in each play I am only going to observe the loss from the arm that has chosen and not for the other arms. In the last class we briefly discussed about this notion of important sampling right. What does it did? Yes, in this banded setting you are only going to observe reward of loss of the action you played, but not of the others. But we could estimate the loss of others in each round and we come up with a estimation strategy which is called as important sampling and we said that that estimation strategy was unbiased. So, in each round I am going to estimate the losses of all and I am going to pretend these estimates are the true loss that I have observed from all arms. So, then I am going to use that to update my weights ok. Based on that let us write down this algorithm called it is called and why is this called pxp3 because there are three pxp, pxp and pxp terms here. That is why this algorithm is abbreviated as this name is abbreviated as exp3. I am going to use just even if I am going to use. So, just like this part right when I just applied expectation over here simply on both side instead of this like actually this is the expectation term I am going to incur right and just take expectation on both side that is going to the and replace this xi by the random quantity this is the expectation. Now, even if you interchange this expectation of minimization you still get this lower bound. So, when we can consider either of this benchmarks, but these are difficult to handle that is why we are going to consider this lower bound as a benchmark ok. Now, how does this algorithm is going to look like? Let me first write down this. So, as you see I am switching bit notations here because I am also switching books, but let us try to be consistent with our notations. So, here I am now using p1 here like if you recall when I was using the weighted majority algorithm for weights I was using W's there. Now, instead of those weights with the weights which we I finally converted to probability here I am also using probability, but instead of W I am going to use the notation p now. So, this is a probability of after now. I have one more notation change I am going to make henceforth is this x earlier I said this is a loss right for loss again I am now going to switch to small l ok. So, here. So, again just the notation just to say that I am in a loss setup here like I want to minimize the total cumulated loss I am incurring ok. Then observe loss T i t ok. So, this is the whole algorithm. So, this algorithm has basically three steps. First step is you are going to play an arm according to your current distribution P t and then once you play an arm then you are going to update estimates the loss of all the arms after you is update the loss of all the arms then your third step is to update your strategy itself. So, what is strategy here? Strategy here is to decide with what probability I am going to play each of these arms fine. So, I am saying to for this algorithm to make work you have to tell how many arms you are dealing with and also you have to pass on a sequence eta subscript t this is defined for every t you have to pass you have to tell 40 equals to 1 what is this eta 1 40 equals to 2 what is eta 2 like that. So, this eta t is defined for every t. So, recall that when we are doing weighted majority algorithm for the expert predictions right there also we had a parameter eta ok. So, in the in the weighted majority we had an eta we set it in a specific fashion. How did we set that divided by what was n number of rounds there, but you will see that in the assignment that we are going to see that it is not necessary that that eta has to be fixed like this. One can take an arbitrary sequence I mean some specific sequence and try to get a better bound then what we got in the weighted majority algorithm. So, eta we said it is fixed, but eta has to be can be come up with a better in a better way and that can change in every round right. So, with that we can cover the better one. So, that we are directly bringing in this concept here like an apollary we are not going to fix eta to be one value. We are just telling we will you pass to this algorithm what is the sequence of eta t and again recall what was that eta was doing the eta in the weighted majority was telling how how much importance we are going to give to the exploration exploitation in some sense right because it was basically controlling how much weight you are going to give to the past observation, but in a way it was also controlling exploration exploitation. So, this sequence will do this here. Now, this algorithm we are going to start with t 1 which is basically uniform distribution initially we do not know anything. So, we are going to put equal likelihood on each of these arms and then the algorithm in each round keeps on updating this p t distribution initially for t equals to 1 this is uniform distribution. It picks an arm and going to observe whatever it pick the loss for that arm, but it is going to keep estimates for all the arms for all i in k it is going to do this estimation. We discussed this last time right this is basically compact way of saying if whatever the arm you are going to play for that your estimate is L i t by p i t for the arms which you did not play this L i t tilde is going to be it is going to be 0 basically and after that you also keep updating your cumulative loss. So, maybe instead of tilde I keep I will write n 4 and this thing to indicate their estimates. Now, you are going to again for all i you are going to update the total loss for that arm and using this total loss we have observed so far you are going to update the probability in this fashion and this is where your eta t is coming into picture ok. So, what it is basically saying that it is going to take this quantity and this is just dividing the same quantity, but after adding them were all arms ok. So, it is just this is the exponential, exponentiated cumulate this sum of arm k you are you are adding all arms and then for each arm you are just taking one component of this and dividing it by the sum. You know now you it is easy for you to see this forms a distribution right because if you just add it over all as it adds up to 1 and each one is a positive number here ok. So, we are saying that s I observed loss for only the arm I played if you have picked i t for some arm and this i t equals to i happens that arm gets not simply L i t it gets a scaled value L i t by P i t. If I am going to observe loss for some arm I am going to divide it by P i t and then I am going to take that as my estimate, but for the arms which I did not observe they will be assigned 0 values ok. Even though we are being assigned 0 values we discussed last time that if you looked it in expectation it is going to be the true loss for that action ok. Now, fine we have a algorithm like this why you can have an algorithm anything you want now, but what is the guarantee that this algorithm has ok. So, that is what we will now see of course, this is one particular strategy in which one algorithm is where you have specified the way you are going to come up with your distribution. To come up with your distribution you have use a particular estimator. Tomorrow you can go and come up with another distribution here and maybe that will have a different performance, but now let us say once we are going to update in this fashion what is the performance we are going to get. Also I said this is an input to this right. So, depending on how I am going to choose the sequence the performance can differ ok. Let us see what is the performance I am going to get I will get rid of this part. I am going to give this first part the second part if eta t is. So, suppose instead of this n t I only pass on the number of arms and the number of rounds n. One can set this eta t to be 2 log k by n k for every t. So, this becomes constant in this case well that does not depend on what is the t here what is t here. So, if I am going to choose this maybe let me just call it as constant here. If you take this eta to be constant like this then one can show that the regret of this algorithm is upper bounded by 2 n k log k. Now, if you do not know. So, you can do this if you know n how many rounds they are going to run it up prairie, but if you do not know how many rounds you are going to run it up prairie, but you can stop it at some time and if you stop it some time and at that point you are going to ask the regret right. Now, what then you do not know prairie n. So, in this case we are you are you set in every round eta t to be like this. In round t you know what is value of t accordingly you set that and if you do that the regret you are going to get is n log k. So, how much this regret is worse by then this regret root 2 times ok. So, what is oh no this is n k ok. Now, let us so I said this regret is guaranteed to you if you operate it tell me what is the number of rounds and this regret is guaranteed I am when I said this if you happen to stop at nth round you did not know operate your nth round is the number of rounds are going to run if you stop at n round this is the regret you are going to get. So, this setup here when I do this you kind of knew already the horizon how many rounds you are going to do, but when you did this you did not know what is the horizon. When you do not need to know the horizon and you can give you regret bound at any time then this is this kind of bounds are called any time. So, sorry in this case if you are going to set eta t like this without knowing k then we are going to call this set algorithm as any time algorithm. So, any time algorithm basically is telling that I do not know operate how many rounds I am going to stop. So, I am I do not have the luxury to set eta in this fashion. So, I am going to set I have to do without knowing n I have to set and if that is the case then you are going to call in that setup any time algorithm right. So, this algorithm right now it do not need to know what is n right. If you do not know n you can go and set up like this in this case it becomes an any time algorithm ok. So, now, let us compare this algorithm with this bound what we got for the exp 3 with the bound we got for weighted majority. What is the bound we got for weighted majority under root of? Log. Log. Log yeah. So, here d is there what number of experts right I could treat them as number of arms. So, if I take this d to be k what was the weighted majority bound it was like 2, 2 and log 3. So, compared to this how much this guy is bad by what factor by square root phi factor right this guy is larger than this quantity by square root k factor ok. Now, let us compare this with the amount of information that the weighted majority algorithm had and my exp 3 algorithm had. So, in the weighted majority in every round I get to observe the loss of all the k arms whereas, this algorithm is working with the restricted information where it is going to use only loss of one arm in that round. So, in terms of the information available to this algorithm it is 1 by k factor of what weighted majority has right is that clear, but whereas, in terms of the regret bound it is only was by factor of root k. So, now, let us say for the weighted majority I used to I got 2 and log k whole of this square rooted right. So, where I got k amount of information in every rounds, but suppose let us say in this weighted majority let us pretend that to get k number of that information I have to weight like k rounds ok. So, in that case what is basically I have to do the same amount of information in the weighted majority instead of running it n rounds I have to run it for n k number of rounds right. So, that is what happening here like if you replace it by n k. So, that that is like if you are you are getting only information of one arm instead of k arms in each round it is like you are elongating your time by n k number and that is why that n is getting replaced by n k here and this regret bound is was by factor of square root k factor here ok. Now, to prove this algorithm what we are going to do we will show that this bound R n bar is going to be k by 2 we will show this ok. Now, if you go back and plug in this bound n eta equals to this bound you will get this if one holds you just plug in the value of eta t like this you are going to get whatever this bound is and when you are going to set eta t like this and let us say you are stopping at time at some n the at which you want to measure regret at the last time then you are going to take that last t to be n. So, that last t will be same as this value because they t you are going to replace by n, but for the other t's it is going to change like this ok. Now, suppose ok. So, again to put it in a different way suppose let us say you are running this algorithm and you stop at nth round whatever that nth is. Now, what is the regret bound on that? So, in each time you are going to be using eta to be in this fashion. So, so just plug in those values here ok. Now, if one holds now let us say if two holds what is this bound? This bound is going to be what? You are going to use this bound this is going to be k by 2 t equals to 1 to 2 what is this quantity? This quantity is 2 log k by tk this quantity this is when you stopped at nth round right that is eta n this is the last round that quantity is nothing, but 2 log k n into k that you can compute, but what I am just going to see is only this factor now the first part ok let me just write. Now, here if you look into this I have square root 2 log k tk in this only t is varying which is in the denominator. You can show that maybe I will just write it here whatever that bound what is that 2 log k tk plus. Now, we are going to use this inequality. So, what can anybody know what is the how can I upper bound this quantity? So, if I am going to integrate this this is always an upper bound right instead of summing I am just integrating for everybody possible value. And what is this integration? And I could as well take 0 also in this. Now, maybe 0 is going to be what will be the upper bound for this? 2 root n right maybe let us take a 0. So, this is going to be upper bounded by 2 square root n. So, this will be 2 square root n. Now, let us use that bound here if I are going to use this this is going to be k by 2 everything I am going to just 2 log k by k into I have this summation of square root of 1 by t which I will going to replace it by 2 square root n and this quantity. Now, if you are going to now everything is in the form we want we have basically got rid of that square root. If you just do this what you are going to get there is a k here and there is a square root k here you are going to get 2 n k log k here and there is also one more term you can see that that will what I want it into square root. Just check that if you are going to simplify what is this eta n eta n is going to be log k by eta n. So, it is going to be let us write that factor that is going to be plus I am just taking with I am going to log k by this quantity I will have square n k 2 log k. So, with this will I get this right term there is something missing here right. Some 2 factor is missing just let me write it to the right ok. Sorry I missed that this is we are going to take it as log t by t k not as 2 log k t k we are going to set it in this fashion. If you are going to set it in this fashion I will not have this 2 here I will also not have 2 here I will also have 2 here now, but this is exactly. So, I will end up with square root n k log k plus another square root n k log k and that is exactly 2 times n k log ok. So, this was a mistake I made with that is that fine? Finally, the bound I am going to get. So, fine if now we have shown that if at all I can show that the regret pseudo regret can be bounded like this now this are all true ok. Now, next let us try to say why this is true. So, this is going to be bit involved in whatever the remains in time is just going to write down whatever the proof steps possible. So, for any doubts in this algorithm. So, understand what is the difference between any time algorithm ok and the difference between the full information setting and the bandit setting and how much one kind of what factor we can expect in the regret bound when we go from full information to bandit information right. So, all of this literature in bandits they kind of study this. It is not necessary that in every round you are going to get only the action of the arm you are going to play. You may get something more than that and also it is not necessary that you are going to get information about all the arms in each round. These are like two extremes right. One getting information about only the arm I play and getting information of all the actions even though I played one, but I am saying I am getting information. So, these are two extremes. There could be something in between also depending on how these actions are related with each other. So, people study lot of varieties of this, but right now we will be focusing on do these two extreme cases full information and the bandit. In fact, for the subsequent course we are only focusing on the bandit case and at some point we will touch upon something in between these two.