 So, after talking about this concentration inequality, we return back to the stochastic bandwidth setting, we define what is the regret would be interested in the stochastic bandwidth setting and then defined the pseudo regret, which we want to take it as a performance measure and we wanted to performance guarantee on that. And we concluded by giving the last class a regret decomposition result. Now, we will move on and try to see what is the best bound or what kind of bounds we can give on the pseudo regret ok. So, we know that we let us say we have a bandit instance, we have this set of k distributions, assign to each of the arms. And we said that the sample when you are going to play an arm a, the samples are drawn the random variable x that corresponds to the sample that are drawn from arm a that will be distributed according to distribution a right. And we said that expected value of x in this case, let us say we denote it as mu a ok, where this guy x is drawn according to this distribution mu a. Now, our goal is to basically our goal boils down if you want to minimize the regret, our goal is boils down to what identify an arm which has the largest value of this mean value right. And what would we say in the last time? Through our discussion of concentration inequality, we said that if we have many samples from a particular arm, how to estimate this parameter mu a. And we said that if we are going to use sample mean as my estimator, how far the estimated mean will be from the true mean after a certain number of rounds. We gave a bound on that right. We specifically focused on the cases, the case where this mu a's are all sub Gaussian distributed and for that we gave this concentration bounds ok. Now, let us start thinking about how to estimate this means, but our goal is to not just estimate this, but to quickly estimate this. So, that I do not incur regret after some time. Now, what could be our strategy? Now, we are inter squared into comma with the policies for this right. So, what are the policies? Policies are the algorithm. So, any thoughts on how we will go about this? Let us say you have been asked to you have been given n rounds and you have been your goal is to comma with the policy pi such that you want to write this which in the last class we discussed that this can be given also as 1 to the power k arms. So, this is nothing, but expected value of right way we defined as di equals to ok. And also we said the last time suppose some other arm is optimal. So, throughout we will assume without loss of generality that my arm 1 is optimal that is the mean which has the highest value is the one with arm 1 ok. So, we know this, but we the algorithm does not know this ok. So, this is just for our analysis point of view. So, because of this if arm 1 is the optimal one delta 1 is going to be 0 and delta 2 will be positive and delta 3 will be positive. And also we will assume for time being that the optimal arm is the unique right. So, if mu 1 is is the highest mean all the other means mu 2 and all mu 2 all others are going to be strictly smaller than the mean of arm 1 that is why the r 1 is the optimal arm. So, we know that this delta 1 is going to be 0, but this delta 2 and all are going to be positive. So, if I want to minimize my regret I need to ensure that the expected number of place of my arm 2 and others should be as small as possible or I want to keep them low ok. So, now I am going to use the terminology that arm 1 since it has the highest mean I am going to call it as optimal arm and all other arms are suboptimal arms ok. Any thought on how to do this? Just like some random ideas ok. So, first we will start with something called this explorant commit algorithm. So, one obvious thing is you sample each of these arms for certain number of rounds and then find the means of sample means of each of these arms and then maybe after that you just play the one which has the highest mean ok. But here the question is how much time I should sample each one of them so, that I get the good estimates all of them. If I can if I get a once I am sufficiently confident these estimates are good maybe then I do not need to explore any more further then just play the one which has the highest mean after that words right. So, let us that one natural policy let us we explore and commit. Now, how the algorithm looks in this case? So, for this algorithm we need to tell how many rounds it need to explore. So, explore let us say we have to tell the algorithm sample each of these arms this many times that has to go as an input to this algorithm right. So, let us say input is that number m and of course, and k k is the number of arms and then what this algorithm does? The algorithm is going to play arm 1 m times arm 2 again m times arm 3 m times. So, there are k arms in the first k m number of rounds it will sample each of the arms m m number of rounds. After that it finds the estimates of all of them and commits to the one with the highest mean I am just going to write that. So, all of you understand what is this? This is t mod k ok and till the first m k number of rounds what it is going to do? It is going to do t mod k. So, suppose let us say t equals to 1 it is going to be what? 1 mod k 1. So, what it is going to play? It is going to play 1 mod k right let us say k is some greater than 1 it is going to be this term y it is going to be 1 yeah plus 1 there is right it is going to be 2 in that case. So, like that when t equals to 2 it is going to play arm 3 when t equals to 4 sorry t equals to 3 it is going to play and when t becomes k it becomes 0 and it is going to play arm 1 and it continues right. So, it is going to play the arms in a round robin fashion it starts with actually 2 3 4 5 up to k and then with comes with 1 2 3 4 5 like that and till that is it is going to do m k number of rounds and for t greater than m k. So, what is mu hat k here? So, we are going to define mu i hat at time t that means the notation here is the estimates of arm i I have at round t and this is going to be. So, this is the number of samples averaged till time t right. So, what is the numerator doing? It is so, first focus on this denominator it is counting on the number of place of arm i till round t and what is numerator doing? Numerator is taking the sum of all the samples that comes from arm i till round t right. So, this notation is saying that whenever i is in the S round if it is equals to i then only this term is retained. So, it is retaining only those samples that you have collected from arm i till round t which one this has to be i s S. So, once you have hit m k number of rounds what you will do? You will add that round you will see which arm has the highest mean and after that that is it you keep on playing the same arm which are the highest mean you are not going to change the estimates after that. Now, the so, that is why we are saying that we have explored till the first m k rounds and after m k round we are just committing ourselves to the one which have which has the highest empirical mean in that till that round ok. Now, if I am doing going to do like this what is the performance I am going to get? So, now, let us try to bound this ok. Now, let us count t i. So, how many rounds I have played arm i till n rounds right. You know that for the first m rounds you have played arm i that has been guaranteed by this first part right ok and then for S equals to m k onwards you do not know whether which arm is played that depends on which one has the highest mean. So, let us ok and now from this round m k plus 1 and n whether you have played arm i or not depending on in the m k th after the m k th round whether the arm i thumb happened to be the best one or not right. So, let us try to do this. So, what is this is going to be that this is going to be going to be. So, this depends whether I am going to play or not only depends whether on the m k th round I happened to be the best or not. So, that is going to be m minus m k times indicator. So, I am saying that. So, now after this m k plus 1 to n rounds I will be playing arm i only if mu i hat of k this guy happens to be larger than the other arms right all computed at m k th round. So, at the end of this m k th round I have computed the empirical means of all of them and I know that after for t is going to be greater than m k I am going to play arm i only if its empirical mean happens to be larger than others right fine ok. Now, so this is a constant that has been given to me as an input to the algorithm. So, n is also let us assume that is given to me the number of rounds for a given number of rounds. So, what will be the expected number of place of t i n. So, this is going to be m plus n minus m k and now if I take the expected value of this indicator it is going to be probability of right. So, now, the problem boils down to if how can I bound this probability. What is the probability that at the end of m k rounds my i the ham happens to be the one with the highest mean highest empirical mean ok. So, now, let us try to focus on this term. So, now, for time being let us assume that this I am interested in this arm i which is other than the optimal arm that is i is not equals to 1 ok. Now, I am interested in a bad event right like I am I want to basically see that I have assumed my arm 1 is optimal, but what is the probability that I will end up some arm i to be empirically best right. So, if that happens if the arm i which is not 1 happens to have a highest empirical mean compared to the arm 1 then that is a bad thing for me right that is going to cause regret for me. So, let us consider that event and see how if we can capture that. So, what we want is in this I want the when this happens to be larger than mu 1 hat of k that is the mean empirical mean of the i the ham happen to be greater than or equals to the empirical arm of the arm 1 that is the best one right. So, so here I was looking at max over all the arms here I am just trying to replace it by the one with arm 1. So, what can I say the relation between this and this? So, this is basically saying mu 1 hat is greater than maximum of several terms and mu 1 is one of the terms here in the max ok. So, if I am just going to retain one of the terms here. So, what is the relation between this probability and this probability? So, which is going to be larger? This one is going to be larger or this is going to be larger first one is going to be larger right and anyway left hand side is the same. So, now, I am saying that mu 1 being greater than equals to some larger quantity. Now, I am asking mu 1 to be greater than somewhat smaller quantity which which which event implies which event. So, if this happens this automatically happens right. So, which one should have a large probability? If this A implies B first one will have larger probability. So, A implies B. So, A is this event if whenever A happens they are saying that B happens right. So, it should be correct yes of course, this is larger yeah this one is going to be more stringent right. And, this is happening this is automatically implied. So, no upper one is most stringent. So, because of that you can check that if this is most stringent this probability should be smaller than the other one right or like maybe like we are basically saying that if this means basically you are saying otherwise this we are saying that A is contained in ok. So, this is just like this guy is asking it to be larger than some quantity and now you want it to smaller than it this quantity to be larger than somewhat smaller quantity. So, this is all automatically implied. But, the other way is not true right like suppose mu 1 hat is greater than let us say mu 1 hat that does not mean that mu 1 hat is going to be greater than or equals to max over all of this or is that automatically implied actually we should think it like this. Suppose let us say mu 1 hat is going to be greater than or equals to mu 1 hat and now I am looking at max in this max mu 1 hat is already contained ok. So, now, this quantity is going to be larger than this quantity right. So, you are basically asking this to be most stringent you want more than what is already guaranteed in this. So, that also implies that this quantity has to be less than or equals to this quantity ok. Now, we have this let us manipulate this what I will do is I want to ok this I will just take both sides mu i is less than mu i minus I am taking this quantity on the other side and now I have added minus mu i here and plus mu i this is going to be what sorry mu 1 here this is going to be what mu 1 minus mu i it is correct I have just done a manipulation here I have just taken this on the other side and there is minus mu i minus minus plus mu 1 here and I have added the same thing over here. So, now, continuing this line of inequality mu 1 hat of mu k mu greater than or equals to mu 1. So, this is our max over j naught equals to i mu k of hat ok. So, now, this is going to be greater than or equals to probability that and by definition what is this mu 1 minus mu i for us delta i that we have defined as delta i. So, this is what we have ok. Now, recall what are these quantities? This mu hat mu i hat is nothing, but average of m samples that are obtained from i term right. So, that is how you have computed it it is when it did it we are taking it t equals to m k it is looking at all the samples that are drawn from arm i till m k throughout till m k throughout we have exactly m samples right and this is going to be m because m the arm i has been played m number of rounds ok. So, this is nothing, but this quantity is nothing, but average of m samples that has been drawn from arm i and we know that arm i these samples are i i d further the mean of arm i is mu i. Now, we argued last time if all my arms distributions are sub Gaussian what this quantity is going to be sub Gaussian sub Gaussian with what parameter? Suppose assume that all my distributions are one sub Gaussian 1 by m why is that? So, there are. So, if you simplify this it is 0 sub Gaussian under root of sigma 1 square minus 1 square right yes. So, what happens? So, in our case we said sigma 1 is 1 and sigma 2 is 2 also 1 it is, but there is also like ok. Now, let us rewrite it. So, I know that for m k the denominator becomes what it becomes m right. So, this is m and what is this numerator is like this is basically summation of this is going to be the summation of i samples which has drawn from arm i only. That is what it is saying right this is x i s s is the sample you are going to observe in the sth round, but I am interested in only those samples where i s equals to i right. So, because of if you just ignore all the places where my i s is not equals to i if you only written those where i s equals to i I have exactly m samples in the numerator right. So, if I just took that it is basically going to be what x i and so, this is basically going to be just i right where i is s 1 to m. So, these are all the samples of the i d 1 do not need to write this ok s this all the samples are coming from i term and all the samples are one sub question. So, now, we have this and now we have also mu i here right. Let us take this mu i here this is a whole of this quantity divided by mu i. We have already discussed this the s equals to 1 to m x i s times mu i this whole divided by again. So, there is some small mistake we made we we did not say that these samples are coming from one sub question. We said that these samples when you subtract from mu the mean we said that this is one sub question that is what we said in the last class right that the distributions are such that if the if you center this distribution that is if you subtract the mean from these samples those samples are one sub question. So, that is why when you write simplify this. So, I have to also note this I am taking this entire thing here this is going to be like this and we know that this quantity is one sub question ok. Now, if this is one sub question and if you go into some this what is this this is going to be 1 by 1 by 1 by root term why is that. So, we already said that you said that it is going to be like sigma 1 square sigma 2 square like this right we have m quantities here and first each one is like 1 by m square right. So, 1 by m square plus 1 by m square all the way to 1 by m square there are m quantities like that and if you just going to m by m square it becomes under square root. So, it is going to be 1 by m square. So, this entire quantity here is 1 by square root m sub question and what about this quantity this is mu 1 here we are just saying that all the distributions are all we know is one sub question right even though they have different the centering value that all the center distributions are one sub question. So, even even for the optimal arm when you center it that is going to be also this sub question with the same parameter ok. So, now, what about this? So, I know that this guy is one sub question 1 by m sub question this entire thing and this another guy is also 1 by square root m sub question what about there is difference ok before difference this entire thing is 1 by square root m sub question what happens with the minus sign with the minus sign is what is this why is that. So, right we have said that even if you scale it by some constant c whether positive negative it does not matter it is going to be the same 1 by square root m with that constant and that one that constant is 1 here ok. Now, this is 1 by square root m sub question this quantity is also 1 by square root m sub question and think of these are we are adding two sub question random variables and are they independent here? They are independent right when what would we say when I am going to draw a sample from an arm it is going to be independent of the past of pulls from that arm and also independent from the pulls of other arms ok. So, it has to be these two are independent and now what is the what is this then then it is going to be under root 2 by m right because then it is like square root of. So, now, this entire thing here is under root 2 by m sub question. So, if this entire thing is under root square root 2 by m sub question now I already know a result which we showed last time which we written as one of the lemma right can I apply that and find a bound for this what is this probability. So, recall that we said that probability that x is greater than or equals to epsilon when x is a sigma sub question what would we say this is upper bounded by exponentially exponential minus right to what are sigma here this sigma is the sub cautionary parameter of this x. Now, this x is nothing, but this entire quantity right and so replace this sigma by square root by 2 by m. Now, what you are going to get if you do that and your epsilon is now delta i now this x is this entire quantity. So, if you just apply that this probability is going to be upper bounded by epsilon minus epsilon is delta i square 2 times 2 times this sigma is going to be square root 2 by m. And if you are going to put it we are going to get after squaring 2 by m and this is going to be 2 times m right. So, this is nothing, but minus delta i square m by 4. So, what we finally, actually showed is I will just erase this part I have rewritten there m plus maybe m equals to n plus m minus k. And this whole quantity we have just argued that this is upper bounded by exponential equals to this ok. Now, once we have our bound on this expected pulse of an arm i right. Now, we can go and use our regret decomposition result to get a bound of my regret bound on my regret. So, what is that then my pseudo regret is it is given by expected value of k i of n. So, just substitute this and this quantity is upper bounded by like this. So, this is going to be what first term is going to be m k times so, if I substitute this I am going to get m times summation plus equals to i into k delta i u x p. Is this correct? I have just substituted bound on the expected number of poles in this expression here.