 We are going to start looking into another setup in multi arm bandits called pure exploration today. So, where we do not care about regret or what is the reward I am going to accumulate till some point or it may be the case that what I only worry about is whether finally the arm you gave me is the optimal one or not ok. So, let us consider an example so you all know that like before all these drugs are tested on in a laboratory in a lab on some animals let us say mouse let us say or on mice. So, what they will do they will have a set of drugs they want to test and finally identify which is the drug which is most efficient when they are doing this trials they will not worry about how many mouse or mice they killed right like because some of the drugs what they want is at the end I whether I identify the best drug or not. So, that when I apply on humans the risk are minimal or that is the most effective drug, but while I am applying on testing it on the mouse or mice I do not really care. So, what for me there matters is whether did I identify the most efficient drug or not. So, during the exploration I am not worrying about like how many mouses I really saved it is fine like as long as that experiment is helping me in identifying which is the good drug I have full freedom. And similarly let us say you want to identify which is the best restaurant in surrounding area of Hawaii what you could just do is if you do not care about the money you are going to spend you just go and eat in every place every time and at the end you have to recommend to some one of your friend or let us say your parents have come and you have to take them then you are going to say okay so let us go for this I have already explored and let us go to this restaurant. So when you are exploring you did not care whether you are going to get the best or not but what you are trying to do is just identify which is the best one. So when we set up in a way you are not worried about the regret right what is happening what is the cumulative reward or loss you are going to incur sometime. Now such problems are usually called as pure exploration problems and in that also there are different different versions okay so we are going to start with the case of simple regret okay let us recall our standard stochastic multiambanded case so in the standard stochastic multiambanded case how did we put our criteria we evaluated the performance based on the regret we are going to incur how was the regret defined there in the stochastic we defined regret of an algorithm by for an environment nu as I could write it as t equals to 1 to t nu star minus let me just write it as so this is the reward sample you got when you played it term and just took the expectation but it is still a random quantity that is why because that depends on all the past observations. So we are going to look at this let us take a particular time t what is this quantity here this quantity we can treat it as the loss or the regret instantaneous regret I incurred in round t and the regret is what it is some of the instantaneous regret you are going to incur in each of these rounds. Now instead of worrying about what is the instantaneous rounds and how they accumulate over a period of time suppose let us say I am only worried about what is the what is the regret I am going to incur of so I t plus 1 what is I t plus 1 this is the R that is recommended by this algorithm at the after the tth round that is in the t plus 1th round you are worried about this you whatever in t rounds you wanted to do let us say you did it but now you wanted to worry about this quantity and let us let us worry about the expected value of this quantity and now this we are going to call it as a simple regret simple regret I am going to write it as simple whatever I want to have. So till first three rounds whatever I got I do not care but now my performance is defined in terms of what would I have got in the t plus 1th round and my goal now I am going to set it as minimize this quantity. So, anyone has simple algorithm to minimize the simple regret what could be simple way which has the highest mean empirical mean right. So, that is what similarly we actually did in explorant commit algorithm also right but there we also worried about the entire regret but let us say I apply the same algorithm here and evaluate a simple regret. So, whatever the algorithm you said that is a like let us call that as a simple uniform exploration what this algorithm is going to do is simply the first t rounds it is going to spread along each the arm equally and then it is just going to play the best one at which is best in the sense that which is empirically best. So, k here is the number of arms. So, what is mu i hat t is this is the empirical estimate of arm i you have gotten till round t 5. So, you are just doing this then the question is how what is what will be the simple regret for this simple algorithm how to bond this. Let us try to get the bond for this we already have done it the bond you are going to get is very similar to what we did it in the explorant commit algorithm. I am going to denote by delta i this is what by our notation it is going to be mu star minus mu i mu star is the mean of the optimal arm and mu i is the just like mean of the i thump. Now, let us say i t plus 1 equals to i. So, let us say I happen to choose i t plus equals to i thump what does this mean it means that it should be. So, I am going to assume that arm 1 is the optimal arm. So, I am just without this is without lots of general idea algorithm does not know, but for analysis per point of view just assume that arm 1 is the optimal arm. So, let us say so in this case if I have chosen i thump then it must have happened that the mean of this would have been at least equal or greater than the mean of the optimal arm. Then we want to bound what is the probability of this by the way how many times each arm is sampled in this t by k let us say t by k is integer for time being now how to bound this. We know this what is this quantity this is nothing but so I am going to just write it as mu i hat t greater than or equals to mu 1 hat t this I am going to simply write it as I want this to be greater than or equals to 0. I did some manipulation here I know that mu i hat has been obtained by averaging t by k samples of i thump. So, mu i hat is nothing but average of t by k samples of i thumps and similarly mu 1 hat is nothing but t by k. So, average of x i averaged over t by k number of rounds ok. Flour of what? So, the number of samples I am going to get for both of them t. So, it should be less than t by k. Let us keep it like that let us say everything for us goes through and I am just not writing the indexes here, but these indexes are this sum is assumed to be over t by k number of samples. And I know that x i has mean mu i and x 1 has mean mu 1 and the difference between their means is del i that is what I have added on both sides ok. But the difference here is minus del i yes mu 1 minus mu i mu 1 minus mu i is del i, but that is fine this is the expectation of this is going to be minus del i. But if you look into this is this a 0 mean random variable? Yes right it is a 0 mean random variable. Let me make one more assumption that assume that the each of the arms have sub Gaussian distribution and specifically we are going to assume that they are on one sub Gaussian so that the sigma square parameter is 1 and they are independent across the arms. So, now in that case what is the distribution of this going to look like yeah it is going to be simply root to sub Gaussian ok. Now if I am going to apply so if I am going to apply hat by t hat t written as equals to 0 now if I am going to apply the property we have for a sub Gaussian noise this is going to be exp so we have kept it as t by k delta i square divided by 2. So, then the denominator we usually got 2 sigma square right, but sigma square is what here? 2. It is going to be 2. So, it is going to be 4 now. Notice that the when you are going to do the averaging right the number of samples are fixed the number of samples are no more random because I know that after t rounds I would have had each the number of samples is this floor of t by k. So, this quantity is a deterministic in this case that is why I could apply my result that I have for the sub Gaussian tapes ok. Now, once I have this I am now going to bound my regret sorry my simple regret. What is my simple regret? My simple regret is nothing but probability that and what is this probability? Probability that i t plus 1 equals to i is upper bounded by what? But this quantity right because this probability that i t equals to 1 is exactly this condition and we have bounded that probability. So, this is going to be like so now to tighten this bound what we are going to do is let delta be some quantity whatever you choose delta be some quantity. Now, I am going to split this sum as ok maybe like before this I will do this I am not here. So, I am going to split this sum as sum of all i for each delta i is upper bounded by delta plus sum of all i for which delta i is greater than delta. I have split it into these two format based on what is the delta I have chosen ok. So, I know that my for this is going to be d 0 this is somewhere delta 1 this is going to be delta 2 all the way up to this is going to be somewhere delta k. I have chosen somewhere delta here and that is why I could split it I mean it may happen that delta is beyond this point also in that case in this case it is going to be simply one sum this second time will not arise. But as I have chosen this delta arbitrarily. Now, this quantity here I know that this delta is upper bounded by. So, the delta is upper bounded by this quantity delta I find then I will going to pull it out then this is nothing then it is going to be the sum of probabilities for which this condition holds. But I know that the sum of probabilities can be at most one. So, that is why I am going to upper bounded as delta, but the other quantity is going to still written like this. But in the other quantity yeah I am going to I am going to. So, this one I cannot bound it in the way I have done here because delta is a lower bound on delta I is here yeah, but I will use the probability term here from bound on the probabilities. But here this delta I square I have this delta I know that this is lower bounded this delta is lower bounded by this delta. So, if I replace this delta I here by delta will I still get an upper bound. So, I will do that now the way I have chosen this delta here is arbitrary I have just said that some delta greater than or equals to 0. So, if I am going to do instead of this min over all this quantity such that delta 0 this bound is still a valid bound ok. So, this is what we are going to state as a result now what we have just proved. So, I am going to use this notation epsilon sg superscript k 1 this means this is the class of distributions which are sub Gaussian with sub Gaussian parameter 1 ok. So, when I say this nu belongs to this means each of the arm is sub Gaussian distributed with parameter 1 and k denotes already the number of arms in that. For all p greater than or equals to k what we have just showed is r of t simple regret of this guy uniform exploration is upper bounded by what now what does this bound tell us. So, if you look into this how does the t coming into picture here the number of rounds it is coming as negative exponential in this it is coming as a negative part in this exponent that is if I increase t this term is going to vanish exponentially first right. So, this uniform exploration is going to be asymptotically optimal in that sense right because as I let t goes to infinity this term is going to vanish and minimization I am going to take it over del equals to 0 and then this is going to say upper bounded by 0 I know this simple regret is already a non negative quantity. So, it is going to be 0. So, this simple algorithm is going to give us a simple regret that is going to be falling exponentially fast in the number of rounds central limit theorem what do you want to say yes. So, he is claiming like if I am letting t go to infinity there also I am getting the right means right. So, I should not be making any mistake the same thing is also being said here. So, when I said this is going to go to 0 this simple regret exponentially fast that is for a you fix the environment and then you increase the number of rounds. But now let us say if I fix the number of rounds to t and if I keep changing and I am applying. So, let us say I will give you the I will give you fix t and I am now asking you to apply it on different different environment. So, how is that in that case if I fix t and if I am going to change this environment that is the delta is are changing what could be the bound on this. So, you see that this guy here if I increase delta this guy is falling exponentially fast, but it is also there is an increasing parameter delta here that because that is going to increase the regret. So, let us see and also there is a delta here. So, let us see if I am going to fix t and let us try to get a problem independent bound here that irrespective of whatever delta is what is the regret I am going to get. One thing you can do is suppose I set a environment or let us say if I am going to set my I mean going to choose my delta such that. So, delta I said is arbitrary right it is just going to be better than or equals to 0. In this I am going to choose a particular delta here in this part here. If I am going to choose a particular delta in this fashion can you quickly plug in this delta into this equation and see what is the value you are going to get. So, delta is going to be in that case this quantity is going to be delta is this what is this going to do it is going to if I square it just let me plug in myself here. So, t is going to be this is going to be 4 right 1 4 get cancelled this guy get cancelled I am going to get exp minus 2 log k I am just going to return it and this is going to be simply delta. We will not only this 2 get cancelled with this 4 here is it that let me just cross verify my calculation. So, if I do del square here this going to be 4 into 2 log k divided by this quantity. So, 4 4 get cancelled is it not because of this 2 this 2 will come written 2 2 log k. In that case it is simply going to be what 2 log k whatever the term t by k this is going to be 1 by k 1 by k and how many terms here at most k terms will be there right. And if you are and what is this del i yeah it is subversion right we do not know like whatever but let us let us write it whatever it is a fixed content right. So, this is going to be k 1 by k it is simply going to be in that case is delta i 1 by k multiplied there is at most this sum can be at most k right. I just I removed the summation that is summation is 2 if I am going to keep that. And now let us see if I want to make it independent of this quantity fine. So, this we do not know like really what is the value of mu i's right, but we know that these are some finite quantities when they are not at least infinity rights. And so, this del i's are what the difference between the means. So, they are further bounded. So, if you are going to assume that all this means are all some finite then this is going to be just a constant right. So, this summation is going to be some constant in that way. So, then I have I will just write it as if I assume that the support is between 0 1 they all be bounded by 1 that is fine. But here if I go if when I started saying that I have just said that this guy belongs to a class of sub question distributions. So, I am not specified about where the support is the support is in the interval 0 1 it is still true and everything I can bounded by 1 this is going to be a constant. If not I can always assume that this means are some finite value for this entire class I can assume that for this whatever my entire class it is there right I can assume that the means are some finite. Once I know that this finite then this quantity is going to be bounded in terms of that constant value. So, suppose assume that you have you are dealing with a class of sub question noise right. No, I know there are at most k terms in this. Delta max yes, but what is that delta max? So, that is indirectly right if I bound mean delta max is already bounded. So, I am trying to make sure that instead of writing delta max delta looks like a problem dependent quantity right. I want to just write it down as a constant which depends on the entire class not for that particular distribution. So, as I am like when I want to trying to write this I want to cover with a bound which is problem independent. Delta max could be depends on a particular problem instance, but I do not want it to a problem instance dependent. I want it to be a parameter which depends on the entire class not on the particular instance. See like one thing is you have this delta i's you can choose max over delta i's and call it as delta max that I agree. So, there when I say support to be 0 1 right that is not the property of the instant that is the property of my entire class. So, if I do this in this case irrespective of what is my problem instance this delta i's are always going to be 0 1 that is why delta max is going to be at most 1. In that case ss it is less than 1 right I am only trying to do deal this part separately and that is fine. I know this if the support is in the 0 1 I know this is already between 0 1 but that I am taking I am not making that assumption right initially I am saying the support can be anywhere. That is what I am saying like when you want to go for problem independent bound let us say that this class is such that the means are bounded 0 1 that is something but whatever that value is it is bounded. In that case I know that this delta i's are all bounded. So, then it is going that is going to give me some universal constant delta max that is only depends on the on this class not on a particular instance. So, you are saying that the r max is anyway upper bounded going to be delta max if I are going to. So, that is fine. What I want to write it here is in terms of in terms of the number of rounds yeah fine let us let us figure out that. So, what I am going to get if I upper bounded by delta max whatever right. So, that is delta i by k into some part which is not at k yes. So, is that that is all k I agree then it becomes delta max that is a trivial bound I have here. But here I right now I do not know that is the case because of that the other part is also there. I think this needs a bit more refined argument. So, let us let us see if we can argue that bit later, but fine anyway. So, if we are going to like have the sub Gaussian noise and let us say we have some parameter which defines how large this delta i's are in terms of that then I can fit this as a constant and what I have is here a simple regret bound which is of the order log what is this log k divided by square root of this is what what is this sorry this is floor of t by k ok. Then what I want to write this as this. So, this is like order if I just do this manipulation this is going to be 2 times some k log k by right I have just taken this k in the numerator. So, this is k log k divided by t if I am going to read this as a constant here in terms of the problem class dependent constants. So, what we have done here is we have an upper bound on the simple regret where that is going that is falling exponentially fast, but this problem is this bound is like problem dependent like it depends on the particular one. So, now if I make it get rid of the problem dependency by bounding it is all the parameter now I have a constant here where regret is like falling like the simple regret is falling like inversely in capital T. But this is only a constant right like I mean I am not changing this is not changing with time I am only talking with respect to time how it behaves only time. So, this is going to if I am going to fix whatever the problem instance and then you are going to increase t what is this is saying is a simple regret is going to 0 exponentially fast. But if you are going to treat it across all parameters with respect to of a particular instance it is saying that there could be a worst case if you are going to treat where it can make you go like only like order 1 by like square root 1 by t.