 this is a first exercise problem you are going to do for assignment. So, this is going to be the assignment exercise 1 in the 33rd chapter in this book, book by it is Chapa Zappaswari and Latimur. So, what this exercise wants you to show is, it asks you to show that fix the number of rounds it says that there exists a distribution nu such that whatever policy you are going to apply the regret is going to like grow like at least square root k by n. So, where e c is a constant. So, it is saying that there exists nu and it says that ok nu is. So, it says that so this is the class of sub Gaussian random variables with parameter 1. You would say that there exists a nu belonging to this class such that irrespective of whatever pi you are going to choose you are going to incur regret which is of the order square root k by t for some constant c. So, I have to work out this, this is an exercise and this has a hint also in that exercise like this follows along the similar lines which you did it in the adversarial case right. You there also you came up with an algorithm where you showed that there exist an adversary who can select a sequence such a way that your regret is going to be like at least grow like square root t. Along that you are going to show this ok. And another thing so if you are now going to look into this is your uniform exploration is already optimal for uniform exploration what is the boundary we got we got it like k log k by t right. So, ok 2 is a constant you can ignore that, but if you are going to compare for the parameters k and t the only extra parameter I have here is log k otherwise it is go it is already k by t right under root of k by 2. So, in that means it is already optimal up to this factor log k another exercise in the same chapter it asks you to show that yes this uniform exploration is optimal up to this factor log k, but you cannot avoid this factor log k here for uniform exploration it will be always there. So, you have to show that if you are going to use uniform exploration this is going to be always for k log k. So, for uniform exploration you cannot avoid this maybe there is a other policy better policy which we will going to achieve this lower bound, but uniform policy is not that optimality it is this extra factor log k. So, both of them you will do for assignment 4 ok now fine. Now, the question is is there a policy which will achieve this lower bound for my simple regret the answer turns out to be positive if that is the case then the question we have to answer is what is that policy ok. Suppose let us say I am going to this is also stochastic case right instead of cumulative regret we are looking at simple regret. Suppose let us say instead of worrying about a special algorithm for this I will just take one of the algorithm which I have already for the stochastic multiambanded let us say ucb or klucb I will take and for the whatever it is going to do let it do it in the exact way it is doing, but whatever arm it is going to select in the t plus 1th round for that round I am going to compute its instantaneous regret. So, do you think if I do that it should give me better performance like that should be good ok. So, let us say I want to I will give you a time t capital T what you do is you just supply your ucb algorithm let us say whatever it did till t round you did it and whatever you are going to going to it ask you to select in the t plus 1th round that you give me as the output in the t plus 1th round and then I am going to compute simple regret on that. Do you think the performance of that is going to be good or bad? Why is that? But there also it has to be doing some optimization right if it is not bothering about optimization each round it is a regret will not be sublinear it has to also do some optimization right for every round we are exploring less over there compared to this. Yeah, but so in that case most of the time it will be picking the optimal arm right or it will be exploiting if it has if it is doing a good job if it is already exploiting it it is better be exploiting the best arm if it is exploiting some arm right that means it believes that that is the best arm right right. So, it could be also exploring in that the way the ucb's are selecting it has both exploitation term plus exploration term in right it may be happening at some point it may in the if you are unlucky it may be happening that in the t plus 1th round it went into exploitation term become dominant and it is exploiting exploring some arm. So, in that case it would not be giving a good performance to you right, but that is not ucb I am just saying if you have to just plainly take it and apply it give it whatever it is doing ok, but ok other way like thinking like you suppose let us say I took ucb. So, I have given time t I just took the ucb and I just see how many arms it played till this point each one of the arms and then I am going to pick the arm which is played highest number of times. If I do this you expect it should have a better simple regret yeah. So, that means, if you are going to just pick the arm which has been so far played highest number of time you should you do expect that to be the already the best one or that should give the smallest sample regret. So, then you are saying you see it depends on t if t is large enough then maybe like I would have explored all of them sufficiently good amount and I have good confidence in each one of them then maybe I will just pick the one which has the which I have been selecting most number of the times. If t is small I do not have that much confidence right. Because who knows like I may not have a. Yeah, but that is not the goal here right I do not want I just want how best whether what. See like rest I distinguish fine, but the two ones which are very close to each other if I have noted resolve them finally when I have to give one arm which one right. So, fine so in that case like if so if you are in a bad situation where two arms are very close it has not figured out which one of them is better and so it might be playing them equal number of times. So, because of that so you take the whatever the empirical means what I want is at that time in the t plus 1 whatever it is going to play whether it is going to how far it is going to from the optimal arm is the question whether it is going to be good or not. It may be right like still suppose if you if you are saying two arms are very close to each other and it has been playing them for quite a some time to distinguish which one of them is better. Now, in the t plus 1 round you just ask pick one it should be picking one of these two right because it is trying to resolve between these two. And because these two are close it may happen that even their sample regret also would not be that much because they are almost like very they are their difference is not that much. For the remaining arm it will alright. So, you are saying doing UCB is not bad in that case we should better take this sample. So, fine that is what I am asking like whether you will go for a UCB and just take whatever it says in the t plus round for the simple regret. So, it depends on the problem instance right. So, you cannot like blindly apply it on any problem instance given to you. So, for a simple regret you have to be more careful. But still if I want to still get a bound on if I want to adapt an algorithm that is there for a standard which is designed for regret minimization to a case where I want to do a simple regret minimization. What could be the case? One thing simple thing I could be one possible thing I could do is based on number of times I have observed each of this arm till time t. I am going to now construct a probability distribution on the arms which is proportional to the number of times I have played each one of them. And then I am going to pick up one according to this probability distribution. So, we will see that if there is a good algorithm for the regret minimization problem and if you are going to do like this in the t plus one-th round that is you are going to construct a probability distribution based on how many times your regret minimization algorithm has played till time t then even your simple regret would be good. So, let us understand this. So, first I want to express simple regret in terms of the cumulatory regret. So, this is a proposition I am going to write. So, let cumulative regret. So, let pi defined like this be a policy and I am going to so this is defined for the first capital t rounds. Now, you are going to define this policy for the t plus one-th round in this fashion. What you are going to say? We are going to say that we are going to play i-th arm given that you have been observing this is like i 1 in our rotation i 1, i t. That is you have played i 1 in the first round observed reward r 1 and in the i t-th round you played i t arm and you observed reward i t. We are going to define this quantity to be what? To be the average of this indicator. What is this going to give you? It is going to give you so what is this p t plus 1 is going to give you? It is going to give you a distribution on the arms right and what is how is this distribution is going to be? That is going to be proportional to the number of the times you have played that particular arm. Now, if you are going to do this and now if you are going to compute a simple regret of the policy where you take the first t rounds like this you and for the t plus one-th round you have exactly done like this then the simple regret of that algorithm is going to be cumulative regret of the policy pi that was playing like this in the first whatever it is it was whatever the policy is it has done something till first round first t rounds like pi 1, pi 2 all the way up to pi t then this simple regret is going to be nothing but the cumulative regret divided by this capital t. That is the average of this cumulative regret is exactly equals to your simple regret. So, why is that true? So, let us write my cumulative regret. I know my cumulative regret is given by what expected value of this is correct cumulative regret is this we have we know that this is by the regret decomposition theorem. Now, let me divide it by NIT. So, what is NIT here is NIT is gives you the number of times you have played arm and till round t. So, is this NIT is going to be same as this quantity here? Now, what is this? This is nothing but this distribution pi t plus 1 right. So, then can I write this as this is this expectation as nothing but pi just del of this is del i t plus 1 whatever the arm we selected this are exactly this distribution. So, then this is nothing but and this is nothing but this is the simple regret. So, my simple regret. So, if you give me any policy which I can apply to get a good cumulative which I can apply for the cumulative regret minimization problem and if I am going to at the end if I am going to sample my t plus 1 thumb according to the distribution defined like this then the cumulative regret sorry the cumulative regret the average of the cumulative regret is nothing but my simple regret for that policy pi. So, what I have basically done is I took a policy pi here which I am applying it on the standard bandit problem to minimize cumulative regret. But the way I am pulling the t plus 1 thumb is based on this new distribution. If I do this I am just saying that this is how the simple regret and the cumulative regret are related. This is the simple regret that I would have got in t plus 1th round and this is the cumulative regret in the first two rounds. Now, can you tell me what is the best bound? So, given this relation holds now can you tell me what is the best simple regret bound I can get? Order root k by t, root k whatever, one single big root k by t. So, why is that? So, I know that which is that algorithm? MOS algorithm. MOS algorithm if I use my MOS algorithm that already guarantee me some order square root k t. But because of the denominator t here I am going to get it as order square root k divided by t which is what this lower bound also tells. So, if I have a good algorithm for my cumulative regret I have a good algorithm for simple regret. But it does not like look like as simple as it is. It is not necessary that any algorithm which performs well on the that gives a good cumulative regret necessarily has to yield a good simple regret. So, that we will talk a bit later. But let me just for complete this statement here as a corollary for all this exist for our policy. So, as you already see that what is that there exist a policy this policy which is giving me this bound is already MOS policy. So, fine. So, let me first discuss. So, this corollary says that for all t there exist a policy such that for any instance this relation holds. They are giving an upper bound. Lower bound are already given here. This is irrespective of what is your policy there exist an instance such that this lower bound holds. Now we have to give and see that we can achieve this. What we are saying is there exist a policy such that for any instance this holds. Now you already know what is that policy. This policy is MOS policy because that gives me square root k t regret and from that I am going to achieve square root k by t regret.