 So, in the last class we are talking about pure exploration methods, we talked about uniform exploration method and see what is the bound we are going to get from that. And we also discussed what will be the lower bound and then we connected how we can obtain a policy for pure exploration from a policy which is designed for regret minimization and we then showed that if you are going to appropriately define how you are going to choose the arm in the last round, we said that simple regret is going to be related to the cumulative regret for that policy, simply simple regret is equals to average of the cumulative regret. So, today what we are going to discuss is something called fixed confidence setting. That is I will tell you with this confidence you have to identify the optimal arm, you take whatever number of samples you would like. I do not care how many samples you are going to take, when you are going to do exploration what is the regret you are going to incur, but at the end whenever you are going to stop, if you tell me this is the arm, I want to guarantee that that happens to be the optimal arm with probability at least 1 minus delta, so it is called best term identification. So when I am doing pure exploration, what could be the questions I could be asking, I will give you confidence term, you do whatever number of exploration you do and at the end give me the arm which happens to be the right optimal arm with probability 1 minus delta. What could be the other question kind of questions I could be asking here, so it could something that is called fixed budget, you could be asking, I am going to give this much of budget that is this much of rounds you are allowed to do whatever you want to do, but at the end you have to output me an arm which happens to be the correct arm with as much as high probability. So we will formalize that, but let us say I am going to call this as fixed budget setting, the second possible question we said as fixed budget setting. So that is, so we will study this later will be in the next class, but today we are going to just focus on this part. So in this case what you want to set your performance criteria is, what I said we will be given a confidence term, you are allowed to do as many exploration you want, at the end you want to return me arm which happens to be the right one with probability at least 1 minus delta, where delta is I have passed on to you, there could be many algorithms like that, that could be doing this, but what algorithm you like in that case? The one which takes less number of rounds or it identifies the optimal arm with as many few rounds as possible. So let us try to formalize that, so before I do this I am going to give this definition. When I said the best term identification with a fixed confidence, my only input to the algorithm is tell me what are the arms you are talking about and give it the confidence parameter, what it has to do is internally it has to do whatever exploration it has to do, but it has to come up with its own stopping criteria, stop at that point and then output an arm. So, in all this algorithm there will be some stop stopping criteria will be there, let me call a policy and stopping time, so pi is what a policy that you are going to apply and it has its own stopping criteria when it is going to stop, we are going to say this policy along with its stopping criteria is going to be sound for any delta, the confidence term you have passed on, if this is going to happen, so what is this? It is first saying that tau is finite, like it stops, it has stopped at finite time and after that what is it tau plus 1, the arm it is going to give me in the next round t plus 1, that and what is del of i tau plus 1, this is the sub optimality gap of this arm with respect to the optimal arm, that being greater than 0, what does this mean? If this is greater than 0, that is this i t plus 1 happens to be sub optimal arm that happening is upper bounded by delta, that is if it stopped at finite time and it is giving me an arm which is not optimal, that probability happens to be upper bounded by delta, that means it is giving me the correct arm with probability at least 1 minus delta, so when this happens we are going to call this such a policy pair pi tau as sound, so I think before this I should have mentioned this also, so I am going to denote this tau here, let me make this tau more formal here, tau is the time when player stops and we are going to say that tau is stopping time, can anybody tell me what is this tau is f t stopping time, where we are going to say this f t is the sigma algebra generated by all x 1 i 1, so f t is the sigma algebra generated by what you have observed so far till time t and the action you are going to play, we have not yet observed what is the reward for arm in the round t plus 1, so then what does y is what does this tau being stopping time with respect to this means, this means that, so what is the definition of stopping time? Basically it says that I can say something s or no based on till the current observation, not needing to know what is happening in the future, it is exactly saying this like I this is going to decide whether to stop or not based on what you have observed till that point and the action you could potentially take in that round, but without knowing anything after reward at time x t plus 1 or anything that is going to happen in the future after that. So, this is the stopping time associated with any policy pi and the sigma algebra generated here will indeed depend on the policy you are going to apply, because the policy is going to govern how you are going to choose this arms and it also depends on the underlying distribution because also this x 1, x 2 are all generated according to the underlying distribution. The value at which you are this algorithm is going to stop tau, can it be random? Suppose let us say I will give you a K amp K stochastic distributions and you have an algorithm, you applied your algorithm and you stopped after certain number of rounds and gave it. Now, you are going to reapply your algorithm again starting afresh on the same distribution. The starting time could be different, it could be different, right. So, the tau can be in this case a random variable. So, that is why we are going to say for a given pair pi tau, I am going to say expected value of tau. So, this is expectation is going to be induced by the policy pi as well as your underlying distribution nu as the sample complexity. Now, what would be interested in is we will be interested in a policy with the sample complexity with the stopping criteria which has small s sample complexity, ok. So, now before we start talk about how this what should how this sample complexity should be, is it like possible that my sample complexity has to be this much irrespective of whatever for a given policy, whatever be the underlying distribution, this is the sample complexity I am going to incur. Maybe there is some value, first we are not going to look into that, that is but what we start looking into is directly algorithm which we hope that is going to minimize this sample complexity, ok. So, I will so there are different version of this lower bounds and I have it looks like they have very complicated characterization of this lower bound. So, I will try to see which one is the simpler one we will discuss in the next class, but today our focus will be mostly just on the algorithm. So, today we are going to discuss on algorithm called KL, LUCB. This is actually a derivative of another algorithm called LUCB, lower upper confidence bound, but we will just discuss the one version of it called KL LUCB, ok. Anybody has any intuition like when I have such a pure exploration, how I should go we discussed in the previous class a bit about this. So, one thing we discussed is how to make use of an algorithm which is the UCB kind of algorithm to make it work for the pure exploration one. So, we notice that the kind of algorithm we had in the regret minimization setting like UCB algorithms not necessarily do good for the pure exploration setting, right. So, what could be other possibilities? So, one possibility is I would estimate the confidence intervals along with their means and now I am going to compare estimated mean plus confidence term and I am going to order. I am going to check which is the highest among them and if it so happens that the highest and the next highest, if I can ensure that the difference between them happens to be at least larger than some amount, then maybe I will have enough confidence that the first guy is the best one, because the second guy I have already ensured that it is already separated from the first guy certain amount, maybe I can use this idea. So, what I am saying is you do like as usual like in UCB like you have the estimates plus the confidence one. So, you are going to take the optimistic value at any time for all the arms. If at any point you notice that the first the best and the next best are separated by certain amount that amount you have to define. If that is going to happen then you can be confident at this point. So, there seems to be already separate enough separation between them I am going to stop there. So, the scale L, UCB, XL try to formalize that, but it has to address a bit more general problem. So, what we are trying to do in this case is we are always trying to identify the best arm. Instead of I could ask the question find the best 3 arms, best 4 arms. So, maybe I will ask the question in this class I want to identify the best 5. So, when somebody gives me this 5 I do not care about who is exactly first, who is exactly second, who is exactly third in that, but as long as I am if you can tell me with confidence these are the top 5 I should be fine. So, this scale L, UCB makes such a generalization of this problem and it is going to solve try to identify instead of just the top 1 it tries to identify the top m arms. So, let me just formalize that notion. So, let us say you have this means mu 1, mu 2 up to mu k, these are the means of the arms. So, let us say these are all ordered already. So, mu 1 is the best, mu 2 is the second best and like that. So, the top arms are here then simply mu 1, mu 2 all the way up to mu 1. So, if I am interested in top m arms. Now, what I may do is instead of exactly identifying this top m arms, I will slightly weaken my condition and say that I want top m epsilon arms. What does that mean? I have defined with all mu a. So, these are exactly the top m arms. So, suppose let us say I am going to set epsilon equals to 0. If epsilon equals to 0, what is this set is? Set of all arms which are equal to mu m that is exactly the top m arms. Suppose I have slightly relaxed this and allowed epsilon to be positive that is given to you. Will this set include more arms than this? It is going to include more arms right. Now, I have basically, but I am ok like as of from this bigger set as long as you output me m arms, I am happy. I am take them as equivalent to the top m arms. So, whatever I have defined here is this is the special case of this with epsilon equals to 0 and m equals to 1. When I want to identify that just top m arms, but now we are just relaxing that and I am going to say as long as you give me epsilon optimal m arms, I am fine. And this algorithm is now going to look into this situation. Now, the question is how to identify this top m arms. So, let me write input. I am also going to set b 1 equals to infinity. I will define what is b n. So, let us see this. So, this algorithm takes epsilon as an input u and l. These are confidence bounds which are going to define a bit later. So, this is going to just give u is going to give upper confidence bound and l is going to give lower confidence bound on your arms. So, what it does is in each first k arms, first k arms it is going to play each arm once and for each one of them computes their upper and lower confidence bound. So, here index one here argument one means it is for the first round because they had just one one sample. And now it is also maintaining a b function which is a function of t. We will see what is this b t defined as. What it does is it is at every time it is going to define two arms u t and l t. u t is coming from the upper set and l t is coming from the lower set. So, now let me come here. What it is doing after you pull an arm you know its empirical mean and using this upper confidence bound term you have defined its confidence term also. So, this u and l are going to give you the confidence term. After you play your arms you have the samples you can compute their empirical mean values. At any time you are going to maintain this set J t. What is this J t is doing? It is splitting your total number of arm into two sets the one with the highest means the first m arms with the highest mean and the other one with the rest of the arms. So, let us say you have at any point let us say you have ordered your. So, it is so happened that your arm means happen to lie lie are like this. Let us say in round t and let us say you are interested in m equals to 4. So, it is going to take 1, 2, 3, 4. This has one set and the remaining it is going to put let me take one more the remaining in the other set. This set we are going to call it as J of t and this is going to be J complement of t. So, now what it is doing trying to do is from this set J t it is trying to find u t. What is u t in this case is going to be u t is the. So, this is J does not belong to J t means it is coming from J complement t set. It is looking for the largest index of all the arms, but taking their upper confidence into account. So, when I have drawn here these arms this is there in this set it is their empirical values plus upper confidence bound added. So, here all the terms here are like let us say this is an arm here. This is going to be mu A hat in round t plus it is coming and then this is going to be their lower confidence bound. And here when I am going to look at let us take an arm A it is going to be mu A of t plus it is upper confidence bound. So, this is how I have taken their value as if an arm comes ok sorry what I am saying. So, these are simply the certain terms which I have divided into two partitions these are the top m and the other ones. Now, for each of these arms here what I am going to take. So, these are just their empirical means for each of the arms I take here and what I am going to compute is their lower confidence term. And whatever the set elements I am going to take from this set I am compute their upper confidence bound. So, suppose let us say if I am going to take this set I am going to look at its this end. If this is their confidence interval I am going to look at this end of this. If I am going to take an arm here I am going to look at its upper confidence term. It is going to in each round it is going to select these two arms and these are the one that are going to be played in that particular round. So, here in each round we are going to play actually two arms, but that is same as saying each place spread over two place like like either I can say in the single round I am going to play two arms or like my one round is spread over two rounds where I have played one each. Now, you play these two arms for that you are going to observe a sample and then you are going to estimate their empirical means from that you are going to get these two values ut and lt using the way we have defined here. And then you are going to look at the difference between the upper confidence bound for the utth arm and lower confidence bound of the lth arm. And now what you are going to do is you will if this guy happens to be larger than delta epsilon you continue. If this happens to be lower than epsilon you are going to exit. So, let us understand what does this mean. Let us just for clarity let us say this is my JT set and this is my JT complement set. So, here what I have taken from this JT set I have taken my lower confidence term. So, this is my point and on this I am going to take my upper confidence term. This is the partition. So, what this algorithm is trying to resolve is it is trying to resolve this set for from this set this is the top M set right. So, what are the things that are potentially conflicting the potentially conflicting arms are here that are at the border. So, in this set this is I am going to say this is the best and this is the other part in the best among the best this is the worst guy. Maybe that this guy actually belongs here but by mistake I have put him here and maybe that here among the worst this is the best guy. Maybe this best guy belongs to this part by mistake and I have put him here. So, I need to basically be more confident about the edge points right. Other points maybe like I am bit more confident, but my resolution is I want to resolve more about these two edge points here. So, this algorithm is exactly trying to do that. So, it is trying to see. So, UT is coming from this set it is going to take this upper confidence term here and this is going to take this low confidence term here. If this difference happens to be large then maybe I will try to I will continue. Just let me just just confirm this is greater than epsilon or less than epsilon. So, if this difference happens to be less than greater than epsilon whether I am confident enough. Then this positive. Yeah. Then the JT complementing is going ahead of the JT. Exactly. So, in this case this is coming on the left side and this edge is on the right side right. That means it may be possible that this guy actually belongs here and this guy is belongs here. That is why I want to resolve it and I want to continue. If that is not the case that means I had a enough separation between this I mean the lower bound of this and the upper bound of this I had enough separation that means I can be more confident about this and maybe I can stop. So, this algorithm says that it takes this epsilon. So, this is like kind of a resolution parameter it takes this as an input and based on this separation how much you want to ensure it is going to if this separation happens to be larger than epsilon that is going to stop. Otherwise let us keep on doing this. The things that now remains to be defined how are the confidence terms computed? How what are this UA and UL?