 So, what is the performance guarantee of this algorithm? So, before we write this bound, we already discussed that the pseudo-regret can be decomposed right. We had we had a decomposition result we said that the pseudo-regret can be written as summation of expected number of place of arm times the suboptimality gap that is we said that this can be written as. So, then in the in this case it is enough to give a bound the expected number of pulse of each of this arm right. So, in this theorem we are exactly going to do that first most and then pseudo-regret. So, if you just plug in this value in this what we are going to get. So, another thing is we know that if some arm i is optimal right for that arm i del i is going to be 0 by definition ok. So, among all these del i's there will be one del i i for which this del i value will be 0. So, I will just because of that I do not need to consider that I am only going to consider all this i's for which del i is going to be positive and then write it ok. So, because of this now I am going to plug this value in this. So, this is going to be all i. So, we are going to get this is upper bounded by this is some 6 this is the constant summation over all arm i square which suboptimality gap is positive 6 log n by d i del i plus summation over i i this is this is the constant here right. So, I can write this as well ok. So, I will just sorry I do not make it. So, there is a del i value and we define this del i as max over mu i will be like mu j for all j and as mu i. And now and our convention was already we said that arm 1 is optimal right we said that let us assume arm 1 is optimal that algorithm does not know, but let us assume arm 1 is optimal because of that this del 1 is going to be 0 there will be some value of del 2 del 3 like that. So, del 2 will be positive del 3 will be also positive. What I am going to now define another term as delta minimum value of all this delta i square del all i where delta i is positive or maybe in our case I am just going to take it as delta 2 delta i where i is greater than 1. So, what I am doing? I know because my arm 1 is optimal delta 1 is 0 delta 2 delta 3 they are all positive I am interested in among the positive values what is the smallest one ok. So, what is this I am saying? Suppose let us say my mean is ok this is my highest mean mu 1 let us say somewhere mu 2 is here mu 3 is here mu 4 is here ok let us say this is one example. I call this gap as delta 1 from here to here as delta 2 and here to here as delta 3 sorry this one I called it as delta 2 right and I call this as delta 3 and from here to here I call it as delta 4 ok. What now I am saying is among this delta 2 delta 3 delta 4 take the smallest one and that I am going to denote as a delta ok. So, this is for my convention I have written this as mu 1 here mu 2 here mu 3 mu 4 it could be anywhere like this could be I could have been mu 2 and this could have been mu 3 this is just like labeling right whatever what where I am delta what is this delta is telling is the gap between the highest mean value and the next highest mean value. So, this is the highest value here right what is the next highest value in this case mu 2 the gap between them is what delta is denoting. So, so this delta is. So, by definition this delta is smaller than delta i for all delta i greater than 1 right this is by definition. So, it is basically saying that what is the gap between the highest mean and the next highest mean right this delta is exactly capturing that. So, now, I will just plug in this here. So, I am going to replace this delta i by this deltas and since this delta is a lower bound on this delta i if I plug in I am going to get an upper bound here ok. Now, if you look into this bound is this a sub linear bound. So, how does this regret grow in n it grows logarithmically n right because of that it is sub linear, but if you look into the order wise quantity what is the order wise regret it is like order wise k log n by delta right. So, this is k minus 1, but this is constant right I will just take it as k this log n this whole divided by. So, this this should be. So, the expected regret of my UCB algorithm scales like k log n divided by delta ok. And here this delta depends on the problem instance right. So, the problem if I have been given the problem instance which has mu mu 1 mu 2 mu 3 mu 4 this delta is basically capturing the gap between the best mean and the next best mean. So, basically representing this complexity in terms of delta, but not all these delta i's because delta actually matters more. Suppose now assume that you have been given mu 1 mu 2 these are the things. And your goal is to identify this mu 1 as long as there is a good separation between mu 1 and mu 2 maybe this is slightly easier, but if this mu 1 and mu 2 are very close to each other it slightly becomes complicated right. I am saying that if instead of mu 2 was instead of this much separation my mu 2 was here it is not very far from the optimal one ok. So, the gap between mu 1 and mu 2 is very small will my problem is becomes easier or harder. Suppose let us say only take take the case of two arms ok ok. Let us take in one case I will I will give an instance where mu 1 happens to be 0.5 and mu 2 happens to be 0.8. I ask you to identify an arm which has the highest mean in this case. Let us say you did something and identified. And now I what I do is I will keep the first one same, but make this 0.6. So, now the gap between them has reduced right. Will you think it will be easier here to identify which one is the best or it will be easier to identify which is best here. Why? Because they are well separated right like if let us say in a class there is only one good guy and others are all bad guys. It is easier to identify good guy right like and I mean if you want to assign only one top grade and there is only one good guy and others are all bad guy. So, you know whom to assign this thing and if there are too many good guys maybe that is the case we have. And we want to identify whom to assign the top grade then only one top grade then it become bit challenging right. So, the problem is all about identifying the best from the rest ok. So, if there are too many I mean there are too many people who are close to the best then it is separating the best guy from the rest is slightly challenging right. So, the problem becomes hard. So, that is why this gap matters what is the gap between the best and the second best the mean values of the arms ok ok. So, fine now the question is the how to show this ok. So, for that we will introduce some notation maybe and next time we will continue it. I just said it for k equals to 2, but you can imagine right this should hold for any number of arms. If there are too many guys who are close to the best guy then separating them becomes bit not so easy task. So, and this a bound is exactly capturing that if the gap becomes very small your bound is also going to be large accordingly ok. In the algorithm I wrote mu i hat of t right what does that mean? This is the estimate I have for arm i at time t and what was that like we said that this is nothing, but the sample average of all the samples I have gotten for arm i till that point. So, this is actually ti of t minus 1 summation s equals to 1 to ti because I am going to till when I am trying to estimate the value at time t I am going to use these many samples right ti t minus 1 because till round t I have observed my arm i this many times right and what is this? This is going to be x times. So, how we are going to interpret? We are going to have the average of ti of t minus 1 samples to get this mean value, but then what is this? The way we are going to interpret it is in the these are the. So, we have arms right like let us say let us say arm 1, arm 2, arm 3 like let us say arm 4 we have let this is let us say this is x 1 1 this is x 2 1, x 3 1 and x 3 1. So, this is in round t 1 and this is round t 2 this is going to be 2 1, x 2 2, x 3 2, x 3 2 and let us say t equals to 3 3 1, x 3 2, x 3 3 this k ok like this you can continue like I am going to now write some for arbitrary t this is going to be t 1, x t 2, x t 3, x t k. So, like this. So, in had you played arm 1 all the time in round 1 let us say the sample value you observed is this, in round 2 sample value observed is this, in round 3 sample value observed is this and in round t sample observed like 2. Had you applied t all the arm number 2 this is the sample value observed, but you are choosing right let us suppose let us say in round 1 you played this, round 2 you played this, round 3 you played this. So, you have got this sample and similarly depending on which arm you played let us say you played arm 3 in round t you got this samples ok. So, it is not that you are getting samples from each arm in every round right you are going to if you happen to play that arm in in a particular round then only you are getting samples from that. So, when I write this it may happen that for S equals to 2, S 2 I I mean I may I may not have played I thumb in the second round that sample is not there, but what is our interpretation of this is where whichever slot you played I do not care, but suppose let us say you have this many samples you are just take them and average. So, because of that with this the way we have to interpret this the first first observed samples is that fine or you lost all all I am saying is. So, let us say I want to ok let us only take only 2 arm case and you happen to play arm 2 in the second and third round and you happen to play let us take this to be t equals to 4 and you happen to play arm 1 in the first round and in the fourth round. So, after this for arm 1 you have 2 samples and arm 2 also you have 2 samples. So, now you have to you can only average for arm 1 only these 2 samples right. So, we are saying that you are going to take this sample and this sample and averages ok even though we have written that S is running from 1 to t it is not that this sample and this sample it is this sample and the one which you have observed till that time ok. And similarly here it is going to be average to these 2 samples here whichever you have observed till that time. Now, so that is the meaning of the estimate mu i t here like you are just taking average of whatever the number of samples you have observed for that and the number of samples you have observed for arm where t i t minus 1 is what we have denoted as ok fine. Now how to go about the proof ok. We will just write it and maybe we will continue discussion. Suppose assume that in round t some arm i is played and assume that this is not 1 that is this is not the optimal one some suboptimal arm is played. What could be the reason that suboptimal arm could have played have been played? Yeah, according obviously it is played because the mu i hat t minus 1 plus what is the confidence term we had here 2 log t divided by t i t minus 1 this happens to be larger than just everybody that is why you are able to you played it this is happens for all j. Why this should happen? Yes, because this has happened you played arm i in the ith round that is what the algorithm said right then why why this should happen in what cases this could have happened. This could have happened because or let us say and this is the case and in particular this should be also the case that this guy is the larger than the optimal arm. So, that is mu i t minus 1 plus 2 log t t i t minus 1 should be where r equals to mu 1 the first arm. So, you happen to play arm i because it it is better than everybody else. So, by the way we are going to call this to be the index of arm i ok and in the. So, earlier we defined UCB t minus 1 to be. So, we are going to call this index arm i. So, it is called UCB index in this case, but in general whatever the value we are going to assign to that particular arm in that round that is going to call the index of that arm. So, what UCB algorithm is doing is basically it is finding an arm which has the highest index and it is playing that right. So, because of this what of course, the UCB index of if i t happens to be i its index should be larger than everybody else and in particular its index should be larger than that of the optimal arm itself in that round that is why and assume that this i is suboptimal. Suppose in some particular round let us say you did not play optimal arm and you happen to play an arm i which is a suboptimal one. Now, it must be the case that its index is also going to be larger than that of the optimal arm in that round right. So, this can happen it so, happened that in that round the mu i's index is kind of overestimated right and the index of optimal arm is kind of underestimated or it may happen that the exploitation term dominated for the i-thumb because of this happened because the i-thumb was not played sufficiently many times ok. So, ok just to we will write it formally next time. So, what could have happened? So, suppose let us say this is i-thumb and let us say this is the one with optimal arm. So, their estimates are like this it has so, happened that the true value of mu 1 is somewhere outside it is not in the confidence term or because of this and what is this? So, I have to make this and this confidence term is going to be this one is this and I want in whatever this mu i have star the confidence term is lower somewhere had this happened ok. So, obviously, if I had picked i-thumb in round t right it must be the case that its upper confidence term here which is this value should have been larger than this point right that is the only reason I mean the if I have picked i-thumb it must be happened that this should be larger than this quantity this is the upper confidence term for the i-thumb. Now, because of this in this if this has happened further some other cases can arise right it may happen that the true value of the mean this is my interval where I am expecting my mean to lie it may happen that it may not have lied in this, but it would have lied something outside this and also the mean value of i which I expected in to be in this interval that it did not lie in this interval, but it lied somewhat below even though this mu i we are saying that this is the optimal arm. So, its value is going to be higher than this value, but it is so happened that when I estimated I estimated this to be in this interval. So, it is not capturing it correctly and when I wanted to estimate mu 1 I wanted interval to somewhere here, but it has so happened that it is this value has come below. So, because of that it is so happening that this guy is exceeding this y guy and I may end up playing i-thumb ok. So, I am just giving you pictorially what could be the possible reason i-thumb has picked instead of first term ok. So, fine we will formalize in the next class and write it as a three possible conditions.