 So, we have been discussing UCB algorithm, we introduced it briefly last time. So, this is one of the algorithm based on optimism in the face of uncertainty. So, the UCB algorithm is said that it combines both exploration and exploitation together. Unlike ETC algorithm, which first did exploration and then did exploitation separately in a subsequent rounds, it is it combines exploration and exploitation in each step by looking into upper confidence bounds of the arms in each round. So, in UCB algorithm we are going to. So, like I said last time UCB is like index based algorithm in which we are going to assign an index to each arm and based on the value of this indices we are going to pick an arm. So, we are going to pick the arm with the highest index in this. So, what is the index in the UCB algorithm? The index of the UCB algorithm for arm i at time t is defined to be nu hat of i t minus 1 plus alpha log t divided by t of t minus 1. So, what is this? This we said as this is the estimate based on the number of samples of arm i I have observed till time t minus 1 and here this is the confidence term here, which basically determines the confidence width of my estimates. And here how did we define t i t minus 1? This is the number of pulls of arm i over. So, maybe I should write it as s equals to 1 2. So, recall that i s here denotes the arm I am going to pull in round s and depending on how many times I have pulled this arm i over t minus n round this is going to give me t i of t minus 1. And based on this, this is just a recall of power notation. We said that this is nothing, but summation of all over s equals to t minus 1 times what is this? We said this is x i of s. What is x i s? This is the sample observed from arm i if you have pulled it in round s. It is not necessary that you would have pulled it in round s. If at all you pulled it that is the indicator here then you are going to include that sample in the summation and the whole of this divided by s equals to t minus 1 indicator of i of s equals to i or just to be t minus 1 and i t equals to i times x i s divided by t i t minus 1. I am also. So, notice that when I said mu i hat till time t minus 1 that means, this is going to be a random quantity based on the t i t minus 1 sample you have observed till round t minus 1 ok. So, after you do not know how many samples of arm i would have observed till t minus 1. So, t i t minus here is a random quantity. But suppose I fix this number suppose let us say till round t minus 1 I said that I have observed exactly certain number of samples then for that I am going to use this notation maybe u I will use. So, when I use this this I mean to say that I have exactly u number of samples till round t minus 1 and I am going to use them to find this estimate. So, when I wrote this notation this is whatever be the number of samples I have observed till t minus 1 that was t t minus 1 I am going to use this to find average. But somehow if I fix this number of samples to have been observed till t minus 1 to to be u then I am going to use this notation and in this case it is simply going to be right. So, this u is this u which is I am taking as number of samples of arm i. Now, our goal is to come up with what will be the regret of my UCB algorithm which works which selects arm like this ok. Now, to prove the regret bound of this we are going to use the regret decomposition result that we have shown last time. Let us recall that we said that mu star mean of the highest highest mean we have across all the arms and we are going to denote I star to be arc max mu i over I. So, this I star we are going to call it as optimal arm and all other I which are not I star we are going to call them as suboptimal arms I such that I we are going to call them as suboptimal arms. Last time we also denoted this d I sorry del I which said the gap between the means of the ith arm and the optimal arm mu star minus mu i right. And we also said delta to be minimum value of I not equals to I star of delta I ok. And this term we are going to call it as suboptimality gap of arm I and this we will simply call it as suboptimality gap. So, we also going to assume that the unique arm sorry the optimal arm is unique because of that we are this guy is going to be positive in our case ok fine. This was the notation and with this notation we have defined that regret of any policy where n round can be given as expected number of pulls times of a I I running from 1, 2. Now, to bound the regret of our Isubi algorithm we will just try to bound what is the expected number of pulls of each arms and in particular expected number of pulls of the suboptimal arms. So, if you have a bound on this we will directly get a bound on the regret of my Isubi algorithm ok. So, now how to go about this let us start the proof of Isubi algorithm. Assume that in round T an arm arm I is played a suboptimal arm is played and let suboptimal arm I is played in round T ok. What I am saying is I T is equals to some I and where this I is not I star. Now, what could be the reason that this suboptimal arm has been selected in round I? One possibility is this estimate of this the UCB index of this suboptimal arm happen to be very high in this. Obviously, if it is selected in this round I it must be the case that it is the arm I should be the one which has the highest UCB index. Then the question is what is the reason that this arm happen to have the highest UCB index? Obviously, it could be because somehow my estimates are not good in this case ok. Let us visualize this. Let us say this is my mu I and this is my mu 1 ok. By the way we are going to assume that without loss of generality arm 1 is the optimal one. So, we are going to take this I star is one that is arm 1 is optimal. So, let us say this is the mean of the optimal arm is here and true mean of I thumb is here. So, now, it might happen that when we construct confidence intervals of this arms the confidence intervals may or may not include this bounds and this confidence intervals may over shoot or under shoot this true values ok. Suppose this guy has been selected in round I and let us say and the optimal arm is not selected in that one possibility could be it has so happened that this guy happen to lie my my confidence bounds whatever let us say my UCB my confidence bounds let me just call it as confidence term ok. Let me directly write the confidence term we have already discussed the confidence term we are going to use this this part here right whatever this term happened to be and like this and for arm it happened to be like they over estimated it in this the confidence term of this happened to be like mu I hat plus alpha log t here alpha log t divided by t I t minus 1 and this lower term happened to be mu I hat of minus alpha log t t I t minus alpha log t because of this you see that obviously, this arm gets a preference over this arm because UCB index is this. So, this is one possibility that is the intervals for the sub optimal arm actually over shoot and they over estimated its value and it is telling that my intervals are telling that the true value of mu I is here, but it is actually below this whereas for the optimal arm the true value of the mean happens to be here, but my confidence interval said that that is going to lying here because of this bad case I may end up missing my optimal arm. Other possibility is that I have not sampled my arm I sufficiently enough because of this my t I for the I term is smaller and this term happens to be larger and made it dominate the made its UCB index dominate over the others. So, other possibility is that the t I at round 1 is smaller is ok arm I is not sampled enough. So, because of this results maybe let us call 1, 2, 3 we ended up possibly choosing a arm I which is sub optimal in this round. So, putting this 3 conditions more formally. So, what is this condition is saying mu I hat minus alpha log t divided by t I t minus 1 happens to be larger than mu I and mu 1 hat plus alpha log t t I t minus 1 happens to be less than mu 1 and let us say mu 1 and this guy t I I square by 4. Now, the claim is a sub optimal arm I has been played in round t that it means it must be the case that at least one of these 3 condition must hold. If I t is equals to I in round t at least one of 1, 2, 3 must hold. Now, let us see why this claim holds true. So, how to prove this claim we are going to show that suppose none of them holds true none of this condition holds true then we are going to argue that then it must be the case that I t should not have played in this round t ok. So, we are going to assume that neither a neither 1, 2, 3 are true that is assume then we will show that that means I t that is cannot be I in round t ok. So, to show that let us assume that first let us say this condition is violated. So, that means I have mu 1 hat plus alpha log t divided by t I t minus 1. So, by the way this is not t here we will take it as n. n is what n is the number of rounds this algorithm is played for this guy is taken to be mu 1 definition of sub optimality gap this should be equals to mu I plus delta I right. So, we have taken this case we are assuming that this does not hold. Now, let us take that this condition does not hold because of this ok. So, let us simplify this what does this implies. So, if this condition here implies that delta I is less than or equals to 2 times square root ok. So, if this condition is violated we will take the opposite inequality and that will give me 2 times. So, that will give me a lower bound that is alpha log t divided by t I t minus 1 ok. Now, let us appeal to the third condition suppose this also is violated then I am going to take the opposite inequality of this that will tell me if I just take this guy on the other side it says basically mu I plus this quantity is lower bounded by mu I t. So, just appealing to that this is going to give me mu I of hat plus alpha log t by t minus t minus 1 ok. So, what is this now we have started with the left hand side which is UCB index of 1 in round t and now we have ended up with this is one of them we have to take it as strict inequality right and we have showed that this is equal to UCB of I in round t. So, if none of this three condition holds what we have just demonstrated is the UCB of 1 is going to dominate that of UCB of index of I that means, I t I can be the I t the arm played in round I can be I right because anyway arm 1 is going to dominate that it would have been picked if nothing else right. So, we do not know at least for sure we know that I can't have been played in round I. So, this implies that I t can be equal to I ok. So, a contradiction. So, now it is must be now we have just proved that if arm I is played in round a suboptimal arm I is played in round t it must be the at least one of this must hold ok. Next based on this we are going to bound now expected number of pulls of arm I. So, before that I am going to define I am going to take u to be alpha log n divided by delta I square. So, just so we made a small mistake here right when I applied this bound here I should first get mu I times 2 alpha log n divided by t I t minus 1 here because this delta I the way it is defined here it depends on n here not t, but we know that we are considering this in round t and in round t is less than or equals to n. So, because of that this inequality also holds and rest of the things are fine here. Now, we define u to be alpha log n by delta I square which is a constant right like because alpha is some constant n as the number of rounds which is again constant and delta is a suboptimality gap that is a constant now. So, this quantity need not be integer. So, we can take it to be simply the seal of this, but I will not necessarily like seal all the time. Now, how to bound the expected pulls of arm I am going to know this by definition this is going to be summation t equals to 1 to n indicator of I t equals to n right. Anyway I am interested in the expectation I am going to write my expectation of t of r n I split it into two parts first part is expectation of t i of n indicator that my of n is less than or equals to u plus and the other part is expectation that my t i now n is greater than or equals to u ok. Notice that this when I said where in round t if i term is played all at least one of this might hold and that round t I could arbitrarily take it any round right between 1 to n. So, I can take that that t to be the last round here and now I am going to use that logic here. I know that because of this indicator here that t i n is less than or equals to n this first expectation I can upper bound it add by u right where what is u here the u is the one which I have defined here plus the remaining expectation, but in the remaining expectation let me substitute the terms for t i n which is defined to be t i 1 indicator I t equals to r times this another indicator. Notice thus this indicator here it does not depend on time here. So, simplifying this I am going to write it as expectation of. So, now this is indicator consist of indication is based on this joint event here that is i i term is played and the number of fulls of i are till round n is greater than u ok. Now now at this point we are going to now appeal to this claim that we have made and proved. We are saying that if at any point i is played right and t of n is greater than u that is this condition is violated here for the nth round. So, as I said this is true for any n right. Now, we are saying that if this is the case it must be true that if this is condition is violated then other at least one of the other should hold right. So, because of that it is going to be u plus expected value of integer 1 to n integer i t equals to i. If this is not this t i n is greater than u then at least this or this should hold right either 1 or now applying union bound I am going to write it as u plus expectation of t equals to 1 to n indicator that i 2 equals to i comma 1 holds plus expectation that t equals to 1 to n indicator that i t equals to i 2 holds right. So, I have basically simply applied the on this union bound on this indicators right. So, indicator of i t equals to i and 1 or 2 holding is like i t and 1 holds and i t or i t and 2 holds ok. Now, pulling the expectations inside the summation and applying this expectation on the indicator you will get that expected number of pulls of t i n upper bounded by u plus summation of t equals to 1 to n probability that i t equals to i and 1 holds plus t equals to 1 to n probability that i t equals to i and 2 holds. Now, further this is the joint event. So, if I just skip the i t equals to i part here I will further get an upper bound here with that summation t equals to 1 to n probability that 1 holds plus t equals to 2 t equals to 1 to n that probability that 2 holds. Now, let us recall the definition of what was 1 condition 1. So, condition 1 told us that the confidence interval of the i term it overshoot and or overestimated that mean value of i. So, let us write it separately holds is same as saying that that is probability that mu i hat minus alpha log t divided by t minus 1 is greater than mu i that round and probability that is equals to 2 holds is same as probability that mu i 1 hat plus alpha log t divided by t i t minus 1 less than mu i. Now, this probability we already know how to handle right because this is nothing, but the estimate and if you are going to take the mu i left hand side this is the difference between the estimate and the true parameter being away from this quantity and similarly this quantity also we know how to bound this is for mu i hat and this is the estimate in the error of mu 1 hat. So, both of them we know how to bound using our concentration inequality for our sub Gaussian distribution. So, I am going to bound your bound for this and similarly we are going to get a similar bound for this.