 So, now we are going to show that we show that there exist mu and mu prime suffices to show that. So, if I can show that there exist parameters mu and mu prime such that. So, if I can show that on this instance either mu or mu prime. So, when I say mu this is collection of this Gaussian distribution when mu prime is collection of this distribution if I can be able to choose this Gaussian distribution with this parameter certain parameters. So, that the max of this regrets happens to be larger than c square root k t then it kind of implies my regret lower bound right. So, now let us start looking into what are these parameters. So, for mu I am going to set such that this mu i is going to be delta for i equals to 1 and it is going to be 0 otherwise. Yeah. So, this example we are running through Gaussian case. Yeah, but we also said in the statement that it is sigma Gaussian it should be fine ok. Yeah, if it is fine for sigma sub Gaussian it all automatically implied for bounded one right. So, I am looking for one banded instance now where the first term has mean delta and others are all have 0 mean ok. So, how I have chosen this delta we have not specified just for time being adjust assume that the delta is something which is positive strictly positive and other hams has 0 mean. Now, for this banded instance clearly arm one is the optimal arm ok. Now, what is the regret is going to be? It is regret is going to be. So, number of rounds in that you just pull whatever the number of times you have played the optimal arms. So, this is the regret of this is the regret we are going to get for this instance. Is that correct? So, let us say whatever is n 1 t denotes what number of place of the arm one. So, these are the one which are this many in the expectations these many rounds do not give any regret. Other than this place all the place t minus these many rounds they are going to you are you are going to incur regret and each one of them is going to add a regret of delta. So, this is the total regret if you have this problem instance ok. Now, we have total in number of rounds to be capital T let us say then the claim is there should be at least one arm which would have been played less than t by k number of rounds is that correct? Yeah, no by what principle? Vision arms principle. So, we are just going to apply that. So, one of the arms which are like not the optimal arm there should be one arm which is going to be at least played less than t by k number of rounds. So, there exists i such that expected number of. So, this is not true the total number of place is going to be greater than t ok. Now, based on this arm so whichever is that i naught equals to 1 we are going to now what yeah ok. So, let us say that is right I have excluded i equals to 1 if i I have included everything this is fine ok. So, if I say this should be correct right. So, just forget the optimal arm. So, this is the remaining number of place you have and from that you are going to play and there are from the remaining k minus 1 rounds. So, I am just going to be upper bounded further and just take it k minus 1 earlier one was correct, but it was loose fine ok. Now, we are going to now construct the second bandit instance as following we are going to say now construct. So, what I am going to do is mu j phi prime. So, the new set of parameters I am going to define like mu j for j naught equals to i and then I am going to define delta day for j equals to r. So, what we are just saying is so there should be at least one arm i naught equals to 1 for which this upper bound expected number of poles has to be t k minus 1. Let us say for that arm I am going to make now the mean reward to be 2 delta others I am keeping the same. Now, what has changed earlier the mu my mu look like delta 0 0 0 0 0 right and now my mu prime looks like this delta 0 0. Let us say this i th one is is is the this one whatever I got let me call it as 2 delta like this. So, what is the optimal arm in the first bandit instance and what is that in the second bandit instance i th arm right it has now the question is the two instances differ only in the i th arm the two optimal arms in the two scenarios are different. Now, the question is can I set my delta in such a way that even if my bandit instance it happens to be 2, but my algorithm thinks it is still bandit instance 1 and ends up selecting my arm 1 most of the time. If I can do this then I have made my algorithm make error right I will make it error most number of the times. So, now let us see what will be a good choice of delta here. So, here on it will be too much hand waving now this is just like we will make the thing bit more formal. So, right now assume that. So, just you notice that the upper and lower bounds they have a factor n delta square right n into delta square. Suppose I set my delta n. So, I had this delta such that. So, before I write this let me. So, when I wrote this expectation here. So, this expectation is induced by the interaction between my policy and the environment that is going to induce the environment here is defined by nu. So, the interaction between nu and my policy resulted in that induce this expectation here. Now for this bandit instance what is the expectation expected regret expected regret suppose if I continue to apply the same policy on this also first time is no more optimal, but that is going to cost me a regret of delta factor right because it is off by delta the optimal is 2 delta arm 1 is delta. So, the regret by the place of arm 1 is going to be this much and then there is no regret from the ith arm in this bandit instance because that is the optimal and other than arm 1 and arm j everybody is going to cost me a regret and what is that amount j not equals to 1 and i expected value of n j t and 2 delta and this quantity has to be at least delta times expectation of n more t. So, I just written the expression for regret that I am going to incur on the bandit instant. Now suppose if I set this delta to be let us say 1 upon expected number of n i t what is this expected number of place. So, and here what is e here this is I am going to put e prime here because this expectation is induced by interaction my policy pi with environment nu prime whereas, this expectation was induced by interaction of my policy pi with environment nu. Now suppose whatever be the expected number of pools I had with respect to my original or initial bandit instance right let us say this was the number of expected number of pools I know that this quantity has to be upper bounded by t upon k minus 1. So, this quantity let us say. So, what I have basically made is suppose if you ignore the expectation here just do some hand waving and just take this to be simply the number of pools of ith arm then what I am basically doing is delta square times number of pools of ith arm I am just setting it to be 1 that is I am making the factor n delta square big enough that is close to 1 ok. Now if I can do this by choosing it in such a fashion. Now you can again go and see that on even this instance I can come up with a similar apparent lower bounds I had earlier on the sample mean of a Gaussian random variable and you can see that my algorithm will fail to identify this as the optimal arm with this and it may still end up choosing on this instance this as and it will confuse this instance as this instance and think this is the arm one is the optimal arm most of the times. So, that again will make it more formal, but that is the idea like if you can formally argue that by the choice of this delta even on this instance it is likely that my algorithm will still continue to think that it is this instance and plays arm one most of the times that means it is making a wrong choice right in terms of the optimal arm and even if that is the case in that case by this choice let us say then we expect because my algorithm got confused between these two instance the number of times it is it would have pulled is going to be closed as expect number of times that would have pulled under banded in the second instance. So, in this case it is clear that let us assume that the algorithm is making eventually a figuring out arm one as the best term. Now with the appropriate choice of this delta there is still a possibility that even in this instance this will continue to make a bad yeah no no no again again it is like I just restarted and give you a new banded instance. Once I start algorithm I am not going to change the banded instance ok. In this case because even in this instance my algorithm got confused thinking it as this instance then it is going to spend most of the time on the first arm we are going to expect the expected number of pulls on banded the expected number of pulls that I am going to see with the second instance is almost going to be same as that I am going to have it in the first banded instance. Now I think we are more or less done ok. Now first we are going to consider now consider the case expected number of pulls is going to be n by 2 t by 2 this is the case. So, earlier we had demonstrated that RT of pi nu is what is equals to t plus expected number of pulls of. Now if this is the case then this is nothing but what this is t by 2 and delta and what is my delta we have set it as k minus 1 by 2. So, this is like sum k minus 1 root k t by for t times k minus 1 right ok. Now consider the second case where has happened to be h equals to then we have also argued that this k r t pi nu prime we have had a lower bound here this lower bound here we had shown this is nothing but delta expected value of n 1 t. But then using this approximation here this is nothing but expected value of n 1 t and this will give us now this will give us again the same thing right again it is going to be half of. So, this quantity this one I am going to replace it by t by 2 and delta from that this is going to be k t times k minus 1. So, for both the cases we have this lower bound and now we have are now if you are going to take max over these two instances. Now we have that the regret is going to be order square root t k minus 1 and there is a half factor here. So, this is just like very heuristic top level arguments we have. So, what is happening is we are able to come up with I mean we are just arguing that if at all we have this instance where my algorithm is going to confuse it with the other instance then it is possible that my regret is going to be order square root t k on these two instances ok. Now we have to make this bit more formal ok. So, for this we need some information theoretic quantities and some bounds based on them I am just going to introduce them in today's class and in the next class we will try to go through the steps formally. So, how many of you know already entropy you know or no half you know half ok. Now we are just going to some definitions. So, suppose let us say if we have a probability distribution p on k alphabets we are going to define entropy of this distribution as expected value of p i log 1 by p i. So, those who know this entropy can they tell what is the meaning what is the operational meaning of entropy. I mean we know that the amount of information contained is inversely proportional to its probability right like suppose if I event is going to happen with probability 1 is there any information contained in that you already know that is going to happen right. If my event is very less likely then maybe like information contained in it is more right because if something rare happens like that is the big news right. So, in that way the amount of information contained is like inversely proportional to its probability and also and that way I mean you can also give a interpretation that if something is more frequent and if you want to assign let us say some codes to it you are going to assign larger code or shorter codes the things which are going to appear occur very frequently. You want to assign shorter right because it is appearing more and more times the one which are going to appear less frequently are going to you may have you will be forced to assign larger one because you need to distinguish right. If all the small length codes are taken by the quantities which are happening frequently then what remains is the larger one. In a way the if you want to like encode something you want to encode something which is more frequently with less code lengths and the one which are happening rarely with larger code lengths. So, in that way if you are going to think this law 1 p as the length to code a message let us say then this is going to give you basically the expected length of the code and the information that tells that this is the minimum average code length you need if you want to recover the message correctly ok. So, fine this is for a given entropy of course, like I am assuming that all these quantities P i's have positive mass like if some P i is 0 this 1 by P i is not well defined right. So, I am assuming that all all the alphabets I have here positive mass. Then there is a quantity called divergence this also we defined in the last class let us now here I am assuming that both P and Q are a distribution on the same probability space ok. Let us say P and Q are defined on the same probability space then divergence between these two quantities defined as log P i by Q i. Can I write it in terms of entropy? So, in that case it is like if I have to write it in terms of entropy it is like minus HP then in a way. So, if you are just going to give the same interpretation that we just gave here. So, if you are feeling that let us say some messages are generated according to probability distribution P i bus you misinterpret that and you assume that they are generated according to distribution Q i right. The true generation is P i the messages, but you are you somehow got confused and think that they are generated according to another distribution P i Q i sorry. So, then the length you are going to assign is log 1 by Q i and but the true one happens to be P i right. So, this is still the expected length you are going to code those symbols and this is anyway minus the best you could have done right. This is you are not doing that good right. So, this is the best you could have done, but this is not the best you could have done. What you think this difference has to be positive or negative? It has to be positive right because this quantity is going to be larger than this. So, this quantity divergence we already discussed when we discussed the proof of KL UCB that using this for at least Bernoulli case we can get a very tight upper bounds ok. And now in general this is this divergence is have other names like Kullbach label divergence KL divergence in short. It has some nice properties it kinds of measures the distance between two distributions even though it is not a true metric though ok. So, it is not a true metric because it does not satisfy the transitivity property ok. Let us see now I am now just going to state one result and then we will stop which is going to become handy to us when we are going to state our proof formally. By the way like if P happens to be Gaussian normal with Gaussian with mean mu 1 and sigma square and Q happens to be another Gaussian with parameter mu 2 and sigma square. So, notice that the way I have defined here this is assuming that these are discrete probabilities right like they are these probabilities defined on discrete random variables. But this random this distribution could be on a continuous random variable in which case I have to appropriately define this ok. So, that definition is has something bit involved it depends on the radon equating derivative, but just let me write it here this is defined as log of dQ by 0 otherwise. So, it is defined as this should be P here sorry. So, this is defined as dP by dQ of omega d of P omega here this is integrated. So, this is ok right P is absolutely continuous with Q. So, that means, I just want to make sure that this is well defined by making sure that this guy is in the denominator I do not end up with zeros right. And whenever this guy and also like these do not end up with 0 by 0 format in this. I just recall what do you mean by just absolute continuity whenever P is non-zero Q is going to be non-zero. So, if it is Q is P is 0 and Q is not 0 this is 0 by log 0 right ok log 0 is yeah 0 log 0 to be 0 fine fine. So, we want as long as Q i to be positive then non-zero P i to be non-zero no no there is a d this is what we have this Radon-Nickon derivative this is dP by dQ into omega let us not go into that it is yeah it is a density function density P and Q are the density functions, but this is something different ok. So, let us let us leave it like this is another quantity we have to slightly interpret in a different way ok. So, now, if you have P and Q we are going to get it as d of P Q you can just compute this we only need this that is why I am writing instead of working out all this this is going to be mu 1 minus mu 2 whole square by 2 sigma square ok next theorem. Let P and Q be probability measure the same, let A belongs to A component is omega by omega. So, this is we are going to not going to prove this it says that if we have two measures or two let us say probability measures P and Q on the same space omega and script F then if you look at probability of A and probability of its complement, but with respect to another measure Q that is going to be lower bounded by exponential of minus divergence between these two measures ok. So, what it is basically saying is if you are interested in an event with respect to some measure P then that not happening with respect to another measure Q that shows total sum is going to be lower bounded by like this ok. So, we will just going to use this result later to prove that ok. So, let us stop here. So, we will continue to the next class.