 Like we discussed in the last class, let us have an environment nu which has this mean value. So, it has a mean value delta where I am going to assume that this delta is 2 0 or maybe like I will assume it to be between 1 and half and we are going to choose select it later. We want to select delta later, but let us say I have one environment which has this one and I have I already know that when I am going to run my policy on this bandit or this instance there exist one arm which would have played in expectation less than t upon k by 1 number of rounds. I know at least one such a bandit exist right. Now let us call that as I to be the one. So, let us call that I and I know that this I would have been played less than t times k by k minus 1 number of rounds ok. Now that is now for this I I am going to change the mean rewards and I am going to construct a new bandit instance which is delta 0 0 and this is 2 delta and this is exactly at the ith part. Now I know that. So, in this I am assuming that first one is the optimal arm right and now I am looking for the arms which are other than the optimal arm ok. So, do not confuse this delta with that. So, here I am telling this to be ok maybe I should just tell this as mu it is a mean value and this is mu and mu. Again do not confuse again do not confuse I initially said mu to be a vector ok maybe let us say let us say mu 1 ok mu 1 is the mean of the first one and then this is mu 1, but this one I am going to make it as mu 1. Then we are going to choose that mu 1 later ok. Now we already know that what is the regret regret of a policy pi on environment mu is nothing, but delta I times expected number of pulls of that arm I 1 to k and here this expectation is with respect to an whatever the distribution that is induced by this environment nu. Similarly, you can write it when what happens when you replace nu by nu prime ok fine. Now I am going to write these two things let me tell whether these are correct. The regret I am going to incur from policy pi on this environment nu is going to be at least probability under this environment nu times n i t less than or equals to t by 2 times n mu 1 by 2. Is this correct? So, sorry this is 1. So, I am basically saying the number of pulls of the optimal arm has been less than or equals to t by 2. It has been played less than t by 2 rounds. So, if this is the event I am looking at then it must be the case that the sub-optimals arm should have been played more than t by 2 rounds and whenever on those t by 2 rounds I should have incurred a regret of at least on each round I should have incurred a regret of mu 1 right because whenever I am going to play a sub-optimal arm I am going to incur mu minus 0 regret in that round. Is that clear? And now I am taking this probability. So, that event happening whenever this event happens I am going to going to get a regret of t 1 by t mu 1 by 2 at least this much regret I am going to get ok. So, I will have this one similarly on this nu prime. So, here I am already assuming that n 1 t has happened less than pay plate dust man rounds in that case regret is going to be at least this much, but that event happening is this probability right. So, that is why I am taking yeah this is of course, this is an expectation and this quantity is going to be at least this much ok. So, just write down that is why I wrote this you have this just manipulate this are going to get that. And similarly this quantity being is going to be at least now I am going to take the event that my in the second environment I am looking for an event that I have played it I have played the first arm more than t by 2 rounds. If I have played t by if I played the arm 1 in my second band it more than t by 1 rounds how much of regret I should have incurred t mu t mu 1 by 2 right. Because in that case also regret is every round I am going to play a suboptim if I play mu 1 the regret I am going to incur is 2 mu 1 minus mu 1 right. And if I have played this guy itself more than t 1 t by 2 number of rounds I should have also incurred at least this much of t mu 1 by 2 also regret. But that probability is this is with mu prime n 1 being greater than t by 2 into t delta by 2. Is that clear? Now, I have these two lower bounds t mu 1 by 2. So, now what I have is I will wrap this. Now, what I will I will just take the sum of these two RT phi nu plus RT phi nu prime this should be greater than or equals to t mu 1 by 2 times these two probabilities. Yeah, but policy is the same, but I am already saying this is under the new environment right they are are the two different random variables. So, what I am talking here is let us say in the first environment number of pulse is less than this. If the number of pulse of the same arm happens to be greater than t by 2 this its distribution is now going to be governed by this nu prime. But I am still going to call that number of pulse of that arm to be just n by 1 n 1 t only right n 1 t is just indicating number of pulse of that arm. How this is distributed is going to be governed by the underlying environment. It is the same event like see number of arm pulse either in this environment or this environment I am just the same thing the number I am just counting the number of pulse of that arm. That values will be different that will be yes yes of course that will be governed by this new environment that distribution its distribution. It is how many times I am going to pull this it is going to be having a different distribution. It is just that I am looking at the event whether this arm has been pulled more than this numbers are not. I am just looking at the event in this case I am looking at the event that whether that arm has been pulled less than t number of t by 2 number of rounds. And here I am looking whether it has been pulled greater t by 2 more than t by 2 number of rounds yes the distributions are different. Yeah it depends on new prime it is with respect to this distribution. Now this is an event A this is an event A complement even though I am looking at their probability in different environments. So, that I am looking at this A and A complement in with respect to two different probability measures ok. Now this is like what can I use my this result I have two events A and A complement there the event A corresponding to number of pulse of arm A being less than or equals to t by 2 that p there is governed by my environment nu q is governed by my environment nu prime. So, then this is what t nu 1 by 2 times half of b x p now divergence between nu p divided by nu p prime yeah because the underlying sample space is the same for both right. You are looking at the same set of arms and the rewards are coming from the same region even though distributions are different. So, what is this where are this what is the underlying sample space underlying sample space is 0 to 1 let us say k times into k into no it is 0 to 1 t times and then yes this rewards we have assumed to be Gaussian. So, let us say this could be real number. So, this is the sample space right this is my samples space over which both the distributions are defined. So, I am looking at t rewards and t pulls off my arms. So, this is what I mean like you understand this notation yeah 1 2 3 4 k, but Cartesian product taken t times ok because I said this is nothing, but this samples are coming from this underlying sample space and both of them are defined on the same space ok. Now, we have this now we want to invoke this guy here. So, we have encode this guy we have let us invoke this guy. Now, what is happening the way if you are constructed new and new prime they are same at every for each arm, but they only differ at ith location ok. Only that I where they differ this divergence is going to be non-zero for all others it is going to be 0 right because all other distributions are same. I have only changed at 1 for one arm ok. So, then this is going to be nothing, but what is that let us say. So, just for clarity I am just making this initially k. Now, for that whatever ith one this is nothing, but this value is simply going to be mu i of t time d here. So, this value is just one term remains everything is 0 here because for all others the distributions are same and hence the divergence is going to be 0 for them. So, mean same does not mean they have the same distributions right ok. So, let us state that we are saying that we are keeping not disturbing the distribution of the other arms except for the ith arm which we are changing it and making it is to have a different mean which is 2 mu ok. Rest of the distributions remain same and so their means also remains the same. That is where now we will be able to have the good expression for this divergence otherwise in general we do not have a good expression for divergence right. It is for a Gaussian we have a good I already told you in the last class right. What is the divergence between two Gaussians? What was this value? Yeah, just a minute let me write this mu 1 minus. So, let us put that values. So, mu 1 is this and mu 2 for us is not really right. So, I am going to compare distributions of ith component only. So, ith component here has mean what? 0. 0 and it is 2 mu. So, 0 minus 2 mu divided by 2 sigma square or just sigma square? Sigma. Just sigma. What is the value? 2 sigma square. Yeah, according to this it is should be 2 sigma square. So, right now we assume that the sigma square is fixed for all the arm variance is fixed only the means are changing ok. So, without loss of generality we can assume that sigma square is just 1 the variance is 1. What we know about this quantity for this I? We already know that under this environment nu this has been pulled no more than t divided by k minus 1 number of rounds. So, this divergence I have between two environments is nothing, but this quantity I will just repeating this now, but I know that this is nothing, but t divided by k minus 1 times 2 mu 1 square sigma square ok. So, let me I am going to assume sigma square to be 1 and just let me I am just going to ignore this denominator sigma square equals to 1 ok. Now, I have an upper bound on divergence. So, let us go back and plug in here if I plug in the upper bound on divergence and this is with a minus sign I am going to get a further lower bound on this. So, I will have t pi nu R t pi nu bar is equals to watch t mu 1 divided by 2 R 4 p x p minus t 2 t mu 1 square divided by k minus 1 right. Now, making use of both this results I have been able to show that the sum of the regret under those two environments for the same policy has to be at least this much. Now, we will set mu 1 we are going to set mu 1 such that it is exactly equals to k minus 1 divided by 4 t and by assumption in my theorem the hypothesis was that t is already larger than k minus 1. We have assumed that right number of rounds is going to be at least that. So, k minus 1 by t is already less than 1. So, this quantity should be at least upper bounded by 1 by 4 this is like 1 by 2 at least this quantity if I am going to set my mu 1 like that it is going to be at most half no more than that that is what I also wanted right. Initially my means should be in the interval 0 1 and I have it here and I have also deliberately made that this mu 1 is going to be less than or equals to half so that in my other environment nu prime where I have 2 mu that is also between 0 1. So, in both the environment all the mu's are between 0 1 only ok yeah mu 1 because I just want to explicitly give you an environment whose mu lies in the interval 0 1 for which this holds yes that should also be fine, but like to be consistent with whatever the theorem statement I have said right where I will come up with the mean the environment. So, the so suppose like later you want to also make a similar the same claim for where the rewards are all bounded, bounded rewards. So, even though we are doing it for a Gaussian distribution here, but if you want to do it for rewards within some bounded interval and there let us say means are further bounded in some interval, you want to ensure that you were able to come up with an environment from that class where any policy is going to suffer this regret. So, you have to like whatever the class you are going to claim you have to give me an environment from that class for which your policy is going to fail. So, that is why what are the class I was looking at I am now trying to construct an environment from that class ok fine. So, now if I am going to set like this what I am going to get I have set mu to be exactly like that. So, this is going to get me what I am just simplify this. So, I am going to get mu is I am going to get like this square root t k minus 1 and what this will happen minus 1 or like I will simply write it as square root t right 1 upon e square sorry 1 by dt what is that I have. Now what we have so now you can just see this 27 whatever I have write in the theorem 1 upon 27 that is coming from 8 into square root e what this will happen turn out to be. Yeah. So, this is just the sum of this we are not at there ok fine. So, why is the how is the final claim that my regret is going to be at least this I know that the regret I am going to incur is going to be the max of these two regrets and this max of these two regrets is nothing, but is going to be at least half of this sum. So, let me write that so max of the regret I am going to incur is going to be larger than thus. So, 2 times this is going to be larger than this sum right I am going to incurring i t pi mu. So, when I am going to apply a policy pi right my regret is i t pi mu. Now what I am saying this the max of these two regret is going to be larger than this either you are look at regret in this environment or this environment it is going to be larger than this sum divided by these two if I am going to take that to other side. So, let us go I could be facing any of this environment right I that is not in my control I should be I could be facing mu or I could be facing mu prime. So, in one of this case I am going to be doing worse and that I am going to guarantee that that worse is going to be at least this much. You may be lucky that on one new environment you may do be extremely well, but on mu prime you may take a hit because mu mu prime you may confuse it as mu and you are going to incur a large regret on that and it may be otherwise you may do extremely well and mu prime, but you may get confused when you are playing with on environment mu and then do extremely badly. So, then so my regret has to be in that case is going to max of this now just do this. So, my whatever my i t pi star whatever I have right like on the particular environment I have here now it should be larger than this sum divided by 2 which is 1 upon 16 square root t times t a minus 1 and this 16 square root t is approximately 27 I guess. So, yes when we went through the proof fine like you have some nice results from how to relate to event and its complement under different measures you are able to show this, but if I just tell you to prove this how to get all this intuition like I said how I am going to do all these things right. I mean this is hard like in general in general this lower bounds are very hard to come up with instances and get an intuition that under what instances your algorithm can potentially get confused and think one for the other and make mistakes it is hard, but somebody has done it and he has these results take quite efforts and you see that it they need to use the nice tricks. And so, yeah so the thing is like, but I think in general to prove the lower bound these are the tricks there are I do not think there are you can there are many many tricks available to prove this tricks for some reason somebody got hold of this tricks and is able to do this, but I think if you understand all these things, if you want to prove lower bounds in other case if you generally follow these tricks you should be able to prove the lower bound for that case also. So, the main thing is this result and this divergence result. So, I mean it is not possible that like other things like algorithms we can think of this is how I am going to do, but for proving a lower bound it has to be you have to make any algorithm fail right and you want to construct such an environment. So, it may not be always easy, but that is fine like we just know some idea now what could be potential tools or what could be potential tricks to use to get a feel of what the lower bound look like. So, let us stop here.