 So, let us start what we are discussing last time. So, last class we just broadly discussed about the idea for lower bound proof, right? We stated that every for any stochastic K arm bandit where the rewards are either bounded or some sigma sub Gaussian, we said that the min max regret has to scale like some constant times square root Tk. And last time we just discussed that idea, let us now try to make that formal in today's class. So, recall that at the end of the last class, I had told one result which said that for any measure P and this is greater than what? So, we are going to crucially exploit this result in doing this. For the proof of the result, I need slight notation to make, so let me define those notations. So, there is an environment, there is a learner. So, let us fix a policy of a learner, let us call that pi when you are going to interact with the environment, you are going to generate like, you are going that is going to induce a distribution on the way actions and rewards are drawn. So, what is that I mean? You remember if you play in round one, let us say action A1, you are going to observe the corresponding reward X1, then you are going to play A2, X2, like that till round T, AT, XT. X1, X2, XT, these are random samples you have observed by playing a corresponding arm. At A1, A2, these are the arms you have pulled. And maybe like this arm pull itself is random, because this arm pull is going to be induced by what you have observed so far. So, now, if you are going to look at the distribution of this quantity, which consists of actions and this distribution on this will be induced by the environment and your corresponding policy. So, let me, nu is your policy, sorry nu is your environment and pi is your policy. In this sequence, we are going to observe by interacting by the interaction between the learner and the environment. This is going to be randomly distributed and this distribution will be induced by this quantity is nu and pi. Let me call that induced distribution to be nu pi, is that okay? So, this quantity has to be randomly, random which is has to be induced by nu and pi, right? Whatever it is, let us call this. Now, we also know this, right? How to write this? For example, nu pi of a particular realization, let us call this A1 X1 A2 X2 all the way up to A capital T XT. How is this probability is going to be defined as? It is going to be product of and then what and then P of A T of XT. This is the policy and this is the probability of observing that sample XT when you are going to play arm A T. So, this P A T XT is coming from your environment. So, whatever policy you are going to use, maybe your policy is going to play action A T based on the observation you have made so far time that is going to be random and then based and then you are going to play A T and then after you play action A T, you are going to observe a corresponding sample A T drawn from that arm A T. So, it has introduced this. Right now, we are kind of assume that the same throughout, right? If you are going to every time you are going to play this action A T, your sample is going to be coming from the same distribution. So, that is where that underlying environment is there. Now, we want to understand. Now, similarly, let us say there is another environment. Let me call it as nu prime that will under the same policy pi that nu environment will also induce another distribution which I am going to call it as P nu prime pi. So, that is going to be what again that is A 1 X 1 A 2 X 2 A T XT. This is everything is going to remain only thing it changes is here pi XT because my environment has changed. Our first claim is a lemma DPV. So, since I am going to fix a policy pi right, I am just going to and only my environment I have changed in this and this is my E prime. So, I have fixed a policy pi and now I am considering two environments one is nu and another is nu prime. So, I am just denoting by them and now I am going to consider the divergence between these two quantities. We are just going to argue that this divergence is is going to be nothing, but and nu prime here is another distribution which is P i prime. And what is this expectation here? This expectation is taken with respect to the underlying environment mu whatever that distribution that environment mu indices this expectation is with respect to that environment. Now, why this is true? See now, when I am looking at this induced distribution P nu pi right, here this quantity is a 1, a 2, a t these are nothing, but actions right. These actions are coming from one of the k actions, but this x 1, x 2, x t these are the reward samples correspondingly coming from different arms, but these are continuous valued ok. So, here I am looking on this distribution nu pi on this quantity where some components can be continuous right. So, this distribution is defined on a vector where some components can be continuous and some components are discrete. So, for this I have to define appropriate divergence between them. What is that? So, that divergence whatever it is we have already defined for a continuous random variable right. How we have defined the divergence between two continuous distributions in terms of I wrote it in the last class, if one distribution is absolutely continuous with the other then we wrote it in terms of the integral ok. So, now that integral has this term log d P v d P v. So, this has meaning like what P mean by d P v this is I also said in the last class this is random decoding derivative, but let us not get into that we will just think it as like just P by P prime I mean P nu divided by P nu prime and now calculated at the corresponding points. Let us compute this quantity I mean just imagine that these are nothing, but the distribution value taken by P nu at this quantity divided by P nu hat taken at this quantity that is the meaning of this. And now if I just plug in these two quantities here right this is nothing, but d nu is nothing, but d nu pi this quantity divided by d nu prime is nothing, but this three quantities. If you take the log there and then simplify what we are going to simply get is t equals to 1 on capital T log P of A t by x t divided by P of t prime divided by x t. The factor corresponding to the policy cancel out because I am using the same policy in both the bandit environment ok. Now, this is for one realization if I look at the expected value of this quantity. Now, expectation with respect to what my nu the underlying environment nu this is going to be nothing, but expectation of this t equals to quantity log of T A t x t divided by P prime A t x t. But this quantity here this is nothing, but I am taking expectation with respect to the distribution that is induced by environment nu. This is exactly equals to this quantity in the numerator is nothing, exactly equals to the divergence between nu by T nu hat. So, what we have defined the divergence between two quantities if you recall I defined it as and then we have defined it as what B of P omega right. So, this this is this expectation is nothing, but exactly that, but with these two distribution P nu and P nu prime. So, that is why this is divergence and this now divergence is nothing, but this quantity expectation of summation of the logarithms of these two ratios. Now, in this both x t is the random quantity A t is the random quantity ok. Now, what you could do is this in this case now I am going to write this expectation over two parts. The first one I am going to condition on the A t and then I am going to take the expectation over the other part. So, what I will do is this expectation is this expectation of this one t equals to 1 to t g 1 A t. Can I do like this? This expectation I can write it in this I can do it over two steps. First I conditioning on A t and then find it expectation and then I will take the expectation of that quantity again. What is that? . This is same this is nu again nu whatever this nu is going to induce the distribution on this underlying environment. It does not make much difference right like let us see this if I just to do. So, this anyway there are only finitely many terms right. I can also write it as summation t equals to 1 to t expectation of nu over this log and then you can you can write this as this expectation over here. So, in this case you are only conditioning each term, but yeah this is fine. So, for every quantity we are conditioning that A t part there. Earlier also whatever you did you could have just taken that inside that summation inside ok. Now what I am doing conditioned on this A t I am looking for expectation of this quantity. Once I condition A t what is the randomness there? Only x t and what is x t? Now conditioned on A t these x t's are coming from that particular arm right. Now what I am doing? Now what is the randomness here? It is only going to be corresponding the randomness is due to the corresponding arm right whatever the arm you are conditioned upon. Now these are this is corresponding to the distribution of that arm under nu and this one is the corresponding distribution for the same arm under this nu arm nu prime. And now you are doing the expectation with respect to nu that is with p distribution p. What is this quantity is going to be? This quantity is going to be nothing but divergence between the distribution corresponding to arm A in the first environment and the distribution corresponding to another arm the same arm in the other environment right. So, this is nothing but. So, now this distributions are completely absorbed the randomness in the samples x t. Now this nu is only talking about how is this distribution over now it only remains. So, this quantity are completely absorbed that distribution in the reward samples and now this expectation is only over now the randomness in the arms pool 80 ok. Now another step. So, now I am just going to simply rewrite this in this fashion. Let me know if it is correct ok. So, now initially it was this is over some going from all rounds. Now I am splitting them in over different arms. Now first I am taking a summation of k equals to 1 to k and then in this summation I am only looking for those terms which when 80 equals to i and then I am going to do it for each arm. So, then this sum should be equals to the same as this sum right 80 equals to ks. I am just looking at let us fix 80 to be 1 arm then look at the divergence with respect to the distributions in the two distributions corresponding to that arm and then do it over all arms and I did it by bringing in that indicated term there inside the summation over t equals to 1 to infinity. Now we are done whatever we want to do right. Now we have just now now k equals to 1 to k keep the outer summation just like that and if you just take the inner summation inside now you are looking for a particular arm k yeah pk, but now then you are going to add the number of pulls of that arm right and what is that going to be? That going to be exactly number of pulls of that arm. So, this is now going to be expected pulls of n k t times divergence between pk and pk prime and that is what our claim was. That is what you have basically expressed through this lemma is the divergence between the induced the distributions in the two environment by my policy can be expressed decomposed into the corresponding divergence of the individual arms distributions in this fashion ok. So, now we are we have this indicator right k going from 1 to infinity. So, I have already said a to be k. So, now take that particular k let us say that at equals to k. So, if you just take now this is independent of t right then what remains is summation of this indicators with that expectation term here yes. So, this term basically I just skip this this term came because there is an expectation of when I put at equals to k this term was of a time. Now, whatever this total number of rounds you have going to play that particular arm or t p is there is nothing, but the expected number of pulls of that arm under this environment nu ok fine. So, this is one result now the rest of it once we have established it now rest of the things is going to follow just exactly the same way we discussed in the previous class. We are going to construct two distribution p and p prime such that they differ at only one point, but still they have a different optimal arms and then we are going to invoke our this results on this setup and we will be able to show that ok. Just to recall what we wanted to show we wanted to show that. So, we assume let k equals to greater than 1 and number of rounds is t is greater than k minus 1 then we said there exist a policy pi. Now, we said that for all policy pi for all pi there exist what we are going to show is for all pi there exist a mean vector mu which is such that what is nu mu this nu mu is equals to this this is going to be 1 upon 27 square root k minus 1 prime stick. What is nu mu here? The nu mu we are nothing, but an enuronment whose distributions have this means coming from this mean vector mu. So, this mu is a vector with k components each quantity coming from 0 1 each component coming from 0 1 that is why in that there exist a such mu vector from which you can define an enuronment on which your regret is going to be at least this. And now once we have this I have just demonstrated exist on such a policy right that means I can come up with a I have a class of enuronment over which your algorithm is going to make a mistake. I have. So, what I am saying is through this setup I will be able to come up with one mean vector that means one enuronment for which for any policy we can its minimax regret is going to be at least this much that means there exist a class of stochastic bandits right or environment such that on that this policy pi is going to incur at least this much of regret. We have just demonstrated like existence of 1 pi we are going to do that that means there exist a class such that this holds on that class your algorithm is going to incur at least this much of regret or this implies basically what this implies is basically RT star of epsilon k is going to be on. So, this mu k is that of all enuronments whose mean parameters are going to be in the interval 0 1 ok. So, now earlier we have stated this result saying that this holds for any bounded enuronment where the rewards are taking in a particular interval or sub Gaussian any sigma sub Gaussian right, but now the result we are going to show is what we are going to restrict ourselves to Gaussian distributions because we know that Gaussian distributions are also sub Gaussian ok. And for that we are going to show that as long as you can construct a Gaussian distribution whose mean vectors. So, where all the arms have a Gaussian distribution and the mean vectors are lying in the interval 0 1 I wish if we can construct and we are going to show that such an enuronment with appropriately defined mu values is going to satisfy this result.