 So, nowat least one thing we are happy about is now we have a bound which decays exponentially in n ok. Now, but this happens for sub Gaussian random variables right we need to have that property then the question is what kind of random variable satisfies sub Gaussianity property. One thing we already saw if it is a Gaussian with mean 0 that satisfies sub Gaussianity property. So, here are the other ones you need to verify all of them x has 0 mean x is the sub Gaussian and if x has 0 mean and x is. So, this is like almost surely. Suppose you have a random variable x that has 0 mean and its absolute value is bounded by b. If that happens then that guy is going to be sub Gaussian with parameter b ok or other possibilities if you have an x random variable that is 0 mean, but it takes value in this bounded interval some a and b that is the case then also it is sub Gaussian with a parameter b minus a by 2 ok. So, you see that whenever my random variable has 0 mean and bounded support it is going to be sub Gaussian with some appropriate parameter ok. So, we will be henceforth focusing on random variable or distribution that are just sub Gaussians with some parameter. So, we really need not worry whether they are bounded like this or supported on some bounded interval anything like all this is fine we will just say that it is going to be a sub Gaussian with some sigma ok. So, we will say that henceforth, henceforth according to our notation what we said we are going to assume my distributions are going to come from a environment class in which the distributions are sub Gaussian with some parameter sigma square ok. We are going to say that my environment class is which is mu of a such that mu of a is sigma sub Gaussian for all. So, we are going to say that each of this arms are going to have a distribution that is sub Gaussian with some parameter sigma and that sigma where we will start with saying that we know it it is just that. So, if I am going to assume everything that is going to be 0 right. So, if I am going to assume already all the distributions are sigma sub Gaussian all of their means is going to be the same 0 right. What we will assume is minus expected value of x which is drawn from mu i is sigma sub Gaussian. So, what is that we are going to say that these are the distributions assigned whatever the true means that is the mean. When I say expectation of x where x is drawn according to distribution mu a. So, this is nothing, but the mean value of this distribution mu a. So, if you take a subtract this value from the distributions right. So, distributions I mean. So, if x is going to be drawn from that distribution and if you subtract mean from that. So, this going to be a 0 mean random variable right and we are going to assume that those values are sigma sub Gaussian all of them. I am just saying what is a good notation. Is this notation clear to you? We are just saying that the distribution of the 8th arm that random variable whatever associated if you subtract the mean value from that this is going to be a 0 mean quantity we are going to this is going to be sub Gaussian with parameter sigma each of this arm will be. We do not know that what are the associated means with it, but if you just take this distribution with centered around that mean then that is going to be sigma sub Gaussian ok. Now, let us return to our stochastic bandits. So, what will we say in our stochastic bandit? There is a learner in each round he has to decide which arm to play and when he plays that arm environment is going to generate a sample associated with distribution of that arm and learner get to observe that. And learner's goal in this case is to quickly identify the arm which has the highest mean right ok. So, let us rewrite the interactions we had our setup was like input is number of arms and for t equals to 1, 2 let us say n is known till that point learner selects i t from one of the arms and environment draws some random variable x t i t this is drawn from the distribution nu i t nu i t is the distribution associated with that arm i t ok. So, earlier we just wrote it x i subscript t first let us also append the t here just to mention that this is the tth round ok. So, this is how the interaction happens and based on this observations every time the learner has to see what is the action that is going to see. So, we said that policy is in every round is based on the past observations in every round this is based on the history the learner has to decide which is the action he is going to pull in round t ok and policy consists of all of this. And we defined our regret to be. So, we defined it a bit. So, this is over policy 5 or n rounds. So, last time we just defined this regret quantity to be let me define it more refined way this is t equals to 1 to n rounds and this is like ok. So, let us the. So, what is this? If you play over t rounds and if you always happen to play the i term this is the cumulative reward you are going to get right and learner want to identify I mean you will be interested in taking your base to be the one which would have given you the maximum reward and you want to compare it with the reward you get where you played i t thumb in round t ok. So, this is same as saying expectation of max over i can just flip the expectation. But now this quantity here where I am trying to seek the maximum among the realization. So, this x i t's are realizations right if you are done over t rounds this x i t for t running from 1 to n this is the realization you would have seen on the i term. You are looking at the maximum value of this, but in general as we argued in our adversarial case this point wise maximization sample wise maximization to deal with this is going to be hard. So, we will relax this and we will take the expectation inside. So, if you do the expectation inside we are going to get a lower bound ok, but here. So, ok now what is this expectation is over? This x t i's x i t t are also random variables right. And this i t is also random variable because it depends on your past history. So, there are two randomness here one is the randomness in pooling the arm I mean even though the learner is not randomizing he is not picking them according to some distribution, but even though he is doing deterministically, but that will be influenced by the past observation. So, because of that this i t is has some randomness and this x t i's anyway they are drawn according to some distribution. So, if you take into account the distribution of both let us let us say first fix one sequence of these pools if I fix this and if I pull only one expectation inside where that I will take the average of this samples. So, this is going to be t equals to 1 to n of mu i t t right because expected value of this sample is going to have this mu i t mean that is what we said right each of this distributions have a mean and the i term. So, our thing is we said that this distribution mu i is going to have a mean of mu i. So, this x i t is going to be drawn from mu i. So, that is why this expected value of x i t is going to be mu i t here ok. Now, what is this quantity I can interchange the expectation here. So, this is going to be max over i and what is this? This is now nothing, but summation of t equals to 1 to n this is going to be what mu i right expectation of x i t is simply mu i mu i here and this quantity here. Now, what I am looking is the maximum values right of the means for all things. So, suppose let us say and this is nothing, but max over i of n times mu i and the other part remains the same. So, suppose define mu star to be max value of mu i. So, then this is going to be n times mu star minus expected value of summation t equals to 1 to n of mu i t. Now, what is this expectation is about? This expectation is on on this i t because i t itself can be random depending on the past history that past history is going to induce some stochasticity in i t. So, this is what we are going to call it as. So, let me call this as simply r of pi n and I am going to basically I do not know how we call this one and this is the earlier did we denote the pseudo regret by r bar or the original regret by r bar, pseudo by r bar. So, ok. And now, we are going to take this as pseudo regret and we will be interested in this regret, bounding this regret ok. I do not know if we discuss this during the adverse as suppose. So, we have a lower bound on this actual regret right, pseudo regret is a lower bound on that. So, if I give an upper bound on this regret, pseudo regret will it be an upper bound on this actual regret? So, this quantity so, what we have finally, showed is r pi n is up lower. So, this is the one we said henceforth we are going to focus on. And when what we will usually do we will develop algorithm which is going to give an upper bound on this. If I have an upper bound on this pseudo regret will it be an upper bound on this actual regret? No right, because this is a lower bound itself. So, if you give an upper bound on this this cannot be upper bound on this, but we will ignore that fact. And hence we will only try to study this and see how we can upper bound this regret bound ok. And as usual our interest is to find the policies pi that makes this environment class learnable. What does that mean? We are interested in policies pi such that. So, we will interested in pi such that if you divided by n it goes to 0 that is we are interested in policies which gives us sub-linear regret ok fine. So, now, before started looking into what are the policies that gives me this sub-linear regret regret. So, we will have one more property of this regret which we call it as a decomposition I mean decomposition of the regret let us say decomposition of the regret. So, henceforth like I will not keep calling pseudo pseudo regrets pseudo regret or just as regret, but what we mean is actually pseudo regret we will be interested in pseudo regret ok. So, what is this we have or a policy pi we have n mu star minus expected value of t equals to 1 to n mu i t. So, here mu i t is the one that depends on your policy pi ok. So, now, let us try to simplify this. So, this I will simply again rewrite is as t equals to 1 to ok. So, before this I am going to define now total number of pulls of an arm ok. So, you are playing this learner in each round he is going to play different different arms right. So, after n rounds he would have pulled each one of them different number of times. So, let us define number of pulls of arm i over n rounds to be equals to summation of t equals to 1 to n indicator right. So, if in round t if it happens to pull i then it will get counted in the summation. All we are saying is how many times arm i is pulled over n number of rounds is this correct ok. So, now, using this let us rewrite this quantity here this n mu star is simply t equals to n sorry where before this. If I take summation of expected value of this ti n. So, tell me is this ti n is a random variable yes right it depends on how many times they are pulled and i t is going to influence how many times they are going to pulled. So, let us take an expectation of this and sum it over all i equals to 1 to n then is it equals to n right. So, total number of pulls either you would have pulled i or something else. So, if you you would have pulled one of the k arms in each round right. So, if you are going to take the sum of all the pulls it has to be equals to n. So, this is the first thing I will do in in this equation because this n I am going to sorry is it n or k here it should be k right the sum of the pulls of of the arm that should be equals to k. So, this n I will replace by this quantity. This one in this expectation I am going to add another summation. So, here this mu i t I have written in this format by doing. So, did I change anything from here to here? So, let us look what happens. So, this i t has to take some value right. So, what I am doing is now looking at all the polys of i i to 1 to n and I am saying what is that value i t takes. If i t takes that particular value I will have an indicator here only that mu i is retained everything will vanish right. So, because of that in this only that i corresponding to i t will remain. So, that is the this term mu i t. So, these two terms are the same. Now, let us do one thing what I will do here is these are both finite summations right. I will interchange the summation and also this is a finite sum. So, I can also take expectation inside here. What I will do is oh right there should be mu star here this mu star. So, first thing I will do is in this expectation I will interchange the first inner one ok and then this is like mu i i t equals to. So, now, this part here in the internal sum here what we are basically doing is we are counting as I run from t equals to 1 to n basically it is going to be this indicator is going to be 1 only when i t is equals to i right in this internal summation. So, then only when i t equals to i this sum this mu i retains otherwise it is going to be 0. So, can I then say this this is nothing but number of times I have played arm i times mu i. So, then this is like. So, this is nothing but mu i times number of times I have played arm i and anyway this mu i's mean values are constants for me in this summation. I am going to take it out and then I will further simplify this and now I know that mu star is the maximum value of all this mu i's right that is how I have defined my mu star. So, mu star minus mu i has to be positive now what I will do is I will define this difference as mu star minus mu i to be some delta. So, what is this is telling you? What is the gap between the best arm and the ith arm? So, if I do that now this is going to be and this is going to be called as your regret decomposition formula. So, does this make sense like this formula what whatever we are saying this is saying that if you have played your arm n times in that you have played arm i ti n number of rounds. Suppose this i arm happens to be the optimal one then if i is the optimal one this delta i is going to be 0. So, it is not contributing anything to you, but if this i happens to be something other than the optimal then in the regret it is always going to contribute an amount of delta i right. So, what we are basically saying is this regret is nothing, but number of place of the suboptimal arms and times the regret they are going to incur and this is taken sum over all suboptimal arms. So, so we have even though I have taken it sum over all arms, but for the optimal arm this term is going to be 0. So, now ideally suppose let us say in this let me denote i star to be the arg max of mu i that is i star is my optimal arm. So, what you expect whether ti star you want to be much higher than the other ones or is it ok they they can be all equal if your algorithm your policy is good. So, let us say we are playing it n number of rounds arm number 1 you have played t 1 number of rounds arm number 2 you have played 2 to t 2 number of rounds and the optimal arm i star ti star you have played some number of rounds. If your algorithm is good you expect the number of times this ti star has been played should be much much higher than the other ones right that is the number of play of the suboptimal arm should be much much lesser than compared to the number of place of your optimal arm. So, suppose if we can bound and expected number of place of this arms depends on your policy pi ok. Now finally, suppose if for a given policy if you can bound how many times you are going to pull the suboptimal arms compared to the optimal arms then you can come up with a bound on this. So, all the regret analysis that we are going to do exactly going to exploit this line of thought. Once you are given the policy they will try to identify how many times you have played an optimal arm and how many times you have played a suboptimal arm they will find out come up with a bound on that and based on that you will end up giving a bound on this ok. So, that is the line of attack we will take to give the regret point. Now, so, before we start with any algorithms you have any idea about how to go and minimize how to get a small value of this regret anything now you can think of. So, this all boils down to how quickly you can identify correctly estimate the means right. So, if you somehow get a good estimate of the means then you know already what to do you have to just pick the arm with the highest means right. Now, how you are quickly going to narrow down on the arm which has the highest means. So, here the thing is how quickly how quickly you can end up identifying the arm with the highest means. It is not even necessary that every arms you estimate the value correctly it is about just identifying who is the best among this. We do not know how good he is compared to the others, but all we need to know is ok this is the best arm compared to the others. If we can do this then we are done right. So, how you are going to do that ok fine. So, we will discuss different algorithms in the coming classes.