 So, so far we have been dealing with learning problems in which we did not make any kind of specific assumption on how the environment is generating your losses right it could have it could be even adversarial like the environment could be just trying to make you incur as much of loss as possible. So, in that we define the notion of regret and try to see develop algorithm which minimizes this regret. Now, we are going to switch to the stochastic setting where we are going to appropriately assume that the loss that the environment assigns to an arm or an action is drawn from a distribution. So, it is always going to draw the losses from a fixed distribution and now the goal is to identify which of the arms gives you the smallest meaning expectation ok. So, what we are doing is we are now kind of restricting ourselves in the sense the losses are now coming from a fixed distribution that distribution is not going to change once the game has started ok. So, because of that we are saying that the environment is following a fixed rule that is a fixed distribution according to which it is generating the losses. Now, how the interaction happens in the stochastic banded here we are going to say that as usual there is the templative like input the input you are going to say how many arms are there in each round learner selects an arm in the environment solution this is going to be associated with ok. So, let us say there are k arms and I am going to denote mu A what is mu A here mu A is the distribution associated with R A ok. Now, what we are going to say learner select an arm let us say from A belonging to k and then environment assigns a loss which is going to be drawn from which is going to be an loss x which is drawn from your mu A the associated distribution with arm A ok. Now, this game is going to be played in every round right. So, let us say for any round and here n is the number of rounds this may be a priori specified here if it is a priori specified the number of rounds are going to play it that many rounds or it may not be specified a priori how many rounds you are playing it will be you keep on continuing it ok. Now, again we are going to denote this I t to be the action selected by arm learner in round t this is going to be one of this t and then x t let us say denote this x t is the reward so or loss. So, again when I am talking about now this banded setting I am now going to switch from loss setting to the reward setting ok. So, now I am going to assume that when a player plays an action what he is getting is a reward and now his goal will be to accumulate maximum number of rewards ok and he is going to reward is going to get x t which is now drawn for nu of I t. So, I t is the action is selected in round t and this is the you get an reward which is the drawn from this distribution V I t here ok fine. Now, in this stochastic betting the setting for the environment is simple right he has simply the environment already decided the distribution for each of this arm that is fixed and every time if you select an arm it is going to draw one sample from that distribution and it is going to reveal to you ok. Now, when though I said here this is the action selected by the learner in round t now depending on what is the action that he is going to select his performance is going to change right. So, the selection of action in round each round we are going to call it as the policy of the learner and this policy of the learner will depend on what is the actions and the corresponded rewards he has seen so far ok and now this I am going to denote it as pi t which is. So, in round t you would have observed what he he had all this history and I am going to denote it as its history time till time t what the history includes what I what I played in round 1 what is the reward I observed what is what I played in round 2 what is the reward observed all the way till t minus 1 round I have this information based on that he has to decide what is the action he is going to play in the next round ok. So, this is like I t will be coming from this and this is in the round t and now the sequence of this pi 1, pi 2, pi 3 whatever you are going to do this pi this will constitute policy. So, basically these are rules you are going to say round t round t. So, these are the rules like if you tell me what I have observed so far then it is going to tell you what is the action you are going to play in every round and this collection of all these rules in every round is going to be your policy ok. Now, what is the objective here as I said your objective is to accumulate as many as much reward as possible, but you do not know these distributions you do not know a priori what is the distribution that each arm is associated with ok. So, let us say if you run it for n rounds your total accumulated reward is and my goal is to maximize expected value of and you have to do it without knowing this distribution. So, can anybody say what could be the maximum value of this expected value of S n? Suppose I tell you a priori these are the distributions associated with each of these actions ok and let me say also let me denote as mu of a to be mu of a distribution a ok. So, e there are these distribution the distribution associated mu a is mu a. If I have to maximize this what is the maximum value I can get max of yeah max of mean let us say I have this right like I have this mu of a ah sorry and there are and let me call this as some mu star ok. So, in that case what will be this quantity will be it is going to be n times mu star right because you are running it n times and in each round what will be going to play do is you play the arm which has the highest mean and you are going to get it as n mu star. So, this is what I am going to get in expectation when I know the distributions of all the arms, but we do not know them a priori that has been selected by the environment, but that has been not told to me and now I want to goal is to achieve this much of expectation in reward. So, I am going to set this as my benchmark when I am going to define the rate of a my policy. So, here I am going to define regret of a policy pi for n round as what S of n minus. So, this is this n mu square minus expected value of S of n. In the policy I have said here I am just saying that this could be deterministic there is no randomization by the learner here. So, based on this history here is if you tell him this history he will find out based on this quantities tell what is the arm you have to play in the round t ok. So, he we are not allowing him to do any randomization here ok. In this it is just like plain deterministic given the past. What is x t is the reward you are going to observe in round t if you happen to select action i t will be drawn by nu i t is distribution is nu i t in round t. So, in round t you are playing some action right. Let us say you have played action i t then its distribution of this reward will be like this. So, because of this this x t is a random quantity here the reward you are observing in each round is a random quantity and now when you define this S n here like this this is a random quantity because this x t's are random here. Now what you are interested in is is the expected reward total reward in this case ok. And now what I am saying is this is the best you could get if you happen to know all the distributions how you achieved this we achieved this by playing the arm with the highest mean in each round and this is what you are going to get based on whatever your. Now I am going to slightly slightly denote this read x t I am going to append to this x t here this i t also like. This is to just say that the reward I am going to observe in round t this is associated with arm i t which is random quantity ok. So, now when I did this expectation here this expectation is over what there are two random quantities here right. Even though your if I give you this history what I am going to select in round t could be deterministic, but this sequence itself is random right because this x 1 x 2 these are random quantities here the sample you have observed because of that this will induce as a randomness in the choice of my arm. So, because of this these i t's are random variables they depend on what is the samples you have observed in the past ok. So, because of that this expectation here will consist of will involve two randomness. One is the randomness in the samples itself, other one in the choice of your actions ok. So, even though the samples have induced the randomness in the choice of action i t ok. So, I am just let me look what is the right notation for this. So, I have this yeah. So, just let me continue to use this notation maybe I will just write this in round t the observed this is the reward you are going to observe maybe I will just write it like this may t i t is going to be this or the t i t looks confusion let us just use this. The you have played action i t in round t and you are associating a random variable with this this is going to be drawn like this. So, when if I do this then this is like. So, this is not eta this is just n, n nu star minus this is now expectation of order this x i t t equals to 1 to n number of rocks and I am going to call this as regret and how is the policy influencing this regret? The effect of the policy pi comes on the choice of it is here. If you are going to change this pi the way you are going to choose your it is going to define will different and that is going to affect this regret part ok. So, what we are doing here in this case we are our regret base is still the same as we did it in the adversarial case what is we are saying this is the regret incurred by playing your policy this is the regret maximum regret you got. So, this maximum regret is basically saying that this is nothing, but what like I am if I just this is like max this is n times max i summation of this expectation of x i right this is like i equals to 1 to. So, i this is i coming from k. So, I am still looking at what is the mean reward I get if I have to play a single arm throughout n rounds. So, if I am going to play the arm which gives me the highest mean for n round this is the total reward I am going to get right and I am going to compare it with whatever I am going to get by playing my policy. So, in that way this what we are basically doing is I am comparing whatever I got by comparing it with an arm which I if I play it throughout all the n round that would have given me the best possible reward. So, here my benchmark is still playing a single arm, but that arm is now the one which gives me the highest reward ok fine. Now, we say that these are unknown distributions. Now, question is can this be any distributions that will allow or we allow only some specific set of distributions on the arms or is that is there some special structure we are going to assume on these distributions and that is going to kind of define the environment class we are going to look at. Right now, we have said that environment is just a priori choosing these distributions and fixing it that is not revealed to me and my goal is to identify the one with the highest mean ok. Now, we are going to say that this environment class itself is going to define like this. I am going to denote this to be an a neuron. So, let us say an arm this number of k's are fixed, the number of arms are fixed. Now, this is the set of distributions on these arms right. We are going to say that the set of distributions once you fix it that is going to define your bandit instance ok. Now, what I just said is we can assume that this bandit instance are coming from some special class ok or we can say that these bandit instances are drawn from some environment. So, let us define this let us say. So, we said that one of this new defines a bandit instance right is the notion of bandit instance clear to all of you. So, we are saying that there are k arms right. You decide let us say environment decided one particular set of distributions arm one is going to take this distribution arm two is going to take another distribution arm three is going to take another this distribution. And it has for each arm it has given a distribution that will define me one bandit instance ok. So, maybe later tomorrow the environment assigns a different set of distribution to each of these arms. Now, that will make an another bandit instance ok. Now, what this we have defined is irrespective of what is the bandit instance the learner is facing we are going to define this to be his regret ok. And notice that this maximum value here the maximum mean it depends on a bandit instance. Once you fix this distribution their associated means are fixed. Now, for that they are in an associated maximum value and we are defining that regret in terms of this. If you are going to change your bandit instance your maximum value could be different ok yeah. So, I am saying that this is some class and we are going to this is like some set of distributions. So, the environment can say that I am going to assign a distribution to this arm, but the set of the distribution sign can assign to this arm there will be coming from some set. So, this distribution itself this MA is basically the set of distributions here. Let us say environment has some bunch of distributions for one particular arm and it has another bunch of distribution for another arm like this. So, each time it can pick one distribution from this set and assign it to some arm and from the second set it can take one distribution and assign it to second arm like that and depends. So, that is why I am saying this is basically all possible collection of bandit instances and that will define my environment class. At any time the environment can pick one bandit instance from this and I will be faced to learn against it ok. Now, what are the typical environment classes? So, now I am going to define you what could be this different environment classes ok. So, for example, one possibility could be just an example the environment assigns Bernoulli distribution to each of these arms with different parameters for example, have you write Bernoulli? So, this is one what environment can do is for each of the arm for each arm it can pull a value which is from between 0, 1 and then associate Bernoulli with parameter mu i to the ith arm ok. So, for example, let us say one example could be if you let us say you have let us say k equals to 4 arms the first arm could take Bernoulli with parameter 0.3, 0.3 and 0.2 and Bernoulli let us say 0.6. So, this in this case this will make one bandit instance. So, in this bandit instance what is the maximum value or what is the best arm the one and this one because this has the highest mean right the mean here is 0.6. And another bandit instance could be like just like now instead of taking 0.5 it could take let us say any value in the. So, it could be just 0, 2 and it could be like it can be same something like 0.4 and Bernoulli 0.9 something like that. So, that is why I am saying this here the environment class is like this ok. Other environment class you can think of is where to each arm the environment selects a distribution which is uniform ok. And now uniform distribution is defined by what parameters a where lower limit and the upper limit the range and depending on the range it can come up with different different uniform distributions. So, that I can write it as and here this a i and b i could be some real numbers for and we will assume that a i is less than b i for all i. So, for each i it is a uniform distribution and this parameters a i b i can be drawn from real numbers. Similarly, other examples could be all these distributions could be Gaussian with mean mu i and variance sigma i square. So, here again mu i can be real and sigma square can be real and and this is for all i. You can think of like different environment classes like this. So, another another finite another environment class is like something like finite variance. What here environment looks like is it is like all mu i's. So, here now I am writing it as i such that the variance of a random variable which is drawn from b i is less than sigma square for all i. So, what does this says? It says that the distribution assigned to each arm could be anything the only restriction I am putting is its variance if you draw the variance associated with the distribution should be finite sigma square and this we assume that will be apriori given to you ok. So, this could define another class of another class of environments right. Now, is this environment class contains this environment class here which is the Gaussian distributions then what could be sigma. So, here we are saying this sigma i's are just real sigma i squares are some positive numbers right, but here we are saying that this is some bound we are giving. So, if you are going to take variance of the distribution of the i term it is going to be sigma i square. How do you know that is going to be less than this given sigma square? How many take like this could be any positive number the sigma i square can come from any positive number right. Suppose, I relax this environment and I say that all of them have the same variance. So, then I do not need to write this. So, this is the one with fixed variance. So, what I have done is I have just taken a Gaussian distributions for all the arms and I am saying that all of them can have different means, but all of their variance is going to be the same sigma square. Sigma square is fixed that is known only thing is their mean values I do not know. Does this fixed variance Gaussian distribution will be a subset of this class right. I do not say anything about the mean, mean can be anything here right and only thing I am allowing is all these variances are s r equals to sigma square. So, here this environment. So, here this is finite variance ok finite variance this is Gaussian. Another thing who could be simply the bounded support ones. So, here the environment could be simply set of all distributions such that the support of E i belongs to is some subset of a comma b. So, all of you understand what I mean by support of a distribution what is support of a distribution probably it is not 0 ok. So, this this bounded support one will it contain this Bernoulli environment here. So, suppose if I define this a b to b let us say 0 1 then will the support of this Bernoulli distributions will line the interval 0 1 right because the Bernoulli what are the values either 0 and 1 right that will be always there, but it is only thing is now in this case I am when I look at this guy I am looking at all possible values of this ok or support of this for all of them let us take this instead of subset. In that case this is going to be containing this environment class already. So, another could be like my all my distributions could be such that mu i is sigma square sub Gaussian. So, how many of you know what is the sub Gaussian random variable? How many of you know sub Gaussian random variable? Ok we will define it a bit later, but this is like more generalization of a Gaussian random variable. So, if you are going to say it is a sigma square sub Gaussian we will see that its variance is already going to be less than sigma square. So, because of that this guy will already incorporate this this guy already going to include this family of distributions with finite variance then I take the sigma square to be the same as this sigma square. So, as you see the environment class can be anything, but we may a priori restrict I know environment can choose these distributions to the on the arm, but both distribution we are going to be drawn from this environment class like that. So, most of the cases what we will be focusing in our discussion is something like this the one with the sub Gaussian. By the way this as I said this sub here though here I am saying the support is bounded here if the support bounded for the Gaussian random variable it is not right. Even here for a sub Gaussian it is not bounded, but we will see that later as we go on even though its support is not bounded for the sub Gaussian, but its analysis is not much different from when we are going to deal with the family with support is bounded ok.