 So, in the last class we said that our benchmark is instead of considering all possible functions that map my context to arm, I am going to restrict all possible such functions my benchmark is going to be slightly weaker and the possibilities we considered is one is where my function assigns. So, I will consider a partition and assume that my function going to select the same arm for all the context falling in that particular set. So, that is the one some one one thing we had in terms of the partition. So, so we said that take a partition then we said that. So, then we said that phi to be set of all functions that constant on each part of P case is one thing we said second thing was based on this similarity for that we said that we will consider all phi such that. So, we said this is such that we want to consider all free from C to A such that this quantity is less than let us say some theta and other possibility we said is we just consider some experts and these experts are going to correspond to some finite number of functions. So, now the benchmark you are relaxed by looking for such functions ok. So, now we said that we are going to focus on this case. Now for this case what is happening? We have these many functions and we want to see which is the function I should be choosing I want to find select the best function among this possible available functions right and I am going to call this functions maybe like I can treat them as some experts and the question is which is the expert I should be interested in ok. So, now we are going to call this whatever we are going to call for the set we are going to treat each one of the functions as like some experts and now my benchmark is defined in terms of this function. So, what my benchmark will now become? So, my benchmark will be so what is the regret I will be interested in that case. So, what the time period T I am just now going to see which is the best expert in hindsight and how I compare against that using whatever the policy or algorithm I am going to use ok. So, this is the total reward that you collected this is what the best you could have gotten in hindsight ok. So, now we have relaxed this benchmark here instead of considering all possible functions we are only considering some finite functions and now we are calling them as let us say experts and now we are trying to see how to solve this problem ok. So, now I am now going to look into this setup, but with a slightly generalization of this then we will come back to this like. So, what now I am going to assume is there are m experts ok. So, ok when you have this m functions right what happens you could when you observe a context in a round each of this functions could be as maybe like pointing you to a look it choose this arm this arm or the other. Now, we can just assume that like now these functions are some m experts who will be playing an arm that is as prescribed by this function itself ok. So, if it is a function then it is a deterministic map right if you see a context then you are going to you will be recommended to place a particular arm and that expert will play that. So, now let us just kind of go for a slight generalization of this assume that these experts are such that instead of telling for this context which particular arm to play they will come with a distribution on it when you see a context now you are going to give a distribution according to which you should be selecting a particular arm. But I can say ok then in that case you can assume that the way we can set up this problem as ok there are m experts in each round I am going to tell them the context I received and in term each of these experts is going to tell me their corresponding distributions. So, each experts will have what is the distribution with which you should be playing your arms so they will reveal me. So, actually what the learner get is now he is going to he caught one-one distribution from each of these experts. So, that I can think of as a matrix ok. So, I have gotten a distribution from each of these experts now I have to decide according to which experts distribution I am going to pull an arm ok. So, what you are going to do is your self is going to come with a distribution on the experts ok and accordingly you are going to choose an expert and then whatever that expert suggested to the distribution then you are going to play the arm according to that distribution ok. So, earlier we in the exp3 what you did there was only one expert like so you are the only expert there. So, you just cover with the distribution in the arms and you pulled now there are multiple experts. So, you maintain a distribution on them and these experts are maintaining a distribution on the arms ok. So, the setup we are going to have them is for so we can assume that in every round when the when a context comes right. So, that context is visible to all the experts. So, they just see that and then they will tell what should be the arm how should the arm should be picked according to their own distribution. So, let us say that learner observes so ET. So, what is this ET? This ET is is a so ET here is ET ok. So, ET is here is a matrix, but what is this? So, we will assume that each row of this matrix correspond to a probability vector associated with that expert. So, it is m cos k matrix right. So, each row corresponding to one expert and then we are going to say that ET of m is the ET m by expert m. Then this is what he observes from all the experts then the learner selects a distribution PT on. So, learner selects a distribution PT on k in some way we will say this how. So, how this learner comes up with a distribution on k? We will specify this as of now just assume that the learner after getting this E of t, he will come up with a distribution according to which he want to play an arm. It is k ok, we will specify how it is on k. It is going to be through m, he will come up with on k and then he is going to action IT sample from PT and the reward is. So, we recall that in every round t are reward vector is assigned to arms which we denoted as that vector as xt. So, xt1 corresponded to the reward assigned to r1 in round t, xt2 correspond to reward assigned to arm 2 in round t like that. So, now, if you are going to play action IT, the reward you caught on is xt, IT ok. This IT is what you have played. Now, with this set up the regret you have observed is expected value of ok, let us understand this. So, this part is clear this is the total reward you have accumulated over a period of time. Now, what is this? So, what is em superscript t xt what this gives you? So, what is emt? We said that emt is nothing but the probability distribution on the arm suggested by expert m. And what is xt? xt is the reward vector ok. So, this xt is a actually column vector for me and this is a row vector. What is this give you? Expected reward for expert m right for expert m. If you have gone with expert m this would this is what the expected reward you would have obtained in round t. Now, what I am looking at? What is the best reward I would have got in hindsight? If what is that expert I should have followed to get the best reward in hindsight ok. Now, I am comparing it what I have got up. So, with this set up we have we have now this is our problem regret minimization problem ok. Now, let us focus on this and try to see what is the algorithm we should be using to get solve this ok. So, already we discussed what is exp for right exponentially weighted exploration and exploitation algorithm with experts I mean some version of that ok. Now, for this the input are. Yes. Which expectation we are talking about? Yes, but you have done it right for all arms. So, this is what important sampling we do. So, this is for all i. So, when we what what we do in exp 3 there also we only got to observe only for one arm, but we estimated reward for all arm using important sampling we continue to do that here also ok. So, here what for all arm what is the range of i here 1 to k for all arms you are going to have this estimate. So, let us try to understand this algorithm now what it is trying to do? As I said it is like it has two things to deal with experts and the distribution given by them. So, it is maintaining this algorithm is simply maintaining a distribution on the experts and once it selects an expert according to that distribution it is simply going to follow the distribution given by that expert to pull an arm. So, it is going to initially assume a uniform distribution on the experts we do not know initially which one is good and then what is going to do? It is going to set p t to be q t into e t what is e t? e t is a matrix right we have already described it e t is a matrix where each row corresponding to the distribution given by the expert and now it is going to pull an arm according to p t. Now if I do this is it same as first selecting an arm expert according to distribution q t. So, if I am going to play an arm i t according to distribution p t is it same as first selecting an expert according to distribution q t and then selecting an arm given by that distribution. So, putting it in a different form suppose what I am going to do is I have a distribution q t on the experts right. So, first I am going to select an expert according to distribution q t. So, when I select that particular expert what now I basically now what is a distribution vector a probability vector I got now I am going to select an arm according to that vector. Now is it same as saying that I am going to pull an arm i t according to distribution given by p t right then that because that p t is nothing but the product of q t into p t. So, now you have basically selected an arm and you are going to just receive the arm reward for that arm. So, based on what all what is the thing you have observed like you are just doing the important sampling here and then got estimates of all the arms ok. And now what you are going to do? So, what is this x i t this is now. So, x i t hat is the estimate you have for arm i in round t. Now x t hat here is nothing but the vector of this guy. So, this is the ith component and now x t is nothing but the vector of those estimates you have. Now what you are going to do is ok this is the estimate of the arms I am going to get. Now then what is the average I would have obtained for an expert. So, now let us say so now let us focus on one. So, e of t is what it is a basically matrix right let us focus on one row of it corresponding to one particular expert let us call that expert m. So, that take that expert mth probability vector and then multiply it with the column vector which are now the estimates of the rewards in that round. So, what that will give you the expected rewards you would have obtained from expert m in that round right. So, what this x tilde x t tilde will give you. So, now it will give you it is a vector again it will give you each component give you will give you what is the reward expected reward that each of the expert would have obtained in that round ok. Now you take those values and then try to update the weights for each of this expert in a ways very similar to what you have did earlier you just give them the weights in this exponentially weighted form ok. So, what is this x t tilde I is going to give you what is the estimated reward that ith expert would have obtained ok so far and now there are tuning parameters here of course, you need to know how many experts are there, how many arms are there and we have also this parameter gamma here. So, if you have gamma here like this suppose if it is gamma is a greater than 0 what was. So, in the in the standard bandit setting bandit in the adversarial setting what this algorithm was when we have used the estimator like this x p 3 p r yeah this is e x p i x right because the this is this has come in the denominator. So, we are not updating it that is the experts are already fixed. So, they are using some policy ok whenever you are going to give a context right in the for that context they are going to tell what is that they have to use. So, this experts are fixed ok we are we are updating our weights on them. e super script e super script e comes depending on the which context yes which context you have observed in that round t because the expert is going to for each context the expert may have different distributions ok, but those distributions are fixed. For example, if they have 10 context for first context you may have one distribution second you may another like that, but they are not going to change with time yes maybe like after two different point of time you observe the same context for those two context the expert would return you the same distribution ok. And he is not going to adapt it to based on what context you have seen ok. In this our goal is just to identify the experts they have already figured out what for which context how should be arms should be selected. Now, our job is who has figured out it well. So, that is what like from the experts we want to identify the best one.