 So, let us continue our discussion on multi arm contextual bandits that we started. So, just a quick recap on what the things we did, we say that we have in every round a context is revealed to us and looking at this the learner has to figure out which action to play and we said that we have about k actions ok. Then we said that when he is going to play an action I here he is going to get a reward which we said in round T is how did we write the reward? We said reward in round T is some function which depends on the context in that round and the arm he played plus a noise term. And for this we wanted to see how should we should be choosing an arm in every round looking at the context so that my regret is minimized right. So, how did we define the regret? The regret was that the regret of a policy pi we defined it to be expected value of. So, this is the context x t then we said so, i t is the arm played by the learner using policy pi in round T. So, this is the total cumulative reward you would have got and this is the best you could have got in every round ok if you knew the function this reward function yeah. So, whatever like in x t when he observes what is the best he could have gotten in that round T. This is the reward he is going to get in round T and this is the sum over total rounds T. And this we are comparing with what would have got gotten if he has played i t in round T. So, then we said that so, what then we said that yes as of now this reward function is just a function of two variable that is the context in that round T and the action you play in that or like it is just like reward is a function from your context set and action set to some number reward function. The learner does not know this ok he has to figure out. So, then we said that if there are only let us say finitely many contexts then what learner can think is for each context and the arm pair he can think that as an another arm whose reward he do not know. So, in this case instead of thinking as k arms he can think of for each context and then each arm he can think it as a pair and across this pairs he can think them of individual arms and then try to learn the mean value of that. Then this is corresponding to the standard k arm bandits in that we have, but the k corresponds to the number of context into the number of arms there yeah. And eta t we just said this is what would you say it is a sub question noise yeah it has a zero mean noise. But we said that if that is the case like if the number of context are huge then I am basically learning over a large number of arms which if I am going to just apply the standard bandit algorithm it is regret is going to be like scaling like how many pairs are there right. Then we said that say when are there as a contextual case it is not that rewards across this arms are independent may be reward from one arm for a context is somewhat similar to that you can observe from other there could be a potential correlation that is like if I have a context like this if I am going to observe some reward for this context from these arms and if I get another context and if I observe rewards for that new context across this arms potentially there could be some correlation across them right. Because if the two context let us say happens to correspond to same like young user who is trying to log in maybe their interest could be potentially similar because of that I can expect a kind of similar reward across these arms. We are going to hence forth assume a structure on the reward function which is of this form where this phi function what would we call this phi function we called it as feature map and this function is known. This is known for every possible pair of context and arm and then we said that this is an unknown parameter which is independent of context arm it is just theta star which is going to parameterize this reward function. So, this is one setting we have this is what we called as so we called it stochastic contextual multi-unbanded or contextual stochastic multi-unbanded stochastic contextual multi-unbanded. Then we said that we have another setting called stochastic linear bandits. So, let us now treat this as like a different problem for time being and then let us try to connect these two. So, what we said is in every round a decision set is revealed to you. So, these are the sequence of decision sets and let us say Dt is some subset of some Rd, but let us assume these are all this is a bounded subset. So, it is not I am not saying that Dt has finitely many elements in that that Dt could have uncomtably many, but the thing is each vector here is bounded that means each component in the vector is bounded. Now, my goal is to in every round. So, if you are going to pick, so this is Dt here. Here the reward you are going to obtain in round T is given by what is that it is already linear. So, I am already looking into the linear banded. This is going to be given by Dt into again let us say theta star plus noise. So, what is Dt? Dt is the arm. So, now I am talking about arms are nothing but the set of feature vectors. So, Dt is a set of feature vector that is revealed to you in round T. And what you have to do is you have to select a feature. If you select a feature Dt here, this is the reward you are going to get. Now, what is the maximum expected reward you are going to get in a round? The maximum reward you are going to get in a expect the best thing you can get in a round is. So, over a time period T, if you know theta star already the D from this decision set you have chosen is the D that maximizes this. But if you are going to play whatever your algorithm tells you is let us say D, I am going to make it D theta star. Let me call this as RT of phi. So, this is what we are going to call it as stochastic linear banded setting. So, what is I am doing? I am basically trying to optimize a linear function here. So, let me let me let me put it in other way. I have a function f whose output is x star theta. So, let us say I have a function f which is linear. If I know, so this linear function is parameterized by what theta star. If I know x, I know what is the output of this function. Let us say you have a black box which is this function. You do not know what is this function. But you want to find a x that maximizes this function f. So, how you are going to do this? Every time you are going to choose a set DT, what is going to happen is you do not observe theta star here. But what you observe is the output of this function. What is that is if I am going to give it, if I give DT as the input for this function, its output is DT transpose. But the thing is you do not observe it, you are going to observe a noisy version of this. So, that is what I said that is the reward for you. But now, your goal is to find a point from decision set every time such that you are going to get the highest reward. So, every time you are going to play, you are going to observe this. This output is going to depend on theta star. What you are going to just observe is this inner product plus noise added to that. Now, your goal is to in every round you want to match this. This is the best you would have gotten, right? If you know theta star, but you do not know theta star. What you are getting is only these observations, noisy observation. From this, you want to, so this is minus here. You want to learn this theta star such that every time you are going to play an x which optimizes this function. So, then this is exactly this is stochastic linear bandit or this is you are trying to optimize a linear function. Are you following? I have just a linear function here. I do not have directly access to its parameter, but what I can do is in every round I can play a point, I can query it. So, this is like a query box, a black box to me. I can query with whatever input I want. From that, I am going to observe this output. My goal is to get this output, which is as maximum as possible. So, this is the best I could have got, but if I am using going to choose a dt in round t, this is what I would have gotten, whether this noise represent the error we made in features. Yes, at least in this model, right? We are going to treat it as some observation noise. It could be like anything, like you remember we assumed already what it is. We said that this is a conditional sub Gaussian noise, right? So, that is given your observation till time t, this noise could depend on all the things you have observed so far. But it is conditioned on that it is a sub Gaussian noise. So, you are coming to this, right? Like what is the guarantee that there is some reward function, but what I have done is basically parameterized through this features. If I am going to have a features like that, is it guaranteed that whatever that function is, it is a good approximation of that function? Is that your question? I mean, I do not think that eta corresponds to that. You are kind of assuming that yes, there exists a good features map such that this is true. We are not saying that by there exists some noise, I can add a some noise correction term here such that whatever this function is, that is exactly represented to this linearization part. We are not saying that. We are just saying that this holds, but whatever eta corresponds to here is the observation noise I am making whenever I am paying a particular feature. Okay. I have written and separate please because this is like stochastic contextual problem. This is a linear optimization problem, stochastic linear optimization problem where I have this function which I am trying to optimize over many rounds. So, this is what I have collected from it, but this is what I would have ideally like to collect from it. I would have collected this had I known this theta star, but I do not know this. Now, in the last class, we just said that this problem and this problems are identical. Right? How did we say that? What are the, how did we map this problem to this problem or this problem into this? So, we just said that dt could be, so dt here is nothing but the feature maps, feature maps for the context and all possible actions we have. So, that I could treat it as dt that is revealed to me. So, it is true that right as soon as I see the context xt here, I have assumed that I know this phi function. I could compute all of this. I am going to take that as my dt here. Now, the question is from this dt, I am trying to look for a feature map that is maximizing this. Then it is same as asking for a arm which maximizes the reward for that context. So, that is why we said that, so now everything remains here. Once I do this mapping and assume that this dt set is finite for all t because I have only finitely many arms here. Right? k, there are only k arms here or k actions here. If once I assume that this dt is this, so anyway if k is finite this set is anyway is going to be finite. So, then this actually take considers this problem as a special case. When I took this stochastic linear bandit, I say dt to be bounded set, but it could have uncountably many. But suppose now I say that I make this dt to be finite. If I can take this, solve this problem for uncountably many I should be able to do this for finitely many also. So, as a special case, this is just a cardinality of size set dt. So, as a special case in this, if I am going to take this dt to be this feature vectors for each arm in that case, dt is finite and then apply this problem and these are same. So, that is what? Yes, x t's are coming from some context set c. Whenever I got this x t, I went and mapped them to this features. This could now lie in a dimension, in a space of dimension same as theta star which is small d. Now, I am just saying that these that that set of feature vectors itself that is my decision set in the stochastic bandits. I am just what we are doing through this is there is no c here. It is just like a decision set that has been coming to give that have been given to you in every round. From that, you have to identify which is that feature which is maximizing this function. Only thing that is coming for you is dt set here and now we are saying that yes that the identity of the marks arms here is not important because I have mapped them to this feature vectors. What matters to me is now this feature vectors and now once I have mapped this I did not arms to this feature vectors and now it is just about thinking which is this feature that is going to maximize my reward. So, that is why I can think this problem are just like a stochastic linear bandit problem or just like a linear stochastic linear optimization. So, now we have done this subtraction now we can just try to focus on this problem rather than just assuming dt is finite in every round we will allow dt to be in anything in each set. We just want is it is coming from a bounded set. In this setting I am fine dt to be a bounded subset of id. It could be anything if I have it is only consist of only finitely many points in it it is still a bounded set of id. Solving I do not care whether on this set or whether I am finding whatever like see I am only finding points from the given set dt from that whether I am able to find the best point. Whether it is it is consist of uncomptimally many or finite I do not care I have to all I have been given is this dt from that I want to find a point which maximizes this. What you use optimize it I do not care that is of irrelevant to me all I care is what is the optimal value you use gradient descent or some optimization I do not care. So, right now we are not talking about how we optimize right we are just talking about this is what you want whatever this value is how far you are by using your policy compared to this that is the only thing I am worrying about. So, as long as this dt is a bounded subset of this id I am fine whether it consists of finitely point or uncomptimally I do not care it all falls in this setup okay fine. So, now the question is how to go about solving this. So, if you recall in our last class at the end we also discussed that if I allow my this dt to be unit vectors then I can think of this problem is actually solving a stochastic KM standard multi-amp banded problem right with theta stars are equivalent to the means of the arms okay fine. So, let me just write it special case. So, these are actually special cases so special case. So, if I let this case then my let me call this as slb and let me call this as scb stochastic contextual bandits. So, then it is like slb is same as scb right if my dt is happens to be this set in every round then it is nothing but stochastic linear banded is nothing but stochastic content. So, second case if my dt is d1 d2 all the way up to ed okay. So, where d is the dimension of this theta star. So, then what is this problem what is then stochastic linear banded is same as with d arms stochastic multi-amp banded with stochastic d arm banded. So, that is fine. So, now we have bit so we have bit now both this contextual bandits and this linear bandits we we have specifically studied them earlier but now we are going to study bit more general structure which is now a stochastic linear bandits. So, just for this I am going to just write this as without expectation as the regret of a policy pi whenever I want to find expectation I will write expectation on this directly expected value of RT okay. How to how we go and solve solve this. So, what is the unknown in this theta star if you can somehow figure out what is the theta star then we are more or less than right. So, in the multi-amp banded k with k arms there mu 1 mu 2 up to mu k they were the unknowns if you figure out that mu 1 mu 2 mu k that problem is I can solve it. Now here theta star is unknown but theta star is not like one real number it is a vector if I can figure out this theta star I I can solve this problem. Now how to figure out this and how to go about it E k so I see the D arm bandits that means k is already D we are treating with k D arms in that okay.