 In the last class, we started like thinking about how to go ahead and solve this stochastic linear bandwidth, right. So, we just said, ok. So, we just said first thing we are going to do is, first question is how to estimate the parameter theta. So, what would you say how we are going to estimate it? What was the natural candidate for this estimation? How did we got an estimate for theta stars? We already discussed, right. We are going to just regularize least square methods. We are going to do this, ok. So, estimate as least squares. And what is the second part? We assume that we will be able to find confidence sets in every round. So, that my theta star belongs to that confidence set with high probability. So, now, the first part is here, right. We already know like given my observations how to find an estimate of theta. We will just do the regularized least square method. And then the main question was, here we said how to find this confidence interval so that my theta star belongs to this confidence set with high probabilities. The last class we said, right. We are going to write, let us assume my confidence set to be some subsets of the form such that, so for some beta sequence. We said that suppose I will be able to give such sets based on my estimate till round t minus 1. And if I construct such a ball, let us say this ball includes my theta star with high probability. So, all I have not yet specified how this beta star all defined. Of course, they will maybe depend on this theta itself that sorry delta itself with what confidence you want your theta star to belong to this. In addition, this beta t may also depend on all the observations you have made so far. All the arms you have paid, played and the corresponding rewards you have observed so far. So, how we will get this? We will look into that later. But now let us say you are able to do this. Now, what is the method to find out how you go about finding regret of my stochastic linear bandit. So, how did we define regret of my stochastic linear bandit? It is R t of pi. We said this as summation minus whatever the inflate. So, this is what you are going to obtain in round t, right. So, if I am going to play R dt in round t, this is the mean reward you would have got. Actually, what you have observed in round t is we have denoted it as the reward you are going to reward sample you are going to observe is this quantity eta term which we say conditionally sub Gaussian, right. So, this is the noisy reward you are going to observe in round t. And we have defined your regret to be like this. And here we want to bound this quantity and also the expected value of this quantity. So, here it is still this dt is the arm we are going to play in every round could be random, right. Because the choice of this dt depends on what you have observed so far and that depends on what are the sets dt that has been selected so far, ok fine. So, now how to go about this? Before we start proving that we are going to make couple of more assumptions about this setup, first thing we are going to assume is we are going to assume that the reward the mean reward. So, this is the mean reward, right in round t. We know that this is a sub Gaussian. So, this is a noise with mean 0. So, the mean reward you are going to get in round t is this. We are going to say that this is going to be always less than or equals to 1 for all d coming from dt t equals to t. What does this mean? In round t, this is dt is going to be a decision set, right. Consider all union of all this decision set. It is saying that you play any element from this decision all possible decisions, the mean reward for that is always bounded by 1. So, this is same as saying that earlier the mean value of my bandit of each arms is less than 1. So, the analog version of the mean here is this quantity here, right. It is just we are saying that mean is going to be which irrespective of which arm here we are going to play. Here the arms are nothing, but that decision vectors, right decision points, whichever you are going to pay for that you are going to get the mean reward which is strictly less than or equals to 1. The second we are going to assume that the elements d are all bounded with L to norm, again for all d belongs to. No, the maximum reward is this. So, in round t, in round t, right, dt is your decision set. If you have whatever that d that maximizes this, if you have played that d, that is the maximum reward you are going to. That d will also. No, this d is coming you are going to choose this d only from dt. That d will also belong to the union of all this decision sets. Yes, it belongs to. So, this set does not mean bounded by 1, but all those values are bounded by 1 actually. See, the reward is given like this, right. The reward is this part if you are going to play in round t some decision t plus noise. So, this is the, what is the mean reward you are going to get in round t? This is nothing but this part, right, depending on which arm you play. I am just saying that whichever arm you are going to play in that round or in fact any round for all those rounds the mean value should be less than or equals to 1, okay. What is, what you are saying? dt is why it is, why it is probabilistic? Because this is what you are going to observe in every round. So, let us say it is still some point time if you have observed you know which is the arm you are going to play and you know what is the corresponding reward you observed. So, this is the information you have and any estimate you are going to make it is going to depend on these two quantities, right. It has to depend on the observation you have made that is noisy and because of that any next decisions you are going to make. So, let us say how did we find theta t hat? We said that this is nothing but r min of this is my r t minus and then what did we say what dt of theta square plus norm of theta square, right. So, this is what this is, this is what our regularize least square regression, regularize the least square estimate, right. Whatever theta t hat you are going to find this should be maybe I should write it as s, s equals to 1 to t minus 1. So, t around t you have observed all the samples from s to a t minus 1. You have observed this reward and the corresponding for this corresponding arm ds, you get this. Now, the decision which arm you are going to play in the next round has to depend on this, right. Based on this estimates I am going to make a decision which arm I am going to play in the next round. So, I will come to that like let first me let me write this conditions under which I am going to prove this, ok. So, this is one assumption this is assumption and then the third assumption is simply this like in every round. So, this theta star belongs to C t for all t with high probability this is another assumption I am going to make, ok. Now, fine this is the setup. Now, I have to give you a algorithm how the generic algorithm looks like and then I am going to say for that algorithm how the regret bounds are going to look like, ok, fine. You estimate your theta in round t based on your past observations like this. So, this theta height is a random quantity, right. So, now, based on using this or it may be like in round t based on your previous estimate you are going to whatever in round you are going to construct a confidence set like this, ok. Now, let us see how to find now how to make a decision in every round based on this information. So, now I have to make a decision, right. So, this in every round I have to see a DT set is revealed to me looking into that DT set I have to now decide which is the small DT from that set I have to I am going to play, how you are going to do that. Suppose if I can assign a value to each element in the decision set DT, ok, then what I will do? So, ok. So, let us go back to my multi arm banded setting the original one. What we did for each arm based on its estimate plus the confidence term I defined a value for each arm, right. So, in my in the in my standard I have mu T hat plus 2 loss T divided by N. So, this is my estimate of arm I till round let us say T minus 1 and this is my number of pulls of this. So, I have defined a term like this for arm I and what I did? I played an arm which has this highest value, right. So, this is kind of a index value that I assigned to arm I. Now, for me the arms are this entire set, ok. So, first suppose if I can assign a index value to each of the elements in this set DT, then which is the arm you are going to play in that round T. So, ok. So, what is this? So, we said that this is nothing but UCB of arm I and what I did? I basically played an arm which has the highest value of this UCB quantity, right. I am going to do possibly similar thing maybe when I said DT is revealed to you do a similar thing for every point in that define a UCB index for that and then go and find out which is the point arm in that which has the highest UCB index, right. So, now, let us worry about how to find this UCB index, ok. Now, this is for multi-banded. So, now, I have to find UCB for at D in round T. So, I have DT. Now, for all D belongs to DT I need to find UCB of D, ok. So, let me write it more clearly. So, one natural candidate is to find an assigning a UCB value to each element in this is to look for how, ok. Let me define this and then we will discuss is equals to max over theta belongs to CT times D theta. See what I have defined? This CT is something which involves my true parameter with high probability. In a way this CT is some ball which involves my D star, sorry my theta star with high probability. Now, what I am doing? For my arm D, I am looking for if suppose if I have to play this arm D. Now, I am looking at the best reward I would have got if the parameter T theta is drawn from this set CT is right. So, what I am basically assuming is my true parameter is somewhere in this. All I know is it is somewhere in this. All I am trying to do is I am trying to take this UCB index. The best I would value I would have obtained if I played. So, the best that could have happened to me if I played my arm D. So, this is what right like I am basically choosing the reward for D optimistically. So, how we did it in the UCB? UCB we have a mu hat and then around this mu hat we have constructed the confidence interval right and what would I do? I always went and choose this as the true index. This has the index for my arm and then I define the UCB index to be. So, I am trying to do similar thing here. In this ball I am just trying to see. So, if I will just compute take my X D just see if with this D what is the theta that gives me maximum value in the set CT that I am going to define the UCB index of that arm D. It is just like thinking that with my current information of confidence set I am just assuming that optimistically the parameter is the best possible one when I am going to I mean when I going to if I am going to play arm D I am going to see that the parameter that is unknown to me is the best that could have come from CT. I do not know what is that parameter, but from this CT I am just going to do this maximization and assign UCB to be this value. So, do you agree that this is like assigning the value to D optimistically here from the set CT? Earlier I was doing this optimistically by just adding this upper confidence bound. Here also in a way analog say I am doing some kind of upper confidence bound right the best I could have got from this set D if I have to play arm D. So, this is how we are going to define UCB index of arm D. Once I have defined UCB index of arm D like this then what I am going to do find the one which is going to maximize that then you are going to do play R max over D of UCB of D and this is what we are going to call it as DT star DT. So, this is the one I am going to play. So, this is basically my algorithm. So, you see that this is nothing but again applying the UCB idea upper confidence bound idea to this contextual settings where I have not finitely many arms, but that arm set is now a complete set DT which could have uncountably many elements in this ok, but now that I have that parameterized it right. So, I have parameterized the rewards and now I am trying to estimate this parameter and building a confidence about that parameter using the CT and now I have defined my confidence upper confidence for each my arm in this fashion. Ok fine, this is the broad idea that all the algorithms that work in that apply on linear stochastic bandits work and there are different names like the based on how they are going to come up with this confidence sets. So, in literature you will see many algorithms like linear UCB, it is called linear UCB, there are other called I think linear REL. So, this is like linear UCB stands for linear stochastic bandits UCB, linear REL stands for I think linear reinforcement learning. This is one of the earlier algorithm ok and there is another algorithm that is more recent, it is called optimism in the face of uncertainty L stands for linear ok. So, people have this different algorithms on this. So, all of them are going to hinge on kind of this, they just differ in terms of how they are going to construct this confidence sets. So, ok yeah. What are the assumptions to and what are the complex problem is simply taken as a. These assumptions? Yes. No, in fact these assumptions are not actually I mean bad, what is this? This is just about the setup right, what we are basically mixing, we are just saying mean rewards are bounded by 1, if they are not. Yes, yes of course this is the more complex part. Now, later we will see that how to construct such sets right, because right now you should have begun starting by how to construct that set, I mean we will lost the overall picture right. So, right now I am saying that let us say we have such a sets then how to go about this algorithm ok. So, now you will see that even with this setup to bound my regret, I need to have a slightly different take a different approach than what I had done for the standard bandit algorithm. So, how did we do? How did you prove the regret for our multi-amp bandits in the standard setup? We basically bounded the number of pulls of suboptimal arm right. And how we were able to do that? We know that if you have to play total number of rounds t, of course each one of has to be played less than that many rounds t. But here the number of arms itself could be countably many or uncountable right, the number of arms. So, because of this it may happen that you may not even end up playing some of the arms. So, what in the standard multi-amp bandit setting you played each arm at least once right, that there were only finitely many arms and your time horizon you always took larger than to be than number of arms you got at least one samples. But here if once t is finite even though this set d t let us say it is going to same in every round it may happen that for some of the arms you will never get any observation. So, because of that like it is not clear that the same method we used to bound the number of rounds number of place of the suboptimal arm is going to work out here. So, that is why let us see if we have this what is the way to go and prove the bound. Yeah, but now this that has been now gotten into this set d. Now, this is what I am going to call as arms. All values. Yes, all values possible in d t. So, let us rewind this. What is d t for us? Let us now go back to our initial problem from where we arrived at it here. So, d t for us is nothing but phi of x t of a a coming from k right. So, in round t a feature x sorry a context x t was revealed and I had a map feature map which said for each action what is this value? This was for contextual mandates for the linear mandates you generalize this even more to. Right. So, this is what like this is the contextual bandit. Now, I am just saying that this set d t consists of these features. This is the finality finite. So, now if by making it more general this c t d t could be any arbitrary any arbitrary subset bounded subset. Okay, the thing here is. So, where did this each of this feature remained in this this is a feature vector right this remained in some r d space. Now, instead of that let us say why why always talk about this specific features. I am just saying in every round I am going to get a set d t which will consist of this feature vectors and those feature vectors I am going to call this arm is a subset of r d s. Yes. Yeah, that is what like this d t is a bounded subset of r d. Okay. So, other way to think about this is. So, in this case in this case it is clear that there are only finitely many features right in this because one corresponding to each arm. So, let it the number of arms goes to infinity or like number of arms could be uncountable. Then for each arm I have this feature space right and that I am going to now call it as d t. But each each feature vector is corresponding to one particular arm right. So, that is why now because we have done this abstraction I am now just saying that any element in this d t is corresponds to an arm. Okay. So, when I have done this if I take a particular element here phi of x t and a that corresponded to that arm a right. So, I have basically for in that round t I have mapped a arm to a feature. So, here just like I have given so many features that means I am just thinking that they correspond to different arms. We have now that is why I have thrown away that concept of actual physical arms. And now we have these features of this decision space and again I am coming back and saying every element in that decision space is another arm. It is just an arm.