 So, during the first half we saw all bandit problems, like we started with the full information setting in the adversarial case, then we graduated to the case where we have only bandit information and then we consider the case where it is not adversarial but your environment is stochastic and we looked into different algorithm and at the end we showed that what is the minimax regret one can get on all these instances and what our algorithms actually got. We showed that all our algorithms that we discussed in the class, they were almost minimax optimal and one algorithm was exactly minimax optimal. So, others were half by a factor square root log t sometimes. So, yeah, before I start this the next part, so in the last we ended the first half by giving the minimax regret but there is another notion of called problem dependent lower bounds which we did not do but there is a separate chapter on the book you can look into that. So, you remember like we gave two kinds of bounds on each algorithms, right? One is problem dependent and problem independent bounds and for the problem independent case we looked at the minimax lower bound and showed that whatever the problem independent bounds we are getting they are matching the minimax lower bound up to the log factors. But we give problem dependent bounds upper bounds but we do not know whether they are optimal and what is the lower bounds problem dependent lower bounds. So, there is again whatever we have they are almost optimal we will not going to the lower bounds problem dependent lower bounds. So, today we are going to start this stochastic contextual bandage. So, if you recall the setup we studied so far, what was my interest? I have a set of actions each action give me some reward stochastic reward but it has its own mean value all my interest was to find out the action which has the best mean, right? And we pose that problem as regret minimization problem that if I how best I could what is the total reward I am going to accumulate in expectation over a period of time compared to the case where I knew exactly which is the best of that is which is the arm which has the highest mean. So, the goal there was to find the action with the highest mean or the best action there. So, often it so happens that the action the best action for you in a particular round depends on that that round itself something that has related to that round itself. It is not that you are interested in one best action over all rounds but it may happen that the best action could be different at each rounds. So, what could be the example? For example, suppose nowadays you might be doing a lot of shopping over e-commerce sites and all right like you know when you just log in you already see that the things which you like but you have never told to the website they already started popping up on your screen. So, what these guys are doing you are basically trying to recommending items which you are going to like. So, they what is the action set for them? For them the set of actions is let us say either advertisements or the products that they want to sell. Everybody will be users will be logging to that website let us say whenever a new user logs into that website that is like a new time that I will be looking at ok. And now naturally that whenever a new users logs in that time that websites like to show a product or an advertisement that is most likely clicked by that user right and for each user the likings could be different. So, if a new user logging into the system is at different events or the times then optimal action at every time is potentially different right that depends on that user itself who is logging at that time. And the website where you are logged in may have some information about you. For example, you when you registered on to the website you might have told about your date of birth, your sex and maybe some the region where you reside and all these things. So, these are the information that the website has and maybe this information is useful for the website to tell what is the best advertisement or the best product to recommend for you ok. So, now the question is ok fine this is a problem I have a let us let us to the case that I have a set of advertisements I want to show it to users and my goal is to gets maximum clicks on the advertisements. So, whenever advertisement you have shown if you click on this that websites make money out of it right. So, it wants more clicks to happen on the advertisement. So, it is going to get more clicks if the advertisement it shows you you likely to click on that. Now, it over a period of time it want to maximize the number of clicks you are it want to get. Now, do you think I can pose this problem as a bandit problem which in a setting which I already know from the what we did in the first half like either in the stochastic bandit setting or in the adversarial bandit setting. If I want to do this what how can I do it what the mean what is the mean here. So, reward here is the probability of click if I use a clicks that means with that probability he got some money ok. Now, can it if I have to use my earlier setting right what I will eventually end up finding up I will end up finding up a single best advertisement that I want to show to everybody because like earlier set up only cares about one best action and here my actions are set up advertisements. It cared about showing one single best action among all that would have got more number of clicks, but here if you make and it will try to always it it try to find that single best advertisement which gets maximum number of clicks. But now if you customized your advertisement based on the user information the user who has logged in then do you expect to get a better clicks then always showing a single best advertisement across all users or you want to show an advertisement which is most likely to clicked by each individual users. You are going to make it personalized or you want to choose a one which is globally I mean optimal across all you want to make it personalized right. The question is did whether the earlier setting allowed for that earlier setting we had did it allowed for it it did not allow for it right. Now, the question is how to how to make use that how to incorporating this possibility in that setting yeah fine you want to run every user separately, but user one comes he does click on or not click on something then he may vanish right he may not again come, but from that did you learn anything about and there are so many such users right and you cannot design one algorithm for every user. So, you see that right you could do that potentially like you could treat that every user was some bandit instance and I on which I want to learn what is the best for you, but that guy is short time like he just comes and leaves. So, in that way like I cannot directly use that standard bandit setting right what could be the other issues if you want to use that word setting. So, one case is already we discussed like this number of users could be many many and I cannot run separate algorithm for each and other things whatever I learnt from one user let us say whatever time he was in the system whatever little I learnt from it, can I transfer that knowledge to other user the earlier setting also did not have that facility right. So, now we are going to study this version of stochastic bandits called stochastic contextual bandits where depending on the context the best action can change and what is the context the context is whatever the information I have about an instance at that time ok. So, let us say my I am going to just denote my set of actions. So, my last my actions are this and I am let us say our team pair. So, think of this times as a instance happening for example, somebody locks into your system. Now, what the earlier setup did is it tried to see these are actions it tries to find which is the single best advertisement I should be showing to each one of them. But, now let us say now I have some contextual information for every user at time t let us say I have some content lecture x t that I could observe. For example, as I said when you log in the website already knows through your profile what is your age category possibly what is the shopping you have done earlier all this information and on observing this now you have to find which is the best action that need to be showed for this instance x t ok. So, this x t here is what we are going to call it as context or side information and this context x t could be drawn from set some c which could be some subset of r d. So, this x t could be a vector of dimension d and it could be coming from some subset of r d you understand this. For example, if I am only going to consider your age height and your location only 3 then this x t is a three dimensional vector for each person ok. Now, the reward you are going to get if you are going to play an action I after observing a context x that we are going to denote it as some r this is some context x and what is it? It is such action chosen this it and this it is. So, the learner in round t first observe this contextual information and based on that he plays an action it which is one of them and then he he observes a reward in that round t which is depends on both the context as well as the arm that he played ok. But what he observes is not just this this plus some noise t. So, he is going to observe a reward which is noisy. For example, like when you enter something right you may like something and you may click it, but it may happen that you like it for some reason you did not click it ok. Just because you just happened that I mean you you just saw a new mobile which had all the features, but it so happened that just a week you had just bought another mobile and because of that I I cannot just buy this mobile. So, because of that whatever in the clicks or no clicks right I have to take them somewhat assume that they are noisy here right. For that reason we are going to assume that the reward that learner is going to is observed is noisy version of the actual reward which depends on both the context as well as the arm played ok. So, what is this R? This R we are going to assume that this is a map from whatever the set of context into the set of arms to my R. So, this report function tell you for a particular cost and a particular action that is the context action pair what is the reward I am going to. Now, yeah it could depend on the history. So, what I am saying is yes it could depend on the history, but IT is selected after observing XT user may use it or ignore it, but he has observed it and after that he is choosing this. And now, but for the learner this reward is unknown you do not know what is this report if he knows this reward. So, suppose let us say if for every context and action learners knows this reward function. Now, what does what will you do? What is the best thing for him to do? For him the best thing is to do is just for come find out for each context and action what is the best you can get. So, whenever you observe a particular context just go and see which is the action that is going to maximize this and you are just going to play that right if you know that. But the user learner does not know it and he is one of the goal is if you want to select a best action in each round he has to figure out what is this function is ok. For in this current set I have to figure out what is this function for every possible pair of context and action or at least he has to figure it out for whatever context is going to observe and for the correspond and all possible action pairs he has to find out what is this function is ok. Now, we are going to call this as our reward function and we are going to assume that this eta is sub Gaussian. So, we are going to say that is conditionally sub Gaussian I will. So, what I mean by this we will just going to assume that. So, we know already what I mean by a sub Gaussian noise right. If I am going to say that eta is what we have we say that R let us say sigma sub Gaussian what we know we know that e to the power lambda eta is up to bounded by what e to the power function of the context also. It could be that is why we are making it time dependent it could be and now what we are saying is that is why that is why I have bought this conditionally. Now, I am going to say what I mean by this. So, this is if sigma eta eta t is sub Gaussian, but if I say that eta is conditionally sigma Gaussian or f t I will tell what do you mean f t that means we are going to say that e to the power lambda eta given f of t is up above bounded by 2 and what is this f of t? f of t is actually going to be my sigma algebra that depends on context 1, I 1, x 2, I 2 all the way up to x t, x t and I 2. So, ok. So, let us understand what I mean. So, I am saying that this eta t is conditionally sub Gaussian and this conditioning is on the sigma algebra generated by your observations so far, right. What is this? x 1 is your first context, after that you have played action I 1, x 2 is second context you have played action I 2 till round t you have observed x t and played I t. So, conditioned on this, this noise in that round is going to be sub Gaussian, ok. So, you are just saying that if I know what has happened till now the noise that I am going to observe in the reward that is going to be simply a sub Gaussian, ok, conditioning on my observation so far. So, you see that I have already assumed that I have known this context and which action selected conditioned on that this is sub Gaussian. So, that is it could potentially depend both on the context as well as the arm we are going to select in that round, ok. This is about noise part. Now, since we know that if it is a sigma sub Gaussian noise what is its mean? It is 0, right. So, what is the reward in any round we are going to get? What is this value is? It is going to be simply r of x t and whatever the arm you are going to play in that round. The way we are going to write this is given for t. So, when I am going to observe my reward in round t, this is based on f t includes all the observation that I have so far. That means, here if I condition f t I already know which arm I have played. So, now this is exactly this. So, I should be writing this not because I have to know what is the reward I am going to get in i t, I am I should be knowing what is that context I have observed and what is the arm that I have played and this is the one. So, now, what I have to do? If I what is what is how should we select? So, this is a problem setup, right. What is happening? The environment is generating the context. You observe that context the learner the and the learner has to decide which action to or which arm to play or which action to apply, ok. So, then what learner has to do? Learner has to come up with a mapping that maps a given context to an arm. Any mapping that maps a context to arm is going to a policy here. So, what is what is basically learner doing? He observes a context and then he has to decide which arm I have to play. So, that is like a mapping he has to find out or to give on a context what. Now, our goal is he want to come up with a policy pi that gives him the maximum reward, right. What is the maximum reward? The maximum reward what is the maximum reward that a learner can get? Suppose, let us say he knows the reward function, what is the maximum that he can get over a period time t. So, suppose let us say how we have denoted the context, ok. I have denoted context by xt, right in round t xt, but this context I have noted specified how they are generated, right. So, they could be generated stochastically in every round and we are going to assume that these are generated in an IID fashion. This context are generated according to some common distribution and that are revealed to the learner. So, I am going to denote that whatever the context I am going to observe over a period and let us say x 1, x 2 up to xt these are random. So, I am just writing it by capital notations. Then suppose let us say you have observed a particular realization x 1, x 2 and xt over a period of time t. Over this time t what do you think how what is the best, what is the best total reward that the learner could get? So, the best thing the learner could get is if you have observed x n let us say in round t then let us say this is from over t equals to 1 to t, right. If what is the a you should play in round t, ok let me write it at star, let me write this at star. Can you tell me what is this at star should be? At star is. So, if you know this function r and if you observe this in round xt you should be playing an action a which maximizes this, ok. But the learner write a priori do not know this r functions, right the reward function that is hidden from him. So, he will play whatever actions he feels or whatever the actions that is derived from the policy is using and now we are going to define the regret of that policy with respect to this. This is the best he could get, right. Now we are going to define regret of a policy pi let me as expected value of. So, this is the best he could get if he applies the best action in round t, at star is the best action he could apply if you know the r function that is how we have defined at star. And now we are comparing this against if he plays it in round t whatever that action is going to play this is the total reward he is going to get and now we are going to compare this total reward against the best total reward he can incur over a period of time t, yeah. This we can compute only o. So, this is what I am saying this is the best you can get if you know what are this reward functions and this is you are getting through your policy with by playing it is here you do not know this reward function. And now we are going to define and what is it? It is whatever the policy pi tells you to play in round t after observing the context here, ok. Now what is this expectation about what all it averages over here? So, one randomness is already over there are different contexts, right. And then what is the other randomness? I t could also be random, right the arm I am going to play in every round may be deterministic, but it could be random also. So, I have averaging over both of this randomness here. Now suppose for every possible context an arm A I know the mean reward is given by this reward function, right. That if on a particular context if I am going to pay a particular arm I am going to observe this noisy reward, but its mean value is given by this reward function, right. So, I know that for every context and arm that is context arm pair I know let us say I can treat this as an arm, one arm instead of treating this as arms I am going to treat context and paired with this arm as all possible arms. Can I do that? So, in that case how many pairs I am going to get cardinality C into cardinality of K, right for every context I am going to pair one arm and I am going to get. So, let us say I am going to collect this r x A where x belongs to C and A belongs to K. Now if I do not know r function, if I knew r function I can compute all these values and I am going to treat this r x A as the mean value, mean reward I am going to get when I am going to play action A when I observe context x, ok. So, now suppose if now if I am going to think this as a some bandit problem where my arms are this pairs x A pairs, ok. Now I am figuring out if I can find out estimates for each of this pair what is this reward value, then also I can find out if I can estimate all these values, right. Then I can figure out whenever I have a context what should be the best arm, what is the arm I should play on that? Well, alternatively you can just think of these are different arms, I am going to play them and the goal is whenever I observe a particular context I want to play a arm which has the highest mean whatever, right. But this is what like you enumerate all of them, you that is why I have enumerated for all context and all arms. So, in each, yeah. Yes. Yes, it could be, but for time being, ok assume it could be it is finite, it is finite number of arms is also finite. So, we have finite pairs here, ok. Now I am going to treat them as a different arm. Now my goal is just find for every context which is the best one. So, if I am going to treat them as in separate arms, can I use my standard bandit algorithm to figure out what is the best action eventually for each context? Yes, right. Like basically for every context you just observe play an action and estimates its reward and similarly you do it for every context. So, now you have this value and based on that you just whenever you observe and new context just correspondingly play whichever has the highest mean from whatever your estimates. But what is the problem with that? The set of arms is could be potentially large, right for especially like as you are seeing if this context could be very large. Even if it is a finite, but very large the number of arms is already large and we already know that how does the regret scale with the number of arms root k, right like in the minimax setting its scales like square root k. And if I am going to treat these many pairs as arm instead of just k arms the regret can be very bad, ok. But right now how you are gathering? Is there any right now we have just said for every context and pair this is the reward. Did I say anything about how one pair says anything about the other pair? We do not have any such as of now we have not bought in any of those angles, right. Now the question is yes if you are going to do like this, if you are going to treat blindly this problem as a earlier bandits in earlier bandit setting with considering these many pairs as arm your regret bounds can be very bad, right. Now the question is when I can actually do better? As you are saying maybe if there is a information that I can extract from one context about the other context maybe I should be able to do better in this case. And it is often the case when you have this contextual information it is true that right like those visitors who are coming to let us say on a web page usually that kind of things the youngsters do could be potentially very different from the kind of things the old people do, right. So, if you know that some people are in a certain age category you kind of already kind of narrowed on what there could be interest and infer from one guys whose interest with another guys who are in a similar age group all these things you could potentially do that, right. So, what we are going to next assume is this rewards here they can be parameterized such that in a parameterized such way I there is some correlation of rewards across the context, ok. So, what we are going to do is we are going to assume that now we are going to make a special. So, far we did not say anything about how this function R look like. Now we are going to assume that this function R for a given context X and arm A is going to look like in this function. What is phi? We are going to now assume that this is some theta star this is unknown and this theta star belongs to some R D and this is we are going to say that this is a feature map known feature map and this reward is given as the inner product of these two quantities. And of course, this feature map is also giving me vectors which is in R D or has a dimension D, ok. So, known feature map belongs and now phi of X of A belongs to R D for all X A, yeah xi is known that is what I have written right this known feature map. So, ok let us understand this what we are saying here we have a fixed theta star here which does not depend on what context or action we are talking about this theta parameter is independent of this and this is a unknown. Now we are making an unknown quantity theta star which does not depend on context or action and we have a known map which depends on context and actions and we are now saying that the reward is inner product of these two.