 So, so far we were studying what we call as learning with full information setup that is because in the case that we were interested in that is the binary classification online binary classification problem we said that when an instance comes or sample comes we get to see the predictions of all the hypothesis that are available right. That means, you got to know what everybody who would have said that was the full information setting. So, let me revisit the regret definitions we had in this case. So, how did we write the regret R n? We had another parameter here right hypothesis class ok. So, how did we write this was as supremum over h and super power. So, this was over n rounds and then we wrote it as summation what all for first term. So, this is my this is we defined as regret ok. So, just let revisit this definition a bit the way we defined what we said is whatever the sequence I am going to look I am interested in the best if I am going to apply single best hypothesis what is the how far I am from that hypothesis type. And the kind of when we bound this using our algorithms it the bound hold independent of what was the hypothesis and what is the sequence right. So, now let me ask you. So, instead of defining regret like this suppose if I have defined all my earlier bounds still continue to hold why yeah, but when I give the bound you took whatever h you took whatever sequence I was able to bound this quantity right. So, earlier the bounds I gave like if you recall they were like of the form 2 to the power cardinality h times and something like this right which was independent of. So, even if you think it like this. So, if you are going to say that. So, what is the difference between this metric and this metric here. So, here what you are doing if I write like this you the hypothesis which is going to maximize this is selected based on the particular on the sequence right it could change, but here what I am asking is I am asking one hypothesis that maximizes irrespective of what is the over all possible hypothesis sequences, but here this hypothesis could depend on what is the sequence you are looking at that is the difference, but still if I am if whatever the bounds I gave there that bound is still valid for this case also right because that is independent of what is h and what is your sequence. So, if you rewrite this quantity a little bit right I had just pulled this soup inside and it has become input. So, I mean we have already discussed this this is saying that ok given a sequence what is the best you could have given got an in hindsight right. So, this is the loss you incurred now you are looking at what is among my hypothesis which is the best one gives and now you are comparing that against what you got to what you got. So, even though earlier we bounded for this quantity, but basically our all the bounds were valid for this ok. And another thing we said this guy y t here did not be deterministic right y t is what y t is the prediction given by the learner in round t. We said that if he gives a deterministic prediction then he can be made to error in every round then we allowed this y t had to be random ok. So, y t have so then we say ok instead of this let us look at a expected regret right when we allowed this y t had to be random then we considered the expected regret ok fine. So, this was the setup we have looked into the case when we dealt with what I what I called as full information setting right. In this setup we gotten this bound using weighted majority algorithm and in the weighted majority algorithm we said that we get to see the prediction made by all the hypothesis in every round. Now, so we had basically we said that this using expert advice setting weighted majority we said that this one can achieve a regret bound of 2 identity of h square written for this setup ok. Now, slightly consider a case where my information structure is slightly limited ok. What I say my information see when we are doing this learning right what is that information we are getting by taking an action is very important. If we get more information maybe we can do we can learn faster right less if we get less information maybe we will get we will be slower in learning oh sorry ok thanks. In this setting we call it as now I want to move to something called bandit setting. What I mean by bandit setting is when you play an action here you only get to see the loss you incurred from playing that action and nobody else ok. Earlier in the full information case whatever action you played you incurred loss that action, but you also get to observe loss of everybody else, but here that is not there whatever you play you incur loss for that and you only observe the loss of that. So, now we are going to focus on this setting. So, can somebody tell I mentioned about this brief in the last class can somebody tell example of full information setting and the bandit setting where this can arise. Previous Previous yeah whatever you had done so far for the actions you have played you can observe, but about the current setting I am asking. When I say full information and bandit information right this is about information what information you get in each round. So, exactly. So, when you are thinking about a casino and playing a game as a player you go and play one machine and you get to see whether you will lose or win or whatever amount of money you make with that machine. In that round you do not know what other machines would have given you the ones which you have not played you only get to see the one which you played that is exactly what we are going to call at bandit setting. Now, what is the example of full information setting? So, other example as I said also in the last class if you have let us say 10 routes to take from your let us say home to office and each of them gives different different travel time on a day if you take a particular route you only get to see what is that travel time on that route not about the other routes. But maybe this could be also full information setting why because let us say you are the learner you have turned on your radio in your car when you are driving right and the usually radio keep announcing right that much was the congestion traffic on that route was this much on this route was this much. Even though you are experiencing the actual traffic on the route you are travelling, but through the radio announcement you are getting to know what is the traffic that is been happening or being experienced on the other routes even though you are not there right. If that if you do not have that radio who is announcing this you only know on the route in which you are travelling. If somebody is giving you this information about other routes also then you have full information ok. And also the other example could be the share markets right like on a particular day let us say you have option to buy some hun-ten-stokes or shares you buy one share on that share you get to see whether you made money or not. You may not get to see the stocks of the values of the other stocks or shares, but maybe if you go and look into the newspaper they might have announced what is the loss that have been incurred by the other shares or stocks. So, in so far the setup is said it we somehow assume that I will get to observe the losses incurred by all the actions I have ok. Now, henceforth I will restrict that I will get to see losses of only the actions I observe and not about others ok. Because getting more information is always costly right like in the road example you need to have a radio for that or in the share market like you may have to go to newspapers or some web portals and get all these details about the other ones ok. Now, how we are going to how this setup is going to be different from the earlier setup. So, when it comes to regret definition that was our performance criteria will there be any difference in the definition of regret or do you think we should consider something else to evaluate performance with this setup ok. Let me make this bit more formal. Henceforth I will only talk about actions I will not talk about like hypothesis everything. Let us say there are k actions which I denote as in each round the learner is going to choose one of the action and when when he chooses that action he is going to only observe the loss for that action. Now, I am going to denote x t i t to be the loss incurred by playing action i t in round t. You understand this notation ok and observe maybe like loss i t maybe not from other actions ok. Now, let us specify the so, this is what the learner doing. So, here we what is the environment doing? Environment will assign. So, in round t environment is going to assign losses to these actions and player selects i t observe loss from that action whichever he has played ok. And this i t here this i t the learner is going to select like earlier it could be random. He need not select this i t according to a deterministic policy he could randomize it ok. So, if this is the interaction between learner and the environment now we are going to define he regret of a policy pi. Now, there is no hypothesis here right it is just like the environment is assigning this loss vectors. And let us say pi is some policy that the learner is using ok let us precise let us make what is policy pi. Policy of player is to select an action in each round ok and how he can select an action in each round he will select his action based on his past observations right. So, let us say h t is the history he has observed till round t ok what is that it could be let us say in round 1 he observed i 1 he played action i 1 and based on that he observed this and then in the second round he observed this all the way up to i t minus 1 e x t minus 1 i t minus 1. This is the history that the learner has right till round t you understand this i 1 is the action played by player in first round and this is x x 1 is the vector in round 1, but x 1's i 1 component is the loss he actually observed and similarly till round t minus 1. Now, learner's policy is to choose an action given. So, we are going to denote the policy to be in round t to be probability that i t equals to i is given x t. So, the learner has access to this history based on that whatever he is he is going to choose an i t in that round with what probability he is going to choose that is his policy ok. So, pi is this probability distributions that he comes up with in each round ok. Now, you notice that we are allowing i t to be random and he is going to choose it to be small i with some probability ok. Yeah it could be arbitrarily selected by the environment like the case we did in the expert advice setting also right. In the so, environment is just going to assign a loss vector to all the arms you have ok and that how he assigns it could be totally arbitrary and whether he will assign this losses such that the components across this arms are correlated or uncorrelated they are not putting any constraint it could be arbitrarily selected and then this is my learner he is going to select an arm i t he is going to observe loss i t ok. So, in this case whatever he observes this is what he has actually incurred. Now, we are going to define the regret of the learner with this policy. So, depending on how he chooses this i t that this probability that is going to define his policy right. Now, let us define that we are going to define regret of policy to be the one which he incurred by playing this sequence of i t's right. So, what is this x t i t t equals to 1 to n rounds this is what he incurred minus what is the best he could have incurred if he had seen this sequence for all the rounds. What is that? That is going to be x t i minimum over. So, if he knew all this loss vectors for all all the actions he would have chosen the action which would have given you this minimum sum right minimum loss. So, that we are going to compare that with this he is going to call as regret. Is this regret a random quantity because i t is random right. So, this is a. So, what we may be interested in instead of this instead of bounding this random quantity we may be interested in looking at the average performance of the learner and what is that that we can take for that we can take to be the expected value of this. What is the expected value of this to be? This is for a given sequence of x t i's yeah we do not have, but had we knew this is why we want to we would have like to select an i which would have minimized this right that is why this is what I get this is what I would have like to do I am comparing my with this and this is what I am going to call it as regret. We have discussed this right already for the full information case. Now, I can look into the expected value of this quantity then this expectation is now governed by this probability vectors right. So, I am going to call this as b t i and now what is this expected value is actually governed by my policy pi here ok and this is my expected value, phi. If you notice that this regret here whatever I have defined is actually same as this it is just that this has been defined for the binary classification problem. But here we have instead of looking instead of thinking about ok x t y t is the value that is coming in round t and if I apply hypothesis classes this is the loss I am going to incur instead of this, but they are just going to say that that value loss is simply x t i ok. X t i in this language you can think of the loss I incur if I am going to apply i-th hypothesis on the context x t whatever the loss I am going to incur I could just write that as x t i here ok fine. Now, we are going to see how to bound this ok. What we would like to do is we want to bound this by taking supremum of this it is over all possible sequence like the way we have done it right. So, we can make this to be this quantity to be we can we want to bound it is exactly this. So, we have just written the whole thing that the regret now when we wrote these things and these things we really did not worry about what is the information structure right. It is having full information and here if we are saying this we wrote it for bandit information, but regret definitions is still the same. Where that picture comes into whether what kind of information I have is in your algorithm when you are going to update because here this policy is only observing the losses for the action it played not for everything else not for everybody else ok ok. Now, let us try to see what is the algorithm we get.