 Yeah, you do not know you are not an oracle, you want to your goal has been set like you want to maximize the mean reward how you are going to do. So, one thing is naive strategy I do not care about what goal you set I just do beating in the dark right, I there are there are I have been asked to play t rounds what I will do is I will play each one of them equal proportion of rounds ok. So, let us say there are k actions or k arms and there are t rounds you play each one of them equal number of times that is t by k number of rounds you play one each each one of them is that fine you can do this, but do you think this is going to be good policy I will just try it all of them right or what you can do is instead of playing all of them equal number of times you may just randomly uniformly select arm in each round right. Let us say there are 10 arms you give 1 by 10 probability to each one of them and in each round you just pull one of this arm with this probability like this is uniform probability and just play that do you think this is going to be anything good. So, in both of these cases when you played each of the arm t by k number of rounds or played them according to the uniform sampling, you are basically not learning right like you are not taking into the feedback that you are going to see you are just like ok whatever happens I am just going to do this that is that is never going to get you to the optimal strategy right that is never going to improve. So, you are just basically ignoring what happens what does it mean by empirical question that we are going to talk now. So, if you just ignore whatever you observed in the policy first like you just played all of them equal number of times or just sample them uniformly according to uniform distribution you are not adopting or you are not even thinking of learning here right. So, you are not going to do any better. So, policy 2 what you can do better I have listed already but think through what you what you think you could have done better. So, I am interacting with the environment right I apply environment told whether that applied action was good or bad I have this information. So, from this I can do any better. So, how many of you know what is empirical mean you raise your hands. So, you have observed how good a action is through your past actions past observations right. So, let us focus on one let us say you out of this 100 rounds let us say you have out of this 100 you have applied a one particular action let us say 10 times and for this 10 times you would have observed its feedback right. Can you take this 10 observations and find out what is the empirical mean of that action can you what I am doing how I am computing the empirical mean that is why it should say let us say Bernoulli case and let us say I am every time I am going to play this particular action I am going to get independent observations. So, all of you know what is independent IID. So, how many of you are do not know what is IID all of you know it ok. So, if you can if let us say these observations are independent let us say and but they are Bernoulli like as I said like you can just count how many times I observed one take average and that is going to give me the empirical mean value right. So, let us say out of this 100 rounds whatever I can based on my observation I can compute empirical mean for each of this arm. Once I have empirical mean what I can do take the arm with the highest empirical mean. So, is that good. So, do you think anybody it sees that this is not really good strategy ok how many of you agree at least that it is better than the first policy yeah. Just one round how will we get the. Not one round I said let us say you have already ok first round first round is do not have any sample right you cannot do anything let us say ok first let us say you have 3 actions you do not know like which one is better because you do not have even one sample. So, at least as a bare minimum let us say we start with playing each one of them once. Now, I have got one one sample for each one of them right. So, you lost k rounds in this already because you have played each one of them and in the k plus 1 I can compute empirical mean right because at least like I have one sample, but in the trivial case that one sample itself is the empirical mean. Let us say you played something and some particular action you have played some number of times and based on that n number you can compute the empirical mean right. So, once you have this empirical mean you can order them and play which is the best one. So, ok again so all of many of you said it is better than first, but any of you feel that this is not really better than first or at least in some cases first one have been better than second one. So, goal is not I said the sum we have to maximize the reward. So, anybody has any feeling that still first one is better when I wanted to maximize the sum of the rewards. So, at least we are going to play k rounds where we we are trying to get one sample from each one of them. First k rounds that is what like after that we just compute the empirical mean and from k plus 1 round onwards I will pick the one which has the highest empirical mean. And like let us say in k plus 1 I played some arm it gave me some reward. Then I updated its empirical value rest of them remain the same and in the k plus second round I again choose the one which has the highest empirical mean. So, fine like it looks like a second policy is going to be better than the first, but what is bad about this? In many cases like in second policy if we have two rewards A B and one of them is somewhat good, but second one is very good. If we had chosen the action like randomly then we would have been somewhere in the mid of mid value like P 1, P 2 we would have P 1 plus P 2 by 2 if we have been doing randomly. But if at a point of time like initially P 2 was bad. So, we started picking P 1 and we ended up finally having P 1 at the end. So, this one is there might be cases where policy one performs better of this case. But here fine like let us say let us for just for understanding purpose let us say there are only two actions ok and let us say action one has higher mean than the second action. So, in this case action one is optimal right, but I just do not know this I just know that there are two actions. So, can you imagine a scenario in which second policy is going to be bad in this case, but what first policy said? First policy said like if I said t equals to 100, it said that 50 times of the time play a first action and 50 time of the time you say second action ok. But what is the second policy saying? It is saying each time you just play one in the first two rounds pick one of them and after that pick the one which has the highest empirical mean that time right. So, let us say in the first two samples right k equals to 2, first round I played arm one in that it so happened that I got a zero sample in that and in the second round I played arm action two and I happened to get sample one there right. I could still get sample sample one from second arm it is not necessary that I have to get zero only even though it has a smaller mean right, it may still throw up one. So, what happens in the third round? Arm two you are going to pick. So, will you be stuck in a bad arm in this case? Right, because whatever you are going to observe from arm two its empirical mean is going to be larger than zero right, because it has at least already one that observation one included there. So, you could end up stuck in a bad arm in policy two. So, how to avoid this trap getting stuck in a bad action choose ok. So, he is suggesting that do not take just one one sample and then start doing empirical. So, focus initially dedicate some rounds to get the samples and then take the empirical means. So, in this case the probability that you get trapped in this bad scenario is going to be small. Is that all of you agree with that? We must still end up getting trapped in a bad value, but it is slightly better than policy two right that the that scenario arising is slightly better than policy two ok. So, how to go about this? What we are saying is so, one way of characterizing what is said is let us choose an epsilon which is between 0, 1. So, for the initial epsilon t rounds that is some fraction of my overall t rounds let us do collect sample from all of them ok. So, collecting sample from all of them is going we are going to call this as simply exploration ok. So, how collected samples from all of them it is up to you you can either go and do a random selection in the initial epsilon t rounds or you do epsilon t by k rounds for each of the arms collect it and then after you doing that in the from after epsilon t rounds you do go and select the one which is empirically best. So, when I say go and select the one which is empirically best that is what I call it as exploitation. So, what are the current situation I have I am from that I am just going to try and choose the best one ok. So, do you think this should work better than policy one and policy two depends then depends on what? . Where is C here? So, then now your problem one headache has transferred to another headache right now how to choose this epsilon ok. So, you can choose. So, now you have a good strategy choose this epsilon. So, what happens? So, if epsilon is 0 and epsilon is 0, but I still got one sample from each one of them then this is policy two right. No sorry, if epsilon 0 basically I am not doing any exploration I am just for doing exploitation. So, it is not policy two. So, if what is it epsilon equals to 1? . It is going to be policy one I am just exploring like I am just maybe selecting all of them uniformly random or just like taking playing each one of them equal number of times, but then here a problem is like epsilon how to choose and it has to be between 0, 1. So, if epsilon is going to be smaller value then you are exploring less right and you are just jumping for exploitation not good and if epsilon is 1 we are exploring too much and exploiting less. So, that is also not good. So, you have to hit a balance between that right. So, so too much exploration or even too much exploitation is not good. If exploration is high what is going to happen? If exploration is high at the end of this exploration you might be you might start playing the good one, but for a lot number of rounds you have played bad actions that is not good. And similarly if you start prematurely exploiting right you would have not collected enough information. So, you may start exploiting bad arm and that is also not good. So, you have to find balance between these two. So, this is just one example I have here just to demonstrate this. So, you see that in all this when you are trying to make a decision in this uncertain environment this natural phenomenon of how much to explore and how much to exploit comes into picture right. So, the whole crux of this course will be about understanding how much to explore and how much to exploit or is there a way like I can do this together exploration and exploitation. So, just let us say I mean this is some quick experiment here let us say I have set 5 arms here with the means like this 0.3, 0.35, 0.78, 0.8, 0.5 which is the best arm here. So, action arm 4 is the best one right. So, if you know that you would always like to play this one, but you do not know this, but let us say I am going to I have been given what is this, this is some 10000 number of rounds. Suppose in the previous policy I discussed let us say I set epsilon equals to 0 that means I am all just going to do exploitation this is the mean reward I am going to get over 10000 rounds. So, had I always pulled this 0.8 this is the total mean I expect right like almost like out of 10000 rounds 0.8 of that like around 8, 8000 is what I expect the mean reward to be if I have played always action 4. But let us say I am started playing only exploitation this much I get, but as I increase epsilon to 0.01 did little bit of exploration my mean reward actually increased this much. But if I increase further this epsilon like did more exploration my total mean actually decreased and it decreased and I get the again bad. So, you see that if I increase epsilon 0 to 1 I am kind of getting a bell shaped curve here. So, it says that there is a sum epsilon that is good that is between 0 and 1, but I have to figure out what is that. So, a good policy has to balance between this exploration and exploitation. So, throughout our course that will be our focus how to do this. So, there are some good policies that we are going to visit in this course. A quick applications we just discussed some, but some what are the application from other field right it is not just about recommendation systems. There are this multi-unbanded has applications in many things. The one I did not put it here the one that is of very importance is in the medical field and that is where some of the studies of this multi-unbanded started. So, suppose let us say you have a drug you want to identify a drug which is most effective on the population. So, when initially drug is prepared nobody knows like how effective it is right. Maybe it is it has to be lab tested. So, what how you really do like either we do testing on some rats or something or maybe like if you are brave enough we may directly go and try on humans ok, but whatever it is we want to make sure that the damage is the minimum right. So, we want to initially we do not know which is the best. So, initially maybe you want to keep trying this different one, but obviously because the penalty is too much here it is life involved here what we want to do is quickly identify the drug which is the most effective ok. So, what do you feel like this does this model fits here. So, here think of patients as the environment like how so that are coming in each round and environment is like we do not know the effectiveness of this drugs and your actions are number of drugs you have or number of treatments you have. You apply on them you see how effective that treatment was like maybe when you apply that drug that patient got cured immediately and he started jumping. So, you are very happy about that drug, but it may happen that the same drug if you apply in the something that guy immediately died. So, I mean it may not this as effects effectiveness may not be the same, but what we are interested is the one which is most effective on this population that is the best we can guarantee right how to figure it out right. So, but what we from the modeling point of all with these things that at least in the beginning we will assume that that effectiveness is going to be the same on each one of you. So, let us say in the in the beginning I will assume that all of you are at the same capability. It may be possible that all of may have different different capability and based on that the action our attention have to give is going to be the same for each one of you, but we will not make both are possible and at least in the drug example it is assumed that all the patients are the same effective on the drugs. We will study actually both the models what you are saying. So, the effectiveness of a drug on a person need not be the same it could be different depend it could depend on that person in that way environment is not the same ok. So, there are many other possibilities like in the if you are from communication background it could be like packet routing. So, and also like not just packet routing it is like about you want to find the best path from a source to destination right. So, let us say you are you are at your home and you are going to office someday and you have some three four route options available. So, the time it takes you to traverse on this route depends on multiple things right the conjunction weather environment whatever other things may. So, you have to figure out which is the best route for you best here is the one which gives you the smallest travel time. So, can you think of modelling this as this multi embedded problem? So, what are the actions here? So, the number of routes available to you right and who is the learner here learner is the you like who is taking these paths to reach your office ok. So, such things and also like this one ad placement like when when we visit this web pages we will say that we are bombarded with so many advertisements right. So, there is a whole big business behind this. So, people would each slot there in the advertisement people want to put an ad in which you are most interested and most likely to click. So, if you click on that somebody is going to make money out of that. So, they would like to identify in the given slot which is the most appropriate ad to put. Can you think of it as a multi embedded problem something like let us say I have slot and there are some 10 ads that are available to put in this slot and I have to decide which is the advertisement I am going to put in this slot so that it gets the most clicks. So, then what are the actions here? There is only one slot one one advertisement slot I mean there are many, but let us think consider only one slot in the top. So, just like here the advertisements are the number of arms you want to decide which is the one you want to put that has the highest number of click probability. Each one you can think of modeling differently or maybe you can come up with some variants of this. So, that like if I just not one location you are interested it, but all the possibilities you can now think of the arms as all possible combinations and then you have to expand your problem according. So, this is their minimum we discussed like there are many variants that are possible as I said this is a very active research area you will see some of them will cover basics, but as for the project I would ask you to explore different variants of this bandits that we are that have been studied like some of the things could be like contextual bandits where you are not interested in showing one particular action, but the action is curated in the sense you want to play action that is the instance specific. So, for example, when you put an ad and you know that the guy who is visiting that web page is a very young person maybe you would like to put a some sports shoe or something related to sports rather than some putting some insurance related ad there right. So, it depends on the ad you want to put it depends on what is the context here. So, and you are the context here like if websites know this is a young guy who is just below 20, 22 years. So, accordingly he would like to show ads to you. So, there are different versions of this. So, we are going to follow largely these three books and some other research material for this. These books are available I will also put it on the course web page. These books, these are all online and you can download them.