 So, that this course is all about, so you might have heard about already so many machine learning courses that have been floated, what is about this online machine learning, right. So, just let us, so today we are going to keep a short course, I mean short class I have just about 10 slides, I will just give a brief overview of what I mean by this course and give a flavor of what we, what will be in this course and we will get into the main things from the next class, ok. So, this is all about decision making, ok. So, when you in you, you guys do lot of surfing over the internet, right, this is just like a motivation where this online aspects comes into this course. So, when let us say you are a seller of a car, let us say you are not from a particular brand, let us say Tata or some Mahindra or something, you are like you are some dealer or like you, you are you have a some web page where you are trying to sell some car models or at least show people get interested in some car models. So, what you would like like your basic goal in this will be like instead of randomly showing some car to somebody, you may want to show it to somebody who will be more interested in this, right or the car which is like most likely suited for that person or like maybe like the one which that that guy who is looking at that will potentially buy. So, if that guy clicks on this and buys it, it is a profit for you. So, many people who will be visiting your web page and all, what would be your goal? You want to basically maybe show a car which will be bought in maximum numbers. If that happens maybe it is a great profit for you, right, but you are in an offline platform, many people who are getting into this online platform, you may not know anything about them. If you know already something about that person, maybe you would have made a proper recommendation, but you do not know how you are going to do in these things. So, you have to kind of figure out or learn what is that guy would be interested in that, right. If you can somehow figure out maybe you can make up appropriate recommendations. So, here you see like there is already some kind of decision is making is happening, but it is happening in an online fashion. Like in the sense that something is coming, you do a recommendation and maybe that guy clicked or bought it or not, you can kind of get a feedback from him whether that guy was really interested in what you showed, right. So, it is in a sense feedback to the action you took. So, here action could be like this car models, ok. And you keep showing him, you get to know whether that guy is interested or not and based on this feedback you are going to get, you may want to refine what is that you want to show, ok. And you want to keep on this doing this iteration like you show some item that the guy who is on that platform at that time will click on that and you get a response. And if he does not click on that, then also you got a response, then you know that that guy is not interested in that. Then next lot or maybe the next person you do the same thing. So, your kind of interacting somebody comes, you show him something, he clicks, you got some information, if he does not click you still got some information. Using this using this information, can you refine and try to show him the one which is most likely to get interested or buy. So, if you could do that maybe you are maximizing the number of units you are going to sell, ok. Similarly, you can think of many, many other examples maybe like you might have been more familiar with this recommendation systems, especially the movie recommendation system like you might have seen like if you log into this hot store Amazon Prime or even what is that Netflix, it is fine Netflix, first time when you are logged in, first time you created an account, Netflix kinds of asks you, right. Tell me what are give me three examples that you would like. So, from that it starts try to infer you what could be your potential interest. And then as you watch more and more movies, it kinds of start figuring out what would be of your interest. Maybe you are a guy who is interested in more in scientific fiction or action or whatever some romantic movie or something. So, by the way this is all my old slides I used, the only thing I change in this is this third movie, Tanaji which one latest, otherwise these are all just old movie, ok. So, little bit abstracting all these things, what is this all about this online decision making is not just standard decision making, but you have to make a decision under uncertainty here. Because the environment under which you are taking decision you do not know apriori, right. For example, when you are in a movie recommendation system, let us say in the Netflix platform Netflix does not know like who you are already. So, it is a kind of you is like a random thing for Netflix, you do not know apriori what is your interest are. So, it has to figure out those interest. And it is not like you, right like Netflix or Amazon will have thousands and thousands or maybe millions of users like you and for all of them it may not know apriori what is their interest, but it would like to show or recommend a movie that is most appropriate for that guy. So, if you could do that maybe Netflix or Amazon keep you more engaged and you like to continue the subscription, if you like to continue the subscription, they will learn more money and this is how they want to sustain the business. So, they basically want to figure out that this you for them is like a random thing and they want to make an appropriate decision in this random environment. So, here decision for them is showing you a right thing. So, we can in a simplistic things we can now we want to model and analyze these things how we go about this. So, as a learner you are one entity in this whole setup and there is an environment. And now this is like a interaction between you as a learner and the environment. For example, think of in the Netflix example, Netflix is the kind of learner the platform whole is a learner and the users there they are like the environment there whose preference you may not know apriori right. Now, I am going to put in this box as that random environment which I apriori I do not know ok. And now I have to take such set of actions in this random environment and what is my interest here? I want to take an action here which gives me maximum benefit in this random environment. So, the movies I said let us think of them as a different action or the cars I said think of them as different action here a 1 a 2 I am going to denote these are actions. Is the subtraction clear to you? What I am taking talking as a learner and what is the environment here? So, you as the learner you can decide which action to apply let us say when you see a user or if you are taking making decision in a sequential fashion one by one you have this options like to take an action like let us say let us denote them like m actions denoted a 1 a 2 all the way to a m and let us say each one of this action when you play they will give an output which the value of which is stochastic in nature ok. So, for example, so all of you know Bernoulli random variables right. So, Bernoulli random variable with a parameter p means what? It is going to show me 1 the probability p and 0 with probability 1 minus p. So, p is the actual value the p is the mean there right. So, if between 2 Bernoulli random variables let us say 1 with parameter p 1 and another with parameter p 2 and let us say p 1 is going to be larger than p 2 let us call p 1 as process 1 and p 2 as process 2 where p 1 is larger than p 2. So, in which process we are going to see more once in the first process right because that has. So, anytime you are going to play any of this process you are going to observe a realization which is 0 or 1 which is governed by this underlying means p 1 and p 2 right. So, 1 and 0 these are like a stochastic observations associated with parameter p 1 and p 2. So, here we can think of that like these actions a 1 a 2 a m they will have their associated mean values which I have denoted as mu 1 mu 2 all the way up to mu m ok. These are their true mean data that they are going to give under this environment, but you will not get to observe mu 1 right away like let us say when you observe when you apply this action a 1 the value generated would be some value which has the associated mean mu 1 and when you apply a 2 the value generated will be associated mean mu 2. Is that part clear to you? So, this these are the like true mean values rewards associated with this like. So, we are talking it things in expectation right it is not like I am interested in observing the expected behavior of the system not about the individual ones. So, when I apply action a 1 I will get a mean value of mu 1, but each time I apply it I only see the stochastic value some random value of this with mean mu 1. So, like this so, you have m actions their associated means are mu 1 mu 2 mu m you do not know apriori all these values. So, this value of mu 1 mu 2 depends on this random environment ok which which is unknown to you. So, that is what we I mean are like here the interaction is like you apply an action and observe the random reward associated with that action. Now, what is the best action? So, in this case what you would like to do let us say you are doing this for 100 rounds ok let me put it in this way. Let us say you have this m actions their associated values are mu 1 all the way to mu m and each round you are going to have take one of this action you and you observe the associated stochastic value and let us say I say you you are going to do this 100 times and what I would be interested in over this entire 100 rounds I want you to generate maximum reward. What would you do in that case? Let us come to movie recommendation system ok let us say you have 10 movies to recommend and there are 100 peoples. So, when you show a movie to one of this. So, for time being let us say let us idealize this idealize the scenario. Let us say all these 100 people have the same preference for this 10 movies I mean says 10 movies it means all of them kind of have maintained the same preference list for this movies. So, let us say movie 1 may be liked most by all of them movie 2 may be the next preference for all of them movie 3 is next preference for all of them, but you do not know their preference list ok. Now, in that case what is the best thing like if you want to make sure that maximum number of movies that you have shown are observed or watched by this people 100 people which is the movie you want to recommend to all of them is that the case? So, let us say let us right now bit abstract a bit again let us say movie 1 movie 2 movie m. The probability of that being observed are let us say mu 1 mu 2 all the way up to mu m. These value I said you do not know mu 1 mu 2 mu m ok let us assume that we know ok we know mu 1 mu 2 mu m values I know and when I know them let us say I figured out that mu 1 is the has the highest value that is movie 1 has the highest probability of being watched. Here is the setup clear now? Now, you are going to you have been asked ok 100 people are going to come you have to show movie recommend a movie term them in to each one of them which is the movie you are going to recommend one right because that has the most likelihood of being watched and if if the guy watches that movie that means let us say a reward 1 for you and that you do not watch us that does reward 0 for you in that case naturally your preference would be to show recommend movie 1 right ok. Now, this is the what I called as ideal case where you know the environment, but I am saying that possibly you do not know the environment right up prairie you may not know this. So, then what you are going to do? What is the what you what what do you think you should be doing then? Yeah first I mean whatever the way I am going to do is, but first thing I have to do is I have to figure out these values. Once I can figure out then only I can see which is the best one. How I am going to figure out maybe you do sampling or just whatever way you want to sample or just like do some magic and try to get this number whatever way you are going to do ok. So, this is where the online part of this machine learning comes in you have this environment you have this set of actions you are going to interact the environment is going to interact with this environment and from that you want to figure out which is the best action ok. Now, how does this contracts with the standard machine learning? So, how many of you done any machine learning course here? Many of you. So, can you one of can any of you contracts is what is this way of machine learning is different from what you have learnt? Decision points can change with data right. Anything else? You have all the possible data with you. Yeah. Here you have to somehow figure out sample the data or what can you do. Ok, let us let us revisit what is the classical unsupervised learning method you guys might have learnt ok. What is the classical unsupervised learning method? You had a bunch of hypothesis right and you have a bunch of data ok. Now, you want to figure out a hypothesis which does the best classification on your test point ok. So, how you are going to do that? So, best point is going to come from the same distribution area as your batch data right. So, let us say somebody has already collected you this data points from a distribution given to you and also given you a bunch of hypothesis. They ask you do whatever this and from this give me a hypothesis which does good job of classification when I generate a new sample from the same distribution. So, how do you do in that? You basically take a bunch of hypothesis, train it on your data points. What we mean by training? So, training could be anything like you want to basically find a hypothesis which does the best job of classification on your data points already collected right. Now, on this. So, here you can think of in this as your hypothesis are the actions. Does this analogy make sense to you? I have a set of hypothesis, I have my bunch of data that are given to me. I would say that. So, what I have set my goal as? So, ok you hypothesis are there to you, give me hypothesis from that which does a good classification job on a new data point. That means, so here the environment is what? Environment is something which is generating this data point right. So, what we usually do in this case? We have bunch of data points. So, what are this data points? You can think of data points are something the feature vector and associated label right in the classical terminology. You want to find a hypothesis when I give my feature vector as input. It gives me a label which is the value which is associated with that feature vector. So, in this. So, in this example let us say you have this simple case where all this blue and red points are your data points and you have some bunch of hypothesis let us say h 1, h 2 and h 3. So, h 1, h 2, h 3 are like let us say your actions and I want you to find the hypothesis which classifies them well. So, which one is the hypothesis does the good job here? So, which is well separating the blue point from the red points? Which one is that? h 2 or h 1. So, which one has the smallest error here? Which hypothesis here is putting all the blue points on one side and all the red points on the other side and which one is not mixing them at any point? So, what point is this? Is this red here? Okay. Yeah, that true classifier is not there. I mean each one of them has making some errors, but if your goal is to take the one which makes the smallest error, which one is good here? Okay. So, let us say h 2 is the one which is making minimum number of classification errors right. So, on this data points the bunch of data points that have been given to you, you notice that let us say h 2 is doing a good job. Now, if I give a new data points which is drawn from the same distribution as these points are drawn, which one you want to apply on that among these three? You would like to apply h 2 on that right, because it did a good job on this. This is a training data. So, in this classic method, you already bunch of data generated, you trained yourself and find the best action, in this case the best hypothesis, which does job you which is going to do a good job in the future points right. So, what you do usually you train on this and you keep on applying this whatever hypothesis did the good job on this hence forth, whatever the new points I give you, you keep on using the same hypothesis okay. Now, can you contrast this with what I call as online setup? Now, in the online setup let us say my actions of this hypothesis h 1, h 2, h 3, but I have been not given this batch bunch of data apriori. What I will give you? One net time in incremental fashion. In round 1, I gave you one point, you decide what hypothesis you want to apply, you applied and you got to know you did a good or bad job of classification on that. Then I will give you second data point. So, here when I say I am giving you think me as an environment, which is generating this data point and you are the learner, who is using this hypothesis? Who has access to this hypothesis? As a learner your job is to find out which among this hypothesis is the good one. So, I will give you the second data point, you selected one of the hypothesis again applied on that point, you got to know whether you did a good job or not. Again, I generated the third one, you give me you again applied one. So, this kind of interaction goes on, but here in the online setup I would like to figure out which one is the best hypothesis among this as quickly as possible. If I can do that, so or if I do that means, I start applying the right hypothesis as quickly as possible. So, in this example, we have one best hypothesis, but in general on some way the hypothesis is vote on the particular entity, common vote from all hypothesis inside the lever. So, you are saying instead of applying one hypothesis, you want to apply multiple hypothesis and then ensemble their values. Here what I am saying is I am interested in applying one best action. So, here in that logic I am interested in finding which is the best hypothesis I should be applying. Yeah, I mean this setup like if you are saying that which is the subset of this hypothesis I should select. So, that whatever the values they give that is like if I aggregate them that will gives me the good value, right? Yeah, so some way of aggregation them by either weighing them or just adding them whatever. That is fine that we can define as a we can set up the same goal here, right? We can say among these hypothesis which of them I should say and how I should weigh them so that their classification is the best. So, you, so this is how this is what we are going to see throughout this course, what are the different criteria we can set and how we are going to evaluate against these criterias. I am just giving you one simple example here, ok. So, if you are going to think of this as different actions, this hypothesis are different action and if your aim is to find which is the best one in terms of the classification error, how we are going to do that. So, environment is generating a point, you are applying you got to know whether you did a good job or not. So, you quickly try to figure out which is the best one, ok. So, the difference here between the online setup what we are talking about and what is this back set up is like in batch we have already bunch of data, I will train on that and after that I will fix one hypothesis and keep applying that again and again. So, in the online points are coming to me in an incremental fashion and I keep on updating what is the best hypothesis for me, ok. So, in this course we are going to talk lot about this one particular online setting called multi arm bandits. So, there are different flavors of this which we are going to visit throughout this course. So, what happens here? As I said unlike in the batch setting we will not have a set of data points already available, ok. And only when you apply a particular action you will get to see its reward. For example, in the hypothesis class example we previously discussed, when you pick a particular hypothesis and apply it on the sample, you get to know whether that guy did a good job or not. So, that is what I mean only when an action is applied you get to know whether that guys did a good job or not. But here we would like to like quickly identify. So, I mean you can you are aimless, when a point comes you can apply whatever hypothesis come is of that strikes to that time, but our goal is to what? Our goal is to identify the best action, right. So, that something we are trying to optimize here. So, broadly this is kind of setup we are going to revisit in this course. As a learner you are on this left side. So, you are going to take an action in each round. So, I am going to denote that as I t as the action you are going to take in round t. When you apply this action on this environment, environment gives you some noisy reward associated with that action, ok. So, suppose let us say your actions are not noisy, they just like everything is deterministic, ok. So, then what happens? You apply an action, you got to know its reward, then nothing noise about it that is the reward you are going to get each time that you apply action. So, in that case how you are going to determine what is the optimal action? Take every action. I will just give each action one time you got to know their values and then you know these are the values just choose the best one. But, things are noisy here, right. So, this is how we should be modeling it also, right. It is not like somebody liked when you are in a Netflix you showed some movie, somebody clicked on it that does not mean that everybody is going to click on that, right. So, everybody has preference for it with some bias on this and that bias is what you want to figure out. And what do you want to do in this also? This is the classical image you see like when people talk about this bandit problem. So, this is like some bandit trying to pull this slot machine arms, ok. This is I do not know like this has stuck to this bandit literature like somehow this name comes from this like you if you visit a casino there are different slot machines. You put money on that on the slot machine you choose you are going to win or lose on that based on that you got some information about that slot machine. Now, you have let us say few slot machines your goal is to identify which slot machine gives you the best winning probability, ok. Now, the goal in this thing is you are interacting with the environment over T rounds and I am interacting with the environment that is unknown to me, but somehow you still want to make sure that over your T rounds the total mean reward you are going to accumulate is the maximum. Does it make sense to like that is how like suppose as I said when you are in a Netflix platform, platform do not know you, but it want to show you a movies that so that if it shows the 100 people the over this 100 people it wants the maximum number of people was it. So, that is how I that is why I am interested in maximizing this and we will see that we will have different notions of what we that what is that we are interested to maximize. So, this is this is the simplest thing you just over the T rounds you are doing something and you want to maximize that. When you could have maximized this, if you know already the best time you could have just played that again and again you would have done that, but I do not know that right. So, how you are going to set your goal? All I want to do is if you are going to run it over T rounds whatever I am going to get in that I want it to be maximized and this is going to be the largest if you have been playing the best time in every round. So, you see that, but this number of rounds may not enough for you to figure out that. So, you will just set like I want to maximize my reward over this T period ok. So, anybody has concerns or doubt like why maximize some why you are saying maximize the total reward here or like. So, our goal we said identify the max best action right, but now I am refreshing it as maximize the mean reward. Are these two goals aligned with each other or they are not at all aligned with each other? Same version of that, but the true goal which we had that we were identify the best arm. So, if it is possible we can run it over and find out the best eventually. Yeah. But that would take us long time. So, we are at some time putting some limitation like that we want to identify the best arm as soon as possible. Right. So, that later on we will use that arm itself and enjoy the reward ok. So, let us say this is the IT is what IT I have said that mu IT the subscript IT there I said this is the action you played in round IT right. So, if mu IT are the mean values. So, when is this mean reward maximized? When is this mean reward maximized? It is. Yeah. So, then IT is the best action right. So, at least this is in the same direction that I want to go. So, if I am going to pull the best action in each round I am going to maximize this, but I do not know that right. So, my goal would be to then I do not know what is that I am going to still set this as my goal and we will see that how best we can or how close we can come to our goal ok. So, banded that guy is the learner ok. So, he has to deal with some k machines. So, those are those are his actions ok. So, right. So, each of these machines on which you are going to win is going to be different value ok. So, let us say this has some 5 machines like at least some 5 machines are captured in the picture right. Let us say these are 5 machines and on each one of them you have different probability of winning. So, what you are doing in each round you are going to play one of this machine and you are going to see whether you are going to win or lose that ok. So, now what I am going to say is ok I want and this mu i t is the expected value of you winning on that action whatever you are do took I want to maximize this. So, is this part clear this mapping i t is going to change in each round right. So, if you are going to play the same action in every time you may figure out that on that particular machine I am losing again and again. So, you may ditch it and you may want to play something else. So, this i t is changing, but whatever you play it I am saying that ok you in you got a reward of mu i t and over t it is this is your total reward I want to maximize this ok. So, if you are the oracle or let us say you are the one casino owner who actually designed how winner loss happens on this machine you kind of know already which is the best slot to which machine to play, but as a learner you do not know. So, you are playing different different actions and somehow trying to make sure that the total reward you are going to get over t rounds is going to be maximized ok. Let us discuss this how to do possibly. So, why is this multi term coming here? So, ok so, ok you are doing one, but your options are multiple ok. So, again multi arm right like the terminology arm is like an action here. So, this think of I mean from the picture maybe you can associate like they have this levers right like think of this as like arm or something in each round you have to pull one of this levers or like play one of this arms or like play one of these actions. So, you have multiple actions or arms to choose from this. So, that is what like we we are in the multi arm banded setting here. So, we said that right like let us say this casino devices you have to make them the winning probability bit stuck and let us say you are a casino owner. If you have some five machines are you going to design in such a way that some machines always win and some machines always lose you are not going to do that right, but you will set them with maybe some stochastic probability on which the winning probability is small, but whatever but you want to make money. So, you probably you will design it such a way that any player using that he is going to be with a small probability, but let us say for you as a player here among this itself like whatever the probability is there of winning you want to still go with the one which maximizes yours. So, maybe the winning probability on all of them need not be the same maybe it is different maybe it depends on your talent how you are going to put up your strategy based on the winning winning probability could now based on that you want to identify the one on which you are winning probability is the highest. So, that is what stochastic I am talking about. So, when you play an arm here or action you may be not lose with the associated probability. So, win 1 0 that is the noisy observation of that parameter right ok. So, we said that like if you have been a oracle who knows everything about the winning probability is what would have basically done is you go and play the one with the highest mean, but you are not a oracle. So, you do not know this you have to learn it ok. Now comes the question if you are not a oracle you want to figure out which is the best action what are the options with you. So, how you will go about this? So, what could be policy? So, I am going to call this as a policy right the action you are going to choose in each round is like your policy based on what the observation you have made ok.