 So, today I am going to start with some basics of supervised learning. This is actually not online learning, but supervised learning that just as a precursor to our online learning ok. So, we will try to map what we are going to do in supervised learning in the offline setting, how the kind of metrics, the kind of algorithms, the kind of strategy we are going to see whether we will have a natural mapping when we move to the online learning settings. So, fine let us do the revisit our basic things that we would have done in the supervised learning methods. As I said in the beginning this course is about learning setups, learning different learning setups, looking for the best algorithms. When I say best, if I am saying something is best I need to show it like this is the best. So, how one is going to show something is best that you have to say that you cannot do better than this ok. So, our focus of this course will be like coming on with algorithm and then showing their best through giving a detailed analysis because of that as we grow on this as we go into this course we will get into lot of analysis. So, you have to be like with focus sometimes the proofs and all can get dry. So, if you lose track then you will be totally lost for the rest of the class. So, interrupt me at any point if you are not following anything. So, it is not necessary that you have to cover all the proofs, but till whatever steps we are going to cover in the group it is important that we are all in sync with it ok. So, what is our supervised learning method? What we have? So, basically supervised thing is you have let us say let us take the simplest case of classification further simplest case that is a binary classification you have given an instance you have to say whether it has a label 0 or 1. So, let us try to set up this simple thing in a bit more formal way. So, for this what are the things we know? First thing we are going to start with we are going to assign a label to something right. What is that we are going to assign a label? We are going to assign a label to let us say some instance. To be concrete let us take some example let us say you are a stock enthusiasts and you want to predict your stock is going to go up or down ok your share value is going to up or down. Then if I give you a stock that stock is what you are basically predicting the label right you are basically. So, that we are going to call the point from the set from which this stocks are coming we are going to let us say we are going to call it as some domain set and we are going to denote it by X and this is set of objects that we wish to label. So, in our stock example this script X could be a collection of stocks to which I want to say whether there we want to label whether their share price are going to go up or down. So, another example could be let us say you have some information about the weather, whether let us say you know let us say weather information has only two things like how humidity is and how what is the temperature. If I tell you this information you have to predict whether it is going to rain or not your job is to say whether it is going to rain or not. In that case this X is what in this in that case the X could be collection of all these pairs the pairs is like humidity value and the temperature value. So, in the in this case this X could be subset of R2. So, you give me any point in this that means somebody told today this is the humidity level and this is the temperature and you would like to tell whether it is going to rain or not. And second thing is label set, but the label set we have already fixing we have kind of already restricting ourselves to binary labels, but labels could be any number of labels. So, this could be like let us say set of outcomes and I am going to denote it by a script Y and in the binary case this guy can be simply 0 or it could be also I can equivalently take it as plus 1 and minus 1. And now the one who is going to give labels based on the input you give from this domain set let us call him a learner. So, learner is basically trying to do what he is coming up with a rule that will tell if this is the input this should be the output. And he may have access to multiple rules right it is there is like if you give me whether that let us say humidity and the temperature value I may combine them in some various ways to come up with the output that is saying that is going to rain or not. So, I have multiple possible rules right and the learner may have have access to finite set of rules and he may want to pick one from that to give you the outcome. So, we are going to that set of rules that that is accessible to the learner we are going to call it as hypothesis class and we are going to denote it as H and this H is nothing, but a collection of rules which are maps from Y to. So, you all of you understand this notation these are all functions which maps your input to some label in set Y. There could be multiple such hypothesis and all so that is why I am doing to denote it as a set this set is collection of these rules. Now, the learner here he has to find which is the rule is the best or like like you have to basically let us say for time being he has to pick one rule from this. So, how he is going to decide he is going to possibly decide based on the past observations or the past history. For example, he might have from historical data from weather department you would have seen that with this temperature with this humidity this happened rain happened or not. So, he may have all the data observed from past and he is going to use that to see which is the which rule or which hypothesis he likes to pick ok. So, he the learner has access to training data we are going to call it as training data and that is we are going to call it that is s and this s will have where we are going to say xi belongs to that we are all i to m and so there are m points we are going to call this training data there are m points in the training data each point here is a pair the first part denotes the object here that is coming from my set x that is my domain set and the second component is the label that is coming from my label set ok. So, we have these components in this the learner is have access to this training data he has access to this hypothesis class and now he has to figure out which is the best hypothesis for him that does a good job of prediction on a given data point ok. So, when I say data point it will be it can mean two things data point could be it will have this object as well as associated label sometimes data point I mean to say only the first part it may not have the associated label and in that case you have to come up with the label ok. So, we will either say simply data point or a sample when I say sample if we do not have label for it our job is to predict it when we once we have already have it that is part of our training data ok. Now, when I said the learner's objective is to pick and hypothesis from this that in some way does a good job right on the prediction task. So, let us try to quantify that like what is that the learner's objective would be. So, we are going to say here learning for the learner is we can hypothesis from this class which is good in some sense that we will make it clear in what sense ok. So, now, learning boils down to learning this hypothesis I want to identify which is a good hypothesis from this class which is given to me ok. Now, let us say now, I am learning on a data set right, but then the question is how is this data set looks like or how it gets generated right. Is this data is arbitrary like if I ask you somebody to give me data like can you just generate data and give it to me anything we will do. We are going to assume that there is an underlying process that is going to generate this data ok and actually that process is known to me and my job is to kind of figure out that underlying process that is generating data ok. So, for just for a beginning we will are going to assume a simple data generation model. We are going to assume that data is generated according to some distribution D. We said that the data points are coming from this domain set and how they are picked, they are picked according to this distribution D. So, is this clear to you? Let us say I have said I have a training data right. This training data when I said how are these points generated in this each of these points I had told you they belong to this, but when you generate this point it is going to be sampled from this set according to this distribution D ok. So, we have in this case we are going to say that Xi is according to this distribution D. So, this is about the first part of this data point how these samples are generated ok. Also this first part of this data point they are also called features ok and so, henceforth we start calling them as features and we are going to call this as domain set as also feature set ok. So, for example, in the rain prediction problem we said the humidity level and the temperature level right this could just act as a feature for us ok. So, given a feature I want to identify whether it is going to rain or not that is I want to assign a label and these features are generated according to this distribution D. We are going to say that this distribution D is fixed throughout it is not going to change, but I do not know it ok. What I get to see is the samples generated from this distribution D ok. This is one part of this data set right I told you how to generate the features and we are going to say that there is a label function f which will make it y i equals to f of x. Again I am assuming that this once you tell me feature the associated label is governed by this function f and we will assume again for simplicity for just in the beginning that these f's are again fixed, but I do not know them ok. So, how these features are generated and how the labels are generated both I do not know, but they are fixed. Now and we are going to usually say that this is the environment ok. The D together with f this constitute for us environment both of this I do not know, but the data generation is governed by this environment ok. So, now we are going to I am going to do a measure I am going to say that. So, we said our goal is to pick a good hypothesis from this hypothesis class right then we need to quantify what is that good we are talking about. So, that is the measure of success. So, the measure of success we are going to use is called if you let us say give me hypothesis h from this hypothesis class I am going to say its success is probability that h of x is not equals to. So, what does this tell? The probability that your hypothesis does not give the same label as the underlying label generating function f right and notice that I am taking this probability with respect to this distribution D ok. So, now this x is generated according to this distribution, but this is kind of any x it is it could be any x here right. So, whereas, when I did this training that these are the some specific data points which has been sampled from distribution D and that is made available to me. When you are trying to measure the success of the hypothesis you came I am going to look at all x's entire x, but I will be looking at by sampling them according to distribution and look over this ok. So, what we are basically do is whatever the training set that is given to me I am going to use this let us say and I will come up with some hypothesis and I want to measure the success of that hypothesis I am going to see how good it performed on a random x that I picked from some distribution D. So, when I pick a x randomly from this this that could act as my test point and these points are all my training points. Let us make this bit more formal. So, if I have decide defined my measure of success like this for a hypothesis H what kind of H you would like to use. So, let us say you are a learner and I am like a. So, you have access to the hypothesis class and let us say I am a measurer like I am going to evaluate your performance and my evaluation performance criteria is this. What kind of H you would like to give me? The one minimizes this quantity right and we are going to call this guy as here. So, this is like prediction error can also call test error. As a learner you would like to find a hypothesis which has the smallest test error right. So, my goal could be minimize the hypothesis class H ok. Right now let us say we have not made any assumption that whether this hypothesis class is finite or infinite or countable infinite or uncountable anything, but let us let us be bit vague here and still write minimum of this quantity here. Ok fine. So, is this entire setup is clear now? What we want to what we usually try to do in the supervised learning case like under this simplified case like binary classification and everything ok. So, maybe let us why not to just write this x is a random variable. You understand probability with respect to random variable right this x is a random variable here and that x has distribution d that is what I mean that is ok. So, even though I have written this x 1, x 2 as small quantities, but these are like x when I draw right. So, maybe I should just write it here this x the feature vector is a random quantity that I am going to sample according to distribution d I am not sampling like somebody has sampled and given it to you and from that you have now got the realizations and then you are trying to do. So, think about think this setup like this. So, let us say nature has its own rule mother nature it kind of it relates whether the rain happens or not to humidity and temperature in some fashion. It follows its own laws we do not know about those laws, but we have observed this like maybe like for the past 50 years or something we have been observing what is happening when the humidity level was this temperature. So, let us say somebody has recorded this information. So, whatever the nature's law it has been using we have seen the samples of this over the last 50 years and that law is this that is the environment. It is generating this features and associated labels according to this distribution which we do not know, but what we are interested is suppose if I just get to observe humidity and and the temperature level will I be able to predict my rain the way the nature would have done it the way it would have happened in reality that is the goal here that is what I am trying. So, this is like it always happens that you will see the feature vector and at the end then you can think of label as the result. So, in this case whether rain happened or not is the label that you are going to see the result. So, let us say you saw both temperature and the humidity growing in some fashion and at the end of this as a result of these two the rain is going to happen or not. So, that is like an outcome like rain happened or not and that depends on these two values. You have observed let us say humidity and temperature and now you want to kind of predict whether it is going to happen or not and also like in the stock market case like you want to see tomorrow my stock price is going to share price is going to go up or not. From your so, market has its own dynamics a market laws are governed may be very complicated you do not know it. So, based on that you have passed historical data from the stock exchange you have observed that based on this like if this has happened like let us say whatever the feature vectors if this has happened the next day the share price went up or not. So, you have that passed observation and now let us say today something happened like you have observed the current market scenario and now you want to predict whether tomorrow your share price are going to go up or not. So, that is what we are trying to do. So, this is based on the samples generated by the underlying environment which I do not know and now I want to see that if any features are generated on this how good my predictions are h of x is what is basically prediction or the label assigning right. How good I am assigning a label that is as good as the nature would have assigned it if I can do this. So, suppose I find a h such that this guy this probability is very very small if this guy this probability is very very small that whatever the feature vector I am able to make the correct prediction of its label or its outcome right. That means, in a in a in a in a sense I have figured out what nature is doing that means I have learnt these things from my observations ok. Now, the next question is fine you have set up the problem you have set up your learning environment you have set up your measure of success how you go about achieving how you are going going to get a hypothesis from your observed data ok. So, all you have this somebody generated a data and give it to you they have been also you you let us say Lerner has said ok I am going to use this much hypothesis now this is your objective how you are going to go about it. So, basically I want to find a hypothesis that kind of minimizes this quantity right ok. Suppose, suppose for time being pretend that you happen to be the god or you happen to be the nature or you happen to the oracle or god revealed you what is the distribution he is going to use to generate this feature vectors and he also told you what is the labels function that he is using to assign labels to the feature. So, I am saying god told you these two things then what would have done in that case for any h can you compute this if you know the distribution if you know your function for any h you can compute this right then it simply boils down to minimizing this. But god why should god tell me right I do not know this what I have been like, but what I have been doing is I am observing what like what I call or god is doing and now I am trying to imitate him right. So, how best you can imitate. So, you have access to only this ok. So, from this you have to come up with something. So, how you how we will do this.