 So, this is the standard trick we always use in probability right. If we do not know the underlying distribution, we go with the samples of this distribution and use that to have some empirical estimates of the quantity that we are trying to optimize ok. So, empirical estimate of the quantity we are trying to estimate. Here, this is the quantity if I know, I could go into this optimization right, but I do not know this. So, I want to get the empirical estimate of this using my data samples. So, the standard thing here is to do empirical. So, now, what will be a good empirical estimate of this quantity from the data samples I have? One thing is what I can do is I can say my L SH. So, S is the training data that I have been given. I can use this and I am going to write an empirical estimate for this as 1 upon m 2 ok. Let us understand what is this doing? I have S is given to me, S is given to me what I am doing is and I am trying to compute this quantity for my hypothesis H. So, I will just see on how many points I agree with the value y i, y i is the label right. So, this quantity is just give me this. So, this is the set. So, the value within this braces is going to give me set of all the points, the data points on which my hypothesis H does not agree right. So, y i is the corresponding label of x i, I am just counting how many points my H does not make a correct prediction and now I am dividing it by m ok. So, what is this size? So, I am taking the cardinality here that means, the numerator will tell me the number of points where my hypothesis H has made incorrect predictions and I am dividing it by m. So, what is this quantity is telling? The fraction of the points on which my hypothesis has made a wrong prediction ok. So, next. So, henceforth what I will do is this x i, so x 1, y 1 these are actually one sample right, but instead of right I am just going to write them as capital X and Y. That means, this X is a random variable, Y is also random variable, but this Y depends on X to the function f. So, this X is a random variable which is drawn from my distribution D. Once I note this, this Y is simply the function f of that point X. So, these are my points like I have this is my random variable and these are just like realizations of this ok. Now, suppose this is a particular realization, but let us say let me say that my I could have x 1, y 1, x 2, y 2, x and y. This could be all of them drawn from the same distribution right. So, this X, I, Y, I coming from the same environment. So, these are just m points drawn from the same environment. Now, if I say that these points are drawn independently and anyway they are identically distributed as m cost infinity what you what this quantity is equal to. So, if this quantity is going to be equal to this the probability of the disagreement, why is that which what did you what is what did you apply to conclude this. So, if this Xi pairs are IIDs independent identically distributed I can apply my large left number says that this has to be equals to this right, but it is just that m is not infinity here, m is some finite number that is why we are going to call it empirical risk. And now this quantity here we are going to use it as a proxy for this quantity and now I am going to minimize this over h ok that is why we are going to call this method a empirical risk minimization. Now, so what I will do is I will do this empirical. So, this has in general optimally just we are going to call it a ERM. ERM technique what it is going to do is as minimize over h belong to h now do. I am trying to my goal was to minimize this, but it is just that I do not know this, but what I did is I found a proxy for it based on the quantities I know and then trying to minimize this quantity ok. Next the question is fine if you do like this whatever this quantity is the minimum value of this if you do this how far these two quantities are away from are they very close to each other or they will be very far from each other. Your goal was to minimize this right, but you ended up minimizing this. Then you would like to ask the question whatever I got get from doing this how far this value is going to be from this value ok. And whatever the hypothesis I get from this let me call this as arg max of this the hypothesis I am going to get I am going to call it as ERS and this hypothesis. So, notice that it depends on the training set you got and over the hypothesis class you are trying to find your hypothesis ok. So, this is one training set that I have given to you from which you got your empirical risk minimization hypothesis. If I change this S can your empirical risk minimization hypothesis can change it can potentially change right. Because these quantities have been explicitly used all these data points have been used coming from set S if you are going to say it S it is going to change ok or maybe like I will instead of this I will also I will also going to use it as HS simply this quantity this empirical risk minimization. So, subscript S denotes makes it explicit that this hypothesis depends on your training data ok. So, now, your training data whatever that is given to you it is generated from this particular distribution D right every time you are going to generate this data you may end up with different set of data points ok. Because of this this set S S itself is a random quantity ok and because of this this H of S can be also random right it depends on the data set you have used to generate to learn this hypothesis. So, yes this H of S is a random quantity depends on the data set, but as I said what kind of guarantee we can give about the hypothesis we got by doing this empirical risk minimization with respect to whatever the best one ok. So, there is when we do all these things the hypothesis class matters if you are going to change hypothesis class things can also change. What we are going to now just to understand what we wanted to see like the difference between this minimum and this minimum let us focus on the case where we have only hypothesis class which has only finitely many elements in this ok. Before that how many of you know what is overfitting? So, what happens in overfitting? No just tell me what happens on the training data? So, what will be this quantity let us say if you have a very low if you do if you have overfit what is this quantity for some hypothesis you can always find a hypothesis which will ensure that on your data points it exactly does the correct job. If it does the correctly job this quantity is going to be 0 right. It may do a very good on the training one, but when I look at the actually the test error here where the points are not just these points right the data point they could be also coming from other parts it may not do well. So, there are some things like the how big is your hypothesis class and all they are going to affect your overfitting issues. So, for that like that is not of an interest. So, we will not go more into that, but what we are going to assume is this hypothesis class is finite size that is the cardinality of this set is finite ok. So, if this hypothesis has like infinitely many components in this right you will always come up with the hypothesis class which will overfit and make this loss 0 right and in that case you will your empirical risk minimization always end up showing you a hypothesis class which is like a overfit on the data. So, when you have only finitely some you restrict your hypothesis class that kind of overfitting you may avoid. So, let us for time being assume that this hypothesis class is as a finite element in this ok. Now, let us say whatever hypothesis you got from your training data I want to now measure its performance. So, what I when I will then how I am going to do this? So, here the H that has been you have found out is your H of S based on your training data right. Now you want to see what is its success. So, what you will do is then whatever H of S you have got now I want to I am interested in this quantity this is. So, you have you have done this empirical risk minimization during data set and you have come up with this hypothesis H of S. Now, I want to measure its performance. Now, I would say your empirical risk minimization has given you a good hypothesis if I can show that this quantity is small right or like if this quantity is 0 that is the best maybe like, but I want to can I claim that this quantity is going to be small. So, what are the things that governs how good your hypothesis H of S? So, what do you think? What do you think? You will have come up with a better hypothesis H of S that will make this quantity smaller. When you think what are the things in your setup that are going to affect how good my hypothesis H of S? So, I am assuming that this data points are generated in an IID fashion right. So, instead of giving let us say let us take m equals to 10 let us say I just give you 10 points and then using those 10 points you convert your empirical risk minimizer let us call that as a H S 1 there and later I give you 100 points I added 10 more 90 more points to your 10 points you have 100 points and using this 100 points you come up with this another H S 2. What do you think? H S 1 will be better or H S 2 will be better? Why? So, when you have larger so, you will be approximating this function more accurately and maybe because of that you will end up with this. So, naturally m has a role to play here right how many data points you have on what how good this H of S is going to be. So, notice that like there is also I said this could have m points, but each of the points here itself could be generated independently right. So, let us for time being let us say that let us take a two dimensional case let us say my x axis denotes humidity levels ok and my y axis denotes. So, instead of this let me call it x 1 component and x 2 component denotes my humidity levels sorry temperature levels and let us say these are the points. So, these are some points I have observed let us say when it rains when high humidity and high temperature when it is going to rain high humidity low temperature ok let us let us assume we are like some weather fake weather experts I do not know what a let us say in this region it is going to rain all these points has the label that they are going to rain and below these points these are like not going to rain yeah I am just like this is a hypothetical case it is not necessary there should be a nice separation like that it could be much more complicated. And like ok so, these are data points, but let us say so, this let us say this this is my range of humidity let us say humidity goes from some value 0 to maximum some value let us call this x. So, this is like x 1 max and my temperature also 0 to let us say some x 2 max something like this ok. Let us say the true the true like let us say whether the true in the nature the actual way it happens is whenever let us say in some fashion. So, in this let us say it is all going to rain and in this it is not going to rain. This is how let us say the actual environment behaves. Now, what you are doing is you are basically getting samples from this right over the past history you are basically getting. So, this is rain and this is no rain. Let us say you are now collected data from the last 10 years and the last 10 years happens to be like a draw draw tears ok maybe. So, it means there was not enough rain in those cases. So, because of that most of the data you collected happens to be from this portion from the last 10 years where data set will have data mostly coming from this portion ok. If you are going to train on this portion of the data do you think your the hypothesis you are going to end up will have a we will do a good performance no right because it did not get to see the full things here like it did not get the full picture. So, it may not be able to do. So, and also like if it. So, if you just see this what what it will feel that ok it looks like this in this part of the world it is always going to be draw every year it is going to declare that it is going to be draw draw draw and everybody will panic. But let us say if all the data point like last 10 years happens to be again like flooding or something and you got lot of points from this. This is also going to be bad whatever you are going to get it is going to predict it is going to rain rain rain and it is not going to be a good thing. So, which from which portion of your distribution that data cams also matters right because of that the S itself is going to matter how good or bad your hypothesis that is why we have also substituted it. Notice that we have saying that this points are IID generated right. When I am and I am saying this points are getting IID. So, when I am saying this IID generated it is unlikely that all the points are sampled they concentrate on one area to get this. So, maybe when I am generating IID maybe possibly they will be spread more in this and maybe they will give me a true representation of what is happening. But still when you are generating IID with some small probability you may still get samples only from one region of this right. It is not like even if you generate IID you will get with high probability samples from all over the region and the probability 1. It may end up with some small probability you it may happen all the data points you generated could be coming from this small region which could be bad representation of this. Because of this whatever I have here this quantity here itself will be determined by the data samples I am going to see and not only that what is the size of that data set. You may have large samples you could increase m to large number of points but with some small probability you all of them might have come from a small region small area which could make your empirical risk hypothesis really bad because of that this value could be also bad. So, in this case we want to see that whatever this hypothesis we learn from this given data set S what kind of guarantee I can give on this. So, is it true that if I have this large number of points in S that is enough to make it arbitrarily small need not right because as I said you may have large number of points but those sample points might be coming from some small region would could make this a bad representation of your space ok. So, the kind of guarantee we are going to give henceforth that is why depends a probabilistic guarantee we are going to give and we will only going to say that how small this is going to be ok. So, before we give any such things we when to make such reasons concrete we may require some assumption I will just tell you what I mean. We are going to assume that there exists a hypothesis in my hypothesis class such that it is going to give me 0 loss on all possible realizations of my data set. So, let me make it formal. So, with this assumption we are going to call it as real life ok. So, we will assume that there exist a hypothesis class H star in my hypothesis class H such that whatever this quantity I have whatever the prediction error I am going to get. So, what does this imply? We are basically saying that S star equals to f with probability 1 ok. So, it may happen that on some points it may this H star may not assign a same level of f, but those number of points are 0 mass. So, let us say this assumption holds ok. If let us say if this assumption holds is it true that if I compute this quantity for any S on H star what will this value be that will be also 0 right. So, so this basically means that you take any data samples I will have an H that will have a 0 loss ok. Now, I assume that there exists such a H star which is belonging to my hypothesis class. If I go and do this empirical minimization whatever that minimizes this value. So, we know that at least there is one H star here that is going to make this quantity 0 right. Whatever H S we are going to have that is going to have this quantity to be 0 and there could be more than one, but what we are interested is the one which minimizes this ok. So, let us assume this this can be all easily realized, but for time being let us say that my hypothesis class is expressive enough such that it captures the true labeling phenomena. That means the true label it is it can do as good job as a true labeler that is it can be as good as a the true labeler labeling function f that is why we are saying. So, if this happens then we have the following result. So, you have to get used to such the statements now onwards. So, let us try to understand what is this result saying ok. It says that let us say H B a hypothesis class where H S finite and a delta is given to you and epsilon also given to you. Then if you are going to choose your m, m here is going to denote the number of data points to be larger than this and if you are going to give me a data points which are generated IID and it has at least this m number of points in it then whatever this error whatever the hypothesis class I am going to learn by empirical risk minimization it is guaranteed to have a loss which is smaller than epsilon and that is going to happen with probability 1 minus delta. As I said I cannot give a deterministic guarantee on this because even though I can give a lot of samples the samples may come from some bad region. So, I have to I can make this guarantee I can only give a probability guarantee. Now let us let us try to understand how to interpret this result. I have let us say I you are in a company you have been told to find give me a hypothesis whose value is going to whose true value this quantity is going to be less than epsilon with probability 1 minus delta. So, let us say your boss has given this parameter to you we cannot guarantee anything in word with probability 1 right in being realistically we can guarantee something to happen with some probability maybe that probability is very high like me being alive tomorrow very high, but it is still probabilistic right there is always some maybe there is also some case that probability that sun may not raise tomorrow is not with probability 0 it may not raise some with some positive probability. So, we have to give similar like so I am let us say I am also asking the same thing keeping that in mind I will ask give me this guarantee with this probability and my guarantee has been specified by two things I want loss to be not more than epsilon it should be less than epsilon and this should happen with this probability 1 minus delta yeah this delta yeah we will be careful with delta and it. So, these are the parameter that has been given to you if anybody ask you to guarantee anything you also cannot give him that guarantee right you will be also demanding you are going to tell see ok I will guarantee you this, but you give me this many data points m number of data points and how is this m determined like this. If you are asking for epsilon accuracy epsilon guarantee and with delta accuracy I want you to give me this many samples then I will guarantee you that whatever the hypothesis I am going to give that will have a loss which is less than epsilon ok. So, if somebody ask you get me this job then in 1 hour or something then you will ask give me 10 GPUs then servers I will get it done in 1 hour. So, like this somebody is asking you to give you this guarantee then we are going to say if you give me this many data points then you will be guaranteed you are guaranteed ok. So, you see this. So, this is a kind of wanted to go through quickly this supervised setting like all of you my who have done this course in machine learning when you touch upon this learning theory this is what you are going to do there right. So, this is kind of this is touching upon the sample complexity. If you want to learn something with some guarantee you need certain number of data samples ok. So, this kind of results lead to sample complexity result. Now, but this is not our interest right this is just like a precursor what we are interested is I do not have a priori access to this m data points that are given like in this case you have already been given m data points you train on this and try to apply it on a new point and you are trying to guarantee that how well you are going to perform on the new data point right and you have been guaranteeing such things. Now, I am going to say I do not have luxury of collecting all this data right. Suppose let us say in the weather case right like somebody was collecting data for the last 50 years and they give it to you and he is asking you to predict even that we do not do well right like we have so much of historical data, but our weather forecasts are always wrong. But what so that is one case fine, but what it may be happen right like what happens if you do not have any such information at all, but you want to make a good prediction right from day one how you are going going to go about it fine on day one you have nothing you cannot do anything, but after doing something on day one you have some information for day two right. Can you think of doing something better than what you did on day one and you do it is something on day two. Now for the day three you have information about day one and day two can you use this to do something better than in day three than what you did in day one and day two and as information gets to you can you quickly try to do what is the best you could have done. So, that is where we were going to get into this online machine learning here. So, in this case we have already bunched out data we want to train and guarantee what we are going to do from the whatever we learned from the past, but in my online setting I do not have this luxury, but we are going to be ambitious and we are going to set our goal as I want to do quickly as soon as possible. And you will also see that we are going to set our goal as what we are doing is it how good it is compared to the case when I knew everything like I am the oracle if I am the oracle of like nature myself. So, when I say nature it is like about the data generating process if I knew everything I know what is the best I could have done right now I am going to always compare myself to that like I do not know anything I am started getting this incremental observations using that how quickly I am going to do as if I knew everything if you could do it as if you knew everything that means basically you have learnt and now the question is how to set up this and how to characterize all these things. So, that is what we are going to start looking it from the next class and there we are going to see various settings ok. So, let us stop here.