The following program is brought to you by Caltech. Welcome back. Last time we talked about regularization, which is a very important technique in machine learning. ونحن المساعدة المساعدة التي نأخذها is to take a constrained form of regularization where you explicitly forbid some of the hypotheses from being considered and thereby reducing the VC dimension and improving the generalization property to an unconstrained version which creates an augmented error in which no particular vector of weights is prohibited per se but basically you have a preference of weights based on a penalty that has to do with the constraint and that equivalence will make us focus on the augmented error form of regularization in every practice we have and the argument for it was to take the constrained version and look at it either as a Lagrangian which would be the formal way of solving it or as we did it in a geometric way to find a condition that corresponds to minimization under a constraint and find that that would be locally equivalent to minimizing this in an unconstrained way then we went to the general form of our regularizer and we called it capital Omega of H and it depends on small H rather than capital H which was the other capital Omega that we used in the VC analysis was and in that case we formed the augmented error as the in-sample error plus this term and the idea now is that the augmented error will be better thing to minimize if you want to minimize the out-of-sample error rather than that just minimizing E in by itself and there are two choices here one of them is the regularizer Omega with decay or with elimination or the other forms you may find and the other one is lambda which is the regularization parameter the amount of regularization you are going to put and the sort of the long and short of it is that the choice of Omega in a practical situation is really a heuristic choice guided by theory and guided by certain goals but there is no mathematical way in a given practical situation to come up with a totally principled Omega but we follow the guidelines and we do quite well so we make a choice of Omega towards smoother or simpler hypothesis and then we leave the amount of regularization to the determination of lambda and lambda is a little bit more principled we'll find out that we will determine lambda using validation which is the subject of today's lecture and when you do that you will get some benefit of Omega if you choose a great Omega you will get a great benefit if you choose an okay Omega you will get some benefit if you choose a terrible Omega you are still safe because lambda will tell you take validation will tell you take lambda equals zero and therefore no harm done and as you see that choice of lambda is indeed critical because when you take the correct amount of lambda which happens to be very small in this case the fit which is the red care is very close to the target which is the blue whereas if you sort of push your luck and have more of the regularization you end up constraining the fit so much that the red really is you know it wants to move toward the blue but it can't because of the penalty and ends up being a poor fit for the blue care okay so that leads us to today's lecture which is about validation validation is another technique that you will be using in almost every machine learning problem you will encounter and the outline is very simple first I'm going to talk about the validation set there are two aspects that I'm going to talk about the size of the validation set is critical so we'll spend some time looking at the size of the validation set and then we'll ask ourselves why did we call it validation in the first place it looks exactly like the test set that we looked at before okay so why do we call it validation and the distinction will be pretty important and then we'll go for model selection a very important subject in machine learning and it is the main task of validation that's what you use validation for and we'll find that model selection covers more territory than what their name may suggest to you finally we'll go to cross validation which is a type of validation that is very interesting that allows you if I give you a budget of n examples to basically use all of them for validation and all of them for training which looks like cheating because validation will look like a distinct activity from training as we will see but with this trick you will be able to find a way to go around that okay so now let me contrast validation with regularization as far as control overfitting is concerned we have seen in one form or another the following by now famous equation or inequality or rule where you have the out-of-sample error that you want equals the in-sample error or at most equal to in-sample error plus some penalty could be penalty for model complexity overfit complexity a bunch of other ways of describing that but basically this tells us that E in is not exactly E out that we know all too well and there is a discrepancy and the discrepancy has to do with the complexity of something okay and overfit penalty has to do with the complexity of the model you are using to fit and so on so in terms of this equation I'd like to pose both regularization and validation as an activity that deal that deals with this equation okay so what about regularization we put the equation what did regularization do it tried to estimate this penalty so basically what we did is concoct a term that we think captures the overfit complexity overfit penalty and then instead of minimizing the in-sample we minimize the in-sample plus that and we call that the augmented error and hopefully the augmented error will be a better proxy for E out that was the deal okay and we noticed that we are very very inaccurate in that choice here okay we just say okay smooth you know pick lambda you can use this you can use that so obvious we are not satisfying any inequality by any chance but we are basically getting a quantity that has a monotonic property that when you minimize this this gets minimized which does the job for us okay now to contrast this let's look at validation when it's deal when it's dealing with the same equation what does validation do well validation cuts to the chase it just estimates the out-of-sample why bother with this you know analysis and overfit and this and that you want to minimize out-of-sample let's estimate the out-of-sample and minimize it okay obvious it's too good to be true but it's not totally untrue that validation does achieve something in that direction so let me spend a few slides just describing the estimate I'm trying to estimate the out-of-sample error this is not completely a foreign idea to us because we use a test set in order to do that so let's focus on this and see what are the the parameters involved in estimating the out-of-sample error okay so let's look at the estimate the starting point is to take an out-of-sample point XY okay was not involved in training okay we used to call it test point now we are going to call it validation point it's not going to become clear why we are giving it a different name for a while until we use the validation set for something and then the distinction will become clear but as far as you are concerned now this is just a test point we are estimating E out and we will just read the the value of E out and be happy with that and not do anything further okay so you take this point and the error on it is the difference for hypothesis does on X and what the target value is which is Y okay and what is the error we have seen many forms of the error let's just mention 2 to make it concrete this could be a simple squared error we have seen that in linear regression could be the binary error we have seen that in classification so nothing foreign here okay now if you take this quantity and we are now treating it as an estimate for E out a poor estimate but nonetheless an estimate value of that with respect to the choice of K with the probability distribution over the input space that generates X what will that value be well that is simply E out so indeed this quantity the random variable here okay has the correct expected value it's an unbiased estimate of E out but unbiased means that 80 you know it's as likely to be the the here or here in terms of expected value but and this would be a terrible estimate because you are not getting all of them you're just getting one of them okay so if I if this guy swings very large and I tell you this is an estimate to be out and you get it here this is what you will think E out is okay so there is an error but the error is not biased that's what this equation says okay but we have to evaluate that swing and the swing is obviously evaluated by usual quantity the variance okay and let's just call the variant sigma squared okay depends on the number of things including what is your error measure and whatnot but that is what a single point does okay so you get an estimate but the estimate is poor because it's one point and therefore sigma squared is likely to be large so you are unlikely to use the estimate on one point as your guide to E out okay what do you use you move from one point to a full set okay so you get what you get on a validation set that you are going to use to estimate E out now the notation we are going to have is that the number of points in the validation set is capital K remember that the number of points in the training set was capital N okay so this will be K points also generated according to the same rules independently according to the probability distribution of the input space and the error on that set we are going to call E val as in validation error so we have E in and we have E out now we are introducing another one E val the validation error and the form for it is what you expect it to be you take the individual errors on the examples and you take the average like you did with the training set and this one is the validation error the only difference is that this is done out of sample these guys were not used in training and therefore you would expect that this would be a good estimate for the out of sample performance let's see if it is okay what is the expected value of E val the validation error okay well you take the expected value of this fellow the expectation goes inside so the main component is the expected value of this fellow which we have seen before expected value in a single point and you just average linearly as you did okay now that is this quantity happens to be E out the expected value on one point is E out therefore when you do that you just get E out again okay so indeed again the validation error is an unbiased estimate of the out of sample error provided that all you did with the validation set is just measure the out of sample error you didn't use it in anyway okay now let's look at the variance because that was our problem with the single point estimate and let's see if there is an improvement when you get the variance you are going to take this formula and then you are going to have a double summation and have all cross terms of E between different points okay so you'll have you know the covariance between the value for k equals 1 and k equals 2 k equals 1 and k equals 3 etc and you also have the diagonal guys which is the variance in this case with k x k equal 1 and k equal 1 again and whatnot okay so the main component you are going to get are the variance okay and a bunch of covariance is actually there are more covariances than variance because the variances are the diagonal the covariance are of diagonal okay there are almost k squared of them the good thing about the covariance in this case is that it will be 0 because we picked the points independently okay and therefore the covariance between a quantity that depends on these points will be 0 so I'm only stuck with the diagonal elements which happen to have this form so I have the variance here and when I put the summation something interesting happens so I have the summation again a double summation reduced to 1 because I'm only summing the diagonal but I still have the normalizing factor with the number of elements because I had k squared elements the fact that many of them dropped out is just to my advantage okay I still have the 1 over k squared and that gives me the better variance for the estimate based on evil than on a single point this is your typical analysis of of adding a bunch of independent estimates so you get the sigma squared that was the variance on a particular point but now you divided by k now we see a hope because even if the original thing is an estimate with this way maybe we can have k big enough that we keep shrinking the error bar such that the E value itself as a random variable becomes this which is around E out is what we want and therefore it becomes a reliable estimate okay this looks promising so now we can write the E value which is a random variable to be E out which is the value we want plus or minus something that averages to zero and happens to be the order of approximately 1 over square root of k if the variance is 1 over k then the standard deviation of 1 of the square root of k I'm assuming here that sigma squared is constant in the range that I'm that I'm using and therefore the dependency on k only comes from here and therefore I have this quantity that tells me this is what I'm estimating and this is the error I'm committing and this is how the error is behaving as I increase the number k okay now the interesting point now is that k is not free it's not like I you know tell you okay it looks like if I increase k this is a good situation so why don't we use more and more validation error because the reality is k is not given to you on top of your training set what I give you is a data set end points and it's up to you to use how many to train and how many to validate I'm not going to to give you more just to for you because you want to validate so every time you take a point for validation you are taking it away from training so to speak okay so let's see the ramifications of this regime we k is taken out of n so let's now have the notation we are given a data I said script D as we always called it and we are going to take k points and use them for validation and you can take any k points as long as you don't look at the particular input and output and or not let's say you pick k points at random from the end points that will be a valid set of of validation for you okay so I have the k points and therefore I'm left with n minus k for training okay so the k points I the the ones D sub train I didn't have to use that when I didn't have validation because capital script D all went to training so I didn't need to have the distinction now because I have two utilities I'm going to take the guys that go into training and call that subset D sub train and the guys that are called for validation I'm going to call D sub val okay the union of them is D okay so that's the setup now we looked in the previous slide at and we found that that reliability if we measure it by the error bar on the fluctuation it will be the order of one of a square root of K capital K the number of validation points okay then our conclusion is that if you use small K you have a bad estimate and the whole role we have for validation so far is estimate so we are not doing a good job so we need to increase K so it looks like a good idea just from that point of view to take for taking large K so we have a question mark and let's try to be quantitative about it remember this fellow that was the learning care what did it do well it told us as you increase the number of training points what is the expected value of E out and what is the expected value of E N for a given model the model that I'm plotting the I learned the learning curves of right okay now the number of data points used to be N here I'm writing it as N minus K why am I doing that because under the regime of validation this is what I'm using for training therefore if you increase K you are moving in this direction right okay so I used to be here and I used to expect that level of E out now it doesn't look very promising I may get a reliable estimate because I'm using bigger K but I'm getting a reliable estimate of a worse quantity okay if you want to take an extreme case you are going to take this estimate and go to your customer and tell them what you expect for the performance to be okay so you don't only deliver the final hypothesis you deliver the final hypothesis with an estimate for how it will do when they test it on a point that you want the estimate to be very reliable and you forget about the quality of the hypothesis so you keep increasing K keep increasing K keep increasing K so you end up with a very very reliable estimate okay the problem is that it's an estimate of a very very poor quantity because you use two example to train and you are basically in the noise okay so the statement you are going to make to your customer in this case is that here is a system okay that is unlikely to please a customer okay so now we realize that there is a price to be paid for K it turns out that we are going to have a trick that will make us not pay that price okay but still the question of what happens when K is big is a question mark in our mind so what I'm going to do now I'm going to tell you okay you use K to estimate the error okay now we store the dataset after you have the estimate because the estimate now is in your pocket train on the full set so that you get the better guy okay well I estimated for the smaller guy what are we doing so let's just do this systematically so let's put K back into the pot okay so here is the regime I'm going to describe this figure but let's talk it piece by piece okay we have into D train and D val okay D itself has N points we took N minus K to train K to validate that's the game okay what happened if we use the full training set to train we will get a final hypothesis that we called G this is just a matter of notation but some guys and therefore you are using only D sub train to train okay and this has N minus K doesn't have all the example therefore I am going to generically label the final hypothesis that I get from training on a reduced set D train I'm going to call a G minus just to remind ourselves that it's not on the full training set okay so now here D okay let me get it a bit smaller so that we can get the output okay okay if I use the training set by itself I would get G what I'm doing now is that I am going to take D train which are fewer examples and the rest go to validation I use D train to get G minus and in order to get an estimate so the trick now is that instead of reporting G minus as my final hypothesis I know if I added the other data points here to the pot I am going to get a better out of sample I don't know what it is I don't have an estimate for it but I know it's going to be better than the one for G minus simply because of the learning care on average I get more examples I get better out of sample error so I put it back and then report G so it's a funny situation I'm giving you G and I'm giving you the validation estimate on G minus why because that's the only estimate I have I cannot get you the estimate on G because now if I get G I don't have any guys to validate on okay so you can see now the compromise okay so now under this scenario I am not really using in performance by taking a bigger validation set okay because I'm going to put them back when I get the final hypothesis what I am losing here is that the validation error I'm reporting is a validation error on a different hypothesis than the one I am giving you and if the difference is big then my estimate is bad because I'm estimating on something other than what I'm giving you and that's what happens when you have large K when you have large K the discrepancy between G minus and G is bigger and I am giving you the estimate in G minus so that estimate is poor and therefore I get a bad estimate again so now you see the subtlety here okay this is the regime that is used in validation universally after you do your thing and you do your estimates and as you will see further you do your choices you go and put all the examples to train on because this is your best bet of getting a good hypothesis if your K is small the validation error is not reliable it's a bad estimate just because the variance of it is big small K okay I have small K it's one over square root of K so I'm doing this if you get big K the problem is not the reliability of the estimate the problem is that the thing you are estimating is getting further and further away from the thing you are reporting okay so now we have a compromise we don't want K to be too small in order not to have fluctuations we don't want K to be too big in order not to be too far from what we are reporting okay and as usual in machine learning there is a rule of thumb okay and the rule of thumb is pretty simple that's why it's a rule of thumb it says take one fifth for validation okay that usually gives you the best of both works okay nothing proved you can find counter examples I'm not going to argue with that that's it's a rule of thumb okay use it in practice and actually you will be you will be you know quite successful indeed okay there is an argument with some people that should be n over 5 or n over 6 okay so I'm not going to sort over that it's a rule of thumb after all for crying out loud we'll just leave it at that okay okay so we now have that now let's go to the other aspect so we know what validation is and we understand the how critical it is to choose the number and we have a rule of thumb now let's ask our question why have we are calling this validation in the first place so far it's purely a test we get out to sample point the estimate is unbiased what is the deal we call it validation because we use it to make choices and this is a very important point so let's talk about it in detail once I make my estimate affect the learning process the set I am using okay is going to change nature so let's look at the situation that we have seen before remember this fellow yeah this was early stopping in the in the neural networks and let me magnify it for you to see the green curve you see the green curve now okay so there is a green curve now let's go to the back okay okay so there is in sample goes down out of sample let's say that I have a general estimate for the out of sample goes down with it until such a point that it goes up and we have the overfitting and we talk about it and in this case it's a good idea to have early stopping now let's use let's say that you are using K points that you did not use for training in order to estimate E out okay that would be E test the test error if all you are doing is just plotting the red in order to look at it and admire it oh that's a nice care oh it's going up but you are not going to take any actions based on it now if you decide that okay this is going up I had better stop here that changes the game dramatically all of a sudden this is no longer a test error now it's a validation error so ask yourself what the heck I mean just a semantics it used to be it's the same curve okay why am I calling it a different name I'm calling it a different name because it used to be unbiased that is if you actually go if this is an estimate of E out not the actual E out it will be there are an error bar in estimating E out but it is as likely to be optimistic as pessimistic okay now when you do early stopping if you say I'm going to stop here and I'm going to use this value as my estimate for what you are getting okay now I claim that your estimate is now biased what it's the same point you told us it was unbiased before what is the deal okay so now let's look at a very specific simple case in order to understand what happens this is no longer a test set it becomes in red a validation set okay so fine fine now convinces of the substance of it we know the name okay so let's look at the difference when you actually make a choice okay so very simple things that you can you can really let's say I have the test set which is unbiased and I'm claiming that the validation set has an optimistic bias okay optimistic it's not like we I mean optimism is good but here is optimistic optimism followed by disappointment it's deception okay we are just calling it optimistic to understand that it's always in the direction of thinking that the error will be smaller than it will actually turn out to be okay okay so let's say we have two hypothesis okay and for simplicity let's have them have both the same E out so I have two hypothesis each of them has out of sample error 0.5 okay now I am using a point to estimate that error and I have two estimates E1 for the hypothesis 1 and E2 for the hypothesis 2 okay and I am going to use because the estimate has fluctuation isn't it just again for simplicity I'm going to assume that both E1 and E2 are uniform between 0 and 1 okay so indeed the expected value is half which is the expected value I want which is out of sample error okay now I'm not going to assume strictly that E1 and E2 are independent but you can assume they are independent argument okay but they can have some level of correlation and you will still get the same effect but let's say think now that they are independent variables E1 and E2 now E1 is an unbiased estimate of its out of sample error right right E2 is the same right right unbiased means the expected value is what we what we what what it should be and the expected value indeed in this case is what it should be 0.5 now let's take the game where we pick 2 how are we going to pick it we are going to pick it according to the value of the error so now the measurement we have is applying to that choice okay so now what I'm going to do I'm going to pick the smaller of E1 and E2 and whichever that one is I'm going to pick the hyposis correspond to it okay so this is mini-learning okay the error just pick one and this is one okay my question to you is very and now you thought we'll say okay you told us expected value of E1 is right You told us expected value of T is half okay E has to be either E1 or E2 so the expected value should be a half of course not because now the rules of the game the probabilities that we are applying has changed because you are deliberately picking the minimum of the realization and the expected أكثر من 0.5. الأمر أسيحة لكي تقول أنه إذا كنت تكون لديك أعيانات مثل ذلك المستقبلة التي تكون لديها أكثر من 1.5 هو 75% لأن كل ما يجب أن تفعله هو أحد منهم أكثر من 1.5 إذا كانت أكثر من 1.5 أكثر من 1.5 تتوقع أن تكون لديك أكثر من 1.5 إنه أكثر من 1.5 المستقبلة هي أكثر من 1.5 إذن now we realize this is what this is an optimistic bias and that is exactly the same what happened with the early stopping we picked the point because it's minimum on the realization and that is what we are reported because of that the thing used to be this but we wait when it's there we ignore it when it's here we take it so now that introduces a bias and that bias is optimistic and that will be true for the validation set so our discussion so far is based on just looking at the out now we are going to use it and we are going to introduce a bias fortunately for us the utility of validation in machine learning is so light that we are going to swallow the bias bias is minor we are not going to push our luck we are not going to estimate tons of stuff and keep adding bias until the validation error basically becomes training error in disguise we are just going to let's choose a parameter choose between models and whatnot and by and large if you do that and you have a respectable size validation set you get a pretty reliable estimate for the E out conceding that it's bias but the bias is not going to hurt us too much so this is the general philosophy okay so now with this understanding let's use validation set for model selection which is what validation sets do that is the main use of validation set and the choice of lambda in the case we saw happens to be a manifestation of this so let's talk about it so basically we are going to use the validation error more than the validation set more than once that's how we are going to make the choice so let's look okay so this is a diagram I'm going to build it up so let's build it up and then I'll focus on it and look at how the diagram reflects the logic okay we have capital M models that we are going to choose from when I say model you are thinking of one model versus another but this is really talking more generally I could be talking about models as in should I use linear models or neural networks or support vector machines these are models I could be using only polynomial models and I'm asking myself should I go for second order fifth order or tenth order that's a choice between models I could be using fifth order polynomials throughout and the only thing I'm choosing should I choose lambda of the regularization to be 0.01 0.1 or 1 all of this lies under model selection there is a choice to be made and I want to make it in a principled way based on the out-of-sample error because that's the bottom line and I'm going to use the validation set to do that so this is the game okay so we are called them as since our model I have h1 up to hM and we are going to use D to train and I am going to get as a result of that it's not the whole set okay as usual so I left some for validation and I'm going to get G minus that is our convention for whenever we train on something less than the full set but because I am getting a hypothesis from each model I am labeling it by the subscript M so it's G1 up to Gm with a minus because they are used they use D train to train okay so I get one for each model and then I am going to evaluate that fellow using the validation set the validation set are the example that were left out from D when I took the D to train so now I am going to do this so all I am doing is exactly what I did before except I am doing it capital M times and introducing the notation that goes with that okay so let's look at the figure now a little bit so here is the situation I have that data set what do I do with it I break it into two validation and training I use the training to apply to each of these hypothesis hypothesis set h1 up to hM and when I train I end up with a final hypothesis it is with a minus I am sort of a small minus in this case because I am training on D train and they correspond to the hypothesis they came from so this is G1 G2 up to Gm okay so these are done without any validation just training on a reduced set once I get them I am going to evaluate their performance so I am going to evaluate their performance using the validation set okay so I take the validation set and run it here it is out of sample as far as they are concerned because it is not part of D train and therefore I am going to get estimate these are the validation errors I am just giving them a simple notation as E1 and E up to Em E1, E2 up to Em okay now your model selection is to look at these errors which supposedly reflect the out of sample performance if you use this as your final product okay and you pick the best okay now that you are picking one of them you immediately have alarm bells bias, bias, bias, bias okay something is happening now okay because now you are going to be biased each of these guys with an unbiased estimate of the out of sample error of the corresponding hypothesis you pick the smallest of them and now you have a bias so the smallest of them will give the index m star whichever that might be so Em star is the validation error on the model we selected and now we realize it has an optimistic bias and we are not going to take G-m star which is the one that give rise to this we are now going to go back to the full dataset as we said in our regime we are going to train with it and from that training which is training now on the model we chose we are going to get the final hypothesis which is G-m star okay so again we are reporting the the validation error on a reduced hypothesis if you will and but reporting the hypothesis is the best we can do because we know that we get better out of sample when we add the examples so this is the the regime okay so let's complete the the the the the slide so Em that we introduced happens to be the value of the validation error on the reduced as we discussed and this is true for all of them and then you pick the model m star that happens to have the smallest Em and that is the one that you are going to report and you are going to restore your D as we did before and this is what you have okay so this is the the algorithm for model selection now let's look at the bias okay so I'm going to run an experiment to show you the bias so let me put it here and just build towards it so what is the bias now we know we selected a particular model and we selected it based on D value that's the killer okay so when you use the estimate to choose the estimate is no longer reliable because you particularly chose for it so now it looks optimistic because by choice it has a good performance not because it has an an inherently good performance because you looked for the one with a good performance okay so the expected value of this fellow is now a bias estimate of the ultimate quantity we want which is the out of sample R okay so the Eval the sample thing is bias of that and we would like to evaluate that okay so here is the illustration on the curve and I'm going to ask you a question about it so we have to pay attention in order to be able to answer the question okay so here is the experiment here I have a very simple situation I have only two models to choose between one of them is second-order polynomials and the other one is fifth-order polynomials I'm generating bunch of problems and in each of them I make a choice based on validation set and after that I look at the actual out of sample error and I'm trying to find out whether there is a systematic bias in the one I choose with respect to it's out of sample error so it's not clear which way I'm just not I'm saying that I chose H2 or H5 in each run I may have chose H2 sometimes and H5 sometimes whichever gave me the smaller eval and I'm taking the average over an incredible number of runs that's why you have a smooth curve okay so this will give me an indication of the typical bias you get when you make a choice between two models under the circumstances of the experiment now the experiment is done carefully with few examples the total is 30 some examples and I'm taking a validation set which is you know like 5 examples 15 examples up to 25 5 examples okay so at this point really the number of examples left for training is very small okay and I'm plotting this so this is what I get for the average average over the runs okay of the validation error on the model I chose the final hypothesis of the model I chose and this is the out of sample error of that guy okay now I'd like to ask you two questions okay think about them and also for the online audience please think about them okay first question why are the curves going up okay this is k is the size of the validation set I'm evaluating it it's not because I'm evaluating on more points that the curves are going up it's because when I use more for validation I'm inherently using less for training so there's an n-k that is going the other direction and what we are seeing here really is the learning curve backwards this is E out I have more and more examples to train as I go here so the out of sample error goes down so in the other direction it goes up okay and this being an estimate for it it goes up with it so that makes sense second question why are the two curves getting closer together whether they're going up or down that's not my concern at this point just the fact that they are converging to each other okay now that has to do with k proper directly I mean the other run had to do with k indirectly because I'm left with n-k but now when I have bigger k the estimate is more and more reliable and therefore I get closer to what I'm estimating okay so we understand this but this is definitely evidence and in every situation you will have there will be a bias how much bias depends on a number of factors but the bias is there okay so let's try to find analytically a guideline for the type of bias why is that because I'm using the validation set to estimate the out of sample error and I'm really claiming that it's close to the out of sample error and we realize that okay if I don't use it too much I'll be okay but what is too much I want to be a little bit quantitative about it at least as a guideline so I look at I have capital M models and you can see that the M is in red okay that should sort of remind you when we had capital M in red very early in the course okay because capital M used to make things worse it was the number of hypothesis when we're talking about generalization and it was really that when you have bigger M you are in bigger trouble okay so it seems like we are also going to be in bigger trouble here but the manifestation is different we have now capital M models we are choosing from okay models in the general sense this could be capital M values of the regularization parameter lambda in a fixed situation but we are still making one of capital M choices okay now the way to look at it is to think that the validation set is actually used for training but training on a very special hypothesis set the hypothesis set of the finalists what does that mean so I have H1 up to HM I am going to run a full training algorithm in each of them in order to find a final hypothesis from each using D sub train now after I am done I am only left with the finalist G1 up to Gm with a minus sign because they are trained on on the reduced set okay so the hypothesis set that I am training on now is just those guys as far as the validation set is concerned it didn't know what happened before it doesn't relate to the train all you did you gave it this hypothesis set which is the final hypothesis from your previous guy and you are asking it to choose it's as and what are you going to choose you are going to choose the minimum error well that is simply training if I just told you that this is your hypothesis set and that D value is your training what would you do you look for the hypothesis with the smallest error that's what you are doing here so we can think of it now as if we are actually training on this set okay and this tells us oh we need to estimate the discrepancy or the bias between this and that now it's between the validation error and out of sample error but the validation error is really the training error on this special set so we can go back to our good old Hefding and VC and say that the out of sample error in this case given from those and now you can see that that choice here is star so I'm actually choosing one of those guys so this is my training and the final final hypothesis is this guy okay is less than or equal to the out of sample error plus a penalty for the model complexity and the penalty if you use even the simple union bound will have that form so you still have the one of our square root of K so you can always make it better by having more examples but then you have a contribution because of the number of guys you are choosing from okay so if you are choosing between 10 guys that's one thing if you are choosing between 100 guys that's another it's worse okay well it's sort of benignly worse because it's logarithmic but nonetheless worse okay and if you are choosing between an infinite number of guys okay we know better than to dismiss the case of hand you say okay infinite number of guys okay we can't do that no no no because once you go to the infinite choices you don't count anymore you go for a VC dimension of what you are doing that's what the effective complexity goes with okay and indeed if you are looking for a choice for one parameter let's say I'm picking the regularization parameter when you are actually picking the regularization parameter and you haven't put a grid you don't say I'm choosing between 1.1 and .01 etc okay a finite number okay I'm actually choosing the numerical value of lambda whatever it be so I could end up with lambda equals .127543 okay you are making a choice between an infinite number of guys but you don't look at it as an infinite number of guys okay you look at it as a single parameter okay and we know a single parameter goes with a VC dimension one that doesn't phase us okay but we dealt with VC dimensions much bigger than that and we know that if we have you know one parameter or maybe two parameters and the VC dimension maybe is two if you have a decent set in this case the size of the set you are talking about then your estimate will not be that far from E out okay so this is the idea so now you can apply this with the VC analysis instead of the instead of just going for the number which is the union bound you go for the VC version and now apply to this fellow and you can ask yourself if I have a regularization parameter what do I need or if I have another thing which is the earliest stopping what is the earliest stopping I I'm choosing when how many epochs to choose they say you know epochs is integer but it's sort of you know so I'm choosing when to stop all of those choices where one parameter is being chosen one way or the other sort of corresponds to one degree of freedom okay so if I tell you the rule of thumb is that okay when you are using the validation set if it's a reasonable size set you know let's say 100 points and use those 100 points to choose a couple of parameters you are okay you already can relate to that you don't need me to tell you that because okay 100 points VC dimension too yeah I can get something okay now if I give you the 100 points until you are choosing 20 parameters you immediately say this is crazy your estimate will be completely ruined because you are now contaminating the things this is now genuinely training because that your choice of a value parameter is what well that's what training did the training of an Euronetwork try to choose the weights of the network the parameters there are just so many of them that we called it training now in only one parameter or two we call it you know choice of a parameter by validation okay so it's it's a gray area if you start if you push your luck in that direction the validation estimate will lose its main attraction which is the fact that it's a reasonable estimate of the out-of-sample that we can rely on the reliability goes down so there is this trade-off okay so with the data contamination let me sort of summarize it as follows we have error estimates we have seen some of them we looked at the in-sample error the out-of-sample error or as in E-test and then we have eval the validation error so I'd like to describe this as data contamination that if you use the data to make choices you are contaminating it as far as its ability to estimate the real performance that's the idea so you can look at what is contamination it's the built-in optimistic and better described as deceptive because it's bad you are going to get something you know you are going to go to to the bank and tell them I can you know forecast the stock market no you can't so that's bad okay you are optimistic before you went there after that you are in trouble okay so you are trying to get a by-in estimating E-out and you are trying to measure what is the level of contaminage okay so let's look at the three the three sets we used okay we have the training set this is just totally contaminated forget it okay we took a neural network with 70 parameters and we did backprop and we went back and forth and we ended up with something and we have a great E-in and we know that E-in is no indication of E-out this has been contaminated to death okay so you cannot really rely on E-in as an estimate for E-out when you go to the test set this is totally clean wasn't used in any decisions it will give you an estimate the estimate is unbiased okay when you give that as your estimate your customer is as likely to be pleasantly surprised as unpleasantly surprised and if your test set is big they are not they are likely not to be surprised at all to be very close to your estimate okay so there is no bias there okay now the validation set is in between it's slightly contaminated because it made few choices okay and the wisdom here please keep it slightly contaminated don't get carried away sometimes when you are in the middle of a big problem with lots of data you choose this parameter and then oh there is another parameter I want to choose so you get to use the same validation set okay alarm bells alarm bells and you keep doing it okay so you should have a regime to begin with that you should have not only one validation set you could have a number of them such that when one of them gets dirty contaminated you move on to the other one which hasn't been used for decisions and therefore the estimates will be reliable okay now we go to cross validation very sweet regime okay and it has to do with the dilemma about k okay so now we are not talking about bias versus unbiased because this is already behind us now we are looking at an estimate and the variation of the estimate as we did before and we have the discipline to make sure that we don't mess it up by by making it biased okay so that is taken for granted now I'm just looking at a regime of validation as we described it versus another regime which will get us an a better estimate in terms of the error bar the fluctuation around the estimate we want so we had the following chain of reasoning e out of g the hypothesis we are actually going to report is what we would like to know if we know that we are set out we don't have that but that is approximately the same as e out of g minus this is the out of sample error the proper out of sample error but on the hypothesis that was trained on a reduced set correct okay and you know it's close if I didn't take too many examples they are close to each other okay this one happens to be close to the validation estimate of it okay so here it is because it's a different set that here because it's a different set that I'm training on okay here it's because I am making a finite sample estimate of the quantity so here I could go up and down from this it's okay so I'm looking at this chain that this is really what I want and this is what I'm working with this is unknown to me okay so in order to get from here to here I need the following I need k to be small so that g minus is fairly close to g and therefore I can claim that they're out of sample error is close because the bigger k is the bigger the discrepancy between the training set and the full set and therefore the bigger the discrepancy between the hypothesis I get here and the hypothesis I get here so I'd like k to be small but also I'd like k to be large because the bigger k is the more reliable this estimate is for that okay so I want k to have two conditions it has to be small and it has to be large okay we will achieve both you see in a moment okay new mathematics is going to be introduced okay so here is the dilemma can we have k to be both small and large okay so now the method looks like complete cheating and when you look at it it will look first again and then you realize actually this is actually valid okay so what do we do so I'm going to describe one form of cross validation which is the simplest to describe which is called leave one out other methods will be leave more out that's all okay but let's focus on leave one out so here is the idea so I am going to use you give me that a set of capital N I am going to use N minus one of them for training that's good because now I am very close to N so the hypothesis G minus will be awfully close to G okay that's great wonderful except for one problem you have one point to validate on your estimate will be completely laughable right not so fast let's see in terms of a notation I am going to create a reduced data set from capital D call it capital T sub N because I am actually going to repeat this exercise for different indices small N okay what do I do I take the full data set and then take one of the points that happens to be small N and take it out this will be the one that is used for validation and the rest of the guys are going to be used for training okay nothing different okay except that it's a very small validation set that's what is different okay now the final hypothesis that we learned from this particular set we are going to call we have to call it G minus because it's not on the full set but now because it depends on which guy we left out we give it the label of the guy we left out so we know that this one is trained on all the example but small N okay okay now let's look at the validation error which has to be one point okay this would be what this would be E validation big symbol of this and that but in reality the validation set is one point so this is simply just the error on the point I left out okay so G did not involve the small N example it was taken out and now that we froze it we are going to evaluate it on that example that example is indeed out of sample for it okay so I get this fellow now I know that this guy is an unbiased estimate and I know that it's a crummy estimate okay that's I know that I much I know okay so now here is the idea what happens if I repeat this exercise for different small N so I generate D1 do all of this end up with this estimate do D2 all of this end up with another estimate each estimate is out of sample with respect to the hypothesis that it's used to evaluate now the hypothesis are different okay okay so I'm not really getting the performance of a particular hypothesis I am getting okay for this hypothesis this is the estimate it's off for this hypothesis this is the estimate it's off this is the estimate now the common thread between all the hypothesis is that they are hypothesis that were obtained by training on capital N minus 1 data points that is common between all of them it's different capital N minus 1 data points but nonetheless it's N minus 1 because of the learning curve I know there is a tendency if I told you this is the number of examples you can tell me what is the expected out of sample error so in spite of the fact that these are different hypothesis the fact that they come from the same number of points N minus 1 tells me that they are all realizations of something that is the expected value of all of them so the small errors estimate the error on these guys and these guys estimate the error of the expected value on N minus 1 examples regardless of the identity of the examples so there is something common between these guys they are trying to estimate something so now what I'm going to do I am going to define the cross validation error to be going E cross validation ECV to be the average of those guys so funny situation now these came from capital N full training sessions each of them followed by a single evaluation on a point and I get a number and after I'm done with all of this I take these numbers and average them now if you think of it as a validation set now all of a sudden the validation set is very respectable it has capital N points never mind the fact that each of them is evaluated in a different hypothesis so now I was able to use N minus 1 points to train and that will give me something very close to what happens with N and I'm using N points to validate the catch obviously these are not independent I mean if I was using these are not independent because the examples were used to create the hypothesis and some examples were used to evaluate them and you will see that each of them is affected by the other because it either depends on the the hypothesis either has the the point you left out or you are evaluating that so let's say E1 and E3 E1 was used to evaluate the error on a hypothesis that involve the third example because the third example was in when I talk about E1 and then E3 was used to evaluate on the third example but on a hypothesis that involve E1 so you can see where the correlation is surprisingly the effective number if you use this is very close to N it's as if they were independent I mean if you do the variance analysis you will be using out of 100 examples you probably as if you were using 95 examples so it's remarkably efficient in terms of getting that okay so this is the algorithm now let's illustrate it and this is if you understand this you understand cross validation okay so I'm illustrating it for the leave one out okay I have a case I'm trying to estimate a function I actually generated this function using a particular target I'm not going to tell you yet what it is added some noise and I am trying to do use cross validation in order to choose a model okay or to just evaluate the out of sample error so let's evaluate the out of sample error using the validation method the cross validation method for a linear model so what do you do first order of business take a point that you will leave out right so now this guy is the training set and this guy is the validation set it's one point okay then you train and you get a good fit and then you evaluate the validation error on the point you left out that will be that right that's one session we are going to repeat these three times because we have three points so this is the second time we do it this time this point was left out these guys were the training I connected them and computed the error third one you can see the pattern okay so after I am done I am going to compute the cross validation error to be simply the average of the three errors so let's say we are using squared error so E1 is the squared of this distance and you are adding them up one of the third this will be the cross validation error so what I am saying now is that you are going to take this as an indication for how well the linear model fits the data out of sample if you look in sample obviously it fits the data perfectly and if you use the three points the line will be something like that it will fit it pretty decently but you have no way to tell how you are going to perform out of sample here we created a mini out of sample in each case and we took the average performance of those as an indication of what will happen out of sample mind you we are using only two points here and when we are done we are going to use it on three points that's G minus versus G it's a little bit dramatic here because two and three I mean the difference is one but the ratio is huge but think of 99 versus 100 who cares it's close enough this is just for illustration okay so let's use this for model selection okay so we did the linear model and we call it linear okay so now let's go for the usual suspect the constant model okay exactly with the same dataset so let's look at the first guy these are the two points left out two points left out and this is the one which for validation okay you train on those here you connected here you have the middle number it's the constant number okay and this would be your error here right second guy get the idea third guy now if your question is is the linear model better than the constant model in this case the only thing you look in all of this is the cross validation error so this guy this guy this guy averaged is the grade negative grade because it's error for the linear model this guy this guy this guy average is the grade for the constant model and as you see the constant model wins and it's a matter of record that these three points were actually generated by a constant model okay obviously I don't know of course it could be generated by anything but on average they will give you the correct decision okay and they avoid a lot of sort of funny heuristics that you can apply you can say um wait a minute okay linear model okay any two points I pick the slope here is positive so there is a very strong indication that there is a positive slope involved and maybe it's a linear model with a positive slope okay don't go there okay I mean you can fully yourself into any pattern you want okay go about it in a systematic way this is a quantity we know the cross validation error this is the way to compute it we are going to take it as the indication notwithstanding that there is an error bar because it's a small sample in this case 3 and also because we are making the decision for two points and we are using it for three points okay these are obviously these are inherent but at least it gives you something systematic and indeed it gives you is a correct choice in this case okay so let's look at cross validation in action so I'm going to go with a familiar case you remember this one oh these were the handwritten digits and we extracted two features symmetry and intensity okay and we are plotting the difference guys and we would like to find a separating surface okay and we are going to use non-linear transform as we always do and in this case what I'm going to do I'm going to sample 500 points from this set at random for training and use the test the rest for testing the the hypothesis okay what is the non-linear transformation it's huge fifth order okay so I am going to take all 20 features so 21 including the constant okay and what am I going to use validation for this is the interesting part what I'm going to choose validation for is where do I cut off so I'm comparing 20 models so the first model is just take this guys second model is take x1 and x2 third model take x1 x2 and x1 square etcetera each of them is a model I can definitely train on it and see what happens and I'm going to use cross validation leave one out in order to choose where to stop okay so if I have 500 examples realize that every time I do this I have to have 500 ترانيق سشن ايش ترانيق سشن has 499 points it's quite an elaborate thing but when you do this this is the curve you get you get different errors let me magnify it okay so this is the number of features used okay this is the cutoff I talked about you can use go all the way up to 20 features when you look at the training error not as not surprisingly the training error always goes down what else is new you have more you fit better okay the out-of-sample error which I'm evaluating on the points that were not involved at all in this process cross validation or otherwise just out of out-of-sample totally I get this fellow and the cross validation error which I get from the 500 examples by excluding one point at a time and taking the average is remarkably similar to E out it tracks it very nicely and if I use it as a criteria for model choice the minimums are here so if I take between five and seven let's say I take six I would say okay let me cut off at six and see what the performance is like so let's look at the result of that without validation with validation okay without validation I'm using the full model 20 all 20 and you can see okay we have seen this before over fitting I'm you know sweating bullets to include this single point in the middle and after I included it guess what none of the out-of-sample points was read here this was just an anomaly okay so I didn't you know didn't get anything for it okay so this is typical thing it's like unregularized now when you use the validation and you stop at the sixth because the cross validation error told you so it's a nice smooth surface it's not perfect error but it didn't you know put an effort where it didn't belong and when you look at the bottom line what is the what is the in sample error here zero percent you get it perfect okay we know that and the out-of-sample error two and a half percent for digits that's okay okay but not not great here we went and now the in-sample error is 0.8 but we know better we don't care about the in-sample error going to zero that's actually harmful in some cases the out-of-sample error is 1.5% now if you are in the range I mean two and a half percent means that you are performing 97.5% here you are performing 98.5% 40% improvement in that range is a lot okay there is a limit here that you cannot exceed okay so here you are really doing great by just doing that simple thing okay so now you can see why validation is considered in this context as similar to regularization it does the same thing it prevented over-fitting but it prevented over-fitting by estimating the out-of-sample error rather than estimating something else okay now let me go and very quickly and I will close the lecture with it give you the more general so we talked about leave one out sell them you use leave one out in real problems and you can see why because you know if I have if I give you 100,000 data points and you want to leave one out you are going to have 100,000 sessions training on 99999 each and you would be an old person before the results are out okay so when you have leave one out you have n training sessions using n-1 points each right so now let's consider to take more points for validation okay I mean one point makes it great because n-1 is so close to n that my G- will be so close to G but hey 100,000 if you decide to take 100,000 minus 1000 that's still 99,000 that's fairly close to 100,000 you don't have to make it one difference okay so what you do is you take your dataset and you just break it into a number of folds let's say 10 fold okay so this will be 10 fold cross validation and each time you take one of the guys here that is one tenth in this case use it for validation and the nine tenths use them for training and you change from one run to another which one you take for validation okay so leave one out is exactly the same except that here the 10 replace it by capital N I break the thing into one example at a time and then I validate on one example here I'm taking a chunk okay and therefore you have fewer training sessions okay in this case 10 training sessions with not that much difference in terms of the number of examples if N is big okay you take you know instead of taking one you take a few more okay now the reason I introduce this is because this is what I actually recommend to you very specifically 10 fold cross validation works very nicely in practice okay so the rule is you take the total number of examples divide them by 10 and that is the size of your validation set you repeat it 10 times and you get an estimate and you are ready to go okay that's it I will stop here and we will take questions after a short break okay let's start the Q&A and we have an enhanced question um okay you told about validation and you told that we should restrict ourselves in amount of parameters we should estimate so do we have a rule of thumb about the number of these parameters so it's a K over 20 is there reasonable for the maximum number it's obviously depends on the number of data points so the reason why I didn't give a rule of thumb in this case because it goes with the number of points but let's say that for the 100 if I have 100 points for validation so it's a small data set I would say that a couple of parameters would be fine at least that's my own experience and you can afford more when you have more okay and when you have more you can even afford more than one validation set in which case you use each of them for a different estimate okay but the simplest I would say in a couple of parameters for 100 points would be okay okay can you clarify why um model choice validation doesn't count as data snooping okay the because for the same reason that the answer is usually given for a question like that because it is accounted for okay so I took the validation set the validation set are patently out of sample and I use them to to make a choice okay and when I did that choice I made sure that the discrepancy between in sample and out of sample on the validation set is very little so we had this discussion of how much bias there is and we want to make sure that the discrepancy is very little so because I've already done the accounting I can take it as a reasonable estimate for the out of sample okay so that is why where in the other case the problem with the data snooping that I gave is that you use the data in order to make choices and in that case huge choices you looked at the data and you chose between different models okay and you didn't pay for it you didn't account for it that's where the problem was okay some people recommend using 10 fold cross validation 10 times what does that add okay so the regime I described I only need to tell you 10 fold 12 fold 50 fold and then the rest is fixed so if I use 10 fold then by definition I'm going to do this 10 times it's not a choice given the regime that I described in which in each run I am choosing one of the 10 to be my validation and the rest for training okay and taking the average okay so the question is asking do I do this 10 times inherently built in the method is that you use it 10 times if that's the question I think the question goes through since you chose your 10s datasets inside and then you'd run cross validation what if you do it again choosing 10 subsets and you repeat that process okay I mean there are variations for example even let's say with the leave one out I mean maybe I can you know take a point at random and not necessarily insist on going through all the examples do it like you know 50 times and take the average or I can take subsets like in the 10 fold but I take random subsets and stop at some point so there are variations of those the ones I described are the most standard ones but I mean there are obviously variations and one can do an analysis for them as well is there any rule for separating data among training validation and test random is the only trustworthy thing because if you use your judgment somehow you may introduce a sampling bias which we'll talk about in a later lecture and the best way to avoid that for sure if you sort of flip coins to choose your example then you know that you are safe what's the computational complexity of adding a cross validation I mean you can okay so I didn't give the formula for it basically for leave one out you are doing capital N times as much training as you did before the evaluation is trivial most of the time goes for the training so you can ask yourself how many training sessions do I have to do now that I'm using cross validation versus what I had to do before before you had to do one session here you had to do as many sessions as there are folds so you know 10 fold will be 10 times leave one out would be capital N because it's really capital N fold if you want and so on okay a clarification can you use both regularization and cross validation absolutely and that's okay so in fact one of the biggest utilities for validation is to choose the regularization parameter okay so inherently in those cases you do so you can use it to choose the regularization parameter and then you can also use it on the site to do something else so both of them are active in the same problem and in most of the practical cases you will encounter you will actually be using both very seldom can you get away without regularization and very seldom can you get away without validation someone is asking that this seems to be like a brute force method for model selection is there a way to branch and bound how many hypotheses to consider okay so there are lots of methods for model selection this is the only one at least among the major ones which does not require assumptions okay I can do model selection based on okay I know my target function is symmetric so I'm going to choose a symmetric model okay that can be considered model selection and there are a bunch of other logical methods to choose the model the great thing about validation is that there are no assumptions whatsoever okay you have capital M models what are the models what assumptions do they have how close they are or not close to the target function who cares okay they have M models okay I'm going to take a validation set and I'm going to find this objective criteria which is the validation error the cross validation error and I'm going to use it to choose so it's extremely simple to element and very immune to assumptions obviously if you make assumptions and you know that the assumptions are valid you will be doing better than I am doing okay but then you know that the assumptions are valid I'm taking a case where I don't want to make assumptions that I don't know hold and still make the model selection in the case where the data depends on time evolution how can validation update the model or is it used for that or not okay validation makes a principled choice regardless of the nature of that choice so if you have let's say that I have time series and one of the things in time series let's say that for financial forecasting is that okay you can train and then you get a system and then the the word is not stationary so a system that used to work doesn't work anymore okay so you can make choices about let's say I have a bunch of models and I want to know which one of them works at a particular time given some conditions okay so you can make the model selection based on validation and then you take that model and apply it to the real data or you can you know have you know there are a bunch of things you can do but in terms of tracking the evolution of systems again if if you translate the problem into making a choice then you are ready to go with validation so the answer is yes and the the method is to make it spelled out as a choice another clarification so when with cross validation there's still some bias so can you quantify why is it better than just regular validation okay so both validation and cross validation will have bias for the same reasons okay the only question is the reliability of the estimate okay so let's say that I use leave one out so here is E out and the bias aside if I use leave one out I am using all capital N of the examples eventually when I average them so the error bar is small granted it's not as small as it would be if the N examples if the N errors were independent of each other but it's fairly close to being as if they were independent so I get that estimate and therefore if you anytime you have this estimate it becomes less vulnerable to bias because if I have this play and I am pulling down I'm not going to pull down too far because I'm still within here okay and if I have the other guy which is completely swinging it's very easy to pull it down and I get worse effect of the bias so whenever you minimize the error bar you minimize the vulnerability to bias as well that's the only thing that cross validation does it allows you to use a lot of examples to validate while using a lot of examples to terrain that's the key okay going back to the previous lecture question on that so can you see the augmented error as conceptually the same as low pass filter version of the of the initial error or it can be translated into that under the condition that the regularizer is a smoothness regularizer because that's what low pass filters do so as an intuition it's not a bad thing to consider in the case of something like weed decay not going to be strictly low pass as in working in the Fourier domain and cutting off etc but it will have the same effect of being smoothness if you have a question please step to the microphone and you can ask it yeah please yeah so there's a question in house yes it seems is this on it seems that cross validation is a method to deal with the limited size of the data set so is it possible in practice that we have a data set so large that cross validation is not needed or not beneficial or we could do it all the time in principle it is possible and I mean one of the cases is the the Netflix case where they had 100 million points okay so we think at this point nobody will care about cross validation but it turned out that even in this case the 100 million points only had a very small subset which is come from the same distribution as the output so the 100 million I mean again it's the same question as the time evolution so you have people making ratings and different people making different number of ratings and whatnot okay and this changes for a number of reasons even you know the same user after you rate for a while you tend to change from the initial rating maybe you are initially excited or something so there are lots of considerations like that so eventually the number of points that were actually patently coming from the same distribution as the out of sample was much smaller than 100 million okay and these are the ones that were used to make big decisions like in validation decisions okay and in that case even if we start to do the 100 million it might be good idea to use cross validation at the end and if you use something like 10 fold cross validation then it's not that big a deal because we are just multiplying the effort by 10 which is you know given what is involved is not that big a deal and you really get a dividend in performance and if you insist on a on performance then it becomes indicated so the answer is yes because it doesn't cost that much and because sometimes in big data set the relevant part or the most relevant part is smaller than the whole set okay say there's a scenario where you find your model through cross validation and then you test the out of sample error but somehow you test a different model and it gives you a smaller out of sample error should you still keep the one you found through cross validation okay so this is I went through this learning and came up with the model someone else went through whatever exercise they have and came up with a final hypothesis in this case okay and I am declaring mind the winner because of cross validation and now we are saying that there is further statistical evidence we get an out of sample error that tells me that this is the mind is not as good as the other one okay then it really is a question of okay I have two samples and I'm doing an evaluation and one of them tells me something and the other one tells me the other okay so I need to consider first the size of them that will give me the size of the real size of the error bar correlations if any and bias which cross validation may have whereas the other one if it's really out of sample is not so if I go through the math and maybe the math won't go through it's not always like it's always the case I will get an indication about which one I would favor okay but basically I will it's it's purely a statistical question at this point when there are a few points and cross validation is going to be done is it is it a good idea to re-sample to enlarge the current sample or not really okay so I have I have a small data set that's the premise yeah and I'm doing cross validation so what is the so the problem is so since you have a few few samples then do you do you want to re-sample so instead of breaking them into chunks keep taking it at random well I mean I think for the for the yeah, I don't have from my experience something that will indicate that one will win over the other okay and I suspect that if you are close to 10 fold you probably are close to the best performance you can get with variations of these methods and the problem is that all of these things are not completely pinned down mathematically there is a heuristic part of it because you know even cross validation we don't know what the correlation is etc okay so we cannot definitively answer the question of which one is better so it's a question of trying in a number of problems after getting the theoretical guidelines and then choosing something so what is being reported here is that the 10 fold cross validation stood the test of time that's the statement when there is a big class size imbalance does cross validation become a problem or when there is an imbalance between the classes that is a bunch of plus ones and minus ones there are certain things that need to be taken into consideration in order to make learning go through well in order basically to avoid the learning algorithm going for the all plus one solution because it's a very attractive one okay so there are a bunch of things that can be taken into consideration and I can see a possible role for cross validation but it's not as it's not a strong component as far as I as far as I can see a question of balancing them making sure that you avoid the all constant stuff like that will probably play a bigger role how does the bias behave when when we increase the number of of points that we leave out the size of T or leave T out yeah so the points we leave out are the sort of the validation points and if we are using the 10 fold or 12 fold etcetera the total number that go into the summation will be constant because in spite of the fact that we're taking the different numbers we go through all of them and we add them up so that number doesn't change so so how does it change if instead of doing one cross 10 fold you use 20 fold how how does that does it change it doesn't change the number of of total points going into the estimate of cross validation but what was the original question so how does the bias behave oh well I mean so given that the total number will give you the the error bar and given that the bias is really function of how you use it rather than something inherent in the in the estimate but it the error bar will give you an indication of how vulnerable you are to bias let's say that if you take two scenarios where the error bar is comparable you have no reason to think that one of them will be more vulnerable to bias on another now you need a very detailed analysis to see the difference between taking one at a time coming from n minus one et cetera and consider the correlations and then taking one 10th at a time and adding them up to find out what is the correlation and what is the effective number of example and therefore what is the error bar in any given situation the answer is that as long as you do the the the the fold a number of folds and you take every example to appear in the cross validation estimate exactly once then there is no preference between them as far as the bias is concerned okay I think that's it for today very good so we'll see you on Thursday