 Till now we have been discussing about classifiers and we have been discussing also about feature selection. In this lecture I shall discuss about how to do comparison between the performances of classifiers, how to do comparison between the performances of classifiers. There are basically a few methods available, I shall discuss two of them. The first method is what is known as leave one out method, what is known as leave one out method. Suppose we are given a set of points a subset of n dimensional space and there are some number of classes and let ?i denote the label of xi, that means ?i denotes the class of xi for all i is equal to 1, 2 up to n. Let us just say the number of classes is k, number of classes is let us call it C, number of classes is C, so each ?i it belongs to the set 1, 2 up to C, if the number of classes is C then each ?i it belongs to the set 1, 2 up to C and we know xi, we know ?i for each i is equal to 1 to n, we have the complete information, we have the information now in the initial stages I talked about training set and test set where we divide this whole set into one part which we call it as training set in another part which we call it as test set and then using training set we get the class if we train the classifiers and using the test set we get the performance of the classifier that was this is true but you should there is some small point here the point is that the performance of a classifier depends on the training set okay the performance of a classifier it depends on the training set for the same data set if I take a different training set probably the performance may go down okay now for the same training and test set if I use and I use two different classifiers then I can say with respect to this training set and with respect to this test set this classifier is better than that these sort of statements I can make then the statement has this tray with respect to this training set and with respect to this test set now suppose I do not want to put those two I should say I do not want to put those two phrases with respect to this training set or with respect to this test set I would like to say with respect to this data set then how do I say that we are given a data set then how do I say with respect to this data set this works better than that so basically what one needs to do is that somehow you need to change the training set if you keep training set fixed then with respect to that training set only you are you will be in a position to say but the moment you start varying the training set then you cannot make the statement that with respect to this training set this is the so somehow you need to vary the training set so then how do you do it in leave one out method as the name suggests we are going to leave one observation so how to do it so this is the thing that is given to us then I will write a loop for i is equal to 1 2 up to n okay now let bi is equal to just the singleton XI bi yeah singleton XI okay now let us just say EI is equal to a-bi I is equal to 1 to n okay so for i is equal to 1 find out b1 b1 is just singleton X1 even is a-b1 that means X2 X3 X4 up to Xn take EI as training set develop the classifier that is if you are using something normal distribution based classifier and if you want to measure your means and covariance matrices then you can do that using this EI or if you are using something like a K nearest neighbor decision rule then your training set is EI then what you will do is that XI is classified correctly or not if XI is classified correctly using this developed classifier using the classifier so you have got a classifier then check if f XI is classified correctly okay now before that let me just write something as sum is equal to 0 then if it is classified correctly then there is no change if it is classified incorrectly then if the classification is wrong if the classification is wrong then sum is equal to and here and end the loop that means end for I will come to that I will come to first you asked you tell me whether you have understood the method or not I will come to the remarks part later how much your main question is that how much it is feasible to implement it I am understanding about the feasibility part okay feasibility means if the number of observations small n is say of the order of say 1 lakh then you have to do this thing 1 lakh times my question is considered we have 100 number of samples right and 100 number of training some samples and one training sample 100 100 that means small n is equal to 100 yes sir and what one observation we are removing at a time right and then we are calculating mean and variance kind of thing is it really make a sense means mean will be somewhere near to that I mean original near to original mean only if we are considering mean and variance as a classifier that is if you are considering the normal distribution based classifier okay it may be or may not be how if we are removing some kind of outlier then we can say there will be in various there will be some change in the mean but if we are removing some example from the cluster then mean will be somewhere near the original mean only what you are trying to say is that if you are doing this thing what you are trying to say is that most of the points will be classified correctly yes that is what you are trying to say no I am trying to say sir if we are going to remove one observation from the big set of train examples is it really change in classifier that means so does it really change the classifier that means basically what you are trying to say is that most of the points will anyway classified correctly that is basically what you are trying to say that means there is no significant change in the performance of the classifier that is what your basic contention is you try to look at this thing from another way ultimately you would want every point at every point in the data set at some time or other to be included in the test set how are you going to do it I think you have understood the problem here if you want to make the performance were classifier to be in some way not independent of the training set then that means at the there has to come a situation where every point should at some sometime or other should belong to the test set how are you going to do it this is a simplest way of doing but then at a time suppose you are removing two observations then how many sets you have to make NC2 that will be huge even this will be huge for large values of N even this is huge for large values of N and if you want to remove two at a time then it is going to have NC2 such sets you are going to have NC2 is a very large number actually the question that I am trying to put to you is that some point of time you should make every point should belong to the test set sometime or other then how are you going to do this training test set distinction that is the basic question that I am trying to ask you this is the simplest way of doing it leave one out method then in that way you are not having any bias on any point okay there would not be any bias on any point but then this has some nice properties about which I am not for the present moment I am not going to the details but then the implementation wise if the value of small n is very large then implementation becomes a real real issue as you can see it so now then the next question is that how do you do the implementation in a nice way without really changing the soul of this one how do you do the implementation then people have come up with the second one K fold cross validation K fold cross validation well what is the meaning of this K fold cross validation here what we will do is our original set is all this though this whole formulation is there that means this is your number of points ? i denotes the label of the class i number of classes C ? i belongs to 1 to C so these are all there so I think okay let me just write them no problem a is equal to X1 to Xn subset of Rn this is your given point set ? i denotes the class of X i so number of classes is C number of classes is C so ? i belongs to 1 2 up to C ? i belongs to 1 2 up to C right for all I here what we are going to do is we will take a partition of a let a 1 a 2 a K be such that be such that a i is not is equal to ? okay then a i intersection a j is equal to ? if ? is not is equal to j and union a i 1 to K is equal to a so it is a partition of a into K subsets but there is one more condition this is the fourth condition is the sizes of a is are more or less same let me expand on this what is the meaning of sizes of a is being more or less same suppose the value of K is equal to 10 okay suppose the value of K is equal to 10 and say small n is equal to 100 divided by 10 is 10 okay then every set has exactly 10 elements but on the other hand suppose small n is equal to 101 then there will be 9 sets with 10 elements and the 10th set will have 11 elements 102 then there will be 8 sets with 10 elements one set will 2 sets will have 11 elements like this have you understood what I wanted to say so there may be some difference but the difference should not exceed 1 okay basically we would like to put the sizes to be same and if it is not possible well in some cases it won't be possible then you would like to make it more or less same now so once you do this thing now you have a for loop for i is equal to 12 up to K right now your let me just write bi is equal to a minus ai b i is equal to a minus ai then what we will do is that take a minus ai that is bi as training set take bi as training set and what is this ai as test set take bi as training set and ai as test set okay and find the misclassification rate find the misclassification rate then you will get K such misclassification rates here so you will get K such misclassification rates K misclassification rates you are going to get then you obtain mean and standard deviation mean and standard deviation of these K rates and when you are giving the result you should give the result you should give the mean and you should also give the standard deviation okay and you should mention how many folds you are taking you have taken five fold ten fold twenty folds you should mention that if you see the literature if you read articles you are going to find that the results of classifiers are given using these things they will tell you the number of folds of cross validation and they will give you the mean value you can have misclassification rate as you can have correct classification that is fine does not matter okay give misclassification rate and give the standard deviation of this K values okay now so if you have I think I need to mention a few more points here suppose my classifier is normal distribution based I would like to get the mean and covariance matrix and then I will assume some prior probabilities and then I do the classification okay now if I take B1 as my training set I may get some means and some covariance matrices but when I take B2 as the training set I may not get the same means and I may not get the same covariance matrices right then you see the classification scheme is same but the exact classifier there is a difference are you understanding what I am trying to say the classification scheme is same but exact classifier there is a difference because the means have changed the covariance matrices have changed similarly if I use something like some nearest neighbor rule and if I take BI as my training set and AI as my test set I mean B1 training set A1 test set then I will get some classification but when I make the training set as B2 since the training sets have changed okay since the training sets have changed the performance of the classification scheme I mean the classifier is same but the exact classification rule there is a difference since the training sets have become different are you understanding and the same thing happens even with other classification schemes. So here basically what we are doing is that we are trying to compare the performance of different classification schemes not the exact what Mu you have got what Sigma you have got they may be different but classification schemes we are trying to compare here you are trying to compare the classification schemes here in then leave one out method note that every point every point in that data set sometime rather is going to become a point in the test set the same thing happens here also in k fold cross cross validation also every point in the data set sometime rather is going to become a point in the test set but there is a difference between these two methods in leave one out method it is an exhaustive method at a time you are leaving out one point and that you are doing it for every point in the data set but if you look at k fold cross validation there are several ifs and buts let me tell you the ifs and buts I wrote here these three things ai0 is equal to 5 a i intersection a i is equal to 5 i0 is equal to the union a i is equal to a partition right. Let us just say I have 100 points and my k is 10 I know that a1 should have 10 elements a2 should have 10 elements like that a10 will have 10 elements but the actual question is how are you going to get those 10 elements right are you going to do them selectively that means no randomness involved or you are going to do it randomly how are you going to do it how are you going to get those a1 a2 up to a10 suppose I get one a1 a2 a10 and he gets another a1 a2 a10 but they need not be same are you understanding what I am trying to say but in leave one out method for the same data set whatever I do or whatever he does the results are going to be the same but in k fold cross validation since it depends on this a1 a2 a10 those 10 sets and they can be different for different persons so there is a problem here naturally the more and more folds you do it the better it is going to be okay the more and more folds you do it the better it is going to be and suppose you make the value of k to be equal to small n then what is going to happen you will get this is it clear you make the value of small k to be equal to n then you will get the leave one out method. So basically the main problem with leave one out method is that for a very high value of small n you really do not know how to implement it so in order to take care of this people are using k fold cross validation in order to take care of that if the value of small n is really high then you can use this but then people also understand that it depends on how those a1 a2 ak are chosen how those a1 a2 ak are chosen so what many people do is that also they do it randomly that means this a1 a2 ak this particular thing they do it maybe 40 times or 20 times and take the average over this so that they would like to make this thing independent of this choice of a1 a2 ak are you understanding what I am trying to say is this clear no say I get one a1 a2 ak you get one a1 a2 ak he will get another a1 a2 ak like this you get some 40 or 50 such a1 a2 ak so that you try to make it independent of the choice of this a1 a2 ak your results this also people do it okay this also people do it to make it independent of the choice of a1 a2 ak so that then whatever results that you are going to show in the article they will make it independent of the choice of this the sets the k folds independent of the choice of the k folds this some such thing had become necessary because it is difficult for you to implement the leave one out method directly over the whole data set so you try to do this in that way you are preserving the basic feeling of the leave one out method and also you are trying to make it independent of a1 a2 ak so this is one thing that people have been doing it if anyone can suggest something better than this and people will take it I am not saying that this is the best way of doing naturally I mean I have told you the limitations of doing this thing maybe you can do a better choice I mean you can come out with a better scheme than whatever is existing any questions we will stop here.