 I shall talk about K nearest neighbor rule now this is a procedure for supervised classification here X i ? i is equal to 1 to n they are given to us X is are the points they belong to m dimensional Euclidean space ? i denotes the label of X i for each i label what I mean is the class from which the observation X i has come the class from which the observation X i has come so let us assume that let us assume that the number of classes BC that the number of classes is C number of classes where C is naturally greater than or equal to 2 where C is naturally an integer and it is greater than or equal to 2 that means that is each ? i it can take values from 1 2 up to C for all i for all i each ? i for all i ? i can take values from this set 1 2 up to C ? i is equal to 1 means X i observation has come from class 1 2 means X i observation has come from class 2 etc and the problem is that let X be a point for which the label is not known that is the class to which X should X belongs to that label is not known so how to get the label from X 1 X 2 X n that is the basic problem the procedure is the following we need to find the label of X we need to find the label of X this is the basic issue right now the what is the procedure the procedure is the following procedure the first step is let K be a positive integer note that we are talking about K nearest neighbor decision rule so we are going to take a value of K K is let K be a positive integer how to choose the value of K we will come to it later what we will do is that from this point X we calculate distances X 2 X 1 the distance X 2 X 2 the distance X 2 X 3 the distance and up to X 2 X n we calculate n distances calculate for all i is equal to 1 to n d denotes the Euclidean distance where d denotes the Euclidean distance so we calculate distance between X and X 1 X and X 2 X and X 1 X n so we have in total n number of distances and the third step is arrange these distances arrange the distances in increasing order or to be precise non decreasing order non decreasing order okay arrange these n distances in non decreasing there are n distances arranged in non decreasing order take the first K distances take the first K distances that is if K is equal to 1 take the least distance if K is equal to 2 you should take the first two distances so in general for any K you should take the first K distances and find those points for which these first K distances occur okay find those K points those K points corresponding to these K distances find those K points corresponding to these K distances 6 let K I denote so we have K points from these K points find those points belonging to the class 1 the number of such points belonging to the class 1 you call it as K 1 the number of points belonging to the class 2 call it as K 2 similarly the number of points belonging to class C call it as K C so I ranges from 1 to C I ranges from 1 to C it may so happen that some classes may not have any point for example let us just say C is equal to 10 10 classes are there and we have taken the value of K to be 1 we have taken let us say we have taken the value of K to be equal to 1 then we have these end distances we arrange them in increasing order non decreasing order and you find the first distance so you will have exactly one point let us just say this point belongs to class 2 that means for K 2 the value will be 1 for K 1 the value will be 0 K 3 the value will be 0 dot dot dot up to K 10 the value will be 0 except for K 2 for all the other classes the value should be 0 so it is not necessarily true that K I is a positive value it can be 0 also so K denote the number of points belonging to the 8th class among the K points among these K points now the root is put X in class I if put X in class I if K I is greater than K J for all J not is equal to I this is the rule that means the class for which the class for which the number of members is maximum among this K put the point X into that class so this is the rule so let me give you a graphical example in the graphical example let us just say these three points they belong to one class these four points they belong to one class these two points they belong to one class and let us say this is the point this is the point that should be classified so in class one if I call this class cross as class class one so there are from class one there are three points from class two there are four points from class three there are two points so totally nine points two plus three five plus four nine nine points are there so small n value is nine small n value is nine so if I apply this rule let me just say the value of K is equal to three let us just say K is equal to three then what will happen you have to find the nine distances and arrange them in increasing order probably the least one is probably this and the second distance is probably this and probably the third distance is this probably the third distance is this so from K one from this class we have two points from K two this class they are this one point K three from this class there is zero points so K three zero K two is one K one is two so highest is happening for the class one so put this point in class one this is basically the rule so here there are probably you are going to have very many doubts let me first tell the doubts the doubts that you may be having let me first tell them on my own the first doubt is how to choose the value of K that is the first doubt and there is a second doubt what will happen if there is equality here what will happen if there is an equality here equality in the sense that supposing I had taken here K is equal to four supposing I had taken here K is equal to four and then my fourth one is let us just say this one then from class one there are two representatives from class two also there are two representatives and from class three there is no representative then to which class I should put this point do I have to put it in class one or do I have to put it in class two that is another doubt now let me tell you a few more doubts is it necessarily true that for different values of K you will get the same result is it necessarily true that for different values of K you will get the same result the answer is no the answer to this question is no it is not necessarily true that for different values of K you will get the same result for different values of K you may get different results then the next question is if for different values of K if you get different results then how to choose K this is I mean how do I say that some particular K is better than the other one how do I say that some particular value of K is better than the other value of K and you have a much more fundamental question what is the theoretical justification of this rule just because some people have given this rule do I need to follow it what is the theoretical justification of this rule first let me tell you the history the history is that around 1950s fix and horges they had come out this with this rule K nearest neighbor decision rule if you go through the papers you will find this was one of the earliest references fix and horges in 1950s early 50s they wrote the first paper on K nearest neighbor decision rule when they gave this rule there was no theory and they just applied it on a few data sets and they found that that rule is working well but then that problem was the problem still remained how to choose the value of K the problem still remained how to choose the value of K and it can be easily found that for different values of K you will get different results okay that can be very easily found that for different values of K you will get different results now in 1965 lofts Gordon lofts Gordon in the year 1965 the paper is published in annals of mathematical statistics annals of mathematical statistics 1965 lofts Gordon he wrote a paper on K nearest neighbor density estimation procedure K nearest neighbor density estimation precision I have to tell you this thing so let me actually tell you what this procedure is the procedure is that you have there is a probability the density here is the probability density estimation okay so we have X1 X2 Xn independent and identical distributed random vectors independent and identical distributed random vectors they follow a probability density function P where P is unknown P is the probability density function but P is unknown where P is unknown so now the question is on the basis of these end points how do you estimate the probability density function well the procedure is the following the procedure is the following so let me write down the procedure to estimate P procedure to estimate P I will write down the procedure the procedure is again the first step is as it is let K be a positive integer let K be a positive integer let us say at the point X we would like to estimate the density function let us say we need to estimate the density function I am assuming that all these points they belong to they are belonging to RM so I am not going to write once again so all these excise they belong to RM okay so this X also belongs to RM we need to estimate the density function at the point X belonging to RM okay so what do we do what we do is that find K nearest neighbors find K nearest neighbors of X among X1 to Xn find K nearest neighbors of X among X1 to Xn well we have found them I hope you know the meaning of K nearest neighbors here that is you calculate the distance of X to X1 distance between X and X1 distance between X and X2 and up to distance between X and Xn and the distance function is always Euclidean the distance function is always Euclidean for example in the K nearest neighbor decision rule here I wrote that it is Euclidean so there must be a question if it is non Euclidean then what are you going to do many people apply K nearest neighbor decision rule where the decision the distance function they take as non Euclidean some other metric then how do I sort of justify that all these things we are going to look at it at least in some amount of detail may not be much so when I say here K nearest neighbors of X the distance function is Euclidean only it is not a non Euclidean function this distance function the distance function is Euclidean so you get hold of n distances arrange them in non decreasing order and find the first K distances so you are going to get K nearest neighbors okay now what you would do what is done is that let R denote the distance between let R denote the distance between X and its Kth nearest neighbor let R denote the distance between X and its Kth nearest neighbor then what we do is that you get hold of m dimensional volume let A R M M is the dimension R is the radius and this is a function of N is the number of points let this denote denote the volume of a disc radius R in M dimensions since this R is a function of N I wrote it okay I wrote N you need not write if you want if you do not want to okay since this R is a function of N depends on what X1 to Xn we have got okay so this denote the volume of a disc of radius R in M dimensional space in M dimensional space and 5 is then the estimated density estimated density at X we shall denote it by p hat X denoted by p hat X is given as K by N into A N R M the volume K by N into A N R M now what lofts card and had shown was that under some conditions on K this is going to be asymptotically unbiased and consistent what he had shown was that under some conditions on K this is going to be asymptotically unbiased and consistent now let me basically tell you the intuition behind this thing when did we probably come across the word density I think we came across the word density probably in class 8 or 9 right probably in class 8 or 9 we came across the word density where we define density as mass by volume okay note that here there is volume this is volume A N R M this is volume of a disc of radius R this is volume within this volume how many points are there there are K points out of N points so this basically looks like mass had there been more number of points probably I mean one can write that within this volume how many points are there K points are there and how did this K come K is out of N so this looks like at least it is similar to the word density that we had used in classes I think 8 or 9 okay this is and what lofts garden had shown was that it indeed goes to the probability density of the point X as N goes to infinity as N goes to infinity this indeed goes to the probability density at the point X under some conditions on K what are the conditions on K the conditions on K are the conditions on K are the following the first let me just write conditions on K the first thing is that K is a function of N N is the number of points K is a function of N so let me denote it by we shall denote it by we shall denote it by KN then the conditions on K are first KN should go to infinity as N goes to infinity and the second one is KN by N should go to 0 as N goes to infinity KN should go to infinity as N goes to infinity and KN by N should go to 0 as N goes to infinity if these two conditions are satisfied and if X is a continuity point of the density function P then as N goes to infinity this one it is an asymptotically unbiased and consistent estimate of P X C if B is satisfied if the conditions in B are satisfied if the conditions in B are satisfied and X is a continuity point of P continuity point of P then this P hat X is an asymptotically unbiased and consistent estimate asymptotically unbiased and consistent estimate what is the meaning of asymptotically unbiased asymptotically unbiased means the expected value of this one this is after all a random variable based on N number of observations so the expected value of this that is the average as N goes to infinity this one the expected value it will go to the original one that is why it is unbiased but asymptotically as N goes to infinity the expected value will go to the actual value of P consistent means the difference between P hat and P as N goes to infinity the difference gets reduced that is basically consistency so this is the density estimation procedure this was given in 1965 now using this density estimation procedure you can apply base decision rule for base decision rule you have the prior probabilities and the probability density functions so estimate prior probabilities by the proportions estimate prior probabilities by the proportions estimate the density functions by this and use base decision rule on the estimated prior probabilities and estimated density functions then you will get k nearest neighbor decision rule you will get k nearest neighbor decision rule I will do it now so there are C classes number of classes C prior probabilities capital P1 capital P2 PC and class conditional density functions class conditional density functions they are small P1 small P2 up to small PC okay so what is the mixture density function the mixture density function is if I write it as Px if I write mixture density function as Px it is equal to summation i is equal to 1 to C capital PA multiplied by small PAx then now what are our x1 x2 xn x1 x2 xn they are IID independent and identically distributed and what is the density function for them the density function is small P the density function is small P each of them they are coming from small P following independent and IID is independent and identically distributed and they follow small P this means let me just tell you how it is done it is just an example to I mean just the thing to make you understand how these things are done it is like this generate a random number from 0 to 1 if the value lies if the value is less than or equal to capital P1 then you generate points from the density function small P1 if it lies between P1 and P1 plus P2 generate points from point from P2 if it lies between P1 plus P2 to P1 plus P2 plus P3 then you generate a point from P3 then P4 P5 up to Pc so initially you need to generate a point a random number from 0 to 1 and on the base of that you decide from which distribution you need to generate a point and then you generate a point from the distribution randomly so when you are generating a point from the distribution randomly you have got the point and you not only have the point you have the label of the point also the point and as well as the label that is your first observation similarly X2 the second observation for each observation you need to generate a point you need to generate a random number from 0 to 1 and wherever it is lying the corresponding you for the corresponding density function you need to generate a point randomly so you are getting the label of the point and as well as the value values are excise so with them you have a label also automatically given to you so X1 X2 Xn are IID they are following small P let ni points out of these n points belong to class I is equal to 1 2 up to C that means what that means now our estimate of the prior probability is ni by n the estimate of the prior probability is ni by n now let us X is the point to be classified X is the point to be to be classified so what do I do X is the point to be classified so what do I do I find K nearest neighbors of X among X1 to Xn find K nearest neighbors of X among and let K these nearest neighbors Ki of Ki among these nearest neighbors belong to Ith class I is equal to 1 2 up to C so this naturally means that summation Ki 1 to C is equal to K now so we know what our P hat X is so let for finding the estimate let R denote the distance between and its Kth nearest neighbor let R denote the distance between X and its Kth nearest neighbor so now let again AR denote the volume of a disk of radius R in M dimensional Euclidean space M dimensional Euclidean space so what do we know we know that P hat X is equal to K by n into AR this is true but what is PI hat X this is going to be out of this volume AR you have Ki points out of Ni out of this volume AR they are Ki points out of Ni you have Ki points out of Ni I is equal to 1 2 up to C now you apply the base decision rule this so PI hat PI hat X let me write greater than or equal to because equal to if you remember the base decision rule the equality part does not have any relevance it is not going to give you any extra error any extra error and equality is actually the decision boundary okay so let me just maintain the equality okay PJ hat PJ hat X for all J 0 is equal to I what does it mean PI hat is Ni by n and this is Ki by Ni into AR greater than or equal to this is n J by n and KJ by n J into AR and then you cancel out everything Ki greater than KJ Ki greater than KJ now about your doubt about equality what will happen if I take a particular K and I get from two classes the same number of points and that is the maximum and the question is to which class I have to put it the answer from this thing is obvious what is that first the rule that we are using that is asymptotic this is asymptotic as n goes to infinity these things are happening as n goes to infinity these things are happening and unfortunately we are always living in finite spaces our n is never goes to infinity our n is finite so if there is a confusion I am assuming that you are starting from smaller values of K then you increase the value of K by 1 if there is a confusion then the next question is fine wherever there is an equality you are saying that you modify the value of K but then the next question is how do you choose the value of K this question is still remaining this question is still remaining let me tell you the meaning of it is still remaining the meaning actually I have to put it in several ways one of the popular ways in which people choose the value of K now is by cross validation one of the popular ways that people choose the value of K now is by cross validation okay but then it still does not answer the question completely it still does not answer the question completely if you choose K by cross validation then how do you know that I mean it is among the what we are doing is that if you are choosing it K by cross validation what exactly we are trying to say is that for this set of data this is the best that we are able to do but it still does not say then answer the next question if following the same distribution if one more point comes in then will your K help the answer to that thing is exactly not clear what cross validation and these sort of principles what they state basically is that for this set of data this is what I am trying to do I choose some value for K maybe 10 fold 15 fold or something some value I will choose and then I do the best that I can so this is a partial answer not a complete answer if you look at the history of KNN rule you will find many papers which have made differences over many years one of the papers was by cover one of the students was saying cover and hot hot is the Duda and hot hot okay cover they wrote a paper in 1967 that paper actually what they say is how to calculate the error probability of error for the nearest neighbor rule what is nearest neighbor rule nearest neighbor rule is let me write down remarks when K is equal to one this rule is known as one NN rule are just simply nearest neighbor rule K is equal to one when the rule is known as one NN NN means nearest neighbor rule are simply nearest neighbor rule the rule is known as one NN rule are simply nearest neighbor rule in 1967 cover and hot they wrote a paper where for K is equal to one they found the probability of error for K is equal to one case that they have found it to be a function of the error for the base decision rule they have found it to be a function of the error for the base decision rule Dasarathi and others Dasarathi and Indian they wrote quite a they did quite a lot of work on K nearest neighbor rule and several of their modifications many modifications are done on K nearest neighbor rule so some of them you will find in Dasarathi wrote a book on K nearest neighbor rule I don't know whether it is available now or not in India the author is Dasarathi Dasarathi wrote a book on K nearest neighbor rule and some people Fukunaga he gave some theoretical procedure for choosing the value of K at least in some situations that was one of the reasons how Fukunaga had become famous Fukunaga did some work on how to choose the value of K then there are many people who try to take the value of K adoptively that means adoptively means that for different points you won't take the same value of K for classification you have many points in the test set maybe for one point you will choose you will have one value of K but for another point you may have another value of K so how to do it that there are some people who have done some work on that so there are how to choose the value of K for any data set is still not completely answered to the satisfaction of all the persons there are partial solutions available but only up to that I will go I don't want to say anything more about the partiality can become complete solution or shall I stop it.