 We were discussing K-means algorithm single linkage algorithm in the previous few classes. So and today we will be discussing slight generalization of K-means algorithm, this is K-medoids algorithm, so instead of mean you are going to have a medoid there, medoid the word it means it is a generalization of median for higher dimensions, I hope all of you know the meaning of the word median but let me just recapitulate to you the meaning of the word median, for single dimensional data median is the middle most point, what is the meaning of middle most point, suppose your data set is say it has got just 3 points 1, 2, 3 then arrange them in increasing order in fact they are arranged in increasing order, the middle one is 2, so 2 is the median, supposing your data set is so here it is 2, suppose your data set is 1, 2, 10 here also if you arrange them in the increasing order the middle one is 2, the middle one is 2, so 2 is the median here, so basically arrange the points in the increasing order and the middle one but you can do it provided the number of points is odd, if the number of points is even then what is the meaning of middle one right, so let us see suppose your data set is 1, 2, 3, 4 then arrange them in the increasing order there are 4 points here they are already arranged in the increasing order, 2 you can call it to be the middle one, 3 also you can call it to be the middle one, you can call 2 to be the middle one, you can also call 3 to be the middle one, so here the middle one is not unique, the middle one is not unique, then in this case there are a few conventions, one convention is that you take the average of these 2 points, 2 and 3 average is 2.5 and call it as the median and call it as the median that is the 2.5 but if you want to put a restriction that median has to belong to the data set, note that 2.5 does not belong to the data set okay, so if you want to put a restriction that median has to belong to the data set then that is fine then it can be either 2 or 3, the median can be either 2 or 3 and it is not unique, so in single dimension this is what people do but what will happen when you have multiple dimensions 2, 3, 4, 5 etc, can you arrange the points in increasing order? No, what is the meaning of arranging points in increasing order? It is not clear, in fact we do not have this ordering, I hope you know the meaning of ordering since I assume that you are all computer science students probably you know the meaning of partial ordering, linear ordering etc okay, so for higher dimension you do not have that ordering then what does one do? There are the definition of median for higher dimension then the word is Medoid that you will see different definitions at different places, you will see different definitions at different places, basically this concept is to be generalized to higher dimensions, so how does one generalize this? Let me tell you one definition, in fact I will tell you two definitions, let me tell you the first one of them, say your data set is x1 to xn points they belonging to the m dimensional space, it is a subset of m dimensional Euclidean space and d is the Euclidean distance, d is Euclidean distance what I do is that let me define what ai, ai is max what you do is that from the point xi you calculate distance of every point in the set s and find the maximum, from the point xi you calculate the distance of every point and find the maximum, that maximum I am denoting it by ai naturally i is equal to 1 to n okay, next I will find ai0, ai0 is minimum of ai, ai0 is minimum of ai and ai0 is that subscript for which the minimum is attained, for which the minimum is attained, then the median is or medoid is, medoid of s is xi0, medoid of s is xi0, let me explain, let us take this, let us take this one, take the point 1, the distance of between 1 and 2 is 1, 1 and 3 is 1 right, so the maximum is 2 okay, when you take this point the value is 2 and when you take 3 this is 1, this is 2 and this is 1, for 3 also the corresponding value is 2, what about 2, the value is 1, so you will get 2, 1, 2, minimum is 1 and that 1 is obtained for the value 2, so 2 is median, is this clear? Let us look at this example, 1, 2, 10, here for this the maximum is, this is 1 and this is 9, so maximum is 9, for 10 also it is 9, but for 2 it is 8, so this is 9, 8, 9, is this clear? So minimum is at 8 and that is happening for 2 okay, now let us look at this, here for this the maximum is 3, for 2 the maximum is 2, for 3 also the maximum is 2, for 4 the maximum is 3, so this is 3, 2, 2, 3 right, 3, 2, 2, 3 and the minimum is 2 and that is obtained for 2 values, is this clear? So this is how, this is a way of defining Medoid, but why a way, the reason is that this definition as it is fine, but look at the number of computations you need to do, look at the number of computations you need to do, isn't it very high right, so even though this definition is good okay, even though this definition is good, but because these many computations are involved people do not like this definition, so they have defined something else. Now that definition also let me just, that definition is actually very simple, how does one define this one, find the mean, find the mean x bar okay, find the mean of the n points and let xi belonging to S be such that distance between xi and x bar is less than or equal to distance between x and x bar, then call xi to be the Medoid of S, I will explain. When we calculate the mean of any data set, most of the times mean does not belong to the data set, most of the times mean does not belong to the data set, it is something outside the data set, then you find the point in the data set which is closest to the mean and call that to be the median or Medoid okay, you find the point in the data set which is closest to the mean and call that to be the Medoid. So this xi is closest to the mean, this distance is the minimum among all distances right, see it is a mathematical way of representing it, if you represent it in mathematical way there are absolutely no reasons for getting confused, whereas if you tell something intuitively, intuitively it looks fine but the same intuition you might be having different ways of putting it in mathematics, so it is you have one has to have intuition but on the other hand one also should be able to write the corresponding mathematics, then everything will be crystal clear, that is one of the reasons why in all my lectures I am trying to give the corresponding mathematics, write everything in mathematical form, so that there will not be any confusion whatsoever, so this is a way of writing, this is way of writing it in mathematical form, distance between xi and x bar is less than or equal to all such distances that means xi is closest to x bar, that means xi is closest to x bar, because this is a vector so I am writing it like this, so xi is closest to x bar, then call xi to be the Medoid of this, please. Yes, the output would be similar to came in, no, no, no, can you say that there will be absolute no difference, the difference would be small, that I agree, that I am not going to disagreeing to that, the difference will be small, that I agree, I surely agree that the difference would be small, but this is the way it is there in the literature, I mean I can always say personally I do not like this okay, but you see one thing, what is the mean of this one, 1 plus 2, 3, 3 plus 10, 13, the mean is something like 4 point something, what is the one that is closest to it, 2 right, so 2 is the median and here also here 2 is the median and according to this definition also 2 is the median, basically when you are saying that it is very close to the, it is very close to the K-means algorithm, there is a general feeling in your mind that Medoid is close to the mean, that is true I agree, but the closeness may not be as much as you are thinking, the closeness may not be as much as you are thinking, in some cases it will be really close, but in some cases this is the closest point, but the distance may not be really as small as you think, had it been 100 then also the median would have been 2 according to this one, but then what is the mean 103 divided by 3, which is a very large value, are you understanding what I am trying to say, that is a very large value, so it is not really always as close to the mean as you are thinking, I agree it is the closest because that is what the definition says, but that distance may not be small, it may be really large. I agree, see each one of the things I am understanding in what situation you are trying to say this thing, but do you think that every situation would be like this, that is one question I would like you to think, just think about it and personally speaking I would prefer the definition 1 to definition 2, but then if you ask me that definition 1 needs too many computations, I agree, but maybe I am always more bothered about accuracy than the computations, so I would prefer definition 1, but this is also one thing that you will see in many books, you take the books on data mining where you will find too many clustering algorithms and you see the definition of Medoid, this is one definition that you will find in books, right, how much you like it, that is a different thing, but this is one thing that is existing, okay, I mean you may not like it, I may not like it, that is a different thing, but this is the definition that is existing, other questions please. Just to take care of the outliers, if you do a mean it might in that large value might influence my result, so that is why we are normally goes for a median calculation, but in this second definition like it is almost related to the previous question, like actually we are not taking into consideration that, we are actually taking that mean into account, so what probably might happen is that large value might influence the, here if you take 100, it is a very large value, even then you are getting the median, 1, 200, 2 is the median and you follow this one you will get 2, this is, see the difference in the second example is not that much, like it is within that range, you take 10 to the power of 10, take 10 to the power of 10, 10 to the power of 100, even then you will get 2 as median according to this, are you getting and 10 to the power of 10, 10 to the power of 100, that is very much far away from 2, but even then you will get 2 as median if you follow this definition, I mean it is not really as bad as you think, that is one of the points that I want to mention, it is not really as bad as you think and I am not saying that this is the ultimate, you can always improve upon this, you can always improve upon this, I am not saying that this is something like Bible or Quran and you should not change this thing that I am not saying, you can always change it according to your convenience, if you do not like this thing you do your own definition which has also some intuition and for which with these data sets you should get the same whatever median that you are getting here, you must get the same thing according to your definition, after all people have tried to generalize it, this is what they have given, you may not always agree to this okay, you may not always agree to this, but these definitions you find in books, you can have your own definition, personally speaking I prefer this because I do not mind doing lot of computations, but that is again my personal choice and which I mean, it is only my personal whim, let me just say that, so I would prefer this, but this is all one thing that you are going to find in books and it is not really that bad okay, so this is one definition, this is one definition for Medoid then you can have K Medoid's algorithm which is, you can make it exactly the same as the K-means algorithm, 4G's K-means algorithm except that instead of mean you write their Medoid, you can follow it in the same way and you can decide the number of iterations beforehand, and for K-means algorithm people have tried to prove the convergence, for K-Medoid's I am not sure whether any proofs are existing for convergence, I am not sure of that, but what people generally do is they assume the number of iterations beforehand, maximum number of iterations is this much and then they run the algorithm and then they run the algorithm that is for K-Medoid's and which definition of Medoid you will take that is again up to you, that is again up to you okay, let us just take 2 minutes break, just 2 minutes yeah, so I shall now teach you an algorithm which is popularly known as DB scan and but this DB scan it came into existence sometime in 95, 96 where the authors published the paper in a conference proceedings and it became really famous, but unfortunately the exact algorithm it was published in 1973 in the Journal of American Statistical Association by an author called RF Ling, Journal of American Statistical Association 1973, that is JASA 1973 RF Ling and he called it as generalization to single linkage method, so I am just going to call it as RF Ling's algorithm and which is what I shall be doing it now, so RF Ling 1973, so you are given n points subset of RM, you need to choose 2 values, choose values for R greater than 0 and epsilon, choose values for R greater than 0 and epsilon greater than 0, what are these R and epsilon you will find it you will find out okay, 2 let us say A i is the set of all x belonging to S such that distance between x i and x is less than or equal to epsilon, that means for each one of the points x i we will consider a disc of radius epsilon and then find what all points from this set S they are existing in the set in the disc, so for each point x i we construct a disc of radius epsilon and we find out all those points from S which are within the disc that we will do it for every x i, so that particular disc that particular set we are calling it as A i, i is equal to 1 to n, now 3 if the cardinality of A i is less than R then we shall not involve these A i's in our calculations that means we will only consider all those A i whose cardinality are greater than or equal to R, we will only consider all those A i whose cardinality are greater than or equal to R okay, now 4 what we do is that, so from this step onwards all the A i's that we are considering their cardinality are greater than or equal to R, greater than or equal to R union of A i and A j if A i intersection A j is not is equal to 5, take union of A i and A j if A i intersection A j is not is equal to 5 and this you go on and on doing this you go on doing this step 4 you should go on and on doing it I will tell you the meaning of this, so this is your data such say this is your data such and you will have to choose the values for R and epsilon okay you need to choose the values for R and epsilon, so basically you need to have certain radius and once you consider the disk how many points must be there in that disk that you need to have the value for R okay, what values you are going to choose take them that we can for the present moment we can forget about it, let us just look at this step suppose the A i's that we got they are say something like this the A i's that we got say they are something like this, so these are the A i's that we have got and if you do this thing take union of A i and A j if A i intersection A j is not is equal to 5 then in this one you are going to get all this as one cluster and you will get all this as another question when I say that you should repeat this step you should go on and on doing it till no unions take place repeat 4 till no unions take place note that this method is going to give you the number of clusters automatically is this clear this method is going to give you the number of clusters automatically provided you choose the values of R and epsilon appropriately you will get really good clustering if you choose the values of R and epsilon appropriately and if there are some points that are remaining for example here there are some points that are remaining which you are not able to put them in any one of the clusters now you can have your own convention about putting them into one of the clusters maybe you can have something like nearest neighbor to whichever cluster it is closest put it in that cluster and I mean you can have your own convention the number of points that is less left out it will be very small percentage of points compared to the whole set the size of the whole set given that you have chosen the values of R and epsilon properly so this is basically DB scan what is that if then we shall not involve this AI in our calculations that means all the AI is that we are considering the sizes of those sets is at least equal to R the number of points in those sets is at least equal to R it may be R R plus 1 R plus 2 R plus 3 any such number greater than that right if what is this is the cardinality cardinality is less than R okay then we shall not involve this AI in our calculations okay we shall not involve that AI in our calculation so it means so happen that some points may have been left out I mean one cannot guarantee that I mean once you write this step you cannot guarantee that every point will be put into one of the clusters that cannot be guaranteed so if there are some points which are not going to any one of the clusters then you can have your own conventions like the nearest neighbor or any such point so this is basically RF links method and as you can see if the value of R if it is equal to 1 what is it that you are going to get if the value of R okay my question to you is can you get single linkage method if you choose the values of R and epsilon appropriately can you get single linkage method if you choose the values of R and epsilon appropriately if R is equal to 2 then you will get single linkage method think about it that is why RF link called it generalization of single linkage method that is what he called it way back in 1973 now there is the word density used here density is coming into the picture because in a disc of radius epsilon we want at least R number of points to be there so it is something like density of the point XI is at least equal to R density of the point XI is at least equal to R okay so this is an algorithm with which you find the number of clusters automatically in clustering one of the first questions is that how do you decide a number of cluster if someone gives you it is fine if someone does not give you then is there any way in which you can decide then this is one of the algorithms where you can decide the number of cluster automatically given that you have chosen the value of R and epsilon given that you have chosen the value of R and epsilon okay I think we will stop here.