 Basically in many of this clustering problems we would like to define an optimization criterion, we like to define an optimization criterion and whichever clustering follows that optimization criterion that we call it as clustering of the data. This optimization the definition of this optimization criterion that again that is that should be in principle problem dependent since we are starting the subject. So we will start with basically if use I mean at least one such optimization function we start with basically one such optimization function and then let us see how it proceeds. Why do you need to define an optimization function for doing clustering? The reason is that the clustering problem as I said it is whatever someone wants you to find these properties you need to have. So you find them you do the clustering according to those properties sometimes what happens is that you are just given a data set you are asked to do the clustering the person may not have anything on his or her mind you are just given a data set then what you may want to do is that fine let us just see what sort of groupings are there. If it is a simple example like playing cards one can easily find the groupings but usually the data set that are given to us they do not follow such simple I should say properties they do not have those simple properties. So what one may like to do is that somehow within a cluster you need to have I mean the point should be similar to each other and for between two different clusters you need to have some sort of a dissimilarity and you need to have I should say more dissimilarity. So what is a way of formulating this thing mathematically a way of formulating this mathematically is the following. So you are given a data set okay and the number of clusters K let us say it is known the number of clusters K is known. So basically what do you need to do you need to divide this data set into K clusters that means you need to make a partition am I right you need to make a partition. So let us call one such partition to be say a 1 a 2 a k this is a partition of this data set. So if I write this data set as s and these are a 1 a 2 a k so what are the properties the first property is a i is not is equal to 5 for all i 2 a i intersection a j is 5 for all i not is equal to j and 3 union a i 1 to k is equal to s. So the first thing is you need to make a partition well this is a partition then what do you need to do somehow you need to have some sort of a similarity or dissimilarity. Let us assume that we are doing we are finding this dissimilarity by using Euclidean distance. Let us assume that we are finding this dissimilarity by using Euclidean distance. Then the for a partition a 1 a 2 a k of the whole set s let us define this function l what is this function this function is let us take square here this y i let us call y i as y i is mean of a i y i is mean of a i you find the mean of all the points in this set that is y i. So what you do is that you take a point x in a i find the distance between x and the mean okay and this you take the square of the distance and this you sum it up overall x in a i and you take i is equal to 1 to k this is the function that you are defined now you would like to get a partition a 1 a 2 a k which minimizes l you would like to get a partition a 1 a 2 a k which minimizes l right intuitively it looks nice okay you would like to get a partition a 1 a 2 a k which minimizes l right now okay let me start asking you questions let me not show this thing now let me just keep it like this you are given n points let us assume that the number of clusters is 2 that is not going for k how many such partitions can you have how many such partitions can you have how many such pairs you can have a 1 a 2 this a 1 a 2 okay let us just say small n is equal to 4 you have 4 points you can put one point in one cluster 3 points in another cluster okay you can put to here to here for each one of the things you can calculate this right and you would like to find that partition which minimizes this now my question to you is how many such partition means possible for k is equal to 2 for k is equal to 2 how many such are possible that is right 2 power n divided by 2 there is a 2 power n minus 2 because we are not considering all the points put in one cluster 0 in another cluster that we are not considering so 2 power n minus 2 we need to have 1 and n minus 1 n minus 1 1 they are same so you need to divide it by 2 so 2 power n minus 2 divided by 2 that is for k is equal to 2 what about k is equal to 3 k is equal to 4 and my next question is any value of k what is your answer you try to give me the answer tomorrow otherwise I will tell you the answer tomorrow okay for any k first you need to know how many such partition means are possible then you know the complexity of the problem it is actually of the order of k power n it is k power n minus something some minus something plus something will come in but I want you to give me the exact expression I want you to give me the exact expression okay so that is about the computational complexity part of it the second part is fine you have got these clusters somehow you have done the whole exhaustive search or whatever it is one has done it and you have got the clusters what properties these clusters process what properties these clusters process so this is the function that we want to optimize we would like to get hold of those a1 a2 a k which provide minimum value for this function now let us just see the properties of this function properties means if we really get those sets a1 a2 a k that means if we really get those k clusters then what properties do those k clusters process before I do that I will just write here to find again a1 0 a2 0 a k 0 0 is for optimal such that L P for all P a1 a2 a k we are supposed to find a1 0 a2 0 a k 0 0 is for optimal so a1 0 a2 0 a k 0 such that the partition a1 0 a2 0 a k 0 L is for loss the loss function corresponding to this partition is less than or equal to the loss function corresponding to any other partition a1 a2 a k for all partition P a1 a2 a k of the set S so this is our a now if we do really find this partition a1 0 a2 0 a k 0 what properties does this partition process let us see suppose note that we are using only Euclidean distance we are not using any other distance this is really important and you will actually probably understand it why I am saying suppose this is the mean of one cluster after we have got up to in the optimal partition and this is the mean of another cluster okay this is the mean of one cluster this is the mean of another cluster after the I mean mean of the optimal cluster one of the optimal cluster this is the mean of the other optimal this is the mean of one of the other optimal clusters now you have got a point here this is a point from the data set you are supposed to put this point to either this cluster or to this one now to which particular cluster you would like to put it you will put it to this cluster because the distance of this point with this cluster is less than the distance of this point to this cluster mean and you are always taking the cluster mean in our expression for optimality we are always looking at the distance between x and the mean right the mean of the cluster right so that means what that means basically all the points which are on this side of the line they will go to this cluster and which are on this side of the line they will go to this cluster that means what basically say suppose this is another mean so that means you will have a line like this for this right and suppose there is another mean you will have one more line like this basically okay do you know the meaning of a convex set when do you call a set to be convex draw a line between two point such dead without the line segment join any two points should be completely contained in the set then you call that set to be convex now this is one half space this is one line this is one half space is a half space convex it is convex take any two points here the line segment is here only within that set only and then if there is one another point you are going to have one another line and this side there is another half space so it is the intersection of two half spaces intersection of two convex sets is it convex intersection of two convex sets is convex intersection of finitely many convex sets is convex so basically the cluster corresponding to that particular mean it is if you forget about the finite number of point basically it is the intersection of all these half spaces right and that is basically convex it may so happen that it may so happen that the shape of the cluster may be like this okay but if take the mean here the mean will be somewhere here take the mean here this is convex basically this method provides convex shaped clusters have you understood if you use this optimization criterion the clusters that you would get they are basically convex shaped what is the meaning of convex shaped note that the shape of this is not actually convex this is actually a non convex shape but then once you do this one this is convex this side of this line this is a convex set this side of this line is a convex set convex set when I say that you get convex shaped clusters what I mean is that the finitely many points what shape does do these points have I am not bothered about that the shape of these points I am not bothered what I am bothered is that it is basically intersection of convex sets each convex set by the very nature of the definition a convex set is an uncountable set it is surely an infinite set and since we are looking at points in R2 and R3 R4 Rm it is an uncountable set and intersection of convex sets if it is not a singleton it is an uncountable set if it has two points means then it is going to have all the points in between those two points then it's an uncountable set so intersection of convex sets is uncountable but we are doing clustering of finitely many points we are not doing clustering of uncountably many points we are doing clustering of finitely many points okay. So these finitely many points may form some sort of shapes but once you do the clustering the space corresponding to one cluster that is basically convex the space corresponding to or the set corresponding to one cluster that is basically convex because in this example this whole thing corresponds to this cluster and this is a convex set though the shape of this is not convex. So any I mean this criterion essentially provides you convex clusters this criterion essentially provides you convex clusters and once I say this thing then you are going to get all the other complications. The complications are that look at the example to here if you want to get those sort of clustering then you will not be able to get using this optimization criteria see the power of mathematics is even without doing a single thing we are able to stay that you will not be able to get it okay you will not be able to get this shape here in example 2 by optimizing this criterion okay. So in fact I will make slides of all these things and I will give the slides if not in the coming one or two days in the next this thing surely I will give the slides. So when you have non convex shaped clusters it is not necessarily true that you will get those clusters by using this criteria then you have a next question the question is suppose the clusters are convex shaped do we get those clusters by using this criterion the answer is unfortunately no the answer is unfortunately no suppose take two such elongated strips really long finitely many points only each one is this is convex shaped this is also convex shaped but if you optimize this criterion you may get it like this one may be giving you lesser value for this than this partition if you make the strip to be very long if you make the strip to be very long then this one may give you a smaller value than this. So it is not necessarily true that the convex clusters existing in the data set they are always obtained by this criterion it is given it will give you convex clusters but the convex clusters that it provides they may not be the convex clusters that you are desiring they may not be the convex clusters that you are desiring you are desiring one type of convex cluster it is giving you other type of convex clusters okay. So the whole thing that I want to state is that whenever anyone gives you any criterion for doing any work be it clustering or anything first you try to analyze its properties you try to analyze the properties you will get many clues from there and you see this is one very standard criterion many persons use this criterion a generalization of this criterion is used in fuzzy C means fuzzy C means algorithm is based on a generalization of this criterion if you look at the whole this thing optimization function there it is basically this one only a slight generalization but then since this has all these drawbacks that is a slight generalization it will also possess at least some of these drawbacks if not all okay. So this is a my general advice to you all anyone giving you any criterion you please look at the plus points and minus points just look at it I mean the mathematics are the thing that I have done it is nothing I mean great I am just trying to explain what this is doing okay. So it is not necessarily true that the convex clusters existing in the data set you may be able to get by this criterion and the non-convex clusters you may not get them so that is one part of it now the second part is I was asking you about how many such partitioning are there it is a huge huge number it is a huge huge number this problem is similar to from an end point set to a k point set the number of onto mappings the number of onto mappings from an end point set to k point set how many onto mappings are possible it is exactly the same problem as this one it is a small difference if you are talking only about onto mappings then you will not get that divided by two factorial 2 power n-2 divided by two factorial okay partition a1 a2 is taken to be same as partition a2 a1 that is how the two factorial is coming there so for k it will be k factorial so number of onto mappings from an n element set to k element set divided by k factorial will give you the answer to this will give you the answer to this one and you will see that that number is huge I hope all of you know about traveling salesman problem and NP hard problems right I hope all of you know about this know about it if you have k number of clusters and if it is after order of k power n then you are going to be in serious serious trouble right you are going to be in serious serious trouble so whatever algorithms that are existing for optimizing this they are basically sub optimal algorithms that means you are not in a position to optimize the criterion using those algorithms so if you can suggest an algorithm which really optimizes this without doing the exhaustive search then it will be a very nice contribution to the literature and people will have it up let me state this thing to you once again if you can develop an algorithm for optimizing this for getting this partition without making an exhaustive search for any data search and for any value of k then people will surely lap it up people will surely lap it up because we do not have algorithms for it so k means algorithm it tries to do the optimization but it does not guarantee to provide the optimal solution k means algorithm there are several versions available in the literature there is one version called for this k means there is one version called McQueen's k means there is one version called Johnson's k means and there are modifications and generalizations of this in fuzzy c means there is instead of mean if you use median then you would have known something known as k Medoids algorithm okay and you have some things you have sometimes a data set where some for some points you know the classification whereas some other points you do not know the classification and you would like to do some sort of a it is clustering it is semi supervised which is nowadays many people are doing so you have a k means version of this even in that case like that you have just several several k means algorithms are there there is one leader algorithm there is one leader algorithm professor MN Murthy IISC Bangalore he is one of the authors of this that is a very nice algorithm it tries to do things very fast well it does not guarantee optimality I mean it does not guarantee optimality I mean even the k means sort of thing it does not guarantee but it is very fast and it gives reasonable results leader algorithm that is a very good algorithm MN Murthy IISC Bangalore so like that you see you will find several several versions of this depending on the type of use depending on the constraints that you have you have several several versions of this. So I will do one of the versions here in the class the version that I am going to do it is a 4g skamings which was in 1965 McQueen skamings algorithm it came in 1967 I am doing 4g skamings because many all many pattern recognition books you will find basically 4g skamings algorithm but as a person I would prefer to use McQueen skamings instead of 4g skamings if you want I will give you both the algorithms I would prefer to use McQueen skamings instead of 4g skamings but I am giving you 4g skamings because many pattern recognition books have this algorithm shall we stop it.