 Okay, let me start sharing my screen. Hello to everybody. We meet again today and today I'm going to talk to you about another technique from unsupervised machine learning that is also closely related with many, many physical problems. The technique is clustering, mostly make groups for natural groups. We can see some examples for instance. The first question that one can make is what is a cluster? I think that all of you will agree with me that if I have this distribution of points in two dimensions, for me it's kind of clear that I have two clusters, right? Two groups of points. I think for everybody it should be like that, but there are many, many clusters, possible clusters that we can imagine just in two dimensions. This one is strange, but it's anyway a group of three clusters. So the idea is that what we are going to show you today and the next day, that it's Wednesday I think, it's just an overview of the techniques that we have for allowing our computers to do this in an automatic way. Clustering, it has many, many utilities. I mean, it can be used for instance for the site set of that from a library or whenever you google the algorithm performs a cluster for you before providing you the results. In image recognition, for instance, these are images from a cancer and when performing image recognition many times preliminary steps you need to use a clustering algorithm and everything that needs some kind of classification can be addressed with clustering techniques. There are many types of clustering. I'm going to provide you some kind of classification in order to have in mind the several types that we can have when we use a clustering algorithm. Imagine that you have something like that. If you have something like that, you can imagine that the most natural way of dividing these points is like that, but you can think also about something like that. Why not something like that or even something like that or that. All these kind of clusters, I mean, all these clusters, partitions are equally valid. So, one thing is that the result of your clustering, it's going to depend on what you define that it's a cluster. That's kind of a topology, but it's always like that. There is not a mathematical definition of what it's a clustering. It's really problem dependent. So, each problem will have different optimal clustering algorithm. It also depends on how do you compare the points that you want to cluster. That's the metric. The metric that you use, for instance, for comparing two different configurations will determine the results of your clustering. And also depends on the features that you have chosen. Imagine that you have, I don't know, a simulation of a protein. You can try to cluster the configurations that you sample in your simulation, depending on, let's say, the similarity to a given structure or also according with the, let's say, the angles explored during the simulation. I mean, the choice of these features, the choice of this metric, it's not trivial, and it depends on the problem. In some cases, it's kind of trivial, but in many cases, it's not so trivial. So, it's something that one must have in mind when applying clustering algorithm. It's not that you can use a black box method for clustering. You have to take all these variables into account. I'm going to provide you, as I told you, a classification of the different clustering algorithms. And I would provide a classification according to the output of the cluster. The first, let's say, the easier clustering is the flat clustering. In flat clustering, you perform just a hard partition of your data in groups, into groups. Okay. Then in flat clustering, you perform also a partition of your data into groups, but instead of assigning each point to a cluster, you assign a degree of membership. I will show you later what's that precisely. Finally, on hierarchical clustering, instead of having a single partition, you generate a tree. And this tree of partitions allows you to see many, many, many partitions. It's a kind of dendrogram. It's a kind of classification tree. I don't know if you think about, for instance, when you were in the school and you were studying species of animals, there was the vertebrates. Inside the vertebrates, there were the mammoths, the reptiles, so on and so forth. These were two species and individuals. Here, we do something similar. Let's see later what's going on. I will go into the detail for each of these points. In flat clustering, as I told you, each element is assigned to a single cluster. Assigned to point one belongs to cluster two. Point two to cluster two also. Point three to cluster one, let's say, so on and so forth. In traditional methods, what you have is that you need to define the number of clusters that you have, and it's provided as an external parameter to the algorithm. The question is that usually when one performs clustering, one looks for a hard partition. One wants to say that the elements belongs to a given group. Each element belongs to a group. That's what's clustering. By definition, for instance, if your structure, if your data structure is multi-level, you cannot deal with that. This is a case of flat clustering in which I divide my points in five groups. Each set of points, color with the same color, belongs to the same group. Okay. Flat clustering is a bit more sophisticated. As I told you, you give a degree of membership to each cluster, to each point in the, to each cluster. Of course, since a point cannot be, cannot count more than one point, this membership vector is normalized. It belongs to in this way. Okay. And again, the number of clusters should be provided, usually. And it's not always easy to transform flat clustering into a hard partition. In this case, for instance, if I perform a flat clustering on this set of points, I will obtain something like that. For this point, I would say that the 90 percent is assigned to cluster one, and nine percent is assigned to cluster two, and marginally, it's assigned to cluster five. This other point that it's in the middle between cluster one and two belongs 45 percent to cluster one, 50 percent to cluster two, and a bit to cluster three. This point do not belongs any to cluster one, not a bit to cluster one, nothing to cluster two, kind of a lot to cluster three, more to cluster four, and a bit to cluster five, and so on so far. Okay. Please, if you have questions, interrupt me, because I cannot hear the chat. Excuse me. Yes. In the flat clustering, the number of clusters was arbitrary, or how you choose five clusters in the previous classes? Let's say, in this case, I chose them because I have seen the points mostly, and I decide, but in general, we will see the techniques for choosing the number of clusters. It's not trivial, but now it's just a kind of introduction to the different methods. When I explain the methods in detail, we will go to that point with a bit more detail. Okay. Okay. Finally, hierarchical clustering. Sorry. As I told you, it produced a set of nested clusters. It's kind of hierarchical tree. Okay. It's the output. It's visualized as a tree. Okay. Just, we will see that later. But by doing, like, as a tree, you don't need any assumption about the number of clusters. Okay. If you are lucky, hierarchical clustering may correspond to meaningful taxonomy. It means that, if there is a multilevel structure in your data, what you want is that your hierarchical clustering reproduce this multilevel structure. And just by cutting the tree, it can be transforming many hard partitions. For instance, what I would obtain if I have something like that it's this partition where you see that all the points are kind of in the tree until I, I'm kind of merging trees until I have the structure. This is a good multilevel structure in the sense that it's, I see that plaster red and yellow are more connected. Let's see, let's say that green, blue and purple, right? And also I can see that green and blue are more connected and then to this third one and so on. Okay. This is what hierarchical clustering provides you. Some kind, a part of some kind of partition, it also provides you some kind of view of the structure of your data. Okay. However, it's not easy to flat. For instance, in this case, I obtain the two clustered partitions easily by dividing with a straight line. But it's, in this case, for obtaining the five clustered partition, I have to do something like that. It's kind of at all. So it's not so easy to do it in real world applications. Okay. I did it here just as example, but I would not know how to do it if I cannot see the data. Okay. So let me check if there are questions in the chat and before going ahead. Okay. There are many questions. Let me start replaying them. What are the mathematical differences between these methods? Okay. The question is that what I'm not talking is about actual methods. I'm talking about general classes of methods. Flat, faxian hierarchical clustering and classes of methods, each of them contain many different kinds of methods. So for replaying that, you should see each of the different methods. In general, in the, let's say, generally speaking, the differences between these three classes is the output. In one case, you have hard partition. It means seeing a signature that corresponds one to, from each one to a different cluster. In the faxian one, you have a degree of membership and in this case, you don't have a single partition, but you have a degree of partitions. Can we use the number of components from PCA as a guide for the number of clusters? Not exactly because it's different, let's say, the dimension of in which your data lives, it's not necessarily the number of groups that you have. I mean, for instance, the number of components of PCA in this case would be two, but the number of clusters would be five, right? Regarding the hierarchical clustering, do you have any references on this non-stake cut of the dendrograms? There are, I should look for that, it's long since I'm not looking for non-stake cut. There are many methods trying to optimally divide a cluster. As far as I know, none of them is generally applicable, so I decided not to explain them, but I can check, I mean, it's something that is still a work in progress for the community, trying to decide a general method for non-stake cut of dendrograms. Another question is, what about the position of the dots? On what basis are they placed? Okay, these are examples. I put them by hand, but in general speaking, what they represent is the position of your data. In the case of physical many-body systems, the ones that we are interested, they are usually all the coordinates of your space. If you are talking about, let's say, the XY model of the spin or the 2D icing, the positions of its configuration would be all the coordinates of each of the spins in your system. What is the point in using clustering techniques if the result depends on the technical number of clusters, etc.? Well, the point is that you need it. So usually what happens is that you need to perform some kind of division of your data, and the idea is that, of course, since you need it, you need to know all the clustering techniques. So let's say, the best clustering techniques, the better you can decide the technique to apply to your data. So the point is that it's mostly a matter of needing. In many cases, you need to blindly, because it's not said that you need to concede your data, divide your data in groups. It's useful for many, many things. And since it's useful and you need it, you need to apply these techniques. There is another question of Richmond. This is a kind of technical question about the silhouette score. In assessing clustering techniques, depending on data context, how reliable is the silhouette score coefficient? I would tell you that it's a coefficient that I don't like, especially because it has the problem that it tends to overscore, let's say, spherical clusters, spherical-like clusters. It's a coefficient that, of course, let me show you, I'm sorry for the other people, but I think it deserves a replay. If you have something, let me go back. Like that, the score coefficient of this partition would be really, really low because the silhouette coefficient somehow relies on the fact that your data is more or less compressed. It's more or less spherical. But if your data has a weird way of being clustered, it's not going to give you the correct answer. In many cases, however, it's a good option. In cases when you have some hints about the way your clusters are and you think that are something like this, in these cases, the silhouette coefficient will work perfectly. I don't know if I'll replay your answer. Okay, how can we say that a clustering which is applied to data is reliable or not? Mostly, what you do is to rely in the experience, in the sense that you have cases in which you test your methods, and if your method works in really similar cases to the one that is under study, you somehow trust. The other method is just to try to apply external coefficients like the one mentioned by your colleague, the silhouette coefficient, that try to measure, give you some kind of measure of how good, how compact, for instance, are your clusters. But this is kind of really dependent on the actual shape of your cluster. So it's not so... Personally speaking, I don't like it so much. According to me, the best way is to perform an external validation in some data set in which you know the groups and try to see that if this data set is really similar to the ones in which you are applying the method, then you can rely on the method. Okay, let me go to the first method that I want to explain to you. And the first method that I want to explain to you is the K-means clustering. This one is the, let's say, the father of all the clustering algorithms, and it's still widely used. I put here a reference, but there are several references that are used as the reference for K-means, and they still have thousands of citations by year. I mean, it's something that it's increasing and increasing and increasing is still really, really useful. Okay, so what is K-means? K-means, what does? It attempts to minimize the intra cluster distance, while it means the distance between points belonging to the same cluster, while maximizing the inter cluster distance, the distance between clusters. It's based on the concept of cluster centroid. A cluster centroid is the average position of the cluster elements. As I told you, it's still widely used, and it can be easily parallelized and almost linearized. It means that it's really, really fast. But as I told you, the user must provide K the number of clusters. So what one measures is this object function or loss function, and the method tries to minimize that. Theta, it's the array of the, that reflects the designation in clusters. And CL, it's the vector of the coordinates of the cluster centroid. So for the loss function, it's the sample of the clusters of this quantity, where this delta theta i, it's the difference. It's zero. If i, it's assigned to a cluster, but it's not l. And it's one if i, it's assigned to the cluster l. Okay. So what you compute here is just the average of the, of the coordinates. And here you compute the sum of the square of the distances from each element assigned to the cluster to the center of the cluster. This is what we try to minimize. And the way for minimizing that, it's just with a really easy algorithm. The first thing that we do is to initialize the cluster centroid. And the cluster centroid are points randomly chosen from the data. Okay. Then we enter in a loop in which we first decide the membership of all the data points by assigning them to the nearest centroid. Okay. And then we recompute the centroid by using this formula. This way of, of minimizing allows you to, in this step, minimize the intra distance, intra cluster distance. It, because they are just saying, okay, I'm assigning all the points, all my data points to the center that it's near, the nearest center. Okay. And then when you update your centroid, what you do is to maximize the inter distance, because you are saying, okay, if my two centers are nearby, I can recompute the centers. They will recompute the centers just as the average of the, these of the positions of my data in the clusters. Lucio asked a question. I will reply later. Okay. It's in the lecture, it's in the program of the lecture. So, for instance, imagine that you have your data, this data. Well, the first thing that you do is to randomly pick case centers. In this case, I pick 15 centers. Then I assign the computer assigns each point to its nearest center. Now you can see that they are of the same color. And then I recompute the centers. Okay. And it means I move the centers towards the center of my cluster. And then I recompute, I continue iterating. Okay. This procedure is known to convert, it must convert, and it converges when there is no change in the assignation of points. So you have a center and the assignation of points. But it's not said that you must arrive to the global minima. So what happens is that when you do this iteration, when you perform this iteration, you arrive to a minimum. But it's not necessarily the global minimum of the loss. Okay. And you can see that, for instance, in this case, in which I know the solution is 15 clusters. I assign K to 15. But due to the minimization procedure, I have, for instance, these two clusters divided by two or these two clusters merged. Okay. This is a problem due to the fact that my initialization is random. Okay. So I arrive to a given minimum of the loss function, but not necessarily to the global one. So K means that it's the father of all the methods, but it's already pretty old, has some weakness. And we are going to comment them. The first one is the one that I already commented, which sensitive to initialization. Yes, to share. They are totally random. Let me finish this point because now we are going to see that. I mean, if we pick this, I will have one solution. If I pick another sentence, I will have another solution that in this case is optimal. But I can have other solutions that are far from optimal. So each different initialization will give me a different result. What people suggest to do is to perform a better initialization, that it's not totally random, but still random. One would be, and this better initialization, the most used method, it's K means plus plus, in which you, the first center to choose it as random, then to compute the distance from all the data points to the center and take the, to the already assigned centers and take the minimum. Okay. And then you choose a new cluster with a probability that it's proportional to the square of this quantity. The only idea, we are not going to into the details because we don't have a lot of time, but the idea is, okay, if I pick one center in one place, the next center that I pick, it's better if it's far away from the one that I choose. So that's the final idea. You are choosing centers with more probability when they are far away from the existing centers. And then you perform K means with the centers. So imagine that you have this case and I pick Q's K equal to three. The first step, I will pick randomly with uniform probability one point. Let's say I pick the six. Now I compute the distances from all the points to this center. I pick the minimum that in this case it's the, by chance the distance to the center. And then I pick the next center with a probability that it's proportional to this minimum distance to a given center square. So what would happen is the probabilities of these points that are far away from center six would be bigger than these probabilities. So I would pick for instance, center one. So if I pick this point and this point, what happened is that now I have to repeat compute the distances from sorry, from a center at the minimum. These minimums would be for the 9Ds, for the 8Ds one, for the two would be this one, for three would be this one, four, five, seven. So now what would happen in the, I would pick a new center with a probability that it's proportional to the square of this minimum. So the probabilities of points 8 and 9 would be much higher than four, five, seven, two, three. Okay. Therefore, I would probably pick one of the, sorry, for instance, a nine. And that's all. Now I have these three points, three centers that are not any more randomly picked, but picked in a wise way. And in this way, I will probably obtain the global minimum for this loss function in which I have three clusters here. Okay. So this is one of the weakness and it has been addressed in this way. The other, another weakness that you all asked me for is which K employing, which number of clusters. In K means it's usually, this problem is addressed by what it's called the screw test. And this test, what happens is that what I plot, I compute my K means four K equal to one, K equal to two, K equal to three, K equal to four, five, six, seven, so on, so far. I do the K means with all these different number of case. And I plot my loss function. I can call it the similarity objective function, whatever. It's the loss function, the sum of the distances from all the data points to the center of the clusters. If I plot that, I obtain what it's called the screw test. And I've looked at this plot for an elbow. If I have an elbow, I can say kind of confident that the correct number of K is three in this case. This quantity should be, if we arrive to the optimal, the similarity is the, sorry, Alejandro, it's the loss function. It's exactly, sorry, let me go back, that. Okay. This is, it's a different way of calling this objective function or loss function. The similarity, but whatever it's that. Okay. So, you plot the values of the loss function for each different partition that you've looked for an elbow. Of course, science, you, your K means algorithm. It's sensitive to initialization. What easily it's done, it's to take, let's say, 10 runs for each number of clusters and take the minimum value. Then it's sensitive to add layers. Why it's sensitive to add layers? Imagine that you have something like that. You would agree with me that the optimal partition would be here, right? But now I'm adding a point. Okay. And these are the cluster centers. Now I'm adding a point that is this one. That is far away. It's what it's called an outlier. What I would like is to have the same position. Of course, this point is far away. It's just one point. Why should it modify my partition? However, what happens is that it moves a lot my center because it's really far away. So if you remember the formula for centers, it has a lot of weight in my centers. So it moves my centers. And by moving my centers, what may happen is that my partition, it's not anymore the optimal one. That's the reason why it's really sensitive to layers. So people provide, try an option, a different way of doing K-means. It's what it's called K-medoids. And K-medoids, it's the K-means method, mostly the K-means method. But instead of working with centroids, you work what it's called medoids. And by medoids, that is, let's say, it should remember median instead of mean. The medoids are an element of the data that is the most central element of the cluster. So you are optimizing, let's say, your centers, but restricting yourself to have centers belonging to the data set. You can see that this will somehow mitigate the effect of outliers, right? Because if you have something like that, your center will never be here. It would be here. Okay? Because you are not allowed to move your center far from the data. And having the center here, you will recover the correct partition. Lucio Garcia asked me if it's like selecting the most center point in the simplex of the data. It's the more center point in the cluster, not in the simplex. So it's a bit different. But the idea is the same. I mean, you choose the most center point by just the one that minimizes the sum of the distances from all the rest of data belonging to the cluster. Okay? So this K-medoids has an additional advantage. And the additional advantage is that it can be used with whatever distance that you have within your points. Okay? Because while in the case of K-means, we are using the properties of Euclidean distances for computing the centroids. In K-medoids, you just need distances between all your data points. So you don't need these distances to be the Euclidean distances between data points. It can be whatever kind of distance that you want to employ. Okay? And finally, yes, I want to say the last problem for spherical clusters. And the last problem is that it only allows you to find what it's called spherical clusters. It's not that they are spherical, but let me explain it to you a bit better with one example. Imagine that you have something like that. If you have something like that, and you apply K-means, K-medoids, whatever, you are going to have, in the best case, with K equal to 2, something like that, with K equal 3, something like that, with K equal 4, something like that. You are not going to be able never to reproduce these two clusters with K-means. No K-medoids. And that comes for, because the way that you define the centers do not allow you, I mean, to obtain something like that. Because a point that is near in the center of this cluster would be assigned by distance to the center. In this sense, you can talk about spherical clusters, because the points are assigned to the nearest center. So this problem is addressed by other methods, like K-means, like kernel K-means. But today I'm not going to arrive to these methods. I think for today we can leave it here. Okay, not for today, but I want to replace your questions before continuing. So I was, there is one, is it possible that there be more than one minimum elbow in the cost function? The answer is yes. Mostly imagine that you have a multi-level structure. In this case, you would probably obtain more than one elbow. So more questions. If there are no more questions, I can go to the next point, that it's the fact she means. This is a method that it's quite all newer than K-means. It's something like 10 years newer. And old, of course, because it's from 84, and it's still also widely used. As I told you, it's a version, it can be considered as a version of K-means, where designation is fact sheet, in the sense that you have a degree of membership from each point to each cluster. And once you consider that, you just adapt your loss function and the optimization algorithm to this to fulfill this. The idea is that your new loss function came from this formula, where you, it's the assignation that now it's a matrix, right? Because you have a degree of member for each element, you have a vector of assignation. So the total assignation of your cluster is a matrix. And what you compute is the distances from all the points, the square distances from all the points to the centers, in which you consider the weight of the membership. This M, it's a parameter of the method. You can say that in general M, it's equal to 2. It's widely used. And now the center, it's also compute as the average, but the average weighted by the memberships of each element. Once you have this loss function and this way to compute the centers, what you do, it's, in this case, the algorithm is a bit different. Instead of picking centers to random initialize your memberships and then compute the centers and from the centers, you update the memberships. This formula for updating the memberships, it's not important, but it's exactly that. And it's just a way that saying that the membership of a data point to a cluster would be bigger if the center is near. And that's kind of logical. And it's normalized by using this formula, you have normalized memberships. And since now the memberships are not integral numbers, but real numbers, you have to put a threshold in the exchange of the membership metrics in order to finish the iterative procedure. It has the same problems like it means. It's sensitive to initialization, the number of clusters, you have to decide it. And you can decide it also with the script plot. It's also sensitive to the lines. And now there is not a Cayme-Deutsch way of solving that. And it generates clusters that are near the center. However, this is somehow mitigated by the fact that you have a fancy asignation. Questions? Maybe I was too fast on that, but now we have time for questions. So if there are no questions, I think we are going to finish here. So, okay, there are no questions. Let me stop there. They're serving. And nothing, I would share the slides on, not now, I would share the slides with the notes or also in the metrics. That's all. Mateo, I think we can stop the recording.