 And maybe I don't know whether there are questions on previous lectures, otherwise I'll leave you the floor, Alex. Okay, yes, please, if there are questions, tell it now, we can show it now. If not, I will start sharing my screen. Can you see that? Yes, yes. Thank you. So today, I will start with one thing that I forgot to tell you last day. I'm sorry. So we will do a kind of reminder about the intrinsic dimension. Okay, we were talking the first two lectures about dimensionality reduction, as we see that we saw that in, if your points are something like that, you can map it to one deep easily by using principal component analysis. But if your, let's say, your manifold in which the data lies, it's much complex like this one. There is no way PCA can work. I mean, PCA cannot work because it performs a rotation of the space and try to obtain the maximum variance directions, but if it's something like that, the rotation is useless, right? So we introduced just, let's say, in a really, really shallow manner, other methods that can deal with this kind of problems. I put here the references that I will give you when I gave you the slides. That would be tomorrow, let's say. And you see that here, okay, we can have some of these methods working, but how do we know which dimension are we going to use? Okay, if we are lucky and we are using what it's called a spectral method that is PCA or other variants like it can be kernel PCA or isomap, and we have a nice gap in our spectrum. We know which dimension we are going to use, right? It's going to be this form in this case, right? And this kind of gap is, let's say, natural to decide which dimension we are going to project. But imagine that none of these methods work or we don't have a method that is working and we don't have a gap. How to decide which dimension should we use? What we have to do is directly estimate the idea. And probably the older method for doing that is the backscouting method. Imagine that I have these two distributions. If I sit in this point and I want to plot the number of neighbors of each point as function of the distance to the point. Okay. If I sit in this point, what I would see is that, okay, in this case I have three, I have another quantity, another quantity, another quantity. And I see that in this case, the number of neighbors of my point increases with the square of the radius from the point, right? In this other situation, what I have is that when increasing the radius, I have a linear increasing of the number of neighbors. Why? Because at the end what I'm trying to show you here, pictorially, do not take it, let's say, it's not that here I count the, but pictorially this is the trend. And what happens is that the number of neighbors of a given point is case with the distance to this point to the D, where D is the dimension of the manifold in which the data is symbol. And here we have a prefactor that is the density. So, if I take the logarithm of this equation, I have a nice linear dependence, right? So, what we do in the box counting method is one of the versions of the box counting method. It's just taking all the points in my data set and obtain a list of k nearest neighbor, where k is a parameter. And we obtain the neighbors and the distances to these neighbors. And here we fit this equation with the log linear fit and we obtain D as the slope of the course. Okay, that's a really easy method, but have some problems. Among the problems that it has, it's what happens if the density is not uniform. This equation is not hold, this equation is still hold. But this dependence is not linear anymore, because this log of density would not be a constant term, right? So, in the cases of the density it's not nearly uniform, let's say, this method will fail. Also, in the case that the manifold is curved, if I go for, imagine that instead of having this line, you have a curved line here. When you start going far from your point, you start, let's say, cutting your money in your manifold, right? You will start having, okay, distances, let me try to design here. Imagine that you have something like that. I hope you can see the points. But when you start watching the distances, the first one is still in a line. But the second one starts cutting the manifold. The manifold should be something like, let me write it. Our manifold would be something like that, right? And what we are doing with this method is to compute distances that are outside the manifold. So that's also a problem for this box-cutting method. Therefore, people are developing constantly new methods for estimating the ID. Let me explain to you one, that it's the one that we employ for this course. And before explaining to you, I want to explain to you one probability of points at constant density. Imagine that I have these points at constant density. And I sit in one of them, okay? Now it's just a sum of one of these points. This is one of these points. And I compute the shells, the volume of the shells of the nearest neighbors around each point, okay? You see that this is the shell defined by my first nearest neighbor. This one is the shell defined by the second one, so on and so forth, okay? What happens is that these volumes are distributed with an exponential distribution, okay? And this exponential distribution allows us to derive, for instance, the formula of the nearest neighbor for the density estimation that we are not going to explain you today, but it's one of the utilities of this formula. The other utility of this formula is that it allows us to derive another method for estimating the ID, okay? Imagine that we take this quantity, that it's the ratio between the distance from the second nearest neighbor in front of the distance from the first nearest neighbor, okay? I can show, by using the previous probability, that this quantity is distributed according with a Pareto distribution, okay? And the magic of this method is that by assuming only that the density within the second nearest neighbor is constant, this expression becomes independent on the density, because you are taking this ratio that cancels the dependence on the density, okay? But if you have that, I'm not going to derive that here. If you are interested in the derivation, you can check the reference that I did here. Just by using this, you can also apply maximum likelihood that it's not all, not less, not more than the product of the probabilities of each point. And by taking the logarithm and maximizing, you obtain the expression for this, okay? And this expression for this, it's easy. It's the inverse of the average of the logarithm of this quantity, and it has an error associated that it's the relative error. It's just depends on the square, on the inverse square root of the total number of points, okay? Is it clear? Okay, I will stop here and ask you for questions. Are there questions about this method or the previous one? Okay. If there are no questions, we can go back to clustering. Okay. So, let's go back to clustering. And just in the last lecture, we were talking about methods of flat clustering, and I explained you k-means. Methods about using faxic clustering, where we saw that the assignment is not so clear. And then I explained you faxic means, and we have so also methods of hierarchical clustering, and I didn't explain you the methods of hierarchical clustering. Let me, today we are going to introduce the methods for hierarchical clustering, and let's see if we are arrived to point to other methods that are more advanced. Okay. Let's start talking about hierarchical clustering algorithms. And in hierarchical clustering, there are two main types of hierarchical clustering. One are agglomerative methods. In these methods, you start considering all the points as individual clusters. And at each step, you merge the closest part of pair of clusters until you have only one. Okay. In divisive cluster, the approach is the opposite. You start considering all your points belonging to the same cluster. And at each step, you split your clustering. Okay. Until its cluster contains a single point. Okay. I introduced both, because historically both are considered, but let's say that in practice, the only one that people use is are the agglomerative ones. And this is mainly due to the fact that divisive clusters are extremely, extremely computational. Okay. So let's see how these methods work. I mean, in the basic algorithm, it's really easy. You start computing the distance metrics between all the input data points and assume that each point is a cluster. So to start to iterate by merging the two closest clusters, and then you update the distance metrics. What means updating the distance metrics? It means that once you merge two clusters, you have to consider how to compute the distance between clusters, not anymore between data points, but between clusters. And this is the key operation, and this is the key difference between many, many agglomerative clustering algorithm. Each of them has a different definition on what is the distance between a cluster. Okay. Let's see it in practice. Imagine that you have these six points. Okay. As I told you at each step, you pick the minimum distance and merge the elements. Okay. So what you would start is just drawing an axis that is the joining distance. Okay. I pick the minimum distance and merge these elements. In this case, the minimum distance is between elements five and six that are the two nearest points. So I pick them, I put them in my graph, and I say that they are joined at this distance. Okay. Then I take the second one, let's see, and I take the distance between three and four. And I write that I make them the same cluster at this distance. I label the first cluster, cluster one, the second cluster, cluster two, the third one, cluster three. And now what should I do? It's the critical point is how to compute the distance between these clusters. Okay. There are many solutions. We are going to see some of them. And the question is that its cluster is a set of points. Okay. So we have to think about how to transform a set of distances in a single one. Okay. Probably the older alternative and it's still quite popular. It's the single link distance. And what it takes is that when you compare two clusters, you take the minimum distance between them as the distance between the clusters. Okay. How it applies. In this case, what I would say is that the distance from the cluster one to the cluster two, it's the distance between the two nearest objects that are five and four. The same, let's say from cluster one to cluster three, the two nearest objects are five and two. And from two to three, the two nearest objects are point two and point three. Okay. So what I do is to pick the minimum distance of all of them and join in this case the minimum distance it's between two and three right. So what I do is joining these two clusters. I consider them as single one and I repeat it. And in this case, the distance, it's just between five and two that are the nearest ones. And I already joining them here. Okay. This is how single linkage works. And it has some advantages and some drawbacks. Let's see the advantages. Imagine that you have something like that. You can obtain the two clusters. Okay, let's say you will obtain an entry where the two main branches will define the two clusters. It can handle kind of easily non elliptical shapes whatever shape have your clusters, you will obtain correctly with single linkage. But if you have some noise like it's the case, this noise will kill you will kill you the definition of the cluster. Because by just looking at the minimum distances, you are really sensitive to noise. Okay. And in the cases that there is noise, it tends to really elongate the cluster so you finally don't see a clear structure. Okay, another way of defining the distance between two clusters, it's what it's called the complete link. And in the complete link, it's the opposite. Instead of taking the minimum, we are taking the maximum distance. Okay. In this case, it's not going to change a lot, but it's going to change the definition of the distances. You can see that the distance between the cluster one and cluster two would be defined by the two further objects that are points three and six. The distance between clusters one and three would be defined also by the two further objects that are six and one. And the distance between two and three would be the distance between points one and four. Okay. So I pick again the minimum, and I joined the two clusters at the minimum distance, which is this one, right. Then I continue once join it. There is a question in the chat. What happens. Okay, I will just let me finish here and I will replay. Okay. So, once you have that you have to compute again, the distance between these two clusters that came from the maximum distance and then join them. Okay. And that's all for complete linkage. Let me play the question. What happens if for example, we have a line of three data points in which each two points are near each other. But there are two points that are farther away and does not fit the distance threshold to be considered a single cluster. Does that rise a problem in this approach. Sorry, I cannot. I don't fully understand your question. Yeah, yeah. Do you have my voice. Yes. So, what happens, so the idea here is to find the two data points that are near each other. The nearest neighbor. Okay. Yes, but at first you start with two data points. Well, in this case, yes, yes, you start with all the data points. Yes, yes. So, what happens if, for example, one and two are near each other. And there is number three that is near number two, but is further away from number one. Okay, the question is what exactly what you say in there's a line that number one is near number two and number two is near number three so we can consider number one and two as a cluster. And number two and three as a cluster, but three and one are not in the same class. So it depends on because the question is that what is a cluster. The path depends not in a distance, a distance threshold. Depends on the method that you use for building your clusters, right. So what. The replay to your answer would be depending on the method that you use. So, for example, if we use the distance. Yes, but if you use the single link at distance, you will obtain a kind of warm, like cluster. Okay, in which you have a line of points that will belong to the same cluster. If you use this this this complex linkage, it will probably split your cluster. But if I may. So, usually the distance satisfies the triangle inequality, which means that the distance between point one and three in your example cannot be larger than the sum of the distance between one and two and two and three. So, I mean, if your distance satisfies the triangle inequality, then I think you have some guarantees that that situation you think about do not realize. There is another question. I'm trying to understand in your example where there was a noise between the two clusters application of which of the introduced clustering methods will result into clustered as the answer. Okay, it, for instance, in this case, let me, let me finish it. This clustering method will generate two clusters. It's a complex linkage, for instance. Okay, but let's continue seeing methods and then if it's not clear at the end of the lesson, I will try to replay that. Okay. So, for instance, if you have something like that, you will obtain the two clusters. Okay. The problem. Okay. If you do more balanced clusters, and it's less susceptible to noise. But it has a problem imagine that you have two clusters that are really unbalanced in your data. This method is anyway forced them to be balanced. So, the balancing of the clusters in, in this case, would be generated by artificially imposing through the cluster method, not that the method naturally naturally will know this. They are balanced. Okay. So, by using this complex links clustering, what happens is that all the clusters tend to have more or less in diameter. So small clusters are kind of ignore. Okay. There are other distances that we are going to play a little bit faster. For instance, this one is the group average distance in which you just can say the average distance between two types of two points. Okay. To two clusters. It's a kind of compromise between single and complete link. However, it's still biased towards really global clusters. Okay. Another one is considering the centroid for it. If you remember the centroid for came in. In this case what you do is considering the distance between two sets of points as the distance between the centroid. It's also, and also there it's, I mentioned it, I'm not explaining that it's the worst distance that it's probably the among the most employed. It's a bit the formula it's a bit complex, but it's a kind somehow it's an hierarchical equivalent of came in. Okay. So that's why it's so employed. And whatever the problem is that a group average centroid or distance even, they tend to be biased towards global clusters. Let's say if your clusters are not global, you result will probably not be what you want. Okay. But the other method that allows you to deal with, let's say clusters of any shape that it's the single link has the problem that if you have noise, you are going to obtain but results. So it's not easy to decide which hierarchical clustering you are going to do. Okay, I don't know if there are questions about these methods. No questions in the chat. So if there are not questions up to now. I explained you what are considered the classical methods for clustering. These classical methods are mostly these the ones that I explained you and they are still widely used, but there are thousands, thousands, I'm not exaggerating 1000 of different method clustering methods. Maybe 100. I don't know. So we cannot see all of them. And they did that. All the methods that I explained you to till now have problems. Some of them, you are really right, but they are. There is not a universal solution for clustering. Okay, so people it's developing methods for clustering. And I will try to explain you two methods today that we will have time to explain you that are kind of successful and try to another kind of natural connational to physics. Okay. One method is the expectation maximization clustering. What's the basis of this method. The basis of this method is that you have a model. And in your model to consider that your data came from an underlying probability function. Okay. Then you assume a functional form to this priority function. And you estimate the parameters by maximum. Let's put one example. Imagine that you have this two set of points. Right. You will agree with. And I think, my dear, I will ask you your, your, your question at the end. Okay, I will give you an answer at the end of the lecture. And Okay, if you have these points. What you can do it saying, okay, I assume that they came from a priority density priority function density function that is the sum of two Gaussian functions. Right. You put randomly my Gaussian functions, and then I optimize the parameters of the functions. Once I have optimized these parameters, I can have a partition of my data, according with the Gaussian function that generated them. Okay. This is a. This is a really, really nice way of cluster because it allows you to really understand the point your clusters as coming from an underlying priority function. But you will see that it has problems. Okay. First of all, let's see how you do it. What you do is to use an interactive procedure to compute the maximum value. Okay. At the beginning, you estimate the data given the observer data. Okay. And how, let's say the signature of your points. And then, once you have this as a nation, you estimate the parameters by maximization. Okay. And let's see it in a in a in an example. Okay. We have a mixture of K Gaussians. Okay, a mixture of K Gaussians is defined by a weight of each Gaussian. And a Gaussian function, which that depends parametrically on omega parameters. Okay. In this case, of course, the weight, if this is normalized, these weights must be some one if these functions are also normalized. And these parameters are nothing else that the other than the mean and the co variances. So, what I assume is that all my points belong from this mixture of K parameters of K Gaussian, sorry. And I try to estimate the parameters that are the weights by and the omega or the parameters of each Gaussian by maximizing the likelihood or the log likelihood. Right. By maximizing that, I will obtain parameters and omega, how to do it. That's where the method starts working. But the idea is that each point, it's probabilistically assigned or generated by one Gaussian. Okay, so you can estimate the probability that a given point is generated by a Gaussian by means of this formula. Okay, where you are taking the values of the Gaussian function at the data point. Once you have this probability, you can use it for estimating the effective number of points of your Gaussian, right, but it's nothing else that the sum of all the probabilities belonging to the to the Gaussian. Okay. So with that, you can compute the amine and a variance. Right. But once you have these two parameters, you can recompute this way. This is what the expectation maximization does. You evaluate your weights. Well, you initialize your parameters, your variances, and your mixing coefficients. And then evaluate the log likelihood. Once you have that, you evaluate your weights, and once you evaluate your weights, you reevaluate the parameters. You evaluate the likelihood, and you stop at convergence. Okay. Once you have that, you have, let's say, all you have the probabilities that your points are generated from a given Gaussian, that is, a kind of fax if a partition of your data, right. And then you have a your model. The problem is that one, let's say, first of all, let's say that it's kind of really similar to came in. In general sense came in six, really, really fast. People initialize the parameters of a expectation maximization by performing before again. The problem is that if your model is not realistic. The functional form. There is not writing in any place that your points must be Gaussian, for instance. Or the case, it's not correct. Your results will fail, right. And then there is the additional problem that this expectation maximization, it's a local optimization, so there is no guarantee that you will be able to. Okay. So before going to the next method, that's the last one. Let me see if there are questions. Okay, you cannot, I will, I will replay you to the end. Are there any questions about this expectation maximization algorithm? If not, I'm going to explain you another method. And we are going to finish here. The method that I want to explain you, the last method that I want to explain you, it's called faster than fine of density peaks. And it's kind of similar to the one that came from expectation maximization. It's that if you have this distribution of points, they came, you will agree with me that you have these five classes, right. But they came from peaks in the density of points. And it means that there is an underlying probability density function, which in this case, it's not Gaussian. It's whatever form it has, your clusters are defined by the peaks of this function. What you try to do is to identify these peaks in order that you can identify the clusters. Okay. And this case is what you do. It's a bit different in the sense that you don't want, you don't know the model. You don't have a model. You just say that your clusters are peaks in a given probability distribution function. Okay. So how the method works. One concept that we need to know for understanding the method is the data. Okay, what's called the data. So imagine that you have these clusters, these points in two dimensions. One thing that you can do, it's to compute the local density around each point. Computing the local density, it can be do it kind of easily by for instance, counting the number of points within a given radius. Or looking at the distance of the K-nearest neighbor. Or more advanced methods like kernel density estimation, KNN, but let's forget about that. How do you do it? In practice you can just, let's say, counting the number of points around each point. You will agree with me that in this case, the density is seven, for instance, the density of eight is five, because you have five points there. In this case, the density of point ten is four, because you have four points within the radius and all and so on so far. And for each point, you compute the distance from all the points with higher density and take the minimum value. Okay. In this case, the distance from eight to one would be this one. While the distance from ten to nine would be this one, right? This is a key angle, a key concept to identify the density peaks. Why? Because if I plot these distances as a function of the density, what I see is that the density peaks are the outliers of this graph. And this happens because the peaks, the density peaks by definition, are points that are dense, so would be in the right of this graph, in the right path. But as also far away from points that are denser than themselves, right? Because if it's a density peak, by definition it cannot have a point denser nearby. So by computing this delta, that is the distance from points that have higher density, you can identify the density peaks or the centers of the density peaks. Then in this algorithm, what you do is to follow the density profile by assigning each point to the same cluster as its nearest neighbor of higher density. So you identify these two points as clusters and then you follow the density profile in order to assign all the points to a given peak. Clear? You see that in this method I didn't need to tell to the algorithm the number of clusters, right? They just pop up. And this is an advantage that it's common of many density-based algorithms. Are there questions? I'll listen to somebody. Okay, delta, it's nothing else that the distance to the nearest point with higher density. Clear? Of course this quantity is undefined for the point with highest density, right? Because I don't have a point with the density that is higher than the point of highest density. So in this case, what it's done is to assign the delta to a value that is big by definition, let's say. You compute the deltas of all the other points and then you say that the delta of the point with highest density is, let's say, 10% higher than the other ones. So this is not even an algorithm. I mean, you have a distance matrix between each data point and then you compute the densities. The densities can be either the points within a radius or a kernel density estimation, something like that. Then you compute the delta, which is that it's the distance, the minimum distance to a point with highest density, okay? And then you plot the decision graph. You plot delta expansion of the density. And then you obtain the layers in this graph and you assign them as cluster centers. The only free parameter is how do you compute the density. For instance, if you use a cutoff, it's the cutoff that you could use for computing the density, but it's the only free parameter in this method. And then it's relatively insensitive to this in the sense that what is important, it's not the absolute value of the density but the relative value. So doing that, what you do is, okay, if I have this priority density distribution, I sample points and I obtain, imagine that these are my data points. I compute the decision graph, I obtain this, and if these five are my outliers, I obtain these points, this cluster partition, okay? The advantages of this method are that they are kind of well sweet for whatever shape your cluster has, okay? They are kind of independent of the noise because being focused on regions of high density, the noise region is not so important, okay? And the outliers, it's just by looking, you find them by looking here, looking at the graph, I see that these five are outliers, okay? You do, you identify in this version of density peaks, that it's already a bit old, you identify these peaks by hand. Okay, that's something I'm going to explain now, the black points, okay? I didn't explain it yet. So, the black, now, once you have that, you need to somehow, I mean, I want to explain you how we identify these black points. And what we use it to the concept of hollow, what is hollow? The hollow, imagine that you have this density distribution, okay? And instead of looking at that as a density distribution, let's look at that as if it was this mountain. If you see a climber here, I think you will agree with me that this climber is climbing this peak, right? If you see a climber here, you will agree with me that this climber will be climbing this peak. But what happens if the climber is here? You don't know which peak this climber is going to, right? He can go to this peak or he can go to this other peak. We do for identifying the noise, the hollow points, it's saying, okay, if I have something like that, I identify my peaks. I then identify my clusters, but the problem is that the points that are above the barrier between these two, this saddle point in the middle of the two peaks of the cluster, of the distribution. I'm not sure about which peak are they going to climb. So I assign them as hollow, as noise, as a layer, okay? For doing that, you need to define what's the border between clusters. It's a bit technical. I'm not going to explain you now. I'm going to answer because I want to reply the questions. So let me try to reply the questions now. If there are, before replying the general questions, are there questions about density peaks? If there are no questions, I will try to reply the general ones. Are there any measures to help us choose the best clustering method based on specific data? I think Mateo already replaced you. The problem is that, unless, well, let's say, in practice, the method is, okay, let's use clustering methods that already work for my data. For my data kind. So what you do is try different methods in data for which you have already an answer. And see which method is the one that better fits your data. And then you apply to your data. That's what you do in practice because if you don't have let's say this kind of data, what you do is just to, unless you, well, it depends if you have an intuition or you don't have an intuition about what's your data. If your data, imagine that your data comes from a simulation of molecular dynamics. What you usually have are clusters that are defined by densities. Okay, by density peaks. So usually there you use kind of density based methods. But if your data, it comes from other kind of generated from models in which you know, for instance, I'm, I can't correct me here, but for instance, in economic data, I think you'd usually assume that the, the clusters came from mixed tool of Gaussian, right, or mixed tool of functions. Well, your face is not like, like really convinced. Oh, I mean, there are all sorts of clustering methods that are used. Especially there is a theorem that tells you that there is no single best clustering method. Yeah, that's a problem. There are many criteria to find what is the best clustering method given your data based on internal criteria, external criteria. And then also one that we propose based on info max. Yeah, but yeah, if you said it's, it's a jungle. Yeah, the brain that in two hours it's difficult to explain all this stuff. Let's, let's say that in practice what most people does it's taking that the method works in data that it's similar to the ones that you have. If it's like that you can be somehow confident that your method will not give you extension sounds. And then there is the other question. You can answer. I'm interested on the hierarchical plastic of semantic data set. Starting from a feature vector which describes its document in your opinion, do you think that it's better to create a similar matrix between the elements of the database. The clusters of the resulting network or do you think it could be better to implement hierarchical plastic directly on high dimensional data even by the feature of vectors. It depends on how many data you have and how many features you have. I mean, in general, the problem is that if you have, if your data is not really a big, big data that's it. You have to deal with the cars of dimensionality. So, unless it depends on also an intrinsic dimension of your data. It's not the number of features, but the intrinsic number of features that your data have. If your intrinsic dimension is low, I would go directly working with the feature vectors. It's not, it will, the clustering algorithm should work well. If your intrinsic dimensionality is big when compared with the number of data, then you somehow need to reduce the dimensionality because otherwise the, the course of dimensionality will kill you and the distances will be meaningless. This means that many, many of you of the distances in your data points would be almost the same. So it was what happened is that at this point, and you will obtain partitions that are not. It's really an equilibrium between the number of data points that you have. The intrinsic dimensionality of your data and the method that you implode, but I would try to go. I mean, I would try to check the now the intrinsic dimensionality of the data before making a decision. And I think that's all we can, if there are no more questions, we can finish here. Thank you very much, Alex. Welcome.