 Okay, it is Alex Rodriguez from ICTP. So I'll simply hand over the floor to Alex. Hello, hello to everybody. As usual, it's a pleasure to be here and see so many people here. Let me start sharing my screen. Okay, the detail of my talk is clustering molecular data. I will try to start from the basis of this. I will not go to advanced issues, but we will try to cover the basics of clustering molecular data. First of all, just a definition of what's clustering. Clustering is an unsupervised machine learning technique. Nowadays, machine learning is really in the mouth of everybody. Unsupervised machine learning is a bit particular because what we try to do is to know the, from some characteristics of the data, from its inner structure. We don't have any info about any prior information, prior classification of our data, but we want to classify it automatically just from the data itself. Let's put one example. Imagine that my data is in two dimensions and I plot it like that. This is the plot that I can have. Can you see my arrow, my pointer? What you see, if you look at this data, I think that you would agree with me that there are two clusters, right? It means two regions of high density of points, probably. And what's that so important? That's important. Well, first of all, we want a computer to do it by ourselves. We cannot look at the data. We want to do it even if our data is high dimensional. It means that here I have it in two dimensions, but when your data is in really high dimensions, the task is more and more tough. And we want our method to do it even if the data, it has odd shapes. Let's see later what referring to that. And you would ask me, probably fairly, why are you talking about clustering if we are in a biological talk? And let me put some cases of applicability. Imagine that you have molecular dynamic simulations. The one in the left, for instance, it's molecular dynamic simulation on a peptide. And therefore, you have a lot of data there. And you need to analyze it. One way of analyzing that is to obtain the features of the energy surface, like example. And among these features, they are the, let's say, the metastable states. And these metastable states correspond to regions explored during your simulation many times. Because otherwise, if you just pass one through a state, you cannot even say that it's metastable. If you, during your run, you visit a lot of times African, you will obtain a metastable state. And obtaining these metastable states directly from the simulation, it's done by clustering. Another case that could be of interest in imagine that you have a library of compounds that you want to test against some kind of illness, some kind of target. What you have to do is to assess the chemical diversity. And it's usually done by a scheme like here in the right, in which you start with all the commercial products, that's something like seven million compounds, from which you reduce them to the ones that have unique structure, that do not have some kind of undesirable properties, or you filter by its lackness of being abstract. But then there are still a lot. And what you do is to cluster all these compounds, or these four million compounds, and take, let's say, the representatives of this clusters that you found in this library, as the ones that you are going to test, because there are still a lot of them. In this case, in this example, we are 30,000 compounds, which a lot of them. Another thing that you can do is to sequence clustering proteins. For instance, imagine that you have a lot of protein sequences, and you want to see if these protein sequences are related among them. You perform some kind of clustering, and you can obtain, in this example, a tree in which you put together the protein structures that are similar among them. It's okay. Please interrupt me if there are some questions. Otherwise, I'm always talking. So how do we perform this clustering? The general pipeline, it starts from a set of data that you want to have as a set of features described by a set of features. And then you cluster and perform your clustering on this set of features. Then you validate your clustering just by assessing if these groups are meaningful. And then you interpret these clusters based on your expertise, and this generates some kind of knowledge. Of course, the validation has some feedback to the clustering. Because when you try to validate, you fail, you probably will try another clustering technique. But also, you may go earlier in the process and you can obtain different features in order to validate your groups. But let's start with the data. What are the characteristics of the data? Our data can be of many types. For instance, one of the characteristics of our data is how many objects are in my dataset. And these characteristics, in the case of simulations, will depend on the computational power that you are. You have your disposition and also how much sampling you need for performing your analysis. Then how is your data described? In the case of extractors, I mean, can describe your proteins by the coordinates, right? And these coordinates are real numbers. But in the case of the chemical extractors, you have a lot of different descriptors. And these descriptors can be integral numbers, real numbers, letters, some kind of classification. So they can be even the graphs that show the descriptors, so all kind of descriptors. In the case of sequences, you have just letters. As you can imagine, the kind of data that you have at your disposition is critical influences, the kind of analysis that you can have. Mostly, you need to transform the features. So how can you transform your features? There are three things related to the transformation of the features. One thing is feature selection. Another thing is dimensionality reduction. And another thing is distance computing. Let's see what I'm meaning by that. By feature selection, we think about some kind of expertise based in which we just use the features that are useful for our purpose. If we are talking about an atomic simulation, you can wonder if you should include the water molecules. From yesterday's talk of Ali, you can see that sometimes they are important. So why not? I mean, I need my expertise to decide if, in my case, they are important or not. Also, even I can't even simplify my system more and decide that I only want the main chain of my protein and I can ignore the side chains. Of course, I'm talking about the analysis. In the simulation, you will have the water, you will have the side chains. But when you try to analyze, many times it's easier not to consider all these variables. Also, in the case of the molecule database, if you are an expert, you can decide which descriptors are relevant for a problem. I don't know if you ever have seen a database or molecular of drugs or any kind of molecules, but usually there are thousands of possible descriptors. Some of them are extremely used in drug design. For instance, the logarithms of the partition coefficient, but also the pharmacophore profile, connectivity indexes, and so on and so forth. But maybe you are not interested in them and you can ignore them. In the case of the proteins, the sequences, in many cases, for instance, there are some protein segments that you should not consider when you perform your analysis and also people that are studying sequences used to ignore some part of the protein in order to obtain relevant research. Another way of transforming the features is what it's called dimensionality reduction. What's dimensionality reduction? Imagine that you have your data and somehow this data can be projected in a lower dimensionality space of real numbers. This extremely simplifies the problem. So many times it's something that it's done. Okay, you need, because in many times your data is so complex that without performing some kind of dimensionality reduction, you are not going to be able to perform your analysis. There are a lot of methods for dimensionality reduction. I just put here a really pretty old review. But still, let's say for general purposes, you can think about this. The question is that whatever these methods last, they can be classified in some kind of big categories. Some of them are linear and these are the easy methods. If your data is a kind of a line, a plane in an upper space, your data can be easily projected in this super manifold by using principal component analysis. In this case, we pass from two dimensions, x, y, to just one dimension. And you really use almost all your information. But if your data is, for instance, not in an upper plane, in this case, you will agree with me that this is a two-dimensional surface, two-dimensional manifold, but it's in a three-dimensional embedding space in a complex transformation. You need a complex transformation. So in these cases, PCA cannot work. And a lot of the research done on dimensionality reduction has been devoted to deal with these kind of cases. There are many methods, I will just mention some of them. I will not explain these methods because I would need, let's say, four hours for explaining many of them. So just to know, let you know that there are many of them. They are extremely used, the ones that I mentioned here. And if you are interested in them, we can talk about that. And let's say there are more complex cases. Imagine that you have something like that. If you have something like that, you will agree with me that this is kind of a line. But this line is topologically complex. And it's so topologically complex that you cannot map this to a straight line, right? That's something that can be really, really difficult to reduce the dimensionality. And why? Sometimes PCA works. And even in, this is a work case and a real paper that we do with, when we studied the profile, the conformational profile of a peptide, a small peptide of ten amino acids. And by projecting our simulation into this, we have found that they are too minimal. It means two states, two metastable states that have been later found by the experimenters. In some other cases, they don't work. For instance, when you, if you try to project the folding trajectory of the biline in two dimensions, even when you use a complex or or nonlinear projection like isomap, you cannot recover, let's say, physical meaning, physically meaningful projection. In this case, for, to see that, for saying that what we did, it's to plot into dimensions and then color it according with their folding capacities. And you will see that Q1 means folded and lower Qs mean unfolded. And you see that folded snapshots are pretty near of unfolded ones. You can see some kind of scatter, but it's not really clear there. There is some superposition between folded and unfolded scatters. It means that this projection is not capturing the folding process. Okay. So to summarize, what we do in dimension action reduction is to project our data in a lower dimension space. As the cross, we can say that if we are able to project in 2D or even 3D, we can just visually perform the analysis. And even if we cannot project in such a low dimensional space, but we project in a lower dimensional space, the analysis goes simplified. However, if we don't know exactly what we are doing, the dimensionality reduction can lead to an important information loss. And it's not always easy to perform. Okay. Mostly, as I told you, if the data is not in a plane, in an hyperplane, the many fully containing our data, or it's even topologically complex, it's kind of meaningless to try to perform meaningful operations. And finally, the last point that I want to tell you related with this feature transformation, it's distance computing. And why is distance computing important? Because most of the clustering methods relate on quantifying how similar are the objects among them. Okay. You want to put together things that are similar. So you need to quantify it. And to perform these things, you have to compute the distance. Distance means that things that are similar would be at lower distance. Things that are different would be at high distance. And depending on the data that we have originally, we have different distances at our disposal. When we are dealing with real numbers, we can think about using Minkowski distances, which is if you use p equal to two, it's exactly the Euclidean distance, but this is a generalization. You can also think if you are using more complex distance definitions like the cosine, for instance. When we are dealing with proteins, we know that the rotation of the protein configuration, it's not important for us. Neither the translation. So we can do what it's called the root mean square deviation in which you first try to superimpose the two protein configurations, and then you compute this root mean square deviation. But you can also compute the distance as in the diagonal space of the protein, the diagonal of the backbone of the protein. When we deal with libraries, we can use also the other kind of distances, but in many cases, if you have binary descriptions, we are constrained to use this YAKAR distance. And when you use protein sequences, you compute the Hamming distance in which you just count the number of letters different between sequences. So now that I have used the feature transformation, we can talk about real clustering. So what is a cluster? Imagine that you have this set of points. In this set of points, you can see two clusters, but depending on what you are looking for, you can also see this three, maybe this three, maybe this four, or even this five. So our clustering definition depends, I mean, our clustering results will depend on the definition that we have for a cluster. And there are many kinds of clustering. In general, they can be divided in three types, what we call flat, faxy, and hierarchical clustering, depending on the output that we obtain. In flat clustering, we obtain a hard partition. It means that each element is assigned to a given cluster. In faxy clustering, we have many clusters, but each element can be assigned with a degree of membership to its cluster. While in hierarchical clustering, instead of having a single partition, we have a tree, a tree that explains the structure, the hierarchical structure of our data. Let's see what's that with examples. In flat clustering, this is a case of flat clustering in which red points belong to cluster one, yellow points belong to cluster two, and so on so far. In faxy clustering, let's say, even we have the same five clusters, what we have is that, for instance, this point belong 90% to cluster one, but a 9% to cluster two, and a 1% to cluster five. While this one that is in the middle, it's 45% to cluster two, to cluster one, 50% to cluster two, 5% to cluster three, because it's nearly, so as you see, even if you have a cluster partition, the elements have a degree of membership to its cluster. Okay, this is another example. In hierarchical clustering, what we want to obtain, it's a tree that summarizes all this info. For instance, if this is our tree, we see that all the elements belonging to cluster one are in a branch, to cluster two are in another branch that are connected at a certain level, while the other three clusters are connected to another level, and in general, we have a tree. Okay, this is the idea of hierarchical clustering. However, you can try to recover different clusters partitions by flattening that. It means that we cut our tree and say, okay, if I cut it here, what I recover is this flat partition, but for obtaining this nice one, I had some problems, and I need to do some kind of strange line. It's not trivial to do that. So hierarchical clustering, it's not always trivial to interpret. In flat clustering, we have two kinds of methods that both belong to flat clustering, but they are different in the philosophy. On the one hand, we have partition methods. In partition methods, what we want to obtain is some kind of partition that is optimal according with some kind of criterion. While in density-based methods, what we want to obtain is clusters that are regions of high density. I will put one example of K-means clustering of partition methods, but it's K-means clustering. In K-means clustering, what you do by first is to randomly pick K-points as centers. Then we assign each point to its nearest center, and then we recompute the centers as the average position of all the points belonging to a cluster. By iterating this, what we obtain is that the centers move till we arrive to a convergence, and this gives us a partition. In another example of density-based clustering methods, that it's always a flat clustering method, it's density-picked clustering. In density-picked clustering, what we do is to first compute the density, then we plot this density as a function of the distance from the nearest point with higher density, and what happens when you do that is that the centers of your clusters, the points, the higher density points for each cluster, appear as outliers in this graph. You can pick them, and once you have picked them, you try to assign all the rest of the points to these clusters. For assigning a point, what you do is to follow the density from lower to higher, till you arrive to a maximum, a density maximum. Once you did that, you assign all the points to the same cluster as the one that you arrive. Doing this, what you do is obtaining three clusters in this case that correspond to the partition that you obtain. You can also obtain the division between them, and this is useful for other applications. However, these two methods are kind of useful for different purposes. In real-world applications, imagine that you have the building headpiece. What you do, what you want to obtain is a kind of graph in which you can obtain not only the conformers that are more visited along your simulation, but also the relationship between them. Why? Because this can be related with a free energy profile. Therefore, it can be done with a density-based cluster, also gerartica. It's also important when you try to explain, for instance, the Markov state model of a simulation of a protein. This is a Markov state model of the building, and it uses as input the previous output in order to describe which clusters, which conformations are stable. But by using a Markov state analysis, you can even obtain the times of translation between different conformers. When you analyze chemical diversity, you can also obtain a tree in which your compounds are related in families. And these families, for instance, in this case, this work is not mine, are a lot of chemical compounds coming from fungus, and they are related according with the fungus family that they came from. Also in the proteins, you can find this kind of divisions, and you see that the clustering can somehow mirror a classification made by humans. Why this is important? Because in our future, if we have a lot of sequences and we don't have humans to perform this classification, we can trust the clusters for performing the class. And I think I'm almost out of time, so thank you very much. Questions are welcome. So please unmute yourself and go ahead, or you can write in the chat as yesterday. Apparently there are no questions. Hi, can you hear me? Yes. Could you just describe what are the kind of most recent and latest problems that people try to attempt, such as fault in structural addiction, and along that line, what are the latest developments in that field? Sorry, I didn't get the question. I'm asking like, you know, fault in structural addiction is sort of, it's very popular, like a lot of data sets are there, a lot of people are attempting different things. What are the kind of most recent things that people do in structural addiction? Okay, in protein structure prediction. Yes, I mean, the question is clustering in protein cluster, it's used in many ways. When they came from a simulation, of course you use some sample, if you have a good method or a big computer, you sample many possible states of the protein, and by clustering you obtain those that are representative, and then you can somehow apply other techniques to check the likelihood that they are a real protein structure. In other cases, when you don't generate the extractors by molecular dynamics, you can even, in these cases, you generate the other extractors with other methods, like Rosetta's kind of methods or that stuff, but even in these cases, you usually cluster these extractors and just try to get the representatives, because otherwise we are talking about hundreds of thousands of possible extractors. Did I replay the question? Actually, what I wanted to sort of, it's not really how we do it, but sort of, can you give us some examples of, you know, some applications of things, like where you use this clustering, like some recent examples, right? Okay, in the field, I think almost in all the recent papers in which you analyze of long molecular dynamic simulations, you need to perform clustering at a certain point. So, the method will depend on the preferences of the authors, but apart from, I mean, if you want to identify microstates, as far as I know, there are no other methods nowadays for identifying them, so all of them will, any paper in which they analyze long molecular dynamic simulation will be an example. Hi, Alex. Just more of a curiosity than your question. So, when you speak about this density-based, density estimation method that's used for clustering, I recently was reading about this thing called normalizing flows. Do you try to look into it? Because that's a nice, I think, way to estimate densities, and maybe with that you can try to do some clustering once you have your density estimate. Okay. Indeed, I didn't talk about density estimation, but- Yeah, but you had like density- Yes. In the density method, of course, you need to estimate the density, so you- and I didn't know about this normalizing flows algorithm, so I will look at them. I think it can be interesting just- Yes. Thank you. I didn't know, but just to say that the better is the estimation of the density, the better would be the clustering, so that's a really good suggestion, I will- Yeah, and also it's a matter where you can even generate from the density that you estimate after that, so it can make two interesting stuff, I think. Interesting. Thank you. Thank you, Alex. Okay, I think we need to move on. Alex will be on gather.town this afternoon for more interactions. So, thank you, Alex, so we should move on to the next speaker.