 Good morning everybody, and welcome to the webinar series. Today we have the webinar number 44, and it will deal on clustering free energy landscape from molecular dynamic simulation. The speakers are Lucy Delemotte and Annie Versterlund. I am hosting it together with Julian Singh from the University of Edinburgh. So the speaker of today is R2, we have exactly two speakers. Julie Delemotte is professor in biophysics at the Royal Institute of Technology, and she moved here in 2016 after a postdoc in a Temple University in Philadelphia and in Los Anna in Switzerland. Her main research focus is allosteric regulation and as well as development of protocol to describe phenomena in quantitative methods. Annie has a background in mathematical engineering and she is very interested in complex adaptive systems. She joined the group of Lucy in 2016 where she started PhD. She is interested in developing applied data analysis method and your brain is to understand how protein and conformational dynamics works together with ion channel gating allosteric path. Now I will give the world to start the webinar 2. Thank you, Alessandra, for the very nice introduction. And it's a great pleasure to be here today and have the opportunity to discuss how we can use clustering to make sense of molecular dynamic simulations. So I would like to thank BioExcel for this opportunity. So we have a fantastic tool with molecular dynamic simulations because they allow us to look at the evolution of dynamical systems in a long time at a very fine resolution, both time wise and in space. Unfortunately, they can be quite difficult to interpret because of the noisiness. As you can see on this movie of Kalmodulin, a calcium-sensitive protein. And we will explore in this webinar how we can use clustering to make sense of this noisy data and to uncover some mechanistic insights about our proteins of interest or biomolecules in general. So before I start, I just want to give a little bit of background in the vocabulary, the words that we will use in this presentation. And stress that when we'll be talking about configurations, we mean the arrangement of atoms in three dimensions. And when we talk about states instead, we are talking about collections of similar configurations. So in this simple picture that you can see to the right of the screen, if ligands are represented by these gray squares and the receptor is represented in green, you can see that these different configurations are distributed into two states. The blue state here is when a ligand is bound to the receptor and the unbound state here in green is when there is no ligand bound to the receptor. And if this is the entire landscape of configurations that can be explored by the system, then we can get the probabilities of the states and say that the unbound state is twice as likely than the bound state, which enables us to calculate the free energy of, or the relative free energy of these states according to the Boltzmann distribution. If we do this calculation of the probability and the free energy along a specific degree of freedom that is of interest for our system, that could also be called an order parameter or a collective variable, then we can recover a so-called free energy landscape, as you can see here in yellow. And the minima in this free energy landscape will be metastable states where we will use here the word core states to describe a similar concept, while the states that will be localized on the free energy barrier will be called transition states. So now that we decided what we define as states, I want to talk a little bit about clustering. Clustering in data analysis is assigning unlabeled data to groups. And if we extend this definition to our field in statistical physics, clustering is basically choosing the definition of state. And when we do clustering, this hinges on measuring similarity or dissimilarity between configurations, and we can do this basically in two ways, in a structural manner or in a kinetic manner. And that would be measuring configurations that are close to one another in time. In this webinar, we will not talk about kinetic clustering, and instead we will focus on structural blustering. So then measuring similarity can be done by considering the similarity in terms of Cartesian coordinates, which might seem like it's the most straightforward choice. Unfortunately, it's often dependent on the type of alignment that we use, and choosing an alignment is not trivial in the general case. It can be based on internal coordinates, which is more robust, or is actually rototranslationally invariant. And those could be interatomic distances, or dihedrals are similar. And then we can also measure similarity in a reduced dimensionality. So along collective variables, as I was describing in the previous slide. This is what is usually done in practice, because doing clustering in the high-dimensional space becomes usually impractical with a high number of, or you need a large number of configurations to then be able to cluster, and that becomes impractical. So most of the methods that I will describe today will be based on measuring similarity using collective variables, and I will mention when this is not the case. So now I want to discuss how the clustering is actually done in practice. So this talk will be divided into two parts. I'll start by describing seven clustering methods that are available out there. And for each method, describe the principle, the algorithm, and list the free parameters that need to be picked, and describe how they can be picked in an advantageous manner. And in a second part, I will hand over to Annie Vesterlund, who will describe a new clustering method that she invented that is called inflection score, inflection core states, or inflex. And she will provide you with a comparison between the seven methods that I will have described and her method on toy models for which we know the brown truth. And then finally, she'll show you a short tutorial to show you how we apply inflex to calmodulin data set. So let's dive in. The first method I want to mention is k-means, which is probably the most popular clustering algorithm that is out there. And k-means is basically a four-step algorithm. We'll initiate a number of centers at random. K be in the number of centers. So in the example that I'm showing here, we'll have three centers initiated at random. And then we'll create the k-clusters by assigning the points to the nearest mean. So measuring the distance between each point to the mean and saying this one is the closest, so it belongs to the blue cluster. These ones are closer to the orange ones, so they belong to the orange cluster. With this, we move the mean to the centrate of the cluster and just repeat this step until convergence. So you can see, for example, here, at this step, this point that was originally in the orange cluster was actually closer to the yellow center, and so it became labeled as yellow, and then the centrate of the yellow cluster moved to the center between these two points. So k-means, as you can imagine, is a stochastic method, which means that you don't get twice the same result if you carry out the clustering several times, and this is because the means are initiated at random. And by design, it works very well for spherical clusters, but doesn't perform well on data sets that have a different shape. In terms of parameters, what's quite nice about k-means is that there is only one parameter to pick, and that is the number of clusters k. And we can pick that by using the silhouette score. So I'm highlighting here the clustering that might result from a number of centers three or a number of centers four, and you can see immediately that having four clusters appears to be better in this case, because it clusters the data in a more intuitive way. And in fact, the silhouette score, which is a small within cluster distance and a large distance to closest cluster would be high for this data set, whereas it would be low here, because you can see that you instead have a large within cluster distance and a small distance between clusters. So you can use this silhouette score to determine that in that case, a partitioning into four states is better than a partitioning in three states, and this can be used for generic data sets. Now I want to talk about the second method, which is also very popular. It's called hierarchical clustering. And hierarchical clustering can be done in two ways. It can be divisive or aglomerative, and in this presentation I'll focus on the principle of aglomerative clustering. The idea of this method is to compute the similarity between every pair of objects in the data and cluster hierarchically by grouping similar objects using linkage functions. So on this example data set, if you measure the distance between all points to all points, you'll see that the closest distance is between points A and D and F and G, so these get clustered together initially, and then continuing that way we see that B is closer to AD and E is closer to FG, so they get clustered in a second step and you keep going on until you can construct the entire tree, which is also called a dendrogram. In terms of parameters to pick from, there are a few here. The first is you need to pick which linkage function you want to use. You could measure the distance between closest numbers of the cluster, between furthest, between average or centroid, or words method consists in minimizing the variance within the clusters. And then as in the previous case the number of clusters need to be determined and here we can also use the silhouette score as was described in the previous case. And now I'm going to describe the last of the geometric methods, spectral clustering. This method is quite powerful and has gained traction over the years and it's a method that comes from graph theory. It's basically hinges on building an undirected connectivity graph, representing the data as an adjacency matrix and building the graph Laplacian by subtracting this adjacency matrix from the degree matrix, which contains on the diagonal the total number of points a single point is connected to. Once you have the graph Laplacian you can diagonalize it and find eigenvalues and eigenfunctions and then cluster the top eigenvectors using k-means. So this method basically is applying k-means on a data set that has been pre-processed to take into account neighboring points and this means that contrary to regular k-means it allows to consider clusters that are not only spherical but of other shapes. The number of clusters here can be picked in principle quite easily because the spectrum of eigenvalues should have some steps at the points where the partitioning of space is optimal. So in that case you can see a jump at eigenvalue number four which means that the optimal partitioning of space is in four clusters. The problem might be that in a real data set case you might not see such clear gaps in the spectrum of eigenvalues and that might make it quite difficult to pick the optimal number of clusters. Other two issues with this methods are how to build the connectivity graph. So basically how to determine whether there should be a link between two points. This is usually based on the distance between points that is then sparsified to build the connectivity graph. But how you do this sparsification means that you there is a decision to be made at which point you consider points that are close to actually be connected or not. And that needs to be determined and it's not straight forward always have to do. And then finally different versions of the Laplacian matrix can be used which is another parameter to pick. Now the spectral clustering as I mentioned became quite popular and that is also because contrary to the other methods I mentioned can be performed directly in the high-dimensional space because in fact when you are diagonalizing the Laplacian you are doing a dimensionality reduction step which would be similar to picking the collective variables along which you want to do your clustering. So that makes it quite powerful technique that you don't have to pick the collective variables beforehand. So these three methods are really widely used in data science to cluster to perform clustering but we're really not taking advantage of the type of data that we are considering when we want to use molecular dynamic simulations and as we mentioned earlier biomolecular behavior is easily understood in terms of free energy landscapes which are essentially probability densities so it might be nice to use clustering methods that are based on densities and before I describe the methods it's worth mentioning that first step before doing clustering with this kind of method means that we can do density estimation and many of you might have done this and usually used histograming without realizing that there were assumptions made when building histograms but there are better methods to do density estimation that consider continuous basis functions like gaussians and global estimation so this is a fascinating topic but I don't have time to go into this unfortunately and you can go to Annie's first paper for a review of this and a proposal on how to use gaussian mixture models in an advantageous way to do density estimation but I am mentioning this because it is a prep of clustering using density based methods that you need to pick a way to estimate the density so the first of the density based methods that I am going to talk about is hierarchical db scan which builds on traditional db scans and db scan is a method that defines core points as having at least n neighbors within a cut of epsilon so on this example you can see that this point epsilon has no neighbors this point has one whereas this point has two so if you choose the number of nearest neighbors to be two in this case you would have four core points then you connect the points if they belong to each other's neighborhoods which is the case for these three so they become part of the same cluster and then you assign any point that is part of one of the core points neighborhood to the same cluster which then yields in this case two clusters and two outliers as you can imagine the parameter epsilon that you use to do this is quite crucial so here we have extended epsilon to epsilon prime and you can see that now all the points are considered core states and they all now belong to a single cluster and so hierarchical db scan basically builds on db scan by varying epsilon and building the corresponding parameter or tree and then extract the clusters through local cuts unfortunately this method has quite a few parameters to pick from and this will be difficult or impossible to review in this presentation so if you're interested in the method I refer you to the documentation two methods I will go through very quickly that have gained traction in our field also because they were designed by statistical physicists from our community the first is density peaks which comes from the group of Alessandro Lio this is a clever method that hinges on calculating the local density calculating the minimum distance to any other point with higher density and then plotting this decision graph in which the outliers here with a high density and a high distance will be the centers of clusters this is quite clever method but unfortunately this decision graph is difficult to interpret in the general case you sometimes are not able to decide which points look like they are sticking out and so there is an advanced version of density peaks that attempts to automatically figure out the topology of the landscape and pick peaks directly there is another method that was designed by the group of Garrett Stock and that is called robust density clustering that is quite interesting for us to use because it is intuitive and basically it does an estimation of the density by counting the number of points within a radius r and then joining the points within a free energy cutoff and closer than D lump and the free energy cutoff is increased iteratively until you can build the structure of the clusters in a hierarchical way and as I mentioned this method takes advantage of the shape of the free energy landscape which is nice unfortunately though it has quite a few parameters to fit and the authors have in their papers a description on how to pick these parameters in the best way possible so this brings me to the last density based method that I want to describe in the last also clustering method that I will talk about briefly and that is Gaussian mixture models it's based on the method to estimate density that has the same name and in this method basically we fit the points to a number of Gaussian components using a maximum likelihood approach the algorithm that is used to do this is expectation maximization allows us to find the parameters of the Gaussian mixture the amplitudes, the means mu that mark the positions and the covariance which measures the shape or which parameterizes the shape of the Gaussian and based on this then we can assign the point to the Gaussian component which is easy to have been sampled from this method is very nice because it's robust and there's only one parameter to pick the number of clusters which is also the number of components and we can avoid overfitting by calculating either by using a cross validation scheme or the Bayesian information criterion in which case the only parameters that we need to pick are the maximum and maximum number of components that we want to try out to then pick using these schemes so this is very nice but unfortunately it assumes that the data is Gaussian distributed and so I'm going to hand over to Annie now and she will describe how she extended this framework of Gaussian mixture models and devised a new clustering scheme that is particularly well suited for data that comes from molecular dynamic simulations OK Thank you, Lucy So yes as we've talked about a few times now we're interested in how to cluster molecular dynamics data and basically these simulations are noisy so we would like to somehow identify well defined core states and then leave the transition points outside of these definitions and these core states should be located at the free energy minimum so what inflex does is it uses the shape of the density to do this and specifically the first step is to compute the second derivative of the density landscape so if we show the second derivative as a color in this free energy landscape we see that the second derivative of the density is negative where we have a free energy minimum so this way we can label the points as being either a core state point or a transition point and when we've done this we can see that we will have these kind of islands of core state points that belong to the same free energy minimum and these islands are separated by transition points so to be able to extract the core states we need to sort of join together the points that are in the same free energy minimum and we do this by allowing connections connecting points within the same that two core state points are connected if they don't have a transition point between them for example here we don't allow a connection and this will then lead us to these two subgraphs which we also call connected components and then we label each connected component and assign this label to all the points that belong to these connected components and this gives us the core states that we were looking for and what's kind of nice about this clustering is that because we rely on a Gaussian mixture density we use the functional definition and basically that means that we can extract the regions of core states not using the data itself but we can use for example a grid to do this so that would mean that we have equidistantly spaced points regardless if we have core state or transition state so this would make it a little bit more robust but still in the most basic case we use the data itself yes so in a go further we wish to evaluate the clustering of this method with respect to the other seven methods that Lucy went through and as she mentioned we're using toy models so we have three toy models that we're testing these methods on and basically a toy model is a model that is a fake data set so we know the ground truth but what we're gonna do is that we're gonna pretend that we don't know the ground truth and then we're gonna apply each of these methods to the data sets and try to let the data give us give us the clusters so for example for k-means we will use silhouette score and we will use the same for eigen narrative clustering with the word criterion to select the number of clusters for spectral clustering we will instead use the eigengap score and in general I'm just gonna mention that we're using recommended settings and default settings so we're not optimizing these methods on the data and this goes for all of these methods so using we're just trying to see how well we can do this if we don't know anything about the structure of the data for dmm and inflex we will use the big score to decide on the number of Gaussian components and since we're evaluating the full clustering this means that we also want to assign labels to transition points so for inflex this means that we are sorting the transition points in order of decreasing density and then we're assigning them one by one to the closest labeled points so this will effectively fill up the free energy well from the minimum up to the transition barriers so we want to qualitatively describe how well these methods perform so we use a scoring function or metric called V measure which basically it's a number between 0 and 1 and it measures how well the points from one through class are clustered together while not being merged together with points from other classes and so as I mentioned we have three toy models and for each toy model we will sample data from the toy model and then we will cluster this data using all of the clustering methods and then evaluate the clustering with the V measure and this will be repeated 50 times for each of these toy models so we will have 50 data sets per toy model so if we start out pretty simple we have a toy model with Gaussian clusters and if we look to the left here we see the free energy landscape and we use this to sample data points and in the middle we see one example of a sample data set together with colored coloring according to the classes and to the right we see the corresponding V measure and what we can see here is basically that all of these methods perform well on this type of data set which is not surprising because it's a data set full of convex clusters and Gaussian shaped clusters but what we can see is that inflex and GMM actually perform a little bit better but so as we have a real data set we might not have exactly Gaussian shaped clusters but we might have something that is close to Gaussian shaped or not really at all so we decided to see how these methods perform if we introduce one cluster that is not shaped according to Gaussian so we have this the turner shaped cluster and what we see is kind of interesting in that key means an aglomerative word and the canonical Gaussian mixture model they fail to cluster this so basically what they do is that they divide this one state into several clusters we see that density based methods perform better on this data set and specifically HDB scan and inflex perform the best but then this doesn't look exactly like free energy landscape that we would have in from our MD simulations but in reality we would have something that is high dimensional and then we would project them down onto few dimensions for example using collective variables and this would lead to sort of poorly separated states we would have probably different densities of these states and we might even have some non-linearity in the data so for example in this third toy model we try to mimic this situation so we have, if we look at the free energy landscape we do have free energy minima but it's difficult to see when we look at the scalar data that we actually have three states because they're poorly separated and so looking at how well these methods perform we see that the clustering are based on geometric area they fail to do the clustering properly and it seems like it's a very difficult overall but inflex does a reasonable job at least so what we can say is that basically no clustering method is perfect and can handle all types of data sets but it seems like inflex is able to handle the type of data that we expect to get from MD simulations we did prepare a jupiter notebook tutorial for this clustering and it's available on the delmat lab github so you can go in there and try this on your own data we have here adapted it to calm modeling to show you how to do the clustering so the first thing that we need to do is to import free energy clustering so this is to allow us to do all these things and we call it the FEC with capital letter so that we don't have to write out everything and so the first thing that we need to do before actually starting the free energy estimation and the clustering is to get our data on the correct shape so this forces us to decide on collective variables or how to describe the simulations so in our case we used two collective variables one called DRID which very roughly speaking is like a contact based version of RMSD which measures global conformational changes and then we also use another metric another cv which is called VDAC which we localize to the linker and this measures the changes in secondary structure in the linker so we have already processed our simulations so we have computed the cv's in each frame and this is what we're loading here the text files with the first DRID cv and then the text file with the VDAC cv and the data that we use as input to our function has to have the form number of samples times the number of dimensions and number of samples corresponds to the number of frames in your trajectory and number of dimensions corresponds to the number of collective variables so in our case we have number of frames times 2 because we have two cv's and so we construct this matrix and we in the next step construct an object by writing fvc.preenergy clustering and then we use this matrix as input so we need to specify the minimum number of components and the maximum number of components to do the density estimation of the gaussian of the gaussian mixture and then so this is all that we need to specify but then if we're interested in looking at the free energy landscape we want to specify also the temperature that was used when we ran the simulation and for visualizing the free energy landscape we might also want to determine the resolution of the visualization so this is set by n-grids and we set n-grids equals to 100 since we have two dimensions we will get a grid that is 100 by 100 in this example show you how to do the clustering using this grid but of course you don't have to do it on the grid you can do it on the data and then the n-splits is to determine what type of model selection you do to choose the number of components so you can, if you set n-splits equals to one then you use the big criterion if you set it to a number larger than one then you will use cross validation with by dividing your data into parts and as many parts as n-splits so if you run this self then basically you will get this type of output where you will have a list of the parameters of the input arguments that you have defined and those that were set to default values so in the Jupyter notebook tutorial we also have these input arguments and a short description of what they do and what they mean so you can go in there and read for yourself so the next step is to estimate the free energy landscape and this is done by just writing fec.landscape and what so this will model to use it will estimate the density and then get the free energy from this and it will return the coordinates of the grid so this will be kept in chords it will also give back the free energy estimate of each coordinate on the grid and then also the free energy of each of the data points that you used as input so these are the free energy in each of the frame that we have so before we move on to the clustering we might want to look at the free energy landscape and we can do this by writing fec.visualize and if we don't use any input arguments here we will get just a basic visualization but there are some keywords that we can set to change the figure a little bit so for example in this case we will not show our original data as a projection on top of the landscape and we will change the X label and Y label to be back and grid so the name of our CVs and then we will also change the title to free energy landscape so this is what pops out if we run this so great we have a free energy landscape of our simulations and what we can see is that this free energy landscape looks pretty complex so we would like to decipher this free energy landscape and this is where we start extracting the core states so extracting the core states is done by calling this cluster function so we write fec.cluster and in the first input argument we use the coordinates that we use for extracting the regions of metastable core states so we could for example supply data as input here we are using the grid coordinates in this case because we have a high resolution grid and then we need to supply the free energy of each frame and then also the data that we want to cluster so this will be the data of our collective variables this will return the cluster labels and also the cluster centers and as I mentioned when we compared this method to the other clustering methods we used a full clustering so in case you would like to do that there is a keyword argument called assign transition points by default this is false but if you want to get the full clustering you can change this to true and then this will compute the second derivatives it will construct the connected components and then label the data points and then if we run the fec.visualize again we will get what's seen here on the right panel and then just comparing this to the free energy landscape to the left you see that basically where we have free energy minima this is also where we extract core states so it seems to work pretty well so in the next step we thought that we would just see if we could understand something from this result so we identified the most common state which was this cyan state and it has this helical linker looks very calmodulin like and then we found a really interesting state this yellow state which is compact and collapsed and it's over here in the landscape and then we asked ourselves what would be one possible pathway in this landscape the most common state to this strange collapsed state and what we saw is that one possible pathway goes through blue and then green before it goes to yellow and when we analyzed the states we saw that this pathway was driven by electrostatic interactions by forming and breaking salt bridges linker resulting in this collapsed state where there's a network of salt bridges stabilizing it ok, so that was an example of how to use this on biological data so what we would like you guys to remember from this presentation is that clustering is a data driven way to extract states from these simulations and all clustering methods have different properties and limitations that are due to the assumptions that are made when you construct them but inflex is a method that is well adapted to identify well defined core states and these core states will be at minimum and then we went through this Jupiter notebook tutorial for how to use inflex and you're welcome to try it out on your own data so thank you for listening and let's start the question session thank you very much for the really really interesting talk for everyone at home listening if you want to ask questions there is a questions tab where you can put in whichever questions you want and we will get through all of the questions in keeping with the theme of this talk the first person to ask questions has asked what I will refer to as a cluster of questions so do be ready for more than one question and our first asker is Shashank Shashank, I've unmuted your microphone if you would like to ask your questions please go right ahead thank you can you hear me thank you it was a really nice talk so my first question is on the clustering method and explicitly the question is on spectral clustering so it's broadly on spectral clustering is this method of spectral clustering based on the principle of path entropy or maximum calibre model now so you can see spectral clustering as either a flow or a graph first you represent your data as a graph and then you can see this as you have random walks on top of this graph and the spectral clustering tries to identify regions of the graph and the random walk would stay for a longer time oh, okay okay, thank you okay, thanks Shashank, would you like to ask your other questions? oh, yeah so my second question is on inflicts it looks like a very nice method but the underlying initially I understood in assumption in this method is we know the underlying probability distribution of our landscape like we should have that information before well, so inflex uses the Gaussian mixture model density estimation to get that free energy landscape for the density landscape so basically Gaussian mixture models are pretty good at estimating densities because they estimate the density globally using all the data points and they don't use these local basis functions and then also it's continuous so this makes it more stable in the regions where you have which are sparsely sampled so then we use that together with the model selection for example validation or the big criterion and then we use the estimated density to extract the core states that answers so my only confusion is let's say we just have an unbiased MD simulation so we know in unbiased MD simulation we might not be able to sample the entire landscape or the phase space along the reaction coordinates so then then the inflex still we can apply inflex in that data set as well but we might not get the we might not be able to extract out complete information about the system if I can maybe add something to what Annie was saying I think what you're going into is the choice of the collective variable so I think in the previous paper where we talked about density estimation we indeed showed that using Gaussian mixture models allows to reconstruct a free energy landscape in a robust way but the problem might be that you need to pick a set of collective variables to project the landscape on initially and that is kind of a broad topic and it's not always straightforward to determine one but provided you have your collective variable that you're confident in or you could try a few then you can with quite a few with only a few parameters calculate the free energy landscape and then perform the clustering using inflex so I'd say that the biggest issue here might be picking the correct set of collective variables that separates states advantageously ok ok, thank you great the next question we have is from Alexander Pain Alexander, I've now unmuted your microphone, if you would like to ask your question please feel free to do so hey I'm curious if you can use this Gaussian mixture model on a higher dimensional space so like five collective variables instead of two and then sort of related to that you just mentioned this I guess but have you looked at how the choice of your collective variable affects the ability of inflex to cluster the data effectively and I guess related to that, can you just oh there you go can you use something like Tika to perform the clustering to minimize the decisions you have to make so yeah this method should work basically regardless of the CV that you use given that you have something with a density that you can describe with a density and indeed we did try this out on data set with higher dimensions so here we have four up to eight dimensions and we can see that up to six dimensions or even to some extent seven we can do the clustering pretty well with inflex but then you can't use a grid to but you could for example use the k-means to pre-process the data and then use the centers or something to extract the core states or just the data itself but if you have have enough data then it should be possible and also we did apply this to Tika data not in this project but in a different project so yes you can use Tika awesome thank you great next question we have is from Eric Lang Eric your mic is unmuted if you want to ask your question Eric are you there as Eric does not seem to be responding I will ask the question on Eric's behalf Eric points out that on the github page that you linked to only the toy model tutorials are available not the cal modulin would it be possible to access real case studies such as the cal modulin case and maybe have those added to your github page as well please yeah I mean that could be possible I think though I mean the as as you can see in this tutorial that we went through now is exactly the same as the tutorial that is up on the github so it doesn't really matter what data you use I think you can use your own data and you can use the toy model data or this data I think it doesn't bring too much to put it there but in principle we could great thank you very much the next question we have is from Elvis Martis who is not currently on the stream let me ask the question on his behalf again Elvis asks any specific things to keep in mind while deciding the minimum and maximum components that's a good question yeah so I think a good way to do it is to start off by choosing a quite large number of components large range and then do a sort of course scan through this and then once you've picked once you looked at the results if you end up choosing models on either end point of this interval then you should sort of move your range so for example if you pick between 2 and 10 components and you end up choosing components all the time then you should probably move your range and then try 5 to 15 and then if you start choosing 15 components then you can move it even further but basically you will see how the model selection if it's within the range or if it's on the edge great, thank you very much for that answer next we have Simone Orioli who also will not be able to say their question and they ask is this method compatible with data sets where each point is assigned with a different weight such as free energies obtained from meta dynamics yeah that's also a good question so we have done an adjustment of this so as it is described here it's not but we did add this to the to the to the method so it should be on github so let's see here for example we have we print out whether or not we use weighted data so then in that case the data should be weighted from for example meta dynamics the DMM density estimation will be done in a slightly different way to incorporate this but it should be there great, thank you very much the next question we have is from Henry Wittler Henry I've unmuted your microphone if you would like to ask your question please go right yes, thank you yes, I was wondering I have a similar webpage on github it's at molecular dynamics analysis of incidents so I did 1500 nanosecond simulation of of incidents in 1 nanosecond steps so I was wondering because I did several simulations and in the second one I found different large different populated states but I was just wondering if one can find maybe minor populated states like if there is one may not want to cut cash if there is some kind of minor like a minor change that one may be able to use influx for discovering minor changes in molecular dynamics trajectory that is a good question so that depends a little bit on the data if there is minor states and if how well those states are captured so basically the density estimation will be performed so we will select a model so that we will try to not overfit the data but if there is data there that shows that there is a small state so for example here we see these kind of smaller states although they are high density states they are kind of smaller and they might be difficult to see directly from this if you just look at the data you might think that this is one state and then we have roughly maybe two states there but indeed we have like several smaller states in this region so we can try remembering is that this is based on having minimas in the landscape so if you have completely flat region you wouldn't see the core state appear the population of that state has to be larger than the surrounding regions that would be transition states Vasi, oj Yes But it's not that Sorry I can continue, I have a second question Yeah Yeah If Yeah I have a second question also If you are complete I can continue otherwise Sure So I was just wondering with the license can I copy this code on GitHub and then adapt it in any way I like or does it have any kind of license so it will be applied in any bioxel software later on So it doesn't have a license right now Maybe we should put an MIT license on it but you can film the repository and do stuff to the code I think to adapt it to but Yeah Don't incorporate in some software that you sell I don't know So I have the same problem with my repository but I don't like to find the right one but I won't have to try to find the best one but I have the BSD31 for mine because it depends on the circumstances and everything Yeah, I really agree I have a look In the interest of time because we have a fair few questions to still get through I will now move on to the next question if that is okay So the next question we have is from Bakari Bakari, I've unmuted your microphone if you would like to ask your question Yes So my question was about again, thank you for the talk I wanted to know if you can assess your clustering in some labeled data or you don't have data but you want to assess the clustering on your data So this is a tricky question So this is clustering per se I guess is saying that you don't know what labels you have on your data and you try to figure that out So in that case it becomes difficult to assess whether you did a good clustering or not But so for example on this type of landscape if we would run some other clustering method we could see whether the clusters fall within the free energy minima or not So that could be one way to say did we do an okay clustering But if we look at these toy models for examples So here we know the labels of the data and in that case we quantify how well the clustering is done So here zero means that nothing was correct and one means that we had a perfect clustering And then there are different ways to do this and that we measure because it seems or it takes into account both the splitting of the original classes that you have but also merging of classes So you avoid both of those cases Does that answer your question? Yeah, okay, thank you Sure Great The next question we have Ivan Pulido Ivan, I've unmuted your microphone if you'd like to ask your question Okay, sure, can you hear me? Yes Okay, thank you for the talk I was going to ask something about you said something when you were finding the pathway you said that it goes from the most stable structure to another that's shown by the clustering or is that it is the most stable by the clustering or because you already know from experiments or something like that and in that sense can you use this data of clustering to assess structural models refinement with molecular dynamics So I can start by answering the first question which is how do we know that this cluster corresponds to the most likely state in this dataset and this has to do with the free energy landscape Okay, yeah, I see it Yeah, so it's at the lowest free energy minimum but what we do specifically to quantify this I can see if I can find the picture here, yes So what we do is we have this density landscape and then we integrate the density over the region of the core state to get the probability of that core state so this gives us sort of state populations and then here we can see that this state pops up Yeah as being more populated but you can also look at at the states and how many points you have in them you know how big are the clusters in number of points Okay, yeah So for your second question can you repeat that? Yeah, so if you could use this clustering method for assimilation for refinement refinement of the structure using molecular dynamics that I mean it doesn't change much the conformation hopefully but you will still get some energy in landscape and I was thinking of using this method for it So you're thinking of just doing like plane MD on the structure, see how it relaxes and then picking cluster center to Yeah, okay Lucid, do you want to have something to Yes, I think I think this has to do with molecular dynamics itself rather than the analysis possibly but of course like Annie said you can use this clustering method to see what state is the most probable and I guess would be close to your input structure Okay, thank you Thank you very much The next question we have is from Cesar Mendoza Martinez Cesar, I unmuted your microphone if you would like to ask your question please go right ahead Cesar doesn't seem to be here at the moment so I will ask the question for him it's an easy one which software do you use when running your simulations Gromax The next question is more likely to be a set of questions from Arjun who's asked a number of questions over this question session so Arjun your mic does not seem to be unmutable so I will put them on your behalf the first one is if we have multiple runs of the same system how can this be implemented that seems like it's linked to exactly a time in the presentation next question is can you use more critical variables to help increase the sensitivity and how would increasing the number of variables and sensitivity of your results so we answered that question earlier and he showed the graph showing that it's possible to cluster in high dimensionality up until seven dimensions so I can add something to that and I mean we did touch upon that but if you want to add more dimensions or like you have to have a lot of data points to be able to carry out the clustering and this goes for all the clustering methods actually that are specifically those that are based on density so when you increase the dimensionality you sort of need more points to be able to estimate the density and after a while you basically have no density the data is so sparse so this is a point that should be stressed so just because you add more dimensions to describe your data doesn't mean that you will get a better result per se it's more important that you pick a few good CVs for example great thank you very much for that answer the next question sorry, there are a lot of questions that have been answered previously I think the next question is going back to Eric Lang Eric I've unmuted your microphone if you want to ask the question yourself Eric doesn't seem to be here specifically Eric is asking on slide 45 is the pathway shown here more likely than the one going through several clusters at the bottom of the landscape or is that to do with does your program choose the shortest pathway yeah that's a good question also so actually this pathway would just pick this by hand but you can get the shortest pathway but then of course that depends on what you use as your starting pathway so if you would pick a straight line and then let that relax in the landscape you will get something that is very similar to this but it doesn't make it more likely than this transition so it's we don't really know which pathway is more likely I think that's why we're also trying to say that it's one possible pathway great thank you very much that will be the final answer that we'll be taking considering that we're already running 15 minutes overtime if anyone has any other questions that they desperately want to ask please ask them either on the bio excel page this recording will become a youtube video which will make it to the bio excel youtube page feel free to ask the questions in the comments there and last I wanted to very briefly talk about future bio excel seminars which are coming up namely we have a seminar in I think in two weeks time on the 28th of May by Brinda Vallet and Benjamin Webb and the webinar will be about a prototype system for archiving integrative structures and then we've also got a webinar coming up in mid-June by Alexander Bovan talking about Haddock specifically the Haddock 2.4 server new features in a guided demo thank you again for the very interesting talks and for answering so many questions and thank you to everyone for attending this presentation