 test you've tested your screen share already. Yes. Okay, perfect. After some attempts, aren't you still full screen? Yeah. Okay. Okay, close yours. So good afternoon, everyone. My name is Manteau Carly. And this afternoon, I will try to introduce some of the problem that one can face when trying to compute high-dimensional free energy landscape or high-dimensional probability densities. And I will also try to give some possible solution that we elaborated in our group. So when working with high-dimensional data, high-dimensional data sets, one, of course, faces the problem of classifying and understanding and representing the huge amount of information they contain. And of course, in order to do so, one key quantity is the probability density function of the data. But since data are generally distributed in the embedding space in a non-homogeneous way, one can hope that there are some few relevant degrees of freedom, few relevant so-called collective variables that are enough to give, restricting to them, a satisfactory description of the problem one wants to study. And so in one general approach in big data challenges is the one of free energy methods. Of course, as I said, the free energy methods are quite general, but molecular simulation in which we are interested in, especially in this workshop, are a prototypical example. In fact, let's say if I consider a molecular simulation, every time step I save a vector of all the, for example, all the coordinates of my atoms, which are molecular atoms and sometimes even water atoms. So my data space is a vector of thousands and even tens of thousands, hundreds of thousands of coordinates. So this is a really, really hard problem. So if one wants to go through a sort of free energy approach scheme, this is sort of a possible pipeline that one can follow. So first, of course, there is the problem of identifying all the relevant degrees of freedom for my problem and to restrict to them. And so this is the very known and very hard problem of dimensionality reduction, but assuming one can do it and so going from the full coordinate space to this reduced space of collective variables, one can then try to compute the reduced probability density reduced to this space. And of course, also this is a very challenging task. And there are several methods that try to do it. And there are some parametric methods, which means that methods that are trying to fit some functional form to the data typically. And there are non-parametric methods. Of course, the most famous is the one I represent in the picture is the one is histograms. But in general, there are, for example, all the so-called kernel methods of which we can say the histogram is part, which generally depend on only on a scale parameter that defines the resolution of my non-parametric method at which I want to say look at the data. So this is to compute the probability density. And once I have it, I can, yes, I can compute the free energy by taking minus KBT the logarithm of the probability density. And when I have it, I can find extremal points on it. So landmarks on this landscape, maxima, minima, saddle points. And so characterize how, what is the sort of landscape that the degrees of free don't go to go through during, in this case, a molecular simulation during their dynamics. When I have this, when I have mapped this, this landscape, I can assign states, assign all the configuration to states. And so applying clustering, this is something, for example, Alex Rodriguez has talked about on Monday. And when I have clustered my data, then I have a description of my system. So in the following of the talk, I will mostly focus on the problems that one faces until he can get to an estimation of the probability density or the free energy. And the rest is, I don't have time to cover. Sometimes I cannot go forward. Okay. So of course, the first problem one faces is the one of finding good collective variables. And one would like to do it automatically. But as we have also heard in some of the talks, automatic coordinate collective variable selection is a very challenging problem. It's an open problem. And of course, there are attempts even successful to do it via machine learning generally. But for sure, the thing one can do is to resort to their intuition and insight about the chemical intuition about the problem one is studying and to define or supervise the collective variable definition like this. So for example, in the very famous case, which is very didactical of Alenine decaptide, which is a peptide of two residues and 22 atoms, the metastable states of this molecule are well described by only looking, by only projecting on these two variables, which are two backbone dihedral. So the angles around which the two alanines turn around the phi and psi, and one can simply do a histogram of this variable during a simulation and obtain this kind of landscape. But of course, this is not a realistic case. One can be interested to study. But for example, a realistic case would have like this is the case of the main protease of SARS-CoV-2, for example. And one single monomer of this homodimer has 306 residues and around 5000 atoms. So there is no hope that I can define some collective variables with my chemical intuition. So at this point, it's worth opening up, because I think that the concept of intrinsic dimensionality is very important. So I've talked about a coordinate space or an embedding space, so which is the space of all the coordinates that I input my method, my protocol, that can be of course the Cartesian coordinates of all the atoms, or only of the alpha carbons, the backbone dihedral angle. And this is the embedding space, which has dimension D, which is, for example, the 3N that I mentioned in my first slide. But then in this space, due to a soft chemical constraints, also called breast strains, so inhibition of visiting some parts of the configuration space, the data distribute themselves on a manifold, which is generally of dimension much lower than the embedding dimension. So this is called the intrinsic dimension. Intrinsic dimension can be estimated, of course, and it's a very interesting problem. Some people in my group are doing that. For what we are concerned, it's important only to know that it is possible to estimate intrinsic dimensionality. And of course, since it's an intrinsic property of an embedding space, it should not depend on the choice of some parameters. So of course, it is an intrinsic. And one very important remark is that one can be tempted to use, for example, two or maximum three collective variables to study a problem, because simply they're the easiest to handle for us humans. One, two, three collective variables, we can visualize, we can manipulate, we can think about. But if the intrinsic dimension of a problem is higher than the number of collective variables we retain, then these can lead to misleading descriptions and to wash out important details. So it's very important not to do. Of course, one can have, as the system I mentioned before, the 40 years of SARS-CoV-2, an intrinsic dimensionality of 26. And this is very, very difficult to handle in a visual way. So since it's difficult to find collective variables based on intuition, one can hope to do it, to find collective variables automatically. And the simplest method is by linearly projecting on a hyper surface, on a hyper plane of lower dimensionality. And the most famous way to do so is so-called principal component analysis, in which one finds the eigenvector of the correlation matrix above the coordinates. And then if there is a gap in the spectrum of this eigenvalue problem, then one has also found which one is the intrinsic dimensionality. So if there is a gap in the spectrum after G eigenvalues, it means that the intrinsic dimensionality of the problem is not greater than G. And one can then project onto this first principal component. So here I have an example of a Swiss roll in three-dimension, this surface. Of course, PCA finds that the two principal components, that one component is varying little, as you see from the colors, and it would retain two principal components. But of course, with our eyes, we can see that actually this is intrinsically a one-dimensional manifold. But now PCA cannot project anymore. And as you can see in the image below, if one wants to project on the principal component of this two-dimensional surface, one ends up with a meaningless description. And on the left, I also put another example of a three-well potential. The principal component is along the mostly along the x-axis. But if one projects along it, then one finds a two metastable states free energy, which does not correspond to the real one. When linear projection fails, but the manifold is stillized more to a hyperplane, so stopologically, let's say, trivial, then one can hope to use a nonlinear projection method, which try basically to find a nonlinear transformation of the coordinates to rectify, let's say, the coordinates. So here I put the image of an iron. And there are several methods in literature, and they work even quite well in the case of non-complex topology. So here on the right, I have this risk role, which has been ironed by local linear embedding into this two-dimensional manifold. And then if I apply PCA, I find something which is actually meaningful. And so as we can see visually, we find again that our data are distributed on the intrinsic dimensionality of our data as well. But then the problem arises when, for example, there are some topological complex features. Because for example, if there are loops, I cannot hope to iron some loopy form onto a hyperplane. Of course, there are methods that try to combine, for example, PCA or some nonlinear projection method locally, for example, by defining charts that connect local maps. But these are quite recent, and they're also quite complicated. So this was the problem of topology. The third problem one can face is the one of the curse of dimensionality. So here I want to, I made a simple example to illustrate the so-called bias variance trade-off. So I sampled a one-dimensional Gaussian with 1,000 points. And I simply put it into a histogram or a six histogram with six different so-called smoothing parameters. So six different bin width. And so starting from the left, we can see that when the number of bins is much lower than the number of data points, then I have underfitting. I don't, I miss relevant features. Of course, Gaussian does not have very complex features, but still we're not sure it is a Gaussian if we look at the leftmost picture, right? And so this is a case of high bias and underfitting. If I go all to the right, I can see that basically I have this barcode graph in which I cannot, of course, understand anything about the data. And this is the case of bins, number of bins much higher than the number of data overfitting. I have high variance. It's only variance, it's only noise. There is a high low signal to noise ratio. And while if I look at the middle, we see that there is a good balance and there is a good generalization. And I can actually learn some data. But so this was one dimensional, but when we go to higher dimension, the problem of bias variance tradeoff is even more delicate. And the risk of getting a pure noise, so having many more, let's say, being much more sensitivity than data, or to go in the opposite direction to blur all my information on many dimensions is very high. And so this is a very, very known problem in estimating high-dimensional, when dealing with high-dimensional probability distribution in general. So over the years, in my group, Alessandro and other collaborators developed a protocol which is based on, which tries to go along the pipeline. I illustrated in the first slide. And so it estimates the intrinsic dimensionality, the free energy at each point. And then it does a clustering and tries to represent the states it found. But in this talk, I will mostly focus on the estimation of the free energy density at each point. We'll try to address all the problems I have tried to explain. And this method is called a free energy, is called a pack, point adaptive for a kinetics neighbor. And it does not do an explicit dimensional reduction. So to go around the problem of a collective variable selection, it works in complex topologies because it only looks at data locally. And it handles quite better than other methods, the curse of dimensionality due to an adaptive selection of the smoothing parameter. And to the fact that by computing correctly the intrinsic dimensionality, we are able to restrict to the intrinsic manifold and not to dilute our information on fake, let's say dimension that are only fictitious in the embedding manifold. And the inputs are basically only a metric or a set of airway distances between points. And of course, a reliable intrinsic dimension What is the basic idea of this method? Well, if I have, if I sample points from a non-uniform density, but I go close enough to this probability density, then I cannot distinguish between this and a uniform distribution because it's only a matter of scale. If I go close enough, the date of all data manifolds looks, all the data manifolds look locally uniform. And in fact, this is not made up. This is actually the same Gaussian plotting. If I take the tangent, if I project, let's say, the cusp of my Gaussian on the tangent hyperplane at the top, I cannot distinguish the Gaussian sample from a uniform sample. But at which scale does the sample look uniform? Of course, it depends on the points. For example, this is a two dimensional, this is a scatter plot from a two dimensional sample of a probability distribution that you can kind of, kind of grasp from the scatter plot already. So going around the scatter plot, I see there are, for example, this is a low density region with the slowing, varying density. So the scale at which I see uniform density is a big radius and it includes a lot of points. In this point, there is a low density region, for example, but there is a suddenly varying density. So the scale at which the density is constant is quite small and there are a few points. And this, for example, is in the center of the, in one of the peaks of the distribution. And so even though it's very fast, it varies very fast, I have a lot of points. So what we need, so what we see is that this scale at which my manifold is locally uniform has a point dependence. So what we do is we estimate an optimal number of neighbors to look at in order to, which is an adaptive and depends on the point. And it gives this scale at which in that point, I can consider my distribution uniform. I will not go into the detail of this adaptive case if we will have time maybe later or in the Q&A session. But at the scale at which my manifold looks locally uniform, I am very close to the manifold. So I also see that the manifold is flat because we know that if we sit on a manifold and look around, we approximate it by its tangent hyperplane in D dimension. I stress that the tangent hyperplane is in D dimension. And in fact, these are some consequences because you have five minutes left. Okay, I'll go quick. So the important thing is that we restrict to very close to the manifold and we look at a regime in which we can use Euclidean distances and hypervolumes in small D dimension. So computed the correct intrinsic dimensionality is crucial. I give credit to this picture to the locally flat earth society. So these are the key ingredients of K nearest neighbor. I fix a value of neighbors, which we do optimally, as I said. We count how many neighbors are in the hypersphere containing K points. And then I compute the volume density as a ratio K over the volume. And if I had time, I would explain that there is also a non-less intuitive but more formal way to obtain the same result via a maximum likelihood procedure, which is this one. So I write the likelihood of having density row observing some volume shells around the point. And then taking the log of this likelihood, I obtain this the K nearest neighbor free energy via a likelihood maximization. And the key ingredient of PAK is to modify this likelihood maximization procedure in order with the variational parameter A in order to allow for linear corrections around the constant density. This also provides a nice a nice error estimate. And also this maximization procedure has the advantage of giving a punctual estimate, which means that it is equivalent of taking the limit for the smoothing parameter going to zero. And this has some consequences, for example, of on the possibility to reweight the free energy of biased simulation without incurring in the exponential bias average, but doing it with the punctual simple punctual rating. So here is the testing of this method against realistic free energy landscape in the correlation plot. So we assessed if we are able to estimate the free energy correctly. So this is the free and the real free energy analytically known plotted against the estimated one. The correlation plots are very nice. So these are not toy models because it's actually two to seven landscaping two to seven dimension with which are embedded in 20 dimension after twisting and stuff like that. So the result is very striking. And yeah, I conclude. So basically only giving as input a metric or a fairways distance and a reliable ID estimation, we are able to solve quite satisfactory all the problems I've mentioned before. And we have a robust and unbiased free energy estimator, which performs quite well in high dimension and and with railroad data. And this is, for example, very much thanks to the adaptivity. It incidentally provides a good estimate, which we could have seen in the in the Polish distributions before. And it allows for awaiting bias free energy without the exponential average of problems. I thank all the people that worked on this on these methods. And I thank you for your attention and sorry for taking a bit longer. Thank you. Okay, thank you very much, Matteo for this talk. If there's a quick question from the audience, we can take it or we can wait for after on the gather session. Anyone have a quick question? Okay, maybe we'll leave it for after. So thank you again, Matteo. We'll move on to our last talk for