 The talk, a real talk after so many months is a little bit impressive, to give a real talk in such a big room that makes it a little bit less sharp, the transition between Zoom and remote. So this is quasi remote talk. Dobro. Zelo. Tak. Zelo. Zelo. Zelo. Zelo. Zelo. Zelo. Zelo. Zelo. Zelo. Zelo. sletnjih dekad, lahko smo videli, da je to vse zelo potravil tega vsega. In vznikam, da smo, neko vkratil, vsega pojav njega konputarna, da vsegačimo, kako bilo vsegači zelo. Zato, kot sem ječe, je vsegača, da ima vsegač, in v tepikovih molekodynamicih simulacijov, vzvečnjih, na zelo, z vsem, in in in in in in in in in in in in in in in in in in in. Takaj, je zelo potrebno, da je zelo potrebno, zelo konformacijov v tem samih sistemem. Zato je imaš nekaj protein, zato nekaj, da je se poznaj, zelo se poznaj, zelo se poznaj, zelo se poznaj. Tako, kako jaz semulajte kristal in defekte, kako je kristal veliko, jaz sem zelo, da se vsega realizacija vsega defekte, vsega vsega kristal. Zato, to je motivato. Zelo, da je vsega, ali sem zelo, vsega vsega vsega vsega vsega realizacija in se iz zanimivaj. OK, zato tudi se izgleda tudi svoje tega vsega v komplečni molekularne simulacije tako, da je zelo tudi zelo, nekaj zelo, nekaj je našem vzelo? ko ime, da ime dobro. Ze invesevalo in vsakam je komputera kaj došlo vsakamšno predstavno sestem in ovo je zelo za vse trajektori in ovo je vse zelo, da hatiramo zelo za trajektori. Dobro, se je, izgleda h regularno, različno, da lahko el ne zelo. Vse, posleda... Majem izgleda ne bo in jejo dobro, ne da bo in ob henjel. včakaj, da nekaj, da pristajem, pa vzvečem, njih nekaj, da se jih početila. Vsak je počet, da je izločnico, nekaj? Vsak ima izločnico projekciju, da mi je dobeče. Če so pri vse našta našta, da je dobeče, da mi je dobeče, da je dobeče, da se pri vse našta našta našta, da je dobeče, da je dobeče, In, da početim projekčenje, tjela, da imaš veči metod, tudi sem vlistešen kaj je zapotratil. Tudi sem v početni, ložitej, ko je začal, insupervizem. Insupervizem, tudi, da, ne zelo, ne zelo, ki se početimo, svolim, početimo variabost. Ok, pa... Tudi se jaz vse, da so početiti, prob Express of this pipeline. J mean that after I perform my projection then I estimate the probability density as a function of the variables which I have selected. And then I basically, somehow end up with something, which is let's say in the pretty general case is kinetic model. Model. Akinetic model means that I am describing a complex molecular system with a few states which interconnect onto each other on a rare on a long time scale. So this is what is called a Markov state model. So this is, let's say, the typical analysis pipeline of molecular simulations. So here this pipeline is, well, on this pipeline I would say that there is a large community of people which is still now working hard and there are new things that get discovered. Well, basically, even now, so it's basically, well, any of this passage is somehow, well, I would say there are good solutions available, but they are not totally solved. So critical point means set of points. So, so here I have, once again, critical, but nearly this refers to the difficulty. So the difficulty, a critical difficulty is actually the one that we were mentioning before, projecting a multidimensional data landscape into a, well, a number of variable which in principle can be visualized or understood with a small a priori knowledge on the nature of your system. Okay, so here, well, here this I'm going to run quickly, like, so if, so let's say the hydrogen atom of unsupervised learning dimensional reduction is the principal component analysis, which basically, well, means computing the covariance matrix of your data, diagonalizing it and projecting it on the main eigenvectors of the covariance matrix. Okay, so this approach, which, which is really a, I would say, a holy grail in anything that we are doing, we are all somehow using it in some of its variants. And, yeah, as well, in principle, well, let's say, practically fails, fails to find, let's say, a minimal description if the manifold containing the data is curved and twisted. So here, this is once again a very, very important and famous problem in many, many for learning, which, like all the famous problem has been solved in many different manner. And there are now several approaches, which allow basically ironing your, your embedding ironing your embedding manifold. So really ironing. So here, if I can find. So here, if, so the thing is the following that clearly, if you have this piece of paper, it doesn't matter how twisted it is, you can always tear it apart and map it into a flat piece of paper, but now if this thing is a tube, now you start having a problem, right? Because, because in order, if this is a tube, in order to map it to a piece of paper, even if it's clearly two dimensional, I have to break it, right? So this is clearly a problem of topology. So here, if the topology of your manifold is, so is the topology of an hyperplane, then, then, well, there are, actually it is possible to map your description into something, as the correct dimension, and which is the correct dimension, and basically you can visualize it on your, on your paper. But if your manifold is not isomorphic to an hyperplane, then, well, then you are in trouble because either you break it or you renounce to perform in, let's say, an optimal dimensional reduction. Like, for example, if this is a tube, of course you can say, OK, I keep it in three dimensions. So in three dimensions, I don't have to break it in any manner. No, but this means that I will have to use one extra dimension to represent my data. OK, unfortunately, the situation is even a little bit worse than tubes. Tubes are optimistic, let's say. So actually, in many real-world cases, the data manifold, the manifold containing the data, looks more like a form. A form means an object, an object in which I have, well, basically, the intrinsic dimension of my manifold is typically locally very small. But then I have, so here in this specific case is one, but then I have several points, which have a null measure, where there are different manifolds that meet each other. No, so there in some way one can say that the intrinsic dimension is larger. I think, actually, hinders the possibility of deriving a meaningful description in terms of a few explicit coordinates. So here, imagine that you really want to map this complicated object into something where the coordinates describing these objects are explicit functions. So this is really, really very, very difficult, no? So this thing is actually even worse than that, because the walls of these form bubbles happen to be typically high-dimensional, no? Well, complex system, molecular system, solid state physics system, well, we would like them to be low-dimensional, but they are typically not. So here, now one of the analysis tools that one can actually use for unsupervised meaningful learning is, so, well, a very important one actually is estimating what we call the intrinsic dimension of the dataset. So the intrinsic dimension of the dataset is basically the dimension of the form walls. So here in this case, we will get an intrinsic dimension of one, right? So now if you take, for example, a protein, which is folding, and then you compute the intrinsic dimension in the space of all its internal dihedrals, you find the number, which is of order 10. So 10 means that the walls of the bubbles are 10-dimensional. In other words, it means that if I am in any configuration, in any configuration, there are 10 linearly independent directions in which my system can move on a, well, significantly large scale. OK, well, clearly this means that even if I was able to get an explicit map of this form, then this map would be useless because I'm not able to visualize more than three coordinates, right? So here this thing means that performing a real, well, let me be clear, a dimension reduction with no information loss, that's what we are talking about. Of course, you can always reduce the dimension of your data representation if you accept to throw away some information. No, if you don't want to do it, then you are typically in trouble, exactly in this kind of troubles, all right? So here this actually means that mapping this object to dimension 2 or 3 in such a way that I can visualize it. Normally, I would say typically, implies an important information loss. OK, still, let's try to do it. So let's, for example, let's try to do a two-dimensional projection of the folding of a small protein. Well, here for this protein, a very large molecular simulation was available, and we applied one of these methods for ironing the embedding surfaces, which is called isomap. And well, if you try to iron the data landscape explored in this extremely long molecular simulation, you get a representation like this. Here, well, blue point and yellow points should be separated in two clusters, because one should represent the folded state and the other should represent the unfolded state. So here clearly there is a very high degree of mixture between states, which have clearly nothing to do with each other. So this thing doesn't give you, and let's call it an insightful two-dimensional representation of the data. OK, so here, let's say that my research, so now we are actually going to move to a description of what we have been trying to do in my group and with other collaborators, also in other groups in the last, yeah, eight years. OK, well, the idea is trying to develop a set of methods, which allow to understand something about the structure and the properties of complicated data meaningful. The idea, like in all the tools of computer modeling, so one should first try simple things. So here my suggestion to everybody, when you have a problem, the first thing that you should always try is a principal component analysis and then try to make an histogram as a function of the first two principal component. If this doesn't work, if you have something that doesn't fit what you should find or the structure is too complicated, then you should move forward and then you can try kernel principal component analysis, which is doing already something better. There are other very powerful methods, and if you are really, really desperate, what do you do? You try to learn from Marshall's sailors. So Marshall's sailors, well, I would actually like to learn something from them, but unfortunately they live really on the opposite side of Earth, so they are incredibly skilled sailor men's who actually sail in South Pacific. In South Pacific, and they used to sail in South Pacific since centuries. And in order to do that, they had in practice to record the relative positions of several islands, and at the same time the main direction of the wind, which you have in between, and the main direction of the currents of the ocean underneath. So let's say that it's a six-dimensional landscape, because you have the coordinates of the islands, the directions of the wind, and the direction of the water streams. So, clearly, they immediately realized that they were not able to make a graph with six coordinates, and so they ended up with a compact representation of this six-dimensional landscape, which is totally unreadable to any person who is not a Marshallist sailor, which looks exactly like that. In this thing, basically, so this is a map for them. It's a map that allows them to safely or quasi safely go from an island to the next one. In this thing, even if you don't notice it, it's six-dimensional. But okay, good. So here, so, the thing is that, at variance with ordinary maps, which actually is something which are, let's say, a graph of how a land looks like, is, so this is, let's say, a map which conveys information about abstract quantities. But these abstract quantities must mean something to the person that are able to handle this map. So you need a translator, typically, for this map. Okay, so here, so now, well, what we, our perspective on complex data landscape is that we do not even attempt to project the data on a low-dimensional manifold, because this thing, for the kind of manifolds we are interested in, this is meaningless. And instead, what we do, we build a topography of the landscape. A topography of the landscape is basically a list of the probability peaks and of the saddle points connecting them. So this topography, even if then it is possible to represent it in two dimensions, well, even if it is not always possible to do it, but it is typically possible. This map, typically, you can think, well, it can be merged in a space of any dimension that you are willing, no? Because it's like you basically map your data landscape into, let's say, 20 probability peaks, which can be connected or not with each other. So you can imagine that you map your probability landscape into a matrix, in a symmetric matrix, which would be 20 by 20. So, like on the diagonal of this matrix, you will have the height of the probability peaks and then the off-diagonal elements of the matrix, you will have the height of the saddle points, of the critical points. OK, so this is our, so in general, our perspective on how we try to represent the data landscape. And this is an illustration, like this is a probability landscape. You harvest points from that, then you assign points to states, and then you build representation, which can look like that. Or, like that, if you only care about, let's say, the highest saddle points between two different probability peaks. So we are going to see examples of these representations in the rest of the presentation. OK, so here in order to do this, well, we had either to learn to use or to develop, well, many, many tools. In particular, well, we had to develop tools to compute the intrinsic dimension of the manifold containing the data. These we are going to see why this number is so incredibly important, also computationally, it's not only conceptually. So then, well, basically, after you have computed the intrinsic dimension, we estimate the probability density. But here this is where, in a way, the importance of estimating the intrinsic dimension comes out clearly the first time. So basically, the data are contained in the intrinsic dimension, in the embedding manifold. Therefore, in order to define what the probability density is, you need a measure. So this measure must be defined on the embedding manifold. It cannot be defined into the space of features, because otherwise you would have basically a lot of empty space. So imagine that you have a sheet of paper and the data can only be here. If my measure, I say that it leaves into the three-dimensional space, basically even defining what the probability density is would be somehow tricky. So the point is that the support of our probability density is the embedding manifold, and we explicitly estimate it on it. Okay, then, well, then we find the probability maxima by density peak clustering. We compute the probability at the boundary between each pair of minima, and then we move to a more standard ground, which is how to represent topographies, which in real-case applications include tens, if not hundreds, of probability peaks in a manner which conveys useful information. So this is somehow nontrivial, but this is where we actually exploit many, many tools which have already been developed by other people. Okay, start from the estimate of the intrinsic dimension. So here, from now on, well, I will have a part of my top which will be partially technical, so I will necessarily have to go pretty quickly, but if you want to have more details, you are more than welcome to interact with. The intrinsic dimension is the minimum number of parameters or coordinates required to describe the data while minimizing the information loss. So this definition is from Wikipedia. So here, let's say, the classical manner of estimating the intrinsic dimension is basically uses this equation here, which actually says that the number of data points within a distance are scales with r to the d, where this is the intrinsic dimension. So clearly this is a very obvious equation. So like if I am on a line, it will scale linearly. If I am on my sheet of paper, it will scale with the square of the distance. So the critical thing is that this scaling law, which has actually been used in many, many different approaches for estimating the intrinsic dimension, in which, like in the 70s and the 80s, was called the fractal dimension, even now, actually. But, OK, well, I call it the intrinsic dimension, is that this equation has a prefactor. This prefactor is rho i, rho i is the density, local density of the data points. So here, this actually is equation tells that the estimate of the intrinsic dimension and of the densities are in principle entangled. So one, using this formula, you basically, you cannot do inference on d and rho i separately. So they are because they actually appear in the same scaling law. So this thing, if you are studying a dynamic system, this is not really a problem, actually, because the density, like on a Lorentz attractor, for example, is typically not so variable in the relevant, so in a large region of your phase space. But if you are studying a molecular system, the density varies orders of magnitude. So imagine, like in a metastable system, by definition, the density varies by orders of magnitude, because in practice, in fact, we don't really even talk about density in molecular simulations, we talk about free energy. The free energy is a logarithm of the density, right? So in fact, we do a logarithmic transformation in order to appreciate the extremely large variations in density that we have in molecular system. So here, the prefactor in such an estimator will actually vary brutally in a molecular simulation, and then you would get artifacts. So this was actually well known. So one of the development that we have actually done five years ago now was developing an approach which allows disentangling the estimate of the intrinsic dimension from the estimate of the local density. How do you do that? So basically, we consider, as a probe variable, instead of the distance, a ratio of distances. So this is actually the key point. So here the ratio of basically, so for each point i, we find the nearest neighbor, and its distance we call it ri1, and if we find the second nearest neighbor, whose distance is ri2. Then the ratio of the two is dimensionless, and you can qualitatively understand that basically its probability distribution is independent on the density, because if the density is large, ri1 is small, but also ri2 is small, so their ratio is qualitatively less dependent on the density. Actually, you can actually prove that it is not only qualitatively dependent on the density, but it is exactly independent on the density, because indeed the probability distribution of this ratio nu given the density rho is actually exactly independent on rho, and it is a Pareto distribution, which is this expression here, which is parametrized only by the intrinsic dimension. So basically, the density has disappeared. So basically what we do to compute the intrinsic dimension is measuring this nu i on all our data points, and then do inference. You can do it with Bayesian inference, you can do it with maximum likelihood, you can do it in many different manners, but basically it is a simple inference problem, because we have to infer the value of a single number, just the intrinsic dimension. OK, questions? So the intrinsic dimension, imagine that here you have a tangent. So the intrinsic dimension is the dimension of the intrinsic dimension of this is 1, even if I twist it around. So the intrinsic dimension of this is 2. And the intrinsic dimension is the dimension of the manifold, which contains my data. And so it's just the dimension of this object. So that's what we are trying to compute. This quantity, even in material science databases, which are objects that are, like that you for sure have to handle within this workshop, because it is highly nontrivial number, typically, because you have many, many different features. These features are, let's say, well, locally correlated. So if you are in a specific data point, which can be a specific material, you cannot, you don't find other materials in all the possible directions. You can find materials only in few directions, which unfortunately happen not to be the same directions everywhere. So this is the reason why you cannot simply use the principal component analysis. Is it more clear? All right. So now, so here a very, very important observation is that the intrinsic dimension, while mathematically speaking, is very well defined. So you have a manifold, and then it's the dimension, well, let's say that if it is a Riemannian manifold, it is totally well defined what the dimension of the manifold is. So in real-world application, well, you have a problem. So the problem is, let's take the example of a configuration in a molecular dynamic run of a biomolec with an atoms. So they say that you have no constraints on the bone. So then you have a thousand atoms. What will be the intrinsic dimension of the manifold containing the data? Well, will be 3000, right? Because in principle, well, you have no constraints, so you can actually go everywhere, right? OK. This estimate is irrelevant. Is irrelevant? Is irrelevant because, well, let's say a useful, where useful, we will see what it means in the following. A useful estimate of the intrinsic dimension should provide the number of independent directions in which the system can move significantly, no? So this is, well, in a way, we are not discovering anything. If you do principle component analysis, you will never, in a real-world data set, you will never find any eigenvalue, which is exactly zero, right? So you will find maybe few eigenvalues, which are very large, and these large eigenvalue will tell you what are the meaningful components, but then all the others will be different from zero, right? Still you throw them away. So this is the same thing that we are trying to do here with the intrinsic dimension, no? So for the intrinsic dimension, so the components, in principle component analysis that you throw away are, let's say, short components. Are components, which matter only for a description of your system on a very, very small length scale? Now, if you start looking your system at a larger scale, then you magically end up discovering that few components are actually strictly necessary to describe your system. So, well, here, for example, well, here this is a typical example. Here, if you look at this data landscape on the small scale, the intrinsic dimension is two, but on the length scale of the red line is clearly one, right? So a human eye is always capable of saying this is one dimensional, almost everywhere, except in points here where you can really go in two directions. In points where you would say this system is two dimensional are only this and this. You can even try to do this in a neuroscience experiments. We have tried with 10 subjects and all gave the same answers. Great statistics. OK, so here, well, here, here I'm not going to spend a lot of time here, but basically, when you have a real world system, when you say a database in material science or molecular dynamics trajectory, you always have to estimate the intrinsic dimension as a function of the scale. And like, for example, you say my intrinsic dimension for a variable for scale of 0.1 Armstrong is 100, then the scale of one Armstrong is seven. And then maybe it's seven also at scale of three Armstrong, no. So, well, it's very, very important to date your estimate of intrinsic dimension comes together with the scale analysis, a scale analysis because the scale is exactly something which gives you a, let's say, a handle on the meaning of the number that you are measuring, no. So, to say, I have an intrinsic dimension of four on the scale of five Armstrong, you will know that it means that if I move my system by five Armstrong of RMSD, then I can move in five different direction, no. If I simply say you, so, so, so this thing always must come together. So the reason is because the intrinsic dimension is rigorously independent on the scale, only in realistic test systems. Okay, so here, well, we are, we have our procedure for studying the dependence of the intrinsic dimension of the scale, and we call an intrinsic dimension meaningful if it is, let's say, scaling variant, at least on a range of scales. So, that's in a way our definition. There are systems in which you cannot define an intrinsic dimension. You cannot define it because you don't have such a plateau. So, according to the scale at which you are looking at your system, you will have a different intrinsic dimension. If you are in such a condition, then your tour through the methods developed in my group is over. So, you cannot continue. You simply say, I cannot, well, I cannot continue because, for example, one thing that you cannot do in that case is the second step. And the second step is estimating the probability density at each point. Okay, so here I have finished the part of the intrinsic dimension. I don't know if you have other questions, also remotely, or curiosities. Thanks, Alessandro. Can I just ask this last thing you said? So, for example, kind of fractal, would a fractal be a case where you don't have an intrinsic dimensionality? Yeah, well, the fact that your intrinsic dimension is not integer doesn't imply that the intrinsic dimension is not defined. So, the thing that kills our possibility, for example, of estimating, of estimating the density are so-called multifractals. Multifractals are basically where you have no scaling variance of any sort. So, actually, what makes everything else possible is not the fact of having an integer number there, but having a number which is at least approximately scaling variance. Okay. One more if I may. So, maybe you'll cover this, but are there methods that can calculate intrinsic dimensionality taking into account symmetry equivalent possible configurations? This thing is absolutely a key question. The thing is the following that the features that one has to use in order to describe the system must obey all the symmetries that you have in your system. Because otherwise, you are basically computing a number which doesn't make sense. Like, for example, if you are studying, like, I don't know, a silicon cluster, we are going maybe to show you an example. You need to use atomic symmetry functions, or you need to use SOAP, or you need to do, well, features which automatically encode translational and rotational invariance and encode permutational invariance. Otherwise, well, so this is something that you have to solve in advance. So, in a way, finding a manner of, well, a description which is automatically resilient to all, I managed to use resilient because it's a very fashionable word in Italy. Sorry for that. Okay. Good. Is actually a key prerequisite in all our analysis. But presumably, you can go both ways, right? You can either have features that respect the symmetry, or consider all possibilities, not taking into account symmetry, and then determine the amount of symmetry operators and subtract that from your intrinsic functionality. Well, this is something that we don't do that I think is very meaningful. Okay. Thanks a lot. Okay. Now, let me see how we are doing with times. Yeah. Time is running. Amazing. No. All right. So, how do we estimate the density? So, let's start from, yeah, the K nearest neighbor density estimator. So, the K nearest neighbor density estimator works like that. The point here, and you fix a parameter, which is called K, which here is 13, and then you measure the distance between the central point and the 13th neighbor. And then your density will be the number of data points 13 divided by the volume occupied by your data points, which is pi r 13 to the power two, right? So, here it's clear that if you don't know these two, then you are in troubles, right? This is our intrinsic dimension. No, so you cannot, so if you only measure distances between data points, and you don't know the intrinsic dimensions of the manifold in which they are embedded, here you can put 1000, for example, your number of particles, and then you get a known sense, right? Okay, so here, I don't know if you can see it, but basically, well, K nearest neighbor estimator has a problem in data landscapes in which the density is varying sharply, is varying sharply. For example, here, if I use K equal 20, then I have a large statistical error here. Here the estimator is somehow good, but here if I use K 375, then on this point I have a small statistical error on the estimate, but here I have a systematic error, right? Because I am including in my estimate regions with a totally different density of data points. Okay, we need a different data point, a different K for each point, so this thing is well known. Well, here it's so well known that this has been studied a lot. The classical problem in our supervised data analysis, in particular in density estimate, is called a bias variance trade off. So the idea is finding a compromise between taking your neighborhood large enough that your statistical error on your estimate is small, but it cannot be too large, because if it is too large, then you are basically inducing systematic error in your estimate. So here we have to find the compromise. Here our contribution, which is, let's say, of course almost needless to say strongly connected to things that other people have done, is actually has been developed with a specific goal performing an adaptive density estimate in which, well, two conditions are satisfied. First, the density is implicitly computed on the embedding manifold without defining explicitly its coordinate. So these two things are very, very important. I don't want to define the explicit coordinates of my embedding manifold. Still, I want to compute the density exactly on the embedding manifold. How do we do that? We do that throwing away features and using only distances, because if my data are contained in a manifold, what it doesn't matter how curved it is, as long as this manifold is locally flat, the distance, the local distance on my manifold will be locally Euclidean. So you can actually measure this distance using all your original features, and then this distance, which you measure in the full feature space, will be a meaningful distance also on your manifold. It means that everything that we are doing, we cannot use features, we use only distances. So in this manner we are able to basically compute things on the embedding manifold without defining explicitly the coordinates of this embedding manifold. So this is a subtlety, but it's an important subtlety. So how do we perform this thing? So the idea is basically finding for each point the largest neighborhood in which the density can be considered constant. So this is the idea. So we do this by performing a statistical test. So the statistical test, well, the question that we try to address is, well, we start from a very small k. Let's say, for example, k equal to 1 or 2. And the question that we ask is, should we include the next neighbor in the density estimate? Well, the answer will be yes if the density doesn't vary too much. And how do we assess this thing? We compare two hypotheses. So the hypothesis number one is that point i and these k neighbors have different densities. So the density is different. So the hypothesis number two is instead that the two points have the same density, right? So these two hypotheses. We do, well, let's say, a model comparison. So this is a typical procedure in statistics. You have two hypotheses. And then you want to find out, well, you basically want to find out a criterion, which tells you at this point, I want to reject, for example, hypothesis two. And that's exactly what we are doing by model comparison. So we have, we reject hypothesis two when, well, a likelihood, which now I will tell you what it is, because I cannot derive it due to the fact that I don't have enough time. I reject hypothesis two in the likelihood of hypothesis one is much larger than the likelihood of hypothesis two. Well, this is a meta parameter of our approach. So basically I have to say at which confidence level I have to reject the hypothesis that the density is constant. All right. When this happens, the size of the neighborhood is i is optimal. Well, a small illustration of what happens here. I am increasing more and more the size of my neighborhood and comparing the density of the two neighborhoods. And this is the difference in the likelihood. And here you see that this is our confidence threshold at a certain point. Basically, basically, the likelihood of the two hypotheses is very different. And this happens when my red point reaches this region where the density is significantly higher. So at this point, I have that I stop and then I use this neighborhood, this neighborhood to estimate my density of my central point. Is it qualitatively clear what we are doing? So this procedure has the advantage that can be performed. This test can be performed parallelly for each data point. So it's something that, well, if I want to write a code that performs this test, I can parallelize it very easily because this test is performed independently for each data point. So this is an important difference with respect to other adaptive approaches, which typically require n-square operations. All right, so here this is the outcome of this procedure. 13 for this data point and 270 for these other data points as optimal neighborhood size. So clearly this implies the statistical confidence on the density in different points in my landscape will be different, right? So for example, here I will have a very high statistical confidence here and a low statistical confidence there. So basically here each point in our density landscape will come with an error bar and this error bar is not uniform. So this is another important difference with respect to other approaches for density estimates where the errors are typically uniform. In our case, they are not. OK, a point dependent KI, OK, so this is what we already said. So our model, basically it's a likelihood based model, therefore it entails also an estimate of the confidence on the quantity I am inferring. So this can be done using, well, basically fisher entropy, basically the curvature of your likelihood at your maximum. Very trivial. All right, so here we did benchmark of this procedure on basically artificial data landscapes, where it can be considered challenging for probability density estimate. So basically we started from analytical probability density in dimension between 2 and 7. And then from these landscapes we sampled 10,000 points. And then we transform. OK, so let's say that we are in dimension 7. Then we basically, well, basically we enlarge the feature space by adding, well, in principle as many variables as we want. And then we rotate the landscape and we make it curved. So this we can do it like for example, we can put this thing on a torus, we can put it on a spiral on a sphere. There are many, many manners of making it complex, topologically complex and curved. And then we embed it in a high dimensional space. So basically, so this thing is the correlation plot between the ground truth density on the embedding manifold and the density measured with this procedure. And we have very, very good correlation. And as you can see, also the error bars are correct. Well, are actually rigorously, well rigorously. They say they are correct in the sense that the difference between the true and measured density divided by the estimated error is a normal Gaussian process with variance 1. So this actually means that our model of our estimator is unbiased and the model of the error is correct. Of course, it is, well, how correct it is, it depends on the intrinsic dimension. So it is basically almost perfect for an intrinsic dimension of 2, while in an intrinsic dimension of 7 we start seeing some deviations between a normal distribution and the observed distribution. So, well, this means in practice that we can estimate the density or equivalently the free energy on systems of intrinsic dimension. Well, now we are using this approach up to an intrinsic dimension of 15, let's say, above 15, then the errors become too much, but up to 15, we are still able to keep things under control. 15 means, doesn't mean that we have only 15 coordinates. We can have basically 1000 coordinates, but the intrinsic dimension is 15. Clear? OK. So here, since I'm running out of time, I have a little bit to decide which material I'm going to show. Yes, maybe I'm not going to show you how you analyze entropic basins, but I want to tell you a little bit about finding the probability peaks. So now let's say that we have a data landscape with 100,000 points. Now, so this data landscape is immersed in a space of, I don't know, 100 features, and it is an intrinsic dimension of 7. OK, now for each point, at this point we have an estimate of, we have an estimate of the probability density. So now, since these points are immersed in a very high dimensional space, it is not possible simply to visualize this density, right? So we need a procedure which allows to automatically find the probability peaks. So we do this by density peak clustering. OK. All right. Where are we? Come on. Yes. OK. How does it work? OK. So the idea is of density peak clustering, which is once again a method which requires only distances between data points. This is once again extremely important. We don't want to use features. We want to use distance between data points. Because this allows to work directly on the embedding manifold. So the idea is that the points at the top of the density peak is far from any other point with higher density. OK. So then what we do is, OK, first of all compute the local density around each point. In this, well, we can do it with our advanced method, which I have shown you before, or simply counting the number of points within enablehood. Like, for example, the density around point one will be seven, and the density around point eight will be five. And around point ten will be four, right? Then we have to encode this idea here. And then what we do is that for each point, we compute the distance with all the points with the higher density. So this is actually the key idea. So we only look at points which have a higher density. And then we take the minimum value. So let's look here, for example, data point number ten here is a local density maximum, right? But, of course, its density is relatively low. Its density is only four. So how can we distinguish this data point, which has a density four, from another point, which has also a density four, but is not a local maximum? So the thing is that this point, if I look for another point, which has a higher density, which has, for example, density five, this will be far. It will be, for example, in this region here, right? Instead, for another data point with density four, the other point will be close by. OK. So here, that's exactly what I am depicting here. These are two points, which have basically the same density, but the nearest neighbor with higher density in one case is close, and in the other case is far. OK. So now, so then, if one makes a plot, where here, now, on the x-axis, I have the density, and on the y-axis, I have this distance delta, which is the, I recall, the minimum distance from all the points with the higher density. Basically, the cluster centers pop up as outliers. So this is actually the key observation at the basis of this method, and, well, then the rest is trivial. So the key thing is that this approach is deterministic, so there is no optimization required. If you are a little bit familiar with the clustering methods, for example, you may know that in k-means, basically your results depends on the initialization, no? So you basically, when you do the clustering, you have to do repeated many times, and then make sure that you take the best solution, for example, right? So here in this case is deterministic, and so the cluster centers is deterministic and also unsupervised. Is unsupervised because we don't have to forget that when one estimates the density, one with our procedure, you also know the neighborhood size, the size of the neighborhood in which the density can be considered constant, right? So here the points whose delta, this distance, which I have just defined, is larger than the radius of the neighborhood used to compute its density, is what we call a cluster center. So we don't, we are able basically to find cluster centers without making any specific system dependent choice. So this thing can be pipelined, let's say. So here, so all the procedure is nonparametric in the sense that the density estimate is nonparametric except for the statistical confidence, but we have a manner for correcting its effect, which I haven't described today, and the number of clusters is determined automatically. So here, actually, well, here maybe I can first show you, well, the results, which, okay, I have a delated slide, well, here you would see that actually the clusters come out to be what you would think they are. So here there is, there is a very, very important, a very important detail. And with this detail, yeah, I don't think I'm going to have time to show you many other things, but this I want to tell you. So as I told you, our density estimate comes together with an error bar, right? So this means that if I have two clusters, so imagine that I have two probability peaks. If I have two probability peaks, I will have that the saddle points separating them, so that all the points in this profile will come together with an error bar, right? Error bar at the top of the peaks and error bar at the saddle, right? So in any real world application, these error bars will never be zero, right? So this actually means that when you talk about the clustering partition based on the density like we are doing, this clustering partition always comes together with the statistical confidence, right? So somebody tells you, I have five clusters, so this thing is not a well defined statement, especially if you do it with this approach. No, so you have to say, I have five clusters with the statistical confidence of z equal to, for example, no? So what is, how do you estimate the statistical confidence? Let's say qualitatively is just given by the diff, so, for example, the statistical confidence of a peak would be the difference between the density in the peak, the density in the saddle, this difference divided by the sum of the two errors, right? So basically this ratio, if it is large, then this structure in your probability days, it is statistically robust. If it is of order one, this means that it is not robust. Well, it's not robust. It has a low level of statistical confidence, right? So here in a way, the true final meta parameter of our approach is the statistical confidence. So when we talk about the probability landscape, we always have to set the value of this number, and then, in principle, see if how the results change by varying these parameters. There was a question? No, maybe not. All right, so here I think I'm running really out of time, so I'm not going to tell you how we find the saddle points. One thing, maybe before going really to the conclusion, what I want to, well, I just want to, well, tell you one thing that, so this thing is something that we have actually verified on many, many systems of increasing complexity. The peak, the probability peak, the density peak clusters that we find in this manner correspond one to one, almost, well, I would say one to one with the inherent state of a Markov state model. So here I think many of you have heard about Markov state modeling is a very, very classical approach for characterizing the kinetics of complex system, and it basically gives you out as an output the number of metastable states. So here we are basically finding the same manner, the same thing, but we are finding the same thing without looking at the kinetics. So we have absolutely zero information on the transition probabilities in our approach. And this only requires only, well, basically we only estimate a density, so we have no kinetics, no transition matrices, no diagonalization of anything. We think, well, actually we find, we actually find that our inherent states are actually better than the inherent states that one would find with the classical Markov state modeling analysis, because actually here this shows that the mean first passage time in a pretty complex molecular system. If you use our description, we obtain flat lines as a function of the time lag. So do you think I'm really sorry, it's very technical, only for those of you who know about Markov state modeling, this graph would be understandable. But basically, so here, well, this is even more evident here. So this is even a more complex system using the standard procedure, your mean first passage time looks like that. No, well, let's say you have a plateau only, but not such a good one. If you use our inherent state, we have something which is much more flat, which is this one and this one. All right, so with this, I think I'm going to skip the part on entropic traps. And I would only, well, I'm really moving towards the conclusions of my talk. So first of all, I would like to bring to your attention that, well, tomorrow there will be a tutorial tomorrow morning on a Python package implementing. Well, all these methods which I mentioned and actually several others. And this package was developed by this group of people. Well, in particular, Aldo played a really, really major role in making this complicated project work. And with this, well, here, this is where you can read about this package and you can eventually download the code. And with this, I would like to, well, I'm really going to the conclusion. So here, what I presented today is an approach, which is actually something combined in different approaches, which allows computing the, what we call the topography of a multidimensional probability distributions. So this approach is, well, first of all, a robust algorithm to determine the intrinsic dimension. This is really, really a key thing. This intrinsic dimension is a number which enters in our likelihood model. If you go wrong with this number, the likelihood model basically selects the neighborhood in a totally wrong manner. So you cannot avoid computing it. The probability of density estimator capable also providing an estimate of the error and a procedure of finding automatically. The probability peaks, regardless of their shape and of the dimensionality in which the peaks are embedded. So this procedure also provide a statistical confidence on these peaks. I think I am going to close the presentation, and I am ready to take questions. Thank you very much, Alessandra, for this super elegant talk. And the floor is open for questions. Yes. So if there are two regions in the total data space, which have two different intrinsic densities. So which one would you should we choose? The intrinsic dimensions. Yes. Well, the thing is that, in principle, you should use for each data point the local intrinsic dimension. Because it enters your likelihood function. So that would be the local. Well, the thing is the following that. Of course, in the approach, which I presented, the intrinsic dimension is only one. But, in principle, you can do this thing. So there is an approach which allows detecting regions of your data landscape characterized by the fact that the intrinsic dimension is constant. And you're also able to estimate how many regions you have. No, so this is like performing clustering based on the intrinsic dimension. There you have actually some surprises. For example, in protein folding, the intrinsic dimension is, one would actually say, if now I would make a small exercise, how many of you would vote for higher intrinsic dimension in the folded state? Well, all of you maybe would say that the intrinsic dimension is higher in the unfolded state, right, because it's disordered. Yes. Well, this is not the case. It's actually the opposite. The reason is because the relevant intrinsic dimension in the two states is on a totally different scale. In the folded state, the relevant scale of fluctuations is to Armstrong, for example, and in to Armstrong, you can actually move in many different directions. In the unfolded state, since it's really disordered, basically fluctuations of two Armstrongs are totally relevant in the thing that matters are scales, which are of the order of five Armstrong or ten Armstrong. And on this scale, the intrinsic dimension is lower. So the intrinsic dimension on the scale between five and ten Armstrong is of the order ten, which is the relevant one. On the scale of one Armstrong is of the order 22, if I remember correctly. So it's much higher. But this thing that you said is actually very, very important. Mary, Alessandro, again. Maybe I missed it, but so once you've done all this machinery, do you have access to the data manifold? Can you sample from it? No. So you don't have a model. So you are, so you are, well, this is actually a very, very interesting observation. And this is one of the things that I would really like to do. For example, I would really love to merge this kind of approach with the Boltzmann generators approaches, like, for example, things that have been done by in Francois group and many others. So this thing could be actually something that we would like to do. So here in our case, we simply have a set of observations, which are our data points. And from the set of observation, we have basically a measure of the probability density on embedding manifold with error bars. That's it. No manner of generating anything. Okay, let's talk after them. One thing, I don't know if there are questions on the chat. Maybe you can, if there are questions from the chat, you can tell me, all right. Great. Thank you very much for the talk. So, so given a set of data points, we are able to estimate a density. I would be interested in, which points should I consider so that if I run this method on a subset of my original set, I obtain a density, which is as near as possible as the one that I computed originally. So, so this thing can be assessed very rigorously using the tools of statistics, since we haven't estimated the variance of our density. So, basically, if you have two, let's say, statistically in the, well, a subsample of your sample of a given size, you basically rigorously know how your variance should scale. So, basically, well, trivially, if you have like, so if you sample your data set, so you subsample your data set by a factor of two. Basically, qualitatively, you should expect your error bar to increase by square root of two. And, well, of course, you should have that your estimate is consistent, should be consistent with the previous estimate. So, this thing is actually a very, very powerful procedure, which one can apply to actually test statistical independence. Because in practice, in implicit assumption of everything that we are doing that I haven't commented is the fact that we must assume that our measures are at least approximately statistical independent. And therefore, if this thing is violated, the scaling, which I described, will be only approximate. So, how approximate the scaling is tells you how correlated your data are. Like in block analysis now. Thank you. I think I'll thank you again on the behalf of everyone. Nice catch. The number of coordinates should be greater than the, well, it's not that it should be. It's actually that the maximum value of the intrinsic dimension is equal to the number of coordinates. Because if I want to, so I cannot have that this sheet of paper is immersed in one dimensional space, right? So, I can have that one line here is immersed in this two dimensional sheet of paper, but the reverse is not possible. So, the intrinsic dimension is always smaller than the number of features. Typically, fortunately for us is significantly smaller. Coffee. Let's thank again Alessandro for the coffee.