 Thank you for being here. I also want to thank the organizer before starting for this amazing live conference after so many months. And One second, there is this, there is one thing that appeared on my screen that I want to remove before I move on, which is the notice that this is being recorded. Okay. So good morning. Today, in this tutorial, we'll look at many methods of manifold characterization. This is very connected to yesterday's talk by Alessandro Lio who explained many of the methods that we will go into today, but today we'll have a focus on practically using them and learning how to to use code and analyze data in practice. So I'm Aldo Yelmo. I now work at the Bank of Italy in this applied research team in the informatics department. Of course, opinions are mine and not from the Bank of Italy, but this work anyway was mostly done when I was a postdoc at CISA. So yeah, let's start. I think I have to stop. Yes, slides are not moving. So just give me a second. I'll try to reshare my screen. So, first of all, why is this interesting and why is this useful for you to learn. We'll know that in the last 10, 20 years there's been this really paradigm shift in physics and chemistry in which really the scale of problems that we can tackle changed by using data. Somehow this was due by this extraordinary increase in information storage technology, which was exponential just like Moore's law. And we all know about this. This really led to a completely revolutionized the field to what it is right now. Well, at the same time, what is perhaps less studied is the increase in another quantity. So not only the data has increased, but also the dimension D of the representation that we used to to encode the data into numbers, well, molecules and atoms into numbers. And in fact, even the behavior of the dimension D on this matrix of features has increased exponentially over the years one might argue. And we, you know, from the early times in which one could just use the inter lattice distance well inter atom inter atomic distance of a crystal to represent a material. Now we are using the scriptors that are the size more or less of 10 to the four right we're getting there we we've seen many tutorials talking about this also in the last days. And this is the standard picture showing how many different representations we have now for atoms and molecules. And, and this of course represents a challenge is not the only challenge but it's a significant challenge for machine learning and the first thing I'd like to you to do and to understand is precisely this challenge the challenge of having high dimensional data and having to deal with high dimensional data. So if you go on the website, you should find this link to the first tutorial this is tutorial zero because it's very very introductory to for what we're going to do later on. But I think it's very interesting so this is really visualization of why the high dimensional data is so much different than the low dimensional data. Let me show you this first tutorial I hope you can open it to let me know if you cannot open it, it should be on the tutorial page of the website and the first link. And this is really, I think we're my favorite view and way in which I understand the high dimensional data, and why is it's difficult to treat it. So, we just import here. Okay, so let's wait a second until it's uploaded. But I think this is also not, you know, it's not fundamental to run it on your own, you can even run it later on. The interesting thing is really to follow along here we are just really analyzing a d dimensional Gaussian distribution so let's imagine that our coordinates. Okay, are sampled from a standard Gaussian between zero well and the standard deviation one around around zero, and we all are very familiar with the one dimensional case, and we, you know, we know it by heart from Okay, now the link should be working. And let's look at what happens. Okay, the one dimensional case is well known that there is all the mass of the distribution is really in between between one standard deviation of the Gaussian. In fact, if we do the integral we get between 0.95 standard deviation zero and the other side you get 65% of the mass. Well known this is intuitive and let's move on if we analyze the mass inside this thin shell that I call it in green here. Well there is only 4.8%. Now let's see what happens in two dimensions. In two dimensions, the mass in that thin shell grows to 7.3%, but still the picture there. It's also familiar. It's really just the standard deviation became a square root of two. And then the thin shell integrates to 7.3%. Now let's see what happens in three dimension which is still we can see familiar we can visualize still in three dimensions the thin shell grows to almost 9% of the mass. Of course, it's easy at this point to do this for loop over the number over the dimensionality of our growing space. Okay, this is still three. And we do this from one to 1024 dimension in this for loop here. You know, at least these I think it's surprising if we plot it at this point you probably understand what I'm going to plot right now. And, you know, this is very unintuitive in 1000 dimension all the mass of the probability distribution is inside the thin shell around one standard deviation. If we have a high dimensional space, all points are at the border of your of your dimension of your space and they are all lie on the surface. So this is something that people often say in a qualitative way right in high dimension everything at the stage at the boundary and so on but you can really see it in practice. You're looking at this at this experiment. And of course, as some implications, because in if you are at border of this area high dimensional space then you also have a problems with distances. And distances at this point are really always identical because you are really at the border of this you driver sphere, and if you compute distances in a 1000 dimensional space you have a delta function all points are at the same distance from each other at the border of this high dimensional sphere. And of course, as huge problems if you want to apply any algorithm that you have learned for instance in the past days you want to apply kernel regression, or, or, or even the techniques that Alessandro Lio described yesterday this is a problem because there is no concept of close proximity, everything is just at the same distance from, from the rest. So we just understood that we have a very high dimensional spaces in chemistry and physics, and these high dimensional spaces are challenging and you cannot do any machine learning with them. So, why is it the case that we instead can apply this algorithm and the algorithms work. Well, the reason was already mentioned yesterday, and it's the fact that data is not is never truly high dimensional. So, since there are strong constraints in the physics and chemistry of atoms and molecules, you will never have a Gaussian 1000 dimension like a data set. Because you know if you take a box, and you start to randomly put atoms in it, you will never you will find almost surely infinite energy configurations that are not going to happen in reality. So in fact data lie on this very much lower dimensional manifold that is, that is the physically motivated space for atoms and molecules and so this is really what we're going to do today is we're going to investigate the properties of this low dimensional manifold of dimensions of the whole D embedded in these very high dimensional spaces of dimension large D, which can be 10,000 now. So you will learn to estimate in practice the intrinsic dimension D of the manifold, which we'll see it's orders of magnitude smaller. We will learn to estimate the density of points directly on the manifold. And we will learn to actually taking this density to estimate the peaks of this density in a statistically significant manner. And finally, where we will learn to do all this in practice using a software package that we are developing and it's open source it's called that up high. So what we will not do instead is to learn to find the coordinates of this low dimensional manifold which is the task of dimensionality reduction techniques like PCA isomap kernel PCA. This has some advantages because we're not interested in the coordinates and that means that even if our manifold is complicated is not topological equivalent to a surface like Alessandro insisted in explaining yesterday, you can still apply these techniques and have accurate results. While if you try to, you know, and you know to find the one dimensional space connected to the manifold I'm showing here, you will have problem because it's just there is no way to make it one dimensional. Okay. So, these are the people involved in in the development and of the ideas and the software behind the data pie in particular Yuri and Diego are the experts on idea estimation so if you find them around there should be many of them are here today and available even for dealing with the package. Matteo is the expert on the estimation Alex is really the expert on density peaks estimation Romina is the working on something that's in that up I will not going to talk about it's called the information imbalance and there are other contributors that I show here of course I also want to thank my funding for my postdoc Max, the Max collaboration. So, this is the outline for today, the two stories that in three parts. The first part is on intrinsic dimension estimation. Then we're going to talk about dancing density peak and finally, the idea is that at the very last time for people that are interested we are going to help you to install the package and run the package on your laptop on your own data. Okay. So doable for, for people that are interested the first part will be a brief theoretical overview well a brief recap of the theory behind the interesting dimension estimation and a practical session. In the second part will also brief theoretical overviews and a practical session and the third part will just help you to install and use the package so now I pass the word to Yuri, who is going to overview the theory behind the intrinsic dimension estimation. Good morning everybody so intrinsic dimension. As Alexander told you yesterday the intrinsic dimension basically represents the minimal number of the coordinates that are necessary to describe the point of the data set without meaning without information loss. Because if you attempt for instance, a dimensional reduction it is a strong lower bound, so that if you attempt a projection or not. On a manifold, whose intrinsic dimension is lower than intrinsic on a dimension which is lower than the intrinsic dimension, you will, you will lose information. So as Alessandro was underlying yesterday the problem is the is that intrinsic dimension is scale dependent. So here there is a very simple example of this behavior. So if you focus on the small region on the small scale, you see that the intrinsic dimension of this curve for instance is three. differently if you enlarge the scale at which you look the data set here you can describe the data set just by using a single coordinate so that the intrinsic dimension is one. Again, if you look at the picture as a wall once again you find intrinsic dimension of three. So, in our package. We developed with the method that we use to estimate the intrinsic dimension is the so called to an estimator that was developed by Alexander in the last, in the last years. It's very simple and relies on a really low amount of information they did that for each point, you compute the distance from the first and the second neighbors. You know that the ratio of this quantity which we call new is distributed according to a Pareto distribution, whose exponent is, is basically related to the intrinsic dimension. Once you once you have found all the empirical new I from the distribution of your points you can infer the intrinsic dimension. How about the scale. Well, in our package that apply we provide two different methods to find the scale the scaling of the intrinsic dimension. One is a simple decimation on the data set the idea is that you provide you use the standard to an algorithm, but on a lower on a sub sample of your data you essentially just take a sub sample of the data of course. The spread is the more aggressive the decimation the farther the point they are the larger is the scale at which you look and in this way you are able to profile the intrinsic dimension as a function of the scale. Another option is provided by the great the great algorithm developed by Diego with easier and the idea is that instead of decimating the data set you look at farther distances of neighbor. The idea is that the first neighbor is the second neighbor is a twice the distance of the first one. And so, in this way, yeah, there is another profillation of the D, and you will see in the, in the tutorial how they do methods are slightly different but still provide a similar information. So, what if, for instance, your feature lie on a disk on a lattice and your distance are discrete. Well, you can see these these afternoon in my poster I will show you how to deal with this kind of data sets. So now, how do for the for the tutorial. Thank you. So, hopefully now you will have a link to the second part to the second tutorial, which is it online Claudia. Okay, so here we will do two things in this tutorial. In the first part, we, we use the the algorithms on very simple data sets just to get what we know we should be able to get. And then we'll do something more complicated on on a research question of a few years ago will reproduce the main result of this new newspaper. So, let's start by importing the standard packages here and install the data pie on call up. Let me know if you have questions. Well, of course, we're always happy to answer also. Well, if there are questions on the chat, maybe cloudy and the other organizers can let me know. Then the main class that is interesting to use and to know to use in that a pie is the data class, as we'll see so we start by importing it. And, well, this is a helper function. And then as I said that we start on a very basic case of a manifold without even noise so this is the standard the super standard Swiss roll, you can see that the intrinsic dimension of this manifold is to. And since there is no noise whatsoever even at a short scale, the manifold is really flat. If we want to compute the intrinsic dimension of this manifold we initialize a data object like this with coordinates X of the data, we can even initialize it only with the distances and this is something we'll see later on. And then we use this method the computer ID to an end, which takes this optional parameter, which is the decimation decimation one corresponds to using so this mission is the fraction of data used for computing the ID. So this mission one means we're using all the data and so we're looking at a short scale. If we run this, we see that the scale at which the ID is computed is 0.38 and the ID is too. So if we go and look at the graph. This is really the scale it's here on the right so on this short scale the manifold is flat and the ID is computed to be too. If we change the decimation. To 0.1 the scale increases. You see it's now 1.2 and intrinsic dimension is still around 1.9 around to these is the intrinsic dimension the scale and so on are also attribute of the class of the object that we have initialized. Now let's go that slightly more complicated case and more interesting. This is the case of a manifold with a bit of noise is the same Swiss roll manifold but this time the there is a three dimensional noise this is seen from the fact that you know in this right graph the points are scattered all around the three dimensional space. So once again we initialize our object with this new data, and we compute an end to an end ID with this mission one now this mission one once again means short scale. So what is the ID that we expect for a short scale in this case. I hope you. I hope you have this clear in mind that at a short scale that the manifold is three dimensional so by running this code. Now well this mission. Probably we didn't initialize it with the new data that's what happened. So we have to run this. Otherwise the data was was still noiseless. And now we have this new data set and the scale at a short, well the idea at a short scale it's three now. And if instead if we decimate. We retrieve these two dimensional, well you have to decimate more aggressively, presumably. And we will you will recover the dds coming is going to towards to, as Yuri mentioned it's interesting to look at a decimation analysis of different scales to understand whether there is a plateau and there is a constant ID for different scales. And this is done in data pie using this command return ID scaling to an end, and this is the minimum number of points that one was used, and you see that the scale goes from three at a show that he goes from three at a short scale to two with this plateau. As you expect the same thing can be done in the data pie using also the grid algorithm. And, and you find more or less a similar thing with a similar graph. Of course you can go on this link and look at other examples of ID estimation other data sets. It's also interesting to compare now to an end and ride on these other data set on one dimensional data set embedded in three dimension it's a spiral data set. So see this is one dimensional, and you see that guide really drops much more quickly to the true ID than the two and end decimation approach, because essentially doesn't have to sub sample the data set. Now, I hope everything is clear and the usage of data pie is, I think it's very intuitive. And it's very simple, not now we can apply to a more interesting problem. And this is really computing the intrinsic dimension of hidden layers of neural networks know we have a resnet 152 architecture this is a convolutional deep network for image recognition. And we want to compute the ID in the input layer that means that each of the node of input represents one coordinate so if we have, you know, 1000 pixel this is 1000 dimensional space. And then each of the hidden layer will have a different representation for our data will take the distances of imagine images inside these hidden representations, and we want to compute the intrinsic dimension of these representations. Here, we download the data set and, and the data set that really consists of these compressed files, one for each layer that we're interested in. And each compressed file is composed of two arrays. One is a ray of distances. And one is a ray of indices you see the distances are only up to the 30s 30th nearest neighbor so it's a very sparse format to store distances but since we only need the, the nearest neighbor we can allow what we can run this very kind of large scale computation of 30,000 points, 90,000 images only in a fraction of seconds. We so initialize our data object that I showed before here, using only nearest neighbors distances so up to the up to the neighbor 30, and we do this in a list here. So there is a list of objects, and each object contains distances of image net for a specific layer. We consider from zero which is the input to to the output and some layers in the middle. Now here we compute on a loop, the intrinsic dimension for each layer of these neural network. Okay. And then we assign just the results to this array here called ID. Now, presumably, now if you follow me, you have now a picture in mind of how the intrinsic dimension is supposed to change with the intrinsic well with a hidden layer so at least when I first thought about it, I thought, since the network is learning more and more structure in the data, the intrinsic dimension should really decrease from the input layer to the output layer. Now, the unintuitive result that these people proved with this algorithm is the fact that now if we plot the intrinsic dimension as a function of the layer we have this unintuitive and characteristic and now really well known shape in which the intrinsic dimension first goes up and then it goes down again. This was explained and understood by the fact that there are correlations in the input layer that lower the intrinsic dimension. There are things like gradient of luminosity or colors, things like very, very basic attributes of images that are not related to the classification within cats and monkeys and so on. While, while these kind of correlation needs to be destroyed. And this is what happens in the first part of the network. More interesting correlations are constructed in the second part and so the idea goes back to a low number and actually even much lower number only in the second part of the network. Okay, so this was our first part on on intrinsic dimension estimation I hope you could run the collab. And let me know if you have problems or if you have questions at this point, because then we will proceed to to the second part which is on density estimation. Yes. So just so I understand correctly what's happening right now is, we are just, we are still at the level of trying to figure out what the underlying dimensionality of the data might be. And so we are not doing anything like feature selection or dimensionality reduction yet. Exactly no we're never doing this so this was part of the, it is very important that you rise this up so we are never going to do an explicit dimensional reduction, we will never find the intrinsic coordinates of the manifold we just do computations on the manifold without the necessity of building a low dimensional map. This is powerful because even if the manifold is complex line this figure which you wouldn't be able to, you know, to make it one dimensional if you wanted to, you can still compute the intrinsic dimension in a meaningful way you can put density and density peaks of the manifold. So, the, so this is really by design a choice by design of not having dimensionality reduction techniques in in this package but only these techniques that can compute quantities without any explicit dimensional reduction. Okay, and then I have one more question so in the examples that you gave like with the spiral. It was clear that the intrinsic dimensionality was was one. But then for a neural network, one can only like kind of try to argue what the intrinsic what might be happening right. Yes, so how do you reason about that this is exactly what is happening, rather than just. Please say it okay. It's a good point so in fact what we've done here, it was just take. We didn't do any scale analysis here. So, the way in which you can be sure, or at least you can be confident that the intrinsic dimension is well estimated is by looking at this graph and finding a plateau. Right. This is the way you validate your estimates if you find a plateau you can really reasonably certain that this is a meaningful estimate. And this for the neural network analysis we because well we didn't have time but they did it in the paper they did the scale analysis for each layer and they found that in fact the intrinsic dimension remains low and remains of that order the independently on the scale of the scale that you look at it. So yeah that was a good question. And we haven't shown that but I there is a link to the paper and definitely worth looking at it. Yeah, thanks. Okay, okay, apologize if it is too simple question because I'm not an expert in this field. So, if you don't really do dimensionality reduction, what could be the benefit of finding the intrinsic dimension in for modeling or for treating high dimensional data. That's, that's also a good question there I think the two answers one is that typically the interest dimension can be interesting on its own as a as a proxy of well as a as an information of your of your data. You know, you don't know much of your data. But if you know that the intrinsic dimension is 1000 or whether it's if it's five really changes, you know, the, your understanding of your data set of the correlations that are in it. And also, you know, it can be even a preliminary step before doing an additional reduction so if it's 1000 really you know you can do PCA you can do a kernel PCA. And yet it's not meaningful right if the interest dimension is high. And then the other thing is now, I think this is a perfect question for for for this moment because the other reason why it's important to estimate the intrinsic dimension is precisely because it allows us to estimate density directly directly on the manifold and the density peaks, which is interesting on its own, because you, well, I will give you a little spoiler of this picture but I think now Matteo will explain these in more details and you'll understand better. So thanks. So, can we treat these as a way to improve the interpretability deep learning methods. Yes, for sure. In fact, these I think was one example, the one we showed in which the, you know, understanding hidden representation of neural networks is challenging. People have tried doing this with the low dimensional project projections and with some success but of course you cannot plot it into dimension if it's not to dimension so definitely these techniques can be useful to interpret the deep neural networks. I gave you one example, I will give you another example in a few minutes in which we use other techniques again in the hidden representations to understand what's happening. Now this is definitely a possibility. Yes. So I have one more follow up question so is there an assumption that the intrinsic dimensionality remains the same in the data, because just to think about it if you have like a bunch of points and then they all spread out like flat on a piece of paper. So you're moving along the x axis they kind of spread out and then flatten back again. Yeah, so then the dimensionality is low. Yes, I think this is also a good question. So it goes up and then down. So the question is why, how can we be sure that the three dimension is everywhere constant right on the manifold. I cannot be sure. This is going to be always approximation apart from rather trivial cases. There have been algorithms and techniques developed to estimate well to partition your manifold into pieces of identical intrinsic dimension. So it's about clustering technique based on intrinsic dimension so look it up it's called I think he dalgo with an H. However, for the purpose of the techniques that we will use later on like density and estimation, it's sufficient typically to take this constant idea assumption, which will well turn on into an approximation. Essentially we have we estimate the mean intrinsic dimension across all the possible variations and we take that as a sum of the target and we use that definitely there is space for improvement there. Okay. So, density estimation so suppose we are in a in a feature space of embedding dimension capital D, and we want to understand how data in this space are distributed. As I mentioned yesterday, one possibility is to use the K nearest neighbor estimator. The algorithm is very simple. You fix a hyper parameter K, which is the number of neighbors we consider for every point. So for every point we look at the distance RK of the case neighbor from the point, which describes hyperspherical region in embedding space of volume VK. So, by assuming that over the region, the limited by these radius RK, the density is constant, we can simply compute the density by dividing the number of K by the volume there, these points occupy so notice already that this is sort of an adaptive algorithm, because by fixing K, the selected scale of the model is different for every point. In fact, if I have a high density region, we will have a denser points so then the K points are in a smaller radius in a regional smaller radius and if we have a lower density, the selected scale will be higher. Also, notice this is a non parametric method, it only looks at data locally, so it can handle well complex apologies which is a problem that I'll dimension before and also listen to mention and finally, we here do not define, we do not perform an explicit dimensional reduction so to connect to the questions with the audience, we only look at distances basically and now it'll be even clear with the question regarding the intrinsic manifold. In fact, we know that data are distributed on a manifold, the intrinsic manifold of lower dimensionality with respect to the embedding dimension, and so basically as we see in the picture, here we have a Swiss rolling into three dimension, but the intrinsic dimension is evidently two so it is important to divide the number K by the volume computed on the intrinsic manifold. In fact, if we divided it by the volume in the embedding space, we would basically dilute the information of these K points over irrelevant empty directions let's say. So this is why it is very important to have an accurate estimate of the intrinsic dimension small D. The other thing we have to fix is of course the hyper parameter K and how do we do we would like to include the most point possible because we want to basically reduce the variance of our estimates. But of course if we select a K, which is too big, we would introduce a bias in some points because it would select somewhere some regions over which the density is not constant. So, we should select a K for which the scale which selects a scale over which the sample is a locally uniform, but these scales vary across the data set for example here is a low density region. Slowly varying so the density looks uniform over a high radius with a fair amount of points if we move towards the density peak. We see that the density is varying sharply so the density is locally uniform only on a small radius, but since we are in the density peak, we have a lot of points. As I said we are in the same density condition as in the green case, but then the density is varying sharply so the radius is selected is small and we select fewer points so this is why we developed the algorithm that Alessandro explained yesterday to compute for every point. So we have a minimal number of neighbors to consider adaptively. So together with this improvement, we obtain an even more adaptive version of Kenya's neighbor, which considers, as I said, an adaptive specifically calculated for every point value of the neighbors. So we stress the fact that the adaptivity of the algorithm helps to basically mitigated bias variance trade off that we mentioned yesterday and today, and so to handle better higher dimensionality. So one possible improvement is to allow for linear corrections to constant density or free energy in the sense that Kenya's neighbor can already be formulated deriving it from a likelihood model. So we see the formula here in at the bottom and we consider it without the blue parts. If we maximize over F where F I indicated is minus the logarithm of the density. We maximize it over F that already returns the Kenya's neighbor density. But we can improve it by including linear order corrections to the free energy in blue and these returns a pack, which is the algorithm that Alexander validated yesterday and that we will also test see at work in the tutorial later. So we see here, the maximization is carried out independently for every point I, and by doing these we consider the estimates at every point independent. But as we see, we consider ki points for every estimate so these estimates are actually correlated and not accounting for these correlations. This basically gives estimates that are statistically unbiased so they are correct, but they're kind of noisy oscillating one possible way to treat this problem is by a correctly account for correlations by considering the the density gradient. We can estimate the density gradient by realizing that, which is what we did, we realized that basically by considering the sample mean shift, the average of all these vectors which are the vector connecting the central point of the neighborhood with all the other neighbors the error the colored arrows basically by if we, if we average if we do the sample average of all these arrows, we get a vector which is with captors perfectly the gradient direction. And in fact, we are also able to compute exactly the proportionality factor. And so we are able to compute correctly the free energy gradient and the density gradient, and we can use this information to obtain a smooth free energy or log density profile, but you can see my poster for more details in the stage. So, so after we complete the density, we can estimate the density peaks and I'll be brief on this because then we'll go straight to the tutorial that the density peaks of the of the distribution so the estimation of the density peaks as these steps first of all computer density we can use any algorithm, preferably we would use pack because it's better but not often always we cannot use it, nothing we can never we cannot use it all the time because maybe we have data that doesn't have many neighbors we have with some constraints but anyway the algorithm of density peak estimation will work anyway. So anyway, we assume that we have some kind of estimate of the log density that I'm showing here. And then what we find is the maximum of the density. Now in this case you can see that in the picture below the maximum are indicated by the red dots so you see that there are maxima that are not really that are part of the same density peak and this can happen because the estimation of the density is prone to error. And so you will find the non non statistically significant maximum so we need a technique to get rid of non statistically significant What we do is that we find the subtle point between any couples of peaks of maxima and we decide to join two peaks if a specific condition is met and the condition is this one that difference in the logarithm between a peak and a subtle point and any subtle point of that peak should be larger than Z times the sum of the error on the peak. And on the subtle point right in that case you see the two peaks and you see that perhaps the middle peak peak is too close to the set to the central subtle point and you might decide to merge these two peaks into a single one. You do this operation you see and and then you you find your your partition, the partition of your of the data set into density peaks. Then we'll see how this works in practice and we'll get also a bit more into the details. Another thing that I'd like to briefly mentioned is how you then can estimate. A visualization a very interesting and powerful visualization once you have learned the peaks and the subtle points between the peaks so let's assume we have these data set. And we can start by the two peaks that are separated by the highest subtle points so somehow the subtle point in them is the highest in density. And in this case, they will be six and two, so what we do is we agglomerate or we combine these into a single element. And then we. recompute the subtle points between these agglomerated the element and all other density peaks as the minimum with respect to, in this case, six and two of the original subtle points. And then we do this iteratively. Until there is only one peak left, so this is a procedure that generates this dendrogram visualization that it's really interesting because you can. Okay, in this case it's easy because you can even visualize into the into in to be this visualization can be done arbitrarily even on high dimensional spaces, so you see really the topography of. Of these density peaks so six and two are close to each other, and then they merge with the peak number three and then they merge with peak number four and the other hand you have got five and seven and eight that are close together and so on, so if you learn to interpret these these pictures, I think it's very powerful on any kind of data set that has interesting structure to analyze. So now we will look at this in practice in our other tutorial, let me know if you can open it, and also let me know if you've got questions on the small, the short theoretical overview of these methods. Okay, and once again we will do two things here so first understand how the algorithm works on a very simple data set and then we will apply it to analyze the representations of neural networks so. importing packages installing that up I once again on call up because it doesn't come with call up yet. Then we import the main class and the toy illustration here is more or less the simplest thing or close to the simplest thing you can imagine. And it's the one dimensional double. hill potential double hill density let's. wait for this to appear okay and the first thing we want to test is the density estimation in data pie we sample data set from this simple density, we assign the sample to the data object. With the usual command and we compute the distances Max K is the variable that provides the maximum number of nearest neighbor that are stored in the distant matrix. And then we compute did it work at this is just a very short exercise, you know, a trivial exercise, but why not doing it, you know, we just checked that the intrinsic dimension is one. And then let's move on the estimators we've got three estimators test one is the standard super simple K and N so and then we have more advanced and adaptive version let's start with the K and N. And the can and takes this free parameter K and let's start with K equals 10 there is something interesting we can learn by changing K, so this is a plot of estimated versus real density and you see that K 10. Really gives you a good approximation on average of the true density, this can also be seen in these other graph in which you have true density in black and predicted density in blue. And you can see that on average K and N gives a good estimate that, but the variance is huge, and this is, of course, due to the fact that K is small, the variance decreases as one of the square root of actually the variance as one over K, so we can decrease the variance easily by increasing K. And let's see what happens if we increase it to 100 well to 100 you see the variance diminished considerably, but you have a problem of bias, so the estimates are not true, not even on average, at least. In, well, in this case, and you have this huge bias in these regions of of density in which you just completely wrong. This is really why it's interesting to to develop new and more adaptive algorithms, and this is the job of the pack let's say let's use the pack directly here and the pack really takes automatically this bias let's run it again. Anyway, the pack gives you automatically some kind of trade off between bias and variance, in which, well, I don't know what, but let's reinitialize this there is there's still something maybe to be fixed because it's a bit strange, it should be a bit better than this pack anyway anyway. And you can even see the adaptive case star so this is the adaptive K for each point now. I hope this was relatively clear it was rather obvious in a way, but let's look at how instead we can estimate density peaks starting from an estimation of the density. Anyway, so we start with the same data set I I on purpose I sub sample the data set in only 200 points and using on purpose a density estimator which is not great because we want to show otherwise the problem would have been trivial. So, well, first thing we want to do is to estimate the density peaks, so the maxima, the local maxima and with these estimation of the density we find three maxima and of course we know that the real maxima are only two, and we proceed to an assignment a preliminary assignment of points to density peaks which will be wrong. So this is what happens inside this algorithm, once you run it, there is this initial assignment now think this is not very visible but it's a well it's a cluster it's this is another one and this one and this is a wrong assignment that's where the merging process comes in. And as I told you before the condition for to for a peak to be statistically significant is that this distance is going is well this is not significant is this distance is smaller than Z times the sum of their error. And we can look at the these quantities on this graph on this graph, we have the peaks, the subtle points and the error on these quantities, and you can see that this peak and this subtle point really lie within the error of each other. And so this presumably should be merged in the merging process, we do then these clustering estimation of density peak with a value of Z, we can use Z equals one, for instance, and look at the result. Well, of course, they have been merged. We can, of course, this was easy because we knew the distribution, we knew that this that this peak was not statistically significant. However, this even in general, can there is a way of understanding that this first of all a value of Z should be taken between one and five never at zero. Let's assume that we took a value of Z equals zero, then we could have looked at the dendogram, even without further information on the density. And we see that this peak is very close to where the subtle point lies and so this is really an indication that we should increase that, because I mean, also, that was zero in this case. And also I think it's interesting to see the correspondence between density and this dendogram analysis, of course, if we increase Z, we will merge the two peaks, and the dendogram will look different, but still, you know, one to one map to the density there. So I hope this was clear. Let's see what happens in deep neural networks. Once again, the data set is the same as before. So let's just download, re-download it. So it's a data set of ImageNet with 300 classes and for each class, there are 300 points, a total of 90,000 points. And we also download this time labels that give us whether a point corresponds to essentially a macro class of animals or artifacts, and these are actually the class, the specific classes. And we once again import all these points inside a list of object of data objects. And we also import the class here. Okay. Now, let's compute the density using a simple K and N with K equals 30. Since K, well, since we only have 30 labels, we cannot really do the adaptive search over K. And that's a good reason to use maximum K, which is already not very large. And then we compute the density peaks using advanced density peak class, which is the algorithm I just described. Now, this you see takes really seconds, which is great, even though the, you know, it's 90,000 points and we're talking about 10 layers. So it's not a super easy computation, but we have fast code written in Saiton. Now, number of peaks, these are the number of peaks estimated for each layer. And this is already somehow interesting and connects well to what we've seen in the previous tutorial. The number of layers first decreases to one, the number of peaks. And then it increases again, similar to what what's happening in the ID estimation. So there is some structure in the initial layer, presumably related to colors and gradients and luminosity and so on, that gives up some peaks. And then this gets destroyed in the middle of the network and interesting peaks start to arise after. Now, let's look at these peaks and what these density peaks represent. We do this by simple representation. Well, here we're plotting it today, but it's not really a two dimensional presentation because the distance in this plot is proportional to the subtle points. So points that are close together in this plot have a have a density that density peaks are close together in this then the ground that I showed you later. We see the layer 148 we see the many peaks arise. And what's interesting is that they already are divided between animals and not animals. So let's go a bit earlier even in the in our network, because I want to show you that the distinction between animals and artifacts is actually learned by the network earlier. And this is seen by the fact that these networks predominantly, well these peaks predominantly have animals or artifacts already early on in the network already at layer 142. And, and then of course, the distinction becomes finer and finer as we move forward in the network and in the last layer, let's look at it. In the last layer, we have essentially one peak for each class. So there was one peak for dogs one peak for monkeys and one peak for table or whatever. And another, the last thing I'd like to show you here, which I think is very cool is an analysis of the dendrogram of the last layer because there will be a rich structure in the last layer in terms of peaks and subtle points that we that will give us some information on what the network has learned. So we, we take just the last layer here what I'm doing is that I am removing just for visualization purposes all the artifacts I want to focus on animals. And you get this dendrogram, which is still very large, because there are many points. And it's very interpretable now. Unfortunately, you can't really zoom easily with using using call up but but if you run this locally you can zoom in into couples of peak of peaks. And for instance if you take two peaks here 88 and 89 which where are they here they are very close to each other in this dendrogram in fact they are two species of of butterfly if you go and take 3 and 25 which are also very close. They are two holes, and so on you have a hairy dogs and underwater animals that are there are in the same mountain with pigs in the same agglomerational pick some. Then this was the result of this paper and this somehow is the picture that you can get out of it if you analyze it better, where you have first of all mammals and non mammals. And then you have within mammals all the dogs inside the mountain with all peaks in this mountain, then you've got monkeys and other mammals see animals and so on insects birds and within similar to birds are butterflies but of course they are not really but they're just close. And so it's very interesting, you know to perform this kind of analysis and really these are let's us see in practice that the network has learned is hierarchical phylogenetic structure automatically by by gradient descent. Okay. Questions on this tutorial, because then we can proceed to a bit of a summary and then we'll try to download the package and use it. Yes. I understand correctly what's happened so far is we have used k nearest neighbors to somehow do like density estimation. Yes. And so k nearest neighbors of course has the k as the parameter and that we saw, but then unlike k means where you need to specify the cluster centers you don't need to do that here. So somehow it figures out how many cluster centers there are. Exactly. Right. Is that true. Yeah, that is true. Yes, and it's a one important aspect the other spot aspect is the fact that we have access to the subtle points. Because this is what this doesn't come with the k means, and this is what gives us the possibility of doing this kind of analysis, because we are really a right. A lot of information, you know exactly what what is the, the subtle point within a cup any couple of peaks and that is very, really informative. Thanks. Yes. Thank you. Very interesting. Again, related to this question about kind of the relation between estimating the density peaks and clustering. I guess the difference is that in clustering every point gets assigned to a cluster and here not every point gets assigned to a density peak, or does it. No, I think the, no, no, it is it is the case even in this case is just that I think density peak is just a specific type of clustering based on density I think it's a synonym for density base clustering. It is similar to H db scan and other algorithms that work like that, but it's better in the sense that it's fewer parameters. But then I have a follow up. So, still being, I mean, if I imagine this maybe this is my head being to two dimensional but imagine this like a map, like a topology of, you know, Europe or something, and I have the Alps have lots of peaks. And then I have regions that are very flat. Okay, German. Okay. So it seems to me that this sort of method would be biased towards the places where the peaks are which is also where most of the data is. But like, what about the places where there's a lot of sparse data that is far away from me. Yeah, this is a very interesting question, you know, because essentially, you know, I would never, I wouldn't have thought about this problem on my own, but essentially came up as a real problem that had to be solved. When analyzing molecular dynamic trajectories, you have this problem precisely because many you have these vast regions of space of very low density that are essentially metastable states, states stabilized by entropy. And this technique doesn't work as well as it could, you know, it doesn't do it precisely what you wanted to do. But in fact, there was another algorithm developed precisely for that, which is also in that up high and we don't discuss, and it's called the K peaks clustering. So the idea there is to use the optimal K as a proxy, well as a fake, let's say, as a fake density for density peaks. So the optimal K is gives you a large number also in the region where the density is low. And you find that as a peak, even though the density is low. And this is how you can solve that problem for the specific cases where this is, you know, it needs to be solved like in this molecular dynamics so that I will give you a reference later on and we also can discuss it so thanks. Yeah. So right now this techniques give the probability peaks and the saddle points right. Okay, so this technique gives probability peaks and the saddle points. Yes, can we systematically improve it to get say the width of the peak and so on, so that we can create generative models. So can you systematically improve it to give you more information about the probability peaks. Or some sort of like for instance, you were talking about generative modeling. Yes, so that we can build generative models. Yeah, so this was actually also asked yesterday and I thought it was an interesting question and definitely something possible, and they should be done. Honestly, there is no excuse not to do it. It's not implemented in the package and I think it's still an active line when we should first think about what's the best way of, for instance, sampling for from the density that we have learned by the moment is not possible. You know, you can think of many heuristic ways of doing that. But, you know, there should be a serious analysis of what is the best way of heuristically, or less heuristically sample from because there is a lot of information so it's impossible principle to build a generative model. There are just so many ways to do it. We should understand what's the best one. Hi, thank you for an amazing tutorial. I have two questions. So the second depends upon the answer of the first. So, if I have understood it clearly, this, you are using the concept concept of interest intrinsic dimensions with this KNN. So that you can divide it by the hypersphere volume of hypersphere with the dimensions of your the intrinsic dimensions basically yes. Is it true. True. Yes. Okay, so the second question is, then if I if I'm trying to compare it with algorithms like density based spatial clustering with noise DB scan or hierarchical DB scan. Then they're the only problem is that we do not have this concept of intrinsic dimensions. So they create a lot of problem when you go higher in higher dimensions. Right. So am I following it correctly. So this is not the only difference. Okay, there's one difference. So it is true that they don't use that information but it is also true that there are other differences in the sense that they estimate the density some differently. They estimate the peaks differently. So it's not the only difference and also they don't estimate saddle points. Yeah. Thank you. So yeah, thanks. We have seen in the linear case that something density correlates with the ability of the model to learn so high places with high density are easier to be predicted. Do you find that also this correlation for the case of models and for example classification of animals. Honestly, I wouldn't know how to validate that hypothesis. It's possible. It's possible, but since we don't know the, the true density here, what I think could be done is that we could look at the higher dimensional still known distributions and look at what happens there. And I don't know what what they found because I think Alex Rodriguez work on that in knows what happens even in higher dimension. I don't know. Yeah, we'll have to ask him. I think it can be possible. We'll think more about it. Thanks. Thanks. Okay, so let's wrap up and go to the third part. So, in a nutshell, we have learned how that are very different in high dimensional with these hypersphere example and tutorial. We learned that instead these is not really a problem because in practice due to physical constraints that have strong correlations and are and live on lower dimensional manifold and we learned how to analyze automatically these low dimensional manifold using a series of algorithms to an n grad K and n pack and advanced density peaks. We did everything in data pie and that I stands for distance based analysis of data manifolds in Python, and one of the key features of these packages the fact that it's distance based so you can make it work even if you don't have coordinates. So if you have just distance and in fact even nearest neighbor distances like we did in the neural network case that you can still get all the information you want and you can make the package work well. This is the case for example of genomics data set DNA distances and there are plenty of way of data sets in which you don't have an explicit space you just have distances. So the goal was really to provide easy access to these fundamental methods of manifold characterization. It's fast because it's the bottlenecks are programmed inside on and it's unit tested the code style is guaranteed by these linking tools. There is extensive automatic documentation and Jupiter tutorials and of course it's on GitHub. So, so with this. I want to thank once again the people that worked on on this package and also sponsor the posters of the tonight and my my funding. And with this, so these are some references that perhaps I will give you better with the slides later on. And let's go to part three. So with this I'm finished and we can start this experimentation which can go even through the coffee break for people who won't want that. So thank you. Thank you very much for the presentation. Thank you very much for the minutes to well within the hour and 20 minutes to try to install it so by the way there is Yuri Romina Diego Matteo and me who can help you to solve problems of installation and analysis also we prepared an optional. If you don't want to, you know if you don't have a data set that you would like to analyze immediately you can analyze a sample data set that we provided online there is a link to download a folder. And in that folder there is a data set and there's also a Python Jupyter notebook for the analysis of that notebook of the data set. With the dendrogram analysis and all. So maybe you can try to access that if you don't have already a data set. Well I think you should download it anyway because it has a sample Jupyter I will open it up. I think it's this one. You will download this notebook. And really just essentially is you will have the main comments, you know the initialization of the data object to the distance computation in transit dimension pack advanced density peaks and the dendrogram for instance. So, I think. By the way, even if you have a periodic space you can you can use the package by specifying a period or even different periods along different coordinates. The link, they are asking for the link. So the link is on the web page as the other ones as the last link, I think. I think it takes a couple minutes to reload. Yeah. And yeah this is also the, the data set that's, yeah they're still asking for the link. So let's see if I can get it to the link should be here and zone tutorials. Yeah, that's this this is the one. It takes a little while to download, because there are also the, we should download the whole thing. There are also the, the, the MD visualizable trajectory here of the file this is a, this is the example that we give in our archive paper that I'm going to show and it's just that the, the example data set that we analyze here in the figure five. So it's just a way to reproduce this data. Anyway, I think with this I really want to try to help you use the package for for a data set that you already have. I just want to show you the final, the kind of graph that you should be able to get is this one. I'm just going to start by analyzing the data set. Anyway, so anyway, I think this finished finishes our tutorial and will be here from now until the end of the coffee break to help you install use and explore that up high and your, and your data set. So thank you once again.