 Thanks. So hi all, thank you all for connecting and also I want to thank the organizer for inviting me and also my collaborator Alessandro, Claudio, Gabor and Bincin for helping me with this work. This has been recently uploaded on the archive, and it's a work on doing precisely what I say in the title ranking the information content of distance measure so the objective is that of taking two distance measures or two vector spaces you can say that one is more or less informative than the other. The original motivation for doing this came from this obvious task of feature selection so no matter the independently on the algorithm that one wants to use for a given supervisor whatever learning you might always be able to select or to choose a given representation for your problem. And the objective here is to take two of these feature representations that are not you don't know which one is better and independently on the specific architecture and on the specific learning task. You want to be able to say this feature space is much better than the other because it contains more information on that specific data set. So, you can recast this problem that I exposed in this simple question so given two spaces AMV can identify whether one space is more informative than the other. And the obvious candidate that we thought we could use when this problem first came up in terms of our work was that of using the mutual information between the two spaces. And actually, it's very easy to realize that this cannot indeed be used, and it's not a good candidate because essentially it's a symmetric measuring of information so the information between AMV is the same as the one computed between DNA of course. And, and so it cannot be used to say whether one is more informative whether one space is more informative than the other. Another candidate that is a symmetric accounts for information theory is the conditional entropy. This is of course not symmetric. And, and so one might think of using this quantity to solve this problem of future selection. However, this quantity as another problem, which is that, well, under this situation here, which is very common where this feature space a is an explicit deterministic function of another space space. Plus, let's say independent noise with some standard deviation. Well, under these conditions, the, the conditional entropy diverges with Sigma going to zero. If it goes to zero, the conditional entropy goes to minus infinity. And so, this is problematic because in practice, in practice many spaces will be connected by a map that you might not know, but that exists. And so if you try to sample this measure you will basically have different ways of sampling minus infinities with with a big data set. In practice we want something that's well behaved, and it's a symmetric. And this is what we came up with with first with an intuition on on how this should work and then we could find a more theoretical grounding. We call this the information imbalance. We use the letter delta the information imbalance from a to b. And so this is the, the what I want to present to you today how this tool works and what you can we do with this tool. So the first thing that's that we used or that it's used to compute the information imbalance is that the notion of distance ranks. So, if you have a space, hey, well, and J is the first neighbor of I in this space, then we say that the rank of J with respect to I is one. The set that is distance right now the important thing that I want to show in this slide is that the rank can be different depending on the space and on the distances so if you have a different space be you might have the same two points. I have a different distance rank and in this case the rank of J with respect to I the same two points is three. And the distance ranks can be used to find that what is the most informative features face in well on on on a choice of two. How can we do this. I want to first introduce the the problem on a simple example which is where we first initially cast this problem and the example is this one you have a three dimensional space, which I'm doing here, in which you have got X, Y and Z, and then you have a data set that lies essentially on the XY plane here, because the variants of these Gaussian data set is isotropic along X and Y but the variance along Z is much well let's say it's much smaller than the variance along X which is the same as the variance along Y. Okay. So, what we want to retrieve using our tool is this set of relationship between different metrics. We want to find that XYZ and XY so the spaces that we dimension space and the two dimensional space where the data lies are equivalent. So in this sense, because you might discard the Z coordinate and you might still retain full information on the space. You will also we would also like to find that X and Y are orthogonal independent in the sense that, you know, they have no no information basically they have that you cannot predict the one from the other or vice versa. The final relationship is the perhaps the most interesting one is the one basically we should be able to find that XY is more informative than X because you can predict XY from, well you can predict X from XY but not vice versa. The way we proceed to basically find a measure to characterize these three types of relationships is the following. First of all, the ranks of one space against the ranks of the other space so the same data set you compute the ranks and then you plot one against the other. In this case, we're plotting the rank of the space XY against the rank of the space XYZ. So since this was the spaces are equivalent. Of course the correlation here is really is very is very clear, but we want to be more precise and more quantitative. And in particular we do the following we take all the ranks in XYZ condition on the fact that the rank in XY is equal to one and this basically corresponds to these well these shadow this blue shadow in the upper plot and if you then build the probability this conditional probability of the rank in XYZ the condition on the rank in XY being one. You find this histogram here and this probability now in both cases of course this probability is picked around one because if you have a small rank in one space you are very likely to have a small rank of the same point also in the other space. In this case, the average value of the rank in this distribution is about one. Now let's move to the opposite, complete opposite case in which we have two spaces that have nothing in common X and Y. In this case, we plot the rank one against each other, then we can condition the rank on X to on the fact that the ranking Y is equal to one. This is again corresponds to these these shadow here we plot the instrument this time the rank is flat, which makes sense so basically two points that are close to each other in one space can be anywhere in terms of rank in the other. And the same thing happens if we plot well the opposite relationship the histogram is flat in both cases, and here the average value of the rank is around an over two, because the histogram is well it's flat between zero and N and is the total number of points then the more interesting case is the one of you know the one in the middle in which we plot the ranks according to XY against the ranks according to X which is clearly less informative because it can be predicted from XY. So if you do the same analysis, we find it in one case, the histogram is flat, and in another case, the histogram is picked around around one. So in one case, well let me use the proper color perhaps in one case, the average value of R is around an over two, and in the other case the average value of R is over the one. So this is and you can clearly see that there is an asymmetry now that is the signal they were going to use to define our measure. And indeed we define this imbalance simply by not renormalizing these conditional expectation of the ranks in one space condition of the fact that the rank in the other space is one. And you can see that with this normalization factor at the front to these delta is now statistically confined in the range zero and one. Well, zero means complete information and one means complete. Well, uncertainty in this case delta delta is well delta one is zero and the other delta is also zero both are zero in this case delta and delta prime are all one. And in this case instead we find the delta in one case is zero and in the other case is one in the opposite direction. And this is a clear signal of inclusion or one metric or one space being more informative than the other. Now, the usual representation of the relationships between any two feature spaces, we can plot essentially one against the other intuitively. This is once again the same data set with a Gaussian line on XY. And if we take here we're plotting the information imbalance from one space to the other and from the other space to the first. So if we take two spaces that are equivalent XYZ and XY then both information imbalances will be zero. Okay, the complete opposite case of is given by X and Y these two spaces are independent and both imbalances are over the one. I'm just repeating what I tried to explain last slide. In the case of spaces X and Z, these are clearly included in the space XYZ. And this is where you find them in this plot, which we call the information imbalance plane. So by simply plotting on this on this graph, you know by simply visualization of this graph you can see where what is the relationship between any two feature spaces. Okay. So this is interesting. Okay, this works on this very simple case. The problem of this is that it's a linear data set. It's somehow trivial it's a simple Gaussian with a small variance along Z. But the nice feature of this, this measure is that it works on arbitrarily non linear manifolds as well. And now this is a simple example, which is however non linear. So here we have that the variance along Y is over the five, well the standard deviation, the variance, the standard deviation along X is over the one. And so if you had to use any linear tool of estimation for the most important directions here and the most important feature, you would find probably that the direction why would be selected. So here is the principal component if you had to use principal component analysis. And, and this indeed is not really what you should find because X is clearly more informative because you can predict X from Y while the opposite is not true. And if you look at the relationship between X and Y on the information imbalance plane in particular this one, you find that, well, X is more informative than Y because the information balance from XY is much smaller than the reverse. X is also seem to be equivalent to the full representation XY. And without going too much into the details on the right hand side the same thing happens where Z is now the most informative coordinate and Z is seen to contain both X and Y so be to be more informative than them, and also to be equivalent to both of them since this is a spiral rolling around the Z axis. Okay, so this was the intuition that really brought us to define this, this tool and this quantity, and then, but there is a way to, to connect the distance ranks and this quantity to more standard and grounded quantities in statistics and information Now the main connection comes from this theorem. Now if you are a statistician you know this by heart because it's a very well known, it's called the Sklar's theorem. And it's very powerful in my opinion you can take any distribution of two variables and join distribution and you can always decompose this distribution in the product of two marginals. And a distribution here which is called the Coppola distribution, where here the variable C and C and CB are all distributed uniformly between zero and one. And so this really tells us that even any distribution, all the correlation structure is contained in this factor here, because they have this variable CA and CB this Coppola variables have uniform marginals. So there's no information in their marginals which is all contained here, and they only contain the correlation structure. Now in our particular case we are looking at the distribution of two distances the a and db, we can decompose this distribution with the Sklar's theorem, and then we can, and then essentially, if you take the a and db as your variables, then the nice connection is that the Coppola variables of the distances are indeed given by distance ranks. So distance ranks can be an estimate of the Coppola variables of the distances. Why so well because the Coppola variables are defined as a probability integral transform of the distances so I take the marginal, I integrate the marginal over a given distance and this essentially is giving me the probability of the distance between a point and another pointed distance da well clearly this mass can be approximated by the number of points inside this volume are divided by the total number of points. So this is the intuition by which you can connect the simple idea of computing the ranks and in the information balance and tools in statistical theory now. And only for this we could say that the probability the conditional probability of the ranks is essentially probing directly the conditional probability of the Coppola variables, and in this sense we are probing directly the correlation of the particular distances by looking at ranks. But then I, you can go a bit deeper than that, and connect. Well, you can also write the information, the mutual information as an integral of basically conditional entropy of this Coppola variables. I don't know too much the details I just wanted to kind of light a bit your interest that perhaps for a later discussion on going read the article as soon as it's published but essentially there is a way to connect this quantity also with quantities like mutual information. I also wanted to briefly go through a couple of applications of this tool, which we, which we tried, and the first one the motivational application was that of feature selection. In this particular example, there is a 350 dimensional feature space. I don't want you to go. Well, this is called so it's very specific. It's used in the physics of atoms and molecules. And we want to do what we want to do is to compress want to compress this 350 dimensional descriptor to only a few components that are most informative without knowing the algorithm that will be used on them without knowing the target property that we want to fit target quantity output. We want to do this in this model agnostic fashion only looking at the information imbalance. So what we do is simply this we compute the information imbalance plane between the full 350 dimensional feature space and each one of the single one dimensional components. This is the cloud that is plotted here. Now, given this cloud and given what we have said so far about the information imbalance, you can spot the most informative single coordinate as this point here. So then you can select this coordinate has been the most informative and then look at all couples of coordinates that contain that chosen coordinates chosen coordinate and this is the cloud that is given here. And again, select the most informative couple, which is going to be this one and iterate this process as long as you want to select the most informative deeply of features and do feature selection in this greedy optimization fashion. Now, if you do this, you find that at least in this case, the process converges very rapidly so you can select one the other the second and the third. And if you then look at the, well, essentially this quantity here is a plot against the number of components selected. You find that this indeed. Well, no, well, this is this is for sorry. This converges very, very rapidly. And with only eight or 16 components, you, well, you can have, you can have a good resolution of the first neighbors of your 350 dimensional space. Now let me remove these. So, nice. We've done our feature selection. What is very interesting is that, and now this is I'm talking about these plots here is that you can run a supervised learning algorithm with either a Gaussian process or a neural network on the selected features, and there are much faster convergence in terms of test error or I've given quantity, which is now not important but this is just to say that this feature selection scheme works by you know works for for for supervised learning tasks. And then I'm almost setting out of time. Another application that we briefly that I want to briefly discuss is this one related to the spreading of the COVID-19 epidemics. And you would just basically, we just realized we just found and verified I think this is the right word that the space of policy measures which is given by school closing, gathering restrictions, movement restriction and so on, present in a given nation. Indeed, is predictive about the space of the epidemic state so the number of deaths, the number of positive, the confirmed cases and so on of the COVID-19. And this is clear from the fact that, you know, these the relationship between these two spaces have, you know, appears in the top left of the graph. Then, you know, by running this optimization on the on the feature selection algorithms, you we could even select or understand what were the most informative policy measures on on the on the total of nine. We also will run this, this sanity check that put the random integers indeed the deteriorates only the information content of the policy space. With this I'm finished. I, I'm quite excited about some possible future developments in particular an analysis analysis and design of deep neural networks. This is somehow an obvious application because you can really you could probe the information content of the layers of single neurons of group on neurons. And this could allow, you know, better design and understanding of information flowing neural networks. Another application that I'm quite excited about is this one and is basically using this measure to generate or to design new algorithms of dimensionality reduction so with a continuous optimization might find a low dimensional space that has maximum information on a very high dimensional problem. So, thank you for your attention. Thank you very much.