 Če. OK. Če bomo povediti. Proto, vzelo, da imamo vse organizatori, začno zelo me do vseh veliko zelo. In, če, vzelo, da imamo vse te, da imamo vse te, da imamo vsi nekaj, da imamo vse, da imamo vse, da imamo vse, In da, da imamo počutki, da imamo počutki, da se počutki izgleda. Zato je načo, da je to vse, da je to vse. Vse, ne vse. Ok, prijezamo... Ne, to ne vse. Zvom. Prejmo, da je to vse, da se počutke izgleda. Ok. To je inoštvo, pri vsebih, zakaj, vsočenje tudi, vsebih ga je jazno vsočenje persučenje, in tudi, samo, ki se je tudi, in ta paper, ki smo povizala V Yusah. Zato je vse, da je več, da je konvinsizijačna vse, nekaj je, zato, da je različnik, ko se zelo, kaj je zelo. Tukaj, tukaj, je to da vse, da se možemo, da je zelo, da je zelo, da je zelo, da je zelo, da je zelo, reprezentacij. I one finds that these are, these link with statistical criticality between say relevance and statistical criticality. So here relevance means essentially what this workshop should be dealing with, with the structure in data. And I'll first discuss data and then I'll discuss the properties of representation. I don't know how much time we'll have to go through all these points, but it's a perspective that, I mean, what I'll try to convince you is that it allows you to address a number of conceptual issues that at least I find interesting because they clarify a number of loosely defined concepts in learning. But let me try to be more precise. Okay, so there is one clear way of being structureless, which is maximum entropy. So this is a statement that essentially there is nothing to be learned from data that is completely random and maximum entropy. And essentially there are infinite ways of being meaningful or making sense. And the key point is that learning in my perspective is exactly about understanding in which way a data set makes sense. And so this means that essentially what I have in mind is that what you want to learn is essentially how the data that you see makes sense for the system that you are studying. Okay, so it's an intrinsic notion. And learning is essentially finding out what is this way in which the data that you are studying makes sense. Okay, so the other thing is that learning is, I mean, usually statistical learning focuses on the asymptotic region where essentially the dimension of the problem is finite and you look at an infinite number of data points. And this is a rather unnatural domain regime if you really think about understanding learning, biological learning. Biological learning most likely occurs in what I call the undersampling regime. So where the data is barely sufficient to draw conclusion or to make a model, essentially. And essentially there are these two mirror problems so that on one hand you have data that should have some structure and on the other hand you have the problem of forming an internal element you have the problem of forming an internal representation that mirrors this structure. So the problem of defining what does it mean for a data to make sense and what does it mean for an internal representation or probability distribution to make sense or to be meaningful are essentially the same problem. Okay, so of course, if you know what is the sense, I mean, what is the sense in which data makes sense. So if you know the model from which the data is drawn, you can quantify how much data makes sense in the sense that you can measure in bits, the number of bits that you learn. And this is essentially what you do in statistics, so you have some data that you think comes from a certain model that depends on some parameters. You have a prior distribution on the parameters and then you compute your posterior. And out of this you can compute the mutual information between your data and the parameters. And essentially what you find is that the amount of bits that you learn is given by this formula, this m is the number of parameters, so this is essentially the BIC term. And then there are sub-leading terms which are related to the Fisher information and the prior tells you how much the data is surprising given your prior. But essentially this is the wrong problem because essentially you know that in this problem all the information is contained in the sufficient statistics and you decide a priori what the sufficient statistics are. So you just measure how much you learn on the parameters, not how much you learn on what is the structure of the data. And then the other thing is that you learn these many bits anyhow. I mean, whatever the model is and whatever the data is, you always learn these many bits to leading order. So there is no guarantee that these are really meaningful bits or these are the right bits in some sense. The other thing which is worth mentioning is that the amount of information that you learn is always much, much less than the information content of the data center because essentially growing as logarithm of n, whereas the amount of bits that you need to describe the data set is growing as n. Okay, so essentially this is the wrong problem. Yeah, please. Can you precise a bit more what you mean by then you learn that many bits anyhow? I mean, because this term here does not depend on the model and it depends only on how many data points you have and how many parameters you have. And I mean, the shape of the model, the structure of the model only enters in the subleading terms. Okay, so, and essentially it does not, you don't, so if you think about also how good is your fit only enters into this g hat here. So even if the noise is completely random, you learn this many bits. And what is g hat? G hat is the maximum likelihood estimates. I mean, this is a subtle point calculation. You can compute this mutual information by subtle point calculation when n is large. What happens if the determinant j equal to zero? The efficient information. Yes. If we look at the... The efficient information is a susceptibility, right? So usually... But a multilayer neural network. Sorry? Multilayer neural networks, sometimes the singular. Yes, okay. So I'm not discussing neural network. Indeed, essentially... At the moment you assume that this is regular. Yes, yes, yes. I mean, for the moment I'm just saying... Oh, sorry. I'm just saying that... I don't know how can I go back, so. I'm just saying focusing on statistical inference. If you want. Basic setup of statistical inference, okay? Where you think you know the model, what the model is, and you just, okay? Okay, you can think of doing Bayesian model selection, but essentially you soon realize that this is really hopeless in the sense that even for a very low dimensional system, the number of models is astronomical. And even for one of these models, computing this posterior can be very hard, okay? So this is not... So the idea is that essentially you can think of defining or estimating how much information does a data set contains. How much useful information does a data set contains. By just using basic information theory idea. So imagine that you have a data set, and you think these data are IID drawn from a certain model. Okay, so the probability of the data set is written like this. This is what you have in mind when you think of this observation coming from independent observation of our independent experiments, as well as done in the same way, in the same conditions. And then you can compute how much, how many bits you would need to compress to store this data on your data set. And essentially an estimate is given by, say the empirical entropy, so the entropy of the empirical distribution, where this ks is the number of times that you observe one particular outcome in this data set, okay? Now this is the total amount of bits that you need to represent this data, but not all these bits are useful. And you can split these bits into a part which is useful or which is an upper bound, say, or a lower bound to the part which is useful, and the rest, okay? And the way you do this is by observing that, say, you can split this entropy into a part which is the entropy of your outcomes with the same frequency k. And the probability distribution of outcomes which have the same frequency is by definition a maximum entropy distribution, because it's flat. And so this is essentially meaningless information, and this is what rest, what remains is essentially an upper bound on how much information you can extract from this data set, okay? And this is essentially what we define as relevance, okay? The entropy of the frequency distribution, we call it relevance, whereas the entropy of the variable s, what you measure, is a measure of the resolution, because essentially you can change the resolution by redefine what you measure. Once you define what is the variable that you measure or how you classify your data, or whatever, then this defines what this other number is, okay? So this is decided by the data, essentially, okay? Okay, so this is our notion of relevance, and by the way, this gives questions. So by the way, you can combine the two arguments that I gave you before to give a rough estimate of what is the maximum number of parameters that you can estimate from a data set, which is given by this formula. And the idea is that essentially, the amount of, from any model, the amount of information that you learn on the parameters is essentially given by this formula, and this cannot be larger than the amount of information, the amount of useful information, the upper bound on the number of useful information, okay? And it's the number of data points. M. M is the number of parameters. So M is the model, and this thing is the number of parameters. Mateo, is M related to the support size of the discretization you do in h hat k? So the discretization is related to how you define S. Yes. Essentially, yes. And is that set S supported on a set with M points, or no? Not necessarily, so not necessarily, I mean, you can define it in any way. I mean, essentially, you can also think that your, the support is not defined a priori. So imagine that you go out and sample, I don't know, species of flowers in an unknown island, you have a criterion for saying whether two flowers are the same species or not, but you don't know how many species there are, okay? So, okay, so. Okay, so this is the main idea, so that essentially you have your data, you compute the frequency, and then you compute the multiplicity, how many times you observe, how many objects you observe k times, how many outcomes you observe k times, and then you compute these two numbers, okay? And then you can see, as you change your definition of what you measure of your data set, then essentially the point in this plot can move, and this is the example, for example of data clustering. You can think of data clustering as a way of defining a variable and the size of the cluster is essentially the frequency. And then essentially if you look at different clustering algorithms, you have different curves, as the number of cluster changes from n to just one. And the idea of this plot is that you see that there are ways of doing this clustering, which are more informative than others, okay? And in particular, say, this is a particular example of stocks in New York exchange, and essentially this method here is based on, let's say, a more refined model of Gaussian correlation between stocks that takes into account that there is a market mode, et cetera. So the idea is that the more you refine your model, the higher you go in this curve. Okay, so the other point, which you can realize in this plot is that you have a trade-off between these two quantities. So when you compress your data, then your relevance also changes, and you can think of what happens when you compress by one bit, how many bits you learned, okay? And there are essentially two regions here. So there is a region where essentially this curve is very steep, and here, when you compress by one bit, you learn a lot of bits in rel, about the model. And in this other region, instead, you lose, you don't, you lose less. And in this region, this is oversampling regime, you are just compressing your model. Okay, so now, having defined this thing, you can ask a question, what are data sets, which are maximally informative, because you can solve this simple problem here. And if you solve it, you can find the curve that maximizes this curve here. And what you can find is that at the maximum, the data point, the samples that maximize this curve are all given by power load distribution in frequency. Okay, so they exhibit what is called the statistical criticality. And at this maximum, you can define this trade-off between resolution and relevance, and see that this trade-off is precisely related to the exponent in this statistical of this power load behavior. In the sense that the number of bits that when you compress by one bit in resolution, you gain mu bits in relevance, so there is one part of this, so there is one particular point here, which separates a noisy regime, where essentially when you compress by one bit, you gain more bits on the model, and a noisy regime, and a noiseless, if you want, lossy compression regime, which is on the other side, okay? And this is a specific point when this mu is equal to one, which is essentially the optimal lossless compression point, okay? That essentially realize the optimal lossless compression, and this is essentially what is called, generally called zip flow, okay? Zip flow is a statistical behavior that is observed in many domains, which have to do with, somehow with representations, where essentially the number of items that occur k times is inversely proportional to k squared, or if you want the rank, if you rank objects by frequency, then the rank of the kth objects causes the frequency to the minus one. Okay, so, now, of course, we check that this thing makes sense in machine learning, because you can look at internal representation of, in this case, just recitable machine and deep belief networks, and you can sample the internal states of these machines, and check whether the relevance of each of these machines is maximal, is close to this maximal line for that particular value of the resolution, and more or less, it works reasonably well. And you have, you can also do this, look at what happens in language, and in language, you look at the frequency distribution of different texts. And for example, there is a guy who has studied, there are many people who have studied the Holy Bible, and one guy has studied hundred translation of the Holy Bible and computed these exponents, and you can see that generally, say, earlier translation, later translation are sort of compressed version of earlier translations. I mean, you can interpret it in this way, okay? Okay, so, we also apply this idea to study to inference and to try to extract meaningful information from, say, sequences, from, say, in this case, sequences of proteins. And the idea here is that, of course, this is a system where there is no true model, and also you are really working in the understanding regime, so. And we have applied also this to data in the brain, where you can ask the question of what neurons is more informative for a particular correlator, or what neurons is generally more relevant, okay? And of course, the way in which neuroscientists do this is to take, is to look at the correlation between the neural activity and the correlator, and then look at what is the mutual information between neural activity and the particular correlator. This is the case of navigation in rats. But essentially, you should be able to do this even without having a correlator, even without having any correlator, because the brain does that. The brain is able to figure out that, I mean, this neuron is not particularly informative, whereas this one, they are informative about the position, okay? And so, I think this is frozen. Oh, okay. And so, essentially, what we came up with is a way of applying this idea and deriving a particular measure that we call multiscale relevance. And what we find is that, when this measure is small, when this multiscale relevance is small, there is neurons that have a small multiscale relevance do not contain any information on special correlates, whereas neurons that contain information on special correlates, they have higher relevance. And essentially, if you take these neurons and try to decode the position, you actually do as well as if you took the neurons, which are mostly informative about the position. Okay? Okay, so let me switch now to, I think I have, who is my? I am. What? Okay, so, okay, so let me switch to instead the representation, so look, now essentially I'm going to move from the data to the internal state of a machine that is supposed to represent this data. And essentially, you can rephrase the argument in the same way in the sense that you can say, well, the relevant variable is the coding cost. How many bit do I need to represent one state? And you can compute the resolution as the average coding cost. And then you can think of the relevance as the entropy of the coding cost. Okay? And where this p of e, where this p of e, and where this p of e is the distribution of the coding cost, which is essentially the number of the degeneracy, the number of states, which have a coding cost e times e to the minus e, which is the probability of that state. And the maximally informative representation are those which have this, satisfy this principle of maximal relevance, that maximize this object here at a given value of the resolution. Okay? And again, the idea is that you can split the coding cost into two parts, one part is noise, because all these states have the same energy, so they are essentially maximum entropy distributions, whereas this is an upper bound to the signal. And what you get out of this is that the distribution of energy levels, the degeneracy should be exponential, which is essentially the analog of having statistical criticality. And if you think about it, this is criticality in the sense of statistical physics, in the sense that tells you that the entropy is linear with the energy, so that the specific heat is the second derivative, which is the inverse of the second derivative, is infinite, okay? Okay, so now, in this picture, this picture takes into account that really learning is very different from statistical physics. Statistical physics is described by maximum entropy distribution, which means that you are looking at a system that retains the least amount of information about its environment, okay? Actually, it retains only one number, which is the temperature, okay? And as a result, what you have is essentially a asymptotic equipartition property that tells you that the number of states with a given energy E is essentially proportional to E, okay? So, but this occurs just in one point. If you want, the asymptotic equipartition property tells you that typical states have a probability, which is E to the minus the entropy, which is essentially inversely proportional to the number of typical states, okay? And in a learning machine, if you think about this, that they are described by this maximum relevance principle, essentially you have a broad distribution of energy levels in the sense that this linear behavior between energy and entropy extends over a certain range. And actually this linear behavior is easy to understand, so if you want, it's an optimal use of the information resources that you have. So, minus log of P is the number of bits that you need to code one state. If you need that many bits, then you would like to have a number of states, which have that coding cost, which is the exponential of that. Because you have that many bits. I don't know if this is clear, so if you have, if this is minus the log of P of S and which is what I call E, and if this is the entropy, which is the log of the number of states with a given energy E, then, so this is the number of bits that you need to code a particular state. With these bits, you can code at most a number of states, which is exactly equal to E. And then, essentially, optimal representation where you have an energy range which is as close as possible to this limit. Sorry. Okay, so even in this case, you can think about trade-off between resolution and relevance. And actually, what is interesting from a strategic mechanics point of view is that you can think of this as a relation between energy and entropy, but the concavity is the other way around. And this is because this is the dual problem. It's a very different problem. It's a dual problem with the one of statistical mechanics. And in either session, in statistical mechanics, criticality only occurs at very special point, whereas in these systems, criticality is a typical property, okay? Okay, so you can sort of establish a relation that tells you that this relevance is a lower bound to a much, how much you can learn about hidden features. I'm not going through this if you want details. Also, you can establish, I mean, a specific example where you take your model, a deep belief network, as a random energy model, as a sequence of random energy models, you can really, say, understand that criticality is needed in order to transfer, transmit information across layers. And also, I'm not going to too much details. And the other thing that you understand is that actually Gaussian data or Gaussian model actually do not make sense. I mean, if you believe in this picture, because essentially you can compute, you can take a Gaussian learning machine, like a Gaussian restitable machine, and you can compute the relevance, the resolution, and the relevance. What you find is that the relevance does not depend on the data at all. So it's just constant. And this tells you that, and indeed, this is what you see, I mean, when you train a Gaussian learning machine, a machine on MNIST, I say it does not do a very good job. And maybe we didn't do it very well, but it does much better than what we could do on, with a RBM. And the idea is that, even if you have a Gaussian learning machine, in the end, this can only model Gaussian. It can only model just one distribution, which is a Gaussian. Whereas, say, the structure in the data is related to disentangling a mixture of distributions. Okay, so in the last five minutes, yeah, five minutes. So I would like to tell you some last thing that we have been thinking about, exploiting this idea. And the idea is, essentially, to think about learning machines, simple learning machines, that where, essentially, the internal representation is fixed. Okay, so now that we know how to, what are the properties, I mean, say, an internal representation, which is maximally informative, should have maximum relevance, then we can think of, okay, let's fix it, and let's just learn how the data is projected to this internal representation. And this is, yeah, it's like, I mean, if you contrast this way of learning with the way of learning in recited bottom machine, recited bottom machine is like, you build the representation as long as you learn. Okay, indeed, if you look at the internal representation, what is the distribution of energy levels in size of the hidden layer. During learning, it evolves from something which resembles, like, I don't know, a spring glass or something like that, or a random system, to something which has like a broad distribution of energy levels. So, and the question is, this is essentially, should be thermodynamically efficient, and you can make an argument by which essentially the work that you need in us to learn a data set should be, say, lower bounded by something which is the mutual information between the data and internal representation, plus something which is essentially the DKL between your initial and final internal distributions, okay? And so you would like this to be as close as possible for thermodynamic efficiency. But also you would like to have a representation which is sort of flexible by which is mean, maybe you can add a new hidden variable without scrambling the whole representation that you have learned, okay? Which is what happens in the Citibalsa machine. So, in the Citibalsa machine, if you had a new hidden variable, then there is a complete rearrangement of all things. Or even if you want to compress your representation, look at a different, a more compressed representation, then what you want to do, be able to do is to change the internal state, the internal distribution without having to retrain the machine, okay? And also, there are the reasons why you want to do this. Say, for example, you want to have a machine by which you can, if you learn a data set X and X prime, then you may want to imagine what the relation between X and X prime can be, okay, by just marginalizing over the internal state. Or you can think about other properties like imagination that is, say, once you define at the outset what your POVS is, it's not evident that all the states will be filled by your data set, so maybe there is some data set, which is, okay. So essentially, this is what we did. I mean, we studied this type of machine and we thought about a type of machine where you have a hierarchy of features, okay, where you start with your data set and in the beginning you say, well, okay, these are all images. And then you say, okay, there are images, but there are images of one type and all the type one feature images that do not have that feature. And then there is a second feature and then there is a third feature, et cetera, et cetera, okay. So when you think about this, the remarkable thing is that there is a unique model, statistical model that emerges if you require two sort of very national things. So first of all, that a priori, you should not impose anything, you should not impose that if a data set is described by a certain feature, then another feature should be present, okay. So you want what is called, what I think is called disentanglement, that essentially feature should be as independent as possible, okay. And the second thing is that you want this distribution to be a distribution of maximal entropy, of maximal relevance, sorry. And essentially the Hamiltonian that you get is very simple. It's essentially the maximum index, if you have binary variables, so s can be say zero or one, is the maximal index for which the spin, the variable is one, okay. If you think this describes, if you look at energy at states with a certain energy E, then these are all states where this spin is fixed to be one. All of this spin, all of this feature can be present or not, with probably one half. So this is an infinite temperature state, if you want. Whereas all these features should be absent, which means this is like a zero temperature state. So it's a very interesting model. You can study the whole thermodynamics of this. It's a model that exhibits phase transition between one phase where the entropy is extensive and one phase between the entropy is of order one. It has very low disentanglement. And again, I mean, and it has this property that the representations that you learn are of this continuity property, okay. In the sense that if you train your data set or given data set, and then you add more points, then you don't screw up what you have already learned. Okay, so at least the lower the features remain the same, okay. And also if you add another feature, the representation that you get is essentially as a high overlap with the representation that you had before with one less feature, okay. And also if you change your parameter G that tunes this resolution, then essentially what you see is that you don't change, I mean, the representation you get is a high overlap with the representation that you had before. Okay, so, okay, so let me go to the conclusion because I think I'm over time, we did some tests. And so what I want to stress is that, and this is my last slide, is that essentially this is what I tried to describe in this last part is a different type of modality for learning, so and if you think about it, so the type of learning modality in the city of Baltimore machine is one where, we know we have very local and generic features. So if you train the city of Baltimore machine on one data set, then those weights can reasonably well also reproduce another data set. And this is one example. And so this representation typically you need over parameterization and stochastic dynamics to train them. And also they have a limited plasticity. So because when you learn a data set and then you take that model in your train on another data set, then the weights do not change much. And this is essentially why transfer learning works. Transfer learning uses the fact that the internal representation does not change much. So what you need to change is only the output layer. Okay, you don't need to change the input layer. Okay. Whereas, and so this tells you that the information on the data is actually in the internal representation not much in the weights. The weights are very independent of the data, okay. Whereas when you take a model like when you fix the internal representation, of course all the information in the data sets is in the weights. So the weights give you something which is really tells you about the structure of your data set. And you have a, so in because of this, when you change to a new data set, the weights change completely. So you have high plasticity if you want. And also, I mean, it can argue that it can support higher cognitive function. This is a very bold statement. But essentially overall, it suggests that these two learning modalities, if you think about deep architecture, actually the early stage of processing, you have RBM type of architectures. In the later stage of processing, you have these type of architectures. And essentially what we have tested, these are preliminary results by Carlo Orientale Caputo, who is a master student, just finished his thesis. What we tested, we tested this idea on deep belief networks. And essentially what we find is that in a deep belief network with 10 layers trained on simple data set, like as amnist, et cetera, et cetera, actually the plasticity increases as you go deeper. In a sense, if you change from one data set to another data set, the early stages, the weights of the early layers do not change much. And the deep layers change a lot, the weights. And what we also found is that early layers are very, sort of very well described by low order models, like pairwise models. And so, and this is essentially what people talk about when they think about this hierarchy of abstraction as you go deep, because say, in a pairwise model, you don't have much abstract, you have only low order features. I mean the correlation between two variables does not depend on a third variable, essentially. And whereas as you go deeper, you see that pairwise models do not work that well, you start having higher order interaction that becomes important. And what is interesting is that we also computed the decay, actually Carlo computed the decay L between the internal representation and this hierarchical fissure model. And what we found is that the deeper you go, the more you approach this hierarchical fissure model. Although you should see that this is point one, so you don't really get to zero, so I mean, we are not still there, but it gives you an hint that you converse towards representation which is sort of abstract in the sense that is derived just by first principles. Okay, so this is my conclusion, and yeah, so try to convince you that these are, say, a model-free measure of intrinsic relevance, so, which I call relevance, and okay, this is related to criticality, and that allows you to think about a number of, I think, interesting issues that are related to learning. Okay, thank you. Questions. Yeah, thank you for a really interesting talk. So my question is, it has been observed in biology that particularly in hippocampus, alfatoric cortex, this phenomenon of representational drift in which you have internal representations that change in time, meaning that some neurons that, at the beginning, were responsive to a particular stimulus, stopping responsive, and the whole coding changes. So, assuming that the brain knows what it's doing, that these representations are maximally relevant throughout all the time, do you have an intuition on how the brain could be achieving continuously changing maximally relevant representations of a set of fixed stimulus? No. No, I mean, this is why we study this, we try to nail down these questions in simple systems like RBM and DBM. As soon as you go to systems which are more complex, and superior to the brain, I mean, there are connections that go both in one direction and the other, in deaths that becomes very, very complicated, but yeah. Thank you very much for very interesting talk. So, your results look very general, right? Yeah. So, I want to ask up to what extent of learning problem that your results are applicable. For example, you showed some inequality, the number of parameters bounded by the sum, number of data or something, right? Maybe it scales as n over log n. Yes, yes. But for me, okay, so let's consider the simple perceptron problem in statistical mechanics. Okay, we consider the sum, sum of anemic that n scales alpha times the size of m, right? Yes, yes. So, how did two are compatible? Yeah, so, indeed, so, say for example, when you think about network reconstruction, then we know that there is, you need only log number of spins data points to reconstruct a network, the network of interaction. I mean, this is by pseudo likelihood and things like this. So, I mean, but this is a situation where you know the model. Right. If you know the model, this is, I mean, is an infinite number of bits in some sense. Oh, right. Okay, so, this is essentially an estimate that works when you have no idea what the model is. Oh, right. You mean that if we do not know the prior information about model architecture. Yeah. So, this inquiry holds. Yes, yes. So, I mean, like, you look at biological sequences. So, if you look at a recording of neurons in the brain, then, of course, you have no idea what the model is, and then you want to have an estimate of how rich the model can be. Of course, say, if you say, for example, in network reconstruction, you can estimate an easy model, even when all the configurations are different, and then this h of k would be zero, and this would tell you, you cannot do that, okay? But you have decided that you are looking at a pairwise model, which is a lot of information. Okay, thank you very much. Okay, yeah, so, I had a question about, at some point you said something like gaussian models carry no information, or carry no relevance, sorry, was your statement. And I was wondering what happens if you have, so, you saying if I have, say, gaussian data, and I tried to learn it with gaussian model. So, let's say the gaussian data has very, so, is very nonisotropic, so, some directions of the covariance are much smaller than the others, but otherwise it's gaussian data, and I want to learn a gaussian model with it. How does that work out with the, how does that square with the statement that I have no relevance in the? No, no, yes, yes, yes. So, in that case, I mean, here, I mean, your weights could be very unisotropic. So, you could have some very long directions. But the relevance would not depend on the weight, does not depend on the weights. Okay, and the idea is that essentially what you are learning, you are not learning about the model, the model you already know, because you start from a gaussian and you end with a gaussian. So, the structure of the model is fixed throughout. You don't learn anything about that. If you think about, say, when you learn amnista, you think you learn one model for the ones, one model for the two, one model for the n, then you learn a mixture of models. Okay? But you could learn a mixture of gaussian. A mixture of gaussian would not have zero constant relevance. Because a mixture of gaussian is not a gaussian. I see. Okay, I mean, the reason I stuck on that statement is because I know, because we've done it in one of our papers, that if you just want to model amnist with a gaussian and you make it 728 dimensional, you get very good reconstruction. You get very good reconstruction. So, amnist is basically a multivariate gaussian. Yeah. So, multivariate gaussian is a mixture of gaussian. It's very different. I'm not sure, I don't remember which one we did. But yeah, it could have been a mixture of gaussian. I would bet it's a mixture of gaussian. Yeah, okay. Okay. No, and you know about this, I think tomorrow morning, the first talk by Francesca, is going to say that even if you scramble the lab labels, then essentially, as in toti behavior, is ruled by a central limit theorem. You have this gaussian equivalence thing. But essentially you have scrambled the labels, which means that essentially you have lost any meaning. All that makes sense has been sort of lost. I don't know, Francesca is here. Yeah, this reminds me a bit of like independence component analysis, where the goal is to make the data the least gaussian possible. And that's how you find your independent components. But that's just something that came on. Other questions? Or just a very quick question. How are the inequality where the number of parameters is bounded by the relevance? This one. This one. How sharp is this? Like, is it always achievable? Is there something special? No, it's not. It's not sharp. We give a few examples in the paper. Say, for example, if you take data from this protein data set, then essentially this is a very loose bound. In the sense that you get a very large number there. Yes, so. But it's the first bound of this type that I see. And so my other question is from a theoretical point of view, besides this inequality, what's the motivation, like purely from an analytical point of view? What's the main motivation to pick this notion of relevance? Okay, so, for example, say, okay, so what's the motivation for taking this notion of relevance? Well, it's essentially that you want a notion of relevance that is model-free. And it's like a notion of, it's like the entropy. The entropy is a measure of information content that is completely independent of what that information is about. You want a model of how much structure your data has, irrespective of what this structure is. And that's more or less what information theoretic measures give you. Okay, thanks. Mateo, I had the feeling you need some sort of model, but it's a clustering model, right? I mean, some data point comes in and you have to decide in which box it should be put in, and then you can calculate the number of data points that fall into a certain box. So in some sense, you need a clustering model, maybe a mixture model or something like that to compute in practice the relevance. Oh, well, you need the data. You just need the data. Yeah, yeah, but you have to do something. They are continuous random variables. Okay, so, yes. They have to come up and say, okay, I have to count them into boxes. Yes, okay. And I think that is in some sense clustering. If you have continuous data, then if you have continuous data, then you have to discretize them. And, ah, eh? That's a model. And that's a model. Yes, yes. So I agree, I agree with you. So that's continuous, I mean, we know, continuous variables have, ah, even defining a prior for continuous variable is complicated, so, yeah. That's, we are not solving that problem. Oh, I like it, it's interesting. Maybe we can thank Mateo again for the great talk.