 tako, se je še vznikaj, da se vznikaj, da se je izviti. Pa, da je pomembno, da je začel. Zelo to ne, da je počet. OK, to je več. Proste si to izgledaj na toga, da je vznikaj? Ie. Dostaj. Podjel sem da počet, da je nekaj vznikaj, ki je vznikaj. OK, nekaj začel. Vznikaj. Sjeli. Vznikaj. ... ... ... ... ... ... obkazuj s delovstvoj, ki tukaj smetno namoja z vse infini tudi držiči v svednjih obkaz magnesiumi in tudi dvarmči. Zato neath tega da tudi tega, ne začem, mi je tako, koncient č sewočne z vseve trenutko pilu in update, kaj pa lahko se različimo, tega najbolje, kajva맛i drožnje na vsevenštje dvarmči. Thoma lahko tega imelem, advancedi pomeni na ddelovstvo, za dve čo so nyčile, bardzo užitost doičanje s Koangili in Haimson Polinski na stadističnjem sečnemični z zelo in in na Covid-19. V sečnemični zelo je d skinanjo nespečno nebo u prez kaj doče je zrpravljena hranjevac pristih, kaj je zrpravljena v pošle, v kaj je pošle in pošle in zarodnja. Kako je vse na tukaj, moraš, da nekaj za vse na tukaj, in tukaj neče, kaj ta tačnja je tako, če je tukaj izračno. Nama pixel na tukaj, in v komputerih, zrpravljena, kaj zrpravljena, v tem operations we work in the so-called over parametrazi regime, where the number of trainable parameters, which in a fully connected architecture is proportional to the size of the hidden layer squared times the depth of the network, is much larger than the size of the trainings that we are using to train the network. Let's say that for the Amnist dataset, you have ten to the four training patterns, for instance, Danes, pravda se je z Resnet 50, ko je dobro 10 milijonov parametrov. So, in na vsezavstaj za magnitud nekaj treba. Tudi, da ne zelo, da ne bomo se vsezat, ali da se vsezat, da je bilo vsezat, da tez vsezavstaj, ki je to vsezavstaj, ne bila vsezavstav vsezavstav. Na performancji, vsezavstaj, na generalizacij. Vsezavstaj, ki je sradi v statistける, in teori, ki so mene matematičnih in odličnih, ne, ali je biti bolj in informačni argument in tvoj bias variance trade-off, ki je vse vse vse vse našlično, da imaš več našlično vse parametrično. To je vse našlično, ali je vse našlično. To je vse vse našlično vse našlično. Prvno vse našlično vse parametrično vse vse parametrično. When we think to statistical physics approach of course we cannot acknowledge two important pioneers in the field. One is Elizabeth Gardner who unfortunately passed away at the end of the 80s, the other guy of course is Majesty Heimson Polinsky. Basically what this guy did was to understand how to employ thin glass techniques v državce, da ne vemimo vsečkih modelov, izgodnih potreboj. S enkalizacjem vsečko mekanik, všim vidim, ne biti roadsima, sve zva bilo, da bi veččva nemonitivne in pričeljno kar načinam, vsečkih, vsečkih, načinan. Način, kaj načinamo, to stadi simple models, shallow models, linear perceptrons, random feature models, kernel learning, which is very related to random feature models, or shallow architectures, such as parity or committee machines where you know we have exact results. And the typical approach in statistical physics, but also in statistical learning theory is to have a data set, a training data set, I'm thinking to supervise the learning problems, and this data set is drawn from a certain input-output probability distribution. This is not the most general setting you can think of, because of course you cannot think of out of distribution problems and things like that, but let's say it's a reasonable setting. And what are statistics, well, the maximalist program of a statistical physics would be to say something about the partition function of the system in an equilibrium ensemble in a Gibbs ensemble, where beta is the inverse of the temperature, l is the loss, and you would like to say something about this partition function, because from the partition function you can compute observables. And the typical way we do this in statistical physics is by choosing some input-output probability distribution and then averaging over the logarithm of the partition function. And the reason why this average must be quenched, well, I cannot explain right now, but of course this is a complication. And the complication is that we do not know how to average over many probability distributions. We can average only over simple probability distributions. So of course one escape strategy to this is to introduce more complex models of data such as Sebastian did with his colleagues, of course, hidden manifold model, or another strategy would be to consider the perceptual manifold introduced by Sumpolinski, but you know, you are very limited in the kind of data you can average with. But there is, let me say, a class of results in deep neural networks, which are, we do not need to average over data. The first important class of results in this period is the so-called infinite-width limit, where you basically ask that the size of the hidden layers is much larger than the size of your training data, of your training data, OK? Well, in this limit basically you empirically observe that weights move very little with respect to their initialization. And these end up with beautiful formal proofs by mathematicians that basically infinite-width deep neural network are equivalent to Gaussian processes. And so all you need to know is a nonlinear kernel, which is the place where you find the details of the architecture, OK? In particular, the feed forward structure of the architecture is given in this recurrence relation that relates the kernel at layer l to the kernel at layer l minus one. And the activation function is given by, you see, is in this integral that defines the kernel, OK? And here perhaps it is important to stress that there is a difference in the infinite-width limit where when you consider Bayesian learning of gradient descent, and in particular you find two different kernels. One is the NNGP kernel, the other is the neural tangent kernel. I can argue on this later, if you like, but that's the story more or less. Once you have this result, of course, you can come back to statistical physics and saying, OK, I can now perform, I can now study data average partition functions. Well, this is very well done in a paper by Gengiz Pelevan at Harvard. And this paper is basically a generalization of a very old paper by D Tricoper and Sumpolinsky, where these guys were studying support vector machines with polynomial kernels. OK, so you can do that. Now, let's move to the second class of results. This is a very recent work by Leah and Sumpolinsky, where they consider deep linear networks. So we remove all the activation functions from the network. In this way, the problem from the viewpoint of expressivity is trivial, but still it retains some non-convex features, and we can try to investigate it. And this is exactly what these authors did at the level of the understanding of the partition function. The basic idea is that you can integrate weights backwards through the network, starting from the output layer. And this is what they call back propagating kernel renormalization. But now, if you go into the details, you can show that when you do this, there is a price to pay. And the price to pay is to insert new variables, OK, that they call UL. And you can show that in this thermodynamic limit, where you send P and the size of the layers at infinity, keeping the ratio fixed, you can determine this U self-consistently. Now, if you consider the isotropic limit, where all the aspect ratio of the network is the same for each layer, then what you find is that you can easily compute the generalization error that gives you the standard bias variance, the composition. And you have some self-consistent equations for this parameter U0 that gives you the test error on a new example. As you can see, everything is not averaged over data. This is a fixed data result, OK? But now, OK, they do the numerics. The numerics looks very good. But since they are very smart, they do something crazier. The idea is that, OK, as you can see, basically, this expression depends on the linear kernel, OK? So the idea of liesson-poliski is the following, OK? Since under scalar multiplication, the real kernel behaves exactly as the linear kernel, in a way, I'm endujving, of course, like they do. And you can just replace in those formulas the kernel, the relu and then gp kernel, OK? And you can see whether this theory is predictive or not at the level of the generalization error. And what you observe is that if you consider one in the layer architecture, you can plot the test error as a function of the size of the layer. And the theory is very predictive. Whereas if you move to other layers, well, you still see a good agreement as long as the number of layers is not too large, but then the theory starts to fail. Well, it's not surprising. It's a very heuristic theory derived with end-waving arguments. The end-waving arguments are nice, but we are missing something, of course. So let's say, I think these results are very fascinating. And we would like, for the rest of the talk, I would like to understand a bit more how to rationalize these results, let's say. So the two ingredients I will ask is this proportional or thermodynamic limit, where I send the number, the size of the trainings at p to infinity, with the size of the layers, and at the same time, I keep the ratio constant. And the second ingredient is that I will not average over data, but I will consider fix the data, fix data, OK, fix the instance of the training set. I think the only thing you need, well, no, in a sense, no, and I will come back to this later, maybe when I show the numerical results. OK, but of course, you can assume that you are considering your test set, which is drawn from the same probability distribution of the training set, but actually, I don't need to assume that, OK. I will show you, I will show you an example later. I will come back on this point. Well, of course, we start with some terminology. You define the deep neural networks in a fully connected deep neural networks in a recursive way, defining the preactivation at layer L as a function of the preactivation at layer L minus 1. We add one readout layer, which is this guy here, which is just one neuron in the last layer. And then we consider regression problems with quadratic loss function and a regularization, which is not really a regularization, but since I am dividing this regularization by beta, it should be thought more as a Gaussian prior. OK, and then I define my partition function in terms of this learning problem. Of course, the partition function is linked to the posterior distribution of the weights after training. And as I told, your regularization should be thought as a Gaussian prior on the weights. And of course, if again in the maximalist program you would like to compute this partition function and evaluate observable. And let's say, the main observable we are interested in, but not the only one, of course, is the generalization error on a new example, x0, y0. That you have never seen during training. Now I am very ambitious because I would like to tell you something about the calculation. The partition function is a complex object. Is this, let's say, almost horrible formula. And the crucial idea, which is not, of course, mine, but it's something that is also employed in the proofs of the infinite width limit, is that you should not think this guy in the parameter space, but you should work in the function space. So what I want to do is to integrate over the weights of the network, but I cannot do it for free. So what I do, I add the deltas. Basically, this is a change of variables where I identify the output function and the pre-activation functions. Of course, I have one output function for each neuron, and I have one output function for each pattern. So in the last layer, this variable, s mu, rp, and in the idel layer, you have one activation for each neuron in each activation form, in each pattern. OK, I very easily realized that if the prior of the weights is Gaussian, then I can perform these Gaussian integrals over the weights. And I come up with an alternative representation of the partition functions in terms of these output functions and pre-activation functions. OK, now I can massage a bit my partition function and realize that the only tricky point is to identify this probability distribution in green, which looks very, I mean, it looks very tempting to think this as a sort of probability distribution where I can apply a central limit theorem. But unfortunately, this probability p1 is not independent over the input. OK, and so you cannot really do that. But it turns out that there's a nice mathematical result by Breuer Major, which tells you that actually it exists a generalized central limit theorem in distributions for this kind of observable. OK, of course, you know, this is not what I'm going to say. By all means, it shouldn't be thought as a mathematical proof. My reasoning here is that I want to invoke a Gaussian equivalence based on this Breuer Major theorem. But of course, I leave to mathematicians to understand if this is correct or not. What I can tell you is that when I do that, I can end up with something that I can actually work out analytically. OK, so I invoke these Gaussian equivalents and I find that this is a Gaussian then. And then I find that the only thing I have to determine is its variance. And I find that this is a quadratic function of the dual variables of the output variables. And the kernel, it appears, is exactly the NMGP kernel. OK. And now when you do this, it's really a matter of algebra to show that if the lost function is quadratic, you end up with an approximation that shows you that the partition function in this limit satisfies a large deviation principle. OK. And in particular, this is an integral over two variables that we call q bar, bar q. And we have a sort of effective action, free energy, variational free energy. I mean, you can call this in many different ways. And of course, if you can say that this effective action is intensive, I can employ the saddle point method to solve the system. OK. At this point, and probably many in the audience are interested in this, I didn't tell you anything about the relation between the size of the data set and the size of the input. And I know that Jean Francesco, of course, are interested in this. But let's say we can talk about this later, if you like. But you know, it's very easy, at least at the physics level, to show that the last two terms of this equation, which is where the data set is, are intensive. I leave this to you as an exercise. And once you have this kind of result, you can also compute the generalization error. And you find a formula for the generalization error. Now I will, I think I will move very soon to tell you something about numerical experiment. But first, let me say something just for the mathematicians in the audience. You basically realize while you are doing this calculation that if you send n to infinity and you keep p fixed, you, of course, recover the infinite with limit, OK? In the sense that you show that this variable s bar, that basically define the characteristic function of the prior over the output, they are Gaussian, OK? But now if you cannot, if you send p to infinity, you cannot anymore think of, let's say, Taylor expanding that. And what you find is that this distribution is well known in the literature. It is a multivariate t-student distribution. And so what we are basically claiming is that there are a link between finite with one in the layer neural networks and student in stochastic processes, which are something which are quite well studied in the literature on Bayesian inference, OK? But now we can do something more than that because we can realize that also what Sompolinski is doing is essentially to find a relation with the student processes, OK, in the deep linear case. This will be useful when I will say something about the L-layer case. But now let me move to numerical experiments. Of course, now we can say, OK, is our theory correct? Is it capturing something in the learning curves of one in the layer architectures? And what we find? Well, these are on the y-axis. There is the test loss. And on the x-axis, there is the size of the layers. We train our neural networks with a discrete range of dynamics. These, let's say, up to the discretization of the learning rate, it's almost sampling from the Boltzmann distribution if, as a result, you wait for long enough to claim thermalization. And as you can see, let's say, there is quite a nice agreement between the theory and the experiments. These are, let me stress this, these are real data, OK? I am taking images from the MNIST data set and images from the CIFAR10 data set. And I'm making a prediction about the generalization error of the one in the layer architectures, OK? But now, as you can see, at least in the regime we investigate, the smallest 10-1 we investigate is n equal 50. We observe monotonically increasing or decreasing learning curves. And, well, if you go to the analytical calculation, you realize that there exists an analytical criterion that tells you where these curves will be monotonically increasing or decreasing. And this is given by this criterion here, which is just a function of the data set and on the infinite width kernel. And this is actually another connection with student t, because you can realize that this criterion also appears in Bayesian inference with student t process in a reference where this is shown in this recent paper by Tracy and Wohlpert on Bayesian inference with student t processes. Now I can say something more general, OK? Let's say that I don't just want to consider a, you know, I can make prediction about the generalization error, but I would like to say something general, OK? It turns out that when you restrict yourself to zero temperature, you can find from the formula of the generalization error that the bias is a constant as a function of n1, the size of the layer, but also a constant as a function of the Gaussian prior of the last layer. The second point is that the variance instead depends on the size of the layers, and it goes to zero. You can check it in some cases as 1 over the square root of lambda 1, the Gaussian prior of the last layer. Of course, this is just two physical consequences in numerical experiments. The first one is that you should observe that increasing the magnitude of the last layer, Gaussian prior should lead to better generalization at any n1. And the second one, of course, in this thermodynamic limit that I'm studying, and the second one is that for large value of lambda 1, the dependence on the size of the layer in the learning curve should disappear. And this is the plot where we try to show this. These are, let's say, the test loss performance at three different values of lambda 1. And as you can see, larger lambda 1 helps generalization, helps the generalization performance, and it also seems that up to reasonable value, small value of n, the generalization performance seems to be almost independent on n. Inverse variance of the last layer, Gaussian prior. And these are numerical experiments where we waited for thermalization of the discrete launch of n dynamics. So now we can somewhat exploit this idea that when in the layer neural networks and deep linear networks are related to studenti process, to somewhat bootstrapping a result for the layer case, OK? And the way we do this is by, let's say, guessing in a more or less educated way what the probability distribution at layer L will be of the preactivations, will be in terms of the preactivation at layer L minus 1. I mean, I noticed that this approach is different from what Sompolinsky does, OK? Because Sompolinsky integrates backward. But this is not really consistent if you think to the proofs made in the infinite with case. Those proofs are made by an induction principle that starts from the first layer. I'm thinking to the paper, for instance, by the Google brain guys where they prove the infinite with limit. And when we do this, we realize that we find, again, a recurrence relation, but for a renormalized kernel, where new parameters q bar L appears inside the recurrence relation. And you should think that this q bar L should be self-consistency determined by the minimization of this effective action for the L layer system. OK, this is more or less the idea. I'm going very quick through this, but of course, I'm available to speak to discuss offline this. Well, now maybe I can skip this, but you can easily realize doing this calculation. Well, the results I showed are results for, let's say, odd activation function. ReLU is not an odd activation function. And Sumpolinsky is a heuristic theory by replacing the linear function, which is an odd activation function, with a non-odd. Well, you can see that the theory is slightly different. If you consider a non-odd activation function, and you can compute explicitly corrections to this theory. This is not true in the infinite with limit. It's only a marker of the finite with. In the infinite with limit, the two theories are odd and non-odd does not make any difference. And, well, this is how one gets this more general result. Let me show you one last slide with the preliminary verification of the theory at L layers. What you can basically derive in the L layer case is the very same criterion that could establish whether the finite with network outperformed or not the infinite with limit, at least in the case of ReLU out activation function. And here, there are numerical experiments close to the infinite with limit, where we can see that up to two layers, where this guy is less than p, where the criterion is satisfied, finite with networks outperform the infinite with limit. Otherwise, it's the opposite. And, well, I'm skipping this. But I want to tell you something more about the reason for this criterion. Well, if you give a look to the infinite with kernel for ReLU activation, you realize that this kernel, as L grows, develops almost singular eigenvalues. And this already happens for L equal 4, almost OK. So this is the reason. And since in the criterion, you see K to the minus 1 well, this seems somewhat an unpleasant result, OK. But of course, it is not really an unpleasant result because I'm focusing only on ReLU networks. And we know that, well, here I should say more. But basically, the recurrence relation for the infinite with limit described by discrete dynamical system, OK. And what I'm saying basically is that this discrete dynamical system for the ReLU system is converging to a fixed point. But this is not the only possibility for dynamical systems. So you know, we can have limit cycles. We can have chaos. This is something that was very well investigated in a series of influential paper by Ganguli and collaborators. And there is this edge of chaos conjecture on why you should stay very close to the edge of chaos for information theoretical reasons. But I'm not going to say anything else on this. And so in conclusion, we proposed with my collaborators an approach to investigate in an effective way the statistical physics of neural network with helpfully connected in the layers and to move beyond the infinite with limit. I think, at least to me, if I were a mathematician, this connection between finite with deep neural network and student processes seems something very worth investigating. And I discussed this already once with a very good mathematician, Boris Sannin, who also actually, I forgot, he wrote a paper on deep linear networks where he establishes some finite, let's say, he can compute the same partition function of some Polinsky in the not in the thermodynamic limit. It's a crazy paper where all the results are given in terms of major g function. Very nice paper, by the way. And, well, of course, there are many open questions. One is to understand the similarity and differences with a very recent work of Sirusina and Ringel. And this is something we are working out with the authors of this paper. Another thing that, at least to me, would be interesting is to understand if I can write such an effective theory for gradient descent dynamics. But probably the most interesting part would be to understand convolutions, feature learning. And even related to what I was telling you in the last slide, maybe to understand more precisely the relationship between generalization performance at finite width and this edge of chaos conjecture. And also, there's a paper on the side of the edge of chaos where Young and a colleague investigate the role of skip connections. And I think this could be a nice first step for this. And then it's the last slide, the most pleasant, where I thank my collaborators, Sebastiano, Rosalba and Riccardo, who are in the audience. And then Mauro, who is at ENS, Francesco, Marco, and Alessandro Raffaella, who are part of my in your group in Parma, where I moved recently. Thank you all for listening. Is there a layer in the network? No, I couldn't understand precisely the work of Yoshino, and I probably read it long ago, but I forgot what it does. I know a bit more precisely the work by Hugo and Lenka and Florana, where they are in the same very bias-optimal setting, and I'm studying that paper, but also it's quite recent. So let's say, I cannot really tell you much about comparisons between the fixed data and the data average case. Of course, well, maybe something is something that I can tell you very quickly. When you have this partition function, it is effective action. Of course, since you have this 1 over p trace log, you expect some self-averaging. And if you consider here Gaussian data, you are also allowed to start Taylor expanding this kernel. And what you basically realize is that you find some effective action for a linear kernel, at least in the regime where p is scaling with n0, n0 being the dimension of the input. And in this case, this term is very easy, because it's basically converging. Well, I can write you very quickly. There is a choke, no? Ah, here. Well, you expect that this 1 over p trace log of some matrix k that I'm approximating with some alpha identity plus c. OK, this is OK for c. It basically converges in this p n0 limit at p over n0 fixed to the integral over the eigenvalue distribution of the Markchenko-Pastur distribution of the eigenvalues log of, I don't know if this is the identity over beta plus k, something like gamma identity plus lambda, something like that. So now, if you choose a teacher here, OK, you can exploit some random matrix theory, at least in the linear case. And so, this is a way to avoid replicas, if you like. You exchange replicas with random matrix theory, but you have to know random matrix theory. So, good luck. Maybe I didn't reply to Francesco. I didn't clarify the point of you asked me about if I have to make some assumptions on, OK, well. Yeah, especially also, on the number of points that you have, like on the amount of data that you have. The amount? On the amount of data that you have. OK, let's say, as you can see in the, I don't know, wait a second. We moved up to p equal 1,000. It was very difficult to do this, at least to obtain quantitative agreement. And the reason is that the larger is p, the more time takes your bastard, OK, so, already here it takes 5 million epochs, self correlation times are huge. OK, so you cannot take one, to average, one data, every. I mean, this is because we are using a stupid algorithm that discrete langen dynamics, but I know in the audience, Giovanni had a poster, a nice poster, on a Bayesian sampler, much more efficient, so, you know. But I can tell you something, even when the theory, where you are not able to thermalize, you, at least, this is what we observed in many cases, the qualitative form of the curve is the same, and maybe our theory is a bit below or above. It depends, it really depends on cases. If you want to see this precise agreement between theory and the experiments, well, you have to wait for very long times. This is something also that the other group working on this, Zvar Ringel in Jerusalem is experiencing, and we laugh a lot about that, about how difficult are these. Yeah, also the Taliban rules, not the OS. Exactly. So, but to reply to Francesco, you know, of course, here we are using as the test examples, other examples from current planes, OK. But now imagine you use a picture of my grandmother, what you would get, you can still compute the theory, and you know, maybe the only thing, of course, the generalization performance would be very bad, except in the case, your mother is like, my grandmother is like this, so it resembles more a plane, but, so we can do more or less whatever we like, also studying, investigating out of distribution problems. Yeah, and I'm concerned with how to do this. We didn't do that, but for the same reason, I told you, well, I made this little heuristic argument to derive this, I expect self-averaging in a sense. You know, but we, no, no, of course, of course, it's everything but trivial, I think. Yes, we have, sorry for not mentioning this, I know that, of course, for you, and not only for you is an important point. I think that, well, the reasoning we employ in the paper is that we invoke this Gaussian equivalence, and then we look at the hypothesis of the Breuer Major theorem. Then you realize that basically the proofs of this Breuer Major theorem, the proof requires to introduce this concept over mitrank, and what we can, at least looking at the hypothesis of the Breuer Major theorem, we can say that if the activation function ever mitrank m, r, then p should scale at most as n0 to the r over 2, r over 2, at most. Of course, you know, we have one explicit, of course, this is telling you that for many activation functions with a mitrank one, which is, for instance, re lu, erf, which are the one I showed you in the experiment, these should scale at p, as p over at the one-half. This is not particularly nice, okay? But then there is at least one case, which is the case of an activation function with a linear and a quadratic, linear plus quadratic, where we can prove this, prove. We can provide an argument for this result without using the Breuer Major theorem. In this case, at least in the case of Gaussian data, we see that the derivation works as long in this proportional regime. At least for quadratic activation function. Of course, we made also some numerical experiment reducing the size and not, but let's say, I think this would more or less work, but since for many of you, and I understand now this is very important, I think we will conduct more detailed numerical experiments to understand this point in detail. Thank you. Actually, this is a entirely different topic. You showed a picture with the kernel singular values, different depths. So yeah, you're using fully connected. Yes, of course. So, that's the infinity. I don't know if I wanna speak about convolutionals. I already speak with you about convolutionals. No, no, I don't wanna speak about convolutionals. I will just puzzle because the infinity with kernel of fully connected should not depend on the, well, I don't know if you should not depend on the width. I know that it has basically the same reproducing kernel in the space independently of the depth. So, do you understand why you see all these differences while changing the depth, or do you think it's just that it's a matter of initialization or scaling like this? When we do these analysis, we are just using the well-known expressions, analytical expressions for the relu kernel, for instance. I don't remember it precisely, but of course it's a composition of function. I remember an arc sine square root of one minus something. I mean, yes, we are using the NNGP because we are considering the Bayesian case. Okay, so it doesn't make too much sense to use to employ the NTK. That's the point. Of course, in this kernel, there is a dependence from L, and actually you can see this dependence, for instance, in the eigenvalues. The eigenvalues of the kernel, as long as L is larger than five, they are very much close in this case to zero for relu. Of course, yes, I mean, this is not my discovery, this is something very... The scaling of the eigenvalue with, let's say, the wave vector, I know it's not a wave vector in the end because maybe we choose a different distribution or whatnot, but with an arc. Well, maybe the scaling is okay, but then the magnitude of this, you're reducing at each step the magnitude. You see my point? Yeah, yeah, it's probably just for the magnitude. And so, if you go to the infinite width limit, you can clearly see that you can also, actually improving generalization with increasing L in the infinite width limit, but once you have this finite width effect that depends on the inverse kernel, you are a bit screwed up because sooner or later, at least for relu kernel, you will find these almost zero eigenvalues. So, Pietro, you have a student T distribution on the conjugate variables. Of course. I didn't mention it, and this is why I say a link, but I cannot compute the Fourier transport. I mean, the characteristic, if you like, the characteristic function of the prior of the output is a student T distribution. Okay, so. To be a bit more precise. But did you look at, just visually, at the distribution of the moments? It's not really possible, because if you think for a while, the only reason why this distribution is not Gaussian is that I'm considering an extensive number of input, of training input, okay? So, it's very difficult to find unobservable. Actually, there is one, the generalization error, okay? But if you want a probe of this student T distribution, I think the only thing you can do if you want to compute only observables, which are evaluated over the training set is considering an extensive correlation function with an extensive number of variables. This is my feeling at least. Maybe you can do something smarter, but I don't know how to do it. Okay, we have time maybe for one last question. Otherwise, no, I think we're gonna move these questions now to the coffee session, okay, to the coffee break. So, thanks again very much, Pietro, Francesco for these nice talks.