 So we're very fortunate to have Francesco Camille. We're fortunate to have him here in PSP, fortunate to have him here for this talk. And he's going to talk about fundamental limits of shallow neural networks with not so small training sets and I think also some Gaussian equivalents. It's going to be a topic, so cool. Looking forward to it. Take it away. Thanks, thanks for the great opportunity. I have the honor, privilege and responsibility to be the last one in the week. And I'm very happy to be here, so it's been a great week. I've learned plenty of stuff. Thank also to the organizer. I would like to make an applause for the organizer. Okay, so let's get to it. As Sebastien was pointing out, I cheated a little bit because I changed the title just with this round parenthesis. Not so small dataset. We will understand why. And this is a joint work with Jean Barbier, which is around here, yes, and Daria Tieplova. It is still not an archive, but it will be, well, we submit it tonight, right? So, yes, it will be in the next days, hopefully. Okay, so just a quick outline of how the thing goes. So first I will introduce the problem. I will state it, let's say, as precisely as possible, and then I will state the main theorem. This should take the first 20 minutes of the talk. And then since it's Friday, let's say after this, you could also turn your brain off more or less. Because after this, I have sort of comparisons with many other scaleings that are already present in the literature because it is, of course, is a well-studied problem, not in the same scaleings as we do exactly, but still a comparison was needed. Also to understand what's new about it. And lastly, I will sketch the proof a little bit in the last five or four slides, okay? All right, so let's start with it. So the problem we are starting is a two-layer neural network, of course, with an input dimension that is D. So it's this input layer here as D neurons. And we have also a hidden layer, only one hidden layer, of size P, and a set of weights that is multiplying this X here, which is of size P by D, necessarily. We have an activation function in the middle layer, which we will assume to be regular enough. We will see, and it is from R to R. So you have to think of it as applied component-wise to this vector WX, which is P-dimensional. And then, finally, we have a projection onto a one-dimensional or low-dimensional, but here for simplicity, it's one-dimensional space through this vector A, okay? And finally, we apply this function, F, this readout, that can be also stochastic, and that's why I added this capital A as a subscript, which can be a random variable. It's just introduced to model some stochasticity. We'll see an example later. So, all right, the set thing we're interested in is supervised learning. So we want to understand what happens when we train this two-layer neural network on a given dataset that is a set of couples, X, Mu, that is D-dimensional, and it is fed on the left of this, right of this network. Maybe your left. And we also tell the network what's the right label it should produce. So we are trying to fit these couples, X, Mu, and Y, Mu by adjusting the weights, A and W, in such a way that these relationships here are verified up to a certain error, let's say. A certain tolerance that we introduce. So the main goal here is to... Okay, there are several training procedures of which I'm not an expert, so I will fly quickly over that. But the main goal is in general to produce the smallest possible generalization error, which is the true trial for a neural network, let's say. So when I generate a new couple from a rule that I know, so X, Mu, the correct label, Y, Mu, what happens if I compare the correct label with the output that instead the trained neural network would yield, okay? So we compute this quadratic deviation, and this is our measure of generalization error. So it's, yeah. Yeah, that's right. That should be an expectation, in fact. Yeah, yeah. You're right. There will be later. I don't know why I didn't put it here, honestly. Okay. So if it's not stochastic instead, it's something that makes sense. Okay, so what affects this generalization error? Well, many things. I tried to mention the most relevant for us or the most interesting. So the first one is, of course, the size of the training set, and we expect that the more points we have for these couples, the more accurate the prediction will be. Then we have the size of the network itself, which, let's say, regulates the expressive power of the network. In this case, since we have only a two-layer neural network, this is parameterized by P, which is the size of the hidden layer. The dimensionality of the input, which is the, this is another factor we should take into account, because it quantifies how many real numbers we feed into the network each time. And also it affects the number of weights that we have to tune. And the method used to train the network, but maybe, most importantly, the nature of the data set. IE, what is the true underlying function that these two-layer neural network is trying to fit? Of course, there are many more, but the central question for us in our paper is what is the least possible generalization error and when is it achieved? Not how, really, because we are not focusing on algorithmic questions on SGD or training procedures, but the situation and the setting in which this error is achieved. And we will see that this highlighted, well, our results, let's say, put in a very precise relationship this highlighted quantity, so NP and D, which governs the number of data and the relative sizes of the layer in the network. And, of course, the nature of the data set is important. So concerning the latter, the nature of the data set, we are interested in maybe the most popular theoretical framework we can work in, which is the teacher-student setup. Well, not the most popular, but one of the most popular. So this amount to say that the data set is not generated by God-only-knows-what function. It's generated itself by a teacher-neutral network with some weights, A star and W star, that are fixed and drawn from some distribution that we assume to be Gaussian and they are completely factorized. We will see that this is not completely silly later, but for the moment, I hope you can accept this fact. And, all right, so the couples are generating this way. Once you state a rule to generate this X mu here, then the Y mu is determined by this relation, okay? So here we have the same stochasticity in order to be a little bit more precise. I've added some Gaussian noise that has some regularizing properties here only for, well, only for, let's say, for technical reasons, but not only for that. Okay, so saying that Y is generating in this way is equivalent to say that the Y is drawn from this P out, okay? So, which is called sometimes output kernel. That depends on the quantity here on this A star transpose phi, et cetera, which is the same one that enters the stochastic function F A mu. And you can realize this specific, this particular mapping between these two definitions using this P out here below. And the two problems become equivalent. Of course, the first one is, let's say, more easily interpretable because you can read directly the two-layer neural network. The second one is easier to treat, let's say, analytically and rigorously. So, the main theoretical restriction that, let's say, that we assume for our work is that the X mu's, which I still haven't specified, yes, yes, yes, yes. I haven't said that yet, but I'm gonna say it in the next slide. Okay. I was saying, yeah, the main theoretical restriction here is that the X mu is drawn from, they are all IID for any mu, and their components are also IID. Here we can, you know, let's say, build a more sophisticated model, introducing a covariance between the components of the X mu or drawing it from a mixture of Gaussians, et cetera. This would be another thing to be addressed in the future. Okay, that's what Bruno was pointing at, I guess. So, before I asked what is the best possible scenario for the student, well, this scenario is when the student matches exactly the architecture in every sense of the teacher, which means that they are using the same f, the same phi, and the stochasticity in the f is the same, and it is known by the student, as well as, let's say, the prior on the weights. So, basically, a base optimal student, very informally to wrap it up, is aware of anything but the true weights, of course, otherwise is, the problem is trivial. And as a consequence, the base optimal student has access to this base posterior, which is this one here, so I have collected in theta, in this bold theta, the parameters of the network, so A and W, and I've put a Gaussian prior, so this capital D here stands for a completely factorized Gaussian measure, okay? All right, so we have this normalizing probability, it is usually called evidence or partition function, we'll see it later, and what remains after here, so the product over this mu of the p-outs is up to a factor that, again, I cheated a little bit, I erased, is the probability of the data set given A star, given A star equal to A and W star equal to W. Okay, so this is the base posterior, and so how ideal is all of this, because maybe you're wondering if I'm trying to trick you in some way, a little bit, not that much. So why do we choose a base optimal student? The reason is that he has the best, well, why do we place ourselves in this base optimal setting? The fact is that we can compute, in this setting, it's very nice theoretical speaking, and we can compute what is the least possible generalization error in a way I'll show outline later, and this generalization error is yielded by this predictor here, but the base optimal predictor, which is the posterior mean of the Y, given the data set and the X nu that has just been fed into the network. Okay, so this is not what is usually called one short estimator that you would get from the minimization of an empirical risk or with SGD, I guess. It's different. It's an average over many models that are weighted in an optimal way, and that's why it's supposed to lead, well, it leads. It's informal, but it can be made formal. Yes. Now, it's any algorithm that has at disposal the data set DN, and which is... Yes, yes, for this data, yes, it's true. Yes, but any algorithm can at most hope for this generalization error. Cannot do better. And yeah, I was saying that, okay, so this is ideal, but still, it has some theoretical relevance for these reasons, and I wanted also to remark that the choice of IID Gaussian price is not so silly because it is coherent with L2 norm regulators in empirical risk minimization. I will outline also that later. So before stating the theorem, I need to introduce these quantities, and let's say, as annoying as these lights can be, they are necessary for almost every presentation I made, so the first one is the posterior. You have already seen it, and the second one is the partition function, which is the normalization of the posterior measure. Then we have the free entropy, or minus free energy up to a certain beta that here we have set to 1, for those of you that are more familiar with the statistical mechanics jargon. And then, more importantly, we have the neutral information. This is the true relevant information theoretical quantity that is going to yield the base optimal limit for the generalization error. All right, so as you can see, the neutral information and the free entropy are in a very close relation. In particular, we just have this remainder here with respect to the free entropy that appears as a high-dimensional quantity still because it has the A star and the W star, but if you use the central limit theorem and the law of large numbers, you can reduce it to a low-dimensional quantity pretty easily. Okay, so of course this model has a simple ancestor in the literature, which is the GLM, that for our purposes we can imagine as a one-layer neural network with no hidden layer. And in the teacher-student setup, this was started by Jean collaborators, and they were able to establish the information theoretical limits for this problem. So we know we actually have a formula for the mutual information that you read here on the bottom of this slide. We have a formula, this replica symmetric in spring-glass jargon, which means that, let's say, which means that the relevant order parameters in this statistical mechanics problem are concentrating. They do not fluctuate anymore. And this yields a formula, which is a variational formula, and you need to optimize over a finite number of parameters. So this concentration yields this kind of phenomenon in the final formula. So here I denote the quantities coming from the GLM, which will produce in the student-teacher setup a different data set in principle. So even if I fed the same X mu, I would get different Y mu's in principle. So I denote to distinguish them from the two-layer neural network quantities. I denote all these quantities by a small circle on top. And all right, so for this model, I am again assuming something very simple, like that the V's here, that the teacher weights for this GLM, the teacher GLM weights, are IID gaussians, and then I inserted also another noise here. I decided to put it there explicitly, but I could also reabsorb it into the P out with a convolution of a gaussian. It's just a deformation. It can be reabsorbed as a deformation. You can see it also from here if you want. It can be reabsorbed as a deformation of the P out. So also for this model, you can build a free entropy and the mutual information and compute the two, or one of the two, let's say. But once you have one, you also have the other, generally speaking. So is it everything clear up to now? If there are any questions, this is a good moment. Okay, good. All right. Yeah, I forgot this slide, of course. So what is the relevant scaling regime for the GLM, because this is a well-studied problem, and it is known that, let's say, we can observe phase transitions, or if you want non-trivial generalization error, which are either non-zero or not identically one for any number of samples you have. The non-trivial regime is when alpha, this ratio N over D, which is number of samples divided by dimension of the heat pool layer, or their dimensionality, if you want, has to be of order one. So in this case, you get a non-trivial generalization error with this plot. By the way, I should have cited the paper. It's taken from Jean's talk paper. And this was for Gauss-Bernoulli teacher weights, if I will remember, but for pure Gaussian weights, it should be the same. You see that as soon as alpha increases, which means that I get more and more samples with respect to the dimension of the input, the generalization error gets better and better. And it goes to zero and alpha goes to infinity. All right, so main result number one. What time is it? Yes, 20 minutes. Main result number one. Okay, recall these definitions. So we have the model. We have the teacher. This is basically the teacher generating these labels for the ex-muse. The two-layer neural network teacher. And on the right, left, no, your right, sorry. On your right, you have the labels that instead a GLM teacher would generate. And so they have also two different data sets, okay? There will be D and D zero if you want. Okay, so our theorem that we proved in collaboration with Jean and Dasha states that if we tune rho and epsilon appropriately, which means rho equals the expectation of this phi prime. So of the first derivative of the activation function, the hidden layer. And epsilon is this combination of the two, okay? And phi is odd and regular enough and same for F, P a almost surely. Then we can control the difference between the free entropy of the two-layer neural network associated to the learning problem in two-layer neural network and the learning problem in the GLM. We can control the difference with this order, okay? That is ugly. I mean, it's not the best looking remainder you've ever seen, of course. But it presents some interesting features already, okay? I'm gonna get to the next slide. Don't be scared because this will appear in every slide from this moment on. All right, so this property, this control is inherited by the mutual information related to the two problems because they have, if you'll well remember, the free entropy deferred from the mutual information just for one term and in both cases and both term concentrated, well, converged around the same thing with the speed that we can control and this does not affect the control that we have on the mutual information. So you see that this order keeps appearing again and again. So maybe it is appropriate and we did to identify this scaling, this tilde limit in which we'll, yeah, it is true that we let N, P and D go to infinity but in order for the two mutual informations or free entropies to match, we need to do it very carefully and the sequences we choose satisfy this limit here. So you see that N, sorry, you see from this scaling that N over D equal to order one which means an interesting regime for the GLM is still allowed thanks to this N over D to the three halves, okay? But we need, we have this other red guy here. We need it to go to zero so which means that we are forced, if we want to collapse the two models, to make the two models collapse, we are forced to take P significantly larger than the number of samples. So the hidden layer here, apparently the size of the hidden layer plays a crucial role in the identification of these two models. It has to be very large compared to the number of samples that the network has a disposal to learn. Okay, so as a consequence, again, in the same scaling regime, yes, P number of hidden units, yes. If P, not really, yeah, I have an extra layer in the middle, yes. Yeah, yeah, true. I think it comes from the analysis. It cannot be fundamental, I agree with you. Yes, have I replied to you? Okay, yes. You mean Gaussian equivalence principles. I mean, there is something Gaussian going on. Yes, yes. Yes, it's the phenomenon that is going on. It is likely different though, I'm going to compare to that one later. So, yes, N over D is still allowed, as I said. Ah, Marco, by the way, the reason why only this N by P appears, it might be because we were interested in going to infinity, so we discarded some sub-leading orders that might be going around. Okay, I will check that later. In this order here, we kept only those things that we, let's say, since we aimed at sending N, P and D to infinity, I kept only those orders that I thought it would have been challenging for this limit. No, it's not the case. I think there's something similar going on in the mean-field regimes, which is what I was telling you yesterday. Okay, all right. So, yeah, as I was telling you, this property is carrying on to the generalization error, which is then the same in this very same scaling. It's the same for the GLM, which we know, we have a formula to compute that, and the two-layer neural network, so we could get a close formula for that, which I have not displayed here because otherwise it becomes more cumbersome. Okay, so just a disclaimer on this equivalence on how it should be interpreted. So the first, because it has arisen many times in conversation with colleagues, that's saying, interactions with colleagues. So what we are stating is that if we train this two-layer neural network on this data set here that was generated by a two-layer neural network, this yields a base optimal generalization error, which is equal, okay, so numerically equal to the base optimal generalization error that a GLM student would achieve if trained on the data set generated by a GLM teacher. It is, we cannot state at this level that if I train the two-layer neural network on the data set generated by the GLM teacher, this reaches the base optimal performance, okay? So there's no crossing in this diagram. Okay, so yeah, here I will just go quickly, maybe, yes, because I want to go to some of the proofs. Yeah? Sorry, sorry, say it again. It's the same, yes, they are base optimal, so they match every time. In some sense, they are going to infinity together, okay? Okay, so why is it not so silly to assume a completely factorized prior on the teacher weights? Well, because basically if you slightly modify your free entropy, introducing this inverse absolute temperature beta, okay? In front of this Hamiltonian, that is this quantity here, where you see I have inserted some Gaussian weights, well tuned by, with viruses tuned by lambda and sigma here, just to be for the sake of generality, let's say. If you, okay, this problem here can yield an empirical risk minimization. In fact, when I send beta to infinity, the Fn, sorry, the integral in Fn is dominated by those weights configurations that minimize this Hamiltonian Hn, which has this log p out serving as a loss, let's say, and the other two contributions are norm regulators, okay? So this is why I was insisting on this fact. This is very informal, by the way, just to give you an idea. So about related works. An example that comes to my mind is the committee machine. This committee machine, typically, even though I have to say I'm rather new to this literature, so please correct me if I make mistakes, huh? So in this committee machine, typically, you have a very wide input layer, okay, in which you feed your x mus, let's say D as before, then you have a narrow hidden layer and finally a readout. This scaling is different from ours, because at least for the papers I've read, like Jean worked on it, and then there are some other papers from the 90s by some Polinsky at all, only N and D go into infinity, which means number of samples, oops, sorry, number of samples and input dimensions, which is the size of the input layer here. Whereas P remains of order one, this is strictly speaking not covered by our scaling here, because if P stays finite, then you have to make D become very large in order to hold again. So there is a complicated combination of stuff going on here. So it's not exactly the same. The thing, okay, the regime which I think our scaling has more in common with is this Minfield regime that was studied in the seminal paper by May, Montanari and Gouy in 2019, which occurs when the size of the hidden layer is remarkably larger than the size of the input layer, so in our language when P is much, much greater than D. And the analysis they carry out, it does not need to take into account the number of samples because they, strictly speaking they study an SGD dynamics on this two-layer neural network. And what they find is that you can track the empirical distribution of the weights of this network using a distributional equation and following a gradient flow. And it seems that both in this Minfield regime they studied and in our model the fact that P goes large has some intrinsic regularizing properties that we need to take into account for gaussianity, asymptotic gaussianity reasons. Then there are also these other settings that are very studied, especially recently, but again I'm new to the literature, so I apologize if I don't cite everyone. So it was found for instance that SGD, okay sorry let me explain this first. So in these two pictures the blue weights are like frozen, so they are not learned or learnable. And in here we have a scaling regime which is more similar to the Minfield one and in fact it is connected to that. And here instead you have linear width, input and hidden layer, but still the hidden weights here are blue so they are frozen, we do not learn them. And this is a typical setting that occurs like in Neural Tangent Kernel on one side where the learning problem turns out to be let's say a regression in high dimensional feature space or also it was, yeah, Neural Tangent Kernel was also explained using again this Minfield, this SGD dynamics in Minfield regime that initial stage of learning at the very beginning again by May Montanari and other collaborators. Then we have also random feature models and Gaussian processes neural networks that fall into this category. So here by a simple parameter counting we understand that it is different from our setting because in here only P parameters which are the readout weights and not DP plus P which would be in our case. So we have much more stuff to learn. Okay, so more recently and by more recently I mean from 2021 to 2023 there has been a series, a line of works going on which address the full training of the network where all the weights are learned and they are treated as annealed variables from a statistical mechanics point of view in the free entropy which means they are integrated inside the partition function. Okay, so these works also for deeper networks they address a scaling regime which is the fully proportional one meaning that N, P and D go together to infinity with the same rate. And this was as far as I know first addressed by Leon Sumpolinski whose analysis is constrained to linear networks so they don't have non-linearities even though they argue that some possible extensions that were also looked upon by Ariosto, Pacelli, Pastore, Ginelli, Gerardi, Rotondo that conjecture a formula for the empirical risk minimization generalization error. Using this statistical mechanics formalism I also displayed before briefly and then in 2023 we have this paper from Qui, Krizakala and Zdebrova that leveraging some Gaussian equivalence principle were able to, let's say, get the computation through and get this base optimal limits exactly as we do but we have, let's say, a small problem from the rigorous side, let's say because our limiting, our scaling, our limb tilde does not catch this scaling regime here so N, P and D cannot be, cannot go to infinity together according to our limb tilde and the reason is, let me show this to you again maybe here and the reason is precisely this N over P here so if N and D go together this is one basically of order one it's staying around, it's a constant and this N over P here would be another constant so this would no longer go to zero in our scaling regime so we cannot reach the same scaling regime as they do all right, so we are not sure if this is, if there is something fundamental going on if this is a wall, let's say, an obstacle that we meet in our proof, in our proof we still cannot say that even if maybe, okay I'm trying to convince you that there is something fundamental going on but I'm not 100% sure myself okay, so now I will sketch a little bit of the proof which will also take into account, yeah, please I can't hear you, sorry so here you prove that doing two-layer nets in this specific regime that you look at gives you the same test error as teacher-student GLM so I'm wondering whether all these papers that look at a slightly different regime observe the same phenomenology? the last one, yes thanks in fact this work was partially motivated by this observation yeah, and we tried to prove it but we were not able to catch the scaling and we are currently also trying to understand if there is something going on there is another question? do you expect the equivalence to hold or do you think actually the N over P go to zero is quite important when N over P is constant then you don't think there's equivalence okay, if you want a personal opinion but this is personal, it's not math I think this can be fundamental because we are looking at the base optimal setting and neglecting N over P would mean neglecting some correlations that I'm displaying later it is, okay, when you have something it is always easier to prove it goes to zero rather than it goes to two, I don't know, okay so, yeah, I think there is something going on is there numerical work being done to verify this kind of formula using long-driven dynamics to I did not get the first part I wonder if there is being numerical work being done to verify this kind of conjecture using long-driven dynamics to sample from the posterior distribution not sure about long-driven dynamics I mean just any of the dynamics just to because it's not easy to compute this period in optimal solution yeah, it's true well, for the GLM there is the GAMP that can yield the base optimal estimator this is a good point we don't have an equivalent here yet it is true, it's very hard to sample for me, the Langevan would be very painful okay it could also be that we were not able to push the proof further we have to be careful about that okay so, if you are into... how much time do I have set? maybe five more minutes five, four, oh, this is a challenge okay so, how familiar are you with the Nishimori entities? yay, okay maybe I skip those, okay it's, I can skip directly to this one because professor Oprah asked so okay, the most annoying part of the computation is getting rid of non-linearities in the middle layer, basically otherwise, if everything is linear from the point of view, at least of, let's say fundamental limits we can work our way around but so, the goal, what we would like to try and what inspired our proof are these Gaussian equivalence principles that should hold due to the high-dimensional nature of the problem, let's say and, let's say, informally it amounts to replace this phi by this combination removing the non-linearity and tuning this row and epsilon appropriately, and if you think about it so, the side here is an additional independent Gaussian noise but if you think about it, it makes sense that here you have row which is the average of the the expected average of phi prime because the derivative measures the response to the variations in its arguments and the rest is basically a remainder that can be proved in some circumstances to be a Gaussian and independent, so the problem is that in our setting it is not clear, well, in our setting it is clear, but it was not clear before to what extent this was applicable. All right so, also here I go pretty fast the strategy is as usually done in statistical mechanics of disorder systems in the last 20 years and also inference is interpolation so we build this set of well, basically this is again a combination of the weights of the teachers of the two teachers of both the GLM and the two layer which is the original model and you see that for t equal to 1 you have the GLM, you have only this last piece here and for t equal to 0 you recover the initial model that you're actually interested in and from this we also build the student version of the same interpolation okay, so without the start variables and we also build an interpolating data set we need to change also the data set so ytmu this time is generated by apout that depends on the teacher weights that are built in this interpolating way okay so yeah, as a consequence we also have this interpolating free entropy, again for t equal to 0 we have the free entropy of the GLM that we know to compute thanks to in the teacher student setup thanks to Jean et al and also previous works on the physics side and then for t equal to 1 you have your model sorry, for t equal to 0 you have your model which you're actually interested in okay, so to control the difference between these two you do, I mean the simplest thing you can do is to compute the derivative and hope for a uniform control in time with the same order as in the statement of the theorem okay and here I'm very sorry I'm very sorry it's Dasha sorry Dasha, if I compressed this thing into only one slide because it is actually a very key theorem, so we need in order to carry out the proof also the concentration of the free entropy which is stated in this form alright, so let me go faster here yeah, I'm sorry for this I just wanted to flash a couple of things here to understand why we have this bottleneck of the n over p okay, so if we compute the derivative we have the difference between these terms using the Nishimori identities that I skipped, the last term B can be proven to be zero immediately and this is this is fast so what we expect is that minus A1 counters the A2 and A3 and they cancel each other in this specific regime that we have so inspired by what physicists like Lee and Sompolinsky, but also Rotondo at all did in their paper we tried at first to integrate directly the weights of the networks which in this more rigorous settings amounts to performance integration by parts with respect to these Gaussian variables but there is a problem because these Gaussian weights here are inside phi so we cannot use Gaussian integration by parts right away so we focus a little bit on this guy here and we employ the oldest trick in mathematics I guess so you subtract what you would like to have there and then you add it back and you control the difference between this red and blue guys and then here in this first line here that you would like to go to zero because this is linear and you can treat and it magically cancels the other two contributions A2 and A3 so you integrate by parts the A star here in the first line and what you get is this more or less ugly guy okay alright so here the first part goes roughly to zero as the order that you see in the round parenthesis thanks to the concentration of the free entropy this you can show then you have this sum of n squared terms of these differences here the quantity that you can prove to be of order one it's fine but the problem is this difference here so it's n squared variables and the only thing that makes you lucky is that this guy in the square brackets concentrates around zero okay so it is helping to go to zero but unfortunately I mean if you study the order of this guy you find this small o here of n times square root and this n over p which n has to be simplified with the one of n here that gives yields exactly n over p that it's our bottleneck it's actually there and the reason why I personally think that this is there is something going on here it might be fundamental is that up to this slide everything I've shown you is exact I have taken no limit yes you control it as the norm of the norm of the matrix of the difference norm of the matrix you have red minus blue mu and nu and do you control it for fixed mu and nu the size of red minus blue is small no I take mu and nu I need to take them together look at the norm of the matrix for example the operator norm of the matrix that is red minus blue I think we sorry sorry can you say that again say red minus blue is small but then there is issue of n over p yes do you look at the difference of red minus blue entry wise or do you look at the difference of red minus blue as a matrix and then the operator norm of that matrix might be small no we don't look at the operator norm but the fact is that here I've compressed a lot there is some subtleties here because this red minus blue part is not independent of the first part that is here and also this expectation there is some stuff going around there and if the first part were not there I could prove the concentration I would like it to be getting rid of this n over p and actually I can also estimate further orders the problem is that the further orders are correlated with the first part and I cannot get rid of these correlations and that's my bottleneck that's why we cannot improve that and typically in empirical risk minimization you don't have this part because you're interested in the infimum of the Hamiltonian so you don't have this expectation log z in the first part that's why I didn't manage to replicate your results and that's it I'm sorry sorry everybody take your time to conclude so what about the n over p of order one this is an open question I would like to discuss with people if they have an idea how to improve this because it might be either fundamental or very technically hard we're still not sure about that 100% at least so this is another idea of a project I'd like to pursue which is what happens if I have partial information on the W star so I introduced this additional channel here and when sigma which is the signal to noise ratio in this channel goes to infinity I know the way it's exactly and I get back to the random feature model for instance sigma is zero I know nothing and this is what I've studied so far so this could interpolate between the two and maybe some interesting phenomena arise from that then why not more than two layers this is also another rightful question the fact is that at the moment we only miss the concentration of the free entropy which needs some combinatorial argument that we hope to be able to import from other papers by Hongbin Chen and Jamingsia that did that for the multi-layer GLM in which the inference problem is different from our learning problem but maybe we can import some of their machinery and instead the other part of the simplification of these terms I've shown you can be carried on to a multi-layer setting if one integrates carefully the weights of the network one by one with an inductive argument at least I'm confident I've already started sketching something and last question is what happens if we add structure to the data so this is the simplest thing you can imagine X to be still Gaussian or drawn from Gaussian mixtures why not structure types of data yeah I thank you for the patience also alright thank you very much you already managed quite a few questions but I think we have time for maybe a few more before we go to lunch thanks for the nice talk and I'd like to get some intuition for this result so so this generalization error or the mutual information I mean is this an upper bound I mean because what happens if I understand right in this limit when P goes to infinity that your model should become more expressive so essentially the mutual information should go up but at the same time your data becomes more structure less in the sense that yeah so and so I mean if I take can you take this as an upper bound of or a lower bound of what you would get with finite P this is an excellent question okay I don't know the correct answer is that I don't know okay so if I increase P I would need more data to have more information about all those weights right I don't know I have to think about it honestly no I can't say just of a very simple model yes but I have to correct the very simple model there is a okay let's say that the properties of the more complicated model is are say collapse into these coefficients rho and epsilon that I carry on that said I'm not sure there is a monotonicity between the two I can't say here right now on the spot because I'm not very familiar with what happens in commuting machines that is for finite P I should first have a look at that for sure okay we have time for maybe one last question I think maybe more of a comment than a question I guess I didn't think so much about it but your question makes a lot of sense but it's not clear that when P gets small let's say P is 2 you should have equivalence because when P is 2 you have this commuting machine which is of a very different nature of a linear model which is the perceptron so I don't think you I don't expect this kind of results to hold for P small I don't expect this kind of two-layer network to recover a commuting machine with more than with at least two hidden units because you can suddenly model non-linear rules which are much more complex than the linear JLM so I don't think it's the same thing at all yes yes but we should comment yes comment on this you are right alright that was a very interesting session let's thank Francesco again for his talk