 Yes, so let's get started. Yeah, I'll talk about finite neural networks from the landscape point of view today. And this is, the talk is mostly based on our ICML paper from 2021, but I will also present some material from my thesis, well, and some more numerics that we put up recently. And this is joint work with Francois, Flavio, Francesco Arter, and my supervisors, Lemona and Wolfram, and also Johani. So I want to first start with some motivation for finite with neural networks. So here I am pulling some empirical results from a PEC nomination reverse paper from 2017. So here's a classic shell of neural network that is trained on the Amnesty dataset. And here, like at 32, neural network fits, hits a zero training error. And with further over-parameterization, we see the test error is decreasing, and it saturates around with 128 or so. And the similar trends carry over for other datasets, like CIFAR and SVHN. What I'm seeing in this plot is that the further the over-parameterization is, the generalization error does not gradually decrease, but it first decreases at the onset of over-parameterization, and then it saturates. And in this talk, I'm going to try to understand what is going on at the onset of over-parameterization and also further on, but in particular, this behavior, like what makes over-parameterization good for the generalization error? And this is now a known fact, I think, that over-parameterization helps with training as we saw in the first slide, but I want to show here a toy example. So here's a target function generated by two hyperplanes, and we can pick the hyperplanes in different-like orientations, so that generates another target function, and we can generate more-like tasks of this sort. And then we can train a two-neuron student network. So here, each row represents a different sampling of the hyperplanes, so it's a different dataset, and we are training 20 seeds of students, and here, the dark brown indicates a zero loss solution, so the loss is 10 to the minus 16 in that case, and the yellow dot is a high loss, basically. So we see in some fractions, like the gradient flow converges to the zero loss solution, but in some other seats, it fails to find the solution, even in this very simple problem. So this is the zero over-parameterization case, and if we keep over-parameterizing further, we see that the number of dark blue dots increase, so there is this reliable convergence already at factor two for this simple problem. Sorry, factor four, I want to say, already at factor two, like there's a lot of dark blue dots, but at factor two, it's almost completely dark blue, so it's almost, every seed converges to zero loss. And similarly, if we arrange a similar problem generated by four hyperplanes, then we see in the zero over-parameterization case, for some teachers, the problem turned out to be quite difficult, actually, because the last half of the rows, we see no cases of success, so it's all yellow dots. So it's a very non-convex problem. And in particular, I'm gonna try to answer these questions. I mean, give an answer to these questions, actually, in this talk. So why is it hard to train without over-parameterization, and what is the benefit of over-parameterization? So the go-to answer, I think, is like, yeah, there's a fixed number of data points, and if we over-parameterize further, there are more solutions, so we can find smoother interpolators, so that's a benefit of over-parameterization. But I will give an answer from a landscape point of view to this question. And this is the big question, like, how much over-parameterization do we need to prove convergence to a zero loss solution in finite-width scenarios? And by no means, I'm gonna give a complete answer to this question. This is very problem-specific, and I think it's a very difficult question, like, even to study in your favorite problem of choice in a shallow neural network setting, but I will give an average case answer to this question that one can't keep in mind when choosing a problem. And this is my final motivation slide, and the analysis gives also a perspective on this recent work. So there is over-parameterized solutions are equivalent up to permutations, and here we are looking at a picture from Mentezari. So there are, like, there's these basins that are identical to each other up to permutations, so each of these solutions are identical, and also recent work on the Geetree Basin, they come up with a similar argument, like, they can't find a good permutation to map solutions to each other that are, in turn, like, linearly mode-connected, and they attribute to this implicit regularization of stochastic gradient descent, but we will see in our model, like, that all solutions are identical to each other, and SGD sometimes, if the network is wide enough, SGD finds them, so they are, like, they are naturally linearly mode-connected in this regime, and so this is, and, yeah, with the characterization of the global minima manifold, we are, like, we are able to see that their solutions are connected, but also we know more about the structure of the solutions, like, we know the exact parameters of the solutions, and not only the functional form, but how, what should each neuron do, basically? And using this formalization, we can prune wide neural networks, and here we are seeing, like, on the, so this is an amnesty experiment, and we are generating a teacher network that has 30 neurons in three hidden layers, and it reaches a high loss level, this is represented by the orange dots on the left, and then we are over-parameterizing a neural network by a factor of three, and then, like, okay, like, with over-parameterization, we are finding better solutions that are shown by the purple dots, and then we can prune them down to the original network size, so that gives a solution of the original non-convex problem. That's one way to solve it. Okay, so, yeah. So I wanna talk about the neural landscapes of neural neural networks, but I wanna talk about it from the symmetry point of view, and so there are already, like, a lot of example loss landscapes with full of symmetries, so, for example, dictionary learning, clustering, tensor decomposition, and also in Risa's talk, like, we see a simplest case of symmetry, a subtle point between the two solutions, and there are, like, broader classes of symmetries for loss landscapes, like, here, in the first picture, there's a rotational symmetry for the global minimum or discrete symmetries, which is the case we are gonna be studying, and, yeah, there's a, there are also, like, more, like, more versions of it, so this is a line of critical points on the third picture, and I'm gonna talk this a little bit more in detail. So there are the symmetries, and there is the complexity for neural networks, and so this complexity analysis was done for spherical spin glasses and more models, so the relevant thing to do in this analysis is to count the number of critical points, because then it gives an idea about how difficult this problem is and how non-com, non-comax it is in some ways, so we are gonna take this approach to study neural networks, and we are gonna count the number of critical manifolds. So neural networks are symmetries, so it's not gonna be the number of critical points in this case because they, because the critical points perform manifolds due to symmetries, so we are supposed to count the number of the manifolds, so I'm gonna do it in six steps. First, yeah, I'll introduce the symmetry approach to landscape complexity, then I'll present the scaling laws, so that's the number of critical manifolds, and then I will look at one such manifold, and actually, even one such manifold is, it shows quite a non-trivial behavior, so there is on the same line, all critical points achieve the same loss, but some part of it is the set of points, full of set of points, and the other part has local minimum, so we will look at this transitions from a static point of view, and I'll introduce the landscape complexity measure, and then the equivalence class of global minimum, and yeah, if time permits, I'll have a look at this pruning application in the last slide. So here I'm taking a shallow neural network that is generated by an neurons, so WJ is the incoming vector, and AJ is the outgoing weight, and theta is the parameter vector, so it's a concatenation of all the vectors of the neurons, and so I'm gonna use these cartoons for my counting arguments, so here's a width and neural network, and each neuron is represented by a color, so when I refer to a neuron, I mean a concatenation of the incoming vector, W1 and the outgoing weight, A1, and I'm picking a loss function, so I'm defining a cost, C, and any target function that is F star, so cost, C measures the difference, and I'm averaging over an input data, that is denoted by P, it could be discrete or it could be continuous, so at this stage, it could be an ERM, or it could be the population loss, and I'm assuming C and sigma are twice differentiable, and that the cost function is zero, if and only if the prediction and the target match, so this is gonna be the all assumptions that I'm using, so the first principle is the neuron spitting, so if we take one of the neurons, the green neuron, let's say, I'm sitting in two pieces, that is, copy the incoming vector, W1, and spit the outgoing weight into two pieces by a vector mu, so mu times A1 is the one outgoing connection, and one minus mu times A1 the other, so the sum of the two neurons is the same as the original network function, this is rather trivial, but it was observed by Fukumusi and Mari in 2000, if the first configuration is the critical point, then the second configuration is also a critical point of the loss landscape, so this comes from a simple gradient calculation, so this is known, and then there is another similar principle, that is the zero neuron addition, so we can add two neurons to this expanded network, so the incoming vector is W prime, that can be anything, and outgoing weight is A prime and minus A prime, so if we sum them together, it is a zero neuron, it doesn't add anything to the function, it doesn't change the network function, okay, so this doesn't preserve the criticality, but it preserves the network function because the neurons cancel out each other, and here I am showing a group of two zero neurons, but it could be a group of five, or one group of two, and another group of three, and this is gonna factor in our computations of the number of manifolds, so there are two symmetric principles that I'm gonna study, and with that, now I can already introduce the scaling law, and here I'm putting an extra assumption, that is the, for all the data set, that the data set is generated by a finite width neural network, that is let's say a width k neural network, so already with a width k network, the neural network, the data set can be interpolated by a neural network, so this is also known as a teacher assumption, but I wanna say this is a structured teacher, so it is a teacher that fits the data set perfectly, and k can be as large as we want, but it is finite, that's the assumption, so in this, now I can introduce also what I mean by overparameterization, underparameterization, so there's a width k neural network, and the smaller neural networks are underparameterized, and bigger ones are overparameterized, okay, so I'm picking an underparameterized network of width and now, and here I'm showing you an irreducible parameter, that is the network function generated by this four neurons cannot be represented with a smaller number of neurons, so that is to say, there's no spitted neurons inside of this configuration, so I couldn't merge them together, it's all neurons are distinct from each other, that's why I'm using different colors in this cartoon, so, and then with neurons spitting, so yeah, here the green neuron is spitted for three times, and orange is spitted two times, and I'm ending up in this configuration that I'm also showing in the bucket, and okay, so yeah, let me just write, get into it, so what do I mean by scaling law, is the number of manifolds that are generated by this width and perimeter, in the width and neural network, so we are looking at the loss landscape of this wider neural network, of this am neural network, and the question, and the thing is we have am neurons to start with, but we wanna fill am positions with this am neurons, so how can we do it? It's actually a partition problem, like let's look at the first second formula, so for each partition that I'm denoting by K1 to KN, so there's a number of permutations inside the wider network, so that is denoted by this multinomial vector, and I have to use each neuron for once, so KIs are all bigger than one, and then for each partition, I have to use the count as permutations, and this can be written in another way, so that is, so basically this is a sterling number of the second kind, and that counts the number of functions from a set of am elements to another set of am elements, a set of surjective functions, right? But once there's the surject, once there's the function that maps, then there's the coloring, because each am neurons were different to start with, so that brings us another and factorial, and there's the second invariance that we looked at, so I want to also get the scaling low for this function, to preserve the function, and that is denoted by this number T, and here, so N was free in the first case, because from any under parameterized network, I can map into a wider neural network, it's gonna give a different number, but I want to consider the zero loss configurations, I want to count the number of ground states, so that's why I'm fixing K, so that was the data generating width, and now in the over parameterized network with am, what is the number of manifolds? So this is the number of manifolds that I get from neuron splitting only, that is denoted by this factor G, and then there's an additional factor, because there is also the zero neuron addition, so that gives me an extra factor, okay? So I want to zoom into one of these manifolds, so this is not gonna affect the complexity calculations, but I just want to show an example of how funky these things actually look like. So I just showed, here I'm from Fikumizu, and they had this preserving of the critical points, but we didn't talk about the second order derivatives, so is it a local minimum or is it a subtle point? So here I'm showing, yeah, maybe it's good to look at the example first, so here we are looking at the case when only one neuron is split into two pieces, and there is a factor of mu that is free, so when I change mu, for every mu there's a critical point, but I can't move the mu, so that gives me a line of critical points. So I want to study the second order derivatives on this line as a function of mu. So basically, with the composition of the Hessian, we can get this theorem on the signs of the eigenvalues of the Hessian. So I want to know the signs of the eigenvalues of the Hessian on the symmetry induced critical points for a given mu, so that's on the left-hand side. So the theorem says you should look at the Hessian spectrum of the irreducible critical points, theta, and then you should look at this new sub-metrics Y that I'm writing the formula here, comes from the Hessian decomposition, and the nice thing is that this Y and the factor of mu splits from each other, so I get a factor of one minus mu times mu, and this fixed matrix Y, that depends on the neuron that I'm splitting, and then there is an extra zero because they are living on a line, so there's a zero eigenvalue due to the symmetry, and I'm gonna show an example of a local minimum here in a four-neuron network, and I'm gonna split this first neuron, W1A1, into two, so I'm gonna plot the minimum five eigenvalues of the Hessian as a function of mu, and here I see the minimum eigenvalue is negative for negative mu, so on the negative side of the line, we get strict settles, and then in between zero and one, this factor of one minus mu times mu is positive, so I get a local minimum in between, and at mu zero, that the second part of the composition vanishes, and in that case I get only zeros, so we get a non-strict settle connecting the settles and the minimum on the line, and here is a cartoon representation of this, so here we see the red part shows the local minimum, and this is what happened in this actual example that I showed here with the teacher-student cartoon, but it could have also been like, if y is a negative definite matrix, then we would get in the line rays, we would get rays of local minimum, and because there's permutation symmetry, the line is symmetric from mu is 0.5, and then in between we get the line segment of strict settles, and they are, but in any case, like independent of what is the spectrum of y, because this middle factor vanishes at mu equals zero and one, it's always connected with non-strict settles, so now I'm gonna introduce the landscape complexity analysis and I'm gonna kind of forget like the second order derivatives, and I'm just gonna treat them all equally, and I'm gonna focus on the number. So we've looked at this first scaling law that came from neuron spitting, so this is the number of critical manifolds, okay, so yeah, I wanna look at this number in some limits to get some idea about how it works, so, and we are studying the loss landscape of the width am neural network, right? So now if I take the large am limit, so I'm in an infinitely wide neural network, so what is the number of manifolds coming from super small neural networks? So for small am, this expansion factor is exponential in am, so it grosses and to the power am, and if we look at the, and this is high-loss manifolds in the sense that the small network can only solve the dataset to some extent, so it's gonna achieve high-loss configurations, and yeah, so after the neuron spitting, the function is exactly the same, so the, I mean, network is identical to a narrow neural network in that sense, so it achieves a high-loss values, and, but if am is close to am, so we get this low-loss manifolds, and in that case the number, the scaling number is factorial in am, so it's faster than the exponential, so there are more low-loss manifolds in that sense, and okay, a better way to study these numbers is actually when we scale both am and am at the same speed with a factor of alpha, and I'm normalizing with am to the power am in the log scale, so it's similar to the low-loss, a scaling of the low-loss manifolds, and here we see a unimodal distribution, so the more common manifolds are the ones that come from this intermediate stage of issuing could factor like alpha, and actually, yeah, so this numbers were, as I was kind of hinting, they were studied in the context of communicatorics as we realized more lately, so it is, this g factor is equal to the number of surgexons from a set of size am to another set of am, and also they were interested, like where is the peak of this unimodal distribution, and so it's in this cool math overflow of discussion, and they're like, oh, it is achieved at one over two log two, and that's also what we see in our cartoon, so the most dominant critical manifolds are those coming from the neural network, that is one over two log two times smaller than the landscape we are analyzing, so it is kind of, yeah. Okay, I wanna, or maybe, yeah, I wanna, yeah, let's do this slide first, then I can ask for questions. I wanna also introduce the landscape complexity measure, so that is, yeah, that I'm denoting by C, so we take a teacher network with K, and we are studying now the over parameterized network with M, so I'm defining this as the number of critical manifolds that come from the network with K minus one, so that's the scaling law G, and then there's a scaling of the number of zero loss manifolds, that's the scaling law of T, and I wanna compare, like, which one scales faster, so if there are too many settled manifolds, then I'm gonna conclude, this is a very non-convexed loss landscape, so it's hard to reach zero loss manifolds, but if there are, like, two many zero loss manifolds, and I, okay, this is still a non-convexed loss landscape, because there are still all the settled manifolds that are exponentially growing, at least exponentially, or maybe factorial, but there is much more zero loss manifolds, so it's a benign non-convexity, so I wanna study how this loss landscape complexity grows as a function of M. So, yeah, similar to the limits that we did before, so when M goes to infinity, in the infinite width limit, it goes down to zero, so this is nice, infinite width loss landscape is also benign from this point of view, so there is incredible amount of zero loss manifolds, and the number of settles are small in comparison, so, and there is a second limit, there is a, I think it's interesting, so it's the infinite data complexity limit, when the teacher width goes to infinity, and we have limited over-parameterization, so M grows in the same speed, and complexity explodes, so in this regime, it should be very hard to train the neural network, so here, also I'm not only in the limits, but we see the complexity grows gradually, so this is, so the over-parameterization gradually decreases the complexity, okay, here I have some toy examples, and I wanna talk about the deep case, actually, so I've been discussing the shallow case only so far, but we only use the permutation symmetry to make these scaling low computations, and in the case of deep networks, we can apply permutation symmetry in every hidden layer, so then it will give an exponential factor on the landscape complexity, so we applied one after the other layer, basically, so that means in the mild over-parameterization case, when the complexity is high, so if the network is also deep, it's gonna be even higher, so and we see it here, in the mild over-parameterization case, the gradient flow doesn't find any solutions, so the permutation symmetry prevents it from converging, like in rough terms, basically, but if we over-parameterize further, again, the number of zero loss manifolds take over, and when the complexity is smaller than one, exponentially, I think it is gonna make it just much smaller, so it's much more easier in the vest over-parameterization, so deep depth just makes it much, difference much more pronounced between mild and vest over-parameterization, and so here I have an amnesty experiment, we created a teacher network from the amnesty with 13 neurons, and this was funny to see, I mean, we trained it with Adam, I think, and we trained over-parameterized networks of 50 seats each, and we see in this onset of over-parameterization, it's a non-comic problem, so it doesn't converge, and then there is this intermediate regime, where the complexity is closer to one, and it flattens out, and it comes from this flattening of the expansion factor G at one over two log two, so there's this intermediate phase, and there's only a fraction of seats converging, and with further over-parameterization, okay, it's converging, more seats are converging towards zero loss, so this, I don't wanna make a lot of claims from this figure, but it was, there is a little bit of parallel between this and the complexity plot, and from the level of analysis, I didn't expect to see any match between these two, maybe, so it was surprising that there are some, there is maybe some indications that the complexity analysis gives us on the training dynamics, so for very wide neural networks, do you mean? Yeah, I guess so. Yeah, so we can look at actually this unimodal curve, sorry, it's actually, the switch comes from really, I'm going in the opposite direction, yeah, yeah, so what happens for, yeah, let's do it like this, what happens for wide neural networks is, so whatever K is, K is gonna be like a small compared to M, let's say, and this is, the T is always larger than G, because there's the, due to the zero neurons, so let's say, let's even forget the zero neurons, let's consider only G, and the, I mean the ratio is G, basically K minus one M, right, so that is the K minus one factor is gonna be lower than the K factor, because it's increasing, the G function is increasing in this case, and why it flips in the rest of our parameterization is because when K is close to M, then the settled manifolds are larger than the zero loss manifolds, exactly, yeah, and it's a, exactly, yeah, that's the right way to read it, for choosing the parameters of the teacher? Yeah, I'm wondering basically, oh, I see, yeah, that's a very good question, so yeah, so for sure, I know some teachers that are not gonna be difficult, in the sense that even with zero over parameterization, you can learn these teachers, like the teacher's infinitely wide, depending on the selection of the parameters, so if you choose a unit orthonormal teacher with all the activation functions, like this is, so if you sample them, like if D is larger than the K, right, they are gonna be, the first layer is gonna be orthogonal to each other, so it's gonna be effectively a unit orthonormal teacher, so this teacher is easy to train, so I mean, this, my argument is gonna fail, actually, but I think this is an extra simple case, and it's not like a teacher fitted to MNIST, like where each neuron has to do something different, like learn a different component of the dataset, so we are trying to construct, like, more like, less random teachers in some sense, like this is by choosing the hyperplanes, for example, and creating, like, this XOR constellations, it's a bit the idea. Okay, yeah, we didn't do this, so I wanna go back to the questions that I was hinting at the beginning of the talk, so I was, yeah, why is it hard to train without over-parameterization? So that is, because they are, yeah, I argue that it's extremely non-convexity, but I wanna say, like, this is not true for all the teachers, but for average case teachers, I think this is a good picture to keep in mind, and here I'm plotting a cartoon, like, that's, so the X, the Y, yeah, the vertical axis is the loss of the critical manifolds, and this first circle is the critical manifolds coming from small neural networks, so the G, the expansion factor G is small, and as we increase the width of the under-parameterized network, there's more number of critical manifolds, so, and we are going lower in the loss levels. After 1 over 2, a factor of 1 over 2 log 2, where there is this, like, most numerous, this is the most numerous critical manifolds at this loss level, and then if we go further over-parameterized or further down in the loss level, there is a smaller number of manifolds, and we eventually hit the zero loss part of the loss landscape. So this is the case of zero over-parameterization, but now if you over-parameterize the neural network, this loss landscape, the global structure of it changes because what happens is that we don't fall onto the second phase of the unimodal curve that we looked at before, so we are on the onset of that curve, and the saddle manifolds, like, grow gradually as we decrease the loss level, and the zero loss manifold is the most numerous, and so it's kind of, it's easier to converge in some sense, so this is a, I don't know if these pictures make sense, but yeah, it's, yeah, maybe we can take it offline. If someone is interested, I would be very happy to talk about them. And the question also that I hinted at the beginning, like, how much over-parameterization do we need in average case problems? So this complex analysis, at least to look to, and in the cartoon's practical scenarios that I showed that first it looked like a factor of four, but yeah, I have no ways to make a strong claim on this, like, I think, yeah, one should choose a problem and just, yeah, focus on, it's very problem-specific, but it's nice that it gives a kind of an average case analysis, like, with this analysis. Yeah. So I wanna talk about the equivalence class of global minimum. So here I'm assuming some more assumptions. The input distribution has full support, and the teacher network is unique up to permutation. So there's only one function that interpolates the dataset. And then I'm assuming that the activation function is analytic, doesn't cross zero, it's zero, and it has non-zero derivatives for infinitely many odd and even values. So this to say it can't be an odd activation function, or it can't be an even activation function, because then I will have to add more symmetries, like the mirror symmetry in my Neuron Edition catalog. So I wanna study the simplest symmetry scenario. And in this case, we get that all zero loss points in an over-parameterized network are identical to the teacher network, up to Neuron Spitting and Neuron Edition. So if you have a zero loss point, it has to be identical to the teacher up to these two symmetries. So this is the ground state of the over-parameterized neural network. And I wanna study the topology of this manifold. So I wanna, yeah, let's look at a toy case with the two teacher's student neurons and three student neurons. So the factor returns 12. And in this case, each dot represents an f i sub space of zero loss points and they are connected to one another. If the two f i sub spaces have an intersection, I put an edge between them. So that gives me the connectivity graph of the global minimum manifold. And then I can put it in another case with three teacher neurons and four student neurons. It's a 60 f i sub spaces. So that are only, I mean, green and blue type of f i sub spaces are different because the blue one has a neuron that is split it and the green one has an extra neuron that is a zero neuron. So with the zero neuron, I can move the parameter to the split it neuron. And then from the split it neuron, I can move one of the split it, the other split it neuron to another zero neuron. So that's how we walk towards like within inside this global minimum manifold. So maybe any questions that I'm gonna talk about this pruning application. So here's a cartoon of what I was talking describing before. Like it's a with zero over parameterization. We have a non-convex optimization loss landscape and it's not only non-convex, but it's also hard to train. So it's not like a benign non-convexity in the sense that all settles are strict and we can still converge to the global minimum manifold. So we have non strict settles, we have degenerate settles on the line of critical points. So, and then the way to solve this, one way to solve this optimization problem is to go to an over parameterized neural network with a factor of row, and then train a bunch of seats and then collect the neurons of all these seats together and just cluster them together. And this is in Flavio's paper, we put it in the archive recently. And here we, again, like the examples that I showed similar to the first slide. So we are creating some teacher networks that has input dimension eighth, first layer with four and the second layer with two. And then this is a, oh yeah. If we train this teacher network, we reach like high loss levels, but if you train an over parameterized student networks, as expected, we reach lower loss levels for different seats. But if we looked at the neurons of these networks and combine them together, using this, I mean kicking out the zero neurons and then merging together the spitted neurons together, we go back to the shallower neural network because we are cutting a not shallower neural network because we are cutting off some neurons. And it actually achieves much better loss because we are also removing the noise from the neurons that comes from training. So this works in some settings, in some toy settings that I'm showing here and in an amnesty setting too. And so here I have like, I'm showing more experiments in line with Sebastian's question. So here like we have teacher networks generated by this XOR like hyperplane configurations. And we see like, so this is factor two of over parameterization and so on. So as the larger the over parameterization factor is, the general trend is that it gets easier to converge, but it's also like too, it's hard to like analyze these questions more in finer detail because it doesn't even converge for every seat even at the factor of four. So we see some like green dots. So, there's just somebody, thank you. Oops, sorry, really bad at this. Okay. It's basically observing that it's always convenient to go more over parameterization you're using here. But if I believe this plot, of course I believe this plot. Let's say, why are not, that's the standard you achieve, but no. So that plot is basically telling you that feature learning in fine-tuning method is not too important. I mean it's not, I would like to know your opinion, but it's also something for the next discussion. Yeah, it's a great question actually. I mean, another way to, so I didn't make these plots first, I took them from these papers, but yeah, so in these plots, yeah, infinitely wide networks are working better, but you don't wanna train an infinitely wide network in practice, you don't want to train an infinitely wide neural network. You don't need to train there. Yeah, so I'm not sure if this solution is exactly like identical to the anti-K solution actually, because even an infinitely wide network, there's the like, if you scale your parameters large, then you converge to the anti-K solution, anti-K plus NNGP solution. But if your parameters are slow, then we don't have a closed form expression for the function. So we don't know like, we have to train the infinitely wide neural network to get the small initialization solution in the mean field regime. So that's the first thing. And the, so yeah, it's an interesting question, like how the anti-K solution would fall onto this term, but on this curve, I think for feed the forward neural networks, Francesco Spas and Metivia, and their group has a lot of work on showing like the anti-K solutions are better than the mean field solutions, but for convolutional networks, it's the other way around. So I don't think by any means, it's the end of the story actually, but and also like, so yeah, when you don't have this closed form correspondence through the anti-K, like let's say you want to recover the mean field solution in some ways. And the question is like, what is the network that you want to train? Like should it be a super wide network or like with a factor of 10, you already converge to that solution like effectively. And that's, so I find it interesting that this curve like plateaus. So you could go to this like the beginning, like where you're starting to converge to this solution. And I think like the other picture would be like that this curve decreases smoothly. So it would be the variance reduction picture, like what we get from the anti-K analysis. But it's actually, I picked these kind of pictures, like to be honest with you, because there are like other pictures that show exactly the same experiments and this curve like the case lovely. And that's compatible with the anti-K point of view. But I mean, these things also came from that line of work. So I was like, okay. And it fits my story, I guess. I mean, it's all about how you go to the internet. Exactly, yeah. Well, this honestly, like with my symmetry arguments, I like it has zero sensitivity to like the training, the scaling of the training dataset. I mean, in Reza's slides, we saw that he like he hit this nice picture with the ERM landscape. Like it is more funkier than the population loss landscape. And I think it looks like that. So it's, and I mean, to study like how this, how the critical points disappear with like the, as the sample complexity increases and how many new ones that you get, like in the finite sample scenarios, I think that's, I have no ways to do it with this, with this approach. Okay, yeah, this. If I have finite, if I have a finite training dataset, then it should be kind of effectively like, it should have converged to the population loss. This I agree. But to study the scaling of it, like it's, I think it's a very like, I wouldn't know. I mean, yeah, this is a difficult one. The horizontal line. The vertical line, the vertical line, it is represents the first time it hits the zero, zero training area. So it would be my K, the width of the teacher network. Good question. So, I mean, there is a little bit of peak. There's a little bit of peak here. Yeah. No, it's a, maybe there's early stopping that's a, that could be one thing. Early stopping could be any kind of regularization, could be no added. Thanks. Last question. It's, I mean, so one thing for sure is I'm only using the symmetries to make all the arguments and it's a, it's, but I'm also using the network parameterization because this neuron spitting is a special symmetry to neural networks. If I didn't have a second layer, if the second layers were fixed, for example, it would be still permutation symmetric, but I wouldn't have the neuron spitting invariance. So, so it's not even only, only permutation symmetry. I think it's a really neural network thing that I'm analyzing and counting here, but it's also true, like in a lot of physics problems, there's lots of symmetries. So, there's the rotation symmetry actually thinks, okay, what I can say for linear neural networks, in that case, when there's no activation function, like this manifolds merge together with the rotational symmetry. So there's, it's a more like thick, big one or two manifolds and I lose all the numbers. So with the rotation, it's more like, yeah, it's a, I think I would lose these scaling arguments. But it's, we can't talk more like. I would say let's, yeah, let's take again, I'll just pick up some.