 We're ready for the last session. And the first speaker is Eric Bonsera from MIT. Eric, take it away. Hi, can you hear me? Thanks so much for having me here, and it's been great. I can understand that it's the last session, so please bear with me. You're probably tired at this point, but it's the last one. OK, so thanks for coming. I'm going to talk about the staircase property and the leap complexity, which, despite the name, is about neural networks. And this is work with my really wonderful co-authors, Emmanuel and Theodore, especially for the last two works, which are what I'm going to be focusing on, and then the initial explorations were with Matt, with Guy, Gerdg, and Emmanuel. So we know that deep learning works. That's why we're studying it. And why does it work? Because we have some neural network architecture. We're training it with some optimizer, and we have some structure in the data. So this suggests the theoretical problem. Given some structure in the data, can we characterize when SGD trains our own network or what? And that's what I'm going to be talking about today. And what's the structure in the data? We're going to assume that the target function is multi-index. So I'll tell you what this means. We're going to first see that classical learning algorithms, so kernel algorithms, need a number of samples, which depends on the degree of this multi-index function. And then we're going to see that deep learning can solve. And we argue the complexity of deep learning is characterized by a quantity called the leap of the function, which is going to be much smaller than this degree. And so deep learning is going to beat out like these classical methods. And I'll define these terms, multi-index degree, and leap as we get along. So it's a very basic setting. We have data samples xi, yi, where xi is in rd, and yi is in r, these are the labels. And for the purpose of this talk, we're going to consider xi, which are uniform in the hyperbine or hypercube. You can also do this for Gaussian. You should see our paper, and I can say more about this if you have questions. But this is easier for notation. And but the assumption that the data is isotropic, so all of the coordinates are kind of drawn from the sum iid distribution is important here. And then the labels are going to be given by some target function f star, and we have some noise, which is 0 and E. So what's the main assumption we're making? But the target function is multi-index. So we have our target function f star is going to be equal to h star on some subset of the coordinates i1 through ip, where that subset of the coordinates is unknown. So h star is a function that depends only on p coordinates. It's also unknown, and those p coordinates are unknown. So there's like d choose p choices for these coordinates. We should think of p as being order 1. So the function f star depends on a small number of the coordinates of its input. But the input dimension grows. I'll give you some examples and you can ask some questions. So these are some whimsical examples I came up with. And the assumptions are not met in these examples, but bear with me. So it's just illustrative. So f star of x could be whether some patient with some symptoms, which you can view as a vector in plus 1 minus 1 to the d, each coordinate is whether that patient has that symptom or doesn't, has a disease. And so you're given data, which are the symptoms of the patient and whether the patient has the disease. And it turns out that disease depends on a very small number of symptoms, but you don't know which one's ahead of time. And so the machine learning algorithm kind of has to figure out which are the important symptoms and then fit a function. OK, or whether a person has some phenotype and depends maybe on a small number of genes that are expressed. So maybe these are not very clarifying examples, but please ask questions now about the setting because this will be, if you don't understand it, because this will be kind of the setting of the data. It's clear? OK. I was told wait 30 seconds, make it awkward, and then somebody will always, yeah. I was going to wait. OK, so the question is, are all these multi-index functions created equal? So in the question of representation by neural networks, can I find a neural network that kind of approximates a certain multi-index function? The answer is yes, because you can just take the network and make it depend only on those p coordinates that matter. But you pay 2 to the p, but this is much less than 2 to the d, which would be the curse of the machine. From the point of view of generalization, the answer is also yes. You can, if you take the minimum norm network that fits the data, so this is kind of like minimizing the norm of the parameters, but fitting the training data, then you'll avoid this curse of dimensionality. But this is not an efficient algorithm, because you can't, it's like an NP-hard problem to find the minimum norm network that fits the data. But in practice, what are the networks, how the networks optimize, they're optimized with SG. And so some, I think it's like L2 norm of the weight, so. OK. I think if you take like a, what? Yeah? Yeah, so I think if you have some neural networks, two layers that are bounded norming do some class of functions. I think it's called the baron norm. You should read the Bach paper. I think so, yeah. Yeah, it should be L2 norm. So if you were able to find the minimum L2 norm weight neural network, you would be able to fit the data with that, that fits the data. Then it would generalize. Because kind of what would happen is you would have zero weight on the coordinates that are not I1 through IP, that are the ones you don't know. But SGD training, so all multi-index functions are created equal, but some are more equal than others. Some are significantly easier to learn than others. As an experiment, you can run when we ran. So just to take a function on the binary hypercube with input dimension 30, you train with MSc, so squared loss. And if you try to train X1 versus try to learn the product of X1 through X10, which is going to be in plus 1, minus 1, because this is plus 1, minus 1, input space, you see that the loss really decreases very fast for learning X1, but it kind of gets stuck. And here I've trained it for 100 times more than the left-hand plot, 100 times more samples, and same hyperparameters. This is like some five-layer resinet, and it didn't succeed. OK, so you can say, well, maybe what's determining how hard this function is, is the degree of this function. But that's not quite true, because you can also train on this function we call the staircase function. So what this staircase function looks like, I'll just write it here. I'll just expand the sum. It's a high degree function, because it's got in this case 10 terms of increasing degree with one degree 10 term, but it's still efficiently learned by neural networks with the same hyperparameters as in the parity case. So somehow adding these extra terms has let you learn the high degree terms. Adding these lower degree terms somehow mysteriously, yeah, improve it for a squirrel. I think they don't have anything for a non-squirrel. We only ran on the squared line. So actually, I haven't really thought about how to define it for the non-squirrel. Good open question. Yeah, so somehow these lower degree components are helping you learn this high degree component, which by itself is not being learned. So there's some hierarchical structure in this function. You can think about it that magically gives you this behavior, where the test loss decreases. So this is for uniform on the binary hypercube input. OK, so what are we going to do? We're going to try to. Are you going to do online? This was the, I have the details written somewhere, but you get the same behavior with batch or with many batch with online with ERM with a fixed number of samples. The specific experiment, I could tell you the details, but I can't tell you the details. OK, yeah? So actually, here I'm plotting 1 over the square root of 10 times this. I normalize it. So the L2 norm of these is the same, and this gets to much smaller than 1 over square root of 10. 1 over 10 L2 norm. So 1 over square root of 10 L2 norm. So because you get to 0 error by Parseval's theorem, you know that you've actually learned the high degree term, because you can write any function on the binary hypercube, but uniquely as a polynomial. So you kind of are learning all the coefficients. And I have another plot, which I forgot to put in this, where you can see actually that the Fourier coefficients associated with each of these terms are learned in order, where x1 is learned first, and then x1, x2, and so on. So this is kind of a toy setting, but it inspired these investigations. And basically, what I'm going to tell you is that why can we learn this high degree function? Because you have a question? Yeah. So if you do one pass, the train loss is equal to the test in this case, because the input space is so large. If you do ERM, then the train loss goes to 0 also. But I was plotting the test loss. So this is generalization on outside of your sample. So the number of samples you need depends on the degree of the functions as d to the degree, where d is the input dimension, for classical machine learning, and then kernel machine learning, and then deep learning, we argue we pay d to the leap. We prove this in some cases, and there are some cases that are still contractually open, but we have evidence. OK, so let's do the d to the degree first. So a popular tool to analyze training dynamics of neural networks is so-called NTK. And under this NTK parametrization, which is a specific parametrization of your neural network initialization learning rate, as you take the width to infinity, you're basically training with SGD, or it's as powerful as using a kernel method. So anything you can do with, so there's a specific kernel that's associated with a certain neural network initialization and architecture. And as the width goes to infinity, SGD may make or will be the same as training with a kernel method. So this is good because we understand kernel methods well. Actually, I'm not going to define what they are because this is not the subject of this talk, but there's something where you can really get a good grasp on how you're going to generalize. But it's bad because it doesn't capture our experiments. So it turns out, as I mentioned before, you can write any function on the binary hypercube uniquely as some polynomial where these coefficients, called the Fourier coefficients, are unique. There's 2 to the p of these coefficients, h hat of s, each associated with one subset of the inputs. And so you can define the degree of the, you can define the degree of h star as the maximum non-zero Fourier coefficient, like the maximum size of the set where the Fourier coefficient is non-zero. And it turns out that the degree of h star controls the sample complexity when you take the NTK of a two-layer polyconnected network and the NTK parameterization. And this is previously known. In fact, we prove it's not very complicated, but one can prove that any kernel method needs any kernel method, not just the one for this neural network, but any neural network you choose. The corresponding NTK will always need these are the degree samples to learn. In this setting where you're trying to learn f star of x, and it's equal to h star of xi to the xi p, and you don't know why I want the right p. So I could get into more detail about that, but I think I should move on to the meat of this talk. Any questions? I can give you an intuition. Basically what happens is that if you can learn the class of any kernel method can only output functions that lie within an n dimensional subspace, where n is the number of training samples. And somehow the class of f star that you could make with any choice of i1 through ip, that's d to the degree different possible functions. And they lie in the d. They don't lie in smaller than a d to the degree subspace. So they wouldn't lie in the sub-space of classes. Functions you can learn with the kernel. So this is the first attempt is NTK. So you pay d to the degree. Now the meat is why can we do better with deep learning? And it's like kind of independent of what I've said before. So now if you zone that, you can still come back in. OK. So recall this experiment, parity versus staircase. So we're going to argue that we're going to find this notion of leap. I'm going to say a parity has a large leap and the staircase has a small leap. OK. So now let's write down what the networked architecture we're going to do it. Take this. I'm going to take a two-layer network architecture. So this is the first layers, the second layer is A. And this is in R to the M, where M is the width of the architecture. And then the, I should not have drawn this. OK, so all right. OK, so the first layer is W. And here you apply some non-linearity sigma. So this is your input X. You feed it through the network architecture. You apply non-linearity sigma, then you add up the neurons weighted by A. OK, so why did the N2K analysis fail? Well, because you have isotropic initialization. So the coordinates, so here I fixed X, I want through IP to be 1 through P, since your initialization is a permutation invariant. This is just as hard as learning an unknown I1 through IP. So the coordinates 1 through P are not especially favored. Each row, for each neuron, there's an equal weight on each of the inputs. And so this kind of turns out that if we just trained the second layer, we would need D to the degree sample. But so OK, let's do a thought experiment. What if instead we initialized each row of your first-layer matrix with the first coordinates, first P coordinates of order 1, so much larger, and then we initialized the last coordinates of order 1 over D, smaller. Now this favors learning functions of the first P coordinates and actually turns out that if you have such an initialization, then you can learn any function of the X1 through XP just by training the second layer. Of course, we don't have this magical initialization. We don't know what the right coordinates are. So then the hypothesis is SGD works by making the neurons, so the weights large, on these first P coordinates. So this would be the cat. So this is a heuristic intuition, so a good question. So imagine we were to freeze W and we were just training A. This is kind of the dynamics here are well understood. It's just a random features model. And so we can understand the generalization, how many samples you need to generalize easily. It turns out that if you had much sufficiently diverse, but much larger weight on the first P coordinates, then you would be able to learn with 2 to the P samples, which is much smaller than D to the degree. Because remember, P is constant and the D is growing. So the hypothesis is that this magical initialization is actually being provided by SGD. And this is actually, I'm going to show you an experiment now. So this is some other function which I've plotted because it'll give you a nice plot. You don't have to look too much at the form of this function. It just depends on the first eight coordinates. Let me show you a video I can't do it on Beamer, so I'm going to switch to Keynote. This is like a very hybrid talk with Beamer and the Keynote and the Blackboard. So what I'm plotting on this left-hand side, the test loss during training for this function. And on the right-hand side, I've plotted W transpose W, where W the first layer of weight. And the input dimension is 30. It's just for visualization. You see this effect even more if you take the larger D. And so you have the 30 by 30 matrix here. And after training, the W transpose W is basically 0 everywhere except for this 8 by 8 block, which corresponds to the first eight coordinates. So after training, you have your like, this is just out of the box, SGD on this neural network. After training, you see this 8 by 8 matrix does have the kind of the property. So for our theory, we take infinite. Here, I think I took 40 neurons. I didn't take too many. Actually, I think this is a resonant. I'm even just plotting the first layer in all of the resonance. Yeah. Yeah. Here? Sorry, you're looking at this screen. Sorry. OK, this is training like step, like number of samples. So it go from 0 to 500,000. And this is the test loss. And then this here is the W transpose W, the absolute value of the weights. So I'm going to show you a video. It's very cool. You get a different plot. The theory will make it. I have an odd degree term here, so I have x1. So this is kind of what you see here is the initial decrease in the loss is due to picking up x1. The initial decrease in the loss is because it learns x1. And then afterwards, it plateaus for a bit while it tries to find 2, 3, 4. Once it figures out 2, 3, 4 are important, weights align, and it learns. And then plateaus while it tries to learn 5, 6, 7, 8. And when it figures out those are important, it decreases again. So the length of the plateau will depend on the leap. So this is kind of maybe you're already starting to see the definition. So the difference between the variables in each of the monomials is you pick them up sequentially. OK, so I'm going to show you a video. So here on the, I'll play it a few times, but you see as the training progresses, the loss kind of decreases. And I'm plotting the evolution of the W transpose W at the same time. So you can see when the loss is plateaued here, it totally depends on the first four coordinates. It falls, and now it depends on the first eight coordinates. So you kind of see inside the weights of the matrix that it's putting more attention in some sense onto the variables that matter for the term it's trying to learn. OK, tell me when you get bored of seeing it. So now I have the first four, and now it falls at first eight. So you can see, and you can do this with other functions. And the first, now the first four, and it's when this is plateauing, and now the first eight falls. OK, now operation successful. I'm back here. OK, so all right, so the observation is that you learn there's phases in this learning. And in each phase, you have D to the order 1 steps, which is large. And you're picking up some coordinates in the support. And then you have the smaller number, like it's 2 to the order p, where basically if you were just training the second layer, you would be able to fit the function to the coordinates to which the first layer weights have already picked up. So the key question is, why do the first layer weights grow in the direction of the support? And what governs the complexity of finding the coordinates in the support? This is the leap complexity, which is this. So it'll take D to the order of leap steps to find the length of the plateau where you're trying to find the new chord. Any questions? And I'll go into a calculation. No, that's why I put in quotes. I haven't defined it yet. First to give intuition. Yeah? I have a question. Yes? Yeah, yeah. OK, so here in my initial, this is a thought experiment. Are you talking about this? So this is a thought experiment. I actually initialize each of the rows. So this is like, if I had this, then I would be done, basically, because I could just train the second layer weights. Then the thing is that actually, this is found automatically by SGD. So this magical initialization is kind of how the dynamics proceed to find them. Isotropic. So if you look at the video, let me pause, see? Here, it didn't pause fast enough, but here you have x1 has been picked up already. But yeah, the rest is isotropic. And it picks up. Yeah, so then it grows to the first four. Yeah, OK, all right. Let's move on. OK. Good, so yeah, so you have this leap. I'm going to find it. OK, so we want to understand how training these networks works. You train the first layer with one pass SGD with some mini batch size, just for simplicity for these slides. You can do also with bit mini batch size equals 1, but you get a different limit. And so in the, this is hard to understand. You have stochasticity because of the small width or the finite width and finite sample size. But you can approximate the dynamics. If you take large width, mini batch size, and small step size, you can approximate the dynamics by this kind of thing, this expression where I've taken the population expectation. And I've taken, what else have I done? And also all the neurons are kind of interacting in a mean field way. So this is kind of what you get in the mean field, in the mean field limit. I've simplified a bit, even, but OK. So you have some expression which tells you how the first layer weights are evolving on each step. So let's give an intuitive picture. Yeah? Yeah, large. But in practice, you don't need it that large. Infinite width, but actually it's like, OK. Order one width with large enough constant width. Doesn't depend on the input length. It does not depend on the input length. Like 100 width for any, it depends on the final accuracy you want to achieve in theoretical results. So yeah, but it does not depend on the, sure, p is constant. So yes, it depends on p actually. Pretty bad thing. So yes. So OK, so you're actually reading it. So here. So OK, for constant time of training, you can maintain this approximation as long as it's a small constant time. When you have a larger amount of time, then you also need to add in the current. This becomes the residual. But for what actually in the result on beam field training, we first train the first layer while keeping the second fixed, and then we train the second layer while keeping the first fixed. And so we train for a small amount of time, the first layer. So then this is still valid. And then the second layer, we just use the kernel analogy. Let me give you some intuition. So OK, so I just made that picture. And now I'm looking at one individual neuron. So let's say this neuron, OK? And the weights of this neuron are going to evolve according to this formula. So what are the weights at initialization? These are Gaussian. And then after one step, OK, here I've plotted the projection to the first two coordinates, kind of the distribution of the overall neurons of the weights protected first two coordinates. After one step, you see kind of like the coordinates in the direction one grow a lot. OK, why is this? I'll give you the calculation. You want to see the calculation? I think you do. So OK, so what? No, it makes it more awkward if I don't. And I've told this is how you give a talk. I don't know. OK, so also, yeah, this is not a one. This is my cursor. OK, so what is this? What is how I'm perpetuating the weights? Taking the step size times this expectation over x of x1, this is my target function, times sigma prime of x transpose w times x. This here is a vector. I mean, these are vector, OK? This is kind of the direction of perturbation. So now let me do a Taylor expansion of the sigma prime. And this is roughly the eta times the expectation of x1 times, OK, let's just put some constants which are the derivatives. I don't know. So OK, so OK, so the point is this is close. This is small to begin with. And so we can do a Taylor expansion. This is small. This is going to be on the, yeah, so OK. This is roughly sigma prime of 0 plus sigma prime prime of 0 times x transpose w plus sigma prime prime prime of 0 times x transpose w squared divided by 2, all times x, OK? So what does this expression look like? So this is actually going to be, it's going to depend on the coordinate. So if I look at the first coordinate, which is the first coordinate of x here, this expression will be roughly the expectation over x of x1 times x1 times sigma prime of 0, which is equal to order of 1, OK? And if I look at the other coordinates, which are not equal to the one coordinate, this would be the expectation of x of x1 times xi sigma prime of 0 times plus sigma prime prime of 0 times x transpose w plus lower order terms. And this is going to be much, much smaller than 1 because this first term will cancel because the expectation of x1 times xi is 0. And then this term will also cancel because you have a degree 3 polynomial x1 times xi times some degree 1 polynomial. And then the other terms are the ones that are going to contribute, but they're small. So this is much smaller than 1. OK, so basically, the coordinates in the direction of x1 are going to move, so I should be wrapping up soon. So what happens if you do x1 times x2 instead? It's the same thing. I could do the same calculation. But actually, because you're x1 times x2, you're high degree parity, you're going to be stuck in a saddle point. So after one step, you still have coordinates 1 over square root of d for all the first-layer weights. And after any constant number of steps, you're still going to be stuck at the saddle point. So actually, it's much harder to learn. You need more than a constant number of steps. You'd log d steps. But then what happens with the magic of this function x1 plus x2, x1 x2, which is the staircase? Well, the first step, you're same after one step, only the x1 term matters. So you kind of grow in the direction of x1. After another step, the x2 term can grow, because the x1 term actually has grown. So you can escape the saddle. It allows you to escape the saddle when you have the w1, which is large. So after only two steps, now you've basically have the magnitude of the coordinates that you want on all the weight. OK. So I think I have four more minutes, and then it'll be questions. So do you have questions now? I'm going to tell you what the leap is. But do you have questions about these calculations? Yes. The first, what? The first term. Yes. This term. OK. So OK. So x is in plus 1 minus 1 to the d. And this is so the norm of x is equal to square root of d. And w is drown Gaussian i d over d. So the norm of w is equal to order 1. So then the dot product of this, x transpose w, is going to be order 1. So actually, we take this initialization that's like some constant times this, so it's small. So then this is some small constant so we can do the Taylor expansion. Yeah, we multiply for the theoretical analysis. Yeah, in practice, it's all standard. So yeah, so you're right, you've got me. So there's some sigma, there's some c here. OK, it's small. OK. All right, good. So back to this experiment, the first layer of weights have grown successfully for x1 and for this staircase, but they stuck at the saddle for x1 times x2. So let's go back to what happens in parity in the staircase. Well, somehow, the weights are stuck at the saddle, which is actually of higher order here. And then they escape the saddle by climbing the staircase, so to speak, in the staircase starting. We have the first two to learn x1 and learn x1, x2, and so on. OK. So now you can define the leap complexity. So again, we're writing the function h star, we write it as a polynomial. And let's consider it set of non-zero Fourier coefficients. The leap of h star is less than or equal to l. If there's some ordering of these non-zero Fourier coefficients such that you can kind of grow the size of the support that you haven't previously seen by most l on each iteration. OK, it's kind of like you can think of it as first you're going to learn the Fourier coefficient corresponding kS1, and all those coordinates will grow. And then s1 and s2, and you only really pay for the coordinates you haven't learned so far. OK. So there's some examples. The parity has leap k, because it's one Fourier coefficient of order of k. And the staircase has leap 1. And this conjecture is that for any set S of non-zero Fourier coefficients to almost all h star that have those non-zero Fourier coefficients, you're going to need d to the maximum of leap of h star minus 1 comma 1 samples. So this is a conjecture, and I said almost all, because there are some degenerate cases we can prove. But there's a measure theoretic sense in which you can define almost all. OK, and also why this here? Because this is based off of the complexity of escaping the saddle point corresponding to learning that monomial. So this is a paper of Reza and Benardus and others. So Reza is in the audience. Yeah, because it's one pass SGD. So we're really doing one sample per step. So yeah, for ERM, I think it's different. Also, there's some paper recently that says that if you smooth the loss, then you can actually get a slightly different bound. It's like leap over 2, or it would be leap over 2. So I can talk to you about that. Yeah. Yeah. OK, so any questions? I'll tell you what we've proven, and then it's over. Yeah. OK, yeah. You have to be careful, yeah. So you can take a small step in the first step. The first step can be small. So it's a small constant still. Yeah, it's still constant. It doesn't go to 0 as d goes infinity. So I mean, we don't know of any other way of analyzing first-layer training dynamics, because it's nonlinear. So basically, we train it for very small time, so the Taylor expansion is still valid. And it requires some type analysis, or some technical type analysis. But then when you train the second layer, you can train it for a lot of time, which is still constant. But it's a big constant, yeah. It doesn't grow in d. So this is, yeah. The only depends on h star, on p, really, on the number of coordinates. Yes? Yes? Yes. Yes, because of one pass. Yes. So there's, I can, OK, so I have, literally right here, I have an extra slide, which I copied from keynote. There is an, so we have also like some, so how do we prove this stuff? I haven't told you how we prove. We analyze some PDEs, and we can prove that they're kind of stuck in certain cases, like degenerate cases. And you can construct a degenerate case, but if you perturb it slightly, then you escape. And so we have, yeah. We don't, so in the extension to Gaussian, which I didn't talk about, we don't believe that there are degenerate cases. But, you know, we don't actually know. So let me tell you what we have, what progress we made on this conjecture. I copied here, I put in gray, so you don't get distracted too much. So the case of leap equals to one, I believe, is mostly resolved. We have this paper, which shows that if you take a mean field network, with sufficiently large, like infinitely large width, but really order one width doesn't depend on d is enough, and you train it for order d, order one time, which corresponds to basically order d samples, if you want to play fast and loose. Meaning that you can order one time, you can simulate with order d samples. But it's not necessarily true that you can't do more with order samples. We don't know that. But OK, if you look at the mean field networks trained in order one time, they can learn, if and only if, leap of h star is equal to one. And this is star here because of these degenerate cases. OK? And this is required using the mean field approximation, then saying because the target function depends on a small number of coordinates, you can get like a PDE, which is equivalent, but which is simpler to analyze. And then you analyze it. So we analyze it for any p. So it's like, let me do the Taylor expansion trick. OK, so what happens with the leap more than one? This is mostly open. We do have CS, we can prove that this d to the least order leap should be hard for CSQ algorithms. And then we also have some special cases in which SGD can learn a family of leap staircase functions. So basically these would be staircase functions where I remove kind of a race coefficient or remove terms. And also, in the Gaussian case, you have Hermit polynomials of higher degree. So yeah, so this is our, yeah, this is it. Any questions? And then I'll put the summary slide. Yes? Yeah, so it's this staircase functions that look like this except I've erased some of the terms. Or in the Gaussian case, I have like Hermit k instead of, yeah. OK, yes. So in the paper we only do, in the ABM 23 paper, we only do Gaussian. But the same will hold for uniform on the hypercube. So I've kind of lied. One of the papers is for hypercube. The other one is for hypercube and Gaussian, but we only really show for Gaussian. Yeah, so it turns out the leap, so OK, this is my summary slide. So you define an isotropic leap. So you define it. It's a slight variation of the leap, which is so you have a slightly different definition of multi-index function, where it's not some p coordinates that matter, it's some p directions of the input that matter. So basically, f star of x is equal to h star of mx, where m is in r to the d times p, p times d, OK? And that's kind of the appropriate definition for Gaussian, because it now has a symmetry. And you have to define a corresponding isotropically, which can depend on the expansion of h star in the Hermite basis. Now you pay more, actually, for high degree Hermite, even though if they only depend on one direction. And then you also, so that adds a complication. Because even with one coordinate, now you have many different possible leaps. Partly why the first paper was only for Gaussian. And it was more like rigorous. And then this one, it's rigorous, but it's only special set. OK, yeah, that's it. Good. Good. Good. Good. Good. Good. I'll just assume that here I respond to sort of the 10 dimensional update that they like to do the interpolation of the first one case. It was in that case. In this case, but there's also, I mean, this is for general, you're talking about 2 to the p versus poly of p dependence on the samples? Or what? Yeah. I don't know if you need extra structure with GD, because otherwise you're solving an NP-hard problem. Like learning the support. If you have a staircase, I'm not sure. I don't know. So we didn't talk about it. I'm not sure whether it's strictly better or worse than those bounds. This is also one pass. The GD is a slightly different setting computationally. Whether you do ERM, OK. So for the leap one result, for example, so you can do it for batch GD with a constant number of steps where the batches are ordered D. And it's independently drawn each time. So the mini batch SGD, where order D mini batch size each time for a constant number of steps. I don't know if you can do it for the full, but maybe. So the main difficulty is, OK, yeah, sure. You show that the weights adapt to the directions of the support. But then you have to show they're still diverse. So this requires you have to show that some kernel is non-singular, and this is a kernel that's generated by the train dynamics. So you actually use some algebraic facts to do this. And I don't know how this all interfaces. Yeah, I'm not sure. Yes. If I understand correctly, the greedy low-rank learning dynamics are generally when you have very small initialization. So this is not the setting we're assuming here. So it's a different reason for the low-rank to occur. It's also not a degeneracy problem. It's because we can't prove that the mean field PDE is valid for those timescales, for longer timescales than order one. So the limiting dynamics, you would actually exponential in time samples for those to be valid. So and here we're trading for poly number steps, so poly time. So you would need e to the poly dynamics, poly samples. So actually, this analysis is not this, more than one, is not for using the mean field PDE. But it's using a martingale argument, kind of like the one in the Ben-Aruz paper, and Reza is here too, I think. And then we have the modification where we show that you do go from tattle to tattle. Oh, I also forgot to mention, the achievability result here is we train the first layer and then the second layer, and then here we even cheat more. So we train the first layer and the second layer, and we also have a projection down to a ball kind of thing on each step, just to control it. So there's a lot to be done. Yeah, OK. Does that answer your question? I think I'm probably out of time, but people keep on saying, yeah, asking. The learn the parity? Yeah. Yes, so because the mechanism is such that the weights grow in the direction of the relevant coordinates, you can fit any function that depends on those weights, by training the second layer. So kind of after training the first layer, now you can train the second layer in your, so basically kind of like learn the subspace, yeah. OK.