 So, is this working? Okay. Great. Yes, thank you all. Thank you to the organizers for inviting me to speak, and I'm very happy to be speaking with you all today. And yes, feel free to interrupt me at any point. I aimed for a bit of a shorter talk to have time for interruptions and discussion afterwards, so please do. Okay, so I'm just going to jump right in it. I think there are a lot of people that are familiar with the themes that I'll be talking about today, just to make sure that we're on the same page with the words that I'm using. The first kind of set of results I'll talk about are in this, what we call implicit regularization or implicit bias or algorithmic regularization or inductive bias of optimization. And the kind of high-level idea is that the usage of optimization algorithms themselves, even when you don't impose explicit forms of regularization, can be a form of regularization. So the kind of classical examples of this are, if you think about gradient flow in least squares where you have more parameters than examples, and you initialize from the origin, then gradient flow converges to the minimum L2 norm solution that interpolates the data. Even though there's no L2 regularization in this problem explicitly, just the usage of gradient flow itself is something that minimizes the L2 norm of the solution that's found. And something that's gotten some more attention over the past few years has been looking at gradient descent or gradient flow or other optimization algorithms over in classification problems that where you're minimizing, say, the exponential loss or the logistic loss, some type of convex surrogate for the 0-1 loss. And in these settings, when there are many possible ways to fit the data, gradient flow and gradient descent converge in direction to things that satisfy the KKT condition or that satisfy margin maximization. So this minimum L2 norm solution subject to these margin constraints. Even though there's nothing in this problem that explicitly says something about minimizing L2 norms or anything about this constrained optimization problem, this is what the optimization algorithm itself favors, are solutions that satisfy these constraints and that minimize the L2 norm subject to these constraints. And so something that we'll talk about later is kind of analog to this story in gradient flow and descent on neural networks, where the story is necessarily more complicated because optimization is non-convex, but it really seems to be key to this story of the success of deep learning. Or at least that's what I think and that's what some people think. OK, so that's going to be the first part of the talk. The second part of the talk is going to be on benign overfitting. So by benign overfitting, I'm talking about settings where you have some type of estimation problem or learning problem where there's noise. So there's noise and you're looking at the generalization performance of classifiers or models that perfectly fit the training data. So things that achieve 100% training accuracy under some loss, like the square loss or zero one loss. And so since there's noise in the problem and you're achieving a perfect fit to the data, you know that you're overfitting in the kind of standard statistical sense of having a better fit to your data than what you can expect from the noise level in your problem. But what's been observed in neural networks and in many and now in a variety of settings is that these models can, even though they're overfitting to noise, can still generalize well, even optimally. And so the kind of benign aspect is that you're overfitting, but it doesn't really seem to hurt your prediction performance so much. And a reason this has been attracting a lot of interest from statisticians and learning theorists is that it seems very hard to reconcile this type of behavior with classical ways of understanding generalization from uniform convergence. So if it's a noisy problem, that means that the best possible risk that you could achieve is going to be at least some constant. And when you're looking at interpolators, things that achieve perfect fit to the training data, that means that your empirical risk is going to be zero. That means that the gap between your population risk and your empirical risk is going to be exactly equal to your population risk, because the empirical risk is zero. And that's the kind of striking thing is that the standard approach for trying to understand generalization performance is to look at what's happening in your training data and maybe have some type of complexity measure of your function class and say that as long as the number of samples is much larger than that complexity, then the gap between what you're seeing on your training data and your test data will vanish. But this is an impossibility in the setting where there's noise and you're overfitting. So trying to reconcile how you can have generalization even when you're interpolating and is the kind of main question of this new sub-area of theoretical machine learning and statistics. And just to say, just like in the implicit regularization story, we have a pretty good understanding of how this happens in some restricted classes of linear models, but beyond that, the story is more complicated and much harder to understand. And so that's what I'll be talking about is a kind of some vignettes about implicit regularization and benign overfitting in neural networks in a particular regime. And the regime which we'll look at is what happens when you train on what we call high dimensional data. So it's not exactly just that D is the dimensions much larger than the number of samples, but it's something of the similar flavor that we'll talk about in just a second. And we'll show that gradient flow can have an implicit bias or have an implicit regularization effect towards low rank neural networks. And that's going to be kind of first part of the talk. And then the second part of the talk, we'll talk about how to use this characterization of the implicit bias of gradient flow to say something about benign overfitting in neural networks. OK, so just want to quickly review some recent work on implicit bias in neural network classification problems. So what we'll be looking at is classification problems where you have some type of convex surrogate for the 0-1 loss that you're doing optimization over. So either you can think of the logistic loss, which is this L that we have here, or the exponential loss of the same result supply. And we'll look at gradient flow, which is this equation in red, which just says that it's a dynamical system where the change in the parameters is governed by the negative gradient of the loss at that time. And so associated to this class of functions of these neural networks is a margin maximization problem, which is in that kind of pink equation 1. So this is trying to minimize the L2 norm constraints. So if you imagine all the parameters in your neural network, you concatenate them into one vector. And if you minimize the problem is to minimize the L2 norm of that concatenation subject to this margin constraint. So the yi's are going to be plus minus 1. And of xi theta is going to be the neural network output on an example, xi with parameters theta. And we want all of the examples to lie at least distance 1 from a margin of at least 1. And these very wonderful results by Leo Lee and G. Tolgarski from a few years ago show that for a very large class of neural networks, which are called homogeneous neural networks. So if you haven't seen this before, homogeneous neural networks just means that if you scale the parameters by some multiple alpha, then the neural network output is also scaled by alpha to the L for some L. So basically, if you have a deep fully connected value network or you have max pooling layers, all of these satisfy this homogeneity condition. So it's a really large class of neural networks that is captured by this. Positively homogeneous? Positively homogeneous, yeah. And so if you haven't seen that before, it doesn't really matter. Just the idea is that this captures a lot of neural networks in practice and it'll capture these two-layer neural networks that we'll be talking about shortly. So what they showed is that if you consider gradient flow over a homogeneous neural network and there's some time where the empirical risk is small, so as long as it's smaller than 1 over n where n is the number of examples, then gradient flow converges in direction to a first order stationary point or a KKT point, KKT standing for Karush-Kun-Tucker conditions of this margin maximization problem. And at the logistic law, and at the empirical risk, logistic loss or the exponential loss whatever loss you're using goes to zero. And so this is a quite striking result because of course there's nothing a priori about looking at this optimization problem that suggests that doing gradient flow over the logistic loss should have anything to do with this margin maximization problem. And moreover, this neural network can have an arbitrary structure other than just homogeneity. So it can in particular have as many parameters as you'd like. It can have all kinds of weird architectures, but the same thing is going to happen. As long as the loss reaches a small enough point, it will converge in direction to something that satisfies the KKT conditions for this margin maximization problem. And so something that's nice here is that again, there's no assumption about the number of parameters. There's no assumption about initialization. So it really allows you to abstract away from some of the details that make it difficult to understand neural network training. And by understanding just the geometry of this margin maximization problem, it elucidates something about the limiting behavior of neural network training. So just want to make sure that we're clear on this point because the rest of the talk is going to be talking about what are the consequences of the structure of neural networks that satisfy KKT conditions for margin maximization for this problem. And if we can say anything about that, then we're really kind of illuminating the role of optimization in either whatever thing that we're able to deduce from these KKT conditions, whether that's generalization or adversarial robustness. If we can show that just the KKT conditions for margin maximization imply something, then it really shows that the implicit bias of optimization is playing a key role in that phenomenon that we're looking at. So it's a little complicated, but the easiest way to think of it is if you think about what's happening in the linear setting. So in the linear setting, that was this result that I talked about a little earlier. So basically the proof is somewhat of a generalization of what's happening when you do gradient descent on the exponential loss. Intuitively, you are kind of trying to maximize the margin with this because in order to minimize the exponential loss, you want to maximize the margin on all the examples. And when you have exponentially-tailed losses, essentially only the smallest margin example ends up dominating everything because you have exponential minus something. And basically when you have things inside exponentials, the smallest thing dominates. And so that's the kind of minute-long idea. But otherwise, you can just kind of trust that this is true. And then maybe you want to investigate why it's true after. OK, so this is just saying what I just mentioned. And just as a note, I'm going to be saying gradient flow does expire as a here. But actually any other algorithm that produces max margin neural nets would have the same properties that I'm talking about. But what we know is that gradient flow is the thing that satisfies that. And that's what's of interest for trying to understand deep learning, but just a kind of little caveat. OK, so what are the actual networks that we'll be talking about are these two-layer leaky-value networks? So leaky-value is just the kind of version of the value activation where instead of the derivative being 0 on one part of the plane, it's equal to gamma. So the derivative is always at least some constant gamma. And we'll look at training two-layer nets where we're just training the first layer and not the second. So we have upcoming work that shows what happens with this training the second. But all the kind of key difficulties and ideas come from training the first. Because training the first layer allows for you to have feature learning. It's a non-combatx optimization problem. So most of the difficulty comes with training the first layer. And we'll look at what happens with logistic loss or results also hold for exponential loss. And again, we aren't going to be assuming anything about the initialization. And we also will not be assuming anything about the number of neurons in the network. So m is the number of neurons in the network. We're not assuming that anything like an NTK type analysis or a mean field. It's an arbitrary number of neurons in the network. And just a quick note, this phi is 1 homogeneous. And so the neural network output is also 1 homogeneous. It's particular all the results that hold for the implicit bias of homogeneous neural networks hold here. OK. So here's the when I say high dimensional, what I mean is these two sets of properties here. It's the main one is the one in blue and then the one next to it is also needed. But this definition, the idea is that we want the training data to be nearly orthogonal. So there's this kind of, I guess, aphorism that people say that in high dimensions, things are nearly orthogonal. And that's mostly true. And that's what this definition in blue is trying to formalize. If you have data that's literally orthogonal, then this is trivially satisfied. But if you have data that's not exactly orthogonal, then this is a way of formalizing that. And I think that's really nice about this definition is that it doesn't depend at all on a distribution. It doesn't depend at all on the labels of the training data. All it depends upon is a geometric property of the features x. And so you might ask, when does this hold? When does this not hold? So the kind of running example to have in your mind is if you have isotropic Gaussian, then when d is at least n squared, that's a log factors in this condition holds. You can think about generalizing this to a sub-Gaussian distribution with independent components. Then as long as the kind of version of the rank of the covariance matrix is large enough, then this is satisfied. But there is one setting where this is not satisfied, even when d is really, really large relative to n. And that you can think of this as if you have a Gaussian design. But where there is one component that has very high variance and the rest of the components have identity variance. So this is not going to satisfy near orthogonality because whenever you take two examples, the dot product in that high variance direction is going to be very large. And that's going to dominate on the right-hand side relative to the norm of the examples. Because we need the squared norm to be at least n times the maximum pairwise correlation. So that's just a little caveat that we can't actually This does not hold in all high-dimensional settings, but it is at least a regime of high-dimensional data. So here's our first result. So if we let F be this two-layer leaky value network and we consider this margin maximization problem. So if you concatenate the weights of a two-layer neural network where you're just training the first layer weights, the L2 norm of all the weights is just the Frobenius norm of the first-layer weights. So that's the associated margin maximization problem for this problem. And so if we have this high-dimensional data assumption and we let V be this matrix corresponding to weights for the first layer, and we assume that this V satisfies the KVT conditions for problem one, then the following holds. So the first thing is that the rank of this matrix is at most 2. So this is a matrix that's m by d. So m, the number of neurons, and d is the input dimension. So in principle, it could be at least m or d. But if you're going to satisfy the KVT conditions, it has to be rank at most 2. The second is that the decision boundary is linear. So this two-layer leaky value network has the possibility of being nonlinear. I mean, it has the capacity to approximate any continuous function. But if you want to satisfy the KVT conditions for margin maximization, you have to have a linear decision boundary. And so there's some vector z, such the sign of z.x is the same as the sign of the neural network output. And then the last thing is that this z has a very simple form. So when you're doing classification problems, the scale doesn't matter, just the direction. So what we can say is that this direction of z is the same as the direction of a nearly uniform average of the training data. So z is given by sum of si yi xi, where the si's are strictly positive. And the maximum ratio of those si's is at most a constant. And then, of course, for any initialization, gradient flow converges in direction to a network satisfying above. So there is something there. It's not immediate that this holds from the previous theorem I stated. You need to be able to say that the loss gets small enough at some point in order for these implicit bias results to kick in. And we show that in this non-convex problem, the loss does get small enough in order for that to happen. Yeah? Can you say this again? For all labels. Correct. So if you even have completely random labels, this is something that will hold. Yes? Yeah, so this result holds for any labeling. There's nothing about there's no sample complexity anything right now because there's no generalization. This is just properties of optimization from this geometric characterization of the training data. Later in the talk, we'll talk about how to say something about generalization. So if we assume that the si's, if the xy pairs are IID from a distribution, we can say something about generalization. But right now, all this is saying is what are the properties of optimization from the fact that the norm squared are much bigger than the pairwise correlations and that the maximum ratio of the norms is at most a constant? Yeah, well, there is implicitly not every n and d can satisfy these conditions. So basically, if you think about isotropic Gaussian, d needs to be at least n squared. But really, so the question is when does these assumptions satisfy it? Correct, this says nothing about the structure of the optimization. Yeah, this is the structure of the optimization. And so there's no assumption about distributions here at all. It's fundamental for the results. I think d order n is possible. But when n is much bigger than d, this just plainly does not hold in some settings. There are some settings where it does. I mean, the thing is having rank that's at most 2, you kind of need to have a simple problem in order for that to be a good thing to do. And yeah, so I can maybe, there are ways to relax it, but I don't think that we should expect this to hold in general and bigger than d. Yeah, and bigger than d, it does not. There are settings where it can, but generally no. OK, so I just want to just emphasize that this results, if you think about what happens if you start from random initialization, we're not actually saying anything about initialization here. This result holds for any initialization. But if you think about starting from a random Gaussian initialization, then the rank at initialization is full rank with probability 1. So what this is saying is that gradient flow has a very strong bias towards reducing the rank in the setting where the data is very high dimensional. And I think it is quite notable that there are many ways that you could imagine perfectly fitting the data, but what gradient flow prefers is something that has a linear decision boundary for this problem. And it takes a particularly simple form. OK, so I'm not going to get, I'm running low on time, so I'm not going to even bother with trying to communicate the proof. It's about the KKT conditions for margin maximization. So the proof is a lot of analysis of KKT conditions for this non-convex problem. OK, so I want to maybe just speed it up so I can mention these results on benign overfitting. OK, so I think, OK, OK, that's, yeah, it's fine. I want to spend some more time on this, yeah. So benign overfitting. So as I mentioned before, the most kind of well-understood models in benign overfitting are this kind of OLS, or linear predictors. And in particular, the ordinary least squares predictor. And in my opinion, the main reason for this is that there is a formula to use. If you do not have a direct simple formula, then it's a little difficult to figure out where to start. And I think this is a reason why the linear regression setting is much more well understood than even the linear classification setting. Because the standard kind of estimator that comes from the typical optimization algorithms for classification are these max-margin linear predictors. And in general, there is no formula for the max-margin linear predictor. There's a suite of works that kind of try to say like, well, in some settings, the linear max-margin predictor is the same as the OLS estimator. And then you can use the OLS formula to say things. But really, there are settings where that is not the case. And we have much less to say about those. Because there's no formula. But something that's nice about these implicit bias results I just said is that we actually can get a formula. So we can say that the decision boundary of these two layer leaky value networks, train, migraine, and flow, have a decision boundary that has this formula that is the same as a linear predictor where the linear predictor is just a nearly uniform average of the training data. And so what that allows for is being able to characterize benign overfitting in these two-layer neural networks by just looking at the generalization behavior of this classifier, this nearly uniform average of the training data. So we'll talk about one distributional setting. There's a kind of suite of a family of distributional settings that we are able to analyze. But all the kind of key ideas can be seen in this one setting. So this is the one I'll talk about. And so now we'll actually make an assumption about the training data, or that the training data comes to IID sampled from a distribution where this distribution this follows. It's basically a kind of class conditional Gaussian mixture model, but with artificial label noise added to it. So we have some mean vector mu. And we have some clean labels that are sampled uniformly plus minus 1. And then the x's are sampled such that one cluster is at mu, and then one cluster is at minus mu. And the kind of cluster distribution has the independence of Gaussian components, or just think of isotropic Gaussian. So that's the kind of clean distribution. And then we introduce noise by saying that within each cluster, we're going to flip the label to the opposite label with probability p. Yes, yeah. So we'll get into that in the next slide. So this is the idea. This mean cluster centered at mu and minus mu. And then we're introducing artificial noise by flipping the labels with probability p. OK. All right. So we're just going to formalize what I was talking about before about these uniform classifiers. So we'll call a vector or a classifier u, which is just a vector in Rd. We'll call it tau uniform with respect to the training data. If we can write u as the weighted average of the product of yi xi, where these weights are all strictly positive and the maximum ratio of the weights is at most tau. OK. So here are the assumptions. And as a member in the audience just asked, we do require some particular assumptions on this mean separation. So we're going to have this large constant c, the tidying everywhere. And we'll need at least this many samples. And then assumptions 2 and 3 are that the mean separation is growing like the dimension to some power, which is strictly less than 1 half. And that the dimension is large relative to n squared and also something that needs to be larger and larger as beta gets closer to 1 half. And I think what the audience member was thinking of is that there's a problem here if you have this mean separation very, very large. Then that means that you're going to have one component of your distribution is going to be very high variance. And as I mentioned before, you can't expect near-orthogonality to hold if you have one component that is extremely high variance. Because it means that the dot product between things is going to be large when you need the norm is to be much larger than that. So that's the kind of reason why we need an upper bound on the mean separation. So you can think about this as kind of saying that we're studying the kind of low signal to noise regime of Gaussian mixture models. So in particular, assumptions 2 and 3 together imply these what I call high dimensional data where the norms are much bigger than the pairwise correlations and also that the ratio of the norms is at most 1. And so what our result says is that if you have some tau, so tau uniform classifier and tau is bigger than or equal to 1, then if the noise rate is strictly smaller than 1 over 1 plus tau, then under these assumptions, with probability at least 99%, any tau uniform classifier satisfies the following. So first, it achieves a perfect fit to the training data. So that's what is in red. It's saying that you get 100% training accuracy under the 0, 1 loss. And next, that the test error has a very nice bound. So we have a lower bound of p. p is the label noise in the problem that we've artificially introduced. So we know that the test error is at least p. But interestingly, we also know that the test error is at most p plus something that is exponentially small, as long as n times norm mu to the fourth over d is large. And so what these results say is that you have benign or fitting if this mean separation is growing like d to the beta where beta is between 1 fourth and 1 And so why is this benign or fitting? Well, it's overfitting because we have label noise in the problem you're achieving a perfect fit to the training data. You have 100% training accuracy. And it's benign because the test error is very close to the noise rate. Just to parse this theorem a little bit more, I guess something else that's worth mentioning is that actually getting this test error bound is optimal for the problem. The fact that we have exponential minus n times norm mu to the fourth over d is the best thing that you can hope for in the high-dimensional setting. And so really, it's a very benign form of overfitting because you're achieving the optimal excess risk that you can achieve for this problem even when you're overfitting. So something just to further understand what this result says is you can think about what is the label noise rate that you can tolerate. So if you consider tau equals 1, this is just a literal average of the training data. This is one uniform. And so we can tolerate noise rates that are anything strictly smaller than 1 half. So what that says is that even when 49% of the training data are uniformly random labels, you can achieve test error that is exponentially close to 49%. And if we recall what our previous results were saying, we had previously said that gradient flow converges in direction for two-layer leaky value networks to something to a classifier that is tau uniform or tau is at most a constant. And so what this means is that gradient flow can tolerate noise rates that are at most a constant, right? 1 over 1 plus tau and tau is the order of 1. So you can tolerate noise rates that are at most a constant. We have the exact balance on what this constant is, but it's just a lot of equations I didn't want to show. And so something to emphasize is that when we talk about what's happening with gradient flow on neural networks, because we have this equivalence to this tau uniform linear classifier, the number of parameters in the network plays no role, right? So if you had a neural network with a billion parameters, it will still, and you're in this high-dimensional setting, right? So there's two ways to overfit, one from being high-dimensional, one from having the wider neural network that could fit things in different ways. Even in this setting, you're going to achieve the kind of minimax optimal test error for this problem when you're overfitting. OK. Any questions about the result? Yeah? No, a one neuron neural network would work the same, because it's a one neuron neural network as a linear predictor. And this result holds for linear predictors. So I mean, that's the thing is that we're working in high dimensions, so that's how you can overfit. The kind of natural question is like, how are you even capable of overfitting when you just have a linear model, but the dimension is larger than the number of samples? Yes. Not a single real number. Yes, yes, correct, yeah. Yeah, a single real number, that will not work, yeah. Yes, yeah. And you need, or at least n, right? Yeah, because d needs to be bigger than, yeah. So the number of parameters is d, and you need d to be bigger than n. And even like d bigger than n squared is what most of these results require. Great question, though. Any other questions? OK, so I want to kind of try to give you some intuition about how is such a thing even possible, right? So what does this estimator look like that's able to both fit the training data well, achieve a perfect fit to the noise, and also be able to generalize well, right? Because these two things seem to be in contention with one another. So let's look at what happens with the kind of standard average. So the one uniform classifier u, which is just the exact average of the training data. And let's assume that the clusters have this isotropic Gaussian data. And you'll just recall how the data is generated. You first take y tilde, which is going to be the clean label, that's uniform plus minus 1. And then you take x that's centered at y tilde mu plus an isotropic Gaussian. And then we flip the labels to minus y tilde with probability P. So what this means is that the training data can be partitioned into two sets. One, that's the clean label, the cleanly labeled data, where yi is equal to the tilde yi. And then the noisy data where we flip the labels. So let's just write out what this classifier looks like. So u is just the average. So on the clean labels, it's equal to mu plus yizi. And on the noisy labels, it's equal to minus mu plus yizi. So that means that we have a total of number of clean labels, minus number of noisy labels, mu's there. And then we're summing over all the noise variables times yi. So sum of yizi. So we're looking at classifications. So scale doesn't matter. So all that matters is the direction. So let's just scale everything by 1 over clean minus noisy. And what you get is that this vector has the same direction as mu plus a kind of average of the noise variables. And so let's call this kind of average of the noise variables delta n. So this is something that depends on the training data, but it's not something that we actually observe because it's the kind of hidden noise variables, zi. But this is how we can write the direction of this classifier is as the cluster mean mu plus a noise variable. And so we'll call this thing in the green the signal component because this is really what we want to learn in the sense of if you want to generalize well, you need to have mu appearing. Because if you have a clean test example, you want to be able to understand where it comes from. You need to know what mu is. And so the tension here is that this signal component helps with generalization, but it hurts with overfitting. Because if you think about just having the vector mu and you have a noisily labeled example, it's going to mislabel it because it's noisy. And it's going to think that it came from the other cluster, even though it has the opposite label. So that signal component helps the generalization hurts with overfitting. The overfitting component is the opposite. The overfitting component helps with overfitting because if you actually take this delta n and you use that to classify yk, xk, then it's going to be very large and positive. But since delta n is just pure noise, it will be useless for generalization. There's nothing that it can say for generalization. So the core thing that one needs to do with this problem then is to balance these two. And the remarkable thing is that you can balance these out where the signal that comes from the signal component is large enough to generalize well. But it's not so large that you fail to overfit. And that the overfitting component is large enough to be able to overfit. But it's not so large to be able to prevent you from generalizing well. And so that's the kind of high level idea about how you're actually able to do that. There's a bunch of math that you could do that I can get into if it's helpful. But it's just equations. And the idea is there. OK, I'm going to maybe start to conclude now and just so I can leave a few minutes if there are any questions. So what did we talk about? We talked about the implicit bias of gradient flow in two-layer leaky value networks when the data is nearly orthogonal, where I use this equation to kind of define your orthogonality. It doesn't capture everything about high-dimensional settings, but it captures some interesting things. And what we showed is that anything that satisfies the KKD conditions for margin maximization for this problem has a linear decision boundary. And that we can actually get an equation for what this decision boundary looks like. It's a nearly uniform average of the training data. And then we showed how we can use this kind of characterization to say something about the setting where we actually have samples from a distribution. As long as the dimension is sufficiently large relative to the number of examples. And under the setting, we can actually show that these tau uniform classifiers exhibit benign overfitting, and thus so do two-layer leaky value networks. And just briefly talked about how we could use this kind of signal and overfitting component decomposition to be able to show that there is the possibility of, if you delicately balance things right, being able to have benign overfitting. With that, I'm happy to stop and take any questions. All right. Thank you. So that's a really nice talk. And really nice questions. That was a great start. So let's keep it up throughout the week and let's ask a couple of more right now. I mean, basically, you cannot have a really high signal to noise ratio. So signal to noise ratio kind of correlates to the norm of the expected expectation of yx. Yeah, so our proof really crucially relies on this new orthogonality. And so any setting where this new orthogonality does not hold makes it we don't really have a way of understanding it. Because basically to be able to understand the KQD conditions for margin maximization, when you have this property, a lot of your life becomes easier. And when you don't have it, things quickly become intractable. Or at least they're intractable to me. It's possible that maybe there's ways of removing it. I'm not aware of no work. I mean, there are very few works that are able to say anything about benign overfitting in neural networks. So it's possible that things in the future will. But this is the most general setting where basically all settings, OK, there's maybe one paper. So there's like five or six papers on benign overfitting in neural networks that in five of the six, I guess, have a variance of this condition holding. I don't. But it's possible. I mean, I don't know. This is all quite new. So there's a lot of things to try. There's like a two by me and some other people, I guess. There's some from my PhD advisor, Chen Chengu. His group has a few. Oh, what's the assumption? You said five or six. Oh, they have something like this. Yeah, what's the sixth one? The sixth one, I don't really understand. Is this a kind of low signal to noise ratio regime? No, no, it's a higher signal to noise ratio regime. I don't really know how they show overfitting in it, though. I haven't carefully read it, but yeah. There's one where they have been able to do it, but it certainly is not. It doesn't generically hold for any KQT point to a margin maximization problem. It's a very careful trajectory analysis of gradient descent in a two layer neural network. So they can do something, but it doesn't generically hold for any KQT point, which I think is my saving grace in this, so. That's a good question. So, OK, so the main, the core thing about that I like about the homogeneity is that we, that is the only setting where we know that gradient flow on a neural network converges to something that satisfies KQT conditions for margin maximization. So a question is, it's possible that the analysis could be extended for any KQT point of a margin maximization problem of a non-homogeneous neural network. But then the question is, this is just setting margin maximizers of a neural network, but we don't know that gradient descent or optimization algorithms converge to such a thing. But I think also there is a question of if the proofs can be extended to non-homogeneity. There are some points where we use it, but I don't know. Thank you again.