 Ani, floor to yours. First of all, I want to thank the organizers for this wonderful event, the first time in tests. It's really great, even though I raved a little late. I'm happy to talk here about some of our recent work, led by primarily, I just want to try out if, aha, OK, this works, led by my student, Konstantin, who's also here sitting in the middle. Most of the work here is really done by him and also three other students in the lab, some of them are alums now, Guillaume, who is now at EPFL, Stefan, who will join Stockholm, and Nicola, who is at Tubingen. So some of the beginning will be slightly repetitive, given the first talk of, or some, you think, talk on Monday, but, OK, first I want to go to the next slide, OK. So classical wisdom is that you should avoid fitting noise. Why is that? Well, there's the intro to ML slide, introduction to machine learning I give every year. So the ground truth here is, or the function you want to learn is the red curve, but how you're trying to learn it is by fitting the black crosses, which are noisy observations of that, coming from that true function. And so the model that you learn, say you have a very complex function class to choose from, a polynomial of high enough degree, you're able to interpolate these black crosses and you get a model that's actually far from the ground truth. The model that you learn is the blue one, which is very different than the red one. And so what you do is you add some penalization instead of just minimizing the mean square error of the training points, you want to add some rigid regularization, a penalty term that is leading you to avoid fitting noise and get a model that is much closer to the true function. OK, so that's classical wisdom, avoid fitting noise. So if we look at modern complex models, however, the picture looks a little bit differently. And so here we're looking at the test error of models of neural networks that are trained on noisy C410. So actually the data set C410, they just added artificial labor laws of 15% and they train a neural network with the first order method. And on the x-axis, you see how this test error of those models change if you increase the width, in particular beyond the interpolation threshold where you can achieve zero training error. So what you'll see in a minute or in a second are lots of curves and each curve corresponds to the test error of models taken at different epochs of this training scheme. So I want you only to focus on the very dark blue one, which is the model at convergence, the other ones we don't care about for now. So if you look at the blue one, you see this double descent phenomenon I think Yuting talked about and a lot of you are probably aware of. But what I want you to focus on today or what I will focus on today is to compare this blue line, which is the model at convergence with the model that is stopped at the optimal time for the particular width. So for each width model, you take the model that has, take the epoch where the model has the smallest test error. So now we're comparing this blue line, the model at convergence with the best model or the model at the best stopping time. So if you compare the blue and red line, you will see that close to the interpolation threshold, you actually have a huge difference between the best model or best of early stop model and the model at convergence. Whereas if you look further down for very large models, when the width is really large, you actually have harmless interpolation, which means that basically trying to avoid noise is not buying you much anymore. So this is different than in the first slide where if you try to avoid noise, you get a better generalization or a better closer to the true function. Here, this is not necessarily the case anymore. And at the same time, you also get good generalization. So this is slightly starting, startling. And we are not quite sure why this is happening. So a lot of work has been trying to understand this. And in order to understand this, we want to first move away from neural networks because they're complicated. And they are trying to minimize non-convex losses with first order methods. They're learning features, and hence they are non-convex. Overparization there is essentially the width of the hidden layers that is varied. So instead of looking at these complicated objects, we want to see if we can understand interpolation of noisy data in a much simpler setting, which is linear interpolators. So here, we have reduced the complexity from non-convex losses to convex optimization problems. And here we don't have feature learning anymore, but we look at fixed features, but we still model overparamization or the largeness of the model with a number of features. So if the features are very large, the features D, number of features D is much larger than the number of samples. We also can essentially transfer that intuition perhaps to the very large width neural networks. That's the hope. If we understand the left, the right-hand side, maybe we can transfer some of this intuition to the right-hand side. Okay, so what is known about linear interpolators or linear models in high dimensions? So there's two facts that are already known. The first one is an old truth that everybody here is familiar with, probably is if the ground truth has some simple structure, such as sparsity, ground truth here is the two-parameter W star. Then we can find the ground truth W star or close to something like W star if we introduce a strong matching inductive bias towards the structural simplicity in this case, if we try to find an estimator that is very sparse. Okay, so when we try to do this, we have some chance because we're essentially limiting the flexibility of large D to something much smaller. So we want to only focus on vectors that are as sparse so that's essentially reducing our flexibility and hence giving us a chance to learn even when we have very few samples compared to the number of features. Okay, so what is well known, if we have noiseless data and we try to interpolate with finding the minimum L1 norm that interpolates the data, we get perfect recovery even if the number of samples is much smaller than dimensionality but of the order of the sparsity S. If we have noisy observations, however, we need to again avoid noise, avoid fitting noise by adding this penalty here with the lambda that is non-zero. So if we add the right lambda as a function of the dimensionality, the sample size and the sparsity, then we get a minimax optimal estimation error rate. So now what we were motivated by was to look at noisy interpolation. So we understand that regular estimators can do very well in high dimensions like the lasso. It achieves minimax optimal rate but what happens when we enforce interpolation of noisy data? So for that, we have relatively little intuition still for the L1 case, in particular, when can we get asymptotic consistency and can we get asymptotic consistency here? We'll talk about that in a second. So this is the first coming from the old perspective. This is one way to think about noise interpolation, why this could work well. And a lot of work has been done recently for another interpolator on noisy data which is not the minimum L1 norm but the minimum L2 norm. There's also been some work on L1, I'll get to that later. But there's been a lot of work on this and there's one intuition that is basically a big takeaway is the reason why we could perhaps have harmless interpolation is that when D is much larger than N, the variance decreases and hence if I compare minimum L1, sorry, minimum L2 norm of interpolation with rich regularization, which is essentially just minimizing the y minus xw squared norm plus some w times the two norm squared, then you actually get a very similar generalization performance. Hence basically regularization doesn't really help you. Like avoiding to fit noise doesn't really help because the variance is very small for the interpolator already. But here's a caveat of this line of work which is that the estimation error is actually lower bounded by some constant that's bigger than zero even in the asymptotic limit where N goes to infinity or D and N go to infinity. So both in the regime where D over N goes to constant and in the regime where D over N to the beta goes to constant where beta is bigger one so D is bigger than N and you have, beyond the interpolation threshold, you get an error that is non-vanishing. So in some sense you can say you have a bad generalization, right? So even if you have a very sparse ground truth, the L2 norm interpolator is not going to be able to take advantage of it and help you to get a good estimation error. There is one exception or maybe you can think about good generalization when you think about the prediction error performance. In particular when your covariates have a low dimensional intrinsic structure if they're following a spike covariance then you can have both low variance and low bias leading to a vanishing prediction error. So this is an exception but let's go back to the question of having a small estimation error or good generalization and a harmless interpolation. Is it possible? So we saw from the old perspective you can have good generalization if you regularize and from the newer perspective, newer line of work you can see that you can have noise interpolation. However, it's in a regime where the generalization is not necessarily good. So is it possible to get both, the best of both worlds? If you have a structured ground truth but you have unstructured access, for example isotropic gas suits. So we want throughout this talk to focus on minimum LP norm interpolators where P varies essentially the inductive bias towards varsity. And so we looked at P equal to two as bad generalization, high mean square error. However, we have harmless interpolation. Doesn't matter if you do rich or if you do interpolation then we have P equal to one where you have very good regularized estimators. Now the question is, can we have LP norm interpolator that is also close to the minimax optimal rate of the regularized estimator for P in between one and two? This is the goal of today. Yes, there's a question? No. The somehow the host would like to mute me. Okay, so how do I, what? Okay, this is the right. I'm good, okay. Okay, good. Gladly I don't have to be muted for the rest. Okay, so this is the objects that we want to study today. And what we're curious about is, can we get good interpolation interpolators with good generalization error that are close to the best regularized estimator we know, like the lasso? And so for this purpose, of course, we need to assume something. If we don't assume anything about the ground truth, there's no hope. So for simplicity here, we're going to focus on one sparse ground truth, but the results hold for general, most of the results holds for general as sparse ground truth. But just for simplicity of the presentation, we will consider 10000 as the ground truth. Of course, we don't know the location of that particular one. So it's one sparse, but we don't know where. So then I'll use the following language. I'll say that estimators have a weak or no inductive bias if they're essentially encouraging only a small L2 norm, which essentially doesn't give you any kind of sense of directionality. So all directions are equally good. And in contrast to that, we have estimators with strong inductive bias, which are essentially the ones where you have encouraging a small L1 norm. So p equal to one is strong inductive bias and p equal to two is weak to no inductive bias towards the two simple structure, which is sparse ground truth. And then our results will also hold for specifically Gaussian covariance, X, where the rows of this data matrix are IID Gaussian. Isotropic and we also have IID noise as opposed to, for example, adversarial vanishing noise. Also Gaussian. So here, the oceanity is quite important. Since one of the tools we use is essentially the convex Gaussian-Mimmax theorem. Here, some of you are very familiar with that or actually the founders of this. And also we'll look at the regime where D is of the order n to the beta with beta bigger than one. So this is a regime where we can hope for consistency. And this is not the inconsistent regime that Yuting looked at on Monday, where D over n goes to gamma. So we have a chance that asymptotically our error vanishes. Then we look at the prediction metric, the performance metric, which is the prediction error. In this case, visotropic Gaussian is just equal to the mean square error. So it's the Euclidean distance between my interpolated W hat and the ground truth W star. Okay, so we already discussed that for P equal to two, you don't have much hope because you don't really have any guidance towards the right kind of solution if your D is much larger than the number of samples. So you have high mean square error. What we also know is that variance is actually quite small because harmless interpolation. However, you have a very high statistical bias because you don't know where to search, even if you are noiseless. Okay, so then the question is what happens in between P equal to one and P equal to two or even for P equal to one. So the first result I want to present is a slow rate for P equal to one. So we had a constant rate for P equal to two, essentially, but for P equal to one, we do not get much better. So here, before our result, there were two competing lower and upper bounds. So the lower bound was saying, you should get something like one over log D over N, which is roughly like one over log N. And the upper bounds, I'm claiming that we have some sort of constant upper bound similar to the P equal to two. So the question was, which one was not tight? And so we basically proved that lower bound is tight and you get a tight bound up to leading constant factors. You have a sigma squared, which is the noise over log D over N, which is essentially the rate, plus some higher order terms that vanish faster. And so what happens if we plug in D of the order N to the beta? So it's a bit easier to parse. If we plug that in, you get a rate of one over log N, essentially. And so if you compare that with Lasso, the regular estimator, there is really no harmless interpolation, right? So forcing to interpolate noise with the min or one norm interpolator gets you one over log N versus a really fast rate of one over N. So here interpolation is doing a lot of harm. So now in order to, before we try to understand interpolation, let's just go back to what the regularization is helping us to do. Why do we get such good performance with regularization? Well, if I add this lambda term, I can trade off bias and variance, right? So if I choose a very high lambda, I really avoid fitting noise well. So my variance goes down. But the problem is I have a high structural bias and the opposite. So when lambda is very small, which is essentially the limit of lambda to zero is the minimum of one norm interpolator, I have very high variance, even though my structural bias is small. So when I'm looking at the Lasso, I can actually pick this optimal lambda. But when I'm looking at interpolators, I cannot pick lambda anymore. I don't have flexibility. So I'm stuck with this lambda close to zero picture. So this is my picture. For p equal to one, I have high variance and even though I have a small statistic of bias. So I know where to search, but my noise is still screwing me up essentially. And it's screwing up much more than for p equal to two. So if you compare the two, both have high MSEs, both have high errors, but the reasons are different. For p equal to one, it's because of how variance and for p equal to two is because of high bias. Okay, so intuitively, why do I have much higher variance for p equal to one? So just if you think about it, think of W star is equal to zero. So you're just fitting the ground truth. And the variance is essentially the norm of the vector that is fitting the noise. So in that case, when you are looking at minimal two norm interpolators of noise, you can search in all directions. You can use all directions to fit your noise. Hence, the norm that you need could be much smaller than in the p equal to one case where you're essentially confining your search space to sparse vectors where you then potentially have to use a bigger norm vector to fit your noise perfectly. This roughly is like the effect of smaller flexibility here creates actually a higher variance, which is the opposite to when you look at it, think about regular estimators. That leads to higher variance because you have less flexibility to fit the noise. So that's an interesting intuition to keep in mind. Okay, so this also happens in practice when we try to look at sensitivity to noise. So here, we just ran an experiment, fixed the dimensionality and the number of samples and just increased noise and looked at the risk or the MSE of the minimum L1 versus minimum L2 norm interpolators. And you can see if you increase noise, then the risk or the MSE increases much faster for L1 than for L2. Okay, and actually you can also see between p equal to one and two, you can see a trade-off. Now just ignore the blue line for a moment. You can see between p equal to one here and p equal to two here, the bias goes up. So for p equal to two, you don't have any clue. But the variance goes down and it leads to this actually U-shaped curve as a function of p, not as a function of lambda or model complexity that you're usually used to. So essentially we're having a new kind of bias variance trade-off that occurs for interpolators as a function of the strength of inductive bias. So maybe if we pick some p in between, you will get a better rate than the minimum L1 or L2 norm interpolator. And this is exactly the question for the rest of the talk. So we have these two pretty slow rates but what happens in between? Can we actually get rates that are close to the minimax-optima rate or to the lasso? And this is basically the next result that I want to present. Still want to go into the details for now but we can basically prove polynomial rates. So one over n to the alpha and how do these rates look like? They depend on the beta, right? So the regime D is of the order n to the beta. They will depend on beta and the rates will of course depend on p. So instead of showing you a bunch of formulas, we're gonna plot the rates here. So I just wanna go through this part really slowly. On the y-axis, we have the rate exponent. So the final rate that we interested in. What we really want is a fast rate which is alpha equal to one, which would be the rate one over n. But what we have for p equal to one and two is essentially a rate that is constant up to log factor. So the one over log n, we also treat it as essentially a constant rate. So we have these two are known, these two horizontal lines. And in between, we are plotting essentially the rate for different p's and betas. So this is the degree of overparization. Everything is in the D bigger than n regime. So we have, we can interpolate the data. And so what I want you to look at here are just two things that goes through that plot but slowly, so let's pick some fixed regime. Betas equal to close to three. And let's look at the rates for different p's. Okay, so these are lines for different p's. So we just pick a p, for example, 1.2. That gives us a rate that's pretty slow. But if we decrease this p for here, 1.1, and decrease it further to 1.01, then we get a rate that's really close to one over n. Okay, and it gets, it's a very close rate to minimax optimum. And if we look at beta close to two, it actually gets a rate really, really close to one over n. Arbitrarily close, if we let n large enough. Okay, so I just want to note that these techniques also apply to classification. Actually, I'll show you in the next slide. And also allows an extension to nonisotropic Gaussians and general S bars, W stars. But just to look at classification, this same kind of picture here. But the difference is that actually we get exactly one over n for a large fraction of betas here. For a large range of betas, if we pick p equal to 1.01, we actually get a minimax optimal rate. So just to take a step back, what did we just show? We showed that we can get very good generalization compared to the best regular estimator. We basically can match that for certain regimes and certain choices of p that are in between zero and two. And hence, we also get harmless interpolation because we are essentially matching the regular estimator, the best one, which is the lasso. Okay, for classification, I also have a couple of experiments on synthetic data and also on real data where both essentially the takeaway is that the best noisy, so in the noisy case, sigma is nonzero for the orange or the gray line, the best p interpolator is in between one and two, whereas for the noiseless data, the best p is p equal to one. Is that one? So here, basically, see this difference between noisy interpolation and noiseless interpolation. So once you have noise, you want to go away from one to attenuate the noise better. Okay, so if not too much time left, so I just want to summarize what I've been telling you today. So one major point was a new intuition for interpolators of noisy data as a function of the inductive bias strength as opposed to model complexity, as is known from classical statistics. So here we have the new bias variance trade-off as a function of p, which is essentially our proxy for inductive bias strengths. And we proved some new bounds for regression classification that showed that if you choose p in between, you can actually get a very good estimation error rate, which is close to the minimax optimal rate, and for classification even equal to, for very large DNN. So the follow-up question is, of course, how does this intuition, or does this intuition, transfer to the original first slide or second slide on newer networks? So there's lots to be done there to transfer that intuition back, and we're working on a few experiments to see if we can validate this in any way. But there's lots more to be done. I'm happy to chat with you if you have any questions after the talks in the break and also otherwise. Thanks a lot for your attention. Okay, the floor is open for questions. Thank you for this great talk. My question is the following. Before trying to move to newer networks, which is pretty hard, do you know if some of your conclusions extend to the fact where you have a mismatch, so when the model is, let's say non-linear, but not necessarily a neural network, something easier, but you're trying to fit with a linear model like this? Yeah, so we actually had an experiment with CNT case, so it's a kernel, and there we could see a phenomenon similar in that sense that basically you also have these curves where in the noisy case, it's a different best interpolated in the noiseless case. So that we see, but question is always if you move to this direction, what is a reasonable strength of inductive bias here? Of course you can think of sparsely also in the kernel, maybe regime, but then beyond that it gets very hard to think about a reasonable inductive bias. So I was curious, I mean you took a special case, this signal with a sparsity pattern with just one non-zero component. Do things extend in the regime or you would have a linear fraction of components? So I think, yeah, so this is a very good question and we think it would be quite hard to show the same. So with the same techniques, we're not sure we can do that. Yeah, so for now it's just a constant sparsity then, yeah. Okay, but do you think it extends or you don't know and it's a... Yeah, I can't say, yeah. Okay, thanks. More questions. Oh, I'll ask one in the meanwhile. Anything special about beta equal to two? The place, so do you have a point where you do reach the minimax optimal rate? Yes. Anything special about that? Well, that's... Why for beta equal to two and then not beta? So do you have any intuition of about why that's the optimal beta? Or maybe it's just a coincidence. Well, there's essentially in... Okay, so there's essentially this lower bound, universe lower bound for all interpolators. So that is actually matching this line going down. So only at that point, be it equal to two, can you even get a one over N rate? So you have N over D. So if D is... So if beta is smaller than one, you cannot get one over N anyway. So that's the universe lower bound. After that, you have a chance. So... I was wondering how come this goes... Ah, yeah. Yeah, yeah, so there's... Yeah, it's basically the variance here that has to go down first. And the curves are tight here. Yeah, right. Thank you very much. Another question? You have a question? Yeah, quick. Okay, thank you, Fanny. So yeah, just a clarification. All these results were for isotropic matrices, right? Yes, yes. Could you comment a little bit on what happens with structured covariance? So if you have a spike covariance, we don't have the result yet, but we believe that the way we're proving it, you'll still be able to get similar results with a CDMT. As far as I know, even for minimum L2 norm in classification, using this type of analysis, I don't know if there is results correct me if I'm wrong with structured covariance. Maybe you have, so that's a question. No, we don't have structured covariance yet. Yeah. But yeah, ongoing work. Thanks. Okay, so let's thank Fanny again.