 Do you hear us? Yes, yes. Wonderful, good morning. Good morning, everyone. Thank you for joining us. Yeah, so Yuting Wei is an assistant professor at the Wharton School of the University of Pennsylvania. And she's going to give the last talk of this morning's session. Can you share your screen? Recording in progress. Can you share your screen? Yes, you can. OK, we see your slides. So yeah, just take it away. We'll have about 20 minutes for your talk and then five minutes for questions, OK? OK, yeah, sure. Hello, everyone. I want to first thank all the organizers for inviting me and for putting up together this wonderful workshop. I really wish I could be there in person with you, but I'm sure that we will have a chance to meet each other in person soon in the future. OK, so the talk, the plan of my talk today is to discuss a certain type of estimators in the regression setting, which enjoys a curious multi-descent phenomenon. OK, so this is a drawing work with my student, Yui Li, at CNU Statistics. And the details of this talk can be found in this archive footprint. OK. As we all know, that deep neural networks has found its applications in many science and engineering fields. So in fact, it has a range of applications. In fact, the structure of deep neural networks are not that complicated. So given the feature vector x, we know that deep neural networks applies a sequence of linear and nonlinear transformation to this input and then output either a label or a regression coefficient. OK, so over the last couple of decades, people have tried to understand what happened, why this simple structure can outperform, like can have a superhuman performance. The talk by Nikolai just now gave us a very, very nice algorithmic point of view of understanding the performance of neural networks. However, there is still no unified answers to why this neural network performs so well. However, people have summarized a few key features for this framework. For example, we know that neural networks are usually largely over-parameterized in the sense that if you count the number of parameters in use, they often exceed the number of observations. And also, neural networks are often trained beyond zero training error is obtained. So the reason why people prefer larger models or more over-parameterized models than simple models, it is because that in practice, people have observed that larger models often behave much better. For example, this paper by Nakirin at all in 2019, they carried out a bunch of experiments and shown that if you plot the test error and training error, if you increase the number of parameters, they are still benefiting using more and more parameters. For example, in this plot, you can see the training error always decreases, but the test error, first, it will blow up a little bit and then continue to decrease. So this phenomenon has observed by not only this paper, but a sequence of papers. And people observe that it's always much better to use larger models. So this motivates us to ask the question, how did these neural networks manage to generalize? Because if you open any book on statistical learning, you will see this classical U-shaped curve, where it tells you that if you increase the capacity of your model and if you plot the training error and test error, at some point, increasing the capacity will eventually hurt the generalization error. So you will have this bias and variance trade-off. And more recently, it is nicely summarized by Belkin and his co-authors as this double descent phenomenon. Well, before what they call is interpolation threshold, you still see this U-shape, this classical U-shape generalization curve. But beyond this interpolation threshold, what they call the modern interpolating regime, you can see that as you further increase the capacity, the test error or generalization error will further decrease. So it, of course, conflicts with our classical statistical wisdom. So it motivates us to study classical estimators in the modern interpolating regime when interpolation really happens. So far, the theoretical understandings are still quite limited. Most of the efforts have been focused on this minimum L2 norm interpolators. It is defined as the minimum L2 interpolators. They are defined as the minimizer of this rich-like penalty, where you take the parameter lambda going to 0. And there are a sequence paper which sort of analyzed the performance of this minimum L2 interpolator under the linear model. So it has been observed that this, even for this very, very simple estimators, it presents this double descent curve, where on the left-hand side, this is the paper by Hasty et al. Where they show that the rich-list regression for a misspecified linear model will present this double descent curve. And on the right-hand side, it is a paper by May at Montenari, where they show that this very simple estimator under a random feature model also present this double descent phenomenon as observed in neural network chain. However, sometimes we know that the L2 norm may not show the underlying geometry. So you want to know what happened to other type of interpolators. For example, what happened to this LQ norm minimizer, which minimized where the rich penalty is replaced by this LQ norm ball. OK, so in this talk, I will focus my attention on a particular type of LQ minimum LQ norm solutions, which when Q equal to 1. So why we care about the minimum L1 norm solution? We know that L1 penalty is often used as a surrogate for L0 penalty to encourage far solution. And also because this is a convex penalty, so we can find we can use a convex algorithm, so optimization, easy optimization algorithm to approximate the solution. And as also Nicolai pointed out, there are many algorithms which are known to converge to the minimum L1 norm solution. For example, it's known that other boost converges to the minimum L1 solution for linear separable data. And it is also known that some gradient descent algorithm, for example, for the matrix factorization problem, will converge to the minimum nuclear norm solution. OK, and there's also a lot of empirical successes of job outs and the model pruning in deep learning, where if you think about if you train a larger model and sort of train your model to get a smaller model, so it's known that in many empirical cases, they have better empirical performances. So the message that we got from this work is that a smaller model may behave similar or even better. So it may not take you long to realize that the minimum L1 solution in the noisiness case connects back to the well-known basis pursuit estimator, where if you restrict yourself to auto-interpolator, that have zero training error. So basically, the minimum L1 norm solution corresponds to the one that has the minimum L1 norm. So it naturally connects to the compressing literature that people have developed nice theory about this basis pursuit estimator in the noiseless case. However, in the noisy and over-parameterized case, the natural question is how this generalization error of this estimator depends on the ratio between p over n. OK, so p is the number of features and n is the number of observations. OK, so first we carry out a bunch of experiments and trying to know what's the performance of this minimum L1 solution. And in our first experiment, we show that... So here the x-axis is the ratio between p over n and the y-axis is the rescaled generalization error. And if you take the true signal c dot star to be as far as far as factor, where the sparsity is proportional with the number of samples and number of features, and you can see the performance of minimum L1 solution actually has this multi-descent phenomenon, where it undergoes phases of descent and ascent descent. And then in the end, once you increase the ratio between p over n, it will decrease to the error of a zero estimate. OK, and similar phenomenon happens if you fix the ratio between the sparsity and the number of features. OK, so in this work, we just want to understand how to theoretically characterize this descent. OK, so there are a few key challenges. For example, there's no closed-form solution for the minimum L1 norm interpolators, unlike what happened to the minimum L2 norm interpolators. And also, there's no consistent support recovery in high-dimensional regime when the number of sparsity is proportionately with the number of features. And also, there's no strong convexity or restricted strongly convexity in this optimization problem. And how do you characterize the behavior of this estimator? OK, so here are just several key challenges. OK, so before telling you our main result, let me formally set up the problem where suppose that we consider a simple stylized setting where you have a linear model, where the observation back to y equal to some x times the true parameter c that star plus a noise factor z, where the true signal c that star is as sparse. And we focus our attention on the proportional regime where the sparsity and the number of features and number of observations, they are all proportionately growing with each other. And we focus our attention on the Gaussian design and Gaussian noise. So this is the simply setting that everyone can consider, where each row of the x matrix is generated from the IID Gaussian vector. And the noise is also independent Gaussian noise. OK, so under this setting, we consider the asymptotic risk where it's defined as if you sample a new pair of samples x new and y new from this linear model. And you computed the generalization error, which is the expectation of the L2 loss. OK, so by simple calculation, this has this form. And we consider the high-dimensional asymptotics where we want to know that as you increase the number of features and number of observations, while keeping their ratio as a constant delta, can we characterize the descent of this risk as a function of delta? I hope that is clear. All right, so we are not the first one who study this problem. And here is an incomplete list of literature which consider the exact asymptotic framework. They're beautiful work in comprehensive literature, robust regression classification, and also sparse recovery for spike models. OK, so now let me state our main result. Suppose that for the true signal, suppose each entry of the theta star is generated from this mixture of two distributions where with epsilon, probability epsilon, it will be equal to some number. This is just think about it as a rescaled constant. And with probability 1 minus epsilon, the theta star i will be equal to 0. So it's either equal to some constant or equal to 0 just to ensure that we have epsilon times p sparsity of our theta star. So under this assumption, we are able to show that if you increase the ratio between p over n, so eventually the risk curve will converge to the risk of a 0 estimator. So in the limit, it will converge to what a 0 estimator does. OK, and also we are able to show that at a given aspect ratio, there exists some sparsity ratio epsilon where the risk will decrease with p over nn at this ratio. In the sense that if you fix delta, consider fix pn ratio, eventually the curve will have a negative slope at that point if you decrease your sparsity level epsilon. And what's more, we can show that around 1 and around infinity, there will always be a negative slope. And also if you consider a fix as nr, then if you decrease the sparsity level, there will always be a reaching where you have an increasing phase. So we are able to show that here at around 1 and infinity, there will always be a negative slope. And in the middle, there will be an increasing phase. So some heuristic explanation about this curve first, why there's a peak at the interpolation point where p equal to n, as is shown by this nice three-page paper by Pojo et al into 2020, where they plot just the conditional number of x. And when you consider the condition number of x, you already see a peak at p of n because the conditional number is not stable at p of n. So you should always expect a peak at the interpolation point. And then why you should expect it will further increase with p of n, because for the minimum l1 solution, the size of the support will always be equal to n if p is larger than n. So you somehow encourage this sparse solution. So that's why sometimes if your signal is not strong enough, you want to recover a sparse solution, then the resulting support is actually not correct. So when the resulting support is not correct, it behaves even worse than the zero estimator. So it will have an increasing phase where the risk is worse than the zero estimator. So with this, I want to emphasize that because the minimum l1 solution has a nice interplay between over-parameterization ratio and sparsity, so you have a richer phenomenon. So compared to the minimum l2 norm interpolator, where people characterize can write out the risk explicitly, and you can analyze how the risk behave by taking the derivative of delta, which is the p-ampuration. So you can see the minimum l1 norm solution has a more complicated behavior. So indeed, to prove our result, we resort to the approximate message passing machinery. And due to the time constraint, I will not go into details. But we are able to characterize the risk behavior of the minimum l1 norm solution, which is equal to this tau star, where tau star is a unique solution corresponding to this set of fixed point equations. To do the time, I will not go into details. But there is this close form solution to this tau star. And by analyzing how tau star performs or behaves as a function of delta, we are able to characterize the ascent and descent of this risk curve. So that's just a high-level idea of how to improve this result. OK, I'm running out of time. But you may not take you long to realize that there are some pathological unreasonable things about this curve. If you take two points, two points, one on the left and one on the right, they're corresponding to different p and ratio. But on the left, there is a corresponding to a bigger number of observation. So if you can imagine that if you can down-sample your data, then you can actually reach a lower generalization risk than using a bigger data set. So in this paper, in this new archive paper, we are able to show that you can go from a multi-descent to a single ascent curve by using cross-validation, by adapt cross-validation in the higher high-dimensional region. OK, so just to conclude, we are able to theoretically characterize decent curve, decent ascent for a minimum L1 norm solution and as future directions. So in our simulation, we see that if you consider a more complicated design where the features are generated with general covariant structure, you will see even more oscillations in the risk curve. There are more phases of ascent and descent. So how do we characterize those? And also, the theory developing this work mainly asymptotic code. So can we develop non-syntotic characterization of AMP? And lastly, can we generalize this theory to a more complicated model? Better explain the performance of your network. OK, so with that, I conclude my talk. Thank you for your attention. Thank you very much. That was very clear. Are there any questions for you, Yutin? Yes, Marko. Hi, Yutin. Very nice talk. So I have a quick question. So one of your results says that as p over n goes large, then the risk of your estimator is the same as the risk of the old zero estimator. Now, I wonder whether this is just the sub-optimality of the thing that you're looking at, or is just that the problem is becoming much harder. And so anything that you do is the same as basically plugging everything to zero. Yeah, that's a very good question. So I guess like, so here, because we focus our attention on the minimum L1 solution, and as p over n goes to infinity, basically the signal to noise ratio, I mean, the signal is so weak, so you cannot do better than a zero estimator. However, if you consider, for example, a lasso type estimator and optimize over lambda, so maybe then you can do better than this. Yes. Any other questions? And we also encourage all of you who are following on Zoom, that you feel free to share your questions on the chat. Marco is actually following the chat very closely. And so if you have a question, you can just leave it there, and then we'll ask it to the speakers. I think they can also, OK, actually, you can also unmute yourself, apparently. So either way, if you have a question, just let us know. Is there anything from the online audience? Not right now. Any other questions for you? Is there one? Yeah, go ahead. So in general, AMP theory, it's rather flexible to the priors on the signal. So I'm wondering whether this makes it easier for you to analyze something beyond the two-point prior that you've been looking at? Yeah, very good point. So the theory we developed to characterize the risk of this estimator actually does not use any of the prior information. So basically, this result, just as long as the prior of the CDAT star goes weakly compared to some distribution, then this theorem holds whatever. However, to understand how this tau star varies with delta, then we need to plug in these two-point priors so that these fixed-point solutions become easier to handle. And then we can take the derivative and analyze the derivative and second derivative of this fixed-point equation to study the trend of this risk. Yeah. Good. I think at this point, we can thank you, Ting, for answering our questions. And thank you for this very nice talk again. Thanks, everyone.