 And we were like, oh, that would be cool. But it turned out sort of better than expected. And that's thanks to all of you who made it here and made the workshop really special. So I'm going to talk about debiasing regularized estimators and high-dimensional linear regression. And we coined a term called spectrum over adjustment, and which will become clear why it sort of named this way. This is joint work with Yufan. So Zou already talked about Yufan a bit. The work he presented was joined with Yufan and Shubhuan Yihong. But Yufan is a joint PhD student with me and Shubhuan. He's just really exceptional. I wish he could have been here in person. But sometimes we have visa issues, which he'll be out of very soon. So hopefully you can see him at one of the youth workshops like in the future. OK, so the setting for my talk today is sort of the very traditional high-dimensional linear regression model. I apologize for using notations that a statistician is more familiar with. And I should have probably changed it to A's and X's, like Zou was very smart to do yesterday. But for me, the Y's are the outcome or the response variables. They're real-valued. The XI's are the covariates. They're p-dimensional. And the goal is to infer beta or say whatever you can say about beta. And the traditional approach, obviously in high dimensions, is looking at regularized estimators of this form where you look at some loss function. You look at its minimizer, but you add some penalty with some tuning parameter that is the H function here. Now, if we want to do inference in that, let's say, I look at a finite number of coordinates of the regression vector beta. And I want to recover them to the best possible ability. Sometimes that would mean that I want to create confidence intervals for them or create an estimator with proper uncertainty quantification guarantees. One of the starting points is looking at such regularized estimators. But what you soon realize is these estimators are biased in high dimensions. And they're biased for the simple reason being that you're not really minimizing a log likelihood or maximizing a log likelihood or something. You have this penalty term that makes your estimator, if you look at each coordinate, it is typically not centered around the corresponding true coordinate of your regression vector. So if I use sort of this high-dimensional scaling that we are all familiar with in this community that my number of features to samples diverges, both diverges and the ratio converges to some constant, then we know that for these kinds of estimators, I'm assuming convex likelihood and everything is convex. So this is roughly going to behave like your proximal operator of the penalty H, evaluated at some threshold theta. But you have this Prox operator, which is applied to your true regression vector plus some Gaussian fluctuation. So ideally, I would want this Prox to not be there. I want to have an estimator like that. But that's not what you get if you use regularization. OK. So now what I want for uncertainty quantification or confidence intervals is to find some denoise or eta, which when I apply to my beta hat, it's going to denoise this Prox function. In that, it's going to be centered around beta with some Gaussian fluctuations. So in other words, I wish to basically invert this proximal operator suitably. And this has been done in the literature. The way this goes or the way this evolved historically is roughly as follows. So there was an initial suggestion of how to construct such an eta. And the suggestion was as follows. So take your log likelihood or whatever was your loss function L in my previous slide and take a gradient step on that. OK. And evaluate it at beta hat. Now it turns out if you assume Gaussian errors in this linear regression model, that gives you a formula of this sort. So you take your regularized estimator. That's my beta hat. And then if I perturb it by this adjustment, this is basically coming from the score equations for the model. Then one observes that the resulting estimator is centered around your truth with some Gaussian fluctuations. Now the thing is, this was sort of the initial level of suggestions. But it turns out that if you just do this perturbation, so this is exactly the gradient of my log likelihood evaluated at beta hat. And I'm simplifying things by assuming Gaussian errors. This works, but this only works when the signal beta is highly, highly sparse. And whenever I say something is highly sparse, I mean it's at most small of n. Meaning it could be less than that, but it's not linear sparsity errors. It can't really account for high dimensional problems where maybe a fraction of the coordinates are non-zero. So this was an issue with this correction. Now, of course, people realize that. And the correction that was proposed to modify that takes a form like this. So you have the previous estimator, which had only the score step or the gradient of the log likelihood step. But in addition, you scale it by some correction factor. So I write it as one minus some correction factor. And then it turns out that this new estimator is also centered around your regression vector with Gaussian fluctuations. But the nice thing is that by adding this denominator here, this result of asymptotic normality roughly centered around the true regression vector, this result you get under much less assumptions on your regression vector. And the hope was this could be pushed to linear sparsity. It's not like it's really all the way up to there, but it's almost there. So it covers a much broader class of regression problems or regression models. All right. So I want to start the story basically by staring at this formula and try to give an intuition for if somebody tells me that I should add a denominator term here to correct my formula to get proper debiasing. That's called debiasing for obvious reasons because you get rid of the bias. Then what should be the right correction term? Now, okay, so I'm motivating this from like a statistical story, but a lot of us know approximate message passing theory and approximate message passing algorithms. And it turns out that if you just do some heuristics from those algorithms, it's very easy to predict what this correction term should be, okay? So I'm gonna do that heuristic on the slides in the least possible math that I can potentially write to present that. But it's done so that I can build some intuition for what should we expect here. Okay. So let's see how this goes. So what I'm presenting here is just under IID Gaussian designs or IID Gaussian covariates. If I were to look at the approximate message passing algorithm that tracks my regularized estimator, this is the form it takes. Okay, I have no one, okay, sorry, this pointer is here. Okay, so we have two iterates that we introduced. The beta hat T's are supposed to track my regression vector. They have a formula like this and this is an iterative algorithm. And the way there is another auxiliary iterates, the ZT's, the way you should think about it is this part, if you look at it, it's like a residual in your regression. And then there is a correction term that is added to that. And this correction term we know is our Onsager correction term. All right. Now, MP theory tells me that this guy tracks my estimator. So the MP iterates track my estimator. Now I'm going to take some jumps. All right. So the MP iterates track my estimator. And the jump I'm going to take is everything is nice and the algorithm has a unique fixed point. Okay. So if it has that fixed point, then that fixed point satisfies these two equations simultaneously. All right. Now let's stare at the second equation. I have these auxiliary iterates, their limit Z star here and I have a Z star here. So from here, I can write roughly a formula for Z star in terms of beta hat star. Okay. So that's just some algebra and I just write it this way. So I have Z star here, I have beta hat stars here. I have another Z star here but that can be simplified. I'll not present the simplification but roughly this is what you get. I have almost a closed form equation for my auxiliary iterates Z star. Nice. So I'm going to keep that equation here. I have a closed form for Z star. Now I'm going to leverage some insights we know from MP theory. So what we know from MP theory, if I could go over to these fixed points is that if I take my beta hat stars and add the following correction, which is X transpose of Z star. Z star is the auxiliary iterate. That guy is centered around my true regression vector with Gaussian fluctuations and this is just coming from state evolution results. All right. Nice. So what can I do? I have a closed form for Z star. I can plug that in here, right? I plug that in. That gives me a formula like this saying this whole object is centered around the truth with Gaussian fluctuations. Now we know that MP is supposed to track my regression regularize the estimator in nice problems. So if I could replace this beta hat star by my estimator of interest, what do I get? I basically get the following. Let me go back a bit. So what I'm doing here is replacing beta hat star by my regression vector and that would give me exactly a formula like I presented before because this guy is gonna give me that my beta hat with this perturbation has the right centering but that means I have a formula for this correction term. And this correction turns out to be something exactly in terms of the unsaggered term coming from MP. So basically there is like a one to one correspondence between approximate message passing algorithms which track your regularize estimators to an estimator that is centered around the signal that you want to recover. All right. Now, this was just a simple derivation in the case of IID Gaussian designs. And as there has been a theme of this workshop to study structured data problems, we know that IID Gaussian designs are things we don't like anymore. So we would like to go to structured data and the moment we go to structured data we will have more complicated approximate message passing algorithms. But hopefully this derivation tells you that if I can understand those algorithms really well, I should be able to construct these kinds of estimators which are de-biased versions of my regularized estimators with the right perturbations and the unsaggered terms in your approximate message passing algorithms should show up in the formula in one way or the other. So that's sort of the intuition. Cool. So the first thing I want to show you is examples where this Gaussian formula, this was really sort of the Gaussian unsaggered correction term, we have running examples of five design matrices that I'll present through the talk where this formula does not work well. Now, if I just told you this, that sort of, yeah, sure, we don't expect it to work well when things are not like IID Gaussians, but it actually took us some time to find sort of concrete examples where this de-biasing formula wasn't working well. Sometimes the MP results are not working well, but the de-biased estimator people have created, they were still working well, even in non-IID Gaussian settings. So it took us some work, and we have these running examples. I'm hoping that for all of you working on structured data, some of these matrices that we suggest here could also be taken as use cases to demonstrate where IID Gaussian things should really not work in practice. Okay, so let me first explain this picture and then I'll walk through what my five design matrices were. So in this picture, we are taking the previous de-biased formula I just presented, centering it by the true signal, scaling it by whatever its standard deviation is, and I'm, so this is a p-dimensional vector, and I'm plotting the histogram of its empirical distribution, okay? So those are the histograms, so these are the blue histograms. Now, when we say sample mean here, we mean whatever is roughly the center of that histogram. And this black curve over here is trying to fit a normal density to the histogram. Whereas everything in red is the standard Gaussian density. So this red curve is your standard Gaussian density and the X equals zero, and this red line is the X equals zero. So if my de-biasing theory from IID Gaussians were working well in these other designs, I would have observed that this black curve and the red curve, they just superimpose on each other and that the centers also match, that these two lines, they coincide. But as you could see clearly, for across all these examples, there is a shift in the mean. So this guy's mean is different from zero, and also it doesn't really fit the standard Gaussian, okay? So what are these designs? These are obviously five pictures for five different design matrices. And I'm going left to right. So the first one is just, it's a Gaussian object, but this is kind of a matrix normal distribution where I can have correlations between columns and rows. So this column covariance is, it's sort of like what we call this, they are one or something, the 0.5. So you have correlation that is 0.5 with the neighbor and it's decaying as the locations, as you're further off. And the row covariances, they're just drawn from some inverse wizard. This is the first one on the left. The second one is just a spiked matrix kind of design. I have some lower rank objects here. Alpha is some parameter and have Gaussian noise. So that's here. And this is something more familiar to all of you. The third one is essentially like a linear neural network type objects. I take a bunch of standard Gaussians, I just take a product of them. Here we took four of them and we get this plot. The fourth one is what we know in statistics as vector autoregressive models. And I'm presenting a stylized version of this. So here, the ith row of this design matrix, this pointer really does not go where I want it to go. The ith row of my design matrix is a linear combination of K many of the previous rows. K could be a small number. Let's say K equals three, K equals five, but basically it's a linear combination of K of the previous ones with some Gaussian fluctuations. That is the fourth one. And the last one is actually heavy tilt. So the last one is just multivariate T distribution with three degrees of freedom for this plot. So these were chosen in a very carefully crafted way because we are gonna change these examples later and each of them kind of, you can change in specific ways to demonstrate different phenomena later. Okay? So this is just to demonstrate that there are hosts of these objects, real stuff. You might observe them in practice or not. I don't know. I do more theory than practice. So probably I'm not the right person to judge that. But hopefully, these are use case examples where you can clearly see that the Gaussian based stuff does not work. Okay. So any questions so far? Yeah, so the ground truth is. Yeah, so the question was whether the ground truth is taken to be sparse. So the ground truth and all these problems, I'll just keep it the same model but I have mixture of two Gaussians and there is a spike at zero. And for the two Gaussians, one has a positive mean, the other has a negative mean. It's not symmetric even. It's like, could be different means. So mixture of Gaussians. Yeah, I mean, okay. For these plots, we did it fixed, but you could, you know, it's a sort of, you can think of it each coordinate as IID from this mixture Gaussian and there is a spike at zero. Yeah, other questions. What do we do? Therefore, we do something like rotational invariant ensembles. And by now, I think all of you have done that or there were many, many, many talks on this. And sort of before this work of ours, the most related is this wonderful work by Takahashi and Kabashima in 2018 where they suggest based on replica insides a debiasing formula for the lasso for rotational invariant ensembles. So when we looked at that, we were like, okay, we expect this Gaussian stuff to not work, but this is a much larger class of ensembles. So maybe that's what we should investigate. And we know this is a rich ensemble class. And by now, this ensemble class has already been motivated across very talks. So I don't need to motivate it much more. There has been very rich approximate message-passing developments by Manfred Doppel and collaborators. Zoo is here. She won't give a talk, Rishabh is here. I'm not intentionally missing any of the others of you, but it's like this people here have contributed tremendously to this area. Marko is there. Okay. Now there are two works that are really related to ours. One is this non-rigorous debiasing formula suggestion, but the other is there was a work by Gerbalov et al. In 2022, it's a really nice paper where they characterize risk type functions of the regularized estimators under a rotational invariant ensembles. So there are some crucial differences between their work and ours. So they had a few assumptions of which there are some which may not be, you may not be able to check in these models. One is they actually assume that limits can be, the limit in N and the limit in T in AMP or vector AMP can be interchanged. So the state evolution part is not fully rigorous. And also they assume that limits of certain fundamental quantities sort of exist. So there isn't a formal proof that the limit goes to some things. Like assumption if the limit exists, then this should be the characterization. But these are the most relevant stuff. So what we'll do in today's talk is we'll leverage all of these insights and we'll propose a fully data-driven debiasing formula for right-rotation invariant designs. And I want to emphasize that the formula will be fully data-driven. And you will see that the way we construct this data-driven stuff, which actually leverages spectrum properties of the design very crucially, we know that that's what you need to do for these kinds of ensembles. And we are able to do that. And we have rigorous proofs of everything. We have a host of statistical applications. I'll present some of them. And the other is you can find in the paper, which is still baking in the oven, but it should be done in like two, three weeks, I promise. Okay, and what happened, we were really interested in debiasing, but as a byproduct of all of this, we actually ended up proving a bunch of new properties for the vector AMP stuff underlying these models and related quantities that arise in VAM. So I'm hoping that for other problems people study, maybe some of those proofs or lemmas or propositions could be useful. All right, now we'll get into our suggestion. Any more questions before I move on? So let me first present the suggestion. So the suggestion is, it's kind of simple in that, okay, we are gonna do debiasing with what we call spectrum error adjustment. It looks exactly the same. So I have my regularized estimator. I have my score kind of term, but now I have a correction and I'm just gonna tell you blanket that the way you find this correction is take the following equation. Let me explain what that is. So if I have a matrix X that is right rotational invariant, I'll formally present what that means. But if you look at the eigenvalues of X transpose X, those are my DI squares. H is my penalty. So I'm assuming that my penalty is twice continuously differentiable, at least on, except on a finite set of points. So adjustment had is the variable in this equation. And we're saying, just solve this equation system, you get a solution, that is your correction term, and you have formal proofs that are in our setting, this equation system always has a unique solution, okay? No, I'm getting there. So this is all under the high dimensional scaling limit. I'll present the formal theorems and our assumptions in a bit. But this is sort of like, what is the methodology? This is the methodology. So this equation, and note that there is nothing depending on population quantities here. So here I only need the sample, the eigenvalues of the sample covariance matrix. And H is something I know, okay? So this is fully data driven. That's the, yeah, yeah. So this one we prove there is a unique solution. So that needs a proof. So I write it this way, but we prove, and in our setting, it's unique. So I'll present the assumptions and we'll say that under that setting, yeah. Yeah, we can't prove lasso. Yeah, so for lasso, we can't prove this. So I'll present my assumptions and we can prove it for the lasso if you have like P less than N. So I think for the lasso, the right change obviously will be like some smooth approximation to H double prime and replaced by that. We tried really hard to prove the lasso and maybe we can prove it in three more months, but maybe now we just don't include lasso in the paper form. So what are the theorems? So under regularity conditions, which I'll present soon, I should have written this, but basically throughout the talk, I have number of features in samples diverging, the ratio converging to a constant. And the constant can have a wide range of values below one, above one. That doesn't really matter, that's okay. So what we show is this debased estimator centered around the right object with a certain standard deviation, which we completely, we can completely estimate this, goes to a Gaussian in a Wasserstein two sense. So this is sort of the usual name P type result, but the important thing is we have constructed an estimator, like a true estimator in that everything is data driven and we also have an estimator for the variance. Okay, so quick peek into how does this work? So this was the first row I have already presented to you. This was the Gaussian based formula and we were plotting empirical distribution of this guy on the left, with the Gaussian correction for the first row and for the row below its empirical distribution of this entire object. Now you see, remember that the red curves were my Gaussian densities and this was X equals, the red line is X equals zero. Now we have almost a perfect match. I mean, for some of them there's a little bit deviation, but our sample size and dimensions were actually relatively small for the simulation. And this is a range of different design matrices. Like for instance, I don't know if somebody gave me linear neural network type designs and multivariate T, it's not natural for me to immediately think that there's something that's gonna perform exactly the same on both of them. So we are trying to really demonstrate it across a range of different design matrices. So this is the first result. And as you can imagine, if you have a result like this, then from there you can immediately construct confidence intervals. If you assume that the entries of your underlying signal are all like, let's say, exchangeable random variables independent of your design and noise. Or if beta j is our IID from some pi and pi is nice, maybe has finite second moments, then you can give similar guarantees, but for each coordinate. So this is not as surprising once you have the Warshish-Tyn2 convergence. But note that, again, this is for each coordinate under an assumption of my underlying signal being exchangeable random variable. So it's not exactly a deterministic signal. The formula here are not important, but this is just to show that there is a way to estimate everything data-driven. So our goal was to estimate, let's go back here, my goal is to estimate the standard deviation. And the way we do that is we go through a bunch of intermediate quantities. So everything here is data-driven. And then there is a formula for this standard deviation, which kind of just works. And it seems if I just present that this, that this is coming out of the blue, turns out not quite. So it turns out that if you study the vector approximate message-passing algorithm underlying these problems, and if you look at the fixed point equation, whenever I say fixed point equation, I mean the scalar system of equations your state evolution gives, but let's have sent t to infinity. So I'm in settings where maybe that has a unique solution. We show where you can have unique solution or not, et cetera. But I'm in a setting like this. So it's a system of, I can write it as a system of four equations. It turns out that each of those parameters have a physical interpretation. And one of them is actually exactly our de-biasing correction term. And the other is also a physical interpretation, and I'll not go over that, but what you do is you try to construct estimators for all of these parameters by using your AMP insights, like what is the interpretation of which parameter. And from there, and using this on Sagar correction trick that I showed at the beginning, you can create a fully data-driven estimator for this correction term. And you also have a data-driven estimator for your variance, okay? And everything is function of the spectrum of X transpose X, which is nice, which is why we call this spectrum over adjustments. Any questions? Sorry, can you say that one? Right, so the thing is when I have a real data, I have, okay, so I'm trying to try to repeat, but I might have missed because I wanna hear as clearly, so I might ask you again. So I guess your point is, what is the purpose of data-driven quantities? Because we know in large dimensional limit, all of these concentrate, et cetera, but they concentrate to, let's say, things that are sort of population quantities. And if I have a real data, I have a single observation, right? And I don't know, so let's say these DIs, you know that, you know, there are functionals of these that are just gonna concentrate. But what is the concentrating measure? So if I've observed something in practice, I don't have a sequence of DIs for which I can take the limit. I don't know how to take the limit. I have one snapshot of the data. And theoretically, I can have a limit that I can analytically characterize, but it's not something I can compute from my data. You know, usually I don't have snapshots as my dimension is going to infinity. But the point is that's what we are trying to rigorously do here. Instead of having the expectation quantities of functionals of the spectrum, we are really trying to show that you can have things in terms of the empirical spectrum and functions of that work. Exactly, yeah. Other questions? Yeah, yeah, yeah, yeah, yeah, yeah, yeah. And the way we prove this is more like start from the art transforms, discover nice properties of it and go back and replace by empirical stuff and show that that works. Like you need to prove that these work, but that's how we construct these equations, yeah. Any other questions? So I should mention all the assumptions. So the first one is sort of usual right-rotational invariant assumption. So if you look at the SVD of my design, the O here is uniform from the orthogonal group independent of QD, that's our assumption. For the D here, as you may, empirical distribution converges. Variance has to be positive. It has a compact support. And the others are sort of usual in that this condition on the signal is the standard sort of, if you look at its empirical distribution, converges to something that has finite variance and maybe second moments converge. Now we have conditions on the penalty. So it's twice continuously differentiable except for a finite set of points. We can allow finite set of points where it's not, but actually this assumption sort of precludes the lasso in high dimensions in that we either need these eigenvalues bounded away from zero or each strongly convex for rigorous proofs. But in terms of the formula, we still expect some smooth approximation to each double prime replaced in the case of the lasso should work. Okay, so quick comparison. And if I'm wrong, please correct me. So I mean, the initial proposal for these problems was for the lasso using sort of the replica methods. And from my understanding, and I think I already clarified this in answer to US talk is we're trying to go to the empirical functions of the spectrum instead of the population functions. I think in the previous suggestion, the correction terms were in terms of these population functions of the spectrum. And we realized almost while finishing this is our insights actually generalized to non separable penalties. So we also have a conjecture for this, which we are not proving in this paper. It's for a future paper, but actually the whole construction could, if you have a non separable vamp analog, you can just do a similar construction. Okay, so some applications. The first couple of applications are just gonna be sort of, the first application is just gonna be a bit of a standard application, but it's just to show that this kind of works. So if you have all this Warshstein two convergence, you can estimate a lot of objects, right? So for instance, if I were to look at the mean square error, or if I want the norm of my underlying signal, we can construct, again, data-reven estimators for all of these quantities. Quick peek into how this works. It's probably not so visible from the back, but the left column is for the signal L2 norm and the right column is for the mean squared error. And across the five design matrices, let's maybe focus on the mean square error. The true value is let's say around close to 70 or between 60 and 70 roughly. If you do Gaussian stuff for these kinds of design matrices, you really, really underestimate. So you get like something between nine and 20, but then if you do this kind of debiasing, then you kind of recover and be in the right ballpark. So that's more like a first level sanity check that this is working. Now the application I want to spend some more time on is connected to something that we call principle component regression in statistics, and it's directly connected to PCA, which a lot of you presented. So principle component regression is sort of doing regression after dimension reduction, okay? So what it does is in regression problems which are really high dimensional, you take your sample covariance matrix and you perform its PCA. And usually what you do is you retain a subset K of the eigenvectors, let's say of the larger eigenvalues. And using this subset of eigenvectors, you transform your original data. So you take your original data X and you stand transform by okay transport transpose. It gives you a new data matrix. And now you want to run regression on this data matrix which has a reduced dimension, which is kind of nice. So it saves computation. And often this K is such that you can actually on the reduced space really just perform an ordinary least squares. So people typically do that, but that's not enough because my goal was to recover the original signal. And now I am in this reduced space. So one has to go back. And the way you go back is by multiplying this OLS estimator by the okay transpose, okay? So this whole pipeline is known as principal component regression. And it's used a lot because it sort of dimension is really large, what do I do? This is sort of the first level thing that you do if you are in regression problems. All right, now if we stare at this equation at the very end, something that you'll observe is if my true signal is not in the span of the selected principal components, then this PCR estimator will suffer similar kind of shrinkage bias as you observe for regularized regression estimators. That's expected, right? But people still do it anyways because it's a nice method, it's a simple method. Okay, now we realize that actually using all these devising under spectrum aware approach, you can really correct this issue. So the way you correct it is you do your PCR, like you do your first step, PCA, with reduced dimension K, and you do a different transformation on your data matrix, okay? So I take these DKs, I scale the OLS estimator by these, I do a totally different transformation. And we realize that if you do this other transformation instead, if you started with the right rotation in variant design matrix, your eventual, this one is also right rotation in variant. And you can use our devising theory on this guy. And that's kind of nice because you also reduce dimension, so that's nice. So you can apply spectrum aware devising to this reduced or this transformed space. And that actually, we thought this is gonna work really well, okay? And the whole thing here can be made rigorous if the number of components that you retained is proportional to the dimension, okay? This should be the same K as here, but it is a different form, sorry about that. Okay, so why do I write almost works well? So we were very excited by this because we thought like, oh, okay, this is natural and one can do this. But what happened was when we use this, it worked much better than if you did something Gaussian based, which is obvious that it will work much better than Gaussian based stuff. But it started becoming pretty unstable numerically. And then we realized what's going on is often these are high dimensional scenarios. And often the matrices we were trying this on would have a few top eigenvalues, which are really large compared to the others. And that's why there is some unstable behavior. So we wanted to really fix this. And the way to fix this is again, very natural. So what do you do? So the main observation we had was that our estimator was okay, but its variance was very unstable, okay? So when I scale by that standard estimated variance, our standard deviation, then the thing becomes unstable. If I replace it by the population quantity, then it's stable. So we realized this instability is coming from having some eigenvalues, which are much larger than the rest for specific designs. So you can do the other thing. So for instance, you can basically do PCs to remove the top L. And that's something also done in high dimensional stuff. And after removal of the top L, you can just do spectrum error devising, no more transformations of the data. Now we found that this approach actually works very well for inferring the true signal, even in matrices where there are strong correlations or heterogeneity in the designs and some sort of serious heavy-tailed examples as well. So I'll show you in a picture in a little bit. So these are our examples. Again, I have the same five class of design matrices, but I will show you in a little bit that have completely changed their parameters to capture sort of challenging design matrix settings. So it's plotting the same object as before. So Gaussian debiasing, which we call degrees of freedom debiasing, that is much worse here. So okay, this is far from Gaussian. It's like the empirical distribution looks like this. It's almost bimodal. This is the Gaussian density. And the shift in the centers is much, much larger. If we don't remove the top L eigenvalues and just do our spectrum-aware stuff, you observe something like this, where I'm not yet Gaussian, but my means start to be okay. But you can see that my spread is not quite okay. And then if you remove the top few in this plot, we are removing top 20, then you do spectrum-aware, you exactly match the Gaussian density. Now I want you to maybe remember this picture a bit, the last one in your mind, because I'm gonna present sort of how we tweaked the parameters of these example design matrices to make the problem much more challenging so that spectrum-aware itself would not work anymore. But if you do this, it actually works. Okay, so what are these examples? So I had five running examples. First one was my matrix normal example. Previously I had a 0.5 here, so the column correlations were much lower. I'm just gonna increase that to 0.9, okay? So I'm giving it to test, I'm giving it challenges. Second one was a spiked object before, so you had a VW transpose before. What I'm going to do is I'm gonna introduce a diagonal matrix R in between, and make its entries really unbalanced, imbalanced, so that you're weighing rows columns very differently. So there's 500, 250, 50. So this is what we call maybe heterogeneity in the design matrix, that may not be the best naming, but if you have a better one, let me know. We had linear neural networks, we were multiplying four IID Gaussian matrices before. Now we are taking one matrix, X1, taking it to power 100 with itself, and something independent, sure. Same with the VAR, everything is a linear combination of previous three rows, but we are making this imbalanced again. So the third most back row is the most important, so. And the final one is probably the one I'm the most excited about. I mean, this is something completely my students discovery. I have no credit in this plot. I was telling him like, how much can we push this? How much can we push this? And he just did multivariate T with degrees of freedom one, and the way he constructed this gives you IID rows. Each entry is marginally actually a Cauchy distribution, and within the rows there are correlations. And that's this plot on the bottom right. So if you remove the top few eigenvalues to deal with the heavy tail, after that what you get, the spectrum of our stuff can actually still capture that. Of course I'm not claiming this works for every kind of design, but I kind of like that at least for these five classes we could put it to test, and it still gives something more reasonable than at least what I have seen before. Okay, so in the remaining five minutes, okay. Yeah, yeah, yeah. Right, so it depends. In the heavy-tailed cases, I mean, okay, we took the signals as still mixtures of Gaussian, so all the signal coordinates were kind of, they have similar contributions. Yeah, so we didn't try settings where there is additional complication coming from that. I'm sure all of this will break down, something to check whether this guy at the end breaks down less than the others. You could still, we also tried, remove top L and then do Gaussian stuff, right? And it works probably a bit better than this, but the Gaussian stuff doesn't capture these design matrices. So yeah, other questions? Okay, so I don't have as much more time for the proofs as I would have wished, but let me maybe give you a high-level pick. So there are two steps of the proof. One is characterize the empirical distribution of this device estimator in terms of the fixed points of the vector approximate message passing algorithm, and the other is provide consistent estimate of the fixed points from the observed data. Now, the first step is the kind of results that people are familiar in the AMP literature. And the idea is use VAMP to first track your regularized estimator, then you infer about your de-biased estimator. Now, in the 2022 paper, they tracked this regularized estimator, but there were two conditions there which were sort of non-rigorous. What we did was a different proof technique, which allowed us to make this step fully rigorous. And the main thing we proved was, if you take VAMP iterates in this linear regression problem, then it has nice Cauchy convergence property. And now we have a formal proof of that, maybe that can be useful for other settings. I want to comment that analogous Cauchy property for these kinds of algorithms have been proved in the literature for the Ising Spin glass by Fan and Hu and the base match setting in the high temperature case by, again, Zo, Shubhaprotha, Yehong and Yufan. And in this case, remember the talk, Zo's talk was in the high temperature case, whereas here it's sort of my ERM problem, I'm going beta goes to infinity limit, which is a low-temperature case. So I'm actually hoping that this step could be useful, maybe for all the people interested to extend this result to the low-temperature case or other temperature cases. I don't know. Okay, and the estimation of the fixed points, it's a bunch of analysis, but it's more like really understanding how the VAMP fixed-point equations behave and sort of developing insights from that. So I'll skip over this. Okay, so maybe I wrap up because we can just still take some questions. Then what we have is this data-driven devising pipeline under rotation invariant designs. And we tried hard to come up with these use case design matrices where you can see differences from the Gaussian to these objects nicely. I was surprised at least with the PC removal plus spectrum working in all those five cases, especially when we push the multivariate T degrees of freedom all the way to one. There are other statistical applications, immediately falls out of this, like in hypothesis testing questions, false positive rate, true positive rate characterization, et cetera, I did not present that because they sort of follow directly. And nutshell of what are proofs for VAMP properties we have, I told you about the Cauchy property, but we also prove things like the fixed-point solution fixed points have a unique solution if you have elastic net. So we have an actual proof for that. And for all these fixed points, whenever you have a solution, it is always gonna be unique in our regime. We first prove that, and then you have a consistent estimate for all of the members of your solution. So some of them are like signal to noise ratio parameters, some of them are other parameters. So that could be useful for other problems. Now obviously there's a lot we can't prove and I'll mention one and then others we can obviously chat offline, but we can't create confidence intervals for a prefixed coordinate. So every time we can create confidence intervals, it's assuming that all coordinates of my signal are exchangeable and I would love to relax this condition. And I think there are lots of interest in leave one out cavity for rotation in varying designs. So if anybody has idea how to do a prefixed coordinate here, I'll be very willing to hear. And thank you. I'll wrap up with that. Guys, thank you very much. Let me check it. I think this one works now. This works now. Yeah, this will switch it on. I think it does. Yeah, that one's on. Hello, hello. Wonderful. So now we can take some questions with microphone. We already had quite a few during the talk, but I think Marco was the first and then general come to you. Thank you. It's a really, really nice talk. I was wondering if you can foresee an extension of this spectrum-aware adjustments also to other sorts of algorithms. So here you mostly look at regularized regression. So I was thinking actually to stuff like A&P algorithm themselves. So the simplest case that comes to mind is A&P for rotational invariant PCA. So there you have to estimate all the on-zucker corrections, which depend on higher-order cumulons. So those are very, very bad to estimate in practice. Like essentially you cannot estimate anything over cumulon 30 or something like that. So this means that this limits effectively the number of iterations that you can run. So I'm wondering whether this strategy, whether you have thought about that and whether this strategy gives ideas on how to do this. Yeah, so thank you for the question. And I was emphasizing a lot that we do all this massaging with the fixed point equations because that's something we kind of wanted to solve and we realized that algorithm is really nasty. So it's not like we have something that immediately pops out that gives you a solution to that question, but that's something we'd be interested to think about. And that's sort of where we were trying to go generally as a message like, okay, what can we do with these kind of data-driven insights after this work? Thank you, Jean. I think you had another question. Thanks a lot for this wonderful talk. I would like to ask something which is probably not, it's probably a stupid question, but you are not Bayesian. It's not Bayesian sitting. So there is always this discussion around bias variance. Is it so clear that you should, do you always improve performance by the biasing and for which metrics and so? Yeah, so that was a major motivation for providing. So, okay, statisticians like devising because in some communities, okay, is it biased, then they're not happy, right? And I mean communities of practitioners, but not like the theory or methods that people, but that's why we were interested to estimate the mean squared error because now I have a host of maybe regularized estimators at my disposal and the hope is some of them even after debiasing in terms of mean squared error is actually good. And also there is a tuning parameter which you can really tune to obtain better mean squared error. So that was our motivation for providing a just a mean squared error estimate because that can be used as now you choose a bunch of penalties and a bunch of tuning parameters and just optimize that. How that compares with Bayes optimal stuff unclear because this is traditionally a mismatch setting analog, but our motivation for the MSC estimator was just so that you can also do this and not just devise. So you can always improve the MSC, you say, no, so you can optimize that, where did it go? Yeah, so wait, hold on. So this guy is a function of the tuning parameter, right. So this MSC estimator, this you can compute and this is gonna be a function of the regularization you chose and the tuning parameter you chose in front. So among your class of estimators, you can optimize this and choose the best one so that makes people who want really debiles stuff also happy and the MSC is not gonna be that bad compared to because you can really optimize as a function of the tuning parameter and along the tuning parameter curve, usually the dip, like where it's lowest is actually pretty low, it's almost close to. Based off, I don't know the exact gap or whether there is, but that one is not that bad. Okay, thanks. Okay, do we have another question for, yes, we have two actually, how are we gonna do this? Thank you, very nice talk. So just to add a simple question, so is it very difficult to or not so to generalize it to the generalized linear model? So then we have to go to GAMP VAMP. I don't think that is conceptually very difficult. I didn't want to go to GAMP VAMP in the same paper but conceptually, these adjustment hat and all of these things, you change the L2 stuff with some M estimator kind of thing and it's qualitatively should be similar. The proof obviously needs GAMP VAMP, yeah. Okay, maybe last question before we go on to the break. Thank you, Prakia. I have a very simple question. I think I was just wondering for the sake of culture, maybe I remember that one of the earliest works on this devising stuff was by Zhang and Zhang, which- Yeah, the three concurrent works, yeah. Okay, I was wondering what is the relationship between these methods that you're developing and those methods there? And I remember in particular, there was some step in that series of works where you need to somehow estimate an inverse covariance or sort of orthogonalize a predictor against other predictors, which I'm not seeing appearing here. So I was just wondering if you could comment a bit. So the question I think has multiple parts. So one is this was the original Zhang and Zhang estimator roughly. This guy, the Zhang and Zhang. And they had sigmas because when they were often, they were not just looking at IID gautions, they had some covariance in the sigma, then you get a sigma inverse here in front. But then to make that data-driven, you need to estimate inverse covariance matrix, et cetera, et cetera. Now, if we introduce such a sigma in our model, you'll actually get a sigma pretty much. So this is bellic and Zhang. And then if you go to our stuff, the sigma appears in the same place if it's a right multiplication with sigma to the half in the right rotation invariant, right. So the interesting thing is, okay, for the sigma case, we have a full conjecture. And something that we observe is that in the Zhang and Zhang or bellic and Zhang stuff, you had to estimate really large dimensional objects, like the sigma inverse. You had to almost estimate all of it. For us, we were able to simplify it in terms of traces of objects if you have a strong convex penalty. So even if you have a sigma, what you need to estimate is not like your entire sigma inverse. Yeah, thanks. Okay, wonderful. Let's thank Pargya one last time.