 So thanks to the organizers for inviting me to speak here. I'm going to talk about some work that I'm doing with Andrea Montanari on de-biasing the lasso with an accurate precision matrix where nearing completion of this work. So it should appear on the archive shortly. And, okay, the de-biasing the lasso is a topic that's studied in the statistics literature. The reason I think this talk appears in this session is that we approach this problem using exact asymptotic techniques from the convex Gaussian-Midmax theorem. So the setting here is we have a linear model. This linear model has a bunch of features. So we have P features in this matrix X. And we have one feature in this vector W. And so really this is just a P plus one dimensional linear model. But I sort of separate this special feature W because in this talk I'm only going to care about estimating beta, and I'm only going to be interested in theta and so far as it helps me estimate beta. We're going to make a random design assumption. And the assumption is going to be that all the features, both W and X are jointly Gaussian, but in the spirit of separating out this special feature W, I'm going to write the Gaussian distribution in this way. So first I'm going to assume that these features X, which I'll call the nuisance features are jointly Gaussian with some covariance matrix sigma and that the special feature W depends upon the nuisance features through this linear model that has parameter gamma. And this model up here I'm going to call the outcome model because it's a linear model for the outcome variable Y. And this linear model I'm going to call the precision model. And the reason for that is that it's related to the inverse covariance matrix for all the features jointly. And the inverse covariance matrix is often called the precision matrix. So I call this the precision model and the schema, the sort of precision parameter. And okay, I'm going to make this sparsity assumption on the unknown parameter. And my goal is, like I said, to estimate beta. Our theoretical results are going to establish consistency of a certain estimator, though, what we're motivated by is, is, is, you know, really creating an also approximately unbiased estimator for beta. And, you know, ultimately we would love to be able to show that this estimator is normally distributed asymptotically, we think it probably is, but, but we, you know, that's purely a conjecture at this point. Okay, so let me just describe some, you know, maybe limitations of some other methods. So one way that you could estimate beta is to estimate beta and this nuisance parameter theta simultaneously using the lasso. So this is the lasso here where, again, I'm sort of separating out this special feature w in my notation. And let me just start with a simulation that shows that the lasso estimate of beta can, can be quite bad. So, so here, beta is equal to one. And I have a 400 samples and the dimensionality of, of the linear model is 500. So it's larger than n, though not dramatically so I should say that, you know, what you should have in mind here as in, I think the other talks in this session is that we're thinking about sort of a proportional asymptotic regime. So this is sort of in some sort of proportional high dimensional regime. The nuisance features are just independent standard Gaussians. What I'm going to do is I'm going to set up the outcome and precision parameters in such a way that the sort of estimation of beta is, is substantially degraded when you estimate beta using the lasso. So I'm going to actually take the sparsity of each of these parameters to be the same. In fact, I'm going to take the parameters to be exactly the same. And I'm not going to sort of have time to provide intuition for why this way of determining the parameters is sort of really bad if we're interested in estimating beta. But this new parameter is, is sort of a signal strength parameter. It's the L2 norm of, of these parameters. And what we have in the plot is on the y axis, I have the bias of my estimate of beta computed over 10 simulations. And on the x axis, I have the signal strength parameter mu. And I compute the bias of my estimate of beta for three different sparsity levels. And we see that when there's sort of no signal strength, the estimate of beta is slightly biased. It's biased downward is actually biased sort of on the order of the regularization parameter. But when I sort of increase the signal strength and, and set my nuisance parameters in this way, the bias actually gets inflated and degrades substantially. And the degradation is, is exacerbated by having a more complex model. So the, you know, actually already when sparsity is 10, things are quite bad. And when the sparsity is 100, which is, you know, comparable to this sample size and the dimensionality to the degradation is, is even a bit worse. So, okay, how well can we hope the vanilla last to perform in terms of estimating a single coordinate. And so there's some results on this. So one result involves making an incoherence assumption. So assuming that there's, you know, you know, basically negligible correlations between all the features. And an ultra sparsity assumption. So you assume the sparsity is, is smaller than a square root of N. And here you say that the, the error in estimating beta is sort of on the order of the regularization parameters, square root of one P over N, but, but these are very strong assumptions. They were violated in the simulation I just showed. And, okay, I don't know what these constants are, but I think the simulation maybe should lead us to believe that, you know, this bound is also not really describing this simulation. Now, another set of results and, okay, really the ultra sparsity term here doesn't apply to this second bullet point. So perhaps I should have given this slide a different title, but you can sort of under weaker assumptions, compatibility or restricted eigenvalue conditions, which are, I think, satisfied in my simulations, show these weaker bounds on the estimation error beta. And these are really quite bad bounds. So if you're familiar with the lasso literature, this is the sort of L one error of the lasso. This is the L two error of the lasso. And so this result would tell you that you can consistently estimate beta if you can consistently estimate the whole parameter. So that's a pretty tall order for if we're only interested in messing with a very low dimensional thing, we have to estimate this very high dimensional thing. Well, so, okay. So now I'm going to move on from just a little last time talking about other approaches to try to target estimation to beta. And I'm going to begin with a success story when we have an accurate knowledge of this parameter precision parameter gamma. And so here's one thing that we can do. It's called the D bias lasso. And it involves starting with the lasso estimate of beta and then adding this correction term. And I don't have time to motivate or describe where this correction term comes from. It shows up in an AP iterations, for example. But what you have here is it's a correction term. It depends upon your data, WX and Y depends upon your lasso estimates for the parameter of interest in the nuisance parameter. And it involves the true precision parameter. So we have accurate knowledge of the precision parameter. And there's lots of theorems about this estimator. I'm going to just present sort of not the strongest versions of them because in this talk about, you know, thinking about asymptotic normality, which is, you know, often discussed in the literature of the device lasso. But, you know, we can, we should, we should expect that the sort of estimation error of this device lasso is a voter one of her squared event. And, you know, there are indeed some results to this effect, which hold any sort of a proportional regime. And so, you know, in this paper by Leo Lian and Maria Montanari and myself, Andrea and Yuting Wei, you know, we can sort of show that this device procedure has sort of for the typical coordinate, this type of behavior. And in this paper by Pierre Bellick and Sun Wuzang, they established that this occurs for a fixed coordinate of interest. Well, if you start with an assumption that the vanilla lasso is consistent, which is a strong assumption, or if you know that n is bigger than p, that doesn't need to be dramatically bigger than p. Okay. So, you know, this thing works in the anticipation of what's to come. I'm going to sort of try to show what these methods do diagrammatically. So this sort of de-biasing lasso procedure with an accurate precision model works as follows. You have a bunch of data. You have a bunch of data that you can use to estimate beta and theta using the lasso. And then you have some sort of prior knowledge of the true parameter gamma, which doesn't come from your data. You combine them in some fancy way to construct this de-biased estimate data. And indeed, if you do some simulations in exactly the same setting that I showed in the previous slide, you find that the bias of this estimator is, you know, a little bit different from the other two. So these are the two different sparsity regimes. Okay. So now I'm going to show you another method. So now we're going to successfully de-biased the lasso with an accurate precision model. So now we're going to unsuccessfully de-biased the lasso with an inaccurate precision model. Okay. So what if you don't know gamma? So one thing to do if you don't know gamma is to estimate it using your data. So we're going to use the lasso. Okay. So in particular, here's the device lasso. It's exactly the same as what I showed on the previous slide, except now in place of the true gamma, I'm going to put an estimate of gamma that I got from my data. And okay, unfortunately things don't work as well. Here's, here's sort of the, the same, uh, target estimation to beta. We're going to have our labeled data. We're going to construct the last estimate of beta and theta in the same way we did before. But now we're going to get an estimate of gamma also from the same data. And then we're going to have the same, uh, you know, uh, simulation that I've been showing throughout the talk so far. And we see that at least when you have, uh, you know, if you have, you know, very sparse model and it seems to work, but if you have, uh, you know, even rather dramatically large scarcity, it starts to fail. Okay. Now is this just sort of a feature of the procedure I'm using or is this fundamental? And, um, well, there's not a straight straightforward answer to that, but I think there are some results that would lead us to believe that maybe it is fun. Um, but I think it's, it's, it's, it's, you know, it's really hard to get something to work when you don't know gamma. So, you know, uh, uh, uh, uh, a result in this literature on D biasing is, it's from Tony Cayenne and Zidjian Guo and, uh, was refined by Adele Java modern Andrea Montanari. And it says that the, uh, estimation of beta has a minimax lower bound of this form. And particularly you have this term, which involves the sparsity of the outcome model and the sparsity of the outcome model. And if you're familiar with the last literature, you know that, uh, when this, uh, term goes to zero, it means we can either estimate gamma or, or theta consistently using the lasso. So what this would sort of suggest is that, uh, you know, maybe in order to sort of successfully target estimation to beta, we need to consistently, uh, estimate at least one of these, uh, parameters and know it accurately. Uh, which is sort of the regime of this session, then, uh, estimating these parameters consistently is not, it's not possible, uh, just information theoretically. So, uh, perhaps we are discouraged that we'll be successfully consistently estimate beta in this session. So in the same, uh, okay. So, uh, now that I've constructed a bit of a strong man, uh, I'm going to tell you about how to, uh, successfully devise the lasso with an inaccurate precision model. Okay. And this is now, now we're finally at the topic of our paper. Um, so, uh, the key idea that we're going to use is that we're going to have an inaccurate estimate of gamma, but it's going to be unbiased estimate of gamma. Okay. And so this, because it's unbiased, we're going to actually be able to do something. So, uh, here's our new data pipeline that we're going to be in a slightly new setting that we're, we're managing that we have sort of a semi-supervised, uh, setting. So we have a bunch of labeled data, but we also have some unlabeled data. We don't have a dramatically large amount of unlabeled data so that we can sort of consistently estimate gamma, but we do have sort of more total samples than P, the dimensionality of the parameter. So we can get an unbiased estimate of gamma. Okay. We're going to combine our unlabeled data and our labeled data to get an OLS estimate of gamma, which is unbiased. And then we're going to get the lasso estimate of data just from the labeled data. And we're going to combine them in now a new way. And then we're going to use the same form as the device lasso to create what I'm going to call the sort of device plus estimate of beta. Okay. And so here's just, uh, you know, explicitly what we're doing. So I'll just remind you that these are linear models. This is the outcome model and this is the precision model. And, um, the, uh, the estimate of beta that we're new uses written here. So we have some sort of empirical correlation. Uh, of, of fitted residuals where we're here, we're using this sort of estimate from both the labeled and unlabeled data of gamma. Here we're using the lasso estimate of theta that only uses the labeled data. And we have this sort of complicated, uh, correction term out front. Okay. So this is the device lasso or the device plus estimate of beta. And, uh, in our simulations and exactly the same setting where now there's just the added caveat that we have some unlabeled data. So we're using a total of 750 unlabeled samples, giving a total of 750 samples, which is larger than 500, but not so large that we would expect to be able to consistently estimate the parameter gamma. And okay, we see a bit more noise in this, in this simulation, but we seem to have corrected the bias across all of the, uh, signal strength and, uh, sparsity regimes. And indeed we, you know, prove this as theorem. So, uh, uh, we're in sort of a personal asymptotic, but we're not making too many or really any assumptions about, you know, empirical distributions converging or, or spectrum of, of covariance matrices converging as, you know, often occurs in this literature. We're just saying take any limit, uh, in which, uh, the number of samples you have is some non-negligible fraction of, of the dimensionality of your parameter. And if Sigma has some bounded significant values, and Theta is sufficiently sparse and some other conditions hold, then this device plus estimate is, is a consistent estimate of Theta. Uh, and you know, I should mark, remark that sufficiently sparse, you know, is permissive of sort of proportional sparsity. And so in my remaining time, and, uh, I think like so, I forgot to take a note of when I started. Uh, so hopefully I have just a couple more minutes. So, uh, these proof techniques, I think are over-justified the presence of this presentation in this session. Uh, so we're going to prove this using some exact asymptotics. Uh, and I'll just briefly describe what, what they are. So it turns out that, uh, this problem was consistently estimated. Beta can be sort of reduced to this, uh, a separate problem in, in linear models. And so, so here's sort of this. We have, uh, two linear models with the same features. Uh, but they have different, uh, underlying parameters and they have correlated noise. Okay. So the noise from these two models is correlated with covariance matrix S. And there's a question of, can we, uh, estimate the sort of covariance matrix for the noise. Yes. And if you can, then actually I can use my construction of this estimate to, uh, consistently estimate beta. And in the original setting I described. And so this is sort of related to the problem. So, uh, in high dimensional or noise variance estimation in high dimensional statistics. Uh, and it's related to a construction of a noise. Variance estimate in the last so that was proposed by Andrea Montanari. And mostly by after 2013. Uh, now what we're going to do to construct an estimate of the covariance matrix S is, uh, we're going to have a regression estimator. In both linear models, so we're going to use the same data set and simultaneously run to, you know, construct an estimator for each parameter using sort of the same data. Okay. Now, uh, what's been described in this session is, uh, you know, in part, uh, you know, results which give exact characterizations of the limits of some of the, you know, each regression estimator marginally. Okay. So a standard result in this, you know, very large literature, which I haven't listed exhaustively on this slide is, uh, you know, providing a prediction for what the limit of, um, say the size of the residuals or the size of the up to estimation error is for one regression at a time. Uh, and what we're going to need, because if you recall, the sort of the biased estimator involves the correlation of residuals, we're going to need some knowledge of, of the limit of a quantity that involves, uh, both regressions simultaneously. So we're going to need to know something about the sort of joint distribution of two regression estimators computed on the same data. Um, and, uh, okay, that's what we provide. I'm not going to go into detail. And the archive will be posted soon. Uh, but, you know, what we do, which I think is sort of a more general interest in justice devising, uh, the lasso question in particular is provide, uh, an exact asymptotic sort of joint distribution of two. Estimators and, and linear models. Uh, and we use Gordon's comparison and equality to, uh, construct this sort of, uh, exact characterization of, of, uh, the joint distribution. Uh, there was a paper by, uh, Marco Modelli, uh, Venkatara and, and Trump leaders, which, you know, also, uh, used exact asymptotics from A and P, uh, to, uh, uh, study the joint distribution of, of, of estimators and, you know, GLM. Uh, here we use Gordon's inequality to come up with this joint characterization. Uh, we're able to show consistency at the device estimate, uh, using, based on the last estimator, but we also study, uh, consistent estimation of data where we estimate the nuisance parameter is using, uh, uh, read regression. And I should remark that, you know, we provide more general results than the one I stated in this, in this, this presentation. Uh, and the results apply both to a semi supervised setting and to a fully supervised setting. Uh, as I've already alluded to, the results are, are non asymptotic. Uh, and we provide, you know, you know, really the basis of the result is, is consistently estimated as per variance matrix, uh, S, which is a general edition of this, uh, result by most inviolate and Andrea Montanara. And, uh, okay. Yeah, I was coming in and I hope, uh, I hope it really is coming very soon. And, uh, I hope you'll take a look when it, when it's supposed to.