 So, if you had looked at our schedule a few days ago, then you might have expected one of our co-organizers, Manuel Sainz, to be here at this slot, but Manuel had some travel disruptions, but he'll be arriving on Wednesday, so we are very, very thankful that Marco Mondelli kindly stepped up and agreed to interchange his slot, so we are very happy to have Marco speaking to us today. He's from ISD Austria and he'll tell us about spectral estimators and approximate message passing. So, for this, I guess, five minutes more, right? This was the deal. Okay, very good. All right, so this talk is about spectral estimators and how can you prove interesting properties of them also via approximate message passing. So, this is something that we've really been thinking about for quite a while, so I'm not sure if this is just a sign of aging, that, you know, you keep going back to the same old problems every once in a while, so I'll try to structure this as a circle. So, we'll start at noon with spectral estimators for generalizing our models via random matrix theory. And, you know, I always have to get credits. This is joint work with Andrea, back when I was opposed to look at Stanford. So, then we're going to arrive at 3 o'clock. So, 3 o'clock will do spectral initialization of AMP. So, in that case, I'm still going to use the analysis of spectral estimators, but then I'll try to refine my estimate via message passings in order to do better. Then, at 6 p.m., I'm going to try to revert this a little bit. So, I'm going to try to use AMP to prove something about spectral estimators. So, the second work is with Ramji. The third work, so this one on spectral estimators for mixed models, it's with Ramji and Ihang, my postdoc. And, finally, at 9 p.m., we're going to talk about spectral estimators for general Gaussian designs. And, here, we're going to try to do the analysis mostly via AMP. And, this is joint work with Ihang, with Ramji, and also with Honchanji. Another thing that may be apparent is that the number of authors of the paper size increased. So, maybe that's a sign of laziness, again, or of aging. I don't know. So, I'll let you take from this what you want. All right. So, now, we're here. So, we did about 1%. The important part, so, acknowledging co-authors, which is perhaps the important take-home message of this. And, the first part is about spectral estimators for GLMs. So, here, there is no message passing. And, I'm going to try to do the analysis purely via random matrix theory tools. So, before, let me first introduce the main model that we're going to be considering here in this talk. So, that's the model of a generalized linear model. So, let's start easy. So, let's start from linear models. First, I have a signal to recover. And, this is the dimensional. Now, if I had a linear model, then I would have access to linear measurements. So, I would have access to inner products of the form X inner product AI, where the AI are known sensing vectors. So, you know the AI's. So, in this case, the problem is given the inner products and given knowledge of AI find X. Now, this is a twist because it's generalized. And so, what happens is that you have what you can think of as a noisy channel or a stochastic probability distribution from the linear measurements. So, now, your measurements are distributed according to some distribution of the inner products. And, you know, to be concrete, my running example for the whole talk is going to be phase retrieval. So, in phase retrieval, the name suggests that you lose the phase of the measurements and you would like to retrieve it. This analogy, I think, is particularly suitable when X is complex, because in that case, the inner product is a complex number and you just get to see the models and not the phase. So, you want to retrieve the phase, but here we're going to look at the real case anyway. So, in phase retrieval, what you get is the model squared of the inner product perturbed, say, by noise, for example, Gaussian noise. And this is plenty of applications in imaging disciplines, such as extra crystallography, microscopy, or interferometry. Okay. Now, most algorithms here are iterative and require an initialization. So, I'll say a little bit more about algorithms later in the second part of the talk. So, for the moment, in this first part, I'm going to be settling for less and I'm going to be trying to solve a weak recovery problem. So, weak recovery means that I just want to find an estimator X hat such that the normalized correlation between X and X hat is some number bounded away from zero. And in particular, it has to be bounded away from zero in the high dimensional limit in which the number of dimensions of the signal D and the number of measurements N are both large and the ratio is held constant. And as Justin said, the reason why we choose this scale in this normalization is that in this linear regime, pretty much all the interesting things happen. So, we're going to see that we have a phase transition phenomenon. Okay. So, in particular, what we want to understand is for what values of delta, the limit between the normalized correlation between the estimator and the signal tends to zero. And for what other values of delta, this limit is bounded away from zero. Okay. So, absolute theory is between zero and one and I don't want that it goes to zero. Okay. So, here is a theorem and that's a joint work with Andrea. So, here we look at the case where the measurements are Gaussian. So, the AI are IID Gaussian. We look at the high dimensional limit where N and D are large and the ratio of delta is held fixed and we want to solve phase retrieval. So, we are given A1 up to An. So, we are given the measurement vectors. We are given the measurements to Y1 up to Yn and I want to find X. Okay. So, then we have two statements. So, the first statement is that when delta is bigger than a half, let's look at the case of small noise just for the simplicity of the explanation, but most of these results are a bit more general than that. So, in this case, spectral algorithms work in the sense that they solve the weak recovery problem. So, in particular, as soon as delta is bigger than a half, so delta is a half plus epsilon, then the normalized correlation is bounded away from zero. And of course, it increases as delta grows. Now, this is interesting because delta equal to half is actually information theoretically optimal in the sense that if I had access to the base optimal estimator, for delta smaller than a half, this base optimal estimator would not do well. So, in particular, the MMSE performance of the base optimal estimator is the same as that of a trivial estimator that just sets everything to zero. So, delta smaller than a half, the problem is impossible. Delta bigger than a half, the problem is indeed solvable, and we can solve it with a spectral method. Again, half is just due to the fact that I look at noiseless space retrieval here. In general, you have two thresholds, so you're going to have a spectral threshold and an information threshold. They're not going to match, in general, so it's also not very difficult to find statistical computational gaps. And so the paper is about really GLMs, and it crucially uses the fact that the matrix A, so the design matrix, has Gaussian rows. And we're going to see a couple of twists on this, yes, in the second part of the talk. I'll define it in the next slide. All right, so that was already jumping ahead, so let me present you the spectral method right away. So, here is the setting. So, I'm going to be looking at a signal that lives on the sphere with radius through D. The vectors A, I are Gaussian, as I said, and I'll be looking at the case of phase retrieval with Gaussian noise. So, of this three assumption, the third one is relaxable, so I could do for general GLMs, but for simplicity, I'm not going to do it. The second one is much harder to relax, so if you want to do it for more general sensing vectors, this is actually more difficult. And the spectral estimator is a linear combination of rank one matrices, so it's basically something like AI, AI transpose, modulated by a function applied on the measurements. Okay, so the performance, of course, depends on TS, and so we're going to optimize on TS later. So, the spectral estimator is, once I construct the data-dependent matrix Dn, is the top eigenvector of Dn, okay? So, the task here is to understand what's the correlation between XS and the signal X. Does this answer your question? So, as I perhaps hinted at, in this part, we're going to do it just by using random matrix theory, and so of course I'm not going to do the proof because otherwise probably this would take me till 3 p.m. for real. I'll just use an analogy, so I'm going to be looking at the simpler model that perhaps everyone in this audience knows, so here I'm going to look at the rank one perturbation of a random matrix. So I give you the data matrix Dn, and I promise you that is the sum of a rank one term, so theta uu star plus noise plus Xn, and then this is the same model much simplified that Justin presented. So one seminal result in random matrix theory is that this model exhibits a phase transition, so theta here is a real number, and it characterizes the SNR, the signal to noise ratio of the problem, so large theta means that the rank one component is going to be more prominent, so inference should be easier. So this result says that there is a critical value of theta such that if theta is smaller than this critical value, the spectrum of Dn looks exactly like the spectrum of the noise. So notice that Dn and Xn just differ by rank one term, so by interlacing they can have only one eigenvalue that's different, and in this case this does not happen, so the two spectra are exactly the same, and the top eigenvector is uniformly distributed on the sphere, so the top eigenvector does not depend on the signal u at all. On the other hand if we pass the phase transition, then something interesting happens, so the top eigenvalue escapes from the bulk of the spectrum, and the top eigenvector becomes correlated, and in particular, here the eigenvalue gap implies the eigenvector correlation, so this is a signal processing problem in my mind, what I care about is not really the spectrum, but it's mostly the correlation between the eigenvector and the signal, because this gives me the estimator, but there is a close relationship between these two phenomena, so as soon as the eigenvalue escapes the bulk, then the eigenvector becomes correlated, and the spectral method works. So for this spectral algorithm for GLMs, it's pretty much the same thing happens. Now here the correlation is due to the fact that yi depends on ai, and in particular, it depends on ai only via a rank one perturbation, only via a unidimensional projection, because it depends on the projection of ai in the direction of x, so this correlation is what creates the eigenvalue that escapes from the bulk, and here we can show that a very similar phenomenon happens, so if delta is smaller than a critical delta, then the spectrum of the n is the same as if I had no signal, and the top eigenvector is uniform on the sphere, on the other hand, if delta is bigger than a critical delta, then the top eigenvalue escapes from the bulk, and the top eigenvector becomes correlated with the signal, so same phenomenology happens. And you know, the end result of this line of work is to get precise asymptotic of the overlap, so we get a formula for the normalized correlation, which is what we aimed at, and there has been quite some work, so the first result is by Luenli, and it's required that the preprocessing function is positive, because this makes some matrices PSD, and it helps the analysis. With Andrea, we relaxed that, and we were able to minimize the spectral threshold, achieving that a half that I mentioned, and finally, again, this is the group by you Lu, was able to maximize the correlation for a fixed delta. Okay, and okay, here is a pretty picture. So we showed that if you actually use the preprocessing function that comes from theory, this works much better in an example on natural images. So here, the fancy matrix is not Gaussian anymore, is a modulated Fourier transform, which is also what allows us to handle images that have dimension of roughly a million, because then you can use fast algorithms, and the bottom row is the performance of spectral method with the optimal preprocessing that we propose, and the previous one is some other spectral method that was using a preprocessing function that was constructed according to good sense, I would say. So there were several proposals in the signal processing literature that were based on the fact that maybe if the measurement is too small, then you should discard it, if it's too large, maybe something else went wrong, so there were various heuristic constructions. All right, so we are about a quarter away, so I'm also 15 minutes in, so I think I'm pretty good with time. So now, you know, what I would like to do next is to use the spectral estimator as really an initialization, which is what most people do in practice anyway. So most algorithms to solve phase retrieval are iterative in nature, so they require an initialization. So one of the earliest approaches is alternating minimization, but there are plenty of them. Virtinger flow, which is just a fancier way to call gradient descent, iterative projection, cut mark, and specifically here I'm gonna be focusing on approximate message passing, because in some cases it's known to be base optimal, so I guess that's a pretty good motivation. So specifically I'm gonna be considering generalized approximate message passing, and the generalized here has the same value that you've seen in the generalized linear models. So this is an AMP that would work generally for GLMs. It was proposed by Sandeep in this paper. And here is how the iterates look like. I don't really want you to parse this, but the idea is that I want to estimate the signal x, so what I'm gonna do is that I'm gonna try to estimate two things jointly. So I'll try to estimate x and ax, where a is the design matrix. So u here is the estimate for ax, and v is the estimate for x. And it's iterative because the idea is that as I go through, then I'm gonna do better and better until I get to a fixed point. Now the nice, one nice feat of message passing and the reason why I'm gonna be using it to analyze spectral methods after is that they admit what's called the state evolution characterization. So in general, I would like to do performance analysis of algorithms. So I would like to understand stuff like what's the inner product between the signal and the t-theta, or what's the mean square error, things like that. Now these are inner products or objects that are in d dimensions, and these very large because that's the ambient dimension. So my dream would be to compute this as unidimensional quantities. And state evolution allows you to do that because it says that the empirical joint distribution of the iterates converges to the law of random variables whose parameters I can track recursively. So first of all, notice that here the random variables ut and vt are unidimensional. And I can track their statistics. So in particular, they're gonna be a component in the direction of the signal. So x or ax, according to which iterator I look at, plus an external Gaussian field. And I can characterize what's the standard deviation of the Gaussian field. Okay, so essentially I've taken a d-dimensional problem and I moved it to a unidimensional problem. So that's nice because I can solve unidimensional problems. So then there is a deterministic recursion that tracks those parameters. So mu and sigma are essentially the signal to noise ratios that are induced by the algorithm. So I would like to have large mu and small sigma or a large ratio mu over sigma. And I can track those and I can try to optimize the algorithm. So one key issue is that I need to initialize this algorithm somehow in the sense that for some problems, if I initialize at zero, then I stay stuck there. Okay, and okay, that animal is supposed to be stuck. I'm not entirely sure whether this is apparent now, but let's believe it for a second. Now, one thing that many people do is to initialize at random anyway. And here, full-chloro is that it's gonna think about logon iteration for the algorithm to escape from the trivial fixed point at zero. Now, interestingly enough, this has been proven in a very nice paper by Uting Wei, another group at UPAN, for Z2 synchronization, which is basically rank one perturbation with radar democracy. Now, for GLM, so that paper is already painful enough. It would be nice to extend for GLMs, but I guess that's future work for Uting. Here, we're gonna look at something else. So we're gonna look at a spectral estimator. There are a few reasons to do that. So one is that it's actually quite popular in practice. So people were using this long before we started analyzing them via random matrix theory and PN1 not. They're much more flexible in the sense that they work in a variety of settings. And also, it already pushed you close to the basin of attraction of the fixed point because you don't initialize just at epsilon, but this may boost you already relatively close. So they have some advantages. Now, the key disadvantage is that the spectral estimator defends on the data matrix itself. So for example, one thing that I could do is that I could do sample splitting. So some samples I use for the spectral methods and some sample I use for AMP. So if I were to do that, then I wouldn't need any of this, but that's somewhat wasteful because I would like to reuse the same samples for everything. So the analysis need to be changed and this is somewhat the subject of this slide. The first thing that was studied was low rank matrix estimation because typically it's a little easier and there is a nice paper by Ramji and Andrea on this that uses a different approach. And the correction to the AMP algorithm was actually conjectured in a paper by Ari Amaleki and a few other folks back in 2018. Here, the truth is that I did try the argument, the different approach that's been done in the slide for about two years. I couldn't do it. And so what we had to come up with was an alternative approach with Ramji that's basically based again on AMP. So what we're able to do is to provide the AMP correction and the provable state evolution analysis and this in this paper with Ramji. And the idea is to initialize AMP and to analyze the initialization to use AMP itself. It feels a little circular, but let me explain a little bit how this goes because this would be the fundamental primitive that we'll use for the rest of the talk as well. So the analysis goes in two phases. So in the first phase, I would like to write spectral methods via AMP because if I do that, then I have them already as a past iterate. So I can just plug in my other AMP that I'm gonna optimize very carefully and my analysis will be ready. Now, how can I do that? Well, first of all, I need an algorithmic way of obtaining a spectral estimator, but this algorithmic way is relatively easy because I can just run a power method. So the power method will converge to the top again vector. Now notice that this is just a proof argument. So here the initialization doesn't matter. I can initialize whatever I want because this is just my proof technique. And then in the second phase, the iterates of this artificial AMP just have the task to mimic the iterates of the true AMP that I wish to study. So here what I wanna focus on is the first phase. So here is a power method. So the power method just consists of iteratively applying the matrix M whose top again vector I wanna find. Okay, and then I'll normalize by the two norms so that the two norm of the iterates is equal to one across all iterations. Okay, so claim as the number of iterations goes large x t goes to the top again vector of the matrix M and this claim is true when you have a spectral gap. Now here the analysis is a little bit trickier because we need to consider the limit of d going large as well. So I need to consider matrices whose size is growing. Okay, and notice that here I'm taking limit consecutively. So I'll first take the limit for large d and then I'll take the limit for large d. Luckily there is one thing that saves the day and is the fact that I'm as a spectral gap. So the fact that there is a spectral gap implies that power method converges in a number of iterations that's dimension free. By the end of the day looking at the limit for the large doesn't really hurt me. So I have basically three slides where I attempt at doing math, so this is one of them. So if you lose me, I'll see you back in five minutes when we go to the next part. The thing that I would like to show you is that if you choose that the noise are shootably then you can have that the fixed point of AMP looks like a power iteration. Okay, so that's what I'm gonna attempt at doing now. So first of all I'll choose that the noise are carefully, so if I do that then I obtain this iterate here. So I have to admit that this is done very much posterior in the sense that the way we actually did the proof is that first we wrote the fixed point and then we found a way to find the noises that those fixed points. So there is a choice of the noises that give you that. Then what I'll do is that I'll just set t equal to infinity and I'll look at the fixed point. Okay, now I have two equations into unknowns. So what I can do is that I can derive u infinity from the first and I'll get this. Now I plug in u infinity in the second. Okay, so now I have that v infinity is equal to a transpose z u infinity minus coefficient times v infinity. So I take u infinity from the first equation and I plug it in and I get this. Now this monstrosity it's not that bad actually because it's coefficient times v infinity equal another coefficients a transpose z i plus z inverse a v. So this is an eigenvector equation for the matrix in the middle. So if I do one change of variable then I'm essentially done. Again this doesn't prove much so the analysis is a little bit more involved than that. But this tells you that if you choose the noises carefully you can make the fixed point of A and P look like an eigenvector equation. So if you analyze that A and P by state evolution you can read off the correlation between the spectral method and the iterate. So that's the basic idea. Now the crux of the argument is that you know that there is a spectral gap. And you know that because of the first part because in the first part I proved the spectral gap. So here we're all good and clear. One thing that I would like to do is that I would like to extend the methodology and I would like to analyze spectral estimators in setups where I can't prove the spectral gap. So I'll take that step by step and I'll start with mixed models. So here I can prove a part with random matrix theory and the part with message passing. So in particular I'll be able to prove that the eigenvalue escapes from the bulk of the spectrum, VRMT, but I can't prove the correlation. So I can't prove the behavior of the eigenvector. So I'm gonna use tools from free probability to show that the eigenvalue escapes from the bulk. This gives me the spectral gap and then I'll do what I showed you before. So before that let me first explain what the model is. So I'll just start from GLMs and then I'll make them mixed. So what does that mean? So in a GLM, I get measurements only from a signal. In a mixed GLM, I have multiple signals, in this case for the ease of explanation, two of them. And then I get measurements from both of them, but I don't know which measurements come from which signal. So in particular, all the measurements that belong to the set S are from the red signal X1 and all the measurements from the complement of S are from the signal X2, okay? So then what you want to recover are the two signals, X1 and X2. Again, you know the sensing vectors, AI and they will be Gaussian. Now here the extra difficulty is that there is an unknown latent variable that gives you the assignment. So one other way to think about it is that the measurements come from the conditional distribution and this conditional distribution is eta i times the first inner product plus one minus eta i times the second inner product. Okay, so you can think about the case of estimation for unlabeled data and the label is the eta i. This is something that statisticians have been studying for a while because it models a case in which the data is heterogeneous. So it has two effects that you would like to estimate. All right, and again, the way in which you do inference is very much similar to what you would do in the non-mixed case, so you have plenty of iterative algorithms and so you remain with the problem of how you would initialize such iterative algorithms. And again, here are just some examples. So YAM, alternating minimization, moment of send methods and many, many more. So they require an initialization. There has been some previous work but nothing really that was able to even get the right regime. So the right regime for this problem is one in which the number of measurements is linear in the number of dimensions and then what we're gonna hunt for is the correct phase transition. So what's the correct ratio? So previous work was using either concentration type inequalities that don't allow you to get tight results. All right, so here is my model. It's very similar to what I had before except that now my estimators are gonna be the top two eigenvectors. So if I had K components, I'm gonna take the top K eigenvectors. And as we'll see, the eigenvectors nicely correlate in the sense that if, let's say we have two components only, the mixing ratio is not a half. So let's say that we have an unbalanced case. The top eigenvector goes with a strong signal and the second largest goes with a weak signal. Now if you have ties, it's a bit more complicated and you can only show what's the subspace that's being spanned. So I'm not gonna talk about that. So here is the theorem. So the theorem essentially gives you a precise asymptotic characterization of the overlap of spectral estimators. So all theorems look pretty much the same. So they say that if some condition on delta is satisfied, then I have a closed form expression for the overlap. Okay? And we have explicit formulas, both from the condition for delta and for the overlap that are a little bit complicated. So I haven't put them here because I guess they're not too informative. And same for the second estimator. Now once you have this, actually you can do much more. So here we are able to optimize the preprocessing function both for spectral threshold and for correlation. And we can also combine with other estimators because actually the proof is based on A and P. So I'll get a little bit more info than just the normalized correlation. I really get the joint distribution of the spectral estimator and the signal. So I get a little bit more. And here is our works. I mean, you observe that you can do much better than previous estimators. In particular, the yellow one was proposed in previous work by Karamani et al. The green one is the spectral estimator where you use the preprocessing function that's optimal for the non-mixed case. I mean, that's reasonable enough in the absence of better ideas that's something that you could do. The black one is the optimal spectral estimator where you actually use this theory to optimize. And the red one combines that with a linear estimator so you can do even better. And the same happens for the recovery of X2. So here is a case where X1 is the strong one because 60% of the measurement come from it. And X2 is the weak one because 40% of the measurement come from it. Okay, good. So the idea is again to use GAMP as a power method. So this is the same slide that I showed before. There are a couple of challenges. So the first challenge is that GAMP state evolution records the correlation with a single signal while here we have two. The idea is to reduce to an abstract A and P that allows you to track all correlations. So now I'm gonna write an A and P that tracks both signals at the same time. And then what I'm gonna do is that I'm gonna run it twice and I'm gonna choose the denoisers in such a way that one of the two signals never matters. So this simplifies my life a little bit because this essentially halves the number of order parameters that I have to take care of. And I'm gonna exploit the independence of X1 and X2 to reduce the number of state evolution parameters. One important thing is that in this paper we're able to handle the case where X1 and X2 are uniform in the sphere and they're independent. Now pretty much everything uses both assumptions. Now this part, if you suffer a bit more you could do it for correlated signals as well. So there is I think no general, no structural issue with that. But as we see the random matrix theory part crucially uses independence. And we're gonna see that in a second. So for the random matrix theory part as I said I need a spectral gap. So this is what my A and P recipe needs. And the issue is that if I try to mimic the same approach that I had before. So start from the spectral estimator, reduce to a rank one perturbation and study that. I'm gonna be getting additional mixing term that are due to the fact that now I have two signals. And so I have basically a rank one for each signal so I have a rank two. And then I also have the cross terms. And the cross term I don't know how to remove. Okay so that's strategy we were able to make work. So what we're gonna do is that we're gonna use some tools from free probability to try to essentially get rid of this and reduce this problem with two independent signals to two separate problems. One for the first signal and the other for the second signal. So this is my second math slide and it's one slide of free probability. All right, so this is my spectral estimator. So it's average of rank one terms, say I transpose times a modulation that depends on the measurements, okay. And again here it's convenient to write YI as a certain function of the inner product because this is a generalized linear model anyway. So this would be convenient. So Q here can also encapsulate noise. So Q can be stochastic. Good. Okay so the first thing that I do is that I separate into the components that depend on the first signal and that's the first sum and the component that depend on the second signal. And that's the second sum. Then I'm gonna be noticing the following thing. So I'll use the fact that the AIs are rotational invariant to rotate only the AI that belong to the second signal. Okay this I can do because the AIs are all independent. So I can rotate them separately and the rotational invariant. So here I'm also using Gaussianity. Okay now here I didn't do much. So here I pushed the rotation from the AI to X2. So this has become all transpose and I still have the rotations O and O transpose here. Now this is the key step. So in this step actually I'm gonna use that X1 and X2 are independent. So if X1 and X2 are independent I can rotate X2 and I don't need to touch X1. If they were correlated I wouldn't be able to do that because then I also have to rotate the component of X2 in the direction of X1. So that would screw me up. So if I do that then I obtain here pretty much the same formula that I had before except that the spectral estimator now it's sandwiched between rotational O and O transpose. Well now this is something that I can study because I have the sum of two random matrices. These random matrices are independent because the first random matrix depends on X1 and depends on the AI that belongs to the set S. The second random matrix depends on X2 that's independent of X1 and depends on different AIs. And the second one is rotational invariant because I sandwiched it between O and O transpose. And so here I can study this with free probability theory. So essentially I wrote dn as the free sum of dn1 and dn2 and so now I can use a hammer. So I can study the two matrices separately because that's the non-mixed case and I did that in the first quarter of my talk. And then the sum is asymptotically free. So at this point I know the location of the outliers. Now here it should also be clear why outliers are easy but eigenvector correlation is hard because the problem is that now I can study the correlation between the top eigenvector of the sum and the single top eigenvectors. And I can study the correlation between the top eigenvectors and the signals. But I can study the correlation between the two. So for that I need to redo the analysis. Yes. No, no, S is whatever is the set that's chosen at the beginning. So X is fixed. So in my example it's gonna be randomly chosen but the randomness is independent from everything else. So it's basically what you do is that you flip a coin with probability eta, you assign the signal to the first and with probability one minus eta you assign it to the second. And the key bit is that this choice is independent from everything else. So the eta is not allowed to depend on X one X toward the AIs. So you can condition on it. You can think of it as fixed. Right, so essentially here what I did is that I computed the outlier eigenvalue via free probability and then I used my AMP hammers in order to compute the correlation of the eigenvectors. So this brings me to the last part of my talk in which I would like to study spectral estimators in another setup. So here I'm gonna be trying to go beyond the isotropic Gaussian assumptions and I'll do Gaussians with covariates that have a certain covariates matrix. And I would like to use mostly AMP tools. I'm saying mostly because there is still a bit that I need to do with random matrix theory but it's arguably easier than locating the outlier. Or at least that's what the people that actually work in that field tell me. All right, so let's start with the model. Again, it's a GLM but there's another twist. So now the twist is the fact that I'm gonna be looking at the general Gaussian design. So this means that before the measurement vectors AI were isotropic Gaussians, zero mean identity covariates. Now the covariance is some value sigma. Okay. That's pretty reasonable because an isotropic covariance happen in practice pretty commonly and there is a whole line of work that looks at getting precise asymptotics for statistical estimators but this line of work has mostly been looking at the loss. So the first is a paper by Martin Wainwright. They did a paper by Zhang. There are a few papers by Andrea and Adele Javanmart. So there is a whole line of work although that's mostly restricted to the lasso and at the best of my knowledge there was no result before that was looking at spectral estimators. Again, spectral estimator looks exactly like before. So I'm gonna build my usual data dependent matrix so sum of t of yi, ai, ai transpose and then I'll take the top eigenvector, okay? And ts is my preprocessing function and I'll try to optimize over it later. And here is the result. So the paper is not quite out yet but hopefully soon. So the result looks, it's very similar to what we had before so we have a condition on delta and if this condition is satisfied then we can study what's the exact asymptotic of the correlation between the spectral estimator and the signal, okay? So we have explicit formulas for everything. From this you can do a number of things. So one thing that we were able to do was minimize the spectral threshold. Let me repeat that again. So the spectral threshold is the minimal value of delta such that weak recovery is possible. So doing better than random, doing better than all zero estimator is achievable. One thing that you would like to do was actually maximize the correlation but this we weren't able to use. The formulas were just too ugly or we weren't smart enough to do that. One thing that I think is actually very interesting is that this method does not need to estimate sigma in the sense that the optimal preprocessing function depends on sigma only via the first two moments. So it only depends on the trace of sigma. So we also compare and we'll see that in just a second to an naive spectral estimator that knows sigma and so tries to whiten it. So one thing that I could do is that I could try to whiten my data matrix to reduce myself to the isotropic case. Then the isotropic case I can study and I can get precise asymptotics for that. Now the issue with that is that if I don't have access to sigma estimating sigma is actually much harder than solving the problem itself because in the linear regime I cannot get perfect knowledge of sigma. Now the nice thing is that of course this depends in some way on sigma because the preprocessing function may depend on sigma but the optimal one only depends on the average. So that's pretty nice because estimating consistently the average is much easier than estimating the whole thing. And here are some numerical results. So we tried for a couple of choices of sigma that are pretty common in this series. So one is a template matrix. The other is a circle on one that we'll see. So in this case, for example, this curve here corresponds to a trimmed estimator in which I just take a linear preprocessing and then I cut it at some point. This other one that's still on the low side is subset. So this just assigns zero one scores to the sigma measurements and if it's big enough it's one otherwise it's zero. So those were all proposed in previous work. So those heuristic choices work pretty bad. Our choice works much better and surprisingly enough because this I did not expect it also works better than the naive estimator that widens sigma. And I mean, I have to say that in hindsight there was no reason for that to be better but I thought that since it knows and it exploits the knowledge of sigma that could have been better but it turns out that's not the case. And similar picture is also for sigma circulant. So the flashy green and the yellow are previous choices of the preprocessing. The purple one is us and the blue one is known sigma. So here known sigma is pretty close. But again, one thing that I would like to stress is that our choice of the preprocessing does not need to estimate sigma. Okay, so this is the same slide that I've shown before but also it's a power method. And the basic idea is to implement an AMP that emulates a power method and this critically leverages the spectral gap. So here it's even worse than before because I don't even know how to reduce to a low rent perturbation because that part crucially exploits rotational invariance. Now in the mixed case, the matrix at least was rotational invariance still. Now in this case that's not true anymore because the covariance says our anisotropic. So I have no idea how to do this with the usual approach. So the way I'll do it is slightly different. So I'm gonna be using AMP again also to show the spectral gap. I guess that's also what Feynman says that when you have one single trick that works, just hammer every single problem that you get with that single trick. So here we're gonna try to use AMP both to find the eigenvector correlation and also to find the spectral gap. The idea is actually relatively simple. The implementation of the idea is significantly more painful. So the idea is essentially that state evolution tells you also how long the iterates are. So it tells you something like the two norm of the various iterates. So if you have an upper bound on the right edge of the bulk and the prediction that you get from AMP on the length of the iterates is bigger that is upper bound than you must have found the eigenvector. Because first of all, your AMP implements a power method. So it must select an eigenvalue. Now you still, so in my summary, I said mostly we are AMP because the RMT bit that I need is an upper bound on the right edge of the bulk. So this I still need. But now if I have that baseline, I know that all eigenvalues except the first have to be smaller than that. So if my AMP iterates suggests me that there is an eigenvalue that's bigger than that, well, gotcha, I found the top eigenvector, I found the outlier. So in the six minutes that I have, I'll try to sketch this and I'll do that by taking a closer look at the recursion. So VT is my AMP iterates. So here I put a hat because it's not quite the AMP iterates so you need to do a little bit more work. But this almost looks like a power method because here I have that VT plus one is equal to DN which is the spectral matrix divided by some scalar gamma and you'll see what's the role of gamma in just a second. Gamma is basically your guess of the top eigenvalue times VT plus ET. So ET is the mistake that you make. So what we're gonna be able to show is that ET in the high dimensional limit, so SD goes large and at the fixed point, so SD goes large vanishes. Okay, so that's how far you are from the fixed point of AMP. So here I'm gonna ignore ET so then in the paper we actually have to do the perturbation analysis and we have to show that the errors don't accumulate horribly. All right, so here this is precisely a power iteration. So it's power iteration on DN. Notice that this piece does not depend on T. So that's the important bit. Okay, so first I'll do this for T prime steps. So the reason why I want to do that is that I want to boost my spectral gap. So if my spectral gap was alpha now my spectral gap has increased because all eigenvalues are to the power of T1. Next I'm gonna be looking at the length of the iterates. So I'll look at one over D, two norms squared of stuff. Okay, so now I wanna extract the top eigenvalue. Okay, so first I'll project DN in the direction of the top eigenvector. So this gives me this piece and this is what remains. So what remains is the same matrix times the orthogonal projector. So pi is the projector that's orthogonal to the top eigenvector. Now this is upper bounded by what? By the top eigenvalue of this dude times the length of VT. Now the top eigenvalue of this dude is the right edge of the bulk. Because I removed the top eigenvalue because I projected against it. So now I am left with the right edge of the bulk. So that's the piece for which I need still to do work. So this you have to compute separately. Okay, now I take limits. I'll take three limits. So first I look at D large and this is because I want to reduce my D dimensional problem to a U dimensional problem. So this allows me to pass from A and P to state evolution. Next I take D large because this allows to kill the error ET that I had before. So now I leave at the fixed point and finally I take T prime large because this boosts the spectral gap. Okay, now if I do that, what happens? Now if lambda two of Dn is smaller than gamma then this term vanishes. Okay, because the top eigenvalue because this element here is smaller than one. Okay, so this vanishes geometrically with T prime. Okay, so now at this point I'm ready to somewhat describe what's the proof strategy. So first you guess what gamma is. This you can do again with A and P because you check what's the length of the iterates at the fixed point and it gives you a good idea of what gamma is. Okay, so I'll choose a value of gamma. Notice that this is just a numerical value. So it's the solution of some fixed point equation. Next I'll verify that the right edge of the bulk is smaller than gamma. So this you have to do yourself. So you have to compute the right edge side of the bulk. If the matrix, so if the processing is positive then the resulting matrix is PSD. So there it works almost immediately. It's easy to see that basically this is still a rank one perturbation but a difficult bit is to locate the eigenvalue. So in principle computing that is much easier than locating the outlier and in the paper we're able to extend this beyond PSD matrices by doing extra work. Then at this point this allows me to deduce that this monstrosity here is equal to rho squared and this I can read from state evolution because this is just the length of an AMP iterate. Okay, so this is a number now. So that's the equality that I had in the previous slide in which I removed the influence of all the eigenvalues different from the top one. Now at this point I'm done because how can this equality hold if rho is different from zero? Well if this guy here is bigger than one then the left hand side explodes. If this guy here is smaller than one then the left hand side vanishes. So the only way in which the left hand side does not explode and does not vanish is that the limit of lambda one of the n is equal to gamma so that gives me the outlier and that also gives me the overlap because now at this point I still have to make this equal so I have the overlap so the inner product between the top eigenvector and the iterate is equal to rho squared. Okay and I think this sums it up so I managed to finish my circle and thank you very much for your attention. Marco for the great talk, questions from the audience. Thank you very, very clear talk. So I have a question I don't know if it makes sense to learn the tau function, the tau s function does it make sense or like how do you know what the tau s function is? I mean at this point you optimize like the point is that I have a formula for this guy. So rho squared is now, okay it's pretty much an atrocity but it's a fixed point of some integral equation. But in the spectral method at the beginning where you have like this... But I can choose it optimally a priori now because I know what the performance is. So the optimal tau does not require knowledge of sigma so I can choose it a priori. I can prove what the optimal tau s in a closed form. Because of you know how the... Because I have an analytic formula for the overlap. Now the analytical formula is pretty bad so we are only able to minimize spectral threshold. So you could try to learn it to maximize correlation. At the same time I think that the same thing should also maximize correlation. It's just that proving it is difficult. Or at least I don't know how to do it. I see. Okay, maybe I can ask you more questions offline. Sure. But thanks. Hey, thanks for the great talk Marco. I actually had two questions. I guess the first one is like when you say that the pre-processing functions just depends on like the first two moments. So that's in a Gaussian correlated. So I'm presuming you know Gaussianity really helps. Yeah, yeah, yeah. Or having some Gaussian details but I don't think that that phenomenon would not... I'd say Gaussianity plus universality class. Yeah. I mean this essentially since you are doing the argument by AMP. Right. You could do universality basically by using his results over there. And you still need the universality of the right edge of the bulk but that's something that's relatively well understood. Right. In random matrices. Okay, yeah. And the second question was more along these lines that so for GLMs, maybe for nice GLMs, AMP gives you the system of equations, scalar equations. And often the parameters in those equations that you don't know depends on some low dimensional functions of your underlying signal. Some of them are just pure signal to noise ratio parameters and if you go to things like LESO you have something a bit more complicated but another approach you could potentially think of is if you could estimate those parameters then you can just initialize your algorithm at the fixed point. Yeah, that's how we do it. And then you expect that that's gonna be very fast. So this is giving you exactly like an operational way to do that in a data driven way. Yeah. Yeah, that's an excellent point. So the way the argument works for this is precisely this. So you know what the fixed point is. So you initialize at the fixed point of state evolution and then you stay there all along. So this saves you from the travel of proving convergence of state evolution. Now one thing that's interesting and at least it got me confused for a little while is if you initialize at the fixed point of state evolution you would think that the signal does not move. But that makes no sense because I'm not initializing at the top eigenvector when I converge there. So what must be happening is that the low dimensional projection that state evolution looks at doesn't move. But in the orthogonal space I'm still moving. And that's where essentially the noise moves. Now it moves in such a way that at every iteration of state evolution the signal to noise ratio always stays fixed but the vector has to move because I'm not initializing at the top eigenvector. So that's something that confused that for a little while but yeah, but thanks, that's a very good point. Thanks Marco. I was just wondering, I guess in a lot of these spiked random matrix models oftentimes the characterization of an outlier location is like in terms of the root of some determinant equation and there's not anything more explicit than such a description. I'm just wondering when you do your AMP analysis to characterize these outlier locations like where that emerges. That's a good... Where that kind of structure emerges. Okay what I can tell you is that it's relatively obvious that this reduces for the GLM case that you can do in the RMT way. So when sigma is equal to the identity it's clear that a few terms simplify but now it's a bit more complicated. Like you have a couple more extra equation that you have to solve. So it's not as clean as in the identity case. So I don't know if I... I mean probably you could do it but given that I actually don't know what's the right way of doing it, I don't know. I think perhaps a different way of asking the question is in your last result this row that you say you guess the value of gamma. So how complicated is the characterization of gamma here or what exactly is the character? This is the solution of two fixed point equations. So something like gamma equal to expectation of tau of sigma, comma, x equal to something. And then I have a pair. So the pair is basically the gamma in the row and those you have to solve in pairs. So I don't know if there is anything more explicit than that. So having two means that you cannot do the optimization of the, we couldn't do the optimization of the correlation. So the optimization of the spectral threshold works because it's a Cauchy's-Vars trick. So essentially you look at some point, there is a Cauchy's-Vars, so you get the inequality and this outrates. But we couldn't do it in the same way for correlation. But I think it's easy enough. Like you can put it in a solver. Other questions? If not, then maybe we thank Marco for the great talk. We have a break now.