 Rishabh today. He got his PhD from Columbia University in 2021 and he is now a postdoc at Harvard and is going to present his work on spectral universality joint with Subabrata San and Uelu. Rishabh, take it away. Hi everyone, my name is Rishabh and I'm sorry I couldn't join in in person because of some of these issues I'm not able to travel outside the US but I am working on it and I hope I'll see you folks in other workshops and conferences. So today I'll talk about spectral universality in high-dimensional estimation. So this talk is going to be about an empirical phenomenon that people observe in high-dimensional estimation problems and some work I did with my postdoc advisor Subabrata San and Uelu to try and understand what's going on here. So let me begin by introducing a canonical high-dimensional estimation problem and so roughly speaking the problem will involve recovering an unknown signal from noisy non-linear measurements. So the way we are going to set up the problem is that there's this unknown signal beta star which is a vector which we would like to recover and we observe measurements y and the way the measurements are generated is that the measurements are a function of the unknown signal and this function is a composition of two parts. The first part is linear and is specified by a design matrix x and depending on which community you come from you might also call it the feature matrix or the measurement matrix or the sensing matrix and then there's a non-linear part which is like a link function and it's a function from r to r and it acts on the vector x v star entry y and you might also want to model some noise in the problem and so there's this epsilon and so many in many applications where you want to recover unknown signals you can cast the problem into a problem of this form. So for example in MRI applications this g function is just the linear function identity function and then applications like x-ray crystallography this g is the square magnitude function. So before I tell you about the empirical phenomenon which is called spectral universality let me tell you what universality means in a broader context. So many people say universality they basically talk about they're basically talking about it's a high dimensional probabilistic phenomenon but what seems to happen is that many large stochastic systems seem to behave asymptotically the same even though the exact constructions might be very different. So maybe one canonical example is the central limit theorem which says that if you have a bunch of IID random variables with mean zero and unit variance if you compute the normalized average then it converges to a Gaussian distribution and the limiting distribution is not sensitive to the exact distribution of random variables that you chose and so that's why this is a universality result. Yeah sorry somehow it gets of my computer keeps getting confused. Give me one second and maybe even more striking examples can be found in random matrix theory. So in the three plots that I'm showing you I'm plotting the eigenvalues of a large matrix so this matrix is not symmetric so its eigenvalues are complex valued and so they have a real part and an imaginary part so I'm plotting a scatter plot of the real and the imaginary part. The first one has the first matrix is a IID Gaussian matrix the second matrix has IID plus minus one entries and the third matrix is not random at all it's just populated by the digits of pi and so the point is that the global behavior of the eigenvalue seems to be nearly the same in all the three cases that in the sense of the eigenvalue seem to be uniformly distributed on the unit circle and so this is another example of universality and so the question this talk is going to ask is can universality help us in high-dimensional statistics? So it turns out that a universality phenomenon that happens in statistics which we call spectral universality so I'll tell you about this phenomenon now and again the setup is going to be recover a non-signal from noising on linear measurements and so the universality phenomenon is going to be about the behavior of the design so if someone gives you a design you can always write down its SVD so the u's are the left singular values the sigma are the the u's are the left singular basis the sigmas is a diagonal matrix of the singular values and the v is the right singular basis and what's frequently observed is that the statistical properties of the inference task depend only on the singular values only the sigma and the singular vectors are irrelevant if they are generic and by statistical properties I mean the performance of your favorite signal recovery algorithm for this problem and so what this means is that if you have two designs which have this which they're the same singular values but have very different singular vectors then this sort of heuristic tells us that we should expect the two designs to behave the same and the reason why this is a heuristic is because there's this term I'm using that the singular vectors need to be generic and I'm not really telling you what that what generic means right now and so I'm not telling you when this principle will apply and when will it fail so let me give you an example so the example I want to share with you is this is in the context of the lasso estimator so it was observed in these papers by Dono and Tanner and Manajami at all and what these folks did they looked at the lasso estimator for compressed sensing and they looked at three different designs the first one was the spike sign design the random t-city design and the hardest line and I'll tell you what these designs are and they plotted the mean square error of the lasso estimator as you change the sparsity of the underlying signal and what they found was that the three designs behaved exactly the same so the green curves are for one design the star curves are for the other design and the red dots are for the other design and it's not just that the mean square error of the lasso estimator was the same they also you could also look at the histogram of the lasso estimator and even the histograms look basically the same so it's as if the distribution of the lasso estimator was same for these three very different designs and what maybe I'll tell you what these three designs were so one of the designs was called the spike sign design so this is a deterministic design which consists of two blocks the first block is the identity matrix and the second block is the DCT matrix the discrete cosine transform matrix it's a deterministic orthogonal matrix the other design was the random DCT design so in the way this design is constructed is that you take a big DCT matrix again that's just a deterministic orthogonal matrix and then you pick some of its rows the third design was a extremely random design so what you do is that you sample a big random orthogonal matrix a uniformly random orthogonal matrix and then you pick some of its rows and what's common among all these three designs is that they have the same spectrum they have the same singular values but they have very different singular vectors and out of these three designs we sort of only know how to understand the hard design well and so there's this work which tells us a lot gives a lot of information about the behavior of the hard design and we sort of don't know much about these other two designs but they behave exactly like the hard design and it's not just in compressed sensing so like the two applications I mentioned in the beginning of the talk for example compressed sensing I told you here the most important design is the randomly row sampled Fourier design and in X-ray crystallography there's a different structure design and this design is called the mass Fourier design so each block is a Fourier matrix that is a random diagonal matrix and in both of these applications spectral universality seems to apply and more generally it seems to apply every time you have a very structured design and by structure design I mean you know that it should have strong dependence and very little randomness which makes such designs hard to analyze but spectral universality seems to apply to such designs and I should also say that this phenomenon is not really about statistics it seems to be a property of dark stochastic systems because it's been observed in other fields like statistical physics in spin glasses and in communication systems okay so okay maybe I've convinced you that this phenomenon happens and so why is this should we care about it and so the reason is that if we could sort of prove this thing happens we could use this as a tool for analyzing systems that we can't understand directly so maybe what I mean by that is that suppose you're interested in a given inference problem which has a design that that's given to you and you might often not be able to understand the behavior of the inference problem with this specific design because this design might have strong dependence or limited randomness so think of the designs like the random DCT design or the spike sign design but using the spectral universality sort of heuristic you can construct a surrogate inference problem which will behave just like the inference problem that you care about but will be much easier to understand and so the way you construct your surrogate inference problem is that you take your original design x and replace it by the surrogate design x tilde and the way you construct the surrogate design x tilde is by preserving its singular values so x tilde and x have the same singular value sigma but the singular vectors of x tilde are random they are random orthogonal matrices and so the exact definition of the uniform distribution on the orthogonal group is not super important but just think of it as the most mathematically convenient distribution on singular vectors that you could possibly think of and so if you construct your surrogate design in this way thanks to the rotational invariance properties of the uniform measure on the orthogonal group this design turns out to be easy to analyze and there's like a lot of different tools that people like using to analyze this rotational invariant design and I have some references for some of the tools but at the same time if you believe in this spectral universality heuristic because the surrogate design and the actual design that you care about share the same singular values you should expect the two problems to be equivalent in high dimensions and that's what happens okay so what so let me quickly tell you about what's known about what was known about this this phenomenon so like I mentioned this was first observed in the statistics literature by Dono and Tanner and actually Dono and Tanner proved some of their give mathematical proofs of some of the empirical observations in the context of noiseless compressed sensing so this is a noiseless linear regression problem for the lasso estimator and this this is like an influential result and it's really hard to beat this result in terms of the generality of their assumptions so the result applies to newly deterministic matrices but maybe the reason why this result isn't really the end of the story here is because of their proof so the proof the way Dono and Tanner proved their result was using some results from the theory of random polytopes and the basic idea is that you can pose the lasso problem as a linear program and they were able to use some results known in the theory of random polytopes to understand the behavior of this universality behavior in the context of lasso and in particular the proof breaks down as soon as you add a little bit of noise to the problem we kind of don't know how to extend the arguments or if you go beyond lasso to any other estimator which cannot be computed using linear programming so it doesn't quite capture the full generality of the phenomenon here and then there's like a there's a long line of work about Gaussian universality which basically says that if you have highly random designs they often behave like Gaussian designs and there's a long line of work in this in this direction and so by highly random designs I would mean designs with iid entries or designs with independent rows and what's great about these results is that unlike the Dono or Tanner result they're extremely general so they apply to very broad classes of inference problems but the flip side is that Gaussian universality is not valid for structured designs that I'm talking about like the spike sign or the random DCT design so it's not that these designs behave like Gaussian designs even in simulation so they behave differently and then there's this third line of work which is sort of coming from communication systems and free probability and so really like maybe if I speak in very rough terms is that what they are interested in is understanding the behavior of the spectral measure of a random matrix which is constructed by adding or multiplying matrices with generic eigenvectors and yeah but it's yeah it's hard yeah but in sort of for example it's hard to relate the performance of lasso estimator to the spectral measure of some matrix so it's not clear how these results would help us understand what's happening in statistical applications okay um okay so before I tell you about our result let me just recall what's the heuristic that we are trying to understand so the empirical observation is that the statistical properties of the inference task depend only on the singular values of x the singular vectors don't matter so much if they are generic right um and we would like to convert this heuristic into like a mathematical principle in the sense that we should be able to understand when should we expect this to happen and when should this expect we should expect that this doesn't happen and in particular we would like to understand what does it mean for singular vectors to be generic uh maybe have a precise mathematical definition you know uh generic um and so I've been a little bit obsessed with this problem and I've written way too many papers on this but I what I'll tell you about is the last paper I wrote with my postdoc uh advisors uh you and Suvratasen and I will also point you to this parallel work by Wang Zhong and Fan which also tries to understand this phenomenon under a different set of assumptions uh okay so let me tell you the setup for the result uh so the um uh so we look at the simpler problems so this is just uh maybe the canonical linear regression problem so uh uh again there's an unknown uh signal or parameter beta star that we would like to estimate uh x is going to be a design matrix or the feature matrix uh and we observe the measurements or measurements y and y are a linear function so it's x beta star plus noise and we'll also assume that the noise is iot Gaussian so we want to understand this when all in this simple setup uh and in particular we we will look at um uh regularized least squares estimator uh for this problem so what these estimators do is that they um they're computed they're given by the minimizer of a cost function uh and this cost function has two parts the first one is the uh squared uh Euclidean distance between the action measurements and the fitted measurements um and then there is this regularization term that you might use to promote some low dimensional structure on your estimate of beta star for example if you um uh um uh believe that your true beta star was passed you might want to use the L1 regularizer here but uh in the term that I show you the assumption is going to be that this regularizer is strongly convex so it won't apply to lasso but if you consider lasso with a small amount of bridge then it would apply to that um okay um so here's the theorem that we proved um and before I say the theorem uh I should preface it by saying that this is an asymptotic result in high dimensions so remember we are trying to estimate a parameter or a signal beta star and b is the dimension of this signal that we are trying to estimate and the result is a asymptotic statement as p tends to infinity um ah okay give me one second it's quite frozen again okay so uh what the result says uh is that um so suppose you're given two designs uh and we impose some conditions onto two designs under which they'll be equivalent um and the right now I'm just calling them uh the conditions as the two designs lie in the same universality class and this phrase is a placeholder for three conditions I'll tell you about in the next few slides so they are these three conditions and the two designs have to satisfy these three conditions and then we say that they lie in the same universality class and I'll tell you what these three conditions are in a minute minute and then uh suppose that you sampled the unknown signal or the unknown parameter from an iid prior so each coordinate of beta star which is the signal that we wanted to estimate is sampled iid from a prior pi and then what you do is that you compute uh the regularized least square estimator for the first design um and the second design so beta hat one is the regularized least square estimator for the first design and beta hat two is the regularized least square estimator for the second design um then what the result is claiming is that in high dimensions um the joint distribution of the first estimator computer on design one and beta star uh which is the unknown signal is the same as the joint distribution of the estimator computer on design two and the uh the unknown signal and the sense in which this approximation holds is that if you take any um pi which is like a nice test function it's a bivariate test function and if you apply it to the coordinates of the first estimator and the unknown signal and average it across the coordinates and you apply it to the second estimator and the unknown signal and average that across the coordinates then the difference between the two is going to go to zero as the dimension goes to infinity and really you should think of pi as your favorite loss functions like for example um if you take pi to be the square loss it would tell you that the mean square error of the two estimators on these two different designs um is the same but there's nothing special about the square loss you could have taken other other losses um so so that's the template of the theorem that we proved but I still owe you an explanation about what I mean by uh the two designs lie in the same universality class and what these three conditions are so these three conditions um the first one of them is not very interesting it just requires that x1 and x2 should have the same singular values in an approximate sense and yeah so that condition just says that and it's not super interesting but I think the other two conditions are more interesting so I focus on telling you about these other two conditions okay so uh the the the second condition is that the design should have generic singular vectors and so this was also a condition imposed in the heuristic but I didn't tell you what generic means so this is a definition that we could come up with under which we could prove our theorem and so I'll first tell you uh what the condition is mathematically and then I'll give you some intuition for it so the condition imposes that for any integer k if you look at the case power of the covariance matrix of the design it should look like a rescale a scale it should be a rescale identity matrix and you are allowed an error of uh one over root p and then you have some typically it's hard to verify this condition with one over root p but you're allowed some slack which is a poly log factor and then becomes easy to verify this condition and also the way you are um measuring your approximation error is this infinity norm which is the entry wise difference map entry wise maximum difference between the two matrices so entry wise the gate power of any k any power of the covariance matrix should look like the identity matrix okay so this is uh this condition is a bit hard to pass but the way I like to think about it is via this thought experiment which we call a permutation invariant design so um suppose so in this thought experiment suppose you computed the um uh you look at the covariance matrix of your design and then you computed its um eigen decomposition so the lambda i's are the eigenvalues and the ui's are the eigenvalues and suppose for the purpose of the thought experiment that the eigenvalues are randomly matched to the corresponding eigenvectors so the coupling between the eigenvalues and the corresponding eigenvectors is random and it's given by a random permutation then you could look at any power of the covariance matrix and you could argue by some concentration that it's going to concentrate entry wise to the expected value of the gate power of the covariance matrix and thanks to the random permutation you could actually compute this expected value explicitly and it's exactly a scalar multiple of the identity matrix which shows up in the condition so for this thought experiment actually verifying this condition is almost like immediate um and what I want to tell you is that this thought experiment is not super crazy because it already captures some interesting designs where people saw uh spectral universality so for example if you're picking um random rows of the DCT matrix one way you could do it is that you first shuffle all the rows of the DCT matrix and then pick the first few of them so that's going to give you a random permutation and this random DCT design is exactly covered by this thought experiment and the same with many big random rows of a hard matrix or a uniformly random thought matrix that's also covered by this thought experiment but uh but the point of these conditions is that you don't need this random permutation to verify for these conditions to be satisfied and what these conditions do is that they sort of capture this approximate decoupling between eigenvalues and eigenvectors um and what is nice about these conditions is that we were able to verify the set of conditions for all matrices that uh that were reported to exhibit universality in this work of monogamy at all which was one of the works which um um empirically found this observation um and so in particular I have some examples like Iod matrices would satisfy this and the left linear transformations of Iod matrices would also satisfy this um um the spike sign design I told you about satisfies this and the mass for your designs which are important designs for the space retrieval problem in extra crystallography also satisfy these conditions um okay um so the last the third condition I want to tell you about is the sign invariance condition um uh so the way I am going to motivate this condition is by saying that just with the generic singular vectors condition we know that it uh even without without imposing further conditions uh universality can fail so uh so in this plot I'm again plotting the mean square error of the lasso estimator um as I change the signal sparsity but the three different designs that don't hurt Anna studied so they looked at the random dcd designs spike sign design hard designs so these three designs have the same singular values and they satisfy the generic singular the notion of generic singular vectors that I told you about um and in the first plot I I'm plotting the performance of the performance of the lasso when the unknown signal has been sampled from a prior which is symmetric about the origin um and in the second plot I'm doing the same experiment but I um sample the unknown signal from a non-symmetric prior and in the second case you can see like departures from universality in the sense that the three designs don't behave are the same um and this was also observed in this paper by um monogamy at all uh they they also observed then for example and the unknown signal was completely positive uh they started observing departures from universality so to uh bypass these cases where you know universality can fail the condition that we impose is that the way you construct your design matrix is uh by taking a deterministic matrix j now this j can be any matrix but you randomly sign the columns of j so these s1 through sp are random signs um so it's not going to be uh satisfied by the spike sign design because its columns are not randomly signed but it's for a good reason because in some cases this spike sign design does not exhibit universality but as long as you randomly sign the columns of this spike sign design it'll satisfy all the conditions um and uh that's really the only source of randomness in the design that we need to prove the theorem so just be random signs in the design matrix the other conditions are deterministic and maybe the point is that it's much less randomness in as compared to other universality results that we know um and it turns out that you don't need these random signings if there's some inherent sign symmetry in the problem for example if you're using a the signal is drawn from a symmetric prior or and you're using an even regularizer then which is an even function then again you don't need it's possible to do away with this assumption and um I think what's an interesting problem here is which would be great to sort of figure out um a deterministic condition that replaces this sign invariance condition so we know that we need another condition here but we don't know what exactly this condition is and this sort of this random signings is right now a placeholder for this future condition that we don't know right now what that condition should be um okay uh so let me summarize uh so what I told you about this in the stock was spectral universality so roughly speaking the phenomenon says is that um there's still securities of interest to us often depend only on the spectrum of your design and not the singular vectors um and then I sort of uh in the context of regularized linear regression I told you nearly deterministic conditions of the design under which you can expect this to happen yeah okay question for Risha so in the second condition for the universality class you consider x transpose x right is it important that this is the case why don't you look at x x transpose because this is saying something about right singular vector about x but not left uh yeah um so uh yeah so this is coming this is something to do with the problem structure so this this theorem was about regularized linear regression right and in that problem it turns out that you only need conditions on the right singular vectors but if you look at more general problems for example like if you're fitting a generalized linear model so your loss is not the square loss but like a general loss then you can prove a version of this theorem where you would have conditions on x transpose x x transpose and there will also be some additional conditions uh on uh uh on sort of incoherence between the left and the right singular vectors yeah so so uh yeah so maybe that's the concept yeah okay thanks yeah more questions okay I'll ask one um could you say two words on what happens if the regularizer is not strongly convex but just convex uh yeah so yeah okay so yeah okay maybe this would be a little bit technical but so the way we sort of prove this result is by um uh uh uh writing down an iterative algorithm which would solve the optimization problem that defines the estimator so maybe you just write down gradient descent and what we sort of need for the proof um uh to work is that gradient descent should converge in a constant number of iterations and if the problem is not strongly convex that might not happen um and so uh uh right so so that's why there's this strong convexity assumption but um and if it's not strongly convex you would need to do an analysis which is case by case right for example if the problem is not strongly convex it's not even clear if the solution is unique right uh so like for example for last so you could probably try to do it but it would have to be more on the case by case from what I understand there are no general conditions under which uh people would be able to tell you that oh the solution is going to be unique and um so the problem is not only the finite number of iterations so it's also the fact that this may be not unique because if it's only the finite number of iterations assuming the iterative algorithm is A and B then you would try to do non-asymptotics no you could do a non-asymptotic analysis like you do in random initialization so there is some recent work on that oh right right uh so so I think I missed your question so your question was uh is the problem only that you uh so is the problem that the the surrogate algorithm that you are using to analyze uh regularizing linear regression uh running for a fixed number of iterations independent of the ambient dimension or is the problem that you have say multiple fixed points and so depending on where initialize you go there because the second problem I wouldn't know how to fix for the first problem maybe there is a way right uh okay so uh maybe what I'll say is that uh uh um so it's more of the first problem in the sense that uh uh uh so that's uh like so the proof is method of moments and it's typically hard to do method of moments when you have a diverging number of iterations um and the other thing is that um um so in in problems like lasso for example for lasso Gaussian designs people know that EMP works in a constant number of iterations but they know it because of some special properties of the lasso call right so that's what I'm saying that um uh the main problem is uh constant number of iterations but in some depending on what the problem is you might be able to get away with constant number of iterations by some specific cool yeah thanks any more questions I'm not sure whether my question can be like directly applied but since you mentioned a few times a connection to free probability and uh I know that in free probability um um in analogy to the classical centered limit theorem if you sum up many free variables then uh you also find a limiting distribution so I was wondering if you can use somehow these results uh in order to find a condition on what this matrix x should satisfy to um display some universality in the sense that um maybe this matrix x is a random matrix and if you take independent copies then they are free and then in that case you would get this universality I don't know can you make some connection like this yeah so maybe I can't make connections to the result that you're the free central limit theorem that that you're talking about because that involves averaging a large number of matrices right so here we just have um like one matrix right so uh and like if the matrix is like divisible like a Gaussian matrix you can sort of construct this matrix by averaging several independent copies of it but the matrices that we're looking at don't have this this property but what you could do is you could sort of make an analogy to other results in free probability which says that so for example if you have um two matrices and you conjugated one of them with a hard matrix then the two matrices become freely independent so you can characterize the spectrum of the any polynomial in those matrices right so what you should like maybe the analogy here is that um uh the two matrices in our example would be uh uh like for example if you have a randomly sub-sample dcd matrix so one matrix is the dcd matrix and the other matrix is a diagonal matrix with zero one entries telling you which which rose you sampled um and uh and somehow it turns out that these matrices are free without even uh uh conjugating by hard matrices um uh and uh and it's uh so so that's that that that that that sort of that's the sort of results which were known in free probability because that's the sort of the third line of work that I mentioned that sometimes you can get freeness without conjugating by hard hard matrices uh but uh uh right and the other thing is that it's not clear in statistical problems that why uh uh how is freeness going to help us so freeness is something about spectrum of polynomials right but in statistical problem we're interested in um like for example the emissive of lasso estimator but how is it related to the spectrum of the matrix so but it turns out that um it's something like yeah it's something to do with the behavior of the polynomials polynomials in these matrices which is sort of shared between the two yeah yeah I don't know if that made a lot of sense yeah yeah any final questions thank you and I think the lotationally invariant amp by Zofan derives a similar result so I wonder if there is a difference between the rotationally invariant amp and this research uh so not uh uh right so uh yeah so like the the bad work I mentioned by Wong, Jong and Phan uh they look at uh like universality of like a universality theorem beyond the rotationally invariant case so even if your matrix is not uh uh rotationally invariant sometimes it behaves like a rotationally invariant one and it's sort of looking at uh when does this happen so um yeah so um so maybe roughly speaking I can say that um um so we uh yeah so uh uh the results are stated differently in the sense that uh they like to put all their conditions on the matrix and we like to put some conditions on the matrix and some condition on the unknown signal which is which we assume is sample iid so but you can sort of transfer some of these like imposing some conditions on uh the signal is same as imposing some other conditions on the matrix so if you do this transfer then to make the comparison I think the comparison the Paris comparison would be that uh the results uh those results are stronger in the sense that they don't assume that your unknown signal um is from an iid prior but they would require that the distribution of this unknown signal is for mutation invariant so that's the strength um and the weakness would be that uh on the design matrix they would assume a stronger assumption they would assume that uh they would essentially assume the permutation invariant model that the eigenvalues and eigenvectors of the covariance matrix of the design are randomly coupled uh so it's not going to be satisfied for some of the designs that I told you about but there's no random permutation uh coupling the matrices yeah thank you yeah so yeah yeah yeah so I think in the interest of time we'll conclude here thank you Rishabh again