 Thank you for the introduction and first I'd like to thank the organizers for the invitation of being enjoying this workshop. And today I'm going to present our work on understanding a very simple spectral method that turned out to be surprisingly effective for several estimation and learning problems and this is a joint work with my students. So we know that a central problem in learning estimation is the following. Suppose we have a collection of input or explanatory variables that are n-dimensional vectors. We also have some output or response variables that we can consider as scalars and given a collection of input and output pairs, we want to learn functional relationship linking the input to the output. Of course, this is the e-opposed problem. Typically we have to assume some structures on analyzing function f. In this talk, we're going to consider a very particular structure of this mapping where we assume the output y depends on the input a only through its projection onto a low-dimensional subspace. Now here, cosine 1 to cosine r is a set of basis vector spanning a r-dimensional or rank r subspace and g, this non-linearity can be a function that is potentially stochastic. It is either known or unknown. The geometry of this problem is very clear, so instead of trying to learn a generic function of n variables, we only need to learn a low-dimensional subspace and potentially a function g that depends on a small number of these variables. This is a very simple model but appears in many problems in learning and estimation. So I'd like to go through several concrete examples. The first example is a rank 1 case. This is of course known as a generalized linear model or single index model. In this case, output y only depends on a projection of the input onto a one-dimensional subspace, cosine. The mapping from the scalar product to y is described by a channel or conditional distribution and there are several instances where this model makes sense. In learning, this is also known as a one-layer perceptron with a general activation function. In this talk, we're going to consider a very special case that is how I got interested into this problem. This is known as the so-called phase retrieval problem. If you have not heard about this problem before, it has a very simple mathematical structure. We want to estimate a vector kassai. In applications, kassai is an underlying image we want to reconstruct. The way we measure kassai is to multiply kassai with a known sensing matrix A that comes from our imaging setup, but actual measurements that we can take are equal to the absolute value squared of these scalar products. What we're missing here is a phase information. That's why this problem is called phase retrieval and this is clearly a special case of a model because the output y only depends on the projection of our input which are rows of A onto a one-dimensional subspace, kassai. The linearity is the squared function. And last year, I read an interesting paper that described a different imaging setup where one measurement can contain information simultaneously about multiple unknown images. And mathematically, this model can be simplified as yi is a linear combination of the phaseless measurements of r images. This is, again, just a rank r version of the problems that we consider. Also, of course, I want to mention that this example of a two-layer perceptron is a special case where for whatever nonlinear activation function and whatever coefficient you put on the second layer, the output y only depends on the projection of the input onto a number of basis vector kassai. So this is another example. And I want to mention there are other examples where the model that we consider in this talk appear. So I just briefly mentioned this work, for example, learning of mixture of classifiers, phaseless PCA, and so-called maxifying regression. So this is a very general problem. And by the way, this is also a very well-known model. This is called the multi-index model in statistics. So far, I hope I have provided sufficient motivation for why it makes sense to study this model. But I also want to mention that it can also have some challenges. Our goal is to learn g or an underlying substance kassai 1 to kassai r from a collection of ai's and yi's. First of all, this can be, in general, a non-convex optimization problem, even if you are given the function g. For example, in the case of phase retrieval, it boils down to solving a fourth-order polynomial. And this can be a non-convex problem. And a more challenging case is when you don't know what the underlying function g is. And whatever method that we use, we also ideally want to have some performance guarantees. And in this talk, I'm going to just describe a very simple, special method that is fairly effective for solving problems like this. And also guarantee you this is the most, the simplest algorithm that you will ever hear from, you will hear in this workshop. So it's very simple. It's a two-step approach. The first, let's consider the rank one case. So yi depends on the scalar product input onto unknown vector kassai and g is some unknown nonlinearity. First step, you form a symmetric matrix that is a linear combination of ai and ai transpose. The ai's are the input vectors, and this is the outer product, and weighted by yi. And the second step, you've compute the top eigenvector of d. This is it. This is the algorithm. And the claim is that the top eigenvector carries information about kassai. This is indeed a very simple algorithm because literally it's a one-line MATLAB code. And in general, if you have rank R case, instead of computing only the top eigenvector, what we do is you compute the R1 leading eigenvectors and potentially R2 trailing eigenvectors. There are specific rules for selecting the values of R1 and R2. And I want to mention this is also a very, very old idea. As far as we know, this method first appeared in this paper of Li and the name of principle Hessian direction, but it is not very well known outside of statistical literature. But in signal processing, this method was reinvented in the paper by natural poly and co-authors in the setting of a phase retrieval. But a similar approach also appeared in problems for matrix completion, blind deconvolution, and so on. So the question is why does this method work? It's a very, very simple method. Why would the eigenvector contains information about kassai? We can consider, explain this by considering a special case of a phase retrieval. In this case, y is just the scalar product squared. It might look a bit weird because the input vectors AIs contains no information whatsoever about the underlying kassai. But why would the eigenvector D has information about kassai? That's because if you think about AIs as directions pointing out almost uniformly over the sphere, the trick happens here because we're re-weighting all these directions. And in particular, the yi measures how much each direction is aligned with an underlying kassai. So the directions that are more aligned are given higher weights and vice-versa. And because of this, the eigenvector should contain information about kassai. This is a very simple deterministic explanation. But you can also consider a more probabilistic explanation that helps if you have a general relationship. Instead of a squared, you have just a general channel. In this case, let's consider the case where the sensing vectors or input vectors are Gaussian vectors. And the matrix D that we form is nothing but an average of independent and identically distributed random matrices, rank 1 random matrices. If you have enough of these components, meaning if m is large enough, then you should have a version of law of large numbers. Then this random matrix should be close to its expectation, which you can compute. In Lee's original work, he used Stein's lemma to compute this expectation. But if you use a trick of rotational invariance of Gaussian ensemble, it's a very simple derivation. You can show that in the case of a phase retrieval, the expectation is nothing but identity plus a rank 1 component. And this factor 2 gives us the gap, the eigenvalue gap between the leading eigenvector, which is the cosine true vector, and the bulk. And a similar idea also holds for the ranked R case. You can also compute the expectation of this random matrix. You can see that the majority eigenvalues is equal to a constant. But potentially, there can be several spikes popping out, carrying information. But if you don't have enough measurement, then these eigenvalues, which are supposed to be equal to a common value, will expand to form a bulk. But in many cases, some of these information carrying spikes can still survive this bulk. They will pop out. So that means if you compute eigenvectors, you can detect all these special directions. All right. So the question is, how do we analyze the performance of this method? In the rank 1 case, we essentially just need to measure the angle between the target here, a vector at the Russian point in towards the North Pole, and estimate. Ideally, we want angle theta to be close to 0. And this has been analyzed in a sequence of papers. What it shows is that you can make the angle arbitrarily small than the constant delta with high probability if you have enough measurements. So here, m is the number of measurements, and n is the direction. So this is the progression of papers along this line. Initially, what we need is the number of measurement is greater than constant times n times an additional log factor. But this log factor can be removed later on if you do some preprocessing. So this is what was known in the literature for the performance of this method. And if you care about the proof technique, the proof technique is super simple. So it's just a two-step recipe. The intuition is always what I describe. This is a random matrix. It should be close to the expectation if you have enough measurements. And the first step is you use a standard matrix concentration inequality to show that if m is large enough, then this random matrix is indeed close to its expectation in a norm, with high probability norm is small. The second step, use the perturbation argument. Yes? Yes? Gaussian. So in this case, Gaussian. But later on, I will show extensions of this assumption. This assumption is not so important. But to prove things, you would ideally start with a Gaussian case. And then use a standard matrix perturbation argument, because in this case, expectation d has an eigengap. Then you can use Davis-Cahan sine theta theorem to show that if the two matrices are close in norm, then the leading eigenvectors of these two matrices are also aligned. These are very, very standard procedure, but it allows you to get these estimates. Although these are very simple techniques, but in analyzing this matrix concentration inequality, you get some very useful insight. So here, the most important insight you can get is the following. Ideally, we want to form this expectation matrix, which has an ideal eigengap of 2. But if you apply matrix concentration inequality, you'll find that if your these measurements, yi's, are unbounded, then it's very hard to control this matrix concentration inequality. That's why you need to have the additional log factor. But a trick that you can apply is to trim all these measurements. You throw away any measurements that are too large. Effectively, you want to make yi to be bounded. This is the idea behind this work of by Yixing Chen and Immanuel Kandas. And by doing this very simple truncation scheme, they can show that you just need to have a number of measurements just need to be linearly proportional to the underlying dimension. But since we know this trick is effective, we might as well consider a more general setup. Instead of using yi as a weight, we can apply a function t of yi. And the t that was proposed in Kandas work was this truncation scheme. But you can also consider in the literature different type of preprocessing schemes. In general, the problem becomes designing a function t of y, such that this performance of spectrum method is optimized. The unknown question at that time was what was the optimal form of this nonlinearity? A priori, this can be challenging because you don't want to only assume a functional form and search over parameters. You want to say, what is the optimal function of t? And then what was also missing some of these previous work was this so-called phase transition phenomenon. If you plot a performance of this spectrum method as a function of the sampling ratio, in this case, m over n is how many more measurements you take per dimension. And the vertical axis of the correlation, you will see that there is this critical threshold below which the correlation is always zero. Afterwards, the performance improves. And what these prior work essentially was pointing out was the behavior of this curve down here. If you have a lot of measurements, then the correlation is close to one. But what they cannot explain is a phase transition phenomenon. And yes, I think this function in this case is truncation. But the phenomenon is generic for majority, I think, for most t. Yes. That's true. If your function is linear, there's no eigengap. Even m over n is infinity. You can compute this expectation. In general, if the function is an odd function, this doesn't work. But in that case, there is a much, much simpler version of this method. You don't form the matrix. You just multiply your out of y with a transpose. And then you'll see that there are correlations. It's a much simpler version. What do you mean by critical? Oh, OK, so how fast this approaches to one? If alpha is going to infinity, right? Oh, yeah. No, OK. It's an interesting question, but no. But it's possible to, yeah, OK. I'll come back to this point. So what we did a couple of years ago was to have a precise asymptotic analysis. So the setting that we consider is, first, is everything is ID Gaussian. And the number of measurements over the dimension is fixed. But the dimension going to infinity, what we show is the angle between the eigenvector and the true underlying subspace of Kassai is going to converge to a constant that depends only on alpha. And there is also this phase transition phenomenon. Below a threshold, the angle is pi over 2. So they are not correlated. Afterwards, they become an angle that is less than pi over 2. All these are constructive predictions. So they are formulas for computing all these thresholds and limiting values. And initially, when we derived this result, we had technical assumptions that the measurements model are non-active. But this assumption was later removed. The same prediction works even if you have non-active measurements. So this is a picture that shows this phase transition phenomenon. This north pole is the direction that we want to recover. And what we show is if you don't have enough measurements, then the eigenvector, the angle between eigenvector and true vector is going to be pi over 2. That means the eigenvector is going to live on the equator. In this case, this method is no better than a random guess. Because if you take a uniform random on a sphere, then with high probability, that point is also going to be orthogonal to Kassai. So this is a trivial case. And to make things worse, in this phase, if you look at the gap between the leading and the second eigenvalue of your matrix, they also converge into 0. So if you run your algorithm, you will see it's very, very slow convergence. But once you cross the threshold, then things get getting better. The angle is going to converge to something out of this equator. And there is also an untrivial eigengap. So for exports here, this, of course, looks very similar to classical models of spike matrix, for example, the BBP phase transition, and so on. There is indeed a strong relationship. But it's not exactly the same case, but there is a way to map this problem to a model similar to the BBP type of phase transition. Some experiment, so we can see that indeed these predictions work fairly well, even if the dimension of your underlying signal is pretty small. It's a 64 by 64 image. But asymptotic predictions, in this case, given by the lines, matches with the experimental results. And this extension, this type of characterization can also be generalized to the rank R case. So in this case, instead of studying the scalar product between eigenvector and one direction, we need to study the correlation matrix between R eigenvectors and R, the Russians. And there is a similar story, but a little bit more complicated. You can show that this correlation matrix, it's R by R matrix, is going to converge to a deterministic number. And these numbers can be computed by solving all nonlinear fixed point equations. So these are some additional extensions to the rank R case. And some experiment, in this case, we consider so-called multiplex imaging model, where the yi, the measurements, is a linear combination of three phaseless measurements. And what I show here are the correlation between the true images, or true directions, cos i1 to cos i3, and estimated eigenvectors, and theory versus simulations. You can see that they have a good match, and they're also generic phase transition phenomenon. And for different images, the phase transition points are different. It has to do with the fact that they are weighted differently. So the first image, cos i1, first pops out because signal to noise ratio for this image is the highest. And similar experiment can be done for this two-layer neural network. Suppose this is a two-layer perceptron, and you want to estimate all these weight vectors in the first layer, and you can use a spectrum method. You can see that this is the correlation that you would get. Now I want to come back to the point of the problem of optimal design. So since now we have a precise asymptotic characterization, we can start to ask the question, what is optimal preprocessing function T that we should apply? And this can be formulated as an optimization problem, where rho is the correlation between the eigenvector and the two direction cosine. This, of course, depends on two quantities. The first quantity is alpha, the sampling ratio. The second quantity is the function that you apply. So we want to search over a function that can maximize the correlation. Now the first result, answering this type of question, partially answering this kind of question, was given in this paper by Mondalia Montanari. What it showed is that there is a critical threshold of alpha, so-called alpha weak. Below that, no matter how you choose this preprocessing function T, the optimal correlation is always zero. This means you cannot do anything if the sampling ratio is below that threshold. But after that threshold, what it showed is that this value is positive. What it did was actually a constructive way of doing this, the designed T, such that when alpha is grids and is threshold, this value is positive. But what they didn't know was that if their design was optimal or not. It turned out to be not optimal. And by the way, in the case of noise is phase retrieval, this threshold value is equal to 1 over 2. And what we did was to actually solve this optimization problem. It turns out that this problem has a closed form formula. It's a functional optimization, but it has a very nice geometrical interpretation in terms of finding a minimum norm solution in a fine space. But in any case, the optimal functional form has this formula. It only depends on the conditional distribution, the channel. And the expectation is taking over a Gaussian random variable. And what is a little bit surprising is the fact that this optimal function T does not depend on the sampling ratio alpha. That means for any sampling ratio alpha, you should always use this T. This is the optimal function that you can use within the framework of spectral methods. And some experiment, just an example to show the potential benefit of using this optimal T. So in this case, the model is that the activation function is a Poisson channel of the quadratic measurements. And using the formula, you can compute the optimal T is of this form. The point you don't need to focus on this exact form, point I want to make is that this is not something you can guess by intuition. You have to know the optimal formula to know this is the function that you should apply. And some experiment, you can see that the blue curve is what you would get by using this optimal function. And this is much better than a previous design so-called the training scheme. This is the thresholding scheme that was efficient, but at hoc. It is also better than the scheme proposed in Mondale and Montanari. The two curves achieve the same threshold. But after the threshold, this optimal curve is a little bit better than what they get. And this same type of optimal design can be applied to the multi-rank case. If you look at improvement here from green to blue, the difference is that the green curve used at hoc design of this T function. And the blue curve is what we will get if you optimize the T function. There is a dramatic improvement. And what is interesting is that what we find in the rank R case, the optimal T for different eigenvectors must be different. So you have to use different optimal, different form of T function for different dimensions of the subspace. All right. So I want to also spend some time to address the question about the assumption of the sensing vector. So the sensing vector so far are id Gaussian vectors. And I'm a signal processor. So if I talk to my colleagues who do the actual measurement, they will laugh at you because nothing is id Gaussian. So here, the model that we assume is that the sensing matrix have id Gaussian random variables. Right. And in the case of phase retrieval, the yi is equal to the scalar product squared. It's a very nice assumption in theory for theoreticians, but it has nothing to do with practice. So here I want to mention just one setup that is closer to what is done in practice. So this is so-called coded defraction measurement. What happens is that you have your image. This is Kasai. An image of this sample is your vector Kasai that we want to reconstruct. And in the setup, you first place a mask in front of this sample, followed by an imaging setup. And the physics of this imaging setup will give you the following model. Your measurement turns out to be equal to the Fourier transform of the mask multiplied with your underlying image. And the Fourier transform comes from the Fourier optics. And then because you only can take the intensity of this light information, so you're losing the phase information. So this is the actual setup that is close to practice. And this W is the pattern that you can place in front of the sample. You can do this experiment multiple times by changing the pattern. But mathematically, we can still abstract the model. So here the sensing matrix A, this is a matrix A, can be written in a block form. Well, each block has a particular structure. Each block is equal to a DFT matrix times a diagonal matrix whose diagonal elements are correspond to the patterns that you put. You can assume these are pseudo-random patterns, plus or minus 1, plus or minus i. So this is the structure of the sensing matrix that we can consider. Of course, this is much more challenging than the ID Gaussian assumption. And there is indeed some analysis of performance spectrum method if you use this type of sensing matrix. And the estimate is that a sufficient condition shown in this paper is that the number of measurements must be greater than a constant times n times log n to the power of 4. OK, so if you are theoretician, then you find this OK, because it's just additional poly log factor. But if you are practitioner, if you do some calculation, n, that's a number of dimension of your image, for example, 1 megapixel image, log n to the power of 4, in this case, is greater than 36,000. So this is a very, very pessimistic estimate. Of course, this is a sufficient condition, but you cannot use this sufficient condition as a guideline in practice, because that would require that many measurements. You don't want that. But experimentally, things are much better. So here we pull out the same performance curve as of the correlation versus the number of measurements, m over n. You can see that for this coldly diffraction pattern simulations, they still get similar performance. That means if you have six times as many measurements as a dimension, the correlation is already close to 0.75. So you don't really need 36,000 factor over something. But the problem is that the performance in this case is different from what we can predict using the Gaussian assumption. So there is a little bit difference. And what we do is to exploit the universality phenomenon. So in this case, as in many other cases, you can observe that there is a strong universality in the behavior of these high dimensional inference algorithms. So what we showed here are the performance curve when we change the sensing ensemble from the coldly diffraction to a random frame. In this case, just a random matrix with orthogonal columns. And that's unitary. Another case, a random frame, you start with a dense unitary matrix and you randomly sub-sample the columns. You can see that empirically, all these ensembles will give you the same performance. So of course, this is a similar observation is well known. So it has been observed, for example, in compressed sensing and many other problems. But this observation allows us to study the CDP, the coldly diffraction pattern, by studying the unitary ensemble, which is something that is closer to Gaussian. And what we can get is a similar type of prediction. In this case, we can calculate that the angle is still going to converge to deterministic quantities. And in this case, these quantities should be based on the r-transform of some probability measure. This is something that you can do by analyzing this random matrix. And some step in the derivation for experts is just some calculation of H, C, I, Z integral. And here we have some simulation results. In this case, the image, Kassai is just a real image. And then the blue curve is what can be predicted using theory. And all these crosses, stars, and circles are the experimental results using the coldly diffraction pattern, the unitary ensemble, and random frame. They all follow the same curve. And initially, we derived this result using replica method. But a few months ago, this prediction was rigorously established in the paper using free probability calculations. So just as a summary, and in this talk, I describe a very simple spectral method for estimating a low-ranked subspace from nonlinear measurements. This algorithm itself is very old. It's a very old idea. It's known as a principle of healthy interaction. But the way we studied it was new in the sense that we studied the high-dimensional limit. Typically, in the statistical literature, what people study is the case when m, the number of measurements, is much, much larger than the dimension n. But if you consider the region where m and n are comparable, then there are phase transition phenomena that we can predict. And from this exact characterization, we can also design the optimal shrinkage or pre-processing function. And I want to mention that there is much more to what I presented today. A minor open problem, a mystery, was that if you design the optimal pre-processing function, it turns out that the pre-processing function is always the same form, no matter what matrix ensembles that you use. You can have a Gaussian ensemble, unitary ensemble, or a general ensemble with an arbitrary distribution of singular values. You should always use the same pre-processing function that is optimal. And it turns out that there might be some explanation. And hopefully, in a couple of months, either I or the collaborators Antoine, Florian, Lenka would be able to explain a deeper reason why this is the case. So thanks for your attention.