 Why you? So this talk will be about the all-or-nothing phenomenon. It's a sharp statistical, let's say, phase transition. We're going to the details and specifically we're going to discuss about, do you hopefully see what I did here? Yeah, the sparseness of PCA. And this is joint work with Jonathan Ellsworth on NYU as well. Okay, so let me just jump into it. So what is the motivation here? So what is the kind of generic introduction? So a nice and attractive thing about high dimensional statistics and high dimensional physical models is that they undergo phase transitions. And, you know, this is a term borrowed from physics. It just corresponds to the idea that like small changes sometimes in the parameters of the model really change structure of the model itself in a macroscopical way. And, you know, what would be a canonical phase transition to keep in mind here? It's what, you know, it's called also BPP phase transition in random matrix theory. And this is the idea, you know, like that you can, you know, take a model of PCA in the statistical sense of the principal component analysis. And, okay, let's see together what this model is. So there exists an unknown signal that is something uniform in the sphere in P dimensions. And then there's a P by P matrix, which has IID standard Gaussian entries. And we're observing why, which is just beta-betta transpose, so the rank one spike generated by beta, appropriately scaled, let's say, square root of a constant times the dimension plus the Gaussian noise. And the goal is given access to why you want to infer beta or beta-betta transpose, let's say. You can cover, you can take the MMSA and you can say, okay, let's take the high dimensional limit. So as the dimension goes to infinity. And then there's a transition happening exactly at lambda equal one, so lambda is again a constant here. And what happens is that if lambda is less than one, the limit in MSC is what we say trivial. And what do we mean by trivial? It means that it matches the performance of a trivial estimator, which is the performance of the zero estimator. So it's a unit norm, beta or beta-betta transpose if you want here, sort of this. So essentially, you are essentially as good as you were with zero observations. And what happens at lambda bigger than one is that you jump to something strictly less than one in a also continuous way at one. So pictorial here, if you try to just see what the limit in MSC is, it's like trivial, and then it goes down in a bigger way. And this is a generic picture, like we see it in many models, and this is kind of traditional, the high dimensional statistics picture. And the question is, is it always the case? Is it always the case that the limit in MSC is like a continuous careful, let's say, quote, quote, smooth care? And the only other phenomenon is kind of, if you want a refutation of this, or kind of a strong refutation of this in generic contexts, and there is cases where this transition is very strong. So consider a variant of the PCA model, which is, if you want a very sparse variant of it, which is known as the case sparse PCA model, when k, the sparse, it is sublinear in a dimension. So consider the case where the sparse, the vector beta here is on the sphere as before, but it has only k non-zero entries. And for simplicity, let's say it takes two values. So it's either zero or one over square root k at every point, and exactly k of it. And then the same model as before, Gaussian noise, and y, beta beta transpose plus w, you cannot enter, the transition doesn't happen now, of course, at square root p, you need to appropriately define a different dimension here, but lambda is, again, a constant. So what's another phenomenon that if you consider now the MMSC here, there is a jump again at one, but a very strong jump. So at lambda less than one, you have a performance of a trivial estimator. Okay. But at lambda bigger than one, this becomes like vanishing. So this becomes like the performance of as good as it gets, at least at this scale of things. And just to remember, what we had before is like a transition like this, and the moment it goes sublinear in p, this goes like this. And also, you know, let me tell you a little about who has proven to this result. So initially it has been established by Zones in the other than the conference and also Nicola Materia 2020, and we later strengthen it with Zones. And okay, so this is the thing I would like you to keep in mind, that there is like these two different pictures, at least these two different pictures happening here. Again, you know, this is in subcon with what we understand about happening in regular PCA, and in many other models as well, which are denser in nature, and, you know, a small review here. So the first case that we show this phenomenon happening is in sparse linear regression actually. So it's a little bit more complicated context. And we established it in the case where case gp of square p, which you can say it seems a little bit strange, but you know, this is kind of a technicality. We believe it's again happening in the whole sublinear regime, but we didn't prove it. So the history is that we use with David Gamonik at 2017. We proved it for the MLE. So if you want a special estimator, the Maxwell-Eggerson regression, we conjectured that it's happening generally. And later with Gallerys and Schemmich, you were established it at 2019. And then actually the following year or two years, like interestingly, there was a bump into surface transition, and people started realizing, including this conference, people started realizing that this phenomenon is actually happening generally in sublinear sparsity settings, or at least we expected in many settings to be happening. For example, in Bernoulli group testing, so they take some through onc and co-authors, establish it in the polylogarithmic regime on p. In sparsity cac, Zonen, Nicola McCree, establish it again case of linear here, but bigger than p to a large, let's say 12 or 13, if I'm right here. General as linear models, rate of graph matching, and you can see always that there is kind of some technicality. So there is a technicality. Like you cannot enter what we expect to be the whole sublinear regime, but we kind of believe it, rate of graph matching is a little bit of a different business. That's why there is a miss, but you know, this kind of brings the question, what is causing the surface transition? So what is the, what is, you know, why? Like there is such a discrepancy. And what I want to do in this talk, briefly, is I want to tell you about the special case of this phenomenon, in some reason working with Zonen, in the model of sparse tensor PCA, as we will see, it's like a generalization of the sparse PCA model I mentioned before, for two reasons. I mean, you know, if someone can ask why I don't do this for sparse regression, where you know the initial thing was established, it's just much more complicated, let's say, even if it's the first example, this is much cleaner. And the sparse tensor PCA model is like as before, so you observe why, but it has a tensor form that decomposes again as signal plus noise. And your signal is a t tensor power of a hidden vector. So for t equal two is the matrix case. And can you, you know, appropriate rescale it, and you add Gaussian ib elements per tensor. And also besides the tensor, it's also a sparse spike. Okay. And that's the sparsity and the sparse tensor. For t equal two, that's the, that's the, that's the model we, I present in the second slide. So what we did in this work is we established a phenomenon actually to the best of my knowledge, first time for all the subliners past regime. And this improved over the, over the result I said before by Cynthia and from in the case of t equal two, and k is at least p12 or 13. And what, why I want to mention this is because we kind of understood a generic equivalence here for the phenomenon. So we understood that this is what was happening exactly. A simple, if you want probabilistic condition takes place. So we boil it down to a relative entropy calculation, if you want. And I think that's why I want to communicate the development in that direction. Okay. So I will just dump jump into the details here and tell you, you know, what we, what we understood here. Okay. So the way we understood it is we general, we understood a little bit of a more generic class that, you know, all these PCA models I describe take place, which is the Gaussian native model. So what's a Gaussian native model? I mean, you may hear it in a very different names in the past. It's as you take a signal and you add Gaussian models. And, you know, so it's, okay, so what is more formally Gaussian native model for us? So it's like before, there is some generality. So you are again on the sphere, this is the, where your signal leaves. But there is a discrete, a finite subset of the sphere that I will allow this prior, it will be my prior supported on. So let's say this is s, you know, a finite subset. And beta is a uniformly chosen vector along the subs. And then what happens is, you know, I add noise, you know, I have some noise, I add the Gaussian entries and I just rescale beta and I observe why. And I want to see from why whether I can And okay, so that's the, that's the whole model. That's why kind of like, like it doesn't hide things. And just of course, there is this s, right? What is this s? Okay, you can say is this the PCA model? This is simpler is vector plus noise. It of course is though, because, you know, you can appropriately define this s to have whatever structure you want. So like, you know, you can define s, for example, to contain the PCA or the rather market PCA, where we know, for example, that the transition is not all or nothing. There is this BPP phase transition, I described in the first slide, also for this model. How do you do this? You know, you just take your vectors that you want to define the rank one spike off just the hypercube vectors. And then, you know, you, you, you, you, you create in p square dimension now the matrix PV transpose and you vectorize it. And then you just choose better to be the vectorized form of the transpose. And this is exactly in one, one of these correspondence with the rather market PCA model. Of course, you can do it, you know, sparse PCA. So you can impose a condition here, which is k sparse. And it's like, you know, like it takes, you know, three values and say, so that's the rather market version as I mentioned before, again, in p square dimension, that's fine. And you can do also tension PCA. You just need to go to P to the D dimension and you just consider the corresponding vectorized stances. And, you know, this is a good thing because this brings a contrast here. So this is not sharp. This doesn't have an MMSC transition, which is sharp. But so, you know, this is kind of, you know, will, you know, come to the point of like, what makes the transition start or not, at least in this context, like it will hopefully clarify this part. And okay, so this is the Gaussian model, hopefully it's clean. There is this hidden subset S and that's kind of it. Okay. So what's the all or nothing phenomenon? So the phenomenon is clean in this way. So you can define it. What would it mean that this Gaussian model obtains the all or nothing phenomenon? It means that as you increase the SNR, at some point, the MMSC will jump from one to zero. So as you, you know, it will come, there is a critical lambda C, which now it may depend on two things, N and S. That's all. That for any epsilon constant, the following transition happens. If you are a little bit below the lambda critical, the limit MMSC is trivial. Like you cannot beat the random guess. You cannot, you have no information. But if you are a little bit above the lambda critical, this becomes one. And the question is, when is this the case? When do we expect such a sharp transition to be happening for the limit MMSC for a Gaussian model? And you say, right, it's at a level of constant. This is a kind of, you know, you're new to this transition. This is important. So like you are, yeah. Okay. So the, the theorem is as follows. So I need some notation. So let's just denote for simplicity M to be the cardinality of S. So this is the, you know, the, the support, the size of the support of your prior. So, you know, the entropy, for example, of your prior it's just, you know, log M. So, right. So it's just that you have M points that you're choosing for. And then P sub lambda is the law of your observations at the, at the signal denoization lambda and Q is just P of zero is the gauss, the, this law of Y sub zero, which, you know, with no signal, you're just W such as a standard Gaussian law. So the theorem is as follows. It's quite simple to state all or nothing happens at the, for a Gaussian model you fix all or nothing happens at some lambda critical if and only if two conditions happen. And that's all basically firstly the transition should happen at twice the entropy of the prior. Like there is no other after one over one plus one factors. There is, there is nothing else that could obtain this transition. So this is the critical point where such an SNR, this is the SNR where transition can happen. And secondly, if you compare what you see at the observations, the distribution of your observations are too low game at criticality, like how, you know, the distribution of the observations when this is the criticality with the zero SNR case, they are very close in the KL distance, but not extremely close. They can be the KL divergence, the entropy between them can grow, but they cannot grow bigger than a lot better. They cannot grow faster than a lot better. And basically that's it. So the all or nothing phenomenon happens at some lambda critical if and only if lambda critical is too low game. And the KL divergence between the distributions, what you see at criticality looks if you want very much like what you saw at zero in an appropriate quantity. And, you know, just a quick comment, a lot of times in the literature if you know, for the expert here, a lot of times in the literature, you see, you see statements like if the KL goes to zero, then I can prove that the MMS is trivial. This is kind of a little bit stronger. It shows that there is some room that the KL doesn't go to zero, but still you can get nothing, still you can prove that MMS is trivial. Okay, so let me tell you very quickly how you apply to such as RPCA, would basically, yeah, I mean, by the way, this gives you a tool, right? You have your fabricational model, you want to prove all a lot of phenomenon, just prove this. So you just need to prove that the KL distribution is just a calculus exercise, you know, analysis, most specific exercise. So that's what we did. So I played sparse with PCA. So let me really tell you what we did. So this is the S, which is, you know, what your matrix, the spike of the matrix would be. So, you know, just about this, the case sparse vectors, and, you know, what is M here? Well, this is just P2K, right? You just need to choose a support here. And so, you know, the lambda critical in sparse PCA, at least with a binary vector, would be just twice log P2K. This would be the criticality. This would be the lambda criticality. And that's what we said. So this is the criticality, this is a constant to take the potential power plus W for anything. Now you don't need this to be a matrix. And for all K, it's a little bit of P, we're established at all on other phenomena. So as a corner of this example. So if lambda is less than one, this is trivial. And if lambda is bigger than one, this is fine. Okay, so this is kind of, you know, just a direct application of the result. Okay, so what the following minutes that I have, I would like to give you an idea of how we prove the other result, of how we establish the equivalence. Okay. Okay, so this is the application of the previous result, just the direct application that there is no, you know, I'm not hiding any hidden conceptual, you know, calculations here. Okay, so let me just give you an idea of what we prove here, of how, how did we establish that a lot of phenomena happens if and only if, because we will give you also some, hopefully some intuition. Okay, so recall P sub lambda, you know, is the low wide lambda of the observation that and, you know, M is the kind of functionality of this, right? So this is how many points in the, you're choosing from prior. Okay, and we want to prove that all a lot of phenomena happens if and only if the KL vanishes faster than log M, at criticality between, yeah, you understand between the observation of criticality and zero, and lambda critical is twice the price. Okay, so the key thing here and why this, we have such a standardization is the MMSC relation. So what's an MMSC relation is a very beautiful and clean and quite, you know, straightforward, you know, the proof relation by Gorsama and Verdoux from 2005. It really follows by integration by parts and it's a way to actually connect the KL, the relative entropy between the YL and Y0 and the MMSC in this model, just because of the Gaussian to the noise. And, you know, it's a relation that tells you that if you take the derivative of the KL between P sub lambda Q, it's just half minus half the MMSC. And this is really exact. I'm not hiding any asymptotic terms here. And, you know, what's nice, right? So if you have this thing and you know that all or nothing phenomenon happens, at least asymptotically here, then you know that this is trivial if lambda is less than lambda C and zero if lambda is bigger than lambda C. So in particular, you know that this is half minus half zero if lambda is less than lambda C and half if lambda is bigger than lambda C. So in particular, you can translate the MMSC phase transition to a derivative of the KL as a phase transition. So if you come to think about then what's happening here and you just plot the KL with respect to lambda, you know it starts at zero. So the KL between P sub zero comma Q is zero because, you know, this is P sub zero. Then you know that it should stay there the whole time until you hit lambda critical. And then it should grow with a slope half. And essentially, you know, this gives you an if and only if characterization, admittedly I'm passing some limits here, but it gives you a new if and only if it doesn't follow or nothing to happen with MMSC with respect to the KL. So what I'm saying here is formally you can prove all or nothing happens at lambda critical if you want to take the limits the right way. If for any constant alpha, if you take the KL between alpha lambda critical comma Q divided by lambda critical, then this is just, you know, what it should be in the limit. So if you just plot this thing, you know, with respect to alpha, if you plot this thing, this is just zero all the way to one and then it's half alpha MMSC. And, you know, why, why we divide here? Because, you know, by chain rule, if you take the derivative with respect to this with respect to alpha, you get back here, right? So just it's a matter of chain rule. Okay, so we translate this into actually infinitely many KL conditions. So this is the first step. And then how do you translate to one of them? So let's do the quick argument. So let's see how KL between, you know, the conditions add to log M, you can establish this essentially thing. Well, you can consider it's coming from analytic properties of the KL. So if you consider this function, this is a non-decreasing and half lexis function. This is actually an easy argument to see. So you just, you know, and you need to prove, you know, that this is this function has okay, this property. So it's a non-decreasing and half lexis function, this is an easy thing, and it starts at zero. So in particular, if you have this assumption, you know that alpha equal one, it's still zero in the limit. So because you're not decreasing, it should stay at zero all the way to one. And then, you know, if you know by the half lexis that this is an upper bound. So your function is somewhere here. So then we just by direct analysis prove the lower bound. So this is kind of the way we prove the one direction. It's coming from the analytic properties of the KL. How do you prove the other thing that if all of that phenomenon happens that the lambda critical is to log M and the KL condition is this. Okay, so if all of that phenomenon happens, this happened, and look, if you prove that lambda critical is to log M, then you can just come here, you know, plug in to log M and plug in alpha equal one. This is becoming this quantity, and this is becoming zero. So it's like, if you prove that lambda critical is to log M, you know, essentially you are done. You have this condition also. But there is the question, you know, then why the criticality should always happen and to log M. Like why you have no other choice. This is happening from the following calculation. If you just compute, you know, if you have that this thing has this limit by the phenomenon, then, you know, you can say this thing and just translate it to get to mutual formation by some known identity. It's just half alpha by two minus the mutual formation between beta and all the best by lambda C. And then, you know, because for alpha bigger than one, the MSE is trivial. This is really the full entropy. This is really okay. This is by definition of the mutual formation. This is this, and this is essentially zero. So this is log M. So this is becoming alpha over two minus log M over lambda critical, but you knew here that this would be half alpha minus half. So by equating these two parts, you get that the only place of a critical entropy should be there. Okay, admitted as a super technical side, I hope like some expert got something out of it. So, yeah, okay, so the point is, okay, so this is, you know, kind of the single KL condition. And let me wrap this up very quickly. Sorry for bringing a little minutes off. So this is a new SARP MMSE phase transition. So the phenomenon I hope for the people that didn't know that the, is to communicate this part that there are some transitions that we believe. And you know, it will be some interesting things, whether these are computational as well. For example, do we expect a SARP transition in a computational sense? Approximately message passing, for example, is a good candidate. We do have some results in regression with Galen and Jamming. And I think that John, Cynthia and Michael have also some results in this thing we're seeing. It's a good investigation. We have a clear understanding for these Gaussian models. There are many different models that we don't. It's a linear sparsity of the screen as crucial. I mean, I'm not sure for any of them, I would believe that this is the crucial part, but this is also something to be discussed. And finally, it depends on your background. But in random graph theory, 20 years ago, actually, there has been a big quest of trying to understand the difference between SARP and course phase transitions. Like with culminated with the work of Frigud and Bourguin, trying to understand, you know, what it's causing, you know, triangles to have a course phase transition, as opposed to connectivity, for example, random graphs. I mean, I like to think or I have this ambition that this, you know, the direction that can go in statistics in this sense, and we'll eventually get a characterization of whether the MSE obtains that was iteration. Okay. Thank you very much. I'm sorry for taking a bit of time.