 The first speaker is Justin Ko from Ecole Normaleon and he's gonna talk about matrix estimation problem. Justin, take it away. Hello everybody, thank you for the invitation to speak. Today, I'll be talking about matrix estimation problems and this is joint work with- Recording in progress. This is joint work with Elise Kianay, Jonathan Husson, Florent Krasikala, and Nika Zaboratha. So I'm gonna be sending some of my current work related to matrix estimation problems. Aside from some polishing of the results, they should be posted quite soon. So the title is Deliberately Vague, Matrix Estimation Problems, and it's sort of because I'm combining several results into one top. So I'll be first introducing what these matrix estimation problems are. I'll be going over what type of mathematical results we were able to get, and I'll also describe some of the general proof ideas if we have time. Let's begin with the description of our model. So we're gonna be considering symmetric rank, rank of the end spec matrices with some inhomogeneous noise. So in particular, the matrices we are looking at are where you look at matrices of the form Y where there's a noise component, which is a standard Gaussian noise with a variance profile. I'm gonna add to it a rank D event spike. And my rank D event spike will normally take a form X transpose X, and we'll assume that the variance profile will have some sort of block structure attached to it. And of course we can take limits to get more general block structures into the future. And this is the general type of matrix models we are staying and it has been studied by many people in the stream and hopefully my results still will sort of spread some more insight into the behaviors of these problems. So I've introduced quite a general spike matrix model and these types of spike matrix models sort of contain a lot of models that have already been studied. For example, if I take my noise to be homogeneous all the blocks to take the same value and just by convention I can define lambda to be equal to one of the noise, then I get the standard rank one matrix estimation problem or rank D matrix estimation problem. And below I have some data. There's some noise and I wanna see the signal within this noisy observation of the data. In the case when we add the variance profile to the noise matrix different blocks might have different noise so maybe that might make the estimation problem a bit harder or easier depending on which block we're in. So there's a bit more structure to our problems and we're also going to consider the case when our rank goes to infinity but at a sublinear rate. So if our rank is of the O of N our signal is a lot more complicated. We wanna say can we really recover this as N goes to infinity. So there's several statistical questions we can ask. One of the most natural questions is well how much information can we get from the signal? Can we recover from the noisy observation? Another question is well how noisy or can we measure how noisy the signal can be until we can say something meaningful? Can we do better than the random guess? And the last question is well why should we even care about spec matrices? So the mathematical answers or the mathematical approaches to serve answer and diesel questions is the first one is how can we measure how much information we can get? This is sort of computing something called the free energy. It will measure or will quantify precisely how much information we can get from the signal and the noisy observation. The question about recovery is about some conditions on the size of the noise matrix. That is a question about phase transitions and when can we do better than the random guess? Well that's sort of studying how much information we get as we do various parameters. The last question is probably the most interesting is well why should we care about these spec matrices is a question about universality. Is how many models fall under this framework or how general can our analysis be? So I will briefly go over some results related to all of them and let's begin with the first question. So the first question is well how much information can be seen from the data? And there's a few ways to measure this. One way is well we can compute what the minimum mean squared area is that's sort of a measure of how much information we can get and computing the mean squared error boils down to can we compute the condition expectation of our signal x given our observation y. And because of the nice or the simple model we can actually write down explicitly what are the conditional laws are and by base theorem we have an explicit format and it's just some quadratic function. And if you look at the normalization of this conditional probability and you take the one over a normalized log of it then we get a quantity called the free energy. And the free energy in itself is an interesting object to study because it encodes the mutual information up to some constant that is quite easy to compute. But the hard part is can we compute the limit of this free energy for these models? And there are lots of known results for limits of these free energies. They sort of have very different forms. So I'm just gonna present the one that we managed to derive. So let's consider the first case where we have a finite rank signal with some variance profile on the noise. For this model we have an explicit form for the free energy and the explicit form for the free energy will be given as a solution to some variational problem. So the limit of free energy is given by the supremum over some parameters over a functional, the z-40 functional isn't too important. What's sort of important is, well, what parameters are optimizing over? And in these cases where there's a variance profile, the number of parameters depends on, I guess, how many blocks there are in my variance profile. And also depends on the rank. So the rank depends on the dimension of the object. So we are optimizing over D by D matrices and the number of matrices we're optimizing over is the number of blocks that appear in the variance profile. And there's an explicit formula to compute it. And of course, once you get the limit of free energy, then if we can solve this optimization problem, then we have an explicit form of the mean squared error written in terms of the optimizers of the functional I put before. And this is almost exactly the same as the homogeneous case except we have some averages depending on just the size of the blocks of the noise. On the other hand, let's suppose we have homogeneous noise, but we're considering a rank that sort of diverges to infinity, but not so fast. So the rank diverges at a sublinear rate. And for these models, because I want to use some techniques from animatrix theory, we're going to assume that my x is rotation invariant. And to make sense of this problem and the limit, x has to behave nicely in the limit. So we're going to assume that x, that the empirical distribution of the eigenvalues converge to some probability measure eta with some compact support. And it satisfies some large deviations principle with speed and dA. And again, for this model, we have a limit of the free energy. There's an explicit formula for the function that we're optimizing over. And the parameter in this case is can be thought of as a probability measure where serving codes the distribution of the eigenvalues. There's a lot of terms in this functional, but essentially what this object is is if you compute the rank one problem and you take the limit, you'll get a sum of the rank one problems. And essentially what the problem is is since we're taking the rank to infinity instead of taking a sum, we're just sort of averaging or we're sort of taking the integral instead. So this is the nice generalization of the rank one formulas to the case when the rank is of order. And of course these objects depend on the one dimensional spherical integral and the BVP transition map which encodes the behaviors of the outlying eigenvalues for these matrices. And again, once you have to limit the free energy, you can compute the minimum mean square error for these problems. So I sort of quickly went through some formulas to compute the limit of the free energy. And the question is, well, using these formulas, can we answer the question about what conditions can we put on our noise so that can we say anything meaningful about our observation? So I have some results over here. We're gonna assume that our prior distribution, our way that we're generating the spike is centered. And our condition is, if we look at the operator norm of one over our noise matrix squared, sort of normalized or sort of averaged by the size of each of the blocks, if you look at the top eigenvalue, if the top eigenvalue is below some value, this corresponds to if our noise is big enough, then our minimum mean square error is basically just, that we can achieve is basically just a random guess. But after a certain point, the minimum mean square error becomes better than a certain guess, which is given by the second result. And of course, right now, as it's written, this bounds clearly not sharp. And I guess the precise transition depends a lot on the underlying prior. But in the case when our prior is a standard normo, then we can prove that the transition sharp and the fact that the first bound can be improved to match the second bound. And the way I've written this bound right now is maybe a bit confusing, but I can normalize things. And what this is is basically a statement of the signal to noise ratio for estimation problems where there's inhomogeneous noise. So the strength of the signal will be encoded by the top eigenvalue of the covariance matrix of our data squared. That will encode the strength of the signal. And the strength of the noise will be encoded by the top eigenvalue of my normalized noise matrix, the variance of our covariance matrix. So below us, if the signal to noise ratio is below a certain value, then we can get some, then we can say something meaningful about the recovery problem. And above a certain threshold, we will, we can say, we will not be able to say something. So this results encodes a way to we have a high-dimensional noise, but the signal to noise ratio is a one-dimensional object that we can, that we can precisely check. And the question is, well, does bound to sharp for standard normals? Can we, can we say anything else for more general probability measures? And mathematically, this is a harder problem. It's a bit harder to solve the variation formulas, but we can compute it numerically. So if we consider Bernoulli data, or if I guess some probability of it being zero, so it's like sparse Bernoulli data, or sparse Rayl-McCordata, then if the P is, if the signal is not the sparse, then it appears that the transition is sharp, that the free energy starts changing behavior precisely at the same point as the standard normal models. But when it comes sparse, then there is a gap. So the free energy actually becomes non-zero at an earlier point. So it is still an interesting question to sort of precisely compute it for all prior probabilities. So these, those transitions were computed by studying the mutual information formulas. Another field point is, we can actually use random matrix theory directly to study these transitions. In the rank one, in the homogeneous models, what's a common approach is you can look at when does the top eigenvalue separate from the bulk. And to reduce it to this problem, well, we have a variance profile energy. Well, we can just divide up by the variance profile and instead put all the noise on the signal. So this reduces the problem to a problem with homogeneous noise and when we ask, well, this is a very easy model to study. Can we consider, can we study when does the top eigenvalue pop out for this transformed matrix? And what we can do is, well, we can compute this transition precisely. So the largest eigenvalue of the matrix after putting all the inhomogeneity on the noise to the signal happens at a certain value. This is a bit simplified, I'm considering the rank one case. So the covariance matrix just becomes the variance. And we get a condition that looks almost exactly the same and the only difference is, instead of one over the variance, we have one over the standard deviation in this model. So how does this compare with the transition that we found before? Well, in the case when our delta only takes one value, in the case when it's homogeneous, they're the same. So in the homogeneous model, both of these transitions happen at the same point. As soon as you have a variance profile that is non-trivial, then it so happens that the region of deltas and rows where the top eigenvalue separates is smaller. So the number of parameters such that the top eigenvalue separates is sort of contained in the region that we found before. So there is a gap for these models, which maybe means that we're not looking at the right matrix. So something about having the variance profile on the noise does some change the behavior slightly. So what we've said so far was we have a way to compute how much information we can get from the signal and the noisy observation. We have some ways to look at these signals to see when does the phase transitions happen and how can we quantify the signal-to-noise ratio in these models. And my last point is, well, these spike matrices, they're very simplified models, so why should we even care about them? And why should we even care about them? Well, let's consider a more general first problem. We no longer looking at spike matrices, so let's generate data from some arbitrary probability measure. And we're going to generate our observations conditionally on this underlying signal from some known probability measure, Pij. And we're going to generate it independently, but for each index, i and j, we can have a slightly different rule for generating these probabilities. And if we write the free energy associated to this general static savings problem, we have e to the log likelihood of this problem. And what we can show is that if the log likelihood obeys some regularity conditions, some conditions on these smoothness and some bounds on this first and second derivatives and assuming that x has compact support, we can compute a matrix, one over delta ij, which is given explicitly as expected value of the first derivative of the log likelihood. What it says is that the limit of the free energy for these general static savings problems is exactly the same as a spike matrix model with a special choice of variance given by the noise computed on the upper line. So this means that studying these spike matrix models will allow us to compute the mutual information for these more general statistical inference problems. And of course we can also use the results for the spike matrix models to compute the limit of the free energy for these general inference problems. And on the other hand, we can actually say a slightly stronger form of the university. So what we said was that the free energies were the same. Well, we can say something more about these spectrums of these matrices. So of course the matrix y generated from the general inference problem will not be so nicely behaved, but if we do a particular transformation to the matrix, so if we compute the derivative of the log likelihood and evaluate our matrix entry-wise with this transformation, then we can show that after some normalization that the spectrum of this normalized of this transform matrix agrees with a normalized spike matrix in the bulk eigenvalues. And that in the limit that the transition, so the top eigenvalue jumps out at exactly the same time for both of these matrices. So some similar to the universality of these spectrums for these matrices. And I have an explicit example of I guess a general inference problem that falls under this framework. So we're going to consider something called the degree corrected stochastic block model. It's the standard stochastic block models where you generate some graph and the chances of edges being connected depends on which group each matrices belongs to. So these models have two parameters, lambda sort of encodes the difference between the probability of generating an edge if you're in the same community and in terms of different communities. And in the degree corrected model, there's another parameter theta which is attached to vertex i which sort of encodes the individual's vertexes likelihood of attaching an edge. So what it is is that the location of the indices are now important as well as which group you belong to. And you can build an adjacent matrix for this problem. It's a matrix of plus or minus ones and you'll be sort of generated conditionally if the people are in the same group or not. So if we look at the spectrums of these two matrices, the adjacency matrix and the spec matrix with the right variance profile, then the spectrums are very, very different. But if I apply the transformation I had before, then these bulk of the spectrums sort of look the same and the top eigenvalue separate at the same point. So we have some sort of universality in that we have some really complicated data, but if we apply some nice transformation to it and look at the corresponding normalized spike matrix, then the spectrums behave the same. So just understanding the spectrum of these spec matrices can sort of understand at least some more general inference problems. And I have a few minutes left, so I have just quickly some proof ideas or how we approach these problems. And in the case when we study the spec matrices with variance profiles, it sort of boils down to can we compute these integrals? And if we look at the Hamiltonian for this model, then there's a lot of terms, but if I ignore the last two objects, then the Hamiltonian is basically a multi-species Schoenzen-Kirch-Patrick Hamiltonian with vector spins. And mathematically these objects are very, very well studied so we can use tools from spin glasses to compute the limit of the free energy and from there we can compute the thresholds and the university sort of follows from the university of disorder for these spin glass models. And on the other hand, in the case when we have a growing rank problem, then the free energy sort of boils down to can we compute objects called the spherical integrals? And the main result that will allow us to compute it in the growing rank case is that for spherical integrals when the rank of one of the matrices is of sublinear order, then we have an explicit formula and this explicit formula will be given as the sum of the one-dimensional spherical integrals where the top eigenvalue of the temperature matrix is matched with the corresponding ordered eigenvalues of the matrix A n. So we have an explicit formula and this behavior is very, very different than the extensive rank case but up to a little more than the formula for these spherical integrals are explicit. And that's it. So thank you for your attention. For Justin. Hi, thank you for the very nice talk. I was wondering, you showed at some point something on the BBB transitions for this homogeneous model. Yes. And okay, I'm not sure I completely understood the message you were saying that basically there is a gap between when an eigenvalue pops out and when the MMSC and the free energy becomes non-regular. That's right, that's right. So for in the homogeneous case, there is no gap but I guess that this is for a very specific matrix. I was only computing the BBB transition when I removed the originality. You're also saying I think in the Gaussian prior case the free energy transition is sharp, right? That's right. Yeah, okay. So in the Gaussian prior case, you really can say that you have an eigenvalue that pops out before and recover anything about the signal. That's right. But there's a caveat in that. I'm looking at the BBB transition of a very specific matrix so maybe I'm not looking at the right maybe. So the naive choice of removing the variance profile by putting it all in signal may not be the right choice in the matrix. Okay, but sorry, just to clarify. So the BBB transitions happens before or after you can recover? Like you can, sorry. Yes, so the number of... Ah, okay. Okay, so you can recover before the eigenvalue pops out. Yes, yes. Okay, okay, sorry. So it is... Yeah, it goes in the right way so there's no contradiction. Yeah, okay. More questions. And is that true at all levels? Like, do you recover, do you fully recover the signal before all the relevant eigenvalues pop out of the bulk? Oh, I'm not sure. So this is just a weaker statement in that the minimal means squared error for the problem becomes non-zero and in that region, the top eigenvalue pops out. Whether we're looking at the right matrix or whether it's the right matrix to look at, I don't know if it... So maybe if we look at a different matrix, then maybe the transition happens precisely. So what this is is that if I look at this matrix, the top eigenvalue pops out in a region of unparameters where the MMSC becomes non-trivial and there is a gap if we look at this normalized matrix. How important is it to assume that you have this block structure for the variance matrix? Could you do this? Is it just for this optimization problem that it becomes feasible or is it also critical in other parts? We can, I guess, if the deltas were generated from a positive semi-definite kernel, then we can do approximations with these block models. So it definitely becomes a lot more complicated, but as long as we can do a discrete approximation then these results hold, but the formulas become more complicated in that case. I'll also ask one, what breaks in the linear rank regime? Oh, sorry? What breaks if you wanna try to analyze the case in which the rank is linear? Oh, I see, yeah, then in the sublinear rank case, then we have that we have a sum of, then the limit of the speakers can be written as the sum of the one-dimensional problems. In the extensive rank case, there's different formulas, I guess, proven by at least Guiney and Gianniaga. So it expresses the limit of the spherical integrals as a different object. So I guess the asymptotics just do not match. Thank you very much, Justin.