 It is my pleasure to introduce one of the main organizers of this workshop, Jean-Barbier, which will tell us about bias-optimal limits in structural PCA and how to reach them. So, Jean, please take the lead. Thank you Francesca. So, Federica, I'm sorry. I'm sorry, this is... I wanted to present this work because I thought it was really on spot for this conference and it's something I'm still pretty excited about, so I hope I'll be able to transmit it. This is a work that took us quite some effort to bring about. It's something that is connected to Beijing in France in a problem that we all know very well, which is principal component analysis. And I will tell you some things about this problem in a setting which is slightly different than the one we're used to where noise is considered independent unstructured. We'll try to go a bit beyond that, which is precisely the point of this conference. And so this is a joint work with people that you will, of course, recognize. Francesco Camilli, who is not anymore in Bologna, but who is now in ICTP here. We're lucky to have him here for a few months. Marco Mendele, who is in the crowd that you all know. And Manuel Saez, who is another organizer and that is going to arrive tomorrow. So this paper, you can find it on archive, but hopefully in a week or two it should be accessible in the proceeding of National Academy of Sciences. So what do you do when you have access to a large matrix of data that can represent anything? Just for the sake of simplicity in this talk, I will consider all matrices to be symmetric. So eigenvalues will always be real. So what you do, of course, the first thing in order to make any sense of this data is to diagonalize this matrix and try to find the dominant eigenvalues and eigenvectors. So this is PCA. This is actually spectral PCA because here you really, this is a spectral algorithm. And what happens when you do spectral PCA in the ideal world? So let's start with this case. So the ideal world in the realm of the statistics of inference of matrices is the so-called spike models that looks like this. So the data matrix Y is constructed as a low-rank matrix of information X star, X star transpose. The star will denote the unknown vector that you want to infer. And the noise is Z is a Wigner matrix, which is of size n by n. So this was introduced in the context of statistics by John Stone. And it has been studied by many communities, in particular in random matrices. And this is what led to the understanding of the most famous threshold phenomena, phase transition phenomenon in information processing, which is the well-known BBP transition that stands for Ben-Haus, Bacon, Pechet. We understood that around 2000. So what is this BBP transition? So this spectrum here represents the eigenvalues of this data matrix Y. And what you see is that you have a bulk of eigenvalues that lie below the well-known semicircular law, the Wigner law. And you see an outlier here, which is actually associated to this rank one matrix. So the eigenvector associated to that is not exactly X star, but it is correlated with it. And you see that this outlier pops out from this bulk in the case, actually, where lambda is big enough. And so you have this phase transition. If this lambda, which we call the signal-to-nose ratio, is not big enough, this outlier would not be there and it would be hidden in this bulk and you could not perform in France. And so I call that the ideal well because in this well here you see that this noisy part is assumed, is very simple in the sense that it's essentially Gaussian. It has independent entries. Now if you look at this problem from the perspective of Bayesian in France and information theory, the kind of picture that people obtained over the years is the following phase diagram. So here I plot the so-called mean square law as a function of this signal-to-nose ratio, which controls the hardness of the problem. And the picture that was obtained is this kind of generic phase diagram where you have the presence of a so-called impossible regime, an hard regime and an easy inference regime. So let me tell you what they mean. In this impossible regime, which is below the so-called information theoretic phase transition, you see that the MSC of any algorithm, in particular the black curve corresponding to the minimum mean square law, which is the best error you could reach with the best algorithm in the universe, independently of any computational concern. So usually this algorithm requires an exponentially large amount of time and computation to run. But let's say we don't care about computational questions. This is what you would reach anyway. The performance would be very poor. And so this is a phase where there is simply not enough information in the data to say anything. So it's a statistically impossible regime. In this easy regime, what you see is that the MSC is instead very low and actually it overlaps with the performance you would reach with an efficient algorithm that we've heard about already, which is approximate message passing, which is this red curve. And so we call it easy because there exists a polynomial time algorithm matching the performance of this usually unreachable minimum mean square law performance, which is really the base optimal performance. Instead, if you were to use the spectral PCA algorithm that just take the data, diagonalize it and look at the principal eigenvector. So the first eigenvector associated to the top eigenvalue. This is the performance you would reach. It would still be non-trivial. And you see the transition is at the same point, the so-called algorithmic transition. But it would be not as good as AMP and very far actually from this MMSC performance. And in between you have the presence of this hard phase or what we call a computational gap where we don't know any polynomial time algorithm able to say anything meaningful. So the AMP and spectral PCA lead to that performance while the MMSC performance is there. And so this picture has been obtained through the hard work of many people. Actually, the list here could be a whole slide. I just put some names that worked on this. So this diagram here, let me just precise, was obtained for the case where the low of the signal that you want to infer is a sparse vector. So essentially it has a fraction of zero entries and the rest are plus or minus one taken randomly. And let me emphasize that in this talk the low that I will consider for this signal part, this x star part will always be simple. Instead what will be non-trivial will be the low of the noise. So in this way I will decouple the effect of the correlations induced by the noise and the ones induced by the signal. You have to keep in mind that until now most of the research has really focused on understanding the effect of the structure in the signal part taking more and more complicated distribution for that part while keeping very simple assumptions for the noisy part. While here we do the opposite. We want to understand really the effect of the structure in the noise. So let me now go in the non-ideal world. So the non-ideal world looks like this. So here is a picture, a typical picture you could extract from a random paper in neuroscience, actually probably a very interesting paper where people look at, for example, time series of neurons that are spiking in time. And here is how the matrix look like. And so one of the first thing people do, as you can see from this plot, is taking this matrices, multiplying it with their transpose and doing spectral principal component analysis. And they look at the principal directions in this data. But the striking thing when you see this data, of course, is that whatever it contains, it's far from being just a simple rank one matrix plus an independent Gaussian noise. It is highly structured. Whatever you would call the signal in this matrix and whatever you would call noise, the noise will have a highly non-trivial structure. And in this work we want to capture the effect of this structure. So here is, for example, the kind of spectrum you could get for the covariance of this kind of structured matrices. You see that it is not described by any standard distribution in random matrices. It is highly non-trivial. It is hard to say what is the signal part, what is the noise part, are these outliers, is it part of the noisy bulk? You don't know. And the reason of the complexity of this law is the fact that the noise, whatever it means, is a highly structured object, meaning that it has statistical dependencies. You cannot consider that the entries of the noise in this matrix are independent of each other. Let me motivate a bit further why considering noise is something which we think is important. And this is the following fact, what you would call signal and what you would call noise in any task is actually strongly dependent on the task you want to solve. So, for example, if I show you these pictures, I think that maybe 90% of the crowd will have naturally done binary classifications of these images into dogs and cats. And because this is the standard task we're used to describe in supervised machine learning. But now, if instead I was giving you this data set and telling you that you want to learn something about the notion of inside versus outside, in this case, you see that the features corresponding to what is a dog and what is a cat would actually play the role of noise for this specific task. And of course, the features of dogs and cats are not Gaussian features. They are highly structured things. If my task is to distinguish outside versus inside, this being a cat and this being a dog, would be noise for my task, yet it's highly structured. And you want to capture this. All right, of course, studying dogs and cats is highly non-trivial and we are theoreticians, so we want to have tractable models. So what we've done is to come back to our esteemed spike model, which has exactly the same form as before. Again, the entries of this signal part of this X star will be taken IID according to a very simple law, which is known. And instead, the non-trivial assumptions will be made now on the noise. So what are we going to take? Z will be decomposed as O transpose DO, where now O is R matrix, which appeared in the previous talk as well, meaning just a uniformly sampled matrix in the group of orthogonal matrices of size n by n. Instead, the structure in this noise will be encoded by the distribution of eigenvalues, these diagonal matrix of eigenvalues. And the only case where this matrix Z would have independent entries up to the symmetry constraint would be the case where the eigenvalues here would be drawn from the semicircle Vigner distribution that I've shown before. For any other case, if you plug any other type of eigenvalues here, even if the eigenvectors are generic, are uniformly distributed, this matrix would have highly non-trivial statistical dependencies among its elements. Another way to represent this kind of so-called rotationally invariant matrices is through this kind of trace ensembles. So Z is drawn according to this measure, and here you should think of V of Z as acting only on the eigenvalues. So when I wrote V of Z, so this so-called potential function V here is really a function acting just individually on each eigenvalues. The eigenvectors are untouched, and you see because you have this trace appearing, if you rotate this V of Z by any rotation here, this rotation would simplify, and therefore this law is indeed invariant by rotation. From the perspective of random matrices, this type of ensembles have been studied a lot. In particular, we have, Benet-George and Nada-Couditi have understood very well the generalization of this so-called BBB transition for this kind of noises, but here we want to study this problem from the perspective of Bayesian inference. Let me just give an example. If you take this potential to be a quartic function, in this case, if you write down this law in terms of the matrix element, here is what you obtain, and you see that this PDF cannot be factorized over the matrix element. Instead, if you were taking x squared, what you would obtain here would just be product over ij of exponential minus zij squared, and therefore you would have full factorization over the matrix entries, in this case here. We have statistical dependencies in the matrix element. Let's take a concrete example. The potential I'm going to consider in this talk will take the following form. It will be an interpolation between a quadratic part and a quartic part, and the coefficients mu and gamma depend on each other in a way that when mu goes to zero, gamma increases. Gamma is the only free parameter in this ensemble, and this will allow me to interpolate between a case where I have a purely quadratic potential, meaning we go back to the purely unstructured noise, which is the standard theory, Gaussian noise, to a case where gamma increases and mu decreases, where the noise is structured, where you have this kind of four-point correlations in the law of the matrix elements that I showed before. These kind of dependencies. I will call the asymptotic limit when n goes to infinity of the spectral density of this matrix rho. So these dI are the eigenvalues. And here is how it looks like when mu is equal to zero, 0.5 and 1, and when mu is equal to zero, gamma, of course, is non-zero. When mu is 0.5, we have a bit of the two terms. And this is the shape you get, and let me try to give you some insights of what you get this kind of weird shape. So you see when mu is equal to 1, we have only the quadratic part, and this gamma is zero, so we have the Gaussian noise case, so the pure semicircle law. And when mu decreases and this part increases, you see this kind of double-wale appearing. You have this kind of two bumps appearing. And the reason is that if you write this law in terms of the eigenvalues only, because, again, the law of the eigenvectors is just a uniform law, okay? If you do the change of variable from matrix elements to eigenvalues and eigenvectors, here is the PDF that you get for the eigenvalues. And you have a Jacobian that appears when you do the change of variable, and here is the Jacobian, and in random matrices, this is called the von der Mond determinant. And if you think of this as a kind of physical system, you could exponentiate this term, and you would see that in the exponent that you can think as an Hamiltonian of a kind of particle systems, you would get this term, which is a potential energy term. So these particles, these eigenvalues are feeling this kind of external potential, and you have this kind of pairwise interaction between particles, and they try to repeal each other, because you see here you have this product with a difference, so this term is inducing repulsion between the eigenvalues. So you can think of a kind of gas of charged particles with the same charge, and they repeal each other. So what happens if you put a fluid of particles in a strongly confining potential which are repealing each other, at some point if you confine them too much, which is the case when you increase this term here, because the potential goes from quadratic to quartic, so you are strongly confining these particles, at the same time they try to repeal each other. So what they will do, they will climb the wall, the potential wall, and this is how you get this weird shape, okay? Perfect. Let me now introduce the Bayesian framework I'm going to consider. So here is the so-called Bayes posterior distribution for my hidden spike X, given the data Y. It is a normalization, which you can think as a partition function, times the prior distribution of X, which again is simple, this is just fully factorized over the entries of X, times my likelihood, which is coming there that you recognize with this potential here, okay? So essentially this is the law of the noise, okay? And the objects I'm interested to compute are the entropy of the data distribution, which you would call the free energy in physics, okay? Which is minus the expected log partition function, where the P of Y, which is the distribution of the data unconditionally, is the partition function. And in information theory, of course, this is the standard Shannon entropy, which is identified with the free energy in the context of Bayesian inference. And the second quantity is the order parameter of the problem. It's the minimum and square error, so it represents the scaled asymptotic limit of the square deviation between the ground truth spike that I'm trying to infer and the best estimator in the universe, which is the posterior mean. So this notation here, this expectation conditional on Y, really means the average with respect to this base posterior, okay? And you can show in one line that this is the best estimator in the universe. If you could compute this expectation, which is computationally demanding because it requires computing this normalization, which as usual in statistical mechanics is the difficult part, you have access to this estimator and get the best performance, okay? But of course, it's usually hard. And this is also related to so-called Bayes' risk or overlap in physics, okay? So free energy and overlap. Let me give a few references related to this is a very biased list of references, but these kind of models have been studied a lot in the literature, in particular when the noise is completely unstructured, so it comes from the Gaussian orthogonal ensemble, meaning just a Gaussian noise. We have a long list of people having worked on that. In this case, the spectral density is just the semicircular law. So in this case, we have no structure. In the realm of random matrices, this model has been studied, as I mentioned, by these people and they understood this generalized BBB transition. Again, in the random matrices, the kind of estimators they're interested in are related to the first eigenvector, the principal eigenvector associated to the data. So this is really the spectral PCI estimator. And this rescaling coefficient here can be computed. In the case of Bayesian in France, there exists there is in the literature a very nice approximate message passing algorithm that has been introduced by Manfred who is in the crowd and Winter and then later expanded by Zoufan in a slightly different form. So essentially, this AMP algorithm is what we call the top equations in physics. They have been developed for the structural PCI problem. For the special case, really, of this inference problem, it is really Zou that have made it based on this work. And this AMP algorithm is, in a sense, aware of the data structure in the sense that it takes as input the spectral density of the noise. It is aware of the prior distribution used to generate the ground truth and of the data, of course. But as I will show you, it is suboptimal. And one of our contributions is to design a conjecturally optimal AMP for this algorithm. And I will give you some insights of why this AMP there is suboptimal. Nice works are those of Francesco and collaborators from Bologna and the group of Alice and Justin who is in the crowd and Flo and Lenka who have studied this this PCI problem and there is some structure in the sense you see that now each data point, YIJ is potentially drawn from a different law which is indexed by this IJ which is conditionally on the signal of course but here you still have independence so it means that this law can change from point to point yet conditionally on X star each data point is independent and this is really a crucial assumption and what is very nice in this setting is that through this independence assumption you have a strong universality property that imply in particular from a Gaussian theoretic perspective this model so at the level of entropies, free energies other parameters like the minimum and square that I have discussed before this model is indistinguishable in the asymptotic limit of a Gaussian model where you have additive Gaussian noise where now the noise has a kind of variance profile so you see the variance of the noise that you are looking at so you have a strong universality taking place here which is very nice while in the model that I'm discussing right now you don't have this universality because we lose the independence assumption among the entries of the noise okay so what are the results that we obtain the first result is a replica symmetric formula for this Shannon entropy of the data which again as physicists you can think as the free energy in the thermodynamic limit of the large size so it comes in the form of a variational problem over an explicit replica symmetric formula so let me show you how explicit it is so it's a matter of of taste whether you define that explicit or not I think as a physicist you would maybe as a someone not let's say used with this kind of replica formula it looks like a bit mysterious and it is actually but you can plug that in a computer solve the extremization problem and get actually plots and this variational formula in particular in the case of this quadratic plus quadratic potential used to generate the noise depends on 13 order parameters so but actually if you massage this formula and understand some hidden symmetries in this formula you can reduce you can reduce it to only 2 or 3 order parameters another consequence of this analysis of this replica analysis is a formula for the minimum in square or that looks much simpler but actually don't get fooled this M is actually complicated because this M is the solution it's the M that appear here and you should take the solution of this variational formula and plug it in this very nice looking formula there to actually get the value of the MMSC and a nice insight that we got from this work is the optimality of spectral principal component analysis for rotation invariant priors so what do I mean by this it means that if the signal part in your data has no real structure in the sense that it was drawn uniformly on the sphere or as a Gaussian vector then we show and let me emphasize that all this is non-rigors this is replica based we show that you cannot do better than taking the principal eigenvector of the data as an estimator so you can try to do Bayesian in France you will not get any additional performance ok so if you have no prior information about the distribution of the signal just diagonalize the data take the principal eigenvector this is the best you can do even if the noise is highly structured this is the point now this where information theoretic results on the algorithmic side what we show is that what I would call the natural approximate message passing for structured PCA which was developed by zoo is suboptimal so why do I call it natural simply because this is the kind of recursion that appeared usually in the literature where so how look typically an AMP it's a kind of iterative algorithm where you have two steps you have you which represents the estimator of your unknown signal which is multiplied by your matrix then you remove a so-called onzagger term whose role is to help the convergence of the algorithm in the sense that it will make the statistics of this second iterate F here this second quantity asymptotically Gaussian distributed and this implies first that you can get a asymptotic analysis of this algorithm and it helps for convergence so you remove this onzagger reaction term these big quantities are called the onzagger onzagger terms depend are aware of the spectral distribution of the noise and aware of the sorry of the spectral distribution of the noise so this is a kind of power iterate then you get this F and this F you pass it through a non-linearity that we call the denoiser which is aware of the prior distribution of the estimate of you and you repeat and so what we show is that this algorithm does not yield the conjecturally base optimal performance which is obtained through the replica method this MMSE formula that we obtained before here is supposedly the best performance you can reach in the universe and this algorithm does not saturate it and I will tell you why instead of what you should do is do something very similar in the sense that you're still an AMP algorithm but instead of multiplying your iterates U by the actual data what you first need to do is to apply a so-called pre-processing function J which is a polynomial function of the data then remove the onzagger which are different and then apply your denoising function here think of H and G as essentially the same function so you see the only difference here is that we first apply this function to the data and you see this function is non-trivial in this case it is a polynomial of order 3 and the coefficients are dependent on what they are dependent on the signal-to-nose ratio lambda in the problem and on the coefficients mu and gamma which are the parameters defining the potential from which the eigenvalues were distributed so this is a part which is aware of the noise structure we are in the so-called Bayesian optimal setting we know all the parameters the only thing we don't know is the signal we want to extract so I know these parameters I can compute this function plug it there and iterate this AMP and I claim that this is the correct approach it's not the only one but it's a correct approach so let me just remark why do you get this weird polynomial of order 3 these powers here are really in the matrix sense and this is y times y in the matrix sense so it's not entry wise or you can think of this as a polynomial applied on the eigenvalues this is equivalent why is it a polynomial of order 3? because the original potential in the problem I'm considering at the moment is a polynomial of order 4 the polynomial of order K the function you would need to apply here would be of this order K-1 and it would be increasingly complex as the structure in the noise would become more and more complicated and a third result that we obtain and this is a rigorous result in the paper is the so-called state evolution analysis for this algorithm which essentially states that in the proper asymptotic limit the statistics of these two kind of iterates this u and this f which are high dimensional vectors reduce to Gaussian statistics of scalar variables that can be tracked very easily and essentially the only thing you need to track are the means and covariance of these Gaussians and this kind of effective dimensional Gaussian equation is able to track in time all the statistics you want to extract from this algorithm and the fact that you can obtain these kind of theorems this kind of state evolution analysis is really due to the fact that here you remove this non-trivial onzagger correction that kind of gaussianify these quantities is there any question ok so let's check that I'm not lying to you so here is the first plot where so I'm considering the entries of the signal very simple they are generated as Rademacher variables they are uniformly simple as minus 1s and plus 1s in this case mu is equal to 1 so mu equal to 1 means the pure semicircle law for the eigenvalues meaning a pure gaussian noise in this case you see that the replica prediction is perfectly aligned with the state evolution prediction for our algorithm as well as the prediction for this previously introduced approximate message passing here is a slightly different version whose reference is that so this is also so these are two amp slightly different by zoo for this problem and everything collapses in the case where the noise is gaussian as it should so this is the msc as a function of the signal to know the ratio let me now tune in the structure in the noise so let's look at this figure where now the eigenvalues follow this blue distribution you see a gap appearing between the prediction from the replica theory in black and the state evolution as well as the actual instances with the algorithm this base optimal amp algorithm that we introduced which are here while the previously introduced amp is a bit on top here so it is suboptimal so it's provably suboptimal because we have this state evolution analysis which is rigorous and which is below this blue curve which is also a theorem what is a conjecture here is that the replica prediction in black is actually really capturing the behavior of the minimum and square estimator so you have this gap appearing meaning that in a sense you need really to exploit this noise structure in the way we introduce to fully capture it and get a boost in performance let me now discuss some aspects of universality because this is one of the key concepts of this conference so what we do here is what we wanted to see is whether the assumptions we make in this work were robust so what is the main assumption the main assumption is that the noise is rotational invariant meaning that the eigenvectors are uniformly sampled in the set of orthogonal matrices they are generic so what we did is to take real data let's say the set of images from Cypher 10 which represent cats or planes and what we did is to take these images build their covariance matrix diagonalize these covariance matrices and take the eigenvectors of these matrices instead of generating the eigenvectors totally randomly we extract them from the images of cats from Cypher 10 in between we sandwich the eigenvectors with eigenvalues that are drawn according to the distribution we are analyzing this kind of shape what we do is this but now these are the eigenvectors of the cats of the covariance of the cats images in Cypher 10 instead the eigenvalues are still drawn according to the ensemble we are analyzing ok and we run the algorithm and see what we get and surprisingly we get a perfect agreement with the replica prediction and the state evolution analysis of this amp which is really derived for the case where O is R distributed we do the same with another data set which is coming from some gene expressions in biology and we see again a very good agreement and there is also agreement in the dynamics of the time in the iterates of the algorithm let me now push a bit more the analysis to have even more structure so now the potential I am going to consider is not any more quartic but sestic so this is the potential we are considering and this induces even longer correlations between the matrix elements in the noise so the dependencies are stronger and what you see now is that the gap is even more increasing between the previously introduced amp and ours you see now that the MMSC difference is at the linear scale while the previous plots were at the logarithmic scale and so I think this plot which is also where we did the same for real instances for the cipher 10 images compared to the theory is really vindicating the analysis and I like very much this plot that's all right am I on time at what time did I start I have 5 minutes let me give you just some insights of what is happening through this what we call optimal preprocessing of the data so like I said if the potential underlying the random matrix ensemble behind the noise is indeed quadratic plus quartic this is the optimal function that you should apply to your data before processing it through amp so let's look at what happens to your data when you do so so here is the spectral distribution so the empirical spectral distribution in blue and the asymptotic one in orange for the data before any preprocessing so this is just the spectral distribution of y and so you see this kind of double whale that I showed you with the outlier corresponding to the spike now here is the function j that is applied individually to each eigenvalues so it corresponds to this function applied individually and what you see is that after the application of this function all the eigenvalues that were part of the noise bulk here have been pushed on the negative axis and instead the outlier which was not so far from the boundary of the noise has been pushed even further on the right so you have a kind of cleaning effect due to this function of the spectrum this is when the signal to noise ratio is not too high let's see what happens when the signal to noise ratio is higher so you are more confident in your data the data is of better quality now the function looks like this and you see that the effect the cleaning effect is even more dramatic again the noisy eigenvalues have been pushed to the negative axis so you can read off directly what is noise from what is not noise and instead the outlier which was around 5 before the preprocessing has been pushed around 20 so what is nice to notice is that if you look at this kind of boundary decision of this function between the decision between what this function considers as noisy eigenvalues which will be sent to the negative axis so the transition is there so all the eigenvalues there will be pushed to the negative axis and instead those on the right of this point which is around 5 will be pushed on the right and indeed you see that the outlier here is precisely around 5 so this function is aware of what really should be noise and what should be signal so what is also nice is that think of this problem but not for large instances but instead for smaller instances for n equal I don't know 20, 50 of course in this case the noisy eigenvalues here will not be as cleanly supported below this asymptotic curve you will have some of them that might be there for example due to finite size effects yet this function will be able to optimally clean and send to the negative axis these eigenvalues so you have this kind of cleaning effect for rare deviations but keep in mind that this algorithm that we introduced this AMP is not a spectral algorithm because of course if you apply this function to the data you're not changing anything to the eigenvectors it only acts on the eigenvalues so if you were to look at the top eigenvector of this matrix here you would get exactly the same estimator as looking at the top eigenvector of the original data before any preprocessing so you would not you would not improve the spectral algorithm by applying this function instead the Bayesian algorithm this AMP algorithm which is really Bayesian in nature does feels the effect of this function that acts on the eigenvalues only so something highly non-trivial is happening there which we do not fully understand to be honest and we don't have a recipe yet to guess what's the correct form of this preprocessing before doing this whole replica computation which is rather painful it would be very nice to understand what is going on there and what's the genetic phenomenon it's not yet the case but in theory all this machinery that we developed here could be extended to arbitrarily complex noise structure and there must be a more generic machinery behind that generic phenomenon which we do not fully capture yet but I think there are nice insights from this work and I think that last thing I want to say two minutes let me give you one insight of why this previously introduced AMP algorithm is suboptimal so what we did is the following so here is this AMP where the data is processed instead of the j of y and what we did is a replica analysis again but for another Bayesian framework we did the replica analysis for a case where the statistician would make the wrong assumption that the noise is Gaussian while it is not the noise is not Gaussian it is structured but the statistician is not exploiting the structure and is therefore considering wrongly a Gaussian likelihood for the noise so you see here this is indeed a Gaussian pdf you have a square the prior is still the correct one the statistician is aware of what's the distribution of the ground truth X and the statistician is therefore sampling this mismatched posterior distribution and in this case the replica analysis is indeed perfectly falling behind the state evolution predicting the performance of this algorithm so what does it tell us it tells us that this algorithm is implicitly making the wrong assumption that the noise is Gaussian while it is not yet something this algorithm is still aware of the noise structure it is exploiting it in a way through the dependencies of these Gonzaga reaction terms but actually it's exploiting the noise structure in a way to make the algorithm convergent despite it is suboptimal the algorithm will converge to the fixed point given by the replica prediction which is there so this this AMP is exploiting the structure for algorithmic purpose for convergence issues but not in a way to improve the statistical performance so my last slide so let me give you a recipe to in theory better process your data through a Bayesian approach in the context of this structure spike model so do you have any prior information about the low rank information hidden in the structured noise if the answer is no then our analysis tells us that don't bother just use spectral PCA because you cannot beat it if you don't know anything about the signal the best thing you can do is spectral PCA now if you have some information about the signal and it is not just a Gaussian signal then you should exploit it in some way okay now you wonder do you have any knowledge about the statistical dependencies in the noise if the answer is no then you can use this AMP the standard AMP there which is exploiting the signal structure through this denoising function this is aware of this distribution here and it will be convergent so it's still a good Bayesian algorithm but it is not fully exploiting the noise structure okay so if instead you know something about the noise structure as well then we recommend to use this AMP that does so that exploits both the noise and the signal structure jointly in order to improve the performance alright this is everything I wanted to say thank you very much and thank you very much again okay so thank you very much John for the great talk are there any questions for John? thank you for the talk do you have any idea what the optimal transformation J looks like if your potential is not a polynomial and looks like any regular function if it's not a polynomial because if it's a polynomial you say that it's a polynomial one degree less but what happens if it's a fully yes so we have an argument in the paper where we essentially show that if you have a sufficiently smooth function which admits essentially a Taylor series you can approximate it by a polynomial and essentially the quality of our predictions will go will vanish with the level of approximation that you make in the theory so essentially you can approximate any function by a polynomial and so the theory should be able to describe what happens now what's the generic form of the pre-processing you would get for this arbitrarily complicated function we do not know we have no idea that's one of the missing thing is to get the generic recipe to understand how to process the data the problem is that the computations are very technical so we've been able to carry the replic augmentation for order four and six and we don't see yet the generic mechanisms but of course there's one there's one and in the limit this kind of replica formulas that we get should depend on infinitely many other parameters essentially a functional order parameters and you should get kind of Parisi-like formulas but this generic structure is still lacking it would be very interesting to understand I think thanks John this is super nice work so I guess one comment or question I have is I guess one of the reasons that I ended up analyzing this suboptimally amp algorithm in that paper was because for any algorithm that's more complicated than this I couldn't figure out how to get any explicit characterization of the fixed point of the state evolution right the state evolution becomes quite complicated as you consider more complex algorithmic procedures and I was wondering for your pre-processed amp algorithm here whether there's something that simplifies in the structure of the algorithm that lets you characterize what that state evolution fixed point is or to gain some more analytical understanding of this yes we thought about that I see me Marco doing we thought about that we tried hard actually but we don't have as nice characterization as you have in your paper for the fixed points because the state evolution that we get is simply too hard we don't see at the moment a simple closed equation for the fixed point like you have which I think is very nice so we're not able at the moment to characterize directly the fixed point the only results that we get are for the dynamics yeah hi John thanks so much for the great talk one thing I was wondering about is that in the like a closely related model I guess in the Gaussian noise setting is this spike covariance model right where you could say that every row is independent but the covariance matrix is low rank plus identity or something and in the Gaussian noise case the two models kind of behave roughly the same so I'm wondering like here do you guys have in mind any analog of the spike covariance things because here what happens is that every row in your matrix is still dependent right like if I'm hoping to come up with a model where every observation is independent but somehow is there a way to think of that? yeah no that's a good point the short answer is no I don't see I don't see yet a natural setting let's say a natural extension to the non-symmetric case they I guess there should be one but we didn't investigate too much this aspect at the moment yeah are there questions? okay so maybe I can ask a curiosity John because you were mentioning in one of the previous slides that the noise is basically I mean in that slide where you have inside and outside classification and then cats and dogs and then the noise is played by a sort of image in that case I was wondering also attaching to the previous question whether you have considered the two I mean consider noise structured like with a structure similar to the manifold like in that case but do you think it would be doable or which would be the limitations I don't know I don't know because really the key the key thing we use in the analysis is this rotational invariance because if you lose gaussianity thanks to you can still carry on the computation thanks to a very nice machinery developed by in particular Yoshiyuki that we exploit strongly without this I don't because maybe I did not emphasize this actually to give me this opportunity but let me just emphasize one point why is this model in a sense not so trivial to analyze compared to other type of models so here is the Bayesian posterior distribution let me just compare it to another class of models where rotational invariant matrices have been studied a lot as we've seen from the previous talk which are regression problems and the structure is inside the design matrix if you study this model you see that when you write down the posterior distribution due to the fact that the noise is gaussian the likelihood is still gaussian and so you can still carry on the analysis despite the fact that this thing is structured but the fact that you have a gaussian form here a quadratic form allows you to do computations in an easier way while here really the main conceptual difference compared to this line of work is really the fact that the structure is in the noise itself and therefore when considered in a Bayesian setting the likelihood is different and you have the potential appearing here at the level of the likelihood and this makes your life harder because now you don't have a nice square here and it becomes more painful so that's the main difference but still we can exploit tools developed by Yoshiyuki in slightly different ways without this I don't know but I think it would be very interesting to take a look thanks thank you very much if there are no further questions I think we can...