 Yep. So, last course, I want to show you the formula for the SpikeBignor model. The model is written here, and what I'd like to compute is this, and compare it to, for example, what I'm calling NaivePCA here. So, as it should be now clear, this minimum is achieved by the posterior mean of the product xixg given the observation. So this here, I wrote the posterior distribution, knowing why. And this is the nothing fancy here, the corresponding Hamiltonian in this case, which has exactly the same structure as before, but now I have a product of xixg. So this is a posterior distribution just to define the normalizing functions z, which is a function of lambda. And here is the free energy. And, okay, now I guess the strategy is relatively clear. It's like the toy model we show you. You need to compute a limit for this, and then everything will follow from this explicit computation. So I will not give any insight on the proof of how you prove this. This is actually the part that is, I think, well written in the paper. But I will give the solution. So I shall probably erase this. So you need to keep in mind of this formula in order to compare to the minimum mean square error. So I'll define a function of two parameter lambda, which is my lambda on Q, which will be the overlap. So the first function, okay, psi p naught of lambda. Okay, so this function is a little bit complicated. This is still an expectation. So what is random here is the z on the x here. But they are scalar now. z is, as you can guess, Gaussian random variable. x is a random sample given according to this distribution p naught. And you are integrating with respect to the small x. Okay, so can you interpret this formula now? So this is an expectation of a log. The expectation of a log, this should be like of a sum. Okay, so this is the free energy for a very simple model. The model, the scalar channel with additive noise where you have a prior p naught on x. Okay, so this is the corresponding free energy of this free energy of okay. And now is a theorem for all lambda f n of lambda is converging to the super cube of f from the cube. Okay, so this is it. You have the formula for this. So now you can run all the machinery I gave you. In particular, people doing inference don't care really about free energy. They care more about information theoretical results. So like mutual information, so you have directly that one over n tends to, this is a direct application of what we saw before. Okay, because we related this quantity to the information, mutual information. Okay, and now you know that you can apply the IMMC theorem. So if you take the derivative of this, you obtain the IMMC, which is what we are aiming for. So let's do this. So in order to do this, if you really want to be, you need, I mean, to take derivative unit in mass, you need to check that your function is differentiable. This is a function as a unique max q star of lambda. So I'm looking at a domain in the positive number, where this function has a function of q as a unique. So the supremum here is achieved only at q star. So you have property d minus a table set. So basically, you are covering most of what you want. The function is differentiable on d. You have an explicit formula for the derivative. Yes, yes, that's, so that's following the, from a property we saw, non-increasing conduct, so you cannot be very bad. So now you apply the IMMC theorem and you get the four lambda belonging to d. I mean, this is a direct consequence of what we saw before. Okay. So I think you have a self-contained backboard. Everything is written, the model on the right, what is written to write a solution and the limit for the IMMC, at least. Yes. What do you mean by, I mean, so this is a function of two parameters lambda on q, if you want, lambda is a signal to noise ratio. So this is the lambda of the model, but it's fixed for a given model. And you need to optimize this function as a function of q. The q is the same as what we saw this morning, actually. This is the intuition. So this is why I'm calling it q. Do you remember the matrix with one on the diagonal on q? So this is the exact, this is the same parameter. But you don't need to have this interpretation in mind. If, I mean, the only thing you care is the IMMC. You need to compute this function for a given lambda. You need to maximize it. And okay, in my first course, when I speak about the semi-supervised learning, I had a variational form also for, so I did an explicit computation of this function F in another model. And you were supposed, I mean, to get the IMMC, you were supposed to maximize this function as a function of the parameter in my own waving argument, there were two q. One, because I was dealing with a non-symmetric case. So there was a q for the U and a q for the V satisfying a fixed point equation. And basically the fixed point equation is you are taking the derivative of this to ensure that you attain the maximum. You can do this here too, if you prefer to define q as a solution to a fixed point equation. Any other question? So now, I will only give interpretation of this result in particular cases. But all the hard part must rely now on that I did not show you is showing this limit of the free energy. I'm referring to the paper for this. Now, I just want to rewrite in different ways the same result. So I want to compare this to first the dummy part. So if you remember, this one minus this quantity was a mean square error of the dummy estimator. So the one where you are not looking at the data. So you can reinterpret this result as when lambda is bigger than lambda c, then the limit of when lambda is smaller. So this is, you have a lot of notes. So here, this is a regime where you have no signal. Basically, you are not able to do anything better than providing as an estimator the mean of your signal, which is again, we are in the base optimal regime, so this is not to the statistician. So there is no information in the data. And here you can do strictly better than this. So looking at the data give you a better mean square error. So now there is one. Okay, if you want to do comparison, you need to compute this. There is one particular case where everything is easy when you are taking as prior a Gaussian random variable for x. In this case, you have an explicit formula for the function psi, which is this one, and you plug it in the f and so on. You do the mass and you see that q star of lambda is equal to max of zero and one minus one over lambda. Meaning that the minimum mean square error as a function of lambda is converging to one if lambda is less than one and to one over lambda, two minus one over lambda, if lambda is bigger than one. Okay, and here what you recover is exactly the mean square error of the PCA. Okay, this is the meaning of this is that in this particular case PCA is optimal because you are achieving the mean square error. So now let's see. And again, my result was the best possible algorithm. So let's see if you if we can find an example where PCA is not optimal. So I will consider the case where I have two direct measures. This might be weird with a positive value and a negative value, depending on the one parameter p. And I'm choosing it like that so that I get zero mean on unit variance. Okay, so this is a distribution with zero mean unit variance with two, with support only on two points. So if p is equal to one half, this is just plus or minus one with positive one half. Okay. But in this case, you can do the math. And you will have a picture like that as a function of lambda. I'm putting the mean square error. So one is when you are not able to do anything. There is a threshold here. So you can show actually use like that. This is the minimum This is a minimum mean square error. Okay, since I have color. The PCA is here. So this is a setting where the PCA start to detect signal as soon as it's possible. Okay, at one. But still, the performance of PCA is suboptimal in terms of mean square error. Now, if you take p equals to 0.05, for example, you have a different picture, which is like that. So again, lambda one, so you still have one. So this is the performance of PCA like this. And you can compute a lambda c such that here it jump and start to decrease. So this is a setting where actually there is a phase where you are seeing your PCA is failing. So it's just not dating it. And okay, there is an algorithm that I did not describe, which is related to the analysis done here, which is called approximate message passing. And approximate message passing is doing something like it's not better than PCA in this range. But then the performance is increasing over there. So I have 10 minutes perhaps. There is something I forgot to say here I should write. There is another theorem saying that I would write it one over n x transpose x this square minus q star square in L2. This goes to 0. So you remember the end that I made on the scalar product, the norm being equal to q square. So this is what we are able to prove, which implies the assumption I made this morning to derive the connection with the matrix of overlap. So we are fine in terms of no indistribution. So then it's okay. Yeah. Okay. The last five minutes. Since I promised to speak about graphs, I will give one application to the committee detection for the stochastic, the infamous stochastic block model. And I mean, it's also related to a question I had. If things are working only with Gaussian noise, so clearly when you have these edges on notes, you are not dealing with a Gaussian random variable. And we see that actually, there is some work to do, but you can apply this technique using a central limit theorem. So the model is the one you saw with Laurent. So you have a graph on n vertices. I will take only two communities to make life simpler. So the community is encoded in the vector xi. You put an edge with probability m xi x g where the matrix is a scale like that. Symmetric where a, b and c are fixed parameter that does not depend on n and will tend to infinity. Okay. So p is basically the fraction of nodes in the committee one, one minus p is the fraction of nodes in committee two. In the regime I'm looking at, d is the average degree in my graph because I'm making the assumption that everybody, I mean, it's a balanced graph. I have this constraint relating a, b, c and p. So that's this ensures that d is average degree in j on your degree does not reflect the committee in which you belong to. Okay. This is a kind of symmetric. Okay. So now if I'm defining xi, I'm doing a small change of my, the signal I want to recover is the xi whether I'm committee one or committee two. And with this mapping, minus square root of p by one minus p I see that if I'm looking at the adjacency matrix of my graph, g, i, g, so this is a symmetric matrix and the entry are Bernoulli, this parameter d over n is d. I forgot to define epsilon is y minus b, I think. So this is really rewriting the same thing. So each entry is a Bernoulli random variable with this parameter. Okay. So you see that the parameter depend on the, on the signal thanks to this function. And you see that this phi p, so what is the law of x tilde, it's exactly this law. Okay. So the, the variance of this guy, when you take epsilon going to zero, the, the, when b is closer to one, the variance of this is approximately d divided by n. Okay. You can ignore this. And so if you define the, the now what I'd like to, to do is, okay, I have, so these are zero ones entry. I want to, to approximate this with a Gaussian. So what I will do is very crude. I will match the moment, first or second moment and see how it goes. And if you do that, you have that actually is the other end of the mean plus, so this is why I needed to compute the variance. So, and here you start to see something which is very similar to the Spike Wigner model, except that there is a constant everywhere. So what you, but, so this is average degrees or even from a statistical point of view. If you are giving me a huge graph, you will, it's very easy to compute the good estimate of the parameter d. So this will be the average degree in your, in your graph. So what you can consider is, I should not call it, no, it's okay. It's consistent. So I'm rescaling the entry of my, so this is a modification of my adjacency matrix where I'm removing the average of everywhere and I'm rescaling it. And with this notation, you have epsilon square, okay, which is exactly a Spike Wigner. So here, I mean, you need to, here there is something which is, you need to show that this approximation actually is correct if you want to compute, I don't know, the minimum mean square error or whatever you want to compute for the original model where you have only zero and once. But it's not very difficult. The part which is more tricky actually is that in such an application, you really don't care about the mean square error. What you want is to know whether a given node belongs to committee one or two. You don't want to make a small mistake on everybody. So the notion of performance of your algorithm is what people call the overlap of the committee, which it's not the overlap in statistical physics, but it's a number of people you are classifying in the right cluster. And here you need to work a little bit more because all of my machinery is working very well for the mean square error. But as soon as you change the measure of performance, the posterior mean is not the optimal estimator. But since you have a good handle on the posterior distribution itself, you can get a result on what people call the overlap in terms of community in this case. And if I'm redrawing basically the same plot as here, but now as a function of p. So this is the plot for this distribution in the spike figure as a function of p. And here this is lambda. So you have one over here. You have a line like this. This is a symmetric stochastic block model where you have two communities of the same size. So this is called what Laurent called the Casten stigum bone where here it's easy in the sense that PCA is working. So okay, here since we are not interested in minimum mean square error, what I mean by working it's doing better than random guessing. So you are detecting some signal. And there is a value p star, which has an explicit form of value. Then here you can show that it's impossible. So we know that below this line, PCA is not detecting signal. Indeed in all this regime, no algorithm would be able to detect signal. So it's impossible to do better than random guessing. And here since we didn't find an algorithm, so we know that it's possible because mean square error, I mean you are detecting signal, but we don't know of any algorithm working. So we call this phase a hard phase. And I think I will end up my lecture here. It was a great honor to be invited in such a nice location to give this course. I hope you enjoyed it. And all the missing details are in paper with Leo Moulin that appear to fundamental limits of symmetric Laurent matrix estimation. So in particular, if you want to see the proof that the free energy is converging to the formula I gave you. So thank you for your attention. And I don't know if you have questions. Any questions? We have a bit of time actually because the next lecture is not here. Has he promised? I was too fast. Yeah. Sorry. Yes. Thank you for the course. I have a question then. So in this case, PCA is optimal in this spherical vector case. Do you know other cases where PCA is optimal? It depends what you mean by optimal. It's detecting signal as soon as it's possible. Yeah. I mean both. Like I mean, first cases where the transition is still the same. So I think this happens in many cases. Like this one? Yes. Then when the curve is PCA is MMS optimal. So your question is for what prior? Yeah. So what we saw is that we saw that for Gaussian, the Gaussian prior, which basically corresponds to the case where you have the least information possible, then PCA is... So I guess your question is it the only prior for which PCA is optimal? Or can you think about some criterion that would say that's the case or not? That's a good question. Can you guess in advance PCA is optimal? In advance. Yeah. I don't know the answer to this without doing the math of computing. It's quite natural that, I mean, the Gaussian prior is the one for which PCA should be not too bad because basically there is no information in the prior. And as soon as you are putting information in the prior, then PCA is not using it. So then let me rephrase my question. Do you know if it has been formalized, this natural guess that when no information is present in the signal, then PCA should be optimal? I don't know. The Gaussian distribution, I mean, for fixed mean and variance, you fixed zero mean and unit variance. Then I think it is the unique distribution that maximizes the entropy on the real line. So that is the distribution that contains the least information. This is why I'm saying it contains the least information. Yeah, I don't know if there is. But it does not imply that this will be the only case in which PCA will be optimal. This I don't know. Let's say now the noise is not Gaussian and the the prior is still Gaussian. I don't know. I mean, okay. This case, the noise is not Gaussian. So, I mean, to show that this approximation is correct, it's really not that hard. So you have a kind of universality in terms of the noise. This should be a good one. And let's say you could study models where the noise is not Gaussian anymore, but more complicated. Do you think we could capture more about the SBM because the mapping between the SBM and the spike linear model through universality is in the back. First, I just realized that I did not mention that this approximation will be correct when D tends to infinity. Otherwise, when the D is the degree, so I guess. Do you think you can capture like other scaleings if you could study more complicated version of the spike linear model with a non-Gaussian noise and things like this? So far, that's not what we did. I mean, for sparse, in the sparse regime for stochastic block model, I don't know, but my feeling was that there are results in random mass history in the sparse regime, but they are not powerful enough to attack the problem and looking at the local structure and we are not using at all this machinery of statistical physics stuff. Basically, what we are doing is we try to derive a smart algorithm, smart in the sense that we hope they are optimal. So you can show one bond in this. You can say, okay, in this regime, it's possible, but it would be very nice to have a picture like that in the sparse regime, but I'm not aware of any result, even showing that there is a hard phase for the stochastic block model with the sparse regime. No, there is. You need more communities, but there is. Okay. That's correct. But we don't know what this line is. We don't have information theoretical formula, except for a very particular point. So for the symmetric case, when you are adding communities, you can show, but there is not such a picture like that. Okay. Okay. Perfect. So thank you. Thanks for that. You're welcome.