 Okay, so I will continue what we started this morning. So this is the model I presented. So U is a fixed direction, then you have the VI-R-1 plus Gaussian noise, and you are revealing a fraction of the information. I define though, since I had a question on this, all the plots are for this quantity, what I'm calling the base risk, where the statistician knows everything about the parameter of the model, knows the models, and the statistician does not have any constraint in terms of resources in computation or anything, so he's able to compute this here over all the possible estimator. Okay, so this is okay. And again, the risk, it's not measured on the training set, but it's on the test set. And to model this, I'm sampling a new sample with the same distribution, but it has not been seen by the algorithm before. Okay. So all the, you obtain, we describe it basically two algorithms. The first one is the oracle where you are given in addition the direction U on the fully supervised case where you see all the labels corresponding to the case eta equals one for me. And you see that for both cases, what we did is we took a specific algorithm and we computed the performance of this algorithm. Okay. It's not clear a priori that the performance of this algorithm is actually optimal. Okay, so there is a gap here. And in the rest of my lectures, actually, I will not concentrate on algorithm, but directly compute the best possible achievable performance. So what people call information theoretic limits for such in such setting. So my analysis will not be directly connected to a particular algorithm. Although for the presentation, it's easier to present an algorithm and to give you what is the performance of this algorithm. And in both cases, it turns out that this algorithm are optimal. So here is the general formula on the plots. Again, this is the risk evaluated on the test set, not on the train. So this is for this is why, for example, in the supervised algorithm on the label data only is performing worse than unsupervised setting with in the small label data regime. So it's, there should be no contradiction. Any question on on these two plots before I'm going on. I don't know if there are people in the chat but I am not monitoring the chat so. Okay. Yes. Yes. Is it possible that the supervised one is what do you mean by overfitting. Yes, so, so basically it means that the test set is minimized but for the training set. It's a, it's poorly generalized. It's what it's poorly generalized the supervised one. I mean, I'm not sure I understand your question correctly. My, my, my measure of performance is it's, it's on the test set. Yeah. So I'm taking the infimum. This is the best possible generation error you can get. So there is no notion of overfitting. I mean, it's not related to any algorithm did my plot. Yeah, it's. But then what would happen if you change the alpha. It will change. Okay. You see that for example, the value here of one, where you have the back then I was paying a fee of a transition I told you, this will change. Okay. So this, but the general picture will be a global is the same you will risk it as things. Thank you. Thanks. Is there any. I mean, here I want to do a 2d plot. So since I have three parameters I need to fix some of them in order to do. So this is why I choose alpha equals one but there is no special property of this one alpha being one. So I will, I will try to convince you that this weird mathematical formula to give you a little bit of intuition behind it. In a non rigorous way. And in the rest of my lecture, I will look at a simpler model, but give the details of the mathematical proofs. If you want to do a non rigorous both you need to make assumption that will allow you to do easy computation. And the main assumption that is not easy to show at least directly is a kind of concentration result. So, again, the two algorithms we saw are basically based on you. First get an estimate of the new direction in the record it's given for free in the other one you are averaging. And then based on this best possible estimation of you, you, you get the best estimation for the new label. So the best estimate for you is actually the mean of you condition on what you are seeing those data why on the side information. And the best estimate of these the same thing condition on why on us. Okay. So what we will assume is that when in the regime where and the number of samples and the dimension is turning to infinity. The correlation, the cross between this quantity, which is a random variable of depending on why on us. Converse to a parameter which is between zero and one. Okay, so this is a scalar product this is a scalar. We are averaging a lot of quantities. Unfortunately, there are not ideas or you cannot assume that there is a low flat number with a trivial results, but you are still making the assumption that it will concentrate towards a fixed value that is not random. Okay, so two stars of a few here. This is a number which depend only on the parameter of the problem between zero and one. So this is an assumption we are making on. There is a symmetric assumption on V. So for V, since you have plus and minus one. This is the product. This is the product you have roughly and term of order one so you need to rest it by one over and order to to to hope for a concentration result. Okay, so this is why there is a one over and here, there is no rescaling here because you are on the unit sphere. So basically when the dimension is increasing, you are shrinking all the component of you. So this is basically the only assumption we make. And actually, though I have no proof for that, but if you are making this assumption, the rest of my argument can probably be made a rigorous quite easily. Unfortunately, we are not able to prove this directly. I will show you how we proceed with more mathematical details later. The first consequence of your assumption is, you know that the norm of this random vector, again, this is a vector depending on why on the site information is exactly the same. So the quality, it's basically based rules, we'll see it later if you are not convinced. So the, the norm of this vector is basically the same since you, you expect this random quantity to convert to something the expectation will convert to the same thing. So it should be roughly as the order q star of you. Now you do exactly as in the oracle algorithm you take your best estimate for you and for any sample in your data set. You, you can take the scalar product of your best estimate you bar with what you are seeing and you obtain this equation. So here, by my assumption is converging to a scalar q star of you. So you see that it's q star of you, so you are rescaling the signal vi is a plus or minus one. And then you have this new noise term. Okay, so in order to, to analyze this, you need to compute the variance of this, this, this noise term. So again, I'm not rigorous here but you, you, you expect to have that the I will not impact much the expect value of few. So this should be a roughly a Gaussian random variable with the variance being the norm of the vector you over here the number of the vector you is q star of you. Okay, so this is Gaussian centered Gaussian with variance q star of you. Now you end up with. So you see that here you are dealing with scalar. So you started with a bunch of equation in dimension D by projecting everything on you start you end up with only a scalar equation, which is very simple and it has a following form. So this is what people call in information theory, a channel where you have a signal, which is the square root of gamma times V plus additive noise, and you are observing the output, you want to have a guess on the, on the signal. Okay. So basically, this channel with the square root of gamma being a q star of you. Now, if you know a little bit about information theory. You are actually a basic statistic here. You want, if you want to, to minimize mean square error for the evaluation of again the estimator of V. This is just the expectation of the knowing the output. Okay. And I'm denoting the minimum mean square error. So this, this is an explicit function of gamma, there is only one parameter here. These plus or minus one with 41 half on the variance of the noise is one. Okay, so there is only one parameter so the mean square error, the minimum mean square error is given by basis. Is it clear. So this is perhaps a weird function but it does exist. Okay, here you see that the variance of my nose is one though to go from this equation to this one I need just to rescale the my equation in order to get a variance one for the noise. So this is done by this restrict rescaling. You see that you have the best possible estimator though again in terms of mean square error and this is a mean square error that we will use during this course and I will explain why in a moment. So this is achieved by taking this expectation, this conditional conditional expectation and you have an explicit formula for the, the, the associated mean square error. Okay, which she's with this weird function that I'm calling MMSC of. Is it clear or okay. Yes. It's just for notation purpose so sorry. But gamma is just a parameter of my, I mean, I'm taking swear word of gamma because it makes more sense. I mean, you can take gamma as soon as it's positive it's okay. So this is what I said very quickly here you see that I have a parameter in front of my signal vi and there is this is not a Gaussian, a standard Gaussian random variable. This is a Gaussian random variable with this variance. So if it's with this channel, I need to rescale my nose in order to get a variance of one. So this is what I'm doing here I'm rescaling by Sima square root of q star of you, in order to to end up with. I mean this is the main reason why I'm ensuring square root of gamma this is because I have square root of a quantity here though I want to write the quantity itself. So gamma is q star of you divided by Sima square. Okay, so now, assuming that this is a sufficient statistic for estimating the highest component of of vi. So we will replace it in this will compute the mean square error in another way. So the mean square error is just there you are making an estimate of the this is this formula. And I'm decomposing those in this time you have the label data for the label data you will have no loss of information okay so this difference is zero. So I'm giving you the vi, you are placing but the true value, so it's zero so the only contribution in this mean square error are due to the unlabeled data. Okay. And remember that the site information is si and when si equals zero. This means that si is not plus or minus one. So this is the, the, the, the unobserved label of the data. So I'm, I'm just saying okay this, the, the, what I computed is actually the true. So what I computed here in this channel is actually the true. Conditional expectation of the, so I'm replacing the here and I'm cheating a little bit in this equity here. And I'm computing now this quantity it's given by this one over here. So it's one over N the sum only on the unlabeled data on my weird mmsc function with my particular gamma. So since I have a fraction one minus eight of unlabeled data this is one minus eight times mmsc. Okay. Now, again, I did not use the assumption I made on, on, on, on V bar so remember that V bar is at the top here. So the small, the, the, the modification I made is that I say that my V at is equal to this. So I'm computing the true condition expectation with my with this formula. So here, I'm using the assumption I'm making on the conditional expectation of V. So this is exactly the same argument as what I did before for the norm of few. You can show that basically I'm making this assumption, which implies that the norm of the artist also Q star of V. So again Q star of V is not random. So when you are replacing this equation in the last part over there, you obtain that one minus Q star of V which is exactly the mean of this is equal to the right part over there. So you end up with, I did compute the mean square error for my estimate of V in two different ways. Okay. So I end up with two, two different expression one in terms of Q star of V one in terms of Q star of you. So the two should be equal and this is the first equality between to start on you and to start. Okay. Now, the, the problem is completely symmetric so I will probably go faster on this. I'm doing basically the same thing. On the instead of starting with a U vector and starting with the V vector with estimate of V. I'm doing exactly the same thing as before I will end up with a scalar channel again, which is of this form. There is one difference, the general formula for the channel is basically the same. There is one main difference with previous one is now you hear it's not the plus minus one. It's a Gaussian random centered Gaussian random variable. Okay. So now when you will compute the mean square minimum mean square error for this channel it will not be the same as the previous one because you do not have the same prior on the sign up signal. And indeed, in this particular case where both the signal is a Gaussian random variable and the noise is a Gaussian random variable. This is what is the first channel you study in information theory in a course in information theory. And you can compute everything explicitly. So now you again, your best estimator is a conditional expectation. Minimum mean square error which was a weird function that I did not explicit before it's fully explicit now, and it's very simple function of the gamma parameter. Okay, so if you are familiar with information theory this is a very standard result. And again, I'm doing I'm doing exactly the same assumption as what we did with the V vector, which was a you vector here so I'm on the go with one expression on for the mean square error of my estimator. I'm doing the same thing as before and I'm ending up with a new equation related q star of you and q star of V. But now this MMSE of you is much simpler than the one before when I'm putting everything together and up with two equation with two unknown. So this is the first computation we did with, again, this is an explicit formula but it's long, quite long to write. And this is, this has the second equation here is hidden. It has the same for the first one but the MMSE function is explicit now. Okay. If you are making the assumption that on the conditional expectation at the beginning, converging to the q star of you and q star of V, then you should be convinced that this parameter. So again, these are scalar now should satisfy this fixed point equation. Now, the last part is, okay, try to translate this fixed point equation into performance measure for the empirical risk. And again, we do the same trick as what we did with the oracle algorithm. Since we have access to the best possible estimator with this U bar. You take the scalar product of your new sample with this U bar you obtain, you go from a vector equation to a scalar one, which is much easier to to analyze what you do this is always a scalar channel with additive Gaussian noise. What you need to be careful about is what is the prior on the semial. So here's the prior, again, it's a discrete distribution with plus minus one, pre scale by this parameter. And you, you end up with your best estimator is a sign of your cross dot product between you bar and why you remember that in the oracle you had access to the real you but so you end up with a formula like this. Okay. And for the it's almost like in my, my claim here. Oops. So you see that it has exactly this form. But the definition of the superstar is not the same. Here I did define the superstar as a solution of a fixed point equation. So there was a q star of u and q star of v, and I had two equation. So if you want to connect the fixed point equation to this weird function here, you need to do a little bit more of mass. Okay, so this is a fixed point equation. Basically, you need to show that the solution of the fixed point equation correspond to the optimum of my F function. And this is done by what is called in information theories, I am emacy theorem. So again, we will see that in more detail, a little bit later. So what we need to remember here is that if you compute the mutual information of the scalar channel, I just show you this mutual information is connected to the minimum mean square error that I computed, namely, here I'm computing the derivative of this. So you see this mutual information as a function only of the gamma parameter. I'm taking the derivative of this function with respect to gamma give you as a minimum mean square error. After one half. Okay, perhaps there is a one half missing. Yes, but no. So you can, you can check that the function I gave you is indeed the mutual information of the corresponding channel so that when you take the derivative you will end up with an emacy equation on q star of shoe is a unique minimizer of this function. And you end up with the result of the theorem. So, at least if you only care about an explicit formula for such type of problem, this type of argument will give you an analytic solution. So, if you want a proof, unfortunately, we, as I said, we are not able and I don't think there is any easy way to show the main assumption I took at the beginning, the concentration result. So, what we did the concentration result are more consequence of the final formula. And what we did with Leo Moulin is basically directly computing the limit of the mutual information. And from this, you will obtain all the subsequent results that I show you here. So here you see that a posteriori you can check that the main assumption concentration of the overlap. It's called a physical sample from the posterior times you will concentrate to to a scholar is a byproduct of for both but we are not able to show it in an easy way directly. In other words, once if you are able to. We need to prove everything before in order to end up with the concentration result. Here I cheated quite a bit because I started with this concentration result and then show you that it follows the formula for law if you are making this assumption. Okay, I guess I will, I will skip that. To, to the proof so I will switch to, so I have seen a 15 minutes like something like that. To the blackboard, I want to start. Is there any question on this before I stop. Yeah. What's the problem with the concentration hypothesis at the, but it's a problem. Because I don't know how to prove it. Why is not a reasonable. No, it's, it's, it's more than reasonable because it's true. So, in the end, I'm showing it, but if you want the way I'm doing my proof, I'm proving everything before and I'm, I'm, I'm in the last point is to prove the concentration results that I started with here. So if you, so if you have, I don't know if you come up with a very simple argument showing this concentration result first then all the rest will follow quite easy. So in a sense, I think that this concentration result contains all the information you need to know about the, about your, about your, this is the most difficult part if you want of the proof. This is more physical. I don't know if Jean will agree with that. But okay, thanks. So it's okay, I can move on to the further technical parts, what I had in mind is to start with some tools in Beijing in France. So this is what we'll do. We'll start today and we continue probably tomorrow. So today and tomorrow, on the first course of tomorrow, we'll concentrate on this on a very simple model where there's probably a minimum square error and so on, basically all the concepts I showed you here in the general setting. This, I mean, this is probably the most important part of the lecture because these are, if you are not familiar with, with them these are kind of basic results in statistics, and then I will move on to the lateratic limits for the spike beginner model. And show you how to use these tools in this particular setting, which is a little bit simpler than the semi supervisor. So I will give you a proof on then some application of this result also connected. So Laurent might will speak out today on community detection for the stochastic block model. So a random graph model to PCA on the then perhaps if I have time I will speak about another application of random matrices at the very end of my talk. Okay, so let's start with the tools in in in a Bayesian inference. So I will start with, as you see, there is a model I really liked. I'm underlying because this means that I'm considering vectors instead of colors, and I'm taking a capital letter for random variables. So this is what we just saw. So X as a prior distribution P of X over RN. We will assume that as a finite second moment, and that's it, and then the Z ID on this is independent of X. You want the covariance matrix is identity. So the parameter lambda is a single to know is ratio. Also in this part of the course I will introduce the basic vocabulary that is used both in statistic information theory related to my model. Okay, when people say we're in the base optimal setting. This means that the statistician knows everything. So the model, the prior P of X and lambda. Okay, so this is a setting. You are in dimension and nothing is standing to infinity for the moment. So now you need to define a measure of performance. So, as you saw, I like the mean square error. And you will see later why. So the mean square of an estimator data is. Okay, so this is also a vector. So any measurable function of the Y. So this, this is true for any estimator you are given to me on the minimum mean square error. It does not depend on a particular estimator, but it only depend. Typically we put explicitly the dependency lambda but it also depend on pure fix. So by definition, the minimum of all the data at the mean square. So what is what is the function achieving the minimum here. Yes, the conditional mean of X given Y. Okay. So this is just by Pythagorean theorem. And we will be interested, not only in the posterior mean, but in the posterior distribution. Okay, of X given Y. If you allow me to write it like that. So again, I hope I'm in the notation here, please, please ask me, so this is proportional to So this small X is for a particular realization of the random variable, right. Okay. So this is a function of the small X here, which is the density of the prior issue want to be because, and then I'm under this. I mean, I need to encode the fact that the additive noise. So this is minus from that X. Okay, so what is very simple is to compute the posterior of Y given X. This is basically given here. I want the opposite the posterior of X given Y so I'm basically applying base role, I look at the joint distribution. Now to, to get, so you get, let's go DP X knowing why this is just okay. I'm marginalizing the Y. And I have a normalizing constant here to get a distribution with mass one. And this comes from just this equation where I remove the Y square. I mean, the part that depends on the Y, I'm marginalizing it. And this Z, which is called the partition function. I will sit here. This is just the integral of the numerator. I will introduce a new notation. The name is Hamiltonian h of lambda Y X is equal to this term. So this is a random term, the Y is random. I will also write it like this. This is Hamiltonian. Okay, so here, what I did I just change the Y by expression in function of big X on Z. Okay. So it, I mean, the notation is perhaps a bit weird because here I'm making the explicit dependency in Y and the Y disappears in this expression. We need to be a bit careful. But in any case, this is a random function of small X. And we'll introduce an important notation for the expectation with respect to this posterior. I call it Gibbs brackets. I think it's not like this. So this, I mean, this is only a notation, but it's very important. This bracket, so I will not use a bracket to define the standard again dot product. Okay, you need to find another symbol. This bracket is just when you are taking expectation with respect to this random variable. This I mean there is an expectation but it's still on them. No, no, it's not. It's a condition on why so it's random. You need to take another expectation if you want to do something. So it's knowing big way. So why? You do not agree with this notation. Yes, you can put it like that if you if you want. So I mean the explicit formula you can write it like this is one over on here you. But you see I'm making this appearing the index. So here's a why it's not explicit anymore here. Actually I'm removing it here too. But it's completely correct that it depends on why that's fixed and then you have the Hamiltonian. Okay, so this is a meaning of this. This is a short notation for this. Is there any question on this? I'm done indeed. I think, okay. I need just one minute to define a new entity. Yeah, which one's this one. Okay, so what I'm saying is that what is easy to get. What is given is a prior of X. Given X it's easy to define the distribution of why why knowing X. So I'm saying that the couple, the law of the couple X and Y is used the product of these two of px times the conditional. This is a conditional law of why given X. So you need to get the other computing the, the, the it's proportional to this quantity and then you need to renormalize it if you want to get a probability. Now, in order to get this formula you need to, to take the condition with respect to the Y. So you need to, to remove the dependency in why here. So this is exactly what I'm doing here, but now I'm expected with with with normalization. So we end this part with one definition with a property. Again, F of lambda is the expectation of the log partition function. So again, why is random this is random and taking the log. So this is the free energy and we have F of lambda. Okay. So, I mean, people doing statistical physics are probably familiar with this notation and to not be surprised with this formula to, there is a energy term here on the non tropy one, which is a mutual information over there. And it's probably a good time to stop on to leave you as an exercise the proof of this formula. And we'll start again tomorrow with that. The German. Yes. Yes, it's a. Is there any question, except the proof of this fact. I'm going to make sure not kind of clear. Okay, thanks.