 Okay, I guess, so for the people following online, can you see my slides? Yes. Okay. So thank you for the kind introduction and the invitation in this beautiful. Part of Italy. So my name is Marco large. And I will speak about the topics related to this cool. Statistical physics, a little bit of random matrices at some point on a little bit of random graphs also. I decided to do it with slides to give you an overview of the kind of results I'm aiming for. And then, for the rest of my lectures, I will do it on the blackboard and be a little bit more precise and rigorous in my In the derivation of the result. I guess I can remove my mask. Most of what I will present is actually joined work with a former student of me. I'm now working at G research. His name is Leo Milan. And the work I'm presenting here. There is an archive version and I think there is also a confirmed version available on the web. On the slide, I can share it with you if you want. So the topic which is called semi supervised learning so I guess you all know what super supervised learning is when you have label on the asset and labels unsupervised when you have no labels you have only the data set and you want to learn a representation. So semi supervised learning is in between the two so it's, I think a nice model to start with because it interpolates between the two unsupervised and supervised learning that I will discuss afterwards. So, I will start with a little bit of motivation. So my work is motivated by what people doing in practice. So there is this famous code due to buzzer man breaster when there is no difference between theory and practice but in practice there is. So what we are trying to do is to close the gap or at least to understand better the gap between theory and practice. So, when I presented this apparently somebody in the audience told me that it's not even a Benjamin Brester doing to it's been in Brester but I'm not sure in practice, it's a quote of Benjamin Brester anyway. Brester is a one motivation for what I represent. This is a relatively new. I think it's two or three years ago. I'm not sure what you have in the in the ribs. So it's pre pre historical ages in terms of deep learning, but still was a look at what they call realistic evolution of deep supremacist semi supervised learning algorithm. So you are in a setting where you have a use data sets of images in this case. Only a subset of your data sets as a label so you can think of for this talk I will only consider classification tasks. So you want to classify cats and dogs but you only have a few labels. You want to use the remaining the unlabeled part of your data set in order to get better prediction. So what they say they found in this. So this is a purely empirical paper where what they tried is to get empirical measure for the algorithm which were presented in other papers and try to compare them. As you know, as soon as you do empirical work. You, you, you build a new algorithm and you have a tendency to be very careful in the turning of the parameter of your own power algorithm and not that careful with the tuning of the parameter of algorithm you want to beat. So this is basically the main part of this paper to be careful for all algorithm and what what they found is if you try to be careful then there is a lot of results that turn out to to be not that impressive so here is a practical example. So Cypher 10 if you don't know this is relatively small image data set with 10 classes. And so this model are in the literature and what they claim is that this is the percentage of error so the lower the better. What they claim is that if you compare an algorithm where you train only on the label data with the you again you have a huge data set a small part with label data so you train only your algorithm on this part. And then you you compute the test error on the other algorithm which is a semi supervised one, you can use a label data but you can also use the unlabeled data. And you should get a better performance. So the gap in performance is distributed here. So you go from 34% of error to 12% with this pie model algorithm whatever it is, and you see that the author of Google who recorded everything from scratch. So you see that the improvement is actually much weaker on the label data set they did indeed much better than what previous order did so they have an error of 20% and they improve only to 16%. So this is the main message I guess from this paper is that semi supervised learning is not really working in practice, just training your algorithm on the label data on using this algorithm to do the classification is doing a pretty good job. So this is what triggers my curiosity. And I asked myself, okay, if semi supervised learning does not work in practice, can it work in theory. Okay. So this, I mean, if you think a little bit, there are people doing even more fancy stuff like economics for example. So here is a code. An economist is a man who when you find something works in practice wonder if it works in theory. So here I have something which does not work in practice on I try to see whether it will work in theory. The goal being that if it works in theory then perhaps I can fix what is not working in practice. Okay, so now this is a motivation behind this work. And now I will present my theoretical framework, which will be very simple, far from the real data sets. So if you consider what is called the Gaussian mixture model in a high dimension on in a Bayesian sitting. So I will clarify all of these words in a few minutes. So the model will be simple enough that we are able to basically compute everything. So what we are interested here in in this setting is try to get the performance of fully supervised approach. You can do the training only on on on the label data. And what is the best semi supervised approach, you can do when you are using both labeled on label data. It's not, as you will see, and this will be true for the rest of my course to it's not related to a particular algorithm. So what I am trying to to compute are what people call information theoretic limits that are the best possible performance you can achieve whatever algorithm you are using. And since we are not linked to a particular algorithm, you, you will be able with, again, this simple time model to, to, to quantify the increase of in performance the best possible increase you can do by considering the on label data in this semi supervised setting to see if it was a practice on it. So I'm presenting with slides but feel free to interrupt me whenever you have a question. A term is not clear or anything. So here is the, the semi supervised binary classification model. So I don't have 10 classes anymore I have only two, which will be included by plus one on minus one. And the, the model is the following so the vi are the plus on minus one so you, you have two clusters. Either of class plus one or of course minus one. And the you is a random variable in a high dimension on the units fairing in the dimension. So if you are at plus one then you will be a centred close to plus you, if you are in the cluster minus one you will be centered at the cluster minus you and then you are adding a little bit of caution noise with this the variable. So what you observe is the resulting point in RD, which is noted by why. Okay, so this is the simplest. Everything is I ID in in G, as there is no covariance matrix in the noise. I mean, this is a trigger identity. For example, it's I ID. Is any question on this. So this is the data sets. Now I have to define the label, which are available to you. So this is what I'm calling the site information. So I will reveal the label which is variable V, which is either plus one or minus one with a probability eta independently of everything else. I will not reveal it to you with the probability one minus data in this case is the site information is zero. So if the site information is zero this means you do not observe the label. If you have a plus or minus one, you know that this is a true level. Okay, so it is a fraction of label data in your data set. So if you have the site information as SG on YG, and now you have to, to, to, to, to try to classify the I will consider that sorry, only to classify the different VG. What you could do is to, to, to try to predict the unseen labeled what we do is something. That will be a similar we take a new sample so let's say you know this is a train test you have another test. This is a set, which has exactly the same distribution. You do not observe the label of course, and you try to predict from this new sample. What is it's level. Okay, this will be the measure of performance. Yes. Yes. So you is a fixed direction that I mean that on the unit sphere, and then it's fixed. It's the same for everybody. So now you have the plus and minus one so in 2D you fix. Yeah. But the marginal, I mean you are correct the marginals are still uniform as far for the anyway, the thanks for for the question though. I'm considering a high dimensional setting so by now I guess everybody in this audience at least will buy this assumption so and the number of samples on the dimension will both tends to infinity. And the ratio and over the will be fixed, the constant alpha. So this is a. This is a model what you need to remember the number of parameters you have is the model is fully specified by the parameter sigma, the variance of the noise. Eta, the number of label you are revealing on alpha, which is the ratio of the dimension number of sample. You have three parameters. Now what's the recall. A by chance to be optimal to be in the optimal Bayesian setting. This means that the statistician knows the parameter of the model. He knows the model, of course, only knows the parameter of the model. So again, the variance of the noise, Eta, the fraction of alpha, the ratio. So he knows all the, he's also knows that you is uniformly distributed on the sphere that these gaussian on these plus or minus one with 21. So all these parameters are known to the statistician. And now this is what he is trying to to minimize those base risk. So you have, I should probably show it with my mouse here, you have your estimator given the data set why the site information s, and you are picking a new sample in the test set. And you want to predict if it's in class plus one or minus one. So you are making an error with this probability, you want to minimize this error with respect to all the miserable function of your observation. Okay, so V at is, is your estimator. And, as I said, the new sample is drawn with the same distribution as the data set something. Okay, is there any questions so far. Yeah, exactly. So this is what I'm trying to measure is the power of generalization of my of my estimator. So I'm making the assumption in order to be able to do some math. So I'm making the assumption that the new sample has the same distribution as the distribution of my data set. Yeah. Okay. Okay, I mean, so the question is for the you can you just take the first unit vector. It will not change the model at full. I mean, you can always say that you will change the rotation and you put yourself in this setting. But what matters is you should not tell this to the statistician otherwise there is a trivial algorithm. This is this correspond to actually the first algorithm I will look at which is called the Oracle when you when you know you what is the performance you can actually There is a It's not me. Okay, so this is a thing related to your question. What can you do if you if you know, in addition, the, the you vector. This is what people call some what I'm calling here the Oracle risk because you are giving a lot of information to the statistician here. Okay. I will not show it here but it's There is an algorithm which seems to to be sensible indeed. If you know the direction you know if you have a new sample what you should do is just take the cross product between this new sample on your you, which is known. And if this this cross product is a positive you say it's in the class plus one, otherwise it's in the class minus one. So just take the sign of this cross product and it will give you a good answer. It turns out that this algorithm is actually optimal, meaning that it will minimize the risk. So if you shoot the Oracle risk in this case. So this is simply the probability that the sign of the sigma again, this is the the variance of the noise is bigger than one, though it is, is the noise. In the cross products bigger than one then you will make a mistake in the sign on this is just a random variable on on the you. This is still Gaussian and you can compute everything. The, the important thing to notice here is that even in the Oracle setting where you is given. You see that there is a non-trivial, the error, the risk is not zero. You have a non-trivial risk. It does not depend on the parameter eta for sure on alpha the dimension but it depends only on the on the variance of the notes. So we are in a regime where for sure you will not be able to do a risk of zero for the risk you want to compute. Now there is another case which is relatively easy to study. It's fully supervised case. So in this case you are observing all the VI. So you can consult them out by multiplying each Y by the VI and you obtain this expression like this. So you are removing the label. Now what you can do is taking averages over your observation in the data set, in the train data set, and you will try to estimate each component of you like this by just averaging and you will decrease the noise just by a simple averaging. So you see that the variance of the noise is decreased by a factor of square root of n. And now if you, you take this for your estimate of you, and if you have a new sample, you just take the dot product with this estimate of Y. And you see that you have two terms. The first term, which is of interest to you is this one because it contains a venue that you want to predict. And this is the noise term over there. So you do very simple math. When you let n on d tends to infinity. You can convince yourself that this will be roughly of order one. Okay, because you is on the unit sphere and you have an explicit formula for why one here. So this is basically venue. So exactly the signal you want to estimate. Now you need to, this is the noise term, which is Gaussian, again, and you need to compute the variant of this Gaussian in order to know the error exactly like in the Oracle setting. And again, an easy computation of the variance of this is given here. So, again, since we projected everything on the line, even so we started in the dimension, everything reduced to a one dimensional problem now. And so you have the signal which is a plus minus one, plus some Gaussian noise with this variance, and you see that you have the parameter alpha, which is, which is present now. And you have now what we will call later a Scalar channel, where the signal, which is a plus or minus one is here and you are with additive Gaussian noise. This is how it's called on based on the output of this channel you want to get the best estimate of the new. So again, you will make a mistake, if the, if the, if the noise is too big, which is bigger than one, which is everything can be computed explicitly. So, again, in the Oracle setting, we had only the one minus fee of one over sigma. Here we have this, this additional term. Yeah, but a similar formula. So, it turns out that this setting was studied by statistician in a much more general setting actually whether you have a covariance in the noise that you needed to estimate any question on these two simple cases. So we have basically a lower bonds are Oracle risk and upper bond which is this fully supervised setting and what you want to compute is in between these two bonds. What is the risk as a function of data. So, the risk, which is the error you are making should be a monotonous function of data. I computed it at zero and that one and I want to get the rest in between. So it turns out that the general formula is written here for the risk for any data. As you see, it's a little, I mean, you have the same kind of formula as before it's one minus five where five is related to the Gaussian distribution. A parameter divided by sigma. The denominator was one for the Oracle on the square root of alpha something for the In this case, okay, the general form is always the same but this Q star is a little bit complicated. It's given the minimizer of this weird function that is written explicitly, but it's not clear a theory that for a type of zero or a type of one, you can recover previous results. So this is true. But you see that it's much more complex and Yes. Why do we recover? No, no, no, sorry for the, we are not recovering the, I say something wrong. Yeah. So this, you know, you what matters at this point it's, it's a function of the three parameters of the problem alpha sigma eta, which is a function of a variable q. So there is nothing random in the, in this definition. Okay, you have an explicit this IV is given over there. Function and then everything is kind of explicit. So here is a plot. I mean, since you have an analytical solution, you can at least plot the value to see what happens. So, this is the base of the risk as a function of eta. So, it equals zero means I'm, I have no label data and that have a setting. It equals one. It equals one is fully supervised setting. The Oracle bond is in green here so we are not recovering the Oracle bonds. Sorry. The unsupervised setting is a dotted line here. So again, it does not depend on it. You are just taking the whole data set and in your ring the label. Of course for it equals zero you the two bonds are matching and in red, you have the, the, the best performance you can achieve with the in a semi supervised algorithm, whatever the algorithm you are using. So again, this is information theoretic risk. So what is the blue curve that is crossing the unsupervised, though it's doing worse than unsupervised setting with two label, and then it match the other one with, with all the label. Can you guess what I'm calling supervised here. Yeah, exactly. The best performing algorithm train only on the data set while you have a label. Okay, so when you have no label you are doing a risk of 0.5 which is a random guess. Okay, so you are doing nothing. And you see that when you have very few label data, then you are doing worse than unsupervised setting. And as you are crossing this line here so as soon as the fraction of of data is above I don't know zero dot 15 or one. Just training your algorithm on the, on the, on the label data will achieve better performance than unsupervised setting. Okay. So it, it kind of, well, I'm not claiming this is an example, which is connected to parties but at least what you see is that the algorithm train only on the label data is performing very well quite soon. On most of the time. Algorithm training and training an algorithm on label data, it's very simple whereas doing unsupervised learning is, it might be tricky. So here is another showing the impact of the noise. So here it is fixed to 20%. So I'm on this line here. So again, in green you have the Oracle. Okay, so clearly this is one of those emails or when you, you are going in this direction you are decreasing the level of noise or the risk is decreasing. You see that the dotted black curve correspond to the best performance of any unsupervised algorithm. Here, what you see that's until a noise of variance one, you are basically do just doing random guessing. So there is no information you can recover. And then the risks start to decrease. Actually, this is related to what you already saw in this course. Now especially a phase transition. So, if the noise is too high, then there is no signal and you can just you cannot do something anything and then the risk start to decrease. Again, in green this is supervised on label data only. So here the label data is 20% and it's fixed. In red this is the best semi supervised learning. So you see that at the beginning it's very close actually to supervised on the label data only. And here it's very close to unsupervised learning. And the dot the dotted blue here is when I'm cheating and I'm looking at the best possible algorithm supervised on the whole data set. The lower one for my algorithm except the Oracle which can do better. But okay on this picture you see that if you are in this regime you just training on the label data will achieve a very good performance and you can throw all the sorry, all the data that are on label. And if you are above the in this regime basically doing the opposite just using an unsupervised learning on the whole data set will achieve almost the best performance possible because the red is very close to the dotted black curve. So in a sense, once you saw it, it's quite intuitive. But the range of parameters for which semi supervised learning is really useful. It's when you have a real gap between the red on the blue or the red on the dotted black on it's in a very small regime of the parameters. So the practical paper might might be quite that. I mean, if in deep learning you are in this regime then you can just forget about semi supervised learning on doing training on the on the label data it will achieve almost the best possible performance. How much time do do I have 10 minus two minutes. Okay. Yes, so I will skip. Okay, but perhaps I will stop here and start again this afternoon with a non rigorous derivation of at least explain this formula, where this term are coming from and then proceed later. Thank you.