 Yeah. Can you hear me? Yes, I can hear you. Can you see my screen? Yes, I think I can let you. Okay, thank you. So, first let me thank you, Floran, Jean and Matteo for inviting me for this workshop. I've been learning a lot of stuff these days. So, today I'm going to present you some work that I did in collaboration with Federica Gerace, Marc Mésar, Lenka Zdeborová, and our dear channel chair panel, Floran Xakala. And hopefully it's going to be an easygoing talk because it overlaps with a lot of stuff that has been presented previously. I have in mind, especially the talk by Stefan on Monday and the talk by Sebastian a couple of hours ago. So, you know, there's going to be a lot of repetition, but I think that's good for learning as we all know, several epochs is always better for learning. So, let's get into it. So, like the million dollar question, what we'd like to understand. It's the generalization in modern machine learning problems. So, clearly, if you look the difference between what we teach undergrads in stats 101, and how actually generalization curves look nowadays in modern problems. It's pretty clear that there is like a couple of things we don't understand about generalization in in high dimensional statistics. So, you know, if you go around the conference asking people what we are missing. Of course, the answer as usually is going to depend on with who you talk to. So some people are going to tell you that you know it's about the architecture all adding another layer and that's what makes the difference. Why we should talk with other people they're going to tell this all about the algorithms and how they are biased toward the good generalization minimums. So we heard about that earlier today. While if you talk with other people they're going to tell you this all about the data right so selecting the right features for the learning problem in question you want to solve. It's where most of the work should go. So while probably everyone is right and to have a good theory of learning will probably need a linear combination of the three things. I personally think that data and I think some other of my colleagues in the in the conference are going to agree is one of the least explored of the three aspects so today I'm going to focus on that. So when you talk about data, there is basically two theory cultures. The first one being a worst case analysis which completely ignore the role of data. So I'm thinking about you know, worst case in learning bounds like VC type bounds. And the other culture which is like typical case analysis which you also heard a lot from this week. Many times model data to simplistically so just taking you know ID Gaussian noise. So the question we'd like to ask in this talk is, can we do better than that. And if you if you like, heard like the other talks you probably know already the answer. But before diving into that let me get into a concrete example about worst case versus typical, which is an example that I borrowed from, sorry, from some colleagues here in our group in Paris, which is to consider just a very simple classification of classifying ID Gaussian points with plus or minus one labels, which are assigned by a teacher, a teacher one layer neural network with Gaussian weights theta. So if you look at like the orange dots, they are like just a sober out of the box logistic regression sober that I got from psychic learn and the dash green is the predicted by like the worst case. And you can see the VC type bound for this problem, which you can actually compute and if you want to know more look at the reference it's very interesting work. And you can see there is a big gap between both of them. In black, you can see like the base optimal, the base optimal prediction, which comes from the tier that uses a prior from the model of the data. And like the typical case like that we get just like using psychic learn is much closer to the base optimal than to the worst case analysis so actually, you know, like, taking into account the structure of data actually matters for this problem. So as I said, the question you'd like to answer here is like, can we do better than just getting ID Gaussian points, as in my last example. And if you heard about like Sebastian talk you probably know the answer. So the idea here to model like structure of data so we want to we want the model of data that is actually you know, still analytically tractor because we want to do to you so we cannot just get like you know the most kind of general data because that's going to be a nightmare to do theory, but it won't something which is more complicated than ID, but not too complicated so we can still apply the usual methods we are used to. And the idea here is to model the idea of latent feature space against like the input space. So if you think about like my my dog in lobster example, we know that for instance, we can classify dogs. By just looking at whether they have muscles or not so lobster doesn't have a muscle where a dog has. So there exists a representation of this data, which I would call the latent space representation where the muscle is one of the coordinates and is low dimensional this representation and whether like my picture has a muscle or not is going to spike on this coordinate as a yes or no. And I will be able to perfectly separate this easily by just looking at this space. So that's the idea that Sebastian previously introduced for the hidden manifold model. So here we take a data set X and Y where X is generated from a linear combination of latent factors. And we take the labels just to depend on the latent representation. So why is just going to depend on see the leaves in the low dimensional latent feature space, and my X, which is the data that I'm going to be given to the statistician or if you want to the student if you'd like to think in terms of teaching student is going to be hidden by a projection into a higher dimension space, followed by a nonlinearity so that the problem is not linear and it's not non-trivial. And knows that the scale, the scale here is such that when I take the high dimensional limit of like you know the latent dimension, the input dimension and the number of samples to infinity while keeping the ratio fixed, everything scales nicely. So what is the aim of this work is to study classification and regression on this hidden manifold data set. Different from, for example, what Sebastian had presented, which was the online learning for that. Here we are interested in the full batch problem. So, both classification and regression can be put under the umbrella of generalized linear models. So my labels like that I predicted labels are going to be a nonlinearity applied to a weight dot my data set my input factor. And I'm going to fit choose my weights by minimizing a loss, which can be quite general, plus a regularization term, a regularization term, L2 panels. So by choosing my F0 and F2B, for example, the identity, I'm going to get like regression task. If I choose to be assigned function, I'm going to get a classification task. And by choosing the loss, I can choose also the different tasks. For example, I choose the logistic loss, I'm going to get logistic regression. So we're going to study into this generalized linear model framework and then after we're going to draw the consequences for different tasks. So it's quite interesting because this data set with this linear task can be interpreted in a completely different way. So in fact, if we look to the feature coordinates C as being the data points, which were the role previously paid by X, and we incorporate the feature map on the model side instead of the data side. It comes just a two layer neural network, where we're fixing the first layers with weights F, which previously word projections, and we're only training the second layer with weights W. So the non linearity sigma becomes the activation function for the hidden units. And this model is closely related to the to the random features model introduced by acting high need to study kernel kernel tasks in finite dimensions so in particular also have in mind that your work by a song may amount and I did we studied exactly the setting for the rich regression task. So by taking F hat vehicles to the identity and lost to be the square was so the server double the same peak there and basically we're going to discuss some generalizations of that for other tasks. So this is the the the random projects projections point of view which I'm not going to spend much time on that because we don't have time, but it's good to keep in mind that you have these two pictures of the same model. So what is our main results. I don't want to dive into anything technical but we use a statistical physics methods to derive it, but my main our main result is a generalization error and training loss formula for this problem in the asymptotic high dimensional limit. So, bear with me. The formula looks a bit complicated, but you know, is a is a is not so hard to understand. Actually, what is telling you is that the generalization and training loss only depend on a couple of low dimensional parameters. All dimensional parameters can be obtained by solving a set of points like equations so you just plug in the computer's equations, which look complicated, and then you iterate them and you're going to get the generalization and training loss. And then you can, you can see that like, it depends both on the loss, which is general there depends on the on the task that you choose which is parameterized by the F zero as it is discussed. And it depends on the matrix F only through the spectrum of F F transpose. So only by the students transform of F F transpose so it holds for quite general F as long as the spectral measure of F F transpose is defining the high dimensional limit. And you might ask where the information about the Sigma, which are the preactivations of the first layer in the random features picture goal. Well, they just appears as some coefficients on the equation. So, in particular, as I said before, if you choose the square was and the, and the, and the task to be rich regression, we recover the results of may and mountain area. So let me just go quickly into that because Sebastian already spent a lot of time into the gets, but the main, like the main, let's say technical tool that you use to solve this replica computation is the Gaussian equivalence principle that was discussed previously. So this was first observed by, by Sebastian Golds in a paper and also was a hinted by may and mountain area. And what is actually nice now that I can say that has been proved, as you have heard, so is a rigorous results, especially in this setting of a single layer, and a manifold model. So okay, let's now move to the consequence of the formula. The first thing I would like to analyze is a simple classification task where with the square was and taking the newly added to be a error function but actually the discussion goes through from any other leads newly added as you saw it just depends on the equations only depend on some coefficients So, a couple of things to observe here on the left, I have the generalization curve as a function of the sample complexity. I'm fixing the number of latent to input dimensions to be 0.1. And I'm plotting for different regularization is a lumped up. So you can see that the map like there is a maximum peak of generalization error in simple complexity equals to one so this can be understood easily from random matrix theory because in this case we have a closed form solution in terms of the pseudo inverse. So it's, there is nothing fancy there. And on the right I'm plotting a heat map of the generalization error as a function of the number of latent to input dimensions so we can see that for very low latent dimensions, we can always achieve good generalization So the kind of mimics is intuition that if there are only very few latent dimensions like the muzzle, we are able to actually classify well, pretty easy, pretty easy. So next, let's look at keep like the classification task, but let's look at the different losses so here we're comparing logistic against against square loss. I'm also plotting like the training loss down there and you can see that the peak in generalization. Now I'm plotting things as a function of the inverse of the sample complexity, which is like, you can see this as a, as a measure of, of complexity of your model. The peak on the logistic is slightly before than the peak for the square loss, and both of them happen at interpolation threshold where the training loss goes to zero. So, while I said that like for the, for the square loss, this is related to the PCO inverse for the logistics a bit more complicated and is actually related to the separability threshold that I'm going to discuss in another slide very soon, but keep that in mind. So, now let's, let's analyze the effect of the different f's so the different projection matrices f. So here I plot for both like rich regression task and logistic regression. And you can see that like orthogonal projections always outperform random Gaussian projections. In particular, they tend to the same limit for infinite number of parameters, which is the kernel limit, but for finite number of parameters, if you want to approximate the kernel limit is always better to choose a orthogonal projection rather than a Gaussian projection and this thing has been observed previously. In a work by Shodomansky Roland Valor, and here our model provides like a concrete setting where we can explore these questions with analytically tractable theory. So, before finish, I would like to come back to the position of the peak in logistic regression. As probably a lot of you know, when you take the number of parameters so when you project your data in very high dimension spaces so when these very large, we can always separate data linearly. And a valid question to ask is, what is the number of what is the critical be such that you can separate your data perfectly for both like random and and orthogonal projections. This is what I'm bought in here. And you can see that as I mentioned before, orthogonal projections always outperform Gaussian projections, and you can predict exactly the separability threshold for and for any simple complexity. So, in particular, when the dimension is very large and the data becomes Gaussian ID, we recovered a covert transition, which is at simple complex equals to two. The first simple complex equals to 0.5 is a classical result in the geometric theory. So, these generalize the work by condensed exactly for ID data to more complex data sets. So, okay, just to conclude I see I'm running out of time for with some perspectives. And just look at the very simple tasks. One question we could ask is, for example, what is the effect of training F so learning the best representations in high dimensions. So this is something will be interesting to look at and look at the harder tasks. And another thing that we can look at this. As we learn, can we do the same thing for for a deeper generative models such as the ones that Sebastian was describing. And this is something also we are looking at at the moment and I find quite exciting to look at some more deep architectures. So with that, I thank you for your attention and if you're interested to know more about the technical details please do check our paper in the archive or write me a note.