 ENS and CNRS, and we learn about universality and feature learning. As you can see, universality is the common theme for today's morning, but we'll move to tool your neural networks, and your sites are back. Great. Take it away. Okay, thank you Pragya. Let me start by thanking all the organizers. So it's a huge pleasure to be speaking at ICTP. ICTP had a massive impact in Brazilian physics. So as a young undergraduate in Brazil, I remember that most of my professors did either masters, PhD, or work stuff here. So this place was like mecca for us back in Brazil. So it's a huge pleasure to be here, especially giving a talk. So thank you the organizers for giving me this opportunity. So before starting, I'd like to acknowledge some of the collaborators that are gonna appear in works in this talk. There are many faces because I'm gonna mention many works. But there are some in red, which are, the bulk of this talk is gonna be about the work that I did with them. So Florent, Jacquela, Ludovic, Stefan, who is here, Luca Peche, and Yatin Dandy, all of them at EPFL. But yeah, as you can see, there are many faces here that you can recognize that are here in the workshop. Okay, let's get to business. So the outline for this talk actually is gonna follow exactly the program for this workshop. So it's very rare, the cases where you give a talk where you can align perfectly with the title of the workshop. So I decided to use that as a guideline. Okay, so we're gonna start by talking about learning from structured data. So I'm gonna give a little introduction why structure is important in learning. And then we're gonna move to the other words of the title as we go on. So I think we can roughly divide, theoretical machine learning into two mathematical cultures. Maybe we are biased because most of people here work in one of these cultures but it's always good to remember that there is another one and I'm gonna start by this one because I'm gonna move to the other because I'm in the other side. So the two cultures are basically people who try to get mathematical guarantees on the properties of learning in very generic settings, trying to make as little assumption as possible on the problem and in particular on the data distribution. And I'm gonna call this like the worst case or agnostic case. It's very powerful because once you have these guarantees it's valid for any data but let's say it's ambitious, right? Because you want to get something which is very broad. The other culture is what we are more used in this workshop is like typical case where you make assumptions about the data. You have very sharp results but of course these results are limited to the class of data that you assume to solve. So there is a trade-off, there is no free lunch here. So let's try to just look a little bit about how worst case looks like and then we are gonna move to the other. And that's gonna be perfect because it's gonna allow me to introduce some of the notation I'm gonna be using throughout this talk. So here I'm gonna be talking about supervised learning. We have some data. My data is always gonna leave in RD and I'm always gonna assume that I have little n samples of data. It's okay, that's important because also notation varies widely across this field. So, and I'm gonna assume this data is sampled from some probability distribution that as you might imagine later we're gonna specify. So in supervised learning what is the goal? The goal is to try to find a function that links the inputs to the labels. Searching the space of functions is intractable especially if you want to put in a computer. So what you do in practice is that we assume a subclass of functions. These are usually parametric functions. It doesn't need to be but here and most examples that we're gonna see they are. So for example, neural networks, linear models and then the problem of searching is the space of functions reduced to the problem of searching in the space of parameters which is a bit more tractable. Then we need to, we want something that generalizes. The way we do this typically machine learning since we don't have access to the data distribution is to minimize some loss function over the samples that we have in hand. So that's called empirical risk minimization. I'm sure you all know about that. And the goal is that once we do the training we want to evaluate how good this training on the training data generalizes in the sense of how good the generalization error or population risk is. So you know that the population risk is not over the data but over the expectation of the loss function. So over new samples. So we want to, that's the goal of what we would like to minimize. Of course we don't have access to the data distribution so we cannot minimize that. So we use as a proxy like to minimize the empirical risk but we are hoping that by minimizing the empirical risk we get good generalization. So that's the goal. And in particular, we are interested in studying this problem when the number of data samples, P which here is just like something that measures the size of my parameter space. So you can think about like total number of parameters of your parametric model and the dimensions are large. And then I'm gonna be precise about large what this means later. Okay, so let's look at one example. A classic example of a result in statistical learning theory is what is called like the Hadamaker uniform bound. So here is a bound that tells you something about the generalization gap. So you fix a hypothesis class and then you want to know if you trained with a certain error how good do you generalize? So here you have for all functions in this function class let's think about like linear functions or two layer neural networks. How good is my test error compared to my training error? And then this type of result bounds this by something called the Hadamaker complexity which is a number which is associated to the function class that you chose. Different function classes, different Hadamaker complexities and the Hadamaker complexity essentially is the capacity of this function class to fit random labels. So here there's a connection with the previous talk of Federica because here really you're trying to bound the capacity of generalizing to the capacity of fitting random labels. So of course this bound might be good or bad like you know, it's a bound it might be lous or tight and typically the Hadamaker complexity and that's quite intuitive depends on the number of parameters in your function class, right? The more parameters you have the more expressive you are the better you might be able to fit random labels and so on. The big problem is that since the function classes that we consider now they are huge the Hadamaker complexity might scale with P so it might be very large and the bound might become lous. And indeed there has been a lot of discussion about whether this is a good complexity measure to do this type of results because experimentally you can just take a neural network take a dataset that you like and try to see if you scramble the labels if you can fit or not and you can see that neural networks are capable of pretty much always interpolating the data even if the labels are random which means that essentially there Hadamaker complexity is very large which means that these bounds are very lous. Of course this is not the only bound that there is this in statistical or in interior there's a whole branch of statistical learning theory trying to incorporate more information and getting tighter bounds but this is like a classic result and in particular I want you to notice that this result doesn't use anything about the data distribution, right? So again what I was gonna say is that maybe it's too ambitious to try to generate, to have a result which is genetic in the data distribution. Here is just a small example, yeah, John? Yeah, here is respect to I think a realization of the data that comes from a certain distribution but the Hadamaker complexity depends on the data distribution indirectly because there is this expectation there over, wait. Now I think it's conditional on X but it depends only on the likelihood like expectations over Y given X if I remember well but we can verify. I think it's conditional on X in the definition. It depends on the input-output rule but not on the distribution of the X's. Yeah, so I think it's like you have a realization of X, you fix the realization of X and then you define the Hadamaker complexity condition on X and you take an expectation over Y. Y being the random label. So random according to this teacher rule. There's no teacher. So if you talk about a rule, what do you... Here in the definition of Hadamaker complexity you take the original labels, you scramble them, you take them to be run on IID from minus one, one and then you take this expectation respect to Y. If I remember well. Okay, so the Y's are IID. IID binary in the hypercube. Okay, thanks. Yeah, and then because here you're exactly seeing the capacity of the model of fitting run on labels, condition on the inputs being whatever data set that you took. Okay, thanks. If I remember well, but we can double check that. As you might imagine, that's not my field but I'm just giving you a flavor because I think it's important when you talk about theory of machine learning to have like the two sides. Okay, here I'm giving you an example about just a very simple task where I, here I have a teacher. I'm drawing a teacher from a perceptron with some weights that I fixed and some Gaussian data. And I'm comparing like the performance that I get from just doing logistic regression on this data with the bound that you get from computing the Hadamaka complexity for this hypothesis class, the linear hypothesis class. So in here everything's very well understood. And you can see that for some data, Gaussian data for example, the logistic performance is much closer to what would be the base optimal performance that takes into account the structure of the data than to the bound. So there is a huge gap there. And but of course this is just one example again, like different classes of data might shift this curve up closer to the bound or down closer to the base optimal performance. But just to exemplify that there is a difference between typical and worst case. And it's important to take this into account if you want to have theories which are very, very sharp. And moreover, we also know that in general the problem of training a neural network can be NP-hard in the worst case. So here is an example where you just want to train a two layer neural network or a few hidden units. And you want to find the weights such that you can perfectly interpolate the data. And you can find a specific configuration of the data such that finding these weights is NP-hard. You can map to an NP-hard problem. So in general also the problem can be algorithmically intractable even if you could have some guarantees on the generalization. Okay, so of course there is a lot of work trying to extend everything that is statistical learning theory to take into account the data structure. So I have in mind like, you know, Pac-Base. And that's a very active research line. But there is an alternative which is what we are using, what we like, which is trying to actually model the structure of the data and then trying to get very sharp guarantees. But again, the problem of that is that if you are too simplistic, then you might be having results you are too specific to something which is not relevant, right? And that's mostly what we hear from referees in ERIPS, like, you know, oh yeah, Gaussian data, whatever, right? But we like a lot Gaussian data because with Gaussian data, we can basically do theory. You can basically compute things. And this is an example. So this is a result from a paper. I mean, I'm not expecting you to understand that. Some of you will understand because there's a lot of people who are into replicas here. But this is a result, a theorem that we proved for a specific Gaussian covariate model that characterized perfectly the asymptotics of the error in the high dimensional limit. And, you know, we can do this kind of things with Gaussian data. And we cannot do these things with, let's say, more structured data. Can we? So that's maybe the question. Like, do people work in this field is like trying to push the boundaries towards more and more structure. Of course, these equations are horrible. It's hard to parse. I don't want to go through them. But also, coming from a physics culture, we like to plot things. And you have seen this a lot in the previous talks. So this formula, you can put in the computer and you can get these kind of curves and you can predict interesting properties. So for example, what is the decay rate of the generalization gap as a function of n? Does it decay as one over n? One over square root of n. Is it closer to the base optimus? Is it closer to the Hadamacker bound? And so on. So this is the typical kind of result we can get with Gaussian data. And that's why we like Gaussian data. We can get the rates, but we can also get the full curve. But of course, if you are referee number two, that's what you're gonna say. Isn't this too simple? And this brings me to the first part of my talk, which is just gonna be an overview because Federico already did a great job justifying why Gaussian data might be realistic. So universality, that's what comes to our saving. And so in the past few years, there has been a lot of progress in trying to justify this Gaussian assumption. And progress both in an experimental side, like a lot of work showing that certain data sets are very close to have the Gaussian performance, but also progress on the mathematical side. So there has been many universality theorems that have been proven that show that for certain model classes, Gaussian might be good enough. And these are like central limit theorem kind of results. Because here I want to stress one thing is that we're not saying that data is Gaussian. What we're saying is that for a given task, for a given model that you want to fit on the data, maybe this model depends on the statistics which is low dimensional. For example, the projection of the data in a subspace. And things being Gaussian in low dimensions is much easier than being Gaussian in high dimensions, right? Because there are central limit theorems kind of results. And many of these results have been proven over the past. And a lot of them draw back from seminal works coming from random matrix theory, from the physics literature. So there is also one, I think the work that Federica mentioned by Manfred on SVMs. Universality on SVMs is one of the precursors of all of these line of results. But also like a lot of results from signal processing. So results by Don Ho and Thunder and so on. And as I was hinting before, like the motivation here is like not to think that right the data is Gaussian. But maybe the statistics of what we want to know, let's say the error, might be equivalent to the statistics of a Gaussian predictor. So these results are known as Gaussian universality from our side of the field. They had been noticed like by this work of Sebastien, Mark, Florin, Lenka about the hidden manifold model. More or less concurrent with some work of Song Mei and Andrea Montanari. But as I said, this draws from a lot of ideas from the past as well. And there has been a lot of work that has been done in this direction, let's say in the past five years. So here I'm just giving like name-dropping, a lot of work. And in particular, many interesting things can be computed. So for example, one line of work is like deriving the kernel rates. And I like that a lot because the features of a kernel are definitely not Gaussian. Even if the date is Gaussian, the features of a kernel are not Gaussian. However, if you assume the features are Gaussian, you compute things using replica and you get to the rates, you get exactly the rates that you can derive from classical statistical theory and other rates that they have missed. And nowadays, they are being proven by mathematicians without the Gaussian assumption. So a lot of things can be said when you use universality. And there's a lot of progress in the past years from a lot of people in this room and other groups. So here I'm gonna focus in one line of work which is related to the analysis of the random features model because what I'm gonna say next builds on that. So let's take a look about some of the results which are known in this case. So the model I'm gonna be looking at is like a teacher-student model. So I'm gonna assume I have a target function. This target function is generated by a two-layer neural network with first layer weights W star and second layer weights A star. And I'm gonna be fitting. So I'm gonna assume that the data, the inputs itself, they are Gaussian. But note that the distribution of the features are not Gaussian because I'm passing them through a nonlinearity times a matrix. And then given this data set, this synthetic data set, I want to learn it with also a two-layer neural network. So I'm gonna learn with a different two-layer neural network with weights W and A. And in particular, they don't need to have the same number of hidden units. So on the model side, I'm gonna be assuming that the first layer weights are fixed. So this is what is known as a random features model. I'm just learning the second layer, not the first one. And this model has been introduced in the context of kernel methods to try to approximate kernels at finite number of features. So that's a toy model. You can try to justify it like saying, okay, in the lazy regime, your network might behave like the weights are fixed at initialization and effectively is like doing that. There's a lot of shortcomings to that, as we're gonna see. But that's one way of thinking about it. But you can also think as a model that approximate kernel methods. And then we are gonna be looking at the training of the second layer, A, which I call A here, that has P parameters because this is the number of hidden units on the two-layer neural network. So you're gonna minimize the empirical risk with respect to some objective and some regularization. And the convenient thing about fixing the first layer is that this problem is effectively convex, right? If the loss function is convex and you have regularization which is convex, this might be either convex or strongly convex. So it's much easier than analyzing the joint training of both layers. So that's good because under these assumptions, we can do theory. So we are happy. We can characterize the performance of this model. And this is a work that we have done together with Federica a couple of years ago and generalized some results of Andre Montanari and Song Mei that looked at this problem for the square loss. So if you, again, I don't want you to parse these equations because they are horrible. If you like replicas, you understand them. If you like also like Gordon and Max, you also understand them. But they are not easy to parse if you are not used. So it's better to look at some plots. And that's the kind of typical plot we get for the run-on features model. And this plot is very nice because it reproduces some behavior that we have observing in statistics which is the so-called double descent phenomena. It's the idea that as you increase the number of parameters in your network, you first degrade your performance because you're having more and more expressivity. So we have this bias value straight off where you expect that if you, I don't know, fit something with a polynomial very high degree, you're gonna fit the noise. But then eventually this peak goes down and the minimum of the generalization error appears when the number of parameters go to infinity. So this is something that has been experimentally observed with neural networks back in, I don't know, 2018 or maybe 17. And people thought that this was something specific about neural networks. Why neural networks have this weird property? And then later the community realized that linear models can have the same property. And here is one example of a linear model that has this property. Putting questions, some of the, let's say, dogmas of basic statistics. So we learned a lot of stuff studying this model. My goal here is not to go into details on that. But one thing that I want you to notice here is that even though we are happy because we can draw these curves, the Gaussian universality of the features in this model, which is the basic tool that we need to solve this model no matter what technique you use, makes it equivalent to a linear model with a given covariance matrix. And that's nice, because we can do theory. So Gauss is happy. However, we have limited expressivity because a linear model is a linear model. So you cannot really fit very complex class of functions. So it's sad. Like we have this tool that allows us to do theory, but on the other hand, this tool is also doomed because limit is the class of learning problems we can approach. And this is where it comes, the second part and the bulk of this talk, which is going beyond universality. So what can we say when we don't have Gaussian universality? So Gaussian universality is great. We can do many things, but also limit us in doing theory. So now I'm gonna be talking about this last work, which is a very recent work. I think we put an archive like maybe one month ago. So that's the first time I presented, so I'm very happy to hear feedback and discuss more. So of course, we're not gonna get into the techniques, but I'm happy to discuss in the blackboard later. So this work is essentially inspired by a work of the group of Jimmy Ba in Toronto that observed that indeed, if you have a random features model, you are limited by the best performance of a linear kernel. But if you update, now if you train the first layer weights for one step, maybe you can go get below that. So we are going beyond the features model by doing training, not fully training, but one step of training. And the performance of the model that you get depends on the size of the step that you do. So if you do a step of size order one respect to the input dimension, then you go a little bit down, but you're always bounded by this linear predictor. To actually do better than linear, you need to take a big step and a step which actually scales with the dimension of the problem. And notice that what is interesting about these curves is that the curves which are above the black line are curves for which we can do theory because Gaussian universality works. So we can actually characterize the solid lines passing through this point. Now the ones down this curve, we cannot really say much. We can just plot them and we see that they do better than a linear model, but we cannot characterize. So we got into this question of trying to understand what can you learn after you do one step of gradient descent on the first layer weights. So this brings us to one of the main theoretical results of this paper. So for that we need some notation. So we're gonna define here a correlation matrix. So if you are from physics, these are just like the normalized overlap. So on the top you have something like M, the teacher student overlap. And then the bottom you have just the normalization of the norm. Otherwise you can think cosine similarity between the weights of the teacher and the weights of the student. And we're gonna say that we learned something if we have in the asymptotic limit if this thing is order one. And we're gonna say that we don't learn something if these things like vanishes. The correlation between the teacher and student, the normalized correlation vanishes after this gradient step. So that's whenever I say learning not learning is respect to this measure. And then later we're gonna see something about generalization. But first you're just gonna see like whether after one step we can actually correlate our gradient with the weights of the target. So we're also gonna need some technical thing which is everything is gonna depend on the Hermite coefficients of the target function. So if you are more from maths, this is very natural because the target since the data is Gaussian and the Hermite is a complete basis over the Gaussian measure, you have this nonlinearity, the hardness of the problem is due to the nonlinearity. So it's quite natural to expand the nonlinearity in the basis of the Hermite polynomials. So in particular, the difficulty of the problem is gonna depend on the first nonzero Hermite coefficient of the target function. So we're gonna define the leap index, which is this L, to be the first nonzero Hermite coefficient, okay? So that's gonna be very important for our follow. So just remember leap index measures the hardness of learning and is related to the first nonzero Hermite coefficient of the target. So the result goes like that. So that's the theorem we have in the paper. So let L be the leap index of the problem and assume that the activation function that we are using to fit is expressive enough to learn this function. So we also need to assume that we are able to learn. Like we have an activation function that is not like something linear network for the student, otherwise things wouldn't work. So if the leap index is one, basically after one step with linear number of samples, we can learn something on a subspace which is an average over the teacher coefficient. So we can learn a function which is beyond the random features, but functions that depend only on this direction, which is kind of an average direction of the target and the average is weighted by the Hermite coefficients. So otherwise, if the leap index is larger than one, we are gonna be able to learn a subspace which is the span of this teacher weights. However, we are gonna need number of samples which is proportional to the dimension to the leap index. So, you know, and all you can learn is that. So that's essentially what the theorem is telling you. Depends on the leap index, you can learn a subspace but to learn this subspace you need number of samples which is scales polynomially in the dimension with the power given by the leap index. So again, this is a technical result and maybe it's easier to be seen as a figure. So here we can think about our target and you can think how many samples we need to learn the directions of the teacher and we have a single index regime that you can just learn a perceptron, the perceptron given by the average over the teacher if you have just linear number of samples. And actually to be able to learn more directions, like to learn a larger subspace, we need to have samples which scale as D to the power of the leap index. And you can only learn that at this sample regime. So that's one result. It generalizes all the results again by the paper of the group of Jimmy Byn and Toronto but also some paper by the group of Jason Lee in Princeton that showed some parts of this result for the case D squared. So the leap index equals to two. So here it generalizes for any leap index. And we can see this in a concrete example. So here we take a target function which is given by, you know, fakely we choose an activation function which is given by the Hermites and we're trying to learn two hidden units. And you can see here in red, the red curve is the random features model. And then you see the random features model is limited by the linear predictor. So if we take now something that, you know, you can learn this first direction, you can go down in error. And to be able to go more down in error, you need more samples. So this is essentially like exemplifying the other curve but in a practical example. And another way of seeing that that we like a lot is to plot this into these cosine similarity plots. So here the big circle is of radius one. So the cosine similarity one meaning that we learn if the points, the points, each point is one hidden unit. So if the points are in the boundaries of the circle it's good because we are learning something or correlating with correlation or the one. And there is a small circle in the middle which is the circle which is of radius one over square root of D. So remember we're taking D to be large here. So if you stay in this ball, meaning essentially that as D is larger, you're gonna stay at initialization. You're not gonna be able to correlate when D gets larger. So here you can see like taking the target functions to be, you know, first hermite, then the leap index is one. Then you have second hermite, leap index two, third, leap index three and fourth. And the number of samples you need actually to exit this small circle of radius one over square root of D is gonna depend on the leap index of the problem. So you have the leap index which is one, you can get out of that and learn this direction as soon as you have N of order D, yeah. So most natural activations that you can think of they are gonna have leap index one, they're gonna have a first known zero hermite. So I think for example, ReLU that's the case, hyperbolic tangent, like most of the activations you can think of, phase retrieval, no. Phase retrieval is gonna have a leap index two and that's why we like it, right? Like Marco was talking about phase retrieval, we like it because it's a prototypical example of a hard function to learn. So yeah, I would think like, you know, polynomials of degree L are leap index L and normal activations, leap index one. But they might also have gaps and this is gonna be important later. But for now we're just talking about the first known zero hermite. So most of activations you're thinking of, they have leap index one, meaning that with N proportional to D, you're gonna learn a subspace which is proportional to the average over the neurons of the teacher weighted by the hermite coefficients. So yeah. So here we are taking like some artificial example just to be able to like, you know, show that the theorem is illustrate the theorem. Okay, so that's for everything you can do with one step. So if there is not other questions, I'll move to multiple steps. But yeah, this is gonna build a lot on the previous theorem. So if you have any question, interrupt me. Yeah, Marco? Yeah? X is Gaussian, yeah. And R is a constant. R is, yeah. And you shoot for something like weak recovery. So you don't want to recover exactly. So you're happy with any correlation. No, yeah, exactly. Yeah, no, I don't want to recover exactly. I just want to correlate with the teacher with order one correlation that doesn't decay with D. Okay, thanks. Yeah, no, that's a good point. R being small is a good point because we know that two linear networks can basically approximate everything if they have, you know, is move activations, for example. So there is no free lunch, right? If you have a very complex function, a function that, you know, it's very, very discontinuous, like non-smooth, et cetera, then you need a lot of hidden units to express this function. So if your teacher is given by that, like with R, which is very, very, very large, you cannot expect like to learn this perfectly in a simple way with one gradient step and so on. So here, we are considering functions that depend on a subspace of dimension R and R is fixed. But of course, you know, here take a two-dimensional example because they take 10, 12 and play the same game. Okay, so can we say something about several giant steps? Remember here that these steps, they are not order one steps, right? They are like large steps. So you can think of them also as several steps of order one learning rate or one step of a large learning rate and here we're gonna take several steps of large learning rate. So here, things are gonna depend a lot on the structure of the target that you're trying to learn. Essentially, if your target function depends on something which is just like, you know, an average or a sum over the same activation function, this story is gonna be pretty much the same. So if you have like this type of activation, essentially for several steps, we are gonna be, not much is gonna change. If you take several steps, you still need the number of samples proportional to the leap index to be able to learn the different directions. If your target has a certain structure and that's important because with several steps, we are able to learn more than with one step but for a particular class of target functions and now I'm gonna discuss this class of target functions. So here I'm giving one example that is not in this class of target functions but for each, things don't change as you take several steps and actually in the paper we prove that if you take a teacher which is just an average over the same activation, whatever you want, basically the theorem, the first theorem that I described to you is all that is for many steps of, I mean 10 steps, 15 steps, so on. Now, however, if you have a target function which is heterogeneous, so that depends on some linear combination of your activation functions but in a heterogeneous way, so not AK, not all equals to one, then depending on this combination, you might be able to learn more with a few steps even if you have order D samples. So yeah, here I stress that that's important and essentially what we can say is that even with a few steps, if your lip index is larger than one, you might be able to learn different hidden units of your target by taking a few steps depending on what this A is and a new direction might be learned as you take more and more steps, yeah? Yeah, so here essentially what we are thinking of is like the first layer is just deep, right? It's just like the input data and then the first neural network is just a perceptron then because just taking your input, taking a linear combination and passing through a non-linearity. You see the second one instead is something more a true two-layer neural networks where we have R equals to four there in the second one and then R equals to five, R equals to six, I think. Not mistaken, yeah. So the idea is here how much, how many samples do you need to learn each of these directions with your student? And the first student tells you that if you just have N equals order these samples, you can only learn one direction so effectively you're doing learning a linear model. Then if you have D squared, you might be able to learn the other directions and this is gonna depend on the lip index. Now we have a second axis which is the first axis is just like for one step, one big step. Then we have the second axis which is taking several steps and essentially what I'm telling you is that depending on the structure of the target and I'm gonna be precise in the next slide of what type of targets can be learned in this way, you might be able to learn more than one direction even with linear samples in the dimension by taking several steps. So we can go beyond theorem one. We can learn more directions, not only a linear because theorem one tells you with any proportional to D you can only learn one direction. It's the direction which is the average of all directions. Now if you take several steps, you might be able to learn more directions but not for all class of target and I'm gonna be precise now which class of target you can. Praja? So heterogeneity, by that you imply just learning the more directions or it's a function of the target class which you're gonna get to. Heterogeneity I mean just that the target function that you're trying to learn is a linear combination of the hidden units and this linear combination is not just one over the width or one. Like all of these A star Ks they need to be different for you to be able to learn with n equals to D and more steps. So and I'm gonna be precise like what kind of functions you are able to learn with more steps in this regime. There is also potential sigma star K contributed heterogeneity that could come in to the role, right? Like for K those could be different also that would be another role. But what we can prove is that if you just have sigma K, if you just have a sigma, you don't have an A K, you have an A K equals to one for all K then this function is not in this function class that you can learn with several steps just having n equals to D. Can also show that. So this specific, I mean it's very specific, right? I mean we're talking about targets and these are very specific targets, the target which is basically average over different, the hidden units, the activations can be different but it is an average over these activation functions. So, okay, so this idea was actually first introduced by a paper in the group of Emmanuel Abi which is they call staircase functions. So staircase functions are these functions for which you can after several steps learn even in the n equals to D regime. So what are these function classes? These are functions for which after you learn a linear direction in the target, condition on that you are connected to a different direction. So in which sense? Let's take an example. Let's take Y is just a linear combination of lambda one plus lambda two plus lambda one square minus lambda two square. So after one step by theorem one, if you are here, we are always in the n equals to D regime. So in the proportional regime, the number of samples. After one step, what you can learn, you can learn one direction as theorem one tells you and this direction is basically the sum of the two hidden units of the teacher. W one star and W two star. After you learn this direction, condition on this direction, since this function can be expressed as something that connects Z two, actually here, I realized that I haven't, maybe I should, yeah, maybe I should show this. So we can rewrite this function as something plus something times something. And then we can, in this change of basis, you have lambda one prime, lambda one prime is lambda one plus lambda two and you have a second direction which is lambda one minus lambda two and you can see that the first, the second direction is connected to the first because they are multiplied together. So this means that condition on lambda one, the target is linear on lambda two prime. So this is staircase function. It's a function that condition on the first direction that you learn, which is the linear combination of the directions of the target, you are connected to the other direction that is left. So here is an example in which in two steps, you can learn these two directions. Now, what the class of functions that you cannot learn and then you have the second example, it's targets which, for example, now we have a plus there instead of a minus. So if now you write these in terms of the first direction plus whatever is orthogonal to the first direction, you see that the second direction now is not coupled to the first direction. And in this case, in the first step, you learn the first direction, but in the second step in the n equals to d regime, you don't learn the second direction. Now you need to go to d square to be able to learn this direction. By taking order one steps, you'll never get there because the gradient is not correlated with the second direction when you condition in the first direction that you learn. So this is an example where you cannot learn with multiple steps in this n equals to d regime. And then you can generalize these to every regime. There is a description of the class of functions which you can learn after you take several steps and this is gonna be the staircase functions. Yeah? Here you are always thinking in the online learning regime where number of steps equal number of data or would things change if you have, if you see multiple times the data? Yeah, that's a very good point. So here we assume that every step we have a fresh sample of data. So yes, we are online. We have a batch, but this batch is fresh. It's not created the previous one. So proving this type of results for when the batch is the same batch is much harder. So we don't have anything to say about that. However, if you just simulate, you see that the same holds. Now, yeah, it's kind of an open problem to show the same kind of results when you have, like for example, like just one batch of data that you reuse at every step. But so in this case, let's say you have a D squared data. Mm-hmm. So what does it mean? You would... If you do two steps, you have two D squared data. You have seen two D squared data. Okay, now, but I mean if you see multiple, let's say you fix the number of data. Yeah. Then how, so let's say two D squared, then you would see multiple times the same data would still the, like the phenomenology would still be the same. As far as we simulated, yes. Okay. But we didn't prove that. Okay, okay. We, maybe we depend on, for example, let's say you do mini batches, maybe we depend on the resampling and so on. But I guess the first step would be just to use the same batch over and over again, the same one. But that's a good point. It's kind of crucial for the theorems. So, okay, if you are interested in the theorem, there is a way of formalizing what I just explained to you. Oh, sorry, yeah. Sorry, I had another question. Can you go back just to the previous? Yeah. Can you explain the linearity condition? Is that, does it have to be linear? Or, like why is, why do you require that it's linear in Z two prime? And not just like, it depends on some two prime, or? Well, I mean the second case, for example, is not linear in, in, in Lambda two prime. Right, okay. And these are exactly the case where you are not, so the result is that whenever you are linearly connected to the previous learned direction, then you have this staircase structure. If you don't, then you will need more data to be able to see this direction. That's essentially what we were saying. Okay, so it's a form of kind of suppression, a higher order suppression, so on. Yeah, exactly. Yeah, and that's actually how the argument goes for the proof. Okay, yeah. Thanks. Okay, I was gonna say that, okay, now I have to rebuild all the slide. Okay, what I was gonna say is that, okay, we have essentially, in the general form that I explained you before, we have like a mathematical statement for what that means. We haven't fully proved it, so we, as it stands, is a conjecture. So I would be very happy to discuss that with you if you are interested in trying to prove that. But we can prove a special case of what I just said, which is up to two steps. But again, we observed that these holds to multiple steps is just more challenging to prove these resulted multiple steps. So we can see how the gradient correlates after two steps with the target. Multiple steps is just like a technically more demanding you need to write more terms in the equation. So in the general form is a conjecture. So let me give you an example of that. Again, like with these plots showing like run-on features, run-on features of limited learning and class of function. Here is a function that always in n equals to d. In the left side, one that you cannot learn in multiple steps, so it's not staircase. And in the right hand side, a function which is a staircase. So in two steps, you can see like in these correlation plots that we like, we have again, the circle is a circle of radius one. So if a point is there, means that you have correlation one with this direction. And here we have a function that has two directions. These two directions are coupled via staircase structure. And what you can see is that after two steps you're able to perfectly go to these directions. While there is a case where this is not the case, you get stuck in the first direction that you learn because to see the other direction you would need more samples. So that's an illustration. Okay, I see I have little time. So let me just flash through what we can see about generalization. Because so far we only talk about correlation. So you only talk about how doing one step in terms of feature, like updating your first layer steps, you correlate or not with the target. But I didn't say anything about whether you generalize or not if you correlate, right? So far everything that I have said is about this correlation matrix and whether you can have learning or no learning in the high-dimensional regime. What about generalization? Okay, so now we are gonna look at minimizing. So everything so far we have ignored the second layer because you're just looking at the correlation of the first layer. So now we are gonna minimize the second layer. So let's say you did the first step or two steps or three steps and you have updated your first layer weights and now you learn the second layer on the top of these fixed features. So you do the first few steps and then now you learn on the top of that, like this jointly. So I'm not learning both things together. Can I say something about the quality of the learning that I have? So the answer is that we have some bounds on the generalization performance and the bounds look just like the bounds that I showed you before about the what run-on features can learn or not. So I told you that run-on features have a limited expressivity depending on the regime of the number of samples you have. So here we can say the same thing, but essentially what you're saying is that the generalization error that you're gonna get if you do empirical remunization is gonna be bounded by whatever you didn't learn. That's essentially the moral of this conjecture, that in general is also a conjecture, but we proved again a special case of this conjecture, but essentially what you have to have in mind is that morally assume that you have learned a subspace then we expect that on the direction that we have learned we are able to have good generalization. So we are able to down our error in these directions. However, in the orthogonal directions, the ones that we did not learn, we can do as well as the run-on features model would do. So it's quite intuitive like I would say what this implies to generalization. So essentially you're saying that if you learn the second layer, you are only gonna get generalization as good as the ones in the directions that you have learned and all the rest is gonna just act as noise. So that's what we can see about generalization. Of course what we would like to say about generalization is to have learning curves, just like the ones that I showed for run-on features. So we don't have learning curves so far. Of course that's what we would like to and we want to have an equality and not a bound in this conjecture. So we are working towards that. It's challenging but we believe it's doable. So this would be able to give us access to learning beyond this linear regime and including feature learning into this statistical physics point of view of learning. Okay, so that's what I wanted to tell you today. Let me summarize briefly because I'm just on time I think. So Gaussian universality is powerful but is limiting. So for example you cannot take into account the effects of feature learning. In one giant step we might be able to learn a subspace and the size of this subspace depends on the quantity of data we have in hand. If you take several giant steps you might be able to do better but only for staircase target functions. And finally I have many open problems that I'm happy to discuss about closing the gaps into what we can prove and also about exact asymptotics which is really what we would like to have. So with that I'd like to thank you for your attention. I'm happy to take questions. Questions? Very nice talk, thanks a lot. Thanks. Could you comment a little bit on the connection at the level of the proof strategy between what you do with staircase functions and the approach by Abe et al. So I know the first paper for example is for functions on the Boolean hypercube but they do have some stuff for Gaussian inputs. So I was wondering if you had any thoughts on that. Yeah, that's a good point. So essentially from our perspective, so from a statistical physics perspective the proof that the way they look at that is basically by reducing the mean field limit to a low dimensional set of structures which is essentially like what we would call like the sudden solar framework for online learning. So the fact that you have functions on the hypercube, the dimension is large, the hidden layer size is large means that you have an effective dynamics on few order parameters which are the overlaps and then you can produce stuff from there. Here exactly we have the same strategy. It's just that instead of looking at several steps of online learnings in only one sample we are looking on how the overlaps so you write equations for these overlaps and we see how these overlaps change after one step but with a finite batch. So essentially if you want the sketch of the proof is that you start with GD, you write equations for the overlaps and then you see you prove some concentration results for that after one step and then you see whether this changes or not as a function of the dimension. So is this due to the fact that the learning rate is much bigger in your case? Yeah, that's crucial because this learning rate is actually the critical learning rate. I forgot to mention that so thanks for remembering me. So this learning rate if you go above it or below it either you blow up like meaning that your dynamics is gonna diverge after one step it's gonna blow up with the dimension and if you go below this learning rate you trivially don't learn anything and that's essentially what the paper of Bayatou also have shown. So this learning rate is also what Greg Young calls like I think mu P parameterization. It's like is the critical learning rate such that you learn features and above it your dynamics blows up. Thanks. Thanks. Any other questions? So I had a quick one. I'm trying to sort of build some intuition for like for the leap index stuff is there a really intuitive way to see why that would show up and is it connected to other measures of complexity of functions? No that's actually I think is a very deep question because the problem here is that we have this nonlinearity right? And there are two ways like the only way you get away with this nonlinearity is to linearize things because you cannot do theory for this nonlinearity. You don't know what is the distribution of whatever gets inside of this linearity in general. So the problem is that one way of linearizing this thing is to expand this nonlinearity in some basis that spans the space over the measure of the inputs. Now here the data is Gaussian. So Hermite is very natural. But you can think in general that if your data is not Gaussian your data is from a distribution rule. If you have a complete basis of orthogonal polynomials in the base of rule you might be able to have a similar theory and actually the staircase paper of Abe et al. They do exactly that for the hypercube because in the hypercube you have Fourier. And they do sphere because you have spherical harmonics. Gaussian because you have Hermite. Now in general I think at least it's a question that bothers me is like of course finding orthogonal polynomials like for a generic distribution is very hard. But I ask myself if we had this like harmonic expansion would these results be just generalizing to that? Even Gaussian equivalence like would you have like some equivalence for this class of measure respect to that? I tend to believe personally yes but the problem is that I don't know many bases for many distributions. So let's go for other questions to later when you can catch Bruno and thanks so much for the really fantastic. Thank you. Thank you.