 Thank you, Matteo. Let me share my screen. Okay. Yeah, so thank you very much for the introduction and thank you, Matteo and Florin and of course, Jean for inviting me to this really interesting workshop and for actually, you know, putting it together under such adversarial circumstances. Now, what I'd like to talk about today are two recent papers that we wrote together with Laurent Jacquela, Marc Misa, Gain Reeves and Lenka Stebourova. And the question or the problem that we wanted to look at in these papers was the problem of data structure. Okay. And so the key question we set out here to investigate was, if you look at the data sets that are important in machine learning like images, like natural language, games of go. They all have a lot of structure. And indeed, if you train your network on these data sets, your aim is usually to extract some of that structure and to exploit it for your downstream task. And while this works really well in practice, from a theoretical point of view, this person is only really poorly understood. Okay. And so we started wondering whether and how we can analyze the impact that data has on learning in neural networks. Now, for concreteness in the start, I'm going to focus on a setup where we have a supervised regression task. Okay. So you're going to have samples with high dimensional feature vector x and some label y. And usually the assumption is that you draw a data set IID from some unknown data distribution key. Okay. And if you're working in statistics or in computer science, usually your goal will be to derive some theorem. And the goal is to derive this theorem while making the least possible amount of assumptions about this data distribution key. So you want to make the most general statement, regardless of what your data is. Now, we sort of took a complimentary view where we sort of said, okay, let's sacrifice some of that generality, but let's put in some structure into this data distribution. Let's assume some generative model for our data that has some distinctive features and then see how these features affect learning. And the classical way of modeling data for learning is of course the classic teacher-student setup that emanated from the statistical physics of disorder systems, sometimes in the 80s. And the idea is really simple, right? So you generate your inputs by making some IID draws from the normal distribution. And then you generate labels for these inputs by feeding them through a random neural network that we're going to call the teacher. And whatever is the output for a given input, that's going to be the label of that input, right? And throughout the story, we're going to be interested in these two-layer teachers where you can write the output function like I did here at the bottom of the slide. And it's nice in the sense that it gives you some structure in the tasks that you're learning. It gives you some structure for the function that goes from X to Y and in terms of the teacher features. And now you can analyze how these shall have been learning. But what's somewhat lacking is the structure in the inputs, right? The IID gauss and so sort of by construction, they don't have structure. And so we wanted to go a little bit beyond this traditional ability to student setup. And so what we proposed was what we call the hidden manifold model. The idea here is to generate data in a slightly different way. Again, starting from a random IID gaussian variable that's going to be a latent variable. And from this Gaussian noise, you generate an input by feeding this latent variable through a generative neural networks. For example, here I'm showing a deep convolutional again. The idea of these generators networks is that they are trained on some data set. And then if you give them some noise as an input, they will generate, in this case, an image that looks a little bit like what they saw in the training set. But it's a new previously unseen example. So this is how we're going to generate the inputs X. Now the labels Y, we're going to again take them from the teacher. Only this time the teacher will not be applied to the input X, but the teacher will be a function of the latent variable C. And the intuition here again, maybe best illustrated with images. We sort of thought, well, you know, for this image of a dog, for example, its class or its label doesn't really depend on every single pixel in that image, right? It depends on some higher level features. And in some sense they might be better captured by the latent variables, right? If you think, for example, of a conditional again, in a conditional again, you tell the generator exactly the class of image that you want to have. So in that sense, the important information is really in this latent variable. The question is then, okay, can we still analyze, say for example, the dynamics of two-layer networks if data come from this generative model, the same manifold model? And sort of the quick answer here is yes, yes we can. And particularly here, I'm going to talk about two contributions that we made. One is the theorem, the Gaussian-Cougens theorem, where we give sort of rigorous conditions on the weights of a single-layer generator such that we can do an analysis. And the second one is a contribution where we derive dynamic equations that tell you how the student evolves in time, in particular how its test accuracy evolves in time when it's trained in data coming from this model. So for the rest of the talk, I just want to take you through these two points. Let's start with the Gaussian-Cougens theorem and to understand that it's maybe good to go back a step and go back to this classical teacher-student setup where inputs are just IID random inputs, okay? Now if I say analyze, what I mean is I want to compute the test error or the prediction mean spread error at all times, okay? So in this vanilla teacher-student setup, you can write this prediction mean spread error like this. It's for a given student, that's what you're trying to learn. That's the function you're going to fit, so it's a two-layer network in this case. And you want to know what's the mean mean spread error that this student is going to make with respect to the teacher, right? That's the one that the two-layer network that creates the data, random weights, but it's fixed. Now this is a high-dimensional input over the inputs x, so these IID inputs x. But you can simplify this average by realizing that the inputs actually only intervene in this formula through the dot product with the teacher and the student weights respectively, right? And we're going to call these dot products lambda and nu, these are the pre-activations of your student and your teacher. And so you can replace this high-dimensional average over the inputs with a low-dimensional average over these pre-activations. Now, that's nice, but already it doesn't buy you so much because you still have this non-linearity in the way, okay? Let me just say at this point that these pre-activations, they're really the key-rand variables, not just for the online learning that we're going to talk later, but also if you want to look at batch learning, and so you want to do replicas, they're a really important object, and then Bruno later this afternoon will talk about the batch approach and the replicas in a similar model. Okay, so at this point you want to make the average over lambdas and news, now how are we going to do this? Well, if the inputs IID, there's actually a really nice simplification that takes place. And that's that the central limit theorem picks in. And so the lambdas and the news that jointly Gaussian, and so we're going to say that this model has a Gaussian equivalence property. Okay, the lambdas and news are jointly Gaussian, and so Karl Friedrich here is going to be very happy, but we can be very happy too, because that simplifies the average instead of being a function of all the moments of the distribution of lambda and nu, it's only going to be a function of the second moments of the distribution, and these are the order parameters that you often introduce in statistical physics to analyze the system. Now the problem becomes, you know, tracking the evolution of these order parameters in time, or finding settle points equation for these order parameters. And that's the Gaussian equivalence property. Okay. Now, what we found with this Gaussian equivalence theorem is that a similar results also holds if your inputs come from a generative model, more precisely from a single layer generative that you can write like this. So see again, the random latent variables that used to create your inputs and then you feed them through this one layer in your network, and you have the teacher acting on the latent variables to give you your label. Now it's not immediately clear that the local variables, the local deactivation that I'm writing here are jointly Gaussian, right? For once, now the inputs have a non-trivial correlation, because they're not IED variables anymore, they come from this generator. And secondly, now that the teacher is acting in a different space, acting on the latent random variables, you know, it's not quite clear that they are jointly Gaussian anymore. But what we found that there's still sometimes Gaussian. And in particular, we were able in particular with the help of Gaylen-Reeves to improve the following theorem. Okay. So let P be the distribution of the pair lambda and nu, and let P have to be a Gaussian distribution with the same first and second moments. So you can then define a scalar quantity at distance between these two distributions, and you can show that this distribution scales as one over the square root of n, where n is the input dimension and n here is very large. So in other words, the distribution of lambdas and nu's and the Gaussian distribution coincide. So it's very nice because we saw before that them being Gaussian enables the analysis. Now let's look a little bit closely at what kind of terms actually intervene here in this theorem. On the one hand, you have the student weights and the teacher weights and the weights of the generator, which come in here. You have the matrices W, W tilde and A, and you have two matrices M1 and M2, and they're basically related to the input correlations of the inputs that come out of your generator. Okay. So this Gaussian equivalence theorem is related to a series of related works that also look at sort of Gaussian approximations for real data. And it's in the paper we actually discuss these relations in quite some detail because it's really quite a rich literature. At this point, I would just maybe make two or three very brief comments. So there's one series of works that also looks at white neural networks with white hidden layers, and that has sort of similar Gaussian equivalence results using random matrix theory. Now what these results then need is because it's a random matrix theory result is that the weights in the networks are random, or at least that they're very little for them from the random initial condition. Now what's nice about our theorem here is that the weights of the student and the teacher, the generator, they come in directly in theorem. So we don't need them to be random, and indeed later it will show that all of our results also hold if we pre-train the generator, for example, or the teacher. Another set of really nice results came from the work of Son May in the group of Andrea Montenari, and the work of the group of Romain Couierre here in Paris, who also introduced, if you will, equivalent Gaussian models for data. Now that work is a little bit different in the sense that we're looking at these low dimensional projections of data, whereas they look at models where the generalization error or the quantities that you're interested in can be written as an integral over the spectrum. And so what they show is that the spectral densities of the real data and of certain Gaussian covariate models coincide. So that's another really promising direction. But one at least for us doesn't straightforwardly follow from the other. And that was for one layer generator, okay? So now naturally the question becomes, okay, what if I want to have a more complicated generator, right? What if I want to make it deep? Or what if, for example, in this convolutional GAN, what if I want to have not just fully connected layers, but some, you know, convolutional layers? Now, we don't have a theorem for this yet. But what we did was we said, well, okay, let's look at the following case. You generate your data by feeding your Gaussian data to this generator. And then you let the labels again come from a teacher acting on this latent variable. Now we're going to take a two layer student, and we're going to train it on this data using SGD. Using online SGD in particular where each step of the gradient of the algorithm, sorry, you take a previously unseen example that you generate from a trash and use that to validate the gradient and to compute your weight update. And what you can do then, given the get is you can derive a close set of equations for the auto parameters Q and R that track the dynamics of a two layer student that's trained on data coming from this model. In some sense, this is a generalization of some similar work of sudden solar and be the Schwartz in the 90s. I put the reference on some earlier slides. We did this kind of analysis, but in the vanilla to just in case so my inputs were ID, and the teacher was acting on the inputs directly. Okay, and so what we did here was basically extend this type of analysis to a deep generator and the teacher acting on the latent variables. So I want to go into too much detail of this of this year. And we just quickly say that basically the trick here is to rewrite the order parameters as an integral of the spectral density of the input covariance. Okay. And then you can derive equations of motion for these densities. And that gives you a close set of equations that you can iterate, and that you can then compare to simulations of stochastic gradients. And to give you an idea of what this comparison to the experiment looks like. Let's go to the case where we know that the get hold so when we know that the London's and the news are jointly Gaussian. That's the single layer fully connected generator is that there we have to get theorem. Okay, and so we can now just take a new network put on the computer run it and trade on this data. And that's the solid lines, and we can integrate the equations on the other hand that I just showed you, and we can then compare the two to each other. Now since in this case we know that the get holds because we have a theorem, we expect that the process and the lines actually overlap so that they get the same thing that they agree. And indeed, in the experiment, we found that this is true. The equations, they, you know, they're not just something that you can integrate. But this is the equations are also something that you can you can use and analyze to for example predict the performance of two layer neural networks. Okay. Again, so the time I'm not going to go too much into this, but here this is just one example where you can use the odys to predict how the generalization of the student depends on the latent dimension of your data right. So interesting is, we then said, okay, let's just take a really powerful generative network in this case a so called normalizing flow. Now this is a general network that was trained on on cypher images. We use the real MVP model here from Dean at all. And so you can see at the top half, some cypher images and in the bottom half, you can see some of the images that were generated by this generative model. It has some, you know, like six million parameters, it was pre trained here so the weights in the model have some really strong correlations. And again, we played the same game we trained a student on this data, you know, using pytorch, and we integrated the odys describing the dynamics of the student. And what we found is that indeed, the two agree really well. So in other words, we can now analyze the dynamics of shallow neural networks, when inputs come from such a structured pre trained generative model. So in other words, the fact that the odys agree with the simulations here means that this Gaussian equivalence that we proved for deep for single layer generators. I'm sorry, also holds for these deeper architectures. So again, Carl Friedrich is very, very happy. And with that I'd like to like to finish this talk. Thank you for your attention. And I'm looking forward to your questions at the end of the session. Thank you very much.