 Okay, so hello everyone, welcome to the QLSE seminar. So today, Sebastian Gault, we have the pleasure to have Sebastian here. Let me quickly introduce him, I think everyone knows him, but Sebastian is a statistical physicist who did his PhD in Germany on stochastic thermodynamics. He then moved for a postdoc in Paris, the group of Florence Akala and Lenkas de Borová working on the theory of neural networks. And recently, we are glad to have welcomed him in Prieste. So he started the position as an assistant professor in CISA, where he created the group on the theory of neural networks. And so he's an expert on the dynamical properties of learning in these machines. And so we'll have a course on that. Welcome Sebastian and thank you again. Thank you, Jean. Thank you for the nice words. And thank you for inviting me. It's really pleasure, like you said, I'm right next door here in CISA. And I'm looking forward to coming over more often the next few weeks and months, but for today, let's work with what we have. So like Jean said, I'm a statistical physicist, but these days I'm mostly interested in neural networks and learning in neural networks, and in particular in what happens during learning so in the dynamics. And so the goal of these of these two lectures today and next week will be to discuss a little bit what happens to discuss a little bit the particular limit of dynamics of learning where we can analyze things and analyze things. In a fairly straightforward way but which still gives us some some useful insights which then we will see also carry over to other setups. So the the ideas really to make this, you know, I'd like to make this really as interactive as possible so I'll do a lot on the, on the whiteboard on the one note here. If you have any questions, please feel free to interrupt. I'm not sure if I will always see if you raise your hands or just, you know, unmute yourself and ask questions as we go along. And I will always, I will also have some slides towards the end of today's lecture but yeah really, this is designed to be interrupted. So, yeah, what's the goal for today so the goal is to to in the first step today, look at a certain model of neural networks. Look at its learning right. And so the model of neural networks that we're going to look at it's a two layer neural network. And I'm just going to write it down very quickly. So the goal, or what we have in mind will be a setup where you know you have some some tasks, some high dimensional tasks you might want to classify some images, you might want to know whether to pick the images that you showed you and your networks contain dogs or cats or none of the above. So, if you will, the task we always have in mind is one that's called supervised learning. So, so we'll always be in this supervised learning scenario where you have high dimensional inputs, which we're going to call x, and these high dimensional inputs you know they might be images they might be word representations in natural language. And we're going to say the high dimensional vector in the dimension. And each of these inputs has a label. Okay. We're going to call that going to call that why star, and this label is a scalar. Okay. And this could be for example, plus or minus one if you want to make a difference between cats and dogs right so for example, you know x would be images and why would be. And why star would be plus or minus one plus one if it's a dog. Okay. And so you want to learn this function you want to learn a function that takes you from this high dimensional input x to the to the scale of variable y, and one, you know very popular class of functions and the very powerful class of functions are neural networks. Okay, and these neural networks that you use in practice these days they usually very deep in the sense that they have many layers of weights. And that's still a big challenge for us to analyze theoretically really in detail. So what we usually do in theory is we we look at model of neural networks or in other words we look at simple neural networks, in particular we look at that works with two layers of weights so sort of graphically you can think about this neural network as an input layer. So this is just your input vector x. Okay, and then you feed that through what's called a fully connected network layer. And I'm going to write down in a second what that does that gives you some hidden or latent representations. Those are your neurons. If you will. Okay. And then you get a scalar output which is the prediction of the student which you're just going to call why so why stars, what you want the true label, and why is the prediction of your network model as a function I said you know your class as a function you can simply write this as Phi Theta of X or Phi will be the neural network. Theta will always be the parameters X is the input. And in this case we can write it. The following expression I'll walk you through it in a second. Okay. So you'll see this network has two layers in the sense that there's two matrices of weights, w and V. And V. And so at the first layer, you know, you have this weight matrix w, which is a k times m matrix k times n matrix this k neurons. And so you take the matrix product w times X, you normalize it so that you know everything is nice and well behaved in the order one. And then you apply a non linear function G. And this non linearity. So this is a scalar function. There's different choices that you could have. We're going to look at two particular one is the sig model function, which looks like this. So it's a very popular function in neural networks these days which is called the value, the rectified linear unit and which looks like this just a linear function for positive lambda and zero for negative. And then we have a second layer of weights, but I'm going to call V. And you just take the sum over the neurons weighted by the second layer weights, and that gives you the output. Okay. So this is going to be the function class that we're going to focus on these two layer networks. Now we're going to focus on a particular limits to look at these. And this limit will be what I call the odd element. And this odd element looks like this you take the input dimension going to infinity. Why the number of neurons. K stays of order one. Now there are other limits. There is, for example, a limit where you do the opposite where you keep the input dimension finite, and you send the number of neurons in your network of hidden nodes to infinity. That's what we're going to call the mean field limit. I'm going to focus on this odd limit here but then at the end of today's session, we'll talk a little bit about what I said in this two layer case in the odd limit relates to other limits too. Okay. So, okay, you have all these weights, w v. You want to learn some some sort of function. How do you find these right if you just put random matrices there. It's just doing a random answers. And the solution of course is that you don't write down the weight matrices but you find them by train. So, what does that mean. So we're going to train these. We're using gradient descent. Okay, so we're going to do really just plain gradient descent on in this case the quadratic loss. As you would expect. Check how I wrote this down some consistent meditation. So, really broadly, I'm going to be looking at all my parameters here for the network. I'm going to set mu plus one. Okay, and use this the step of the algorithm. And this is going to be equal to a learning rate which I'm going to call either times the gradient of some loss function and this loss function takes in the output of my neural network. Okay, my prediction, and and the true label of an input. Okay, and here throughout this throughout this talk to keep things simple. I'm going to look at the case where L is just the mean square error like this. Okay. And so at each step of the algorithm, you evaluate this gradient using a particular input output pair. Okay, this is a gradient so you need to evaluate a particular point. And this point will be just a pair of inputs and it's true label. This gives you a quantity and with that you then update your, your weights at each step. Okay. And basically the goal of the next two lectures to understand to think to understand, well what happens to the dynamics of learning. Okay, so how can we, how can we describe what's going on here. So, what are the dynamics. And this is going to be the focus of today. This is online learning, right. This is online learning. I'm going to say this in a second, but yeah, you're absolutely right. I'm going to look at online learning. And I'll say in a second what that what that means. And the second question is, what kind of impacts. Does the structure in the data have here. Okay. I should write shorter sentences. And that's going to be the focus of the second lecture. Okay, so this is very good moment to zoom out and and pause for a second. Now this might be very basic in terms of the setup for for many of you and I'm sure many of you have seen this. I just wanted to write these things down because we're going to need the notation ones and be I wasn't quite sure about everybody's background. So if you have a question about the setup now is I think a good time to ask it. And then and then I'll come back to you. So are there any questions on the setup. Very nice. Okay. Then let's zoom back in. And let's start by looking at the dynamics. And this is of course, you know, a key topic in the theory of neural networks. And so there's been there's a lot of sort of related work and there's a lot of limits in which you could look at this. What we're going to do here is we're going to effectively look at this online learning limit so what does that mean. That means that at each step of SGD. Okay, so at each step of SGD. I'm going to use a new example to evaluate the gradient. So I'm at each step of SGD I'm going to look at my my data set with my images. And I'm going to take a new image. And I'm going to show that to the network and let it update its weights. Okay. And then I'm never going to use that image again. And that sounds like a sort of innocuous thing. It sounds like very efficient, but from a theoretical point of view, as we will see later that's really important that really makes this analysis that we're going to do. Possible, and indeed extending this analysis to a finite data set where you would resample your data samples that still to large extent and open problem. So, we're going to look at this in the online learning limit. So what have people done already here. So people have already looked at, for example, linear networks. Okay, so this is when your activation function is just a linear activation function. There's a lot of a job that going back to the 90s and already discussed with Jean after this talk, I will share these notes, and I will give some references to the words that looked in this in more detail. Okay. And there is then also some work and this is what we're going to review today in this limit on the nonlinear networks, and that was done by sudden solar. And by being really glad back in the 90s. That's going to be our focus today. And then more recently, people have looked at the dynamics of the mean field limits. So this was a series of papers in 2018. And we'll come back to that at the end of today's lecture. All right. So, let's jump right in. And if I say, let's jump right into to analyze the dynamics. What's actually the quantity that I'm interested in what's the quantity that I want to track. Well, what is interesting here I have, I have all the n number of parameters right so this is the physicists of you will remember okay so here's my order and there's loads of parameters I'm not going to check every single one of them right. So that's the, the key quantities that describes the performance of my neural network. And that's going to be the test error. Okay. Of the neural network we're going to call that epsilon G. And we're going to define this as the average over my data distribution of my network output minus the true label. Okay. So the goal will be to understand how does this guy depend on time. How does this guy evolve during training at the beginning this will be random guessing. Okay, and then during the course of training this origin of generalization will go down. And we want to know okay how, how does it go down, what kind of periods are there, what happens in the neural network when this guy goes down and so on. So I've written this average over the data here in a sort of very general way and so far I haven't really said anything about where my data comes from. But if you remember you know I was giving this example of images of cats and dogs, and we don't have the tools mathematically yet to analyze real data to analyze what happens with real images. We don't really have mathematically crisp techniques to describe what's going on there. And you can go about this dilemma. You can either say so to deal with data there's two approaches. You can either say, well, if we can't really say something about real data, then let's just say something that's really general and that's independent of my of my data. Okay. And that's the approach in a lot of statistics and a lot of statistical learning theory, what people will try to do statistical learning theory. What people will do is then say, okay, I'm just going to prove something about this generalization and I'm going to show that it's going to be lower than point one. It's going to be lower than one over the number of samples and my trainers and something like this, but I'm not really going to make any assumptions on the data so my results are going to hold for any kind of data. In particular that means that they will hold for the worst possible kind of data. Right, so, you know there's something very nice about pictures of dogs and cats that makes it that we humans for example can differentiate differentiate the problem. But you can also think of some really adversarily constructed data sets. And if you don't make any assumptions on your data well then you have to take these into account to. Right. And it's almost complimentary way to deal with data structure is one that in the physics we would call the teacher student approach. And the teacher student set up, you're going to say the following you're going to say well, okay, it's hard to make statements for really images. So how about we replace the images and then able with something that's a bit more tractable from an electric point of view. But which has some structure and then we see how that structure shows up in the dynamics of one. And in particular the idea of the teacher student set up is to train a new network like the one I just wrote down over there, which we're going to call the student now and train it on data that comes from a teacher. So basically we're going to look at the case where the inputs acts are drawn ID from the normal distribution with mean zero and the identical variance. Okay. And the true label of each of these inputs is going to be given by another neural network that we're going to call the teacher who has star parameters. Okay, and the teacher is just going to be another two layer neural network. It's going to have random weights. And, and we're just going to see what the student does on this type of data. So this is of course the sketch right this is a huge simplification compared to the images that we talked about just before. But it has some some advantages that begs the analysis that we're going to do now possible. So it's the first model of data structure. Okay, so the inputs here are a bit unstructured. They've just ID Gaussians. And this is the standard approach and you will find 1000 papers. And we're going to talk tomorrow about how to make this a bit more structured, there is structure here in the labels right and the structure comes from the teacher. For example, how many neurons does the teacher have, we're going to call that and what activation function is the teacher going to use, they're going to use a value is it going to use a sigmoid. So there's some structure here in the task that we're trying to learn and now we can track how the structure is not at which point you're learning and so. Okay, so that's the setup to learn your networks and the teacher student setup. Very good. Oh, actually, the structure usually is in the X. Yes, right. Yes, I mean, what's the structure usually in both right if you if you take a typical image classification data set their structure in the X in the sense that you can do unsupervised learning you could do some clustering approaches, and I will tell you something about the data. So there's additional structure in the task in the sense that, you know, only some of this information from the images is relevant for the task right the lightning the angle the orientation of the image that doesn't change whether something is a cattle bar. So there's structure in both usually in this classical teacher student setup. There is in a sense no structure in the inputs, which is nice from a theoretical point of view but I agree with you it's terrible for modeling part of you. And what we're going to do tomorrow is look at some recent work that we've done recently, and that really tries to put some structure back into the X, and you know have a slightly more realistic model of inputs within the teacher students at that. I mean, the general question is that type of problem that you are addressing is very, I mean is learning structure from noise essentially right or classifier classification of noise. Yes, it's it's exactly it's a classification or in this case a regression problem on noise data. Yes, any other questions. Sorry, sorry, Sebastian. There is learning. I mean the teacher network is not learning. No, no, from the very good point. You choose the random weights and then yes. Yeah, sorry, I didn't I didn't emphasize that. So the teacher is exactly like we said we draw it's way to random from the beginning usually you take some ID weights from the Gaussian, but then you freeze them and they're going to stay constant during learning. You could also like in our approach, or in the sudden sort of approach, you can also make the teacher time dependent or something. But, but really for now it's going to be frozen. Thanks. Please, I have a question. Yes. When you when you say at each step you have a new sample. What does it mean, really, because I didn't understand what is here. Yes, when you say it is when you say at each step you have a new sample. What I'm thinking about here is, okay, so I have this teacher now it's frozen, and I have my student and I'm going to train it using gradient descent. Right. And using gradient descent. I'm going to up at each step of gradient descent I'm going to evaluate the gradient of my loss function. And, you know, compute some new weights. Okay, so each time I compute new weights that's a step of the algorithm. I mean I use a new sample at every step of the algorithm. It's about this guy here it's about this guy where you know where do I evaluate my gradients. And so at each step of the algorithm I'm going to draw a new Gaussian vector. I'm going to ask the teacher, what's the label on that vector. And then I'm going to use that to evaluate the gradient. I'm going to take a new Gaussian sample and take a new label and so on. Okay, in this case, if you consider an image, the sample will be a single image. Right. Yes, the sample would be the image and its label. Okay. And go back to the way you present the teacher-student state. Here the teacher is a layer or what, because it doesn't really get the difference between the neural for teacher-student and for the student and the teacher. What's the difference between both? What's the difference between the student and the teacher? When you have the neural, the network, what is the difference between them? Okay, so the difference between them is that one is the student. So the student is the network that I'm training. Okay, the student is the one that I'm optimizing the weights. The teacher is the one that generates my data. Okay, so this guy generates my data. Before I was using a new image and I looked at the image and I said, okay, this is the dog or this is the cat. What I'm doing now is I'm taking a new noise vector and I'm asking this teacher network to tell me what's the label for this image. Okay, but this teacher network is frozen, it's fixed. Does that make sense? Yeah, I get it a little bit. As we have image, you rotate the image to get the new label and give to the student, something like that. Think about it this way. So in a normal machine learning task, you have images and labels, okay, and here you have Gaussian vectors with noise, which have a label which is the output of the teacher. Okay. And what you have in both cases is the student, which is, you know, the two layer network that you're training to learn either this task or either this task. Okay. Okay. Yeah. I also have a question. Yes, please go. The structure of the teacher's network is the same as the structure of the student network in terms of number of neurons layers and function G or it can be different. Very good question. So you have some freedom here. What we're going to focus on is teachers which have two layers of weights, simply because we don't have the mathematical tools to really go go deep yet. But you have some freedom for example in the number of neurons. So you can have more or less number of neurons I'm talking about these guys here. I'm talking about these guys here. You can have more or less numbers of neurons in the teacher. So you can have a case where the student is more powerful than its teacher. You can also have a model or a case where the teacher is more powerful than the student. Actually, you can play interesting games now because in reality, you know, the student will not be able to achieve a perfect score on real training data. So you might want to make the teacher a bit more harder, a bit bigger than the student. You can look at the interactive activation functions you can really get creative here depends a little bit on what you want to model. Thank you. All right, now we have this new new toy we have this teacher now let's see what we can what we can do with it how that helps us with the analysis. I said that the key quantity that I want to analyze is this generalization. Right, so let me remind you this is just the expectation of student of minus the two levels squared over the data. I have the teacher I can I can write this a bit more explicitly and maybe we have a chance to actually do this average. Let's, let's do that so by putting in the teacher what I mean is I can now replace the white star right can now stay well this is just a student minus the teacher and the average is now taking over over x over the side dimensional vector right. So this is a bit tough because there's a very high dimensional average and high dimensional average is a bit of a pain. With physicists we'd like to have something that's a bit lower dimensional in terms of how we describe the system. If we write this out a bit more in detail what this expression actually looks like. It looks like this right so this I'm just writing down the definition basically of the student output where again lambda k is wk times banks and is the input dimension. Minus M. So this is the sum of the teacher. V star. New. When you am is now the same quantity but for the teacher. We're going to call under K and new and the local fields of the pre activations of the teacher and the student because they essentially what what goes into the non linearity. Okay. And I really invite you to remember these guys well because they're going to be really key to a lot of what we were talking about today and next week. Why are they key, they keep because you see that now in this expression for the generalization which is really what we're what we're after we want to calculate as a function of time. They don't appear. They don't appear directly. They don't appear explicitly they only appear through these thought products with the teacher and the student weights. WKX, W M star times X. These are high dimensional products right so these are really, I can write this out maybe just for this this one. Just for the student. Okay, so these are really sums up to N of WKI Xi and each Xi is a Gaussian variable. You have that both the lambda case. And the new M jointly Gaussian. Okay, this is true for any given student or any given teacher right simply because the inputs X are Gaussian the sum of the Gaussians you multiply with something doesn't matter. The sum will be will be Gaussian. I should emphasize here that I want to calculate this generalization error for given students and a given teacher. Okay, so from now on, the question will always be you know you give me a student you give me a teacher. What's its test or you know you give me a teacher and you give me the student after 100 steps of training. What's the generalization. What I can do here is I can, you know replace the sign dimensional average over X with a low dimensional average over these London's and news. There's K different random variables lambda and like we said earlier K is say two or four but it's all the one. And there is M different news. So we can rewrite this average as an average over these low dimensional fields. And I'm making a conscious effort here to sort of keep my handwriting legible but if I fail at that then please, please let me know. So how many of you here are statistical physicists or how many of you have maybe done replica calculations or this kind of stuff, but these local fields they're very similar to this is just going to be a quicker side. They're very similar to the auxiliary fields that you introduced if you do a standard replica computation. Okay, if you if you do I don't know. If you do a replica computation you would introduce auxiliary fields to use the delta functions. And what you want to do is you want to replace some high dimensional average open average over these low dimensional local fields and you will exploit to take out and you look at the saddle point and so on. And here the idea is really the same that what are the auxiliary fields in the replica calculations. They are the local fields or the preactivations in this new. So if you need, I'm talking about dynamics, but you can also look at sort of the statics of the problem and for that you need replicas, and then you will find that your auxiliary fields in your replicas are exactly these these numbers. Okay, so, yeah, if you've done replicas maybe this is helpful. Yes, related to what you said. So the beginning replica expert so now regret that I made the same connection. Yeah, I was just wanting to ask you why you said that K had to be all the one, but in principle, can you take K going to finish after you took I'm going to finish it. After. Like what what do you normally do in committee machine like K goes to infinity but once you were ready to yes. Yeah, you should. I think there's, I think you should be able to do that. I have never really looked at what happens in this limit and actually I think an interesting question would be to compare what happens if you go to this limit, compare to what happens in this mean field limit where you just let the number of neurons go to what what you call the mean field limits is the tangent kernel limit. So what I call them. So the new attention kernel is in the mean field limit so it's a bit. So mean field limit for me just means that your end is finite. And your K goes to infinity so your number of neurons goes to infinity in the dimension stays fine. Now, depending on your scaling in this limit you can get into the NTK regime, but you can also stay in the future learning regime that really depends on the limit your learning rate initialization. Okay. So do you know exactly what the is the NTK regime. Yes, so. Okay, another little aside, remember this this computation. Basically, okay, let's let's say we have finite inputs, and now my student would look like this. Let's actually just just do it like this. Okay, now we send K to infinity. This network when you train with sd will still be able to learn to learn, and the weight of this layer will will move a lot. Okay. Now if you go to a certain scaling and it's a bit subtle how you could do that, you could for example instead of looking at the one over K could look at one over square root of K. You could choose a certain scale of the learning rate of the initialization. This is the same where you still have an infinite number of neurons. Okay, but these ways these first layer weights will now move very little vanishingly little. And this is this NTK regime and I'm going to come back to this at the end of today's talk. This is very nice with theoretical point of view, because it's it's easy to analyze and you can say a lot of things. It's a bit as a model. It's not as expressive because your first layer weights don't really move. You're essentially learning. It's a bit like random features. You can fall back into some sort of kernel learning regime. So it's a perturbation limit. I mean it's a perturbation theory of the purely random feature model. Exactly. Exactly. You can think of it as a linear perturbation of the random feature model. And so you can show quite easily quite elegantly things that you can learn with the tool and network but that you cannot learn with with in this lazy regime. And that's why it's called lazy. Yeah, that's it's really hot topic in your new networks. So these are, these are really good questions and I'll try to like periodically come back to these points. So what we were, we were looking here at the calculation of the generalization right so we wanted to know now that we have a student what's the generalization and we noticed that this high dimensional average over the data, we can actually replace that with a low average over these local fields. Now why is that useful. In general, you know, there's an only a function here. So this would be a function of all the moment of the distribution of lambda and new, but we know that we have the Gaussian right because we have Gaussian inputs. So this needs to huge simplification in the sense that I can now just look at the first two moments of lambda and new. Okay. We're going to say that lambda k and new m have mean zero because we just took our inputs with mean zero. We're going to call their second moments q for the student student. We're going to call them are for the teacher student. And we're going to call them T for the teacher teacher. Okay, so these are our order parameters if you want to talk about this in the physics jargon. These are low dimensional objects, obviously, because you know q for example is a k times the matrix. And since our distribution of London q is Gaussian, these second moments capture the full distribution. So in other words, we can say that our high dimensional object, which was this organization arrow, which was a function of all the student parameters all the teacher parameters. It becomes a very low dimensional object becomes something that I can write as a function of just the order parameters. And, okay, the the second layer weights but there's also just just care of them. Okay, so from a statistical physics perspective, we've done the first and then maybe even most important tasks of the problem, we found the right order parameters for the students. Right. So they are these these q anti parameters. So if I tell you these overlaps, you will know the generalization. And so any question that I have about the student like I said earlier, you know, you give me a student. What's the test error, you give me a student after and I asked you, you know, after 100 steps of training, what's the test error, you cannot translate them from you know, here's my student into questions about the order parameters, what's my order, what's my generalization. So, that's, if you will, from from a statistical point of view, the statistics of the, the statics of the problem. So equation, a question about the dynamics of theta of the student. Now becomes a question about the dynamics of q and of R. Okay, so if you understand the dynamics of q and R, we know everything about the system about about the learning. Can I ask a question. So you said that you. So the x the inputs are Gaussian. But so in this case the fact that this cavity, this, this auxiliary fields are a cavity fields are a jointly Gaussian is is true for any size right you don't need the central limit theorem this case. So, my question is, if you want to drop the assumption that the x are Gaussian and even take like weekly correlated axis I think if they're weekly enough correlated central limit would still apply and therefore your arguments would apply in the thermodynamic Yes, no, you absolutely right. And that's what we're going to talk about next Tuesday so for weekly correlated to what john said or john's point is that, you know, a sum of Gaussian random variables is always Gaussian for any end so I don't need the thermodynamic limit here. If they're weekly correlated. So if instead of you know right now I have. Right now I have this right. So if instead of being, you know, not correlated all they would be weekly correlated so if instead this was like order one over and or something. These fields would still be jointly Gaussian. And indeed what we're going to talk next week is what happens if this is the case. And we're going to prove that in some cases which are quite interesting in terms of modeling real data. This is still the case. So this covariance matrix will have only a few will be you know on average these entries will be small. And so this analysis goes to. But are you you're okay. No one piece. Okay. Let's make a quick check for the time john so I'm a couple of when did we start we started five past three. You're still muted sorry. Let's say you can aim for 20 or yeah 20, 20 minutes then some questions if it's okay for you. Okay, I mean I have to listen more but it's not. I know how hard it is on zoom to follow so and I really appreciate all the questions so far so that's, that's really cool. And that's really makes a lot of fun so I'm not going to, I'm going to exaggerate I promise but okay. Let's say I am for 4pm and then I'll stop. So okay we have, we have this reduction we have found the auto parameters of the problem and we have found that, you know, if I think back to my learning dynamics, or the way up here. Now I don't really need to understand what's happening to every single parameter I just need to understand what's happening to these other parameters queued out and everything about my student during the training. How do you get, how do you, what describes the number of queued out, and here we get some help from some classical status work status classics by sudden solar. And be done regular course built on a whole, on a whole series of literature. And what they showed is that indeed, what you can find is you can derive a set of ODE's that govern the dynamics of your order parameters in other words, you can write down a first order. You can write this order ODE's like this. And these ODE's will describe exactly what's going on in the thermodynamic limit so here we now really need the end to infinity limit. We have these in a series of, you know, PILs and JFSAs so they just sort of give a heuristic and derivation I'm just going to give you a taste of this of this derivation it's it's really quite simple and quite elegant. And the idea is the following so let's stay at this at the finite step level okay let's not get to the thermodynamic limit just yet let's say the finite step limit. Let's think about how to get an equation for for the teacher student over for example. So what we really want then if you want to understand what happens at this step. Okay, and so this is nothing but wk at me plus one here I'm just putting in the definition of the order parameter. I mean one thing I should mention. I'm sorry. So just like this out. Let's get this definition from. So these other parameters as I introduced them here. They may seem like a little bit of an ad hoc thing you know okay it's this correlation variables between these, between these local fields but what do they mean like why are they important like they have a physical interpretation. And in fact they do and it's it's quite simple to see in this case. I can illustrate this using the teacher students that the teacher choose you know what up. So if I write this out. So this is the expectation of lambda k new m. What this really is this is some, I'm just supposing something in the definitions here of lambda and k it's wik wj star. Now this is just a delta function right because our inputs are uncorrelated. So in the end what you find here this is just one over m times the dot product between the cave student vector. And the m teacher vector. So this is just a student overlap this our order from today it's really an overlap it's really an overlap between, you can think about each neuron as having a weight vector. Okay. And what this teacher or student overlap measures is the angle between these two vectors so one of the students notes, and one of the teacher notes, these are vectors in n dimensions. And intuitively, you would think that if you take a random teacher, you initialize your student randomly, and these two vectors will be orthogonal to each other. And you would hope that during learning, somehow the student finds back the parameters of the teacher, and that this angle between the two goes from the 90 degrees at initialization to something that's smaller. And indeed that's what you find in, if you if you actually run an experiment if you run a simulation or if you calculate what's happening to these. So that's really that's the physical interpretation of the teacher student overlap, and then, you know, you can sort of convince yourself that the student overlap students you don't have q. It's really the same idea but for the student weights, and then the teachers teacher overlap is the same idea, but for the teacher points. So that's the physical interpretation of these other parameters. So, I'm using this expression now. I say this expression, this expression. And I'm looking how does it change over time. Now the teacher weights don't change over time, but the student weights do. So I'm just writing that out here. And now, I'm going to substitute in the SD equation, okay the weight update equation for these second layer weights. So what I'm going to find is something that looks a little bit like this. So I have the learning rate I have the one over and from my definition I have a second layer weights hanging around here that comes from my city update I have the derivative of the student activation. Delta is basically you know, student minus teacher output. Okay, this is a scalar. And I have the input x. And now I dotted with the teacher weights and overscored of them. And so we see that this guy we actually met this guy before this is just new M. Okay. And so, if I write this out, I see that the average because I want to know the average change of the way it is read just an average over stuff that I kind of already know. And this question here. So the average that I still need to evaluate here in front. It's again, only an average over local fields. There's the lambda, but here's another lambda. Here's a new. Here's another new. And like we said before, these are jointly Gaussian random variables so this expression, whatever it is, it will again only depend on the order parameters. So again, you know, I will have to evaluate integrals that look like that look like this, you know, G prime k G lambda L. And so this is going to be, again, a function of only the other parameters. And so, these finite difference update equations they close. And what's nice is, and you sort of heuristically as a physicist, you would almost expect this is that if you now go to the thermodynamic limit. Right. This one over N factor that's in front of my update here, you can interpret that as an infinitesimal time step. Okay. So precisely you can introduce a continuous time, which is just the number of samples over the input dimension. And if you're a physicist you can just leave that you know these finite differences then converge on to ODE is in a continuous time limit as you send into infinity. And so this was this serious, the idea of this heuristic derivation of Southern solar, and then two years ago actually with with some some colleagues and just like to do it. We managed to prove that these equations actually rigorously correct so that you can really take this, this limit one over N. And just to infinity and that you really find back these ODE's mathematically robust. Okay, but the idea, the heuristic derivation that was all there in Southern solar already. So, I was just to give you an idea of the, of derivation that's, let's zoom out a little bit in the remaining five minutes. And let's step back and think about what we've done. So we started with this idea of trying to track to lane networks. Right. And we said we're actually analyzing this on real data. It's a bit tough. Let's go to this teacher student limit, let's go to the synthetic data setup where I have noise and I get the labels from random teacher and let's try to see what happens in this case. And we saw that in this case, the key quantity, the generalization arrow. And this is a function of, you know, a few other parameters. QRT. This is an idea that you know, this is quite appealing. And so any question about how the student evolves becomes a question about how these order parameters evolve. The solution of these other parameters was then given by sudden solar, and it takes the form of this close sets of differential equations and if you're worried about mathematical rig and at some point you should a little bit. You can even prove that they're correct. Now let me finish with a few plots. So this will require some Just a quick question. Yes, please. My last because after I have to go. My question is, is the this theory robust to, I guess, yes, to the fact that you would take at every step, let's say a mini batch, but which is never seen again with fresh data points that are never used again but instead of taking one you take 10. I don't know. That's fine. That's not a problem. The key is really I that's the threat. So the key here is that you need these, you know, you need these local fields that appear in the update equations. The lambda case that appear in the updates of your order parameters, you need them to still be jointly Gaussian and their covariance to be expressible by just the simple all the parameters. Okay, as soon as you reuse samples. So the w of the x that you've already seen it will be correlated in a very complicated way with the weights. And then the description breaks down. So going to mini batches of final size, no problem. But we're using samples that's where it gets, it gets heavy. Okay. Let's give you if you reuse samples and you have to look at two time correlation functions q you can think of that as a one time correlation. If you reuse samples you have to get two time correlations it looks a bit like William de Gauchan stuff. Let's look at some, some pictures to, to wrap this up. So, I thought you know I'd at least show you that that what I just told you is I didn't just make that up. And it actually works so this is a plot where I'm training three different neural networks with three different sizes. Okay, it's a number of neurons. And what I'm plotting here is the test error of these students as a function of time training time. Okay, which I call the alpha person. It's not okay to me anymore. And what I'm plotting here is two things the lines are the results of the predictions for the test error that I get from integrating the ODEs, and the crosses are simulations. And you can see that indeed you know like the finite system that I train in the neural network, and the, and the simulations agree quite well with each other. You can see that the, so the ODEs capture what's going on with STD, and to sort of motivate you that this approach is actually quite interesting I'd like to just point out one last thing that already the simple plot and the simple model relates about the learning process. And I'd like to convince you that this is actually something, well, I'll show you that this actually something that happens in deeper networks to that are not just trained on noise. And so what I'd like to draw your attention to is this plateau here in the middle. I don't know if you can see my mouse. I can't see it. Okay, so you see that between sort of step 10 to step 200. The arrow of the green and the red orange student, it goes on so sort of plateau right so there's an initial decay this is not scale. There's an initial plateau, and then the arrow goes down to to really low value. Now what's happening at this intermediate stage what's happening at this plateau. What's happening is that the student is learning a good linear model. Okay, so the student has some correlation with the teacher, but it's not enough to really learn the task well. And what I'm plotting here is in this four times four plot is the teacher student overlap. Okay, so this is telling you the the the overlap between any student vector and any teacher that you can see that all of them are non zeros there are some variations. But they're also, you know, all to all college so every node of the student is correlated to every note of the teacher. If I look at this same overlap plot at the teacher over student overlap at the end of training pictures quite different. Now you see that there's four yellow squares what does that mean that means that each student nodes is correlated with only one of the teacher notes. One to one like it really just basically copied the parameter from the teacher node, and it's not created with all the other teacher notes. This is what's called specialization and you can see that it has a real effect here on the, on the performance of the student, because it makes a difference between a model that has an error like 10 to the minus one model that has an error of 10 to the minus one. Okay. So this specialization is something that's that studied a lot. And you can study in detail with these with these ODEs and I'm not going to go into detail about that. What I'm just going to say is that or what I'm just going to demonstrate you is that this specialization so you go from a good linear model, which is what you have on the plateau to a very good specialized model is something you've seen other regimes to for example, you see it in this infamous mean field regime that we talked about so this is the regime where you keep the input dimension finite. And you let the number of neurons go to infinity. You see here, this is from a really nice paper by the group of under one to nine led by some may in plus two years ago, what they looked at basically the same question that we asked now online learning, they plot the test error they call it the risk as a function of training time they call it the iteration, and they, you know, again, simulations crosses their theory is aligned and PDF discovered in this case, and you can also see that there's this sort of So this seems to be in these two networks quite a recurring feature. What's maybe surprising is that it also happens in deep networks. So this is a very nice paper that came out at New Europe's last year and Europe's sort of the biggest machine learning conference, and what these guys show it freedom and the group at Howard. So you look at specialization in deep neural networks so they trained deep networks on a real on real image data set so this was Cypher 10. And blue is a linear classifier. And at the plateau in the early limit what you basically have is the performance of just a linear classifier. In other words, a classifier with just one neuron. Okay, and then you specialize to something that's better. What they show here is that actually the blue line, the linear classifier. It's really correlated and it's really going together with the deep neural network that they train on this image classification tasks and other words. In the beginning of training you only learn this linear model with these deep neural networks on this image classification data set. And only after a certain time, you specialize, or you learn a more complicated function. Okay. So, I really like this example because it sort of shows two things one that it, you know, it's useful to look at the simple models because some of the things that you see in them really carries over to much more complicated settings. And because I really want to make this point that learning in your network sort of occurs in steps, and you learn functions of increasing complexity as the people from this paper put it right. So first then just the linear model, which is as if you would train just one neuron, and only after a certain time only after the student has seen a certain amount of data. You really start to pick up on the more complicated structure and you specialize to more complicated functions that you can express with one neuron anymore. Now to understand that a bit better, we should also have a slightly better model for data structure. And, and that's what we're going to do next week. So I'm going to stop here. Thank you for your attention. And yeah, let's let's have some more questions. If you want, Sebastian. How does all of this depend on the loss function. That's a really good question. So the last factors is always the issue. So the analysis that I just showed you the OD analysis. For that, unfortunately, it's crucial that it's the mean square error. So, you know, if you want to, you can always write down the ease but if you want to really analyze them look at the fixed points in perturbation hearing what I have you, you really want the arrow to be the mean square error. Now for these six, and this is the same for also the song the Montanari analysis in the mean field. Now for these experiments. I'm not 100% sure anymore, but I think they also see this when they train across entropy for example you see the train on classification. So I'm not sure that they trained on cross entropy to let me check this. Yeah, any other questions. Yeah, they train on the Harvard guys train on cross entropy to. So the specialization seems to be a bit robust with respect to respect to. But this specialization transition we're talking about was with ID data. In, in, in what I showed you it was the ID data. No, this is really, this is the actual cipher 10. So I stopped chambers. No, no, this is the actual cipher 10. So this is really the. So here they did for example, they did MNIST, but they also did cipher 10 animal versus objects. So, and this is the goal. I think this is a convolutional network. They really did some increasingly complex experiments here. I think a little bit of what sucks and gangura when we're showing the linear network that linear network picks up single values one after the other. So that's really rated. So what you're alluding to just for everybody. So what I was alluding to is, there's a really nice series of papers by Andrew sucks from the group of, from then on the group of surrogate really where they look at the dynamics of linear networks. So many of us have no non linearities in the middle. And they find something specialization sense that the linear book sweeps up one singular value in the covariance of the inputs after the other. And it's related. It's not quite the same because the linear network, of course, it can never learn something that's better than a linear classifier right so in this sense. So the specialization is a bit different. It's really the sweeping up of Montreal covariance. I values. Something on the time scale over which this over at which you have this crossover between the linear and non linear. What does it depends on. That's a really good, that's a really good question and that's something actually that we tried to look at a little bit. And it's really subtle because it depends on really fine details of for example the initialization so basically what you can think about this plateau. It's a little bit different from the plateau specialization as some sort of symmetry breaking condition the network has to break the symmetry between its own neurons and the neurons of the teacher. So it needs a fluctuation to basically kick it from the saddle point. And this fluctuation can come from the initialization. And so some small changes in the initialization will change the plateau. The size is tough. He's also why small batches work better than batches right. Yes, so here you really want some stochasticity. Exactly. So here's more back this is really an argument for small batches because you really want some stochasticity to kick you away from the saddle point. Okay, so if there are no other questions then I guess Sebastian. The next lectures will will be next week. Same time. So next Tuesday. Okay, next Tuesday. I don't know if I can. That's Tuesday at 3pm if I remember correctly. Okay. Okay, so if there are no further questions. Then we thank Sebastian again for the very nice lecture. Thank you for having me. Thank you for all the questions. That was really cool. Yes. See you next week. Bye bye.