 Come to the second day of if not I mentioned so I hope everybody's is doing well Before we start with the session just a quick update on today's schedule So we have the two talks now in the morning, and then we have a discussion session at 11 So, you know, we thought you know, there's so many amazing people here doing interesting stuff Yesterday you had a chance to see all the posters already quite a lot of talks, you know so, you know take over the terrace go around and and and enjoy the discussions and then We'll have a session in the afternoon You will have seen that on the schedule then at I think four or four thirty or something like this There's another talk that we wanted to advertise a little bit It's about how to do math in 2030 and how math, you know with transformers will you know how we will all get out of jobs basically and Since if you want to know about that feel free to join that talk and then we'll meet again for a little hike Which will be really just a walk through the castle down the seaside and then we'll have a little aperitif, okay? We'll tell you later exactly when we meet and where But yeah, this is kind of the the plan for today And okay We'll start with the first with the first session and our first speaker will be as I guess I Who's in the system of Western and he's gonna continue our sort of stream at this conference on on the dynamics of learning and the dynamics of SGD So yeah, thank you for coming and looking forward to your talk Thank you very much for the invitation to be here. It's beautiful setting very nice to be Here and thanks for coming to the talk. So I want to talk about some work from last year with Gerard Ben-Aruz who's an NYU and Akash Draganov who's at Waterloo on High-dimensional limits for stochastic gradient descent Yeah, so let me just give a quick like set up the notation and framework for everyone so You want to block these? So the general statistical framework we're thinking about is you know common one we have some IID data stream that's coming in That's these y sub i's and there's m data points capital M and they're all coming from some common distribution p sub y and We have some loss function that's l for me and There's a parameter space r to the p and input space r to the d so p and d will be the dimension of the parameter space and the input data space and There's some risk function Corresponding to this loss function. So we take the empirical average across the data set of the loss function and So there's a common framework it captures lots of problems that we've all seen before like parameter estimation tasks classification tasks neural networks things like that you can put into this language and the common goal is to minimize this empirical risk and Hopefully that gives you some good, you know the parameter that minimizes that gives you some good solution to the task at hand so Some typical features of this Framework are that you know as m goes to infinity as long as there's some reasonable Moments then you have a law of large numbers for the risk function So it'll converge to this phi that will be recurring throughout the talk So that's the population risk or the population loss for me it's just the expectation of the loss function at the value and parameter space that you're considering and So here's you know a simple picture of the risk on the left and the population loss on the right and you see okay So it's common for the population loss itself to be simpler in some sense than the noisy version The risk then the risk but on the other hand symmetries of the problem often lead to even non-convexity for the population loss So I'm really thinking about a non-convex problems throughout this talk and so SGD and non-convex Then the global minimizer or global minimizers of this phi will be good solutions to the classification estimation task that you're considering so otherwise, it's not a good loss function essentially and So okay, so this leads to maybe the empirical risk R is hard to optimize because it's you know Has lots of local minima saddle points things like that gradient descent doesn't do so well So people use things like stochastic gradient descent which isn't optimizing so much on that But it is some stochastic approximation to the phi and optimizes there Okay, so I'm I think most of us are familiar with this kind of thing so stochastic gradient descent is The following algorithm and I'm thinking throughout this talk only about the online one-pass setting so we have this M set of M data points and I move around in my a parameter space by iteratively making a gradient update according to the loss L Evaluated at my current parameter point on the next data point So the the point is that the next data point that I'm getting is independent of where I've moved to up to this point And little delta will be the step size for me okay, so The reason this works reasonably is that these are you know each of these incremental updates are stochastic Approximations to the gradient of the population loss in the sense that if I break up this grad L term I can split it into the grad phi which is the mean part plus a noise term The mean parts will be like what I'll call the population drift. It's a gradient update on the population loss which just really depends There's no randomness there It just depends on your point in parameter space and then the the random part the second term is a martingale increment So you expect it to average out as the step size gets small And you'd make many updates. Okay, and so in this way, you're really doing some kind of Optimization scheme on this smoothed out population loss landscape. Yeah. Yeah, so the grad five should not have the yk Sorry So the theory of You know stochastic approximations goes back to the work of Robbins and Monroe in the 1950s And they showed that kind of as you'd expect from the previous slide If you look at the trajectory of the stochastic gradient descent in the parameter space You take the step size to zero Then it converges to the solution of the ODE which is the gradient flow on the population loss Okay, so the martingale increments indeed average out and you can get this kind of a law of large numbers for the trajectory of the dynamics Since then I mean, this is a long time ago. Lots of other work has studied Stochastic gradient descent in what I'm calling fixed dimensions in the sense that the dimension of the parameter space is fixed The task is fixed the the data space is fixed everything's fixed And you're just taking the step size to zero and in that limit So you get this law of large numbers you can also get a central limit theorem for the trajectory or like an invariance principle large deviation principles All kinds of very precise information for how the SGD trajectory approximates the gradient flow But the one thing I really want to emphasize about this whole story is that this is all for a fixed problem fixed parameter space Fixed dimension etc. And only the step size is changing That's that's the limits that we're taking so the goal of this talk is to look at like the high-dimensional Setting for stochastic gradient descent and obtain some kind of limit theorems in appropriate ways Okay, so any questions about this just background so In modern data statistics where data science statistics, etc. We're often data constrained So we need to think about high-dimensional settings And by that we mean, you know, if I'm taking delta to zero the step size to zero that corresponds to taking my number of samples to infinity But taking my number of samples to infinity relative to the dimension of the parameter space is often impossible you know, that's not the regime that we're often in and so analytically the common approach to studying, you know, this kind of more data constrained setting is to take the the different things to scale together so the dimension of the problem to scale with the number of samples that you have access to and So in particular what I'm going to be thinking of is we have some family of statistical tasks now They're you know in some sense. There's some consistency that we're assuming on them but you have a family of tasks all indexed by n they each have their Input spaces their parameter spaces p sub n and d sub n being their dimensions and you have step sizes corresponding to these little delta sub n and So for instance just you know one thing you can think of is what I'll call the proportional asymptotic regime Where you know the dimension of the problems n the step size is 1 over n So the number of samples is proportional to the dimension of the parameter space So this is a kind of common assumption. I think in like high-dimensional So it's a couple simple examples to have in mind that will recur or you know, you can imagine you get rank k matrices and by n matrices that are corrupted by some Gaussian noise So here the parameter space says n for the n times k may be for the different vectors that are the planted ones and then the dimension of the input data is n squared for the matrices you're getting and Then you take you imagine that the number of samples you have access to is proportional to n another one is classification of some mixture of k Gaussians with some arbitrary labeling on them and similarly you have this kind of proportional access to a proportional number of data points to the dimension of the parameter space, okay, so When we have this kind of a Diverging family of tasks indexed by something like n Oh, we obviously can't look at the trajectory of the full stochastic gradient descent You know, it's on living on different spaces as I change n So we need to project down onto some finite dimensional observables or what I'll call summary statistics in order to get meaningful limits in any in any reasonable sense and so What I'm going to call summary statistics will be these this finite dimensional family of observables that we're going to track and so these will be what we want to prove limit theorems about and So let you know, we imagine that there's some k summary statistics that we want to track and the Important thing here is that k is going to be independent of n Okay, so k is going to be fixed and I take my end to infinity and get some limit theorems, okay, so in this level of generality, there's Many, you know, many people in this audience have worked on these kinds of questions. I'm not saying anything revolutionary and There's a few approaches, I guess to studying stochastic gradient descent in Non-convex settings specifically that have kind of made significant progress in recent years So I just want to outline a few of these to then talk about where our work fits into them so one of these approaches which I think a lot of people here have worked on is What I'll call I guess getting some kind of ordinary differential equation limits as long as you start away from critical points of this Of this population landscape so in the same way that we got convergence to gradient flow for the in the finite dimensional setting if you look at some Family of summary statistics that are the relevant ones You can take limits in the in this high-dimensional setting that I talked about and in fact Here a lot of the work that I've cited here includes, you know looking not just at online one past gradient descent But also considering like launch of and a type dynamics or gradient flow type dynamics where there's more memory And it's harder to understand and you can get some kind of ODE's coupled systems of ODE's as the limits of these observables and And then you can try and analyze those ODE's that you get either numerically or analytically So progress has been made on, you know several questions in this way One thing I want to emphasize though is you know the the population landscape that I drew it had something or if you imagine even like Rank one matrix corrupted by noise so like a spike matrix model oftentimes There's some inherent symmetries to the problem So maybe like in the one observable you have some landscape like this Due to some inherent symmetry and the important thing is like You have to start kind of macroscopically away from the saddle point or from the critical points in order for the ODE's that you get To actually do anything on the time scales that you get limits So you know if I if I get some limiting trajectory that's you know gradient flow on this landscape for my Correlation with some planted vector if I start any with any initializations that are converging to this saddle point They're not going to move in the limit That I'm ending up with so that's why I write that this is in some sense There's some warm starts and so the common approach here is you get some you take Limits you understand what the behavior of these trajectories do if you start like Macroscopically away from the saddle point and then you send that macroscopic amount to zero to try and like read off some behavior near the saddle point a different approach that's kind of been made successful in a few more problem specific examples, I guess in more restrictive settings is Specific problems where you can actually take Limits or at least get some kind of upper bounds or lower bounds on the behavior if you take even initializations that are converging to the saddle points so in in a lot of these problems like phase retrieval for instance, you do have some kind of basic symmetry like this and A random initialization in your high-dimensional space is going to lead to a sequence of initial points in the projected space That's court converging to this saddle point So for these for a few specific problems, there's been some progress in understanding like okay What's the actual time complexity of escaping this flatness near the saddle and then after that you can apply the kind of? You know warm start understanding from the from approach A third direction, which is a little separate from one and two is Problems where there's not so much these finite dimensional set of observables that are the relevant statistics There's been some progress in taking like empirical measures This is sometimes called like a mean field approach or infinite width limits kernel methods so you can look at like an empirical measure in the parameter space and Get partial differential equations instead of ordinary differential equations for the limit of those and then these become maybe I mean pretty Intractable, but maybe you can do something with them And understand this and pull back some information about the actual SGD from this approach. Okay, so this is a kind of different so my goal in the talk and the goal of this result was in some sense to Unify approaches one and two To allow for consideration of you know You have some finite dimensional set of statistics that are the relevant ones you maybe end up with some ODE's as the limits When you take some limits, but maybe you should allow your initializations to vary And you want to blow up around the saddle points at the same time and understand What's the behavior locally near the saddle points and so there instead of getting ordinary differential equations? Maybe there's some residual stochasticity because this is a very flat landscape So there's no strong drift so the the stochastic part of the stochastic gradient descent should persist in the limit also and so the general gist of the result is that you can take sequences of initializations that are you know converging to saddle points and get stochastic differential equation limits or Ones that are not converging to saddle points and get ordinary differential equation limits For a class of fam for a class of problems where there's some finite number of statistics that capture most of what you care about in the problem any questions so Let me say I think a little more precisely at what our setup is and what we're assuming about the problem So I apologize for this kind of ugly slide with assumptions I'll try and just give you a gist of what the assumptions are so first of all We have some we have some family of summary statistics That's the you know the observables that we're interested in tracking in this end to infinity limit and What I'm gonna assume on the problem in terms of regularity is You know threefold the first one is that the summary statistics aren't too spiky there You know their Hessian is has some reasonably bounded operator norm So I'll remind you that you can imagine delta n is approximately 1 over n or it's 1 over the number of samples Is the proportionality you should think about so we are allowing some of the norms of like the Hessian of the observables to blow Up with the dimension it doesn't have to be you know uniformly Lipschitz, but there's some regularity assumptions on them So that's item one item 2 is a matter of the regularity of the lost landscape So here again the thing I want to emphasize is we're assuming that the population loss is Uniformly Lipschitz, but not that the loss itself is uniformly Lipschitz Okay, so the unit the lost landscape itself can blow up It's Lipschitz norm can blow up with the dimension of the problem Which is the kind of this is the kind of scaling you'd encounter and I mean I'll give you examples But for instance the the spiked matrix or the mixture of Gaussians or a spike tensor all of these fall into this Class so those are kinds of so those are situations where the loss is more spiky than the population loss and That's why and the third one is basically that the noise So this is L minus phi is the noise in the loss once you recenter out the mean isn't Correlated with any of your summary statistics like too much so the idea is all of the information in the problem or most of the information in the problem should be captured by the Population part not by the noise part, which I mean the noise should be fairly uninformative Okay, so One thing is yeah these kinds of assumptions on the lost landscape is we're not assuming that it's uniformly Lipschitz But it's this is the kind of scaling that you'd get if it were like random vectors in D Dimensional space like random Gaussians for instance is that that kind of scaling and that's what leads right to this Okay, so here's some examples of yes so okay, so this is Infinity uniformly over compacts and not everywhere in space in theory You'd want you'd want to allow like maybe there's some atypical parts of the world where you know It has bad norm and but your SGD avoids them, but We weren't able to you know decouple the trajectory from the like bad parts of the Hessian Yeah, so it can blow up as you go off to infinity in perimeter spaces There's a five-fixel compact ball then you have these kinds of bounds on your there's some constant depending on the size so some some examples of reasonable things that we can track are You know correlation with some like spiked with some planted vector for instance or some ground truth Vector, so that's like a linear statistic Of the problem that'll always fit into this this set of observables Something like radial components so a lot of the times when we're thinking about a loss function We'll have some radial penalty to confine us to you know some compact set and so radial penalties should always fit into this so we can track them and Then other types of correlations and especially when I want to mention is like the population loss itself That's something that you know if you want to say you succeeded at a problem one way to say that is the correlation with some vector Gets good and other is that the population lot of Scott small so you want to include that in your family and one thing that's really important is that the Definition I gave on the last slide has enough room that not only do all of these examples fit into it But they fit into it even if I blow them up by like a root n factor So what I want to imagine I'm doing is I'm looking at this summary statistic Maybe it has some landscape like this There's some saddle point or some critical point here and in order to get diffusive behavior near the critical point I need to zoom in because at the macroscopic scale you're really not moving around here And so you imagine there's some central limit theorem going on you want to blow up by an order, you know Like this is a little low like an order 1 over square root n window so I want to blow up by a factor of square root n to get some limiting behavior near the critical point and The notion of like the regularity that we assume Has enough room to allow you to blow up by a factor of square root n and still be a well-behaved summary statistic Okay, so summary statistics are not only these kinds of things But also you can recenter them around a critical point and blow them up and they're still well behaved enough Okay, so then so this this you know up to here. We've only assumed regularity things You also need some consistency between your problems in order to get meaningful limits And so that comes in the following form. So we basically We assume that the gradient flow on the loss when it acts on these observables So that that first thing you see on the top left is the gradient of the population loss Grad of the observables that admits some meaningful Limit along with a second order kind of term that I'll describe in a second if these admit some limits as n goes to infinity So they're only they're approximately only functions of the summary statistics themselves Then you have the consistency criterion that you need So the first thing is like a drift for the summary statistics That's what this H will be and the second one is a diffusivity for the summary statistics, okay? One thing is if you you have freedom to take your step size arbitrarily small If you take your step size very small then both of the you know The orange and the green term both drop out and then it's just gradient flow for the population loss So you recover the only a like finite dimensional kind of behavior that'll become clear And then if your step size isn't too small then you can have these kinds of second-order operators that are relevant also So you might wonder like okay, this seems fairly restrictive Or how do I come up with what the family of observables such that this closes? So yeah, some problems might not have you know If I took a rank n to the epsilon matrix and corrupted it by Gaussian noise that does not fit in our framework That's why we need like a finite dimensional set of Observables that you can track but to get that finite dimensional set of observables You can start with like the population loss and then just try and close this so take the gradient See what other observables you need to add in in order to close this and that'll give you your set of summary statistics that you want to track for the problem and so under these two assumptions then the main result is that if I take the Trajectories of the stochastic gradient descent projected by these observables and I take the n goes to infinity limits Then they converge to the solutions of the stochastic differential equation Whose drift is the limit of the drift term the first term and whose volatility matrix is the limit of the second term So that's the volatility All right, so it could be that sigma is zero and this will be an ODE in Particular when delta is going to zero fast enough So you have access to enough samples then you shouldn't expect any like volatility and then you should just expect the finite dimensional ODE limits and that's what's gonna happen So you know you take delta going to zero fast enough the the orange and the green term dropout This is just gradient flow on the population loss But it could also be that there's some residual stochasticity Which will happen for instance near saddles, and then you'll have this space dependent volatility matrix sigma Okay All right, so Let me give a kind of general remark of how this relates to this kind of picture that I was giving here When we see ODE limits when we see SDE limits and then I'll give a few examples Of how you this can be applied to study certain problems Any questions about the statement of the result? So a few Yeah, maybe one comment or a couple comments. I want to make are one of them is H here does not have to be a gradient flow So it doesn't have to be that there's some like lost landscape for the finite dimensional Observables on which you're doing a gradient flow this second order correction term can really make it a non gradient ODE and This the volatility matrix I mean sometimes it's common to assume lower bounds on the diffusivity of your lower bounds on the variance of your stochastic Increments things like this We don't make any such assumptions. And so in particular the volatility can be highly degenerate certain places so you can lead to degenerate diffusions and Things that would actually like keep you stuck at saddle points because there's not enough volatility to let you escape Things like that. So the Sigma can be ranked efficient. It can be degenerate in that sense So when you look at this when you take the kinds of summary statistics I was saying before for a generic problem that fits into the framework oftentimes. What's gonna happen is that? Because of the scaling relations the first limit you get for the summary statistics is just an ordinary differential equation So the volatility part goes to zero in the limit and you're left with just the solution to the ODE DUT is H DT and Now if you stare at this one H was the limit of these two operators if your step size is going to zero fast enough It's just gradient flow for the population loss and you recover the finite dimensional picture But there's exactly a critical scaling of Delta. So there's for any family of summary statistics There's gonna be one scaling of Delta at which that middle term doesn't vanish and at that scaling There's a second-order correction, which is coming from the high dimensionality of the problem so this was observed by Sod and Sola and a while ago actually in a In like teacher-student setups that there's this kind of second-order ETO correction that persists in the high-dimensional limit And so we're finding this in this general class of problems I guess and so you have this potential second-order operator that's changing the landscape in this high-dimensional setting now This gives you some some dynamical system for your set of summary statistics. It'll have you know It's gonna be potentially non-convex or oftentimes it'll have many fixed points some of which are unstable some are stable and Now suppose we want to probe the behavior near maybe an unstable fixed point because a random initialization might converge to an unstable fixed point or like If you start somewhere, maybe you'll get to some saddle point and then you want to understand How long does it take to escape that saddle point before getting to the ground truth and actually solving the problem? So the next the next thing we can do is kind of do this zooming in around a saddle point of the landscape instead of just looking at ODE's So suppose that your initial initializations converge to some fixed point you star of this dynamical system What we can do is we can zoom in about the fixed point by you know Recentering my statistics getting some new statistics you tilde which are just the ones where I've blown up in a window around the fixed point So you're imagining this is my fixed point where my initializations converge to or midway through training my SGD is converging to some saddle point I want to understand the behavior locally near that saddle point now to understand how long it takes to leave for instance Let me blow up in a window around that saddle point to get some new summary statistics that are also in this well-behaved family Now they'll admit their own you know drift and Volatility matrix and what typically will happen is that these rescaled ones now don't have their volatility going to zero They have some persistence stochasticity Okay, and so this will now satisfy a stochastic differential equation limit Which is this one you know it's some it has some new drift function, and it has some new volatility matrix so This is a kind of heuristic picture of what we're doing so let me give a few examples of How this works for the you know the matrix spiked matrix type of Problem that I mentioned as an example in a mixture of gaussians type of problem Okay, so here's example one is denoising a rank one matrix Let's say the very simple problem. I have some planted vector V and We're given IID Samples of lambda V V transpose plus some Gaussian noise, so it's just corrupted by Gaussian noise and We just take you know the log likelihood so this L2 loss function, which is just y minus xx transpose Norm squared and the summary statistics for this problem It's pretty easy to see that the things you need to track in order to understand how this behaves Under the SGD are the correlation with the vector V. That's this planted vector and The some kind of like penalty term. There's some kind of radial Term and so here. I'm just tracking the orthogonal radial part. Okay, so together the two give you the full norm of the Right, so obviously you have succeeded at the task if m gets close to plus or minus one There's some inherent symmetry to the problem. Okay, so With this if you just take the the first kind of limit, which is the ODE limit No zooming in no blowing up. You just take these summary statistics. You apply our theorem You end up with the following system of coupled ODE's for the Correlation and the orthogonal radian the the orthogonal radius It's not too bad But you know it you can pretty easily stare at it and find the fixed points and in particular What you find is that it has a couple fixed points when lambda is bigger than one and just one when lambda is well It's less than one. So you're recovering a transition at lambda equals to one, which If you know if you look at like the spectrum of the matrix That's also where you have a you know something jumping out of the semicircle spectrum And if lambda is bigger than one so the problem can actually be solved There's an informative fixed point, which is correlation square root lambda minus one So that means there's some it has some positive correlation with the ground truth and some Uninformative one, which is the unstable one. That's basically in the m-coordinate. You have this kind of picture. Okay, and so that's the that's the first order limit and if your delta is actually Scaling like proportional to one over the dimension then there's a there is a second order correction The second order kind of term operating on the radius. That's it's somewhere in there Okay, so you can take this limit then you want to ask well, you know If I do a random initialization my initial correlations are one over square root of the dimension So in this limit, they're going to zero. I'm just saying put at correlation zero I'm not leaving the the fixed point zero one. I'm just stuck there for all time So let's try and understand how long does it actually take for me to leave this or what's the mechanism for actually leaving this? Flatness at the origin so you zoom in around the unstable fixed point zero one so what that means is I define, you know, this rescaled correlation, which is I blew it up by square root of n and blown up orthogonal radius which is blown up around its fixed value and for that you get a system of stochastic differential equations now. These are just two Ornstein-Ulenbeck processes and just some interesting things are This itself has a transition at lambda equals to one Which is where the problem goes from you know solvable to unsolvable where it goes from an Ornstein-Ulenbeck that pushes you mean reverting So you're never going to leave this saddle point when lambda is less than one to a Ornstein-Ulenbeck That's mean repelling and so that one is going to drift exponentially fast away from the saddle point when lambda is bigger than one and You also something fun is that the two systems actually decouple so whereas the the two Observables were coupled in the first limit I took when I blow up around the fixed point You get a decoupled system of stochastic differential equations and from this you can read off that like okay in an exact Constant times log n time. I will leave this Flatness around the origin and I'll get to this ODE regime and quickly zoom in this direction or that direction to the solution. Yes Yeah, yeah, so at lambda equals to one. I mean this limit will give you just basically just a Brownian motion at the so this Yeah, so I think that exactly at lambda equals to one instead of being either mean repelling or mean reverting you'll just You'll have no drift and it'll just be stochastic And then you can do some other time rescaling So it should be like a it'll take longer to escape because it doesn't have this drift away from the origin But the randomness will take you away from it eventually because it doesn't have this push towards the origin You could take a different limit maybe to get to probe that we didn't do that Everything here is yeah correct for lambda equals to one It's just you you won't read off this exponential drift away from the origin because it's not there you'd have to rescale some So this is in the parameter M the correlation parameter Something like you know the drift is down is this way or that way if you're Macroscopically away from the origin and then locally so this is for lambda bigger than one I guess this is what the landscape looks like when lambda is less than when it looks something like this And so what we're doing here is zooming in around this fixed point, okay? Yes, possibly, but we did not do this so Yeah, it's a good question. I just didn't look at it Okay in the couple minutes that remain let me Give a more complicated example. This was a really simple example in some sense So the more comp and so here you just see the mean reverting and the mean repelling or encinol and back processes for that All right, so in the couple example minutes that remain let's take a quite a bit more complicated model So this is a mixture of four gaussians, but they're labeled in an XOR structure so in the sense that these are your four means for your gaussians and we label the Antipodal ones blue and the other direction Orange so you have two labels for four gaussians a mixture of four gaussians Why this task because this is kind of a very simple task that requires a two-layer network to To classify well, so there's no it's kind of easy to see there's no linear classifier for this problem So you can study this on a two-layer neural network So that's what we did we looked at it on a two-layer neural network where the middle layer is a fixed size and You have some like let's say re lu activation on the first one and sigmoid activation on the second you end up with some binary cross entropy loss function Okay, so then You figure out what are the relevant summary statistics and this indeed falls into the kind of class of problems that fit our theorem with some 22 summary statistics, which are like the last layers weights the correlations of the first layers with the four means that you're interested in and then some radial terms are like Orthogonal parts is in our products things like that and so with these you can approximately close the problem and therefore you can take their limits and What you get is okay an awful equation but The point is you you get some equation you can do some stability analysis on the fixed points of this and if you do that you can find for instance like Fix the middle with layer to be four, which is the smallest one so that you can express a classifier Then you know you there's 39 critical regions. They have varying topological dimension in this problem You can figure out which ones of them are stable and then you can look at things like over parameterizing this problem so taking that middle layer with to be larger and larger and seeing how the number of Critical regions changes and for instance ask what's the probability under a random initialization that I'm in the ballistic Basin of a ground of a good classifier So in the sense that I converge in it with an ODE like very fast to a good solution Versus what's the problem probability that I'm in the basin of a saddle point instead of one of the good classifiers? And so in this problem We studied you know some notion of a power of over parameterization as I take that middle layer going larger and larger the Probability that I'm in the true basin that I initialize in the basin of a good classifier is going to one as the middle layer with goes to grows You can then also look at you know, there were these critical regions that were saddle regions, so they're not stable They're unstable you can blow up around these regions and if you do that you get Diffusion you get diffusions instead of ordinary differential equations and something interesting You can see is these diffusions are indeed rank deficient in the sense that I mentioned earlier So they they're really degenerate diffusions They move only in certain coordinates rather than all of them and so they can get stuck for arbitrarily long times in these critical regions All right. Thank you very much Yeah, sure instead of M and R per yeah, so I guess it's what makes the equations nicest But like in order to just get a if you have a problem to get like this X or Gaussian problem For instance what we started with is okay We need to know the correlation vectors and then you try and close these if you try and get what's the minimal family That's going to close the drift. Maybe there's different minimal families But that's at least like how you'll get a valid set of finitely many summary statistics for which you can then start to do some analysis of like the dynamical systems you get So that's the point if I don't do any rescaling then I'll just be stuck here So of course I can't probe this log d time. So what I do instead is I blow up space And if I blow up space by root n then I'm actually moving So I'm moving an order 1 over root n in order one time and then what's happening is you're doubling in order one time So it's like the first scale is order root 1 over root n the next one is like 2 over root n then on the next time scale It's 4 over root n and that's what leads to this log d and so you can probe the log d by doing a Set of these rescalings and then stitching them together and you can actually get the log d escape from the saddle by doing this kind of stitching