 So I'll just sort of jump right into it. So the goal for today's talk is, you know, we want to start developing, this is a line of work with Gerard Ben-Rus and Reza Ghysari and our goal was to start developing sort of a classification theory for lost landscapes for the performance of natural gradient type algorithms. You know, the question would be, you know, you're doing some parameter estimation problem. You know, can we start to understand which problem should be easy, hard, or sort of on a sort of a critical boundary in various kinds of planet estimation problems, not sort of from a general perspective, but more towards for specifically like gradient type algorithms. Are there sort of simple quantities or invariance you can calculate that sort of lead to natural bounds on performance guarantees or refutations? So the focus for today's talk will be specifically on SGD, but many of the results I'm talking about now are also true for gradient descent and that'll be the subject of a forthcoming paper. So, okay, so what's the question we want to consider here? You know, a lot of people have already spoken about SGD and its performance and estimation, high dimension. So I won't give a sort of long introduction. So just to quickly get to the sort of the heart of the matter, you know, the general question many people are considering in this field is, you know, given some estimation task, how many samples do I need to be able to estimate by my favorite gradient type algorithm? Say, stochastic gradient descent, gradient descent, some batching, things like that. You know, questions of sample complexity and time complexity. For today's talk, I'll just focus on stochastic gradient descent. So sample complexity and time complexity are the same. And so, you know, this is a conference on high dimensions, but first I should mention, you know, there's a very like large, well-developed theory for fixed dimensions in these settings where the dimension is fixed and then you send time to infinity. You know, there's works on sort of limits, functional central limit theorems, the ODE method. There's even stronger results on rates of convergence once you start adding extra assumptions like convexity. And a lot of these works, you know, the idea is, you know, since n is fixed and t is large, basically, you ignore sort of a burning period for the algorithm. And then you really only care about its behavior after a certain phase where the algorithm seems to start behaving in a regular fashion. More recently, you know, a lot of people started focusing on the high-dimensional perspective where, you know, the standard results don't immediately apply. There's been a lot of sort of model-by-model results in this direction, and I'm only sort of mentioning very few here. But so, you know, what we want to talk about is, you know, the following, the focus for today's talk is also the high-dimensional setting. And our hope is to start to develop sort of a general framework for this, just in terms of geometric properties or analytical properties of the last landscape. And sort of, let me first, before actually stating what we study specifically, let's just think about what stochastic gradient descent does. You know, so typically people feel like there's sort of two phases in stochastic gradient descent. There's an initial search phase for the algorithm where the dynamics is sort of, you know, the algorithm's sort of wandering around in this landscape, which is highly non-convex. And then hopefully, after a certain amount of time, you've entered sort of an effective trust region, and then the algorithm enters what's called the descent phase. There you can sort of treat the landscape as sort of effectively convex, and you hope to start winning off of things like momentum terms and these things like that. And sort of one of the things I wanna get at is, you know, these convexity-based analyses, you know, really do naturally lend themselves to understanding the descent phase, but in many problems that we're interested in the initial search phase, you know, this convex non-convexity can be really pronounced. And so what I'd like to do for today's talk is focus specifically on the search phase. And there's sort of three motivating questions for today's talk. One is, you know, given some lost landscape, how do I tell if it's easy in the sense that I have linear sample complexity? It's sort of almost easy or critical in the sense that you acquire quasi-linear sample complexity. So instead of, you know, order n samples, you need n log n or n log squared n, or sort of the third regime where the problem is hard in the sense that you need at least polynomial sample complexity in the dimension. And, you know, moreover, you know, another question we're interested in is, you know, there's so much work on understanding the descent phase. And, you know, the finite dimensional theory is really sort of, you know, you ignore an initial burn-in. So the question is, you know, how much time do you actually spend in this search phase? We're spending so much energy understanding it, you know, maybe the fraction of time you spend in the search phase is quite small. And the last question we want to think about is, you know, how does this difficulty change as you start to vary the loss function? For, you know, these general linear models or neural network models, you know, in our statistical sort of toy models for understanding these, we sort of pick specific losses and analyze things. And the question is, if I change the loss, you know, in a way that seems sort of very small or trivial, can this dramatically change the performance of the algorithm? So first, you know, before I get to things, I just want to emphasize, you know, why is it that I'm focusing on the search phase? Well, the thing is, you know, once you've sort of exited the search phase, which once you've weekly recovered the parameter, sort of in the terminology of other talks, the performance of the algorithm is very different. In fact, you can sort of treat them all as effectively in a sort of easy phase. So for example, you know, you can prove things like path-wise laws of large numbers, not just for stochastic gradient descent, but also gradient descent, batch gradient descent and things like this. And in particular, you know, this suggests that, you know, if you really want to understand the performance of random initialization, you have to be careful, you know, it's, you can't first send the dimension to infinity and then send the correlation to zero. And just to sort of emphasize that, here I'm showing sort of two plots. So the top is the performance of phase retrieval, firstly from a warm start and then from a random start, where the red curves will always be SGD, the green curves will always be the population dynamics. And you see sort of law of large numbers behavior here, but when you have random starts, things start to deviate a bit more. And then these things can get really pronounced when you start changing the activation of it. So in certain hard models, you know, there's certain models where you can show that the activation will, you know, be in the hard phase. There again, you'll still have a law of large numbers behavior. So from the perspective of n to infinity and then correlation to zero, there won't really be a huge difference in complexity between these problems. But now when you start looking at the performance from a random start, you see that the stochastic gradient descent algorithms just completely unable to recover the signal even though even the population dynamics eventually sort of has a transition to recovery. So okay, so with that in mind, I wanna focus on a fairly simple setting. We'll focus on what I'll call rank one estimation in high dimensions. So what do I mean by this? So we'll focus on the specific case of, you know, let's take the sphere in high dimensions and we'll assume that the population loss, which is to say the expected value of the loss under the data distribution, is a nonlinear function applied to a linear observation. Namely, you take the correlation between your estimate and the true parameter as defined by a normalized inner product and then the population loss would be a function applied to this correlation. Right, of course in our settings, we don't actually have access to the population loss, but from a theoretical perspective, you know, this is gonna be our assumption. And just to sort of understand that this isn't a completely artificial setting, a lot of interesting models fall into this class. So for example, generalized linear models in the sense of like Nelder and Wetterburn fall into this, sort of the GLMs that have been considered earlier in today's talks, sometimes called single layer networks, things like phase retrieval, multi-component Gaussian mixture models, matrix and tensor PCA. So this setting does capture a large class of problems that many people in this audience consider. And the algorithm we're gonna consider is stochastic gradient descent. I just wanna sort of define it exactly since from talk to talk, sometimes that can vary. So it's a problem where you have some input because it's an iterative scheme, you have some input, a loss you're trying to optimize, M data points and some step size and what you do at time T, you do a gradient update and then you project onto the sphere since we know that we're always living on the sphere. And then the output of the algorithm is after I've run through the whole data point, we're out. And the question we wanna consider, formally, we wanna understand what it means to exit the search phase. So in high dimensions, natural initializations typically have correlations that are ordered like one on the square root of the dimension. And typically in a lot of these problems, you enter an effective trust region once this correlation is order one. So we'll say that the stochastic gradient descent weekly recovers the parameter that you're trying to estimate. If for some positive epsilon, the probability that this correlation is bigger than epsilon tends to one. And sort of specifically for this talk, what we'll talk about is how many samples, how large does M need to be with respect to the dimension to be able to weekly recover from a uniform at random start. So the key quantity that I'll show up in our analysis is what we'll call the information exponent. So what it is as follows. So you have your data distribution, p sub n and your loss function, l sub n. And so given a sequence of data distributions and loss functions, we'll say that in the large dimension limit, they have information exponent k. If the non-linearity that shows up in the population loss, remember we assume the population loss is a non-linearity applied to the correlation. We say you have information exponent k, if the non-linearity is sort of zero at the origin, its first k minus one derivatives is zero at the origin. Its kth derivative is strictly negative and its k plus first one is bounded. This is just to say that, if you do sort of a Taylor expansion of this quantity about the equator, we're assuming that the sort of kth order term is the first non-trivial term in the expansion. And then finally, we have to decide what kind of scalings we allow. So we wanted to make pretty broad scalings. We didn't wanna necessarily force ourselves to work with Gaussian models and things like this. So we'll just assume that this sort of data loss pair satisfies the following bounds. We'll assume that the gradient of the sample-wise error, that is the difference between the loss and the population loss, we'll assume that the gradient of that in the direction of the parameter you're trying to infer has order one on root n fluctuations. And we'll also assume that the norm of the gradient is sort of order one. And these are sort of just natural assumptions that come from assuming that, this would be the natural scaling of just assuming the gradient is a random vector. And so again, these sort of fit into the sort of interesting sort of boundary regimes for many different models that people are studying. And so what we find at the end is, under the scaling, why are things complicated? Well, here, you're sort of exactly in a critical regime where there's this non-miniarity term, which is m to the power k divided by n basically, which when m is order one on root n is very tiny, is competing with sort of the landscape itself, the noise and the noise is of an order of magnitude, which is order one on n. So these two only really compete with each other once you're weakly recovered. But before weak recovery, this term is the dominant effect. And so what we find is there's three regimes, sort of an easy regime, a critical regime on a hard regime. And sort of we have formal statements, but just sort of speed things along. I just wanna sort of get to the summary. What we find is the following. So if you run SGD in a landscape that has information exponent one from a uniform or random start, you need linearly many samples in the dimension. If you study a problem that has information exponent two, for example, phase retrieval, SGD would require at least n log n to be able to recover and will require most n log n squared many samples to recover. And if you're studying a problem that's say in the hard regime, then you'll need at least polynomially many samples in the dimension and the exponent depends on this information exponent. So that's sort of the first result. And the second thing, just to get to the second question, how much time do you actually spend in this weak recovery phase? And one of the upshots we wanted to get at is that as a consequence of that law of large numbers, once you're weakly recovered, you're kind of always in this easy regime. And so you see if you look at the fractional time I spend, which is to say the fraction of the data I use in the search phase in either the critical and the hard regime, the ratio of the total of the time I spend in the search phase to the total time is essentially tending to one. You spend essentially no time, or which is to say the amount of time you spent in the descent phase to the search phase is tending to zero. You spend the vast majority of your data just trying to find your trust region in this high-dimensional setting. And sort of the last thing I just wanted to get at before we sort of run out of time is I just wanted to give one example of how this this result applies to sort of interesting problems. So just to as a reminder, we'll be considering, so here I'll talk about, we have these supervised networks or GLMs as in Antoine's talk where you have some observations which are some activation function applied to a random pattern inter-producted with the true parameter you're trying to infer. And you have loss sort of L2 loss that you're trying to minimize. So here you can sort of exactly classify the information exponents for a lot of different problems. So for example, standard activations like Adline, Sigma, and Relu all have are sort of in the easy regime where the problem has information exponent one. Things like phase retrieval or monomials would have information exponent one or two. So they're sort of in this critical setting. But any polynomial, any sort of more complex polynomial, specifically functions that have, when you look at their Hermit decomposition have no first or second order terms, these would all be in the hard phase. And so to see why that could really mess with things, one thing you could think about is the following. Suppose I have data drawn from one of these GLMs. So it's some non-linearity applied to a Gaussian pattern.x. And what I'll do is I'll see I have two different sources of data both with the exact same draws of the pattern but with different activations. One of them is a cubic activation and one of them is a cubic activation minus a linear term. And what, you know, as a consequence of a result, what you find is there's actually a dramatically different performance of these two estimation tests. If what you're observing is actually just a cubic, then you're in the easy regime of information exponent one and you'll recover quickly. And so that's what's going on in the screen curve. But if what you're observing is, say the third Hermit polynomial, which is just the cubic minus a linear term, then you're in a hard phase where you'll need tons of data to be able to sample, to be able to estimate efficiently. So for example here, what I'm plotting is the performance of SGD for these two different links with the same patterns in 4,000 dimensions with 300,000 samples. And you can see that, you know, while the cubic does really well, this Hermit polynomial is not doing well at all. So I think since we're running out of time, perhaps I'll stop things here. And if there are more questions about precise results, I'm happy to answer in the Q and A. Okay, thank you. Thank you, O'Cosh. Thank you all three speakers. I guess we move to Q and A.