 Okay, so last time I told you what we're going to do and gave you a bit of a minimal information on some random metric stuff. Today I will start with one example to try to understand. So remember what we have over there. Once we have high-dimensional statistics or machine learning type problem, in the end we're building a random function of very many variables. That's the important thing. It's a random and a function of very many variables and we want to minimize it. So there are different layers of understanding. As I explained what could make the optimization thing difficult. You had two sources of difficulties. One was the topology of the landscape. The second was the entropy, the intrinsic high-dimensional problem. But before going into one of these, let me start. Let me start with one example of random function of very many variables. For the moment it would seem artificial but you will see that in fact it's very related to the problem of tensor PCA that I mentioned last time. Before going to tensor PCA and in fact historically for me this is where the link has been made between what I was doing before or still doing on statistical physics and the problem, these questions of data science. Let's start with a very simple problem. One interesting example, a class of examples. So I want to look at an example of what? Of random function of many variables. That's what we are studying. So this example will be the following. We are on Rn. In fact I will restrict myself to the sphere but I don't have to, it's just to simplify a bit. And so for the moment I look at a random function f which will be simply, let me take this random function f of x and this which will be a homogeneous polynomial of degree p. So I already mentioned that yesterday. So hard to do something simpler than that. So I just take a polynomial and even I take a homogeneous polynomial. I could of course mix the different degrees but normally I fix one degree. Okay? All right. And I'm asking, is it easy or hard to find the minimum of this function? Right? So first let's, so of course this function can be written something like this. Okay? So that's my polynomial. And I'm wondering, is it topologically complex? Is it difficult to find the minimum in short time? Right? Hard to do simpler than that. Okay? So one case I already mentioned that yesterday one obvious case is, let's take the case where all these terms are zero except one. Let's say, so that's one example. That's obviously not a very complicated function. And of course finding the minimum of this shouldn't be too hard. Right? Discussing yesterday. But now let's take, so this one is not a complicated function. It's a simple function. Now let me ask on the counter. So that's the simplest it can be. Right? So now let me ask the other question. The question for people who like complicated things. So what is the worst case possible? Right? From the topological point of view. Okay, so before going there, I should describe a bit what I mean by something complicated, complex topologically. Right? So of course I take for granted that this is a public of people who are on the math side, so you all know Morse theory. Right? So let me just remind you in a couple of words what Morse theory tells you. Morse theory tells you, gives you constraints. So take a Morse function, let's say compact manifold. That's why I took a, here I'm on the sphere. It's hard to have a simple manifold on the sphere. Right? So on a compact manifold, I have a smooth function on it. And I assume that it's Morse, which means that at every critical point the Hessian is non-degenerate. Right? So for instance, this one is not Morse. Right? But so a typical function would be like that. So what does Morse theory tell you that when you take a Morse function you can bound things that describe the landscape defined by this function, the topology of this function. You can bound it by topological things on the manifold, by the bitty numbers. You know that or should I explain a bit? Okay, so you seem to know. All right. Okay, so what would you want to understand, for instance? I give you a function f from my manifold. Here this would be the sphere. The value is R, which is smooth, and Morse. So I told you what Morse means, which means the Hessian is non-degenerate at critical points. So how would you describe the topology defined by f? One way is to look at the, let's say the level sets. So I will look here at the sub-level sets, let's say. So the place where the function is less than you. Remember we're talking about minimizing a function. So that's interesting. And you may want to understand the topology of these things. So for instance, you may want to understand the earlier characteristic of this. Right, so you see why. This function is more complex than this function. Right, obviously. If you take a level set here, you see that the topology will be interesting. Right, because you would have different connected components and this type of thing. If you don't know what the other characteristic is, you may get a topology and do something simple. For instance, you may want to understand the number of critical points. Let's take b, a subset of the real line. This would be the number of critical points of f of index a, of index k, with value less than, I'm sorry, with value in b. So you want to count, for instance, the number of critical points. So what is a critical point of index k? Of course, you remember the index of a critical point is the number of negative eigenvalues. Right, so a minimum corresponds to an index zero. Right, and a maximum index the dimension. Okay, so for instance here, this is, all these guys are critical points. Okay, but these ones are index one and these ones are index zero. Of course, when you're higher dimension, you have all sorts of. Okay, so you may want to understand, so here I draw, for instance, if I want to understand the number of critical points in this band. This is my b, right. I may want to understand how many critical points do I have here. So here you see I have one, two, three. Right, okay, so these are natural things to describe the complexity of the function. And the most theory gives you a bound between these numbers which depend on the function, or this number, the yellow characteristic of the level set which depends on the function, and topological characteristics of the manifold. Okay, so here, since I chose a completely trivial manifold, the sphere, topologically there's nothing complicated, and the Morse inequalities give you essentially nothing. Right, so a function can be very simple on the sphere. It doesn't have to be complicated because the manifold is complicated. Nevertheless, you could ask yourself, what is, so let me take this criterion. What is the maximum possible number of critical points? Let's say, yeah, number of critical points. If I don't put, right, number of critical points on the manifold here, which is my sphere, for, let's say, in my class here, for a homogenous polynomial, homogenous polynomial of degree p. Okay, so how complicated can it be? I took a very simple class of function, random polynomials. I mean not random, polynomials. And I'm asking, of course it can be very simple, but I'm asking how complicated can it be? So that's a question not for me, but it's done. You know, there are, you know, the nice thing in mathematics that we're all crazy, and there are always somebody doing exactly what you want somewhere. So this has been done by somebody I don't know, Korazov, Kosasov, I'm sorry. And it's not that long ago, okay? And this maximum number is 2 times p minus 1 to the n minus 1, plus blah, blah, blah, blah, plus p minus 1, plus 1. Of course, that's anecdotal. It's just, I love the idea that somebody computed that, okay? So look at it. As soon as p, so if p is larger, so by the way, what happens if p equal 1? Is this, can the function be complicated? You take a homogenous polynomial of degree 1, right? Nothing happens, right? Let's say it's x1, so you have two critical points, the minimum, the maximum, the south pole, the north pole, and that's it. So here the number of critical points is, the phenomenon is, the landscape is trivial. But what if p is 2? If p is 2, you have a quadratic form, right? On the sphere, so of course you can diagonalize it, and you can write this sum of lambda i ui square, right? Once you've changed the basis. So how many critical points do you have for that? 2n, okay? Right, because you have n direction, in each direction you have two critical points. So that's, these two things are not very fascinating. But now if p is larger than 3, this tells you that for the worst polynomial, the number of critical points is exponential in n, right? So the number is exponential in n, because it's like 2 times p minus 1 to the n minus 1, right? So the complexity, if you take the 1 over n log of the number of critical points, it behaves here, as you can see, like log of p minus 1, right? Okay, so that just tells you beware, even functions which look very simple can be extremely complicated, right? Because here you take a random polynomial, not a random homogenous polynomial, and you can find one which is terrible. Alright, but that's again, you could say, okay, that's on the pervert side, you want to find complicated things. So let's be reasonable. And so let's take, now, what about the typical? Now, that's not the worst case. So how would you take a typical polynomial? In many ways, but for a probabilist, it's easy, take it at random. So you take a polynomial at random. Let's say you take the J's to be the coefficients with my polynomials, to be IID and Gaussian, right? So you pick a polynomial at random like that. Alright, so a random polynomial will not be as complicated as the worst case, but it should be a little more complicated than the stupidest case, right? So what do you think? Is it exponentially complex or not? And of course, the bad news or the good news, depending on what side of life you like, is that it's very complicated. So in fact, still topologically, still exponentially complex. So this is a line of work that starts with something like 12 years ago in joint work with Touka Alfinger and Yergy Cherny, myself, and then continues for a long while. But let me tell you, for instance, what the complexity is. So here the complexity is, the anneal complexity, typically if you take the log of the expected number of critical points, here I'm taking the total complexity. This converges in here to the number one half of this complexity. Of course, if p is larger or equal to 3. So interestingly, the random case is half as complex as the worst case. But this is the important thing. You take a random polynomial of degree p and it's terribly complex. And you can say much more. So let me tell you what you can say in more detail. So first, the result is that in fact you can compute this. The number of critical points of index k, k is fixed here with values in set B. This is a certain number of theta k of the set B, which I can explain. So for instance, let me put it more for a specific B because that would be enough. So if you take all the critical points below a level U, this is a certain number. All right? Once properly normalized. And these guys are explicit. And how do you find these? This is random matrix theory. So where is random matrix? I told you last time, that's the Hessian. So the tool to do that, I'll be a little more precise about this. The tool is the Katz-Rice formula, which gives a link. So you have a dictionary, which is given by this Katz-Rice formula to a random matrix problem. But anyway, so before, so when I say that of course here, I need a limit. It's a typical, this is an expectation. Yeah, this is a yield. This is a yield, yes. So of course you want to have it quenched. So you have it quenched. So the quench result. Quench result consists in doing this, right? Putting the expectation in front. So this is called the annealed complexity. The quench complexity, which is the same thing, but much more important with the expectation here. So this is true. This is true when U is small enough, which is the important part for, and I will describe that a little more. And this is due to a sequence of work by Eliran Subag, which came a bit after the annealed things. And then Subag and Zetuni. Okay, so let's first, let me give you first the, before I even mention what the Katz-Rice formula is or whatever. Yes? So just one question out of curiosity. Did someone study the coding of the complexity of the annealed? You mean more than one degree? Or with, again on the complex? It could be done, probably, but I'm not completely sure what the, first it doesn't, it doesn't matter. And also I wouldn't know, of course here the nice thing is that since everything is real, the best way for me to do is to look at level sets, right? It's easy. When the value is complex, I don't know what this would be. So here the, let me give you the important picture. Of course all that is after some normalization. Okay, the usual normalization is the normalization of physics is to look at this f of x to be, let me write the normalization as physicists do. Topologically it's always the same problem. But so that if you want to look at literature and physics you would find this. So it's my function as before, except that it's normalized by n to the p minus 1 over 2. And x here instead of being in the sphere, physicists like to take the sphere of radius squared with n. Okay? Which of course makes no difference in any way. So why square root n, by the way? It's because these models in the sphere are kind of there to simplify the Ising case where the spins are plus or minus 1. So the real problem that the real questions initially in physics are for spins which are on plus or minus 1 to the n. The vertices of the hyper cube. But of course if you are on plus to the minus 1, plus or minus 1 to the n, your norm is square root n. So they take the sphere with the same radius. That's all. But anyway, so I take this normalization. Okay? So in this case you find the following. If I put here, so let me draw a picture of the complexity theta 0, the one for the minima. Right? Theta 0, remember it's about minima. You have a value which is, so let me put u here. Something like this. That's theta 0. So here you have a level, so this u is the energy. So I look here at the number of critical points. The energy is normalized by n here. So here you have a number which is negative, which we call negative e0. And that's the picture. So what does that mean? That means that below, so if I draw the picture now like this, I put the energy like that. So here I have the energy 0. Of course for this thing, the mean value of this function is 0. Right? So that's the typical value. The fluctuation if you compute, because of this crazy normalization, the fluctuations are typically of order square root n. So the typical values are in the window here of size square root n. So here I cut the axis. And then the minimum will be, the low values are in scale n. So it's this e0 normalized by n. And so what this tells you is the following. Below this value, negative e0 normalized by n, the complexity is negative. Right? Which means that you have no obvious reason by Markov inequality. You have no critical point. So below this, no critical point. Right? Above this, exponentially many. Okay? That's what this tells you. Okay? So it's a natural guess to think that this would be the value of the ground state. Right? The minimum should be at this value. And it is. Okay? So above this. Okay. Now I could look at the... Sorry, what's your energy here? That's f. Yeah, I will say what that is later. Let's call it spin glass. Okay? But for a mathematician, it's just a polynomial. That's what it is. But just picked randomly. Okay? No way, f is fixed. I mean it's taken randomly with these j's. And you have exponentially many local minima above this level. Ah, okay, these are the levels for that. U is there. That's... I'm sorry. Ah, okay, I understand. U. Properly normalized. Okay, that's U. Okay? Sorry. So the typical value of the energy is zero. The fluctuations are square root n with this normalization. The extreme value is scale n. And so this will be the minimum. This is the ground state. Okay? But as soon as you're a little bit above it, you have exponentially many critical points. Okay? Now, let's look at theta one, exponentially many minima. Theta one will have a shape of this form. So, and by the way, this at some point, it saturates at a certain value. Then theta one will have something of that form. Okay? So what does that mean? That means that you have another energy here, above the other one. Above this level, you have saddle points of index one. Right? Exponentially many. The theta one, remember, was counting the number of critical points of index one. Exponentially many. In between here, you have exponentially many local minima, but no saddle points of index one or more. Which means that you have deep wells, but you have to climb at least up here before you find another well here. And you have exponentially many of these. That's a complicated landscape. Same thing if you go for theta two, and then I'll stop, et cetera. Okay? Then there is all these values saturate at a certain level here, which we'll call e infinity. So what does that mean? So let me explain in words. Because remember, my k was fixed over there. So here, in this regime here, have exponentially many critical points of finite index. Right? Zero, like minima, index one, index two, index three. And they stop here. Above this, you don't have critical points of finite index. Okay? But now, for instance, if you take something near zero, above this, let's say in this regime here, that's why here I put something to break the axis. Here I'm in a scale. Here you would find a lot more critical points. Critical points of extensive index. When k is, let's say, a fraction of n. So typically when you are at a typical level of this crazy function, which doesn't look like a crazy function, it's just a polynomial. When you are at a typical level, you have exponentially many critical points, which are of extensive index, which means that they have, let's say, a quarter of their direction is going down and a quarter of their direction is going up, and three quarters going up. Okay? So that's the kind of typical critical points you have at the typical value, but when you go deep in the landscape, you find these local minima and all that, and you find exponentially many of them. Right? So this stupid function is crazy. Okay? All right. So how do you do that? How do you prove this type of thing? It's a very, very long story with many, many pages. I will just tell you the strategy. Okay? The strategy is... Because, of course, that's the... But at least you've understood that the... And so when you compute everything, you find that this... You find what I was saying here, that it's not for the worst case, but for the typical case, it's half as complex as the worst case, which I have no idea why it's half. You don't ask me, but that's what's happening. So... But again, think of it. You have your sphere. Of course, you cannot draw a sphere in a very large dimension, even in your brain. But you have a sphere, a very high dimension. You have exponentially many wells. Their depth, the bottom of these things are very deep, and then you have a stratification. Before finding saddle points, which allow you to go from a very deep thing to another very deep thing, you have to climb something which is pretty high. Okay? Good. So, how do you do that? You use the cat's rice formula. So what is... So this is not a new tool. This is a very, very old thing, which goes back to the 40s, and which was used initially for... Of course, when... This question was... Rice was a statistician, in fact, for this. The question was about, in fact, in dimension two. Not in diverging dimension. And it's a very simple formula. Let me give you the simplest thing. If you want to compute the expected number of critical points of f, let's say... So when my f here is a random Gaussian field, or function, smooth, of course, not like Brownian motion, like my polynomial, can be written as an integral in my manifold of the expectation of the absolute value of the determinant of my Hessian of the function conditioned by the fact that the point is critical and the index k integrated against the... Where phi is the density f of x, grad f of x. So f is a Gaussian thing. Smooth, f grad f is a Gaussian vector. So this is a Gaussian thing, which is easy to compute. I have zero here because I condition by the fact that the gradient was zero. Okay, here the formula I wrote was... Okay, I could fix also the level. Here I had to integrate... No, no, I'm mixing two things. This should be... Okay, I can keep x here. I should put f of x equal u here. And here I should have u. The x du. Because I integrate on the level two. Okay, that's a formula. Physicists don't even call it a formula. They call it... it's trivial. Okay, but now look at the formula. It's just an integral. I integrate on the level two. It's not the same u. I'm sorry, it should be v. dv and I integrate here below u. All right, but look at the formula. It's an integral. Here you have a very trivial term, which is just a Gaussian density. Everybody can compute. But what's the main ingredient here is that you have... So the important thing in this formula is that you have to understand the... You have to understand the law. What rules this formula is the law of the gradient conditioned by the fact that the point is critical and maybe by the value. That's what you need. But look, the triple... f is Gaussian. So the couple f grad f is a Gaussian thing too. The triple f grad f hashen of f is also a Gaussian thing. Right? So this conditioning is not too hard to do. When you're on the Gaussian word, conditioning is just linear algebra. So you can compute this law when things are Gaussian, right? But what is this? When the dimension is large, this is a random matrix, right? Because the Hessian is a real symmetric matrix. Random, because my function was random. So this Hessian at the typical point... So what am I doing here? I'm conditioning, let's forget the level for a minute. I'm taking a point, I'm conditioning it to be critical and then I look at the Hessian. So this is basically what I was describing yesterday when I said you had two pieces of information that were necessary. Here you take a point at random on your manifold, you condition it to be critical and you look at how the Hessian looks like. This is obviously a real symmetric matrix. It has a real spectrum. The question is, what is that? Can we understand this? Right? And what is the question here? You have to understand... In order to do that, you have to understand a random... the absolute value of a random determinant. Right? This is what we have here. This is a random matrix. This is its determinant and the absolute value. The absolute value is a pain in the neck. Technically, but let's forget that. Okay? That's what you want to understand. So for that, random matrix is nice. Why? Because... And you see, that's a very simple question. Imagine... I'm looking at this. The product of the lambda i... These are the eigenvalue of my random matrix. Looking at the absolute value. Right? So this is, of course, exponential n 1 over n log of a function of the empirical measure. Right? The empirical measure here is 1 over n sum of Dirac masses at lambda i. Right? And the function psi of mu here is just the integral of log of absolute value of lambda, d mu of lambda. Right? I'm sorry. Too many logs. Okay? Because if I take... Okay, this thing... This is, of course, product of absolute value, which is exponential sum of the log. And what I do is just put an n and a 1 over n. Okay? So now, if you know that this mu n concentrates, converges to some mu infinity. So that's typically a random matrix theory question. If you have this, plus fast concentration. Right? So if you know that this thing converges to something and that it concentrates fast, then probably you could prove that 1 over n log of this thing would converge. Right? You have this n here would converge to this thing. Right? So with that, it should be enough. Okay, I'm cheating terribly. There are hundreds of pages below the surface here. Because, of course, first, this function is not good. Right? When lambda is close to zero or when lambda is large, this is singular. And second, you really need the concentration. Let's forget all these technical sites. But why in hell would we know that? Why would this Hessian of a critical point converge to something? Right? So in general, we don't know that. But here, why would it be true? So here, this random matrix is simple. Why? Because if you look at the function, it's easy to check that this function, so remember my function sum of J ij, my random homogenous polynomial, this function is isotropic. In the following sense, its distribution is invariant under rotation. Right? It's very easy to see. So if this is invariant under rotation, you may imagine that similarly the Hessian would have something like this. So, and there are not a million type of distribution of random matrices that are invariant under rotation. So then you expect this random matrix to have something to do with the GOE. So remember I introduced the GOE. So the GOE is the Gaussian orthogonal ensemble. It's the case of a Wigner matrix, which I explained yesterday, where the entries are Gaussian. Okay? So you don't have to make this symmetry argument. You just take your function, you differentiate it twice. You compute the Hessian. You condition by the fact that it's critical, maybe by the value. All that is linear algebra and you find that your Hessian, your condition Hessian is in fact a GOE shifted by something. It shifts. It just shifts. Yeah. That's what the, let's not put the index. The index is a much more complicated story. But if you don't have the index, this V just induces a shift. This is precisely, this is why we are so proud, because physicists could not compute that. Okay? For that we needed the large deviation for the top eigenvalue, let's say, which they didn't have. And so when you can beat the physicists at computation, usually you're proud. And they were enraged that they didn't do that. But anyway, so forget the Hessian. It's just a GOE shifted. So then, once you are in the world of the GOE, you can compute everything. The theorem I mentioned are there's direct consequences of the large deviation principle yesterday. Okay? Just believe me. I could spend more time. So with this story, that's the dictionary. And why is it simple here? It's because in fact, in this case, the random matrix is very simple. Okay? We've tried this in many other cases, particularly in a work with Julio Biroli and Antoine Maillard, and then many others. And recently on what is called the elastic manifold, which is a much harder model. And then the random matrix has become much heavier. It's not because you have a dictionary, that once you've translated from one language to the other, it becomes simple. But at least you have a way to do it, which is studying the random matrix problem. Okay, in this thing, we had nothing to do. Everything was already done. Okay, so with this, that tells you the following. You take a random polynomial of degree P, P larger than 3, the landscape is terribly complex. Okay? Let me explain a little more. So now why this was interesting. So here, I justified it just from pure math point of view. I take a polynomial, which is a simple function. I take it randomly. But of course, this model has a name in physics. This F is in fact the Hamiltonian of what is called, or the energy function, if you want, of what is called the spherical P-spin model, which of course has been studied at length by physicists. Parisie, Mésard, and many others. Couglier de l'eau, Corshan, Crescenti, Somers. There's a long story. So what does that mean? That means that what are physicists looking at? They are looking at the Gibbs measure. They are not directly interested in the complexity. They're interested in the equilibrium measure, which is mu beta on the sphere, where you square it in, which will be exponential minus beta times this Hamiltonian, which I call F. This F here, DX, DX is the uniform measure on the sphere, normalized. This F, of course, depends on that. Normalized. Okay? And beta is here the inverse temperature. Okay? So in physics, the important thing is to understand the behavior of this thing. So now, if you have a function which is really complicated, topologically, maybe the Gibbs measure will be, and what they, what can be proven once you understand this complicated topology. So the way physics usually studies things like this, it starts with beta equals zero, which means very high temperature. At very high temperature, this is supposed to be trivial. If beta is very close to zero, this should be like the uniform measure on the sphere. Right? Not very fascinating. At beta very large, this should concentrate near the bottom of this landscape. It will become much more interesting and complicated. Right? So, so what you do when you study this thing in physics, the first thing you do is compute the free energy. This is the first thing you have to understand is the free energy, which is, so I should mention that I'm doing that because of a lunch conversation. So, I'm pointing here. So the free energy is, you look at one of our analogs of the partition functions, Zn, which is simply the integral of this thing. Then you want, let's say you take the expectation, you want the quench thing, and you want to understand the limit of that. And this is given by a variational problem of some function. So this is called the Parisi formula. This goes back 40 years, more than 40 years. So it's a super certain functional which I will not write, depending on the temperature, defined in the space of probability measures on zero, one. Okay? And with this type of tool, what the physicists show is that there is a transition, a phase. So, they understand the Gibbs measure, so they say. Right? And when the temperature is high enough, beta larger than a certain, beta critical, I'm sorry, smaller. This is what is called replic asymmetric phase, where this mu is in fact the optimal mu is a direct mass at zero. And when beta is larger than beta, this, this is a replica symmetry breaking phase, one step, and this mu optimal is a combination of two direct masses. All right? That's the story about the physics. And this, with this picture, the topological picture, we can understand much more. In fact, so what is this measure? What is this mu here, in this picture? This is supposed to be what the physicists call the order parameter, which is, supposedly, if you look at the law, you take two replica, which means, which means you sample two points under the Gibbs measure. Right? Under the Gibbs measure. Independently. You take two points under the Gibbs measure and you look at their distance. Right? Since you're in the sphere, looking at their distance or looking at their inner product, this is the same thing. Normalized by N, we call that Rn. Right? That's called the overlap. Okay? So, the overlap of the two replica. And mu, so let's try to understand what's going on here. If you are at very high temperature, let's say beta is almost zero, your mu is like the uniform measure. If you take two points at random on the sphere, right, then their overlap will be zero. Right? Two points are orthogonal. That's what I explained yesterday. Everything is on the equator. So, the law of this, the distribution of this, converges to what is called mu optimal. So, that's the order parameter. It's the distribution, the limiting distribution of the overlap of two replica. So, in particular, this explain why you would have this. At high temperature, they are orthogonal. At low temperature, they are not. So, what is compatible with this picture? Here, this tells you when you take two replica at low temperature, their inner product is either zero, they are orthogonal, or some number. How can that be? That's the one-step replica symmetry breaking. So, sorry, I will draw the sphere in very large dimension on the board. So, here, what you will see that... So, okay, let me explain that in a moment. So, that's what this thing is supposed to mean. And now, what do we understand from the topological point of view? And how is that compatible with that? And how can this be even improved? What we understand... Okay, let me do this picture again. And that in the sphere, I have exponentially many critical points. I have exponentially many local minima. So, remember the picture I just erased here. Here, you have the temperature zero. And if you break it, you have minus e infinity n. And here, you have the minimum minus e zero n. Below this level, you have nothing. This is the minimum. As soon as you climb a little bit above that, you have gazillion wells. Right? So, when I'm at low enough temperature that the Gibbs measure is concentrated in this regime, then it's easy to believe that the Gibbs measure will be concentrated in those wells. Right? When you are low enough in temperature, so there's a value here, the temperature I call, which corresponds to this level, when you are low enough, in fact, the number, incredibly, but the number of wells that count, you have exponentially many wells, but all that can be proven. The number of wells that contribute to the Gibbs measure is essentially finite but large, not exponentially large. Right? For those who know a little bit of probability, it's a Poisson-Dirichlet process. So, here it means, here's the picture, at very low temperature, you have wells, which of course, okay, I cannot draw in 3D, but something like this. Right? And the Gibbs measure is concentrated on rings in those wells. Right? When you fix an energy, let's say here, you have the Gibbs measure will be concentrated on things like this. Okay? And you can control the statistics of this. So, that's, so when you are in this picture, why do you have new opt? That is, the distribution of the overlap that has two Dirac masses. Very simple. Take two replicas under the Gibbs measure. Two things can happen. They are in two different rings. Then their overlap is zero, because this, the center of these points are orthogonal. Right? So, that corresponds to this Dirac mass at zero. Or, they are in the same ring. If they are in the same ring, then their overlap is whatever I called it. Okay? It's non-zero. Why? Because of course, the picture here is, is bizarre. This ring is simply a very high-dimensional sphere. Again. Right? This is in dimension n minus one. This, each ring is in dimension n minus two. It's still a very large sphere. So, two points there are taken at random. But now, the, the overlap is non-zero. Right? So, that's, because you have said that they are in this thing. So, that's exactly what creates this story. Okay? So, this ring they are more or less symmetric because they are always different. So, and they have this, they are actually the same structure at all. Yes. So, that can be shown too, because they are at the same height. And one can prove that the bottom of the well they belong to is essentially at the same height too. Right? And the Hessian is also determined. So, yeah. They are very similar. They are essentially, so their, their masses are not exactly the same. Some, they are in the same order of magnitude, but their masses are a Poisson-Dirichlet process. Right? So, you have the, the largest, the heaviest one, the second heaviest one, etc., which is a Poisson-Dirichlet process. I don't want to get there. Okay. So, but then this, this is the picture that physicists had, but then there is something else. There is another phase, here, let me say, and this, so this, this phase here is replic asymmetric. Is replic asymmetric breaking? I just explained it. This phase is supposed to be replic asymmetric. Nevertheless, in this phase, you have plenty of wells here. So, here is the picture that's called the shattering phase, and this has been studied by many physicists, but if you want a recent paper by Yoko Shagannath, and myself. In this phase now, what happens is the following. The Gibbs measure is carried on an exponentially large number of such rings. So, the phase is completely shattered, exponentially many. But still, the order parameter is just the Dirac mass at zero. Why? You don't detect it in Paris' formula. Why? Because if you have exponentially many bits, when you take two points at random, they will always be in two different ones. So, their overlap will always be zero. But still, the structure of the Gibbs measure is not at all like the uniform measure. It's full of these things, of these shattered pieces. And in this part where this phase is shattered, the order parameter is trivial, but for instance, the dynamics are not trivial. You can understand that this landscape compared to the completely flat landscape is very different. So, the dynamic is super slow, exponentially slow. So, now it's time to tell you a little bit about dynamics. So, because that's what we want to do in the end. What is dynamics here? So, you have this function. And let's come back to my initial problem, which is dynamics of optimization. So, you understand that there is here, this is the place where the phase is statically trivial, but in fact, I mean static, for the Paris' formula trivial, but not completely trivial because of this shattering. For those who know the random energy model, this phase of course exists there too. So, dynamics. We could study the gradient descent, the gradient flow for this. But of course, if you're in physics, you don't study gradient flow first. And even less, in fact, you can also study stochastic gradient descent, but it would make no sense in physics. So, you take the Langevin dynamics. So, what is Langevin dynamics? You just follow the gradient on the sphere. Essentially, you do this. So, this is a little of abuse of notation because I write it as if this was a Brownian motion, this is a Brownian motion on the sphere. And you try to follow that, right? So, this has, this is built to have the Gibbs measure as an invariant measure, right? Maybe with a factor of square root 2 here, depending on how you normalize Brownian motion, but let's forget that. So, the question is, and you start at x0, at times 0, is distributed like some mu0, which is whatever you want. Let's say the uniform measure on the sphere. Sorry, I'm thinking again about that. So, you don't have full, it is not big when it continues. No, no, when it's the P-spin pure, you don't have full. If you take a mixture of P-spin, you can have full, but I don't want to get there. But you are, this is the simple case of one replica symmetry working. Yeah, so what Thierry is alluding to is, you could imagine models where the mu opt now, instead of having two atoms, would have three atoms. That would be called two-step replica symmetry breaking. Or it could have an infinite number of atoms, and even a continuous part. This is called full replica symmetry breaking. Here, this does not happen. But if you want it to happen, you don't have to go very far. You take a polynomial of degree 3, a random one, and a random polynomial of degree 16. You mix them properly, and then there is a region of parameters where it's full. So, it's not crazy, right? So, the question about dynamics of the law of xt, of course, converges to the Gibbs measure. Okay, that's what Langevin Dynamics does for you. So, you could ask, is it fast or not, right? So, you could look, for instance, at the mixing time. So, the question is how fast? So, when you're above, so there's another temperature. So, there are three temperatures. So, when the temperature is less than Ts, which correspond to this static transition, this is exponentially slow. I'm sorry, yes, exponentially slow. The time to reach equilibrium is exponentially in N. When T is between Ts and T in T, what I call infinity, if you want this one, this Td, this is still exponentially slow, but for the relaxation time, right? So, the relaxation time consists in taking, so what can you do here? When you do dynamics, you have to choose your initial thing. So, if you are a computer scientist, you want to choose mu zero to be the worst possible. That's what the mixing time tells you. If you're a physicist, you don't care about worst things. You want to do something feasible. So, you choose your mu zero to be uniform on the sphere. That's called relaxation time. And that's the important notion for optimization and statistics. Because in statistics, you don't try to make your algorithm not work. So, you start with something random. So, the relaxation time is exponentially slow. And then there is another temperature, which is called, so this T infinity is very often called T, like dynamical, the S was for static. And then there is a conjecture. There is another temperature, which is higher than T dynamical. BBM is for Barra, Burioni and Mésard. And above this, so, the mixing time, I'm sorry, less than that, the mixing time is exponentially slow. And above that, it's exponentially fast, it's finite. Okay, so, there is a temperature above which wherever you start, you converge fast. Below which you converge slow if you start from a bad part. Then there is this other temperature between here and there. The relaxation time is fast, but the mixing time could be long. And below that, the relaxation time is slow. So, why is it that you have this, let's say the second temperature? It's because the static temperature corresponds to this energy level. But this is the, but here you have, and up to here, you have plenty of wells. Right, so when you come, when you start, let's say with the uniform measure, you start with something at zero energy. You start from high above. And you come down. Here you will encounter, very close to here, you will encounter wells that will block you. Right, so that's the level where your mixing, your relaxation time is terrible. The BBM story is even worse. You have very unlikely wells which start here and come very high. And if you are unlucky enough to start in one of those terrible things, then you will also not converge. Okay, to the equilibrium measure which is lower. Anyway, so that's the story. The important thing is that the, all that is a very long story, but that the exponential, the convergence to equilibrium is exponentially slow. There are many more things to say like the system ages, a long story, how does it move from one bit of the Gibbs measure to another one, etc. That's the story about this polynomial. So you see just a simple function can make life difficult. And of course this function is, as I said, a very simple example of a spin glass. All right, so here I illustrated how random matrix computes something on the, on the static. I didn't tell you anything about the dynamics, but how you do it. But let's forget that for a moment. And now let's take our breath and say, why do we care? How is that related to the question I've asked? Okay, so link with our, with high dimensional statistics. Let's take again the example I gave you yesterday. Okay, tensor PCA or spike tensors. Okay, so that's a statistics problem, nothing to do with physics. Okay, remember, I had a vector U unknown on the sphere, which I want to understand or estimate, detect, estimate, and I have access to it through a collection of data, which is this thing to the power p plus. So here I have a sample of size m. And maybe let me put lambda here. Lambda is a number, which is the signal to noise ratio, right? The largest the lambda, the more signal you have. And these guys are noise, which is simply IID Gaussian, standard Gaussian tensors. P tensors, right? So that's a p tensor in dimension n, right? So let's say p equal 3, what is a p tensor in dimension n? It's just a cube of numbers of size n, right? If p equal 2, it's a matrix of numbers. If p equal 3, it's a cube of numbers, right? And I take the simplest possible structure of the noise, I take all of them to be Gaussian standard. Okay? Remember, that's the problem. So now the question is, you have three levels of questions. This is the sample, this is my data. My task is estimate U, right? Understand U. Of course, you imagine that the largest lambda is, the largest m is, the easiest it would be. Okay? So here's the story about that. So this is a very simple problem, which has inspired gazillion papers. So, because in fact of what you'll see, because it connects to things that we can understand. So first you have three thresholds. Let me give you the stories. And then how is that? So let me first of course tell you, this is directly related to what I just described, right? The spin glass story is hidden somewhere. Okay? Plus the BBP transition. So let me explain the three thresholds. The first one is the information theory threshold, as I explained before, which is also called detection, right? So what is the question of information theory of detection is, detection is I observe these tensors, right? I observe these m tensors, t1, tm. That's my data. Of course, I don't have this decomposition. I don't have you, otherwise things would be done. And I'm wondering, can I detect just having those tensors, those data points, can I detect if there is a signal or not? Or if it's pure noise, right? So the detection is, can we see if there's a signal or if this is just pure noise? So that, when you translate it naturally, is to look at the total variation distance between the law with a signal and the law without a signal. So let me call P lambda, the law of a tensor like this, then the UP plus W, right? So P lambda is the law of T equals lambda U to the P plus W, where W is Gaussian, okay? So the law of a tensor. You look at this guy and the distance between this and P0, right? P0 means the thing without a signal. So P0 is the distribution of a purely noisy Gaussian tensor of size n, P tensor. P lambda is the distribution of this Gaussian noisy tensor plus a signal. And lambda is the strength of the signal, right? So what is the total variation distance? I think you all know that's the soup, in fact, of all the absolute value of the integral of F dP lambda minus F dP0, where F is whatever continuous bounded function, right? So it means, can I construct a function, an observable, a statistics, like statisticians would say, observable, like physicists would say, such that under these two distributions it makes a difference. Can I detect? So to be able to detect is when this total variation distance should not be too small. So detection is possible when this is not too small. But the result of that, this goes to zero if lambda is less than a certain threshold, it, which here, of course, depending on the normalization in m and etc., is easy to find, but let me just call lambda it. Okay, so there is a threshold below which this goes to zero, which means no detection possible, right? So if the signal is too small, you just can't detect. There is no difference between your thing with signal and pure noise, right? And if lambda is larger than the threshold, it's a little more, there is a little gap between the two, which is a tiny log, but let's forget that. If lambda is larger than this thing, then you can detect, because the, which means that this total variation distance doesn't go to zero. So there is possible, there is an observable, there is a statistic, there is something you can do, but you don't necessarily know which one, what to do. So now when you are above this, so this is an important phase. It tells you that there are situations where you just, in high dimension, of course this would never happen in dimension three, but in high dimension, the signal can be completely lost. You can't even detect, right? It's not that your algorithm doesn't work. There is no algorithm, there is no statistical procedure, there is nothing to do, just go back home, okay? Of course, of course, what is to do? A statistician always knows what to do. Of course this threshold, as I said, depends on M and whatever. So if it's perfectly clear intuitively, and I can explain it precisely, that if you increase the number of data, at some point, you will be above this threshold, right? So if you increase M, if you observe your system more, to learn more, then, of course, at some point, you'll be able to detect. How does it depend on P? It also depends on P, yeah. I forgot. But it's an explicit formula. There is a square root of P log P somewhere, whatever. If you want to know all that, of course, the best way is to go back to old papers by Andrea Montanari and Richard, and there's a long, long literature. This was in 2014, and then many, many papers. And I would also quote the paper by François Banderas and his collaborators. That's a little later, and many more. Okay. Now, assume that you are in a situation where you can learn. So if you are above this threshold, so now you can try something. Okay, so if you're above the threshold, now you can try to solve the problem to estimate U. Right? So then you choose your statistical procedure. There are many, many. For instance, Alfonso Banderas uses what is called the sum of squares. There is a gazillion statistical ways to try to do that. Okay? But we are ignorant in statistics. So we've just, remember, I just told you we need statistics 101. So in statistics 101, you learn only one thing, two things, the least square method, and the maximum likelihood. Right? So let's keep with basic stuff. So let's try the maximum likelihood estimator. So why do I want to try the maximum likelihood estimator? Because this thing is Gaussian. Right? Because I assume Gaussians precisely for this to be easy. So if this is Gaussian, this whole thing is Gaussian. The density of a Gaussian, I can write it. When I take the log likelihood, I get rid of the exponential and the density, and what do I get? Of course, a square. So I let you do this computation. And when you take negative log likelihood of this problem, as I said, it is, I already mentioned that yesterday, it is exactly our spin glass model, where this is exactly our spin glass model. Why? How did it come here? Because of this W. This is a P tensor. So this is W I1, I2, Ip. When you compute that, you will get the sum of this W I1, Ip, et cetera, X I1, X Ip, which is exactly our function. Properly normalized. I'm not careful about normalization. So let's imagine that you, if E1, just to simplify, the likelihood I'm trying to estimate, to minimize, the negative likelihood is this, right? Again, this lambda may be normalized by something depending on name and whatever, but I don't want to waste your energy on that. So now let's look at it algebraically. What do I have here? I have a mixture of this random polynomial, crazy polynomial, exponentially complex, and the stupidest polynomial, not complex at all, right? So not hard to imagine that depending on lambda, when lambda is very close to zero, the mess created by this guy wins. When lambda is large enough, maybe this one can do something, right? So you have another threshold. The threshold MLE. When lambda is larger than this, and I don't tell you how it scales with M and et cetera, then what happens here is that the estimator, so it's not necessarily that you obtain by the MLE, is better than random. So what does that mean? Let me draw this eternal picture. I have E1 here. That's the vector which is U, the vector I want to find. And here I'm not talking about dynamics or algorithm or whatever. I'm looking at this maximum likelihood function and I'm trying to see where is the minimum, right? I put a minus here, right? So that's the estimator. The maximum likelihood estimator consists in looking at the point where this thing is minimum. Now, how do you find it, right? You just look at where is it. When you're above a certain threshold, a random point again will be at the equator, right? When you're above a certain threshold, your estimator will be in a certain, will be in this hemisphere. So positively correlated with a signal. So when you say that, you say, ah, not very good. But remember my other picture. Remember that the hemispheres have very, very small mass, exponentially small mass. So being able to get there, so the signal you want to find is here. So the fact that your estimator is there is not that bad. And then of course, if you let the lambda go to infinity, the maximum likelihood estimator will converge to the truce. If you strengthen the signal very much, this thing will converge up to the north pole, right? So what is this threshold? In fact, it happens here that this threshold, the MLE threshold is very close to the IT threshold. It's essentially the same when n is large. So as soon as something works, MLE works, okay? All that are theorems that you have to prove, long literature, but believe me, okay? So, and again, look at Montanari and Richard if you want to. So now we have the two thresholds, which are essentially the same here, right? So below a certain threshold, nobody can do anything. Above the threshold, many statistical procedures could work. MLE, the simplest possible, does work. Essentially, there's a tiny log again, a tiny thing which is probably related to deficient proofs, but people have worked hard. I have a question. Is it always obvious that there is a monotonicity principle meaning that the threshold is one point? It cannot be an interval? In this case, it's always obvious that... No, it's not obvious, but here it's true. You're right. But in fact, here it's easy. But in some other model, it's not clear. So now the third step, of course, is good. But as I said yesterday, it's computational algorithm. So what was the question here? We have... So before I go there, let me explain something about this story. Remember, if you don't have a signal, the function is incredibly complex, right? Exponentially complex. All sorts of wells, terrible. But here, if you have a signal, but if you are very close to the equator, if you are at the equator, right? At the equator, again, you are on a sphere of dimension, essentially, n minus 2, right, instead of n minus 1. So it's still a very high-dimensional sphere, this equator. This is what I drew in this picture. On the equator, the signal term, this term, is zero. So on the equator, the system is exactly the spin glass we had before. So on the equator, or near the equator, the system is extremely complex. That's why I drew the picture like this. The equator is full of holes and mess. So remember now, so algorithm. Of course, you could do Langevin, but that would be crazy. You could do, again, gradient flow. Okay, and I will talk about SGD. Okay, but that's SGD. But I could also talk about Langevin. So with the Langevin dynamics, we understood that if you stay stuck, at the equator, the Langevin dynamics is essentially what I described before. Because your spin glass, here, the signal term is not there, it's essentially a spin glass. So if you don't move away, so you start at random, right? Relaxation time, that's what I'm talking about. In statistics, you don't know where to start. Of course, there are people studying what is called hot start. Where is a hot start? You start hot. That is, you start already in the hemisphere. And that's crazy. Of course, I did publish things on hot start, but it's crazy. Because it means here, you already know, you start from one of those tiny places, right? The measure, so it's very hard to have a hot start. Okay? If you have a hot start, then things will be simple. But if you have a random start, you start here. So what is the term? Let's look at the gradient of this. The gradient of this will be the gradient of this mess. And here you would have essentially x1 to the p-1. Right? So the gradient is also zero at the equator. Right? So the signal is very weak at the equator. All right, let's go to the second derivative. Second derivative, p-2, still zero. The signal is super weak to extract you from the mess of the equator. But if, so once the signal begins to work, and you begin to move north, let's say, then you will leave this zone where things are terrible. And then things should be simple. So the difficulty here is you have to beat the entropy of the equator. If you have a signal to help you do that, but it's very weak. And if it doesn't work, the big bad wolf of the wells will catch you. And then you'll spend an exponential time. OK? So that's the story. So this is what is seen in this thing. And whatever you do, l'angevin, stochastic gradient descent. So what is the threshold? There is an algorithmic threshold. And in this, with the same coordinates, it's of order this to the power p-2 over 2. Which means that you need much, much more signal or much, much more data. Right? There's a big gap. When p-3, here you have 1 half. OK? So let's say lambda i t is 1. As soon as you're, you need a signal to noise ratio, which is square root n larger for your system to begin moving. If you are in between, your algorithm, your statistical procedure works perfectly, but you just cannot find the minimum. OK? So what do I mean by that? That has to be explained. It's the same behavior for the three dynamics. All the local dynamics. Or there's another one that was mentioned yesterday, which comes from physics, from Mésard, in fact, and Florent Jacquela, L'Encage de Boreva, and their school, it's called Approximate Message Passing. That's a different thing. All these work like that. There are non-local dynamics. So people are working like crazy to big, so this is what is called the gap. So of course, the thing that theoretical computer scientists love is a big gap. When you are in this window between lambda i t and this, normally the model is solvable, but nobody knows how to solve it. So if now you take non-local algorithms, they all stop, they all block at n to the p minus 2 over 4. So better. So I will give you a few examples, but let me tell you again, there's still a gap. So let's take p equal 3. This means all the natural algorithm. Nobody wants to study that in statistics. Statistics everybody does, SGD, and that's it. You don't try a gazillion different algorithms. So that's the important things. And also because algorithmically this is doable. These things are not easily doable. But still, even with that, you still have a gap. Between lambda i t and lambda i t multiplied by n to the 1 fourth, let's say, nobody knows how to make this work. And of course if you talk to theoretical computer scientists, they will tell you good reason that this gap could not be filled, even though now we have at least four or five different non-local algorithms, completely different and they all fail at the same level. So let me give you examples. How that means in practice in the algorithm, we need more knowledge that we should put theoretically. No, we don't really know what the reason is for that. In fact, really we don't. It's a mystery. No, no, no. First, the reason for that is not the bad things. It's the entropy. The reason for this gap is... Yeah, again, think like this. The complexity comes for refutations. That is, if you don't have enough signal-to-noise ratio, you will spend too much time because of entropy. You will spend too much time here and the big bad wolf will eat you. But otherwise, when it works, when you have enough signal-to-noise ratio, you escape this and then you don't see any complexity. So the real problem is entropy, in fact, plus complexity to bug you down. But here, it's not local. You don't have a trajectory of a gradient or anything like this. So let me give you examples. You have the sum of squares method. If you didn't know what that is, SOS4. That's what Alfonso Banderra did. There is a method which is called the replicated dynamics. You use many replicas, many dynamics at the same time, and they talk to each other. One would tell the other, I begin to see a well here, don't come here, this type of thing. Super complicated. Certainly not something you want to do practically. Here's another result. There's a very simple result, which is unfolding. This was proposed by Andrea, a tensor unfolding by Andrea and Richard a long time ago, and this was done recently by Jiaoyuan Wang and myself. So let me explain what that is, and you'll see that's not at all a dynamical algorithm. So the reason why they stop at the same place is mysterious. So unfolding is the following. You have your tensor, your random tensor, so let's say p equal 3, it's a cube. You unfold it, which means the cube, you make it a very long matrix. Instead of having n by n by n numbers, you take n square by n numbers, right? That's a very long matrix. And now you look at the singular values of that. That is, you take this very long matrix multiplied by its transpose, and you get a matrix of size n, right? And now, how do you see a signal in a real symmetric matrix? That's called the BBP transition. So you have to extend the initial BBP transition context to this, which we did, and you find that the BBP transition happens exactly at the same threshold, right? Even for this unfolding of tensors. So you see, when I describe this unfolding of tensors, you know, it's algebraic, it's not even an algorithm in the sense of following a trajectory or whatever, and still it blocks at the same place. So everybody tries the new method, everything blocks at the same thing, which makes, you know, people suggest that there is a good reason for that, but you don't capture. I don't know if you have an idea. Okay, so these are the different algorithms. So what's the reason for all that? This is what, so let's try to understand this. So let's take, I mean, so all that is proven in different series of papers by the authors I already mentioned, but let me try to explain this in words quickly. So again, the problem is, so first you see that the, that's the, this thing here is what I called, in fact, before what I called an M hat. This is the empirical risk, right? But the true risk, right? The population risk, as I said, the population loss, right? Which is the expectation of the empirical risk. What is that? The true loss, right? The thing that you really want to optimize with is OGD, with let's say SGD or whatever. It's the expectation of my random term is zero. The random term, the random polynomial is centered, right? So it's just this. So the empirical risk is this random mess plus something perfectly one-dimensional and only deterministic. So of course, if you, so let me now go back to, let's take, for instance, let me explain things for the SGD. So if you believe, so, I mean, SGD has been studied. You all know what stochastic gradient descent is or you want me to write it, right? So, okay, I look at my vector x at time L, it's the one at time, so it's in the sphere L minus one, minus a time step, gradient of L of x at time L minus one, and these are my data. So if I have a sample, if this is, so the general context is I have a loss function of x, y, right? And my, this is my, as I wrote here, this is my loss, population loss. This is the loss for at point x, x is the value, which I call before theta. That's my value of my parameter, y is the data. And this is online stochastic gradient descent, one step. So what does it mean? You follow the gradient of your, what you would want to do, of course, if you could, you wouldn't want to do the gradient of phi, follow the gradient flow of phi because you want to find the minimum of phi, right? Here we want to find the minimum of this, or the maximum of that, this would be the same thing. You want to find the north pole. But why can't you do the gradient descent on phi? It's because in general you don't know phi because you don't know the expectation, the distribution is unknown, right? You don't know this, the point you're looking at in the sphere is the north pole. But what you have access to are data. So one way, of course, is to look at the empirical risk, which is this thing. If you have a sample of size m, of data, yi, that's the empirical risk. And again, when m is very, very large, this should be close to that, right? So one strategy would be to do gradient descent or l'angevin for the empirical risk. That's one method, right? And the results are the same. Or another way is to do SGD. And I will give you an online version of SGD. So let me give you, I won't describe, so l'angevin for this, you know what it is. I give you a function, l'angevin is just taking minus gradient of that plus brown in motion. We've tried everything for this. There's a paper for which of those cases. But let me explain SGD, because SGD is really what is used in practice. So SGD is this. And online means the following. Here, every time I have a data point coming, I use it to define my gradient, follow one step of gradient and continue. That's the simplest possible. There are more sophisticated version of SGD, where you do what is called a batch. Instead of taking one YL here, I take a sum of some of those YLs. Maybe I sum on 10 of them. Why don't I do a little bit of averaging? If I average on the whole sample, then I'm taking the gradient of this, which means I'm doing exactly the gradient descent on the empirical risk. So taking the largest possible batch is doing gradient descent. Taking the smallest possible batch, which is here size one, is called online. This is this SGD online thing. So there are things in between, but let me treat this other extreme case. Why did I want to study the SGD saying that it would be different from SGD or whatever? That's because I was told so by people doing it for the big guys, for Google, Apple, whatever, or Facebook. They tell me, no, no, SGD works better than SGD. So I wanted to explore that. And they're right. So let's see. So that's SGD, okay? Is it also just computationally or you don't have to do the sum? Computationally it's also easier. For one reason, first you don't remember that when you were really doing something serious, the size of these guys is enormous. That's an image or something. So you don't want to store too many of them. So that's also your right. But let's even forget that. You're super rich, you have the largest computer on the planet. You can do that. But here let's study this case. So you have one thing to choose, two things to choose, the starting point. So here of course we'll start at random, again because we have non-informative priors, okay? We don't have a hot start. And then there's the step size to choose here, right? Which is up to you. So what are the natural... So why is SGD... So SGD is practical like that, but so what do we know about SGD? Performance. It's a very long story. It goes back to 51. Robinson Monroe. And then it has been studied at infinitum. It's essentially the whole curriculum of optimization is understanding SGD. But the... If you want a very good book, there is a book by Nesterov, which is in the convex case, mainly. So most of the optimization you learn in an engineering school or in a math class is about the convex case. Which of course here is totally irrelevant. Right? Because our function you've seen is super non-convex. But the... If you want a very nice... course about this, that's by Michel Benain, which I think it was in 99. You have a series of lectures, which you want to learn about SGD. All this is... I mean, again, I just chose three things I know, but there are literally thousands of papers. So... But all that is infinite dimension, in fixed dimension. And what is the lesson? It's not like what we have here, diverging dimension, not very large dimension. So what is the lesson you learn from all this story? That in good circumstances, the SGD, whether you take online, small batch, there are other things you can do, all sorts of things. SGD converges to the flow of the population loss. The SGD trajectory converges to the flow of the... the gradient flow of the population loss, the true loss. The one I called phi. And then you have more than that, of course. That's the law of large number type thing. And then you have fluctuations, central limit theorems, and even large deviations, et cetera. So this is very much studied. Okay? So, of course, now if the flow of your... So this thing will converge to the gradient flow of this function. The gradient flow of this function could be terribly complicated, all sorts of things could happen. Here it's not the case, right? So that's the story in fixed dimension. But what is... What about what we're doing here? Which is diverging dimension. So then there's not much literature, only the... the recent literature, right? People like ones I mentioned, Tannarif, Laurent Jacquela, L'Enquête de Boreau, ourselves, and many others. Many people in computer science. So what if the dimension is large? Goes to infinity. Then you can't really apply this there. Is it still true that you can approximate the SGD flow by gradient flow for the population loss? You don't know. So the example I gave you here is a very simple thing. Because in the example we have, remember that our function, phi of x, is a very simple function. It's just this, right? It's just a function of the inner product with one vector. It's a one-dimensional thing, right? So you can... And so if you had something like convergence to the flow of this, essentially what you would end up with is a dynamical system in dimension one. Just look at how the latitude, right? If you think of this as the north pole, this inner product is the latitude. How the latitude would move. You start with the latitude zero and you want to end up with the latitude one. So that's a dynamical system. You will find something in dimension one, right? And indeed, when you look at this dynamical system, because as I said, p is larger than three, this dynamical system, what is it? You have something of one here, zero here. That's this coordinate x1. You start very close to here. But your function f is five vanishes here. There's derivative vanishes. There's second derivative vanishes, right? So of course it takes time to move away. How much time? You could guess my end to the p minus two over two. Let's guess it, right? It's trivial to guess. Painful to prove as always, but very easy to guess. Let's write the gradient thing here. dx dx one if you want over dt equals minus the gradient of lambda x1, right? And I start from x times zero. Then I'm sure you can solve this equation. But now, how much time does it take for me to get to a high level, or one or close to one? Easy in fact, because here the x zero I start from is essentially one over root n, remember? Because I was taking a point at random, and a point at random in a high dimensional sphere is a distance one over root n from the equator. It's coordinated in this direction is one over root n. And if you look at that, you see that the probability to move in finite time up, right? I mean the time it takes rather to move up is like precisely this one over root n to the proper power which is p minus two, and then you get the p minus two over two. That's all, right? It's the time for this simple differential equation to move away from here, okay? Now, just a word of caution. So if you're, so that's why you need this, if you want this to happen in short time, you need the lambda to be very large, right? Of course, this time depends on lambda. So if you take your lambda to be growing like n to the p minus two over two, you will do it in finite time. If your lambda is not growing, then you will take a polynomial time to escape. But remember, even if you do, even if your lambda is very small, not growing, you will end up at the north pole, always. Simply, the time to do it is exponential. The time to escape the equator will be exponential. It will take a very long time, right? So, but the question we're asking here is, I don't want to get to the signal in exponential time. I want to go there in polynomial time, right? Or finite time, okay? So the important thing here in this example is that we have in the end, and that's how the proof goes, in the end, the whole system is ruled by a one-dimensional, it's a one-dimensional dynamical system, okay? Which ruled this. And you see that if here instead of having p had another power, I would have, and so from there we understand that this, the important thing is, so the important structure is this. The loss function depends only on one variable, and the size of this power. When p is one, the problem is trivial. When p is two, you need a log term. When p is three or larger, you need a polynomial term, okay? So we generalize that later, and we introduced, so to indicate, and this covers a very large family, more cases where loss, population loss is a function that's a psi of a one-dimensional thing, okay? And then we introduced the notion of information exponent, which has had a lot of success. We just say the following. So here if you have a system, and I will give you an example of that, which are much wider than the tensor PCA. If this function here, so if psi of zero is zero and psi prime of zero is non-zero, we say that the information exponent is one. If psi prime of zero is zero, and so that we don't really care about. Okay, let me keep it here. And psi second of zero is non-zero, we say that the information exponent is two, and if psi prime of zero is psi second of zero, et cetera, is psi k of zero, zero, we say that the information exponent, and psi k plus one is non-zero, the first one, we say the information, let's say k minus one and k, we say the information exponent is k. Okay, we just count the number of derivatives, very simple, basic Taylor series. So here we had a system where the information exponent was p, larger or equal to three, but we could imagine a model like that where you could have one, two, or whatever. And what we can prove is exactly this thing in abstract. That is, the information exponent is one, the model is simple, it goes fast, you escape the mess quickly. Information exponent is two, kind of, it's the critical case, you have a log term, you need a little more data. If it's three or more, it's hard. Okay, and then you need a polynomial more time, still not exponential. So that's an important thing that's called, these models are called single index model in statistics. So that's single index because it means it's in fact basically the population loss if you knew it is dimension one. Interestingly, the single index family contains very many things, phase retrieval, mixtures of two Gaussians, all sorts of things, and this would explain the kind of thing that we see there. Okay, so I will spend more time on this and let me already give you what's going on later. So this was very successful, interesting, covers lots of things, cool, but why would, but of course, the mathematics of that is very simple, right? In fact, dimension one, a dynamical system in dimension one, not exactly fascinating, right? But in fact, why, so then comes the next step, which I will explain, which is called summary statistics, which is the following. Your dynamical system here, if you believe SGD literature, should be close to the gradient flow of the population loss. But it's impossible to imagine that something is that practical, works that well in the industry, right? In very complicated functions, in dimension 10 to the 9 is just impossible. So I started, we started from the idea that if this works, it's because it's not in dimension 10 to the 9. If it works, it's because in fact, hidden in it, it's in dimension 17, as I explained, right? So here, the model is such that you see that it's dimension, in fact, basically dimension one. And all the other models I've mentioned here are like that, even though you don't see it immediately. But then, so the idea is, in general, when is it that you have something which is a projection that will be autonomous, right? That's what I mentioned. So this will be called summary statistics. We introduced in a paper that two years ago or something that the Burbaki-like conditions on when a projection will be autonomous in k-dimensions. And so, and we give examples where we can compute that. And so this will be a generalization of that. Of course, when this happens, then you have a dynamical system in dimension 17, as I said. And this could be very complicated. So I'll give you an example of a problem that looks completely trivial, the XOR problem, where the projection is in dimension 12 and where the number of critical points, the number of critical regions that are sticky is like 695. So it's not like one critical point here. Here, the bad region is this. But otherwise, it can be very complicated. And the probability of success is 3 out of 32, right? So every time when you're a probabilist and you have a probability which is not 0 or 1, you think something is interesting, something is strange. So I will describe this. So that's what I'll do next time. And then the real next question is how do you see this? Nobody is there to give you this 17 directions. You don't know them in general. In academic models, you see them. But otherwise, here, of course, you can guess, obviously, that this will be the thing. But otherwise, how do you find them? As I explained, BVP will find them for you. The transition is spectral. Where? And now it's along the trajectory of optimization. That's what I will do in the last session. So we'll have a dynamical BVP transition that shows, and I will illustrate it in a harder problem than all of these. The kind of basic question of machine learning, which is classification of a mixture of K classes. And then we'll see all the bad things that can happen, all the good things. The story is a little messy, but it's there. And so the thing that finds the right directions is simply a spectral transition. So in words, what does that mean? You have this crazy landscape. So you look at your trajectory that moves there, super high dimension. You are at the point. You look at the Hessian along this trajectory. What I'm saying is that there will be, in good situations, there will be a BVP transition, which means your Hessian will have a big blob, which will be whatever it is, not necessarily a semicircle or wish-art distribution like in the academic models. And when you look at them numerically, they can be crazy. And then you will have outliers pretty far from this. So much larger eigenvalues. So what does that mean? That means in terms of curvature, the Hessian will give you something like the curvature of the graph of your function, the function you're moving in. So it will mean that essentially if you just take this big blob and say that's like direct mass at zero, I just don't care about this width because the other ones are far, all these directions are curved but kind of flat-ish because their curvatures are very small. And then you have a few directions which are much more curved. You have 10 to the ninth direction minus 17 that are not curved, kind of flat, and 17 directions that are very curved. So you look at the projection in those 17 directions. And this is where, there you have a dynamical system in dimension 17. That's where the real action is. The rest is moving between the flat thing. So that's where the performance is happening. And when these two things are far enough, if you reduce all that to zero, it's as if you're saying that this thing is completely autonomous. It's just a function of these 17 variables. And so, yeah? Do you have any result or conjectures that escalated to starting from data with low entropy with structure and data? No. No, the important thing is that you're right. This is a question about hot start. That is, you may think that if you start from, but to do that, you'd better know a little bit. What I'm explaining, for instance, in this example, you start nothing, you start in the middle of the entropy, and then you emerge. Here, when you do the analysis, you also have a BBB, a spectral transition. If you look at the Hessian along the trajectory of the SGD here, you also see this thing. So the important thing, you're right. That's the next question. First, we have to build even more hard examples of what we're doing with realistic networks, layers, multi-layers, et cetera, which is what we've done in the last month. But the next thing is to understand the emergence phase. When you start at random, like in this picture, your spectrum is just a blob. There is nothing out. So what we're saying is that once it comes out, then this thing becomes interesting. Not all of it, I mean, some of the information can still be lost in the spectrum. It's a long story, but this begins to be, this dynamical system rules the game. But the real question is how does it emerge? That is, of course, where you want, so this is, of course, a BBB question. That's exactly the BBB transition, the emergence of the outliers, dynamically. What we have now is that when the BBB transition is fully realized, it's good for you. But how does it happen? That's now where the meta question comes, which is how do you build an architecture such that this happens with less data? That's a real engineering question. And if you solve that, you've killed the industry. Right? Because then you know how to do it. So needless to say, for instance, in this information exponent thing, many cases, many structured architecture that are used are, in fact, with the information exponent. It's hard to find something with the information exponent 3 like this tensor PCA. Why? It's a meta answer. That is, typically, if something is used in the industry, it's because it works, otherwise you don't do it. So that will be the next question, but I'm not sure we'll find the... I mean, this is really hard, right in general. But you're right, that's where things happen. So tomorrow, summary statistics, and Friday, dynamical BBP.