 I mentioned which is attracting a lot of interest at the moment, which is random matrix theory Marco, can you can you mute? Are you not okay, so Could you please make a Maca on daily host again so he can keep up his good work as a Chatmaster, okay Okay, so we'll start this session on random matrices with two talks and the first one is going to be given by Bavna Kenner who's at you know, who just finished a visit to Vienna at IST and who's also based at NYU So Ben, thank you very much for coming in person and take it away. Thank you very much. So thank you for coming Thank you for having me the organizers So let me say first everything. I'm telling you about is joint work. Sorry. It's a start recording recording in progress There we go. Thank you for coming So everything I'm telling you about is joint work with Gerard van der Roos and Paul Bargat who are my PhD advisors at NYU So the title of my talk is landscape complexity beyond invariance And actually almost the whole story of the talk is in these four pictures right here, which also explain the title So these are landscapes. That's my name for random functions of graphs of random functions here from so the unit square into R They're like nice smooth Gaussian functions. Everything is smooth differentiable, whatever And I'm thinking of the ones on the left as being more complex Because they have more critical points and the ones in the right as being more simple because they don't So that's my notion of complexity is just how many critical points do you have? Beyond invariance means that and you can see this more in the ones on the right They have kind of different directions, right? So there's sort of a steep direction and kind of a flat direction And this is sort of a technical novelty on what's happening And you can also see going from left to right the kind of phase transition that I want to prove things about right So there's what's happening is there's sort of some signal that I'm adding and as I'm adding a signal the critical points Disappear and things become simple. So that's that's the basic story And in notation what's happening is is I have some random functions fn from Rn to R So the only thing that's misleading about the picture is that it's a two-dimensional picture because that's what I can draw But I'm interested in high-dimensional behavior Okay, so I have functions fn from Rn to R They have few distributional symmetries. I'll give you an example of what I mean It's not so clear right now, which is okay And what I want to compute is what's called the annealed complexity, which means I take the number of critical points That's my notation crit fn. That's an integer-valued random variable I want to take the expectation and understand is this exponential or not So when I say a lot of critical points or not a lot I just mean do you have exponentially many or sub exponentially many So I take the 1 over n log scaling and I get this number sigma Okay, so sigma is a real number and this is a this is a whole talk just about computing this number And just to rearrange it says that on average the number of critical points is exponential in N And the pre-factor is sigma So you could ask for what the value is but today I'll just give you a simple version where you're really only interested in whether sigma is positive That's the complex case where you have a lot of critical points or if it's zero meaning you have a few Okay, and let me give you sort of informally the main theorem of today And I want to emphasize that we have some sort of general techniques Which we illustrated in a few kinds of examples and I picked the one that I thought was best for this conference Which is a signal plus noise model, so I'm going to give you later on a certain function that has the form signal plus noise But the signal strength is not a signal to noise ratio. It's not a positive real number It's actually described by some probability measure saying sort of combining different signal strengths And what we're doing is we're finding the exact threshold for this model between the complex case in the simple case And it turns out that this threshold is actually described by the inverse second moment of this probability measure So I have a signal measure and somehow the observable of it that matters is the inverse second moment Okay, and the proof sketch which I'll talk about at the end is this goes via sort of new results and determinants of random matrices So let me emphasize sort of why would you care about this give you some suggestions why asking these kinds of questions might be interesting So I just copied over what we're looking at I'm going to give you three reasons you might care and emphasize that these are really rules of thumb They're not theorems. So the first reason maybe the most relevant here is that you might hope to predict dynamics of optimization on fn So fn is a random function of many variables Maybe it's like a loss function or something like this and you want to find the minimum with something like gradient descent So cast a gradient descent Okay, so if and all these local minima that are not the global ones right our traps You spend sort of a lot of time in these things and so you would imagine at a first pass That if sigma is not positive meaning I have sub exponentially many critical points and optimization should be easier Whereas if I have a lot of critical points optimization should be harder. So this is just a way to guess Alternatively you can use it to locate or guess sort of extreme of values So if you think where is the global minimum of my function you can imagine this variant sigma of T That actually scans sort of sub level sets and say if I look at very low sub level sets I can't get too far in the mic. So if I get look at very low sub level sets I don't see anything because there's no function down there And when I sort of raise this up when I hit the global minimum I take up one critical point which is a global minimum as I keep going higher I get more and more So in this way you can actually find what the local minimum is sorry the global minimum The last thing which is a little bit Vague phrasing but that's also because it's sort of not so clear on the physics side And this is really a physics motivation as this should be connected to replica symmetry up a symmetry breaking Meaning that if fn is a fellow could sort of the gives measure with Hamiltonian fn Then at low temperature this gives measures should be dominated by sort of local minima with low energy values And you can imagine studying how these are arranged, which is basically the question about the symmetry Via well this guy that looks at sub level sets, but actually one that just counts minima And for experts also you probably want the quenched version. So variants of this function Can tell you sort of interesting things about the function you care about And I emphasize that from all these points what really matters is actually the sign of sigma not the value And that might seem like it's easier, but it's probably harder Because in our techniques and other things that it's maybe Possible to get variational formulas for sigma and that's really not clear what the sign is But I'm mostly gonna hide this issue because in the example today There's some sort of miracle that lets you actually compute the sign. So I wanted to mention it's not that trivial So I'm gonna give you a classical model and I'll give you sort of the extension a moment But the classical everyone have a signal plus noise model the easiest noise you could start by asking for is something that should be okay Gaussian Centered probably and then isotropic which is a phrase from Gaussian process theory that informally means that it sort of looks the same everywhere If you want a formal version, it says that the covariance between points is some function just of the distance Between points and I put this reference because the question of what functions be can you put here? I was actually answered it quite a long time ago The function be I'm gonna I need it to phrase the theorems properly, but I don't need to worry about it that much So we'll sort of put this issue aside And just the function vn itself it looks the same everywhere, so it has infinitely many critical points, right? This is basically a picture of it. This is fine on all of our end And so the classic solution is you constrain it quadratically so you add mu times norm of x squared Where mu is now some less signal to noise ratio that you can pick And I think of this mu times norm of x squared as being a signal And I ask what changes as I vary mu So if mu is very small the picture basically looks like this Basically, it's a noise model, and I have a lot of critical points Whereas if mu is really big, it's basically quadratic, but it's kind of quadratic It's a little bit wiggly, but these wiggles aren't really critical points Right, so the idea is that increasing mu should kind of simplify the landscape and you can see that in this video back here Because I'm increasing mu you can see the critical points that are kind of far away from the origin There's sort of these cups, and I kind of start tipping until they're not critical points anymore Let's do this one more time So from this you can kind of guess the theorem which is going to be that there's some phase transition in mu And when mu is large I have very few critical points and when mu is small I have lots So indeed that's the result. I'll give you the sort of formal version a moment But it says that the complexity of the model with parameter mu Does have a phase transition. There's a critical mu c and if mu is at least this Sorry if mu is at most this then the noise wins, which is a way of Characterizing the idea that there's a lot of critical points Whereas if mu is at least this threshold, then there's very few critical points. There's sub exponentially many Um, this B appears right so there's always this function B which sort of determines the noise And I'm carrying around but I don't want to worry about it too much Um, everything here is explicit I'm just not writing down the formulas to save space But because it's explicit you can see that this phase transition this this function sigma is a continuous function And it's quadratic at criticality, right? So there's so here's mu Here's sigma of mu here's mu critical It is Zero at least here and then this local behavior is quadratic. So it's a second order phase transition Okay, um, this is a result of Yan Fyodorov in 2004, which is also nice because it's kind of the first theorem in this whole business This sort of landscape complexity program And a few years later along with Ian Williams he extended to a similar result for local minima So not counting all the critical points, but just kind of local minima, which is maybe more relevant for some electric gradient descent Um, there you have the same sort of behavior. There's a musi. It's the same one actually, which is not obvious And you have a cubic rate at criticality. So it's sort of Different it's a third order phase transition Um, also just to give you a name you'll see in the literature this this phenomenon is called topological trivialization The idea being that when mu is small I have some kind of complicated function And as I add more signal as the signal strength kind of increases You can see this in the change of the number of critical points. So everything kind of simplifies Okay, so if you see that phrase that's what this means Um, I want to give you an extension in a moment. I will explain why this extension is kind of natural But first if you'll allow me to just do something that looks fun Um, instead of this quadratic thing Right, which has sort of uh, the same steepness mu everywhere I'm instead going to consider this variant Where I take half my directions and so it's still quadratic But I give kind of half my direction steepness one and half my direction steepness two Okay, so it's something you can do. Um, that's my function g n of x That's kind of this two directions guy and I put again a new in front for scaling So there's pictures of this down here, right? So it looks like before except when new is really large You can see that it really is a two directions thing, right? Um, so from the picture or otherwise it's kind of easy to guess. There should be some critical new c And the question is what is it and how does it compare to the sort of the model where everything is the same? Um, so when I started thinking about this there were three things that I thought were kind of natural guesses One natural guess is well, maybe the flattest directions dominate Which is basically the same saying that g n is approximately norm of x squared In that case you would have that you just plug in norm of x squared here So then new c would be the same as mu c Right or the opposite which is that the steepest directions would dominate That would say that g n would be basically like twice norm of x squared Um, and then because you have this two you would need a one half. So new c would be basically one half of mu c Or the one which seems to be the most natural which is that the directions would average in the easiest way So g n would be sort of like three halves from norm of x squared In which case new c would be two-thirds of mu c So these were the guesses that I thought were natural Um, and the fun answer is that they're all wrong Uh, so actually new c is square root five eights times mu c. It's about point seven nine So the directions are averaging. Um, but not in the easiest way in kind of a complicated way Okay, so here's here's the general model. Um, and and why you might consider this So so i'm taking uh d n to be some n by n matrix and i'm considering this quadratic form Okay, so before I had d n being diagonal matrix, which was half new and half too new All right, um, I want it to be positive definite because I want everything to go up. I want it to be actually confining Okay, um, and of course when d n is constant times identity you recover the previous model So this is really a generalization that sort of allows you to have different directions Okay, so why would you why would you care about this? I want to suggest this is this is a toy model for how a general gaussian function might look near its global minimum Not saying it's a good model when I think it's a toy model Because if I have another gaussian function g n then I want it to just tailor expand with global minimum So say it's at zero tailor expand to second order, right? I have this constant Um, the first order vanishes because it's a minimum and so I just have the second order So if I'm asking the question, what does it look like and does it look like this or does it look like this or something? The constant doesn't matter. That's just a vertical shift just changes things Okay, so it's really this and I'm going to do something kind of dumb Which is right the the Hessian as just the expectation Plus the thing minus its expectation. Okay And the expectation is I sort of this term Is now a quadratic form Where the matrix inside is positive definite because it's a global minimum But it's not it's from this perspective. It's very natural to assume that it could be something general It doesn't have to be constant times identity anymore Right, so this is what this this kind of expectation. I model by x d n of x And then the remainder is something centered. So I'm modeling of IVN It is a fairly strong assumption, but at the moment it's it's something Okay, um, and then the theorem So I have to sort of give you what what? Large end coherence and so large end coherence Is that I have some sequence of deterministic n by n matrices And the right way to think about this is that their empirical measures Which is the average of delta masses at the eigenvalues tend to some limit measure mu d Okay, so mu d let me write this here. Um, this is really my signal measure now It says that I have, you know, so mu d is maybe something like this We can have even different components. This is mu d It says I have, you know, some Steepnesses that are over in this direction some more over here. I have some different So it's a quadratic saying, you know, I have different steepnesses different directions Um, this is some compactly supported measure Again, it should be on the right half line because I want everything to go up I want this to really be confining and this is the function I had before So in the example, I gave you mu d was just an average of delta masses at new and to new But you could have something more general And the result is that okay, again, you can this is sort of the main theorem of the talk. Um You can see a phase transition Again, there's this issue of b that I'm hiding from you You can see a phase transition and the thing that you have to compare to a certain threshold Is the inverse second moment of the measure You can kind of ask sort of a natural question, which is I have this measure What there should be, you know, some scalar I take away from it That's sort of the effective signal to noise ratio And what I was proposing before was my guess is where it would be either the left end point or the right end point or the mean But it's actually not. It's the inverse second moment Okay, this is the right way the arrows have to go Um So this phase transition is is continuous Um, so there, sorry, there are formulas that are explicit ish I mean explicit enough to see that it's continuous Um, and it's quadratic, but what I mean by quadratic is since I have a measure is Uh, I fix the measure and I vary the noise which appears as a scalar. I'll hide that Um, we have an analogous result for local minima also Which is sort of again more sort of more relevant for gradient descent So it has a gradient descent with the same threshold and a cubic rate Um, so the fact of the inverse second moment is kind of the effective signal to noise ratio is new as far as we know Um, but you can think of this as a universality result for the quadratic and cubic exponents Sort of saying it's a second order or third order phase transition Because those already appeared in the results of fjodorov and fjodorov-williams Um, which we can recover by taking this, uh, mu d the signal measure to just be a single delta mass I would just say I'm having same directions everywhere Maybe I'll start at this moment. That's that's sort of the main result. Um Let me give you a little bit of a proof sketch the the the main technique in this whole business is called the cat's rise formula So every proof in this business starts with apply the cat's rise formula Um, which is a fairly old formula. You can see here. Uh, it's an exact formula at finite m So it says that if fn going from rn to r is a nice Gaussian process So this means whatever c2 almost really more. So, I mean, I this is not we're not talking about things that are not differentiable Is that as nice as you want basically? Um, it says the expected number of critical points is given as an integral over rn, which is the base space here right Of the expected absolute value of the determinant of the Hessian Conditioned uncriticality Times some function phi sigma, which is easy. So I don't really want to worry about it But the point is the hard thing over here is this And it converts it into a problem about the Hessian and the Hessian of a random function is a real symmetric random matrix All right, and so this really what this does is this really transforms sort of random geometry On the left hand side into geometry into random matrices on the right hand side And then it becomes a random matrix problem Okay, and so the what's the matrix? It's a Gaussian. It's a real symmetric Gaussian matrix It's the Hessian of a Gaussian function And most of the models studied so far the matrix is closely related to the Gaussian orthogonal ensemble It's the GOE Basically because the the function has a lot of symmetries and these appear as symmetry is satisfied by the Hessian Whereas for non-invariant models like the one I described what you need is is To understand when h n is sort of a Gaussian matrix with fewer symmetries And if I want to put a 1 over n log here I put it here also and so it becomes this random matrix problem Of understanding the expected absolute value of the determinant of some large Gaussian random matrix Okay, and so we also have results on this side. So how could you guess what this kind of thing should be? Here's an easy way to see it. It's not I'm not doing anything fancy at the moment All right, the determinant is just the product of the eigenvalues Because I have an absolute value. I can move it upstairs in an exponential And then the sum of logs is the same as the well n times the integral against The empirical measure the sort of average of delta masses of the test function log of absolute value Okay, and once I've written it like this It's kind of easy to guess that if I put a 1 over n log here And if mu n tends mu hat n tends to some limit measure mu infinity like for vigner matrices it tends to semicircle Then the result should just be the log potential at zero of my limit measure And indeed we can do this. So we have this is is not trivial. It's actually a different paper Um, but we can do this sort of broadly. So for example when the matrix h n is a vigners or sample covariance matrix with exponential tails Um, I'll go through these. I want to find them in the interest of time But an Erdos-Renyi graph or a deregular graph with parameters sort of just bigger than the phase transition um A 1d band matrix with any polynomial bandwidth or a gaussian matrix Which is what matters for cat's rice with a mean a variance profile or some correlations I put a little star here because what the measure is here is a little bit complicated. It's the matrix disson equation Which I think dominik might talk about in the next talk Um, let me summarize and then give you some open questions So the summary is is i'm studying random functions f n from r n to r that are signal plus noise functions Um, where the signal is quadratic with kind of different strengths in different directions stored by some probability measure So the signal strength is stored at this probability measure Um, the question i'm asking about these functions is how many critical points do they have do they have a lot or not a lot? Um, that's my number sigma this annealed complexity And we found that what separates these two phases is the inverse second moment of the signal measure Um, and the proof goes through the cat's rice formula, which is very standard But also through new results on determinant concentration for random matrices I'll leave you with open questions um It's fun to think about so I keep talking about this inverse second moment In the proof it's very natural for the random matrix folks It's the second root of the silt just transform, but I have no heuristics for this on the landscape side Um, so if people have suggestions, I would love to hear them. Um, you also we want to see this threshold algorithmically, right? I'm not coding anything. I'm just doing proofs about functions. So, uh, the question would be, you know If you're on the wrong side of the threshold does gradient descent fail? Um, also we did this what's called the annealed complexity which is where you have the log outside of the expectation Um, but it's probably more natural to want to put it inside. That's called quenched um, and what what happens is In the case when quenched happens to be equal to annealed But it's true for some spherical spin glasses. We kind of know what to do. I'll see here some results here Um, but in cases where quench is different from annealed As far as I know, there is no single model where we have a rigorous understanding of what the quenched asymptotics are So we have no we need a per easy theory. I mean no one knows what to do in these sorts of cases to actually compute it Um And the last sort of wild thing I'll suggest is is if I take functions not defined on, you know Rn or a sphere or something, but discrete ones So random functions on theories using some hypercube Um, it's not clear what the analogous complexity theory would be. I mean there's in spin glasses for people to know Obviously, I know that tap complexity is sort of a special case But in general there's not there's no critical points Right, so it's not clear how you would even define the question. There's no cat's rice in particular But since you're asking questions about, you know, algorithms failed algorithms succeed You can imagine there being some kind of analog that would, you know, be related to this in some sense Um, but it's not clear how to define anything. So I will stop here. Thank you very much for your time Thank you, ben. Very nice talk. Just a small technical thing before we move on to more interesting questions The x squared or early on slides you called a lambda squared with which you rescale the measure Uh, can you might So, um, on the on your summary slide, for example, oh here. Yeah, the lambda squared was uh, summary side Yeah, so this this inverse second moment thing the x squared. So what's x? Maybe I missed that somehow So, so this is just uh, so, okay. That's what you integrate over. Okay. That's just it's an integral of that Sorry, I guess it was the lambda on other slides. I should have been the same everywhere, but yeah, it's just uh Instead of I mean the natural guess for me would have been the first moment. So Yeah, just the inverse second one That would have been natural Okay, um, I'm sure there's some more questions, uh, from the audience And I already see one over there Two questions actually, sorry for that. Please. The first one is, uh, can you do the same calculation? So the cat's rice by also conditioning on a certain energy level Yes. Yeah. Yeah, so you can look in and um Yeah, so there is a very, I think the cat's was from this here There's a variant of cat's rice that considers Uh critical points where at which the function takes values in some braille set. So all that all that happens is you put um You put right here you put indicator that f n of sigma is in some braille set b in r um, so we could We didn't write it up like this for this example that I gave you but I think we probably could have And also I was wondering since you mentioned that you also have the computation just for the local minima instead of genealogy point if The actual value of the complexity coincides Among the two so maybe no for the lowest energy levels not not even Oh, sorry. So the um the The total complexity is different if you don't look at if you look at them at energy and at any the complexity Considering all energy levels is different. Uh, we did not do it for low energy levels. Um It's But I agree that low energy levels that should be the same because it should be dominated by local minima low energy levels Yes, and sorry last comment and uh for the discrete uh case maybe as a Proxy for the definition of critical points, maybe you can consider stability against one spin flip So that you are for a greedy Monte Carlo, you you will be in a in a minimum essentially if Yeah, yeah, so so so I mean not the same techniques. You cannot apply cat's rise But that would be a definition of a local minima Is is one where if I go and any if I flip any spin then I go up So, yeah, so I agree. This is a natural definition of local minimum There's no natural definition that I know of of a saddle point though Because you could say I mean you could say a saddle point is one You know, I have One saddle point is where I have one direction down, but then everything is a saddle point But I yes, I agree. This is a natural definition of local minimum Um, and then the question makes more sense, but I have no idea how to count them anymore Thank you. Thank you. Carlo Are there any more questions from the audience? Or maybe Marco, are there any questions on the chat? Can you anything that's popped up there and in the meantime we have a question here Yeah, actually I have two questions. The first one is when you assume this This d term the diagonal term is d Does it make it harder to make it a general positive semi-definite matrix or you can do a change of Bases to make it always diagonal Yeah, because because v is isotropic It the distribution is unchanged under orthogonal transformations. So you can I mean you can diagonalize things. So it's it's uh Yeah, it's just a question of the eigenvalues of dm And also in terms of your log determinant result this limiting result Your examples are all Hermitian models. Have you considered non-Hermitian case? No, we have not they wouldn't appear in cat's rice because in cat's rice It comes as a Hessian of some various move functions. So it's always real symmetric. Um But this is also a question you could ask for obviously for non-Hermitian matrices I have not thought about this. I don't know what would happen It's I mean from from random matrix perspective. It'd be interesting. I don't know how it would apply on this set Thank you Wonderful. We have very time for one last quick question Francesco Hi, first of all, I'm sorry if uh I missed it But what are the conditions on the distribution of the eigenvalues in the end? Do you just require that this one over lambda square average is finite or So the condition is that um The the limit measure should have compact support in The support should be in zero to infinity So you can have any directions at zero and you can have things going to infinity And then you just need you need things to be gapped away from zero So so I mean dn so the limit has to look you know like this Sorry, it doesn't have to have a density But it has to be in some compact set that doesn't touch zero and doesn't touch infinity And then the the as you're approaching this limit. You can't have eigenvalues touch zero There have been a little bit of uh I'll just write the name down so there's there's a um where I have space So after this there there were results of fjodorov and uh That's when the quash at one. That's a paper that that does uh Allows a little bit to touch zero In a very special way so but but other than this there are no other assumptions. There's no assumption of density There's no assumption of anything like this And for all matrices of this type obviously the inverse second moment exists because it's compact Thank you for the question wonderful. Let's thank ben one last time Thank you