 Okay, let me know if it's not big enough or something. I'll write the title. We'll start with that and if you can't read it And this is a problem. Okay, so Okay, sorry. I needed to have this Okay, so I'm a phone. So this is joint work with Tim Kuniski and Alex Wine. We're working with me at NYU What I'll talk about is Statistical to computational gaps in particular in a hypothesis testing I'll talk about a different method than what we're used to seeing on the heuristics that come from motivated from physics known as the low-degree method and for this audience I thought I should highlight the power of this method by saying something About hardness of certifying certain bounds on the Sheraton-Kirpatrick model So I've been my main interest these days has been really in trying to understand what what kind of statistical Estimation problems or hypothesis testing or optimization on random landscapes Have to the computational gaps on average. So for which problems do we know that it should be possible to solve? If we had a unlimited computational time, but with restricted computations saying polynomial time linear time with efficient methods we really believe there's some fundamental difficulty to solving as Right as we get more and more data bigger and bigger data sets These limits should really correspond to that to sort of the real information limits and the real Limits for what can be done in statistics data science or Okay, and I since I got interested in these fields exactly because of papers from a few of you in the audience Who I learned from some of the heuristics coming from statistical physics and some of the ideas that you could actually make? incredible predictions about what can and can't be done and Efficiently I've been exploring other ways of trying to come up with with these predictions and I'm going to talk about today something Sort of informally known as the low-degree method Which comes motivated a bit more from the algebra and if you're a computer science and maybe Realgebraic geometry very related to the so-called summer square zarchy Okay, so I'm going to split the talk into two parts the second part I'll talk about hypothesis testing and so on the first I'll present a problem About the Sheraton-Kerpatrick model that I can that we can say something about okay, and then afterwards I'll I'll sort of reduce it to a hypothesis testing problem, and I'll explain how we how we say something about that problem Okay, so let's say Okay, so let me introduce the Sheraton-Kerpatrick. I better not misspell anyone's name Okay, my interest from it is a little different so I'll Let me write it this way So this is just the Hamiltonian. I usually write my random matrices as W as if they are a Wigner matrix All right, so what I mean by this can write it as it's drawn from a GOE or just the entries W ij Gaussians The matrix is n by n and it's symmetric Okay, and so this is the Sheraton-Kerpatrick Hamiltonian. I think of it What the way that I'm interested in it is I want to optimize this function. I want to minimize this over So I want to minimize over x in the hyper-q So one can study right and in statistical physics one studies Properties of this Hamiltonian associated gives distribution and so on what I'm interested in is I'm interested in understanding the optimization problem of Minimizing this random function over the hyper-q right this in the language of physics I'm interested in understanding something about the ground state of this Hamiltonian and And so we know Okay, so we know the work of Rizzi in the 80s and of Tala grand I think 2006 that the minute this minimum Converges to the so-called the Parisi number Okay, so the sake of this talk will think about this as minus 1.5. Okay, I mean give or take Okay, so What okay, so let's see so I have a random function Right the function the weights are random and I know that in the limit the minimum of this random function lies at minus around minus 1.5 right this function in theory at least is hard to optimize because it's it's over integers essentially It's a discrete thing is none come back and all these things and on and all Does this pose is it easy to optimize this function from say a computer science and optimization viewpoint? Okay, so let me just define another object just to make things a little easier. Okay, so I'm just gonna write G is the ground state corresponding to the matrix w just exactly this quantity Okay, so now from the algorithmic viewpoint my interest in this is understanding algorithm hardness of oproms associated with this random function There are sort of two main problems It may be one is more natural and the other one might might at first glance not appear as natural And it's the one I'm gonna talk about so let me try to motivate it and please ask questions if If you don't feel has been motivated enough because I can give more More evidence of why one should care about it and this is the good thing with more talks I can detour is can look like he was like prepare like that all along Okay, so one natural question is the question of search Right, so what do I mean by the question of search? I mean given The matrix w drawn from the GOE. Okay, this just means a matrix w drawn according to distributions Right the goal would be find some vector x hat such that Such that my objective is as small as possible So I get drawn the matrix. I now come up with whatever procedure random or not It gives me an x hat and I evaluate its objective this way Right now the output of the algorithm is for sure random because the matrix itself is random And now I can ask how well do I do in a well? Do I do on average or with high probability? Okay, so we know since very recently That search has no computational gap Okay, so let me Okay, this is due to Montenari last year I think and It's based on ideas also from from a paper on optimization of Full RSV models. I won't describe what these things are but by Elearn subar in the same year Okay, so what do I mean by this is that there exists? Can you still read if I write here? Or should I go one more line is okay? Is that there exists or okay, let me be maybe a little more precise Okay for any epsilon bigger than zero there exists an efficient algorithm that given w finds X such that Let's call it x hat Same bigger. Okay. Let me be a bit smaller equal then Okay, so Visible is it visible here? Okay, so what am I what does this mean? This means that this problem even though it corresponds, you know to a certainly an NP hard problem over any type of Model that we have for worst-case problems The search has no computational gap Meaning that if I give you w there is an algorithm that for any value of epsilon Can find can find the configuration hyper queue can find x such that The value of the Hamiltonian for that x is arbitrarily close to the actual ground ground state all of these All of these claims are in the large end limit, right? There's all all of this is sort of with high probability and in the large end limit type of claims But everything can be made precise But as n goes to infinity everything it concentrates extremely well, so you can just think of all this stuff as equalities Okay, so the problem I'm interested in is slightly different and it's the problem of certification So I want to give lower bounds to the ground state So let me see if I can define this in a way that it's intuitive and please if it's not clear what I mean here as questions I'm happy to clarify this sort of the most important part Okay, so what is a lower bound? Let's call it f of w is the following So Okay, let me write this in a different way Given w. Okay. I have a lower bound fw such that f of w Needs to be smaller equal. Okay. I mean so far. I haven't done really anything, right? I'm just sort of defining what a lower bound is okay a Certificate, let me close certification of b of a lower bound of b is a function f of W that is a lower bound All right, so it's a function f of w such that fw is a lower bound f of w is smaller equal than b with high probability Okay, and F is if is efficient to compute. Okay, so this is worth taking a second to to digest so I Have this random function, right? I have the random function g of w. I Want now a lower bound for it. So g of double is in general not easy to compute I want a lower bound of it as a certification of a lower bound that satisfies the following properties It needs to be a lower bound always Right this inequality needs to hold for all matrices w. It can't just be with high probability He needs to of course be below my lower bound my my actual certification value with high probability and I need to compute it efficiently Okay, so let me give you an example here example Okay, a lower bound to a very sort of trivial lower bound to this quantity is Just replacing the hypercube by a sphere Right, I can just take I can make this my f of w Is it clear to everyone why this is always a lower bound right before I was in hypercube now I'm in the sphere larger space. I'm minimizing something for sure this quantity will be smaller and It is a lower bound not just with high probability over Gaussian matrices. It's a lower bound always Right and so now to understand what kind of certificate this gives one needs to understand. What is the value of this with high probability? Right, but so now okay now we're getting again with high probability in the sort of the high probability regime So we're back to the fact that W is drawn from a GUE right this quantity Well, okay, and it is easy to compute right because this is just an eigenvalue. It's just a small eigenvalue of the matrix W Okay, so this is exactly the minimum eigenvalue of W and we know That this goes to minus 2 right It's probably already in the Wigner's work, I think Yes Yeah, please So in this case the B would correspond to minus 2 wait did I oh because I think always a maximizing but for the physics audience I replaced everything it's possible that I placed everything except one in one place Yeah, of course, right, okay, sorry Yeah, I always think the other way and it didn't flip all the signs. Yes Yeah, I mean in reality it will be just always equal to B Okay, so is it clear how this provides a certification of minus 2? right if I have a matrix drawn from the GUE and This is always a lower bound on the ground state of the Shanker Patrick model and the value of the lower bound will be minus 2 I mean as n goes to infinity it concentrates extremely quickly to minus 2 And so a natural question that a few of us asked ourselves a few years ago is can one do better than this? So there's a few reasons of why this is an interesting question Perhaps the most natural is that one of the ways to try to understand Computational hardness not only on average case but to some extent even of approximation and so on one of the One of the very popular techniques in serial computer science is convex Relaxations and in particular some of square hierarchy and things of this type and those techniques always work by Certifying they always work by relaxing to a so-called convex relaxation, which is sort of exactly what's happening here, too All right And so understanding the limits of such techniques in particular one needs to know if if certification is not possible Then those two techniques already have a sort of a fundamental issue. They need to deal with and the other side is that We if there is If there is a computational gap is sort of one of a slightly different nature that should be better understood Okay, and so our our conditional theorem, I mean it's as most things with You know computational complexity is conditional on something but our conditional theorem is that this is unimprovable Okay, so let me try the to write the conditional theorem isn't as precise way as I can and the value. Yes right the gap between Right, so maybe a way to draw this would be We have the actual ground state round minus 1.5 We know that search can actually get arbitrarily close So there is no computational gap in the sense of finding good solutions But there is a or we believe there is a fundamental gap in the sense of certification and that Above two is unimprovable with fast algorithms. Okay, so the theorem that we prove Okay, I'll do it a little informally is the following Okay conditional Jack chair Which I'll describe when I talk about hypothesis testing Okay, for any for any epsilon bigger than zero it is not possible Let me process this so what the theorem says conditional in this, you know In this conjecture that I'll describe at least informally then our claim is that for any epsilon Right, so where the epsilon is just here, right? This is a now minus 2 minus epsilon right for any epsilon here It is impossible to provide a certificate in this notion of certificate That certifies a bound right that is strictly better than two in time essentially less than exponential Right, I mean this this 2 to the end to the point 999 is just because you know These techniques are not quite precise enough to tell to really really say exponential But they get as close of course this 99 can be transformed into any number Right, so essentially it's not only that it's not polynomial time It will essentially take you in exponential time You might as well just try all the the points on the hypercube and then of course you have a certificate because you just tried everything Although it's maybe worth saying that I still believe that Even if I gave you 2 to the end time You could try everything if you still had to convince me of a certificate that I could check in polynomial time I still think this is impossible, but that's a lot harder to To to show because it's a lot harder to make you know formal Okay, so there's Okay, so how do we do something like this? I Guess maybe this is a good point. Is it clear what the statement says? How much time do I have sorry and how much should I leave for questions? 18 okay, all right, so this is a problem on which The phenomena of how hard it is to search for a solution and how hard it is to convince someone that there's no better one Seems to be very very different from the computational viewpoint. So how do we do this? I'll try to Almost show the proof in 10 minutes and then leave some time for questions Okay, so we reduce to hypothesis testing and then use and develop a technique to To try to argue that hypothesis testing is computationally hard Okay, and so do we how do we reduce to hypothesis testing? So the whole idea is in coming up with a distribution w prime That is drawn From some let's call it T some tamper distribution Okay, so I create a new distribution of matrices Right, let's call this T this distribution. I tampered with it in some way and I now draw w prime from it and this distribution Has the following property That I still have on the board G of w Yeah, maybe I'll just has the property that with high probability The minimum is essentially 2 Okay, I'll just write it like this to not To not have to go with you know I'm drawing a distribution of matrices From some other distribution. It's not of course the GOE Such that with high probability the ground state for this matrix is actually very close to minus 2 Think a smaller recall. I mean if we can want to be precise You can just think of it a smaller recall than 2 minus epsilon right, but essentially 2 Okay, so I have this object and the claim is that doing hypothesis testing between What statisticians call each zero or the computer science called distribution q is that depending on the You know on which community belong to where w is drawn from the GOE and each one or The computer science is called P W prime was drawn from or w was drawn from T is computationally hard Okay, so let me now explain this in words again. I have a new distribution. It is no longer Right, it's still the same Hamiltonian, but now with weights W or J in the in the notation of statistical physics that do not come from a Gaussian I built them in some other way and I do it so that the ground state actually is minus 2 right, so now it's a completely different problem in which there is a Hole down in the landscape where the value is actually minus 2 and I claim that I do it what we call quietly or hiddenly or Depending which community belong to you might call the different things, but I do it in a way that Distinguishing if the weights came from the real GOE distribution or if it came from this tamper distribution I believe is computationally hard. Oh, you mean if I want to understand what the gap is for the worst case matrix W I Think I can do so this then becomes it's a question in That's your computer science has looked at for some years on approximation algorithms and so on and the setting is is completely different in particular I believe I can create gaps of order log n in that setting right, so here the statement is different is the matrix is You know, I I get a matrix W. I compute a lower bound for it It just needs to be a lower bound for any matrix Such as the minimum eigenvalue of W. It needs to be something that you compute and it's for sure a lower bound But then the way you evaluate how good your lower bound is is still on average with respect to the GOE Right the minus 2 comes from understanding what the minimum eigenvalue is when the matrix is drawn from the GOE So it's still average case The lower bound you build needs to always be a lower bound in the same way that the search Right when you do search you're effectively finding an upper bound on the ground state and Regardless of how good your search is or how bad your matrix is it will always be an upper bound on the ground state This is always a lower bound on the ground state in that way But then I evaluate the lower bound with on the measure on the random measure So it's still a statement about average complexity in that sense. Is it more clear? Yes Yes, but it is only minus 2 if W is typical But otherwise I could come up with W as minimum eigenvalue is you know minus log n or or anything So it's true that it the minimum eigenvalue is always a lower bound It only is a lower bound of value minus 2 under the measure of the GOE and the statement is about how no procedure can do better than this No, no, okay, so this So what I mean by this is you get drawn, okay, so let me see if I explain the reduction here might not help to write very much I'm gonna explain what I mean by the reduction I mean that you get one sample from W or one single sample some one just one matrix Or you get a sample from what I'm calling W prime. So T I mean, this is a bit of abuse of notation, but I get a sample of GOE or I get a sample of this tamper distribution I say even I drew it with probability half from one bag half from the other I give you one of them and I ask you okay make a An intelligent guess of which whether this came from the GOE or came from the tamper distribution And the claim is that this is computationally hard in the limit of very large the matrices no, okay, so, okay, so why not because in the sense of statistics or in the sense of information theory in the sense of distance between distributions These distributions are very very different because if you had unlimited computational power, you just go ahead and compute the ground state It will be around minus 1.5 here. You'll be minus 2 here They'll concentrate incredibly fast and so you just say okay My statistical test will be is the ground state above or below 1.7 and this statistical test will have incredible power The point is that this statistical test is not something that one can do in polynomial time We believe and the claim is that there is no statistical test that can be done quickly Say in polynomial time or even I mean more ambitiously in time to to the end something That is able to do this task No, so it shouldn't because otherwise that can drink that contradicts the theorem and therefore this proves the conjecture The low degree conjecture which I might not have time to explain, but it's okay So so what what we expect happens then if I run the search algorithm Let's say I'd run like Andrea's Montanari's algorithm on On both w and w prime because this is a polynomial time procedure Its statistics should be exactly the same in w or w prime So what it will do in w prime is it will also find a ground state or or not a ground state But we'll find something that is very close to minus 1.5 and it will completely miss the fact that somewhere else in space There's a huge basin of minus 2 that is just hidden It will still Andres algorithm will still work here, but it will work at finding something of the value Minus 1.5 not of finding the true ground state for this model Is it more clear? Yes But even the statement is stronger in the sense that even a global view as Long as you don't give yourself exponential time to compute Right, so what one means by local and global and right, of course I don't actually prove right proving that things take exponential time We have no essentially no way of doing it But condition in a certain conjecture if you believe that conjecture it says that no procedure That that is efficient can do this Okay, so maybe I'll this is the most fun talk I've ever given yeah No, it is constructive I wasn't going to give you the recipe because I think it's more interesting to see how one argues that this is Computationally hard, but the recipe is actually quite simple and I don't think you learn much from it You essentially rotate the GOE to point in a random hypercube point that that's all it that's all it does Okay, there wasn't there'll be some but there'll be very few. I mean they'll be the ones I Mean at least intuitively speaking I just planted an hypercube point down in the bottom anything that correlates positively with that hypercube point is gaining traction And so there'll be sort of a cone of solutions that into that interacts Like as a constant overlap with with the planted thing So there'll be some but they'll be they'll be restricted to an exponentially small part of the space Now, okay, so one thing that maybe I should even if I said this already I should say again Which is why this is actually a proper reduction Right, so for those who have seen it will Seen this already. Sorry for but I think it's maybe good to say it again, right if if I had a way of certifying a lower bound that does better than minus two Then this task that I just gave the audience of I take things from a bag and ask you which one came from which Would become quite easy because if I drew you something from the GOE You would just run that magical procedure that certifies a lower bound better than minus two And if it does then for sure you are not in the tempered distribution Therefore you just say yeah, you're in the GOE and otherwise you would say no I'm in the tempered distribution So the problem of hypothesis testing here Right is a reduction to the other one if you could do if you could do certification You could also do hypothesis testing this problem is strictly easier than the one of certifying better than minus two What do you mean? Oh, yeah, so I think here of as n goes to infinity I want all the power to go to one so that the hypothesis testing needs to be done You know with probability of failure going to zero Because that would be if I want a certificate that certifies with high probability This would be the sort of the right equivalence Yeah, but here the claim that Here the claim is if you have a procedure that given a matrix w produces a lower bound That is with high probability better than minus two Then in the definition I gave of hypothesis testing This would be an algorithm to perform a hypothesis testing in my definition of doing it and I argue Approve conditionally that this way of doing hypothesis testing is Computationally hard therefore it is also computationally hard to do this right is a mathematical formula. Yeah Okay, all right, so Now can I five minutes with quite with questions or I see? This is the best part. I'm happy to have more so let's see Okay, I'll try to describe the low-degree method in In five minutes and if anyone wants to see the calculations come talk to me offline Or I can send you some references because these calculations are actually not hard to do at all I mean it would take me maybe another 30 minutes and I could sort of do the whole thing But you can I'm happy to show it offline or send you some references. Okay, so what's the idea with the low-degree method? All right, so let's see how I can Okay, so this goes back. I mean depending It's hard to see in the literature where this showed up the first time because it depends how explicitly was described But I think sort of the consensus is that this was in some form or another in a barrack at all in 16 and then made a bit more precise in Sam Sam Hopkins and they've store work in 17 and another paper with subsets of the union of those authors and Again in 17 and then some more specific Conjectures were made in Sam Hopkins thesis and we worked out some we Made some slightly different conjectures or worked out some ways of Some ways of making these computations and sort of surveyed the whole thing Survey that is available on my page Okay, and the idea is the following so let's say I have my hypothesis testing Okay, and I'm gonna use the the Q Versus P sort of notation Okay, so Q you should think of H0 P of each one So I get a sample from one or the other and I just want a procedure to tell me if I receive the sample from Q Or a sample from P right sample from W or a sample from W prime So so sort of optimal ways of doing this They go back to the work of Neiman and Pearson in the 30s Okay, so this goes back to the work of Neiman maybe even before but Some sometime in the 30s Right and what they say is the optimal test Okay, so let me let me write maybe this so y came from Q Versus and random variable y coming from P And they say the optimal test is given by the so-called likelihood ratio Okay, so the optimal test is given by thresholding the following object I mean of course P needs to be absolutely continuous with respect to Q and so on I'm putting a bit of that under the rug All right, but the optimal test is given by the likelihood ratio What I mean by the optimal test is if you fix sort of a you know There's the type one and type two error the false positive false negative errors if you fix one You want to sort of minimize the other and so there's a whole family of tests for each one You fix of the first one and the whole family of tests is obtained by different thresholdings of this function All right So if this function is very high means the likelihood and their P is very high and then their Q should be much smaller Therefore you're probably in P. It's this likelihood ratio is very small You're probably in their Q right. This is the density or the likelihood and so on Okay, so there's another way of thinking about this I don't think is how Neiman and Pearson thought about it, but it's more it's Easier to do what I'm gonna do next is that this function is optimal also in an L2 sense So what I mean by this is this function is the one that maximizes the following If I want to maximize my expectation under P Subject to being having bounded variance and their Q Okay, so now I'm just turning things into L2 so that I can use inner products and projections and so on Right, so now I'm doing I mean it's not usually how we think of optimality in hypothesis testing But turns out is equivalent in this case that What I want to do is I want to function that with with expectation If the data was drawn from P is very large in expectation and its variance is controlled if I draw it from Q Okay, so this now is a lot nicer thing to To do I mean the Neiman Pearson line is also easy to prove But this now becomes sort of a trivial fact in linear algebra Right, so now I just do a change of variables, right? I just did a change of variables Right, but now what is this? This is just the expected value of The likelihood ratio times f of y If I think of the natural L2 inner product in the measure of L2 of Q Right another way to write this Sorry for that the time constraints is making me have to go through the algebra a little fast Right if I think the natural inner product over L2 of Q Right as just the expected value of f times g Right, then what I have there is I'm maximizing Right, but this is a trivial thing to do in linear algebra Right the optimum is just the L normalized to have norm one Right, so the optimum function and moreover the optimum like the value The value of the optimum is just given by the L to norm of L Maybe this is not as important. Okay, so now the likelihood ratio L is Optimum also in the cell two cents And Do I have sorry sorry for it? Do I have to can I take three more minutes that okay? Or okay, so the optimum is given by L now Lacum in the 60s tells us that The ideas of continuity and so on so Lacum in the 60s or even 1960s said Said that in order to see if I put this testing is possible or not in the limit a good way of looking at a good way of testing in Or a condition to see if I put this testing is possible or not is to take the expectation over Q of Then the L to norm of the likelihood ratio Okay, and essentially what he shows is that if this is finite as then grows to infinity then Hypothesis testing with With full power is impossible the proof of this theorem is Really really nice two-line use of Cauchy Schwartz that I don't have time to do But he showed that if this is bounded then I bought this testing is impossible So the whole idea of the low-degree method this comes motivated from the Some of squares your arky which comes motivated from things like the positive Stalin's arts and Hilbert's no Stalin's arts and so on Is that a natural way to do to look at these things in a computationally bounded way is to restrict our attention to low-degree? polynomials to polynomials of degree up to log n for technical reasons and Then the natural question is what if instead of asking for the best possible function? I asked for the best possible function in this sense Restricted to low-degree polynomials right so I add another restriction that F is low-degree Because low-degree polynomials are a subspace of L2 of Q all I'm doing is I'm taking projection on a subspace so the optimum here becomes Simply the projection of the likelihood ratio on low-degree polynomials and then the low-degree method or the conjecture is that We should do exactly what law come told us to do but with the projection of this of this test and not with the test itself Because it's a projection computing norms of projections is just computing inner products with basis functions Which in the cases of Gaussians just corresponds to integrals of hermit polynomials that are actually quite easy to do and Therefore we can actually compute this Explicitly or almost explicitly and then the stand when is it that it's bounded or not in this computationally restricted sense of low-degree polynomials Okay, I'm happy to talk more about this offline or answer any question. Sorry for that. Okay. Thank you very much