 So welcome to this joint MIT Harvard course and in some sense also joint Princeton course on proofs beliefs and algorithm for the length of some of squares and Pablo and I will be given the lectures here and probably Pablo is teaching another course so probably the split will not be exactly 50-50 and because he's already used so much to teaching he can teach all of the course of the lecture here, the marginal cost for him is like less so and we share a website but at Princeton they will progress in their own schedule and this is the website if you haven't yet for the course and we also, David and I are also writing some lecture notes that maybe eventually will turn into some monograph or something like that and these lecture notes are available on this sumofsquares.org and okay so before we start you know talking about actual math let me just do talk a little bit about some administrative issues so first of all so these are these factors are me and Pablo this is the website the first thing you have to do is join Piazza if you haven't done so already if the way Piazza is set up for you to sign up you can't sign up without a Harvard email address so if you don't have an Harvard email address fill out that Google form and then I go over there like once a couple of days and other people that sign up in the form so the times this is this course is on Fridays 10 a.m. till 1 p.m. and it will rotate between Harvard and MIT the rotation will generally be one week here one week there but because of various constraints it might not always be exactly like that so there is a Google calendar on the website with exactly the location in particular apparently MIT is its own different country and has its own MIT holidays that I never heard about before so and I guess we have to observe the MIT holidays as well I'm still not sure exactly how you celebrate them but apparently one way you celebrate them is you don't have classes so we'll probably want to do some make up lectures probably to me like one reasonable time that seems not to conflict with other things would be maybe if we can't do a lecture on a Friday do it maybe on the first day before that 4 to 7 p.m. or something like that that kind of would not conflict but we I'll run kind of a poll to find times for these things and okay and what you need to do and if you want to take this course for credit first of all I'm happy to have you show up regardless of whether you take it for credit or not I'm also very happy to have you sign up to take it for credit it will be nice if the first joint MIT however the theory course will not have like zero people involved in it and so it's a graduate class the two fees I don't really care about what grade zero Like, about grades here, it's more about what you put in, what you put into this class is what you'll get out of it. So basically what you need to do is, you know, show up, you know, read what I emailed in Piazza to read before each lecture. And the exercises will not have, we don't have TAs and will not have submissions exercises, but I would expect you to try to solve them on your own or with friends, you know, it doesn't have to be typed up, etcetera, because you're not submitting them, you can just get, you know, a bunch of friends over some snacks or drinks and go next to a whiteboard and try to solve on an napkin and try to solve the exercises together. And please do participate in class and in Piazza. I hope that because you guys will be reading the lecture notes before each lecture, we can have the lectures a little bit more of a discussion rather than just me, you know, talking at you. And you can also discuss, you can also discuss the homework and exercises in the Piazza. Don't hesitate to do that. Use just the same mechanism as you do when you discuss a movie. If you're going to post about your solution, just put spoiler alert so if someone didn't do it and doesn't want to the exercise to be ruined from them, they know not to read. So if you note that basically most of these things like reading and doing the exercises, I can't really verify that you did it. And like I said, it's a grad course. I don't really care. You'll get out of it what you put into it. I trust you that, you know, if you take this course and then, you know, you'll follow along so it will be more beneficial for you. But yeah, so we're not going to have a lot of formal requirements. It's more about learning and and I hope that we'll all enjoy it. Any questions about administrative stuff? Okay. So, okay, so let's start with the course. And so, okay. So what are the two most beautiful words in the English language? So, of course, there are three cookies, but apart from those two words, the two most beautiful words are like what was exactly your first guess. You know, every, every, you know, if you ask a person on the street, they will tell you that the two most beautiful words in the English language are of course linear and convex. At least on the street here. So, so this is, of course, beautiful when, when we have problems that are linear and convex, we can solve them. We are happy and and that's what we like. But unfortunately, you know, the world is a cool place. And and and where like we have this roaming these ugly creatures of, you know, non-convex and non-linear problems. And in particular, almost every time we want to solve something in life, we actually encounter a non-convexity and non-linearity. So, you know, in optimization, we have these discrete problems. You know, we are trying to find an actual assignment. We, you know, we put, want to put something in the knapsack. We are not, can often not break it in like, you know, in some part and put some party here and some part there. In learning, machine learning, often the objectives are non-convex. Like, for example, in trying to learn neural networks and in control, I don't know much about it, probably is the expert, but I think that basically everything is non-linear. We only have linear approximations and they are not always so great. And so, so we have to deal with this non-linearity and non-convexity. And and we generally cannot do so. We generally have, we believe that these problems, you know, they are NP-hard, they are computationally hard. So, so we can't do it in general. And what we typically do it, we have various algorithms that we tailor to try to solve the particular problem at hand. So, in theoretical computer science, for every different discrete problem, there is, you know, whether it's max cut or bisection or or sparse cut or whatever, and also problems that don't photograph and just escape me at the moment that and for every kind of discrete problem, we have, you know, a paper saying here is this problem, here is the algorithm I give for it and it gives these guarantees. Maybe it's an approximation algorithm if the problem is NP-hard and maybe, you know, I improved the factor by something. And in practice, people use various heuristics, so, you know, they have this problem, they run a SAT solver, that SAT solver doesn't work, they try another SAT solver, they try to massage the problem, you know, in learning, they use, you know, this kind of gradient descent and that kind of gradient descent and so basically we try to, you know, because we can't solve the problems in generality, we kind of try to, for every different instance, try to do the best we can for that particular instance. And our focus on this course is on a somewhat different approach, which is kind of trying to do a general framework that can be applied to basically any problem. And this is this sum of squares, a semi-definite program that Pablo is one of the originators of. And the advantage for that is, first of all, it's really can apply applicable essentially to any problem. There are some, sometimes it's not always canonical how you phrase the problem in the form that you can run the sum of squares semi-definite program on it, but we'll get to that. And because it's applicable to any problem and some of these problems are computationally hard, it will not always work. And to be more accurate, it will always work, but it might take a very long time. So if you give it a finite time budget, then it will not always succeed. But the interesting thing is that often within that budget, we don't know of any other algorithm that does significantly better. So often, even though it's a very general framework that apparently doesn't try to do all the ad hoc tricks that people apply to particular problems, often it achieves the same results that you would achieve when you have a tailor-made algorithm. And a nice thing that I particularly like about this is that even when it doesn't work, even when it fails, you can look at the state of this algorithm and try to extract from it some kind of partial knowledge about the solution that it failed to recover. Any more questions? Okay. So that's one answer that will be meaningless right now is this notion of pseudo-distribution, but hopefully by the end of, by 1 p.m., it will not be meaningless. So let me give kind of an abbreviated history of this sum of squares. I guess you've already read the introduction that I posted online. So I kind of know it, but it doesn't hurt to repeat it. So the two tasks that are very, very related to one another. One is, you know, you're given some function f, you want to find over some domain omega, you're trying to find the x that minimizes it. Either to compute the minimum value or find the actual minimizer x. We are not going to distinguish so much between these two variants. And the second type of problem is to show that, to certify that the minimum is not too small, to give some bound that the minimum is at least something. And these are kind of two dual problems. Ideally, you know, if the actual minimum of f is alpha star, then you are able to do both things. You are able to find the x star that achieves alpha star and to certify that it is actually the global minimum, in the sense that certify that fx is at least alpha star for every x. So those are kind of two general problems that we always want to solve. And around the turn of the 20th century, people looked at the particular case where f is a polynomial and omega is just, you know, rn. So it's an n-variate polynomial over the wheels. And Minkowski asked the question, I think, around the 1890s, if you can always certify, for example, let's just assume that alpha is 0, can you always certify that fx is non-negative by writing f as a sum of squares of other polynomials? And I think in his PhD thesis, Hilbert shows that the answer is no. And in fact, he even gave kind of characterization exactly how many variables and what degree you can do it and when you can't do it. So generally, you can do it if the polynomial is at most quadratic or if it has at most two variables. And in fact, it took a long time. So Hilbert gave kind of a non-constructive counter example to this question. And it took a long time for people to find a constructive counter example. And this was by Motskin, I guess, like 70 years later. And this polynomial is always non-negative because of the arithmetic mean, geometric mean inequality. You can see it easier if you divide by three, then the first three terms are the... If you look at the geometric mean of this, this and this, you get this. Right? So this polynomial is always non-negative, but it's not a sum of squares. However, Hilbert asked, this polynomial happens to be like a sum of squares of rational functions. And Hilbert generally all predicted that maybe for every polynomial that is non-negative, you can write it as a sum of squares of rational functions. And he posed this as its 17th problem in his famous 1900 address, where he listed his list of open problems. And this was resolved by Artin in 1927. He showed, yes, this is the case. And later, this was generalized to a much more general setting, more than just 1f and omega, not just the Rn, but the arbitrary variety. And this is known as the positive standard sets. Pablo, is this accurate so far? Okay, wow, that's surprising. I typically don't have an accurate slide in my talks. So this was kind of like the first half of the 20th century, but of course in the second half we started to have computers and we started asking quantitative questions. So not just whether something in principle has a certificate, but how simple or complicated is this certificate. And different people approach this problem from different angles. So on the proof side, Vorobiev and Grigoryev in 1999, they suggested to measure the complexity of these proofs of this form of like say sum of squares of rational functions or some more general things by the maximum degree that appears there. And Grigoryev showed that this maximum degree can be linear in the number of variables for problems like 3xO and napsack. And we'll see that for this kind of Boolean problems, this is actually the best. And on the algorithm side, even before them, now I'm sure, showed that gave an n to the o of d algorithm for finding degree d proofs in some restricted setting. And this was extended to the maximum general setting by Pablo and Lacer. And so basically they showed that when there is a degree d proof, you can find it in n to the d time. And these guys showed that sometimes d can be large, so sometimes this can be exponential, but sometimes d can also be small. And the typical kind of question we ask is the following. So suppose this is the true answer, the minimum is really alpha star. So we can ask, what's the smallest d star so that we can certify with a degree d? And notice I know that I still didn't define what the degree d sum of squares proof is, and we'll get to that, but we will define it and then we'll give you this notation. So what is the smallest degree, like the simplest, the shortest proof, if you can think of it, that certifies the true bound, that certifies that f is really larger than alpha star. So you can think of these stars basically measuring the complexity of this statement. And another question could be that you're given a complexity budget, you're only allowed to use proofs of this length, what is the best bound you could do? So clearly, so you might not be able to prove the actual minimum, maybe you'll only be able to show a bound on something much like, you know, you'll only be able to show that f is at least minus a million, or maybe f is really non-negative, but you can, but this requires a long proof. But to prove that f is at least minus million might require a short proof. So this is the kind of questions we ask, if we have a given complexity budget, what's the best statement we can prove? And generally speaking, typically when we get, we can prove the optimal statement, we can also get the actual minimizer. When we can't prove the optimal statement, we typically also cannot get the actual minimizer, but we can ask the question, can we still get partial information about this X star? And partial information could mean it's kind of a problem dependent thing. Maybe we want to get something that is close to X star, or maybe we want to get some vector that might not be close to X star, but gives a similar value that someone gives a related value. Yes? This, yes. So that symbol, we haven't yet formally defined it, but what it will mean is that you can prove this thing by a degree D sum of squares proof. So for now, even think of it that it means that f minus alpha to the D is like a sum of squares of degree D polynomial or rational functions, but later we'll kind of make it more formal. So, yes, but this is a good question because I'm using here a question that, you know, a symbol that I didn't define. So basically, so this is kind of the high level questions of this research. And we are going to move to the whiteboard soon and start doing actual math, but any other questions at this point? Yes? Yes? Yes, that's the next, one of the first things we'll do or maybe one of the first things you'll do because I think it will be actually an exercise is to show that we'll actually work with polynomials but model some ideal, but what we'll show is that you don't need, if the degree is bounded, you also don't need too many of those. So, but that's something that needs to be proven. So, I guess we'll move to the whiteboard. So we can move, sir. How do you turn on the lights here? Oh, okay. And this, this, and maybe this, ah, okay, very good. So I should probably take my notes so I know what I'm talking about. Okay, well, ah, yes, yes, okay, okay. So now, okay, so, so for the first few lectures, I'm going to focus on a special case that makes things somewhat technically simpler, but it somehow still captures most of what we're interested about. And this is basically minimizing a function, minimizing a polynomial over Boolean inputs. So we are going to look at, so focus is going to be minimum of f of x, x, zero one to the n. So this captures many combinatorial problems. You can, you know, present problems like 3sat, Maxcat, and various other problems you can present in this in this, in this way. And, and, and now I'm going to basically define, define this thing that I did, defined before. We, we say the following thing and that, let's say, the statement has a degree DSOS proof, which will denote by, so that you can use degree D to prove this statement. If the following holds, there are just polynomials P1 till Pm, polynomials of the degree at most D, such that f of x equals some Pi of x squared, but this only holds over the cube. So, so this is, this is not exactly the same as the, you know, the Minkowski's question, because we don't, we don't, we don't need this, and we generally only ask this about functions that themselves have the degree at most D, let's, typically the functions we are talking about will have like, let's say the degree at most 10. So, so notice that we don't, we don't ask that f is equal to sum of Pi squared as a polynomial in n real variables, we only ask that f equals to sum of Pi squared over the cube, but this is clearly enough to certify that f is no negative over the cube. Okay, so, so let me say what are the basic properties of of these things and then we'll go kind of one by one and either prove them or leave them as exercises. So, so one basic property is that the degree D proves have at most n to the o of D numbers in them. So, and so basically what this means is that even though a priori I didn't bound M, you never need more than clearly every polynomial of degree, every polynomial of degree D has at most n to the D coefficients and what this says is that basically you don't need more than n to the o of D polynomials and that if there is a degree D proof you need at most n to the D polynomials. Now, if you want to properly count length of proofs you have to also worry about bit representations but we'll kind of ignore these issues at least for the beginning of this course maybe most of the time they can be safely ignored especially if you're not, you know, if you don't really care about the difference between certifying that F is a non-negative and F is at least minus Epsilon but so we'll kind of ignore it. The second thing is that we can check a degree D proof in n to the o of D time. This is not immediately obvious because this is a statement that a priori you need to check 2 to the n inputs to verify but we'll see that you can actually efficiently check it and the third thing which is most surprising this is basically this restricted version of this Samo-Square's algorithm is that you can actually find degree D proof in n to the o of D time. So not only can you check it, you can even find it, yes. When you talk about degree, is it both degree and monomial or the maximum degree of any variable and monomial? It's the maximum degree of any monomial, of any monomial, yes. X times Y is the degree 2. It's degree 2, X times Y is the degree 2, yes. Property 4 is that if F is actually non-negative, again, we are only over the cube here, so I'm not going to write it every time, then you can always prove this fact with degree at most 2n. So you never need more than 2n degree, which if you think about it in light of makes sense because if F is non-negative, you never need more than 2n time to verify that. So it may make sense that you shouldn't spend more time, significantly more time than that finding a proof of that. And then, but what you show is that sometimes we can certify this kind of thing with d much, much smaller than n. So sometimes these proofs can be very succinct, and what kind of nice about these sum of scores proofs is that they turn out to capture a lot of what we intuitively do in mathematics, so a lot of statements, a lot of, say, inequalities that we use and love all the time, like Cauchy-Schwarz, Helders inequality, arithmetic mean, geometric mean, all these kinds of things, they often they have a low degree sum of scores proofs. OK, so let me go over these properties. So basically, property 1, I will leave it as an exercise, but I will give you a hint. You can basically say, you can basically say represent F as a matrix and of n to the d by n to the d matrix of all possible coefficients. And this is going to be some basically condition, we'll see basically that this is going to be a condition that F is that all the eigenvalues of this matrix are non-negative, and the number of these eigenvalues is basically the rank of this matrix, and the rank will never be more than n to the d. So, but basically probably you can do it in several ways. It's basically some linear independence argument that you really don't need more than, you really don't need more than n to the d, n to the d terms, and it's related actually to step number two, where we say that even though this looks like 2 to the n conditions, this really is only, even though this looks like 2 to the n conditions, this really can be done in 2 to the n time, in n to the d time, and the reason is the following. So, so we basically the idea is that we say the following. So you have F equals G. So if you have two polynomials, Fx equals Gx for every x in 0, 1 to the n, if and only if F equals G, modulo xi squared minus xi for i 1 to n. So let me explain what I mean by this. Basically what I mean is that first of all, notice that over 0, 1 to the n is characterized by the equations xi squared equals xi. Right? So this is an equation that only 0 and 1 satisfy. In particular, it means that if you take a polynomial and every time you see a power of it, every time you see a variable squared, you turn it into a variable, just you just remove the squared and inductively if you see a power, you can make all the power, every time you see a variable raised to some power larger than 1, you make it 1, then you are not going to change the value of this polynomial over the cube. So clearly if two polynomials, if two polynomials agree, their coefficients agree, if two polynomials, their coefficients agree after we make them multilinear polynomials, then they agree over every x in the cube. And you can also show that this is the other way around, that if they disagree, yes. Each x of i has a different term. Yes, you write it as a polynomial and then basically what we say is you can take down, if you see there a power xi to the 5, you make it xi. And if after you do this thing, the f and g agree with each other, then they agree with each other on every point in the cube. If they don't agree, then there must be a point which will differentiate between the two. And again, I leave these fingers also so that I can exercise. So basically I've shown one direction, that if these holes, then these holes, and you can also show the other direction. But in some sense, by the way, this direction that I showed is enough here. So basically if you wanted to check this equality, what you would do is you're given all these coefficients of all these polynomials, so you will reduce all these coefficients to multilinear coefficients. And now you check that the coefficients on the left-hand side and on the right-hand side are the same. So that's how we check. That's how we check that we can do it. And let me, 3 is the more complicated part, so let me skip ahead to 4 and then go back to 3. So number 4 is, you want to say that if f is no negative, we can always, if f is no negative, we can always prove it by a degree 2 and sum of squares proof. And the idea is the following, we write the following, define the following polynomial. Suppose fx is no negative for every x. Then, now we define p of px to be the following polynomial. You sum over all y01 to the n, square root fy minus pi i from 1 to n, xi minus yi plus 2xy, I think, xiy. Okay, so let's pass this. So p of x is a polynomial in x. And let's try to understand what p of x does. So p of x is a polynomial in x, it's of degree, you know, this is linear, so it's of degree at most n. And let's try to pass what it does. So basically for every, we run this over all ys, for every particular y, if one of the xi's is different from one of the yi's, say for example, if xi is 0 and yi is 1, then this thing will be 0 and this thing will be 1 minus 1, so this whole term will vanish. The only case where this doesn't vanish is when x equals y. So basically what we get is that p of x will equal square root of f of x. So p of x is a polynomial of degree at most n that equals square root of f of x, so another way to say it is that f equals p squared as a polynomial over the cube. Yes? . So we do have examples, we'll see later. Yeah, but I can even give you like now an example that requires, I don't know if it's 2n, but let's say n over 10 or something like that. And this is the following polynomial. Suppose n is odd and define f of x equals sum xi minus n over t, n over 2 squared minus 1, 10 for whatever. Or maybe even this, but maybe even this, just this polynomial. This polynomial is, okay, we want to show that it's non-negative, so let's make it minus 1, 10. Okay, so this polynomial is, this polynomial because n is odd and you can never get to n over 2, right? The closest you can be here is half and then you'll get here a quarter. So this polynomial is always at least a quarter minus 1, 10, which is, you know, it's calculable by someone, but it's a positive number. And then, so this polynomial is positive over the cube, but Grigory have showed that it requires the constant times n to certify this fact. So this is like a nice example of that. Notice that its polynomial is arguably, I mean, it's, if we change this 1, 10, to, you know, so we can easily certify that this polynomial, you know, is at most minus 1, 10 in this case, so generally like we could put here one quarter, it's at most minus one quarter. So in some sense, this is one of those cases where it takes degree n to certify a certain bound, but if you wanted to modify this bound by a little, which is, you know, a constant is relatively small compared to the sum of coefficients of this polynomial, which is n, then you could get a bound with, you know, a very, very small degree, but at the cost of getting somewhat weaker bound. So this example, some of it demonstrates both sides there. First of all, that sometimes you require to n and sometimes you can do much better. Yes. For the random function, what is the thing for? So if I take a random function f, this is not really what I would do. Can I say that it requires? So random function is a little bit, because we kind of want low degree functions. In some sense, we kind of want low degree function, but in some sense, one of the answers is that we simply don't know. So here is like an open question. So suppose f is a random, I know the degree for polynomial and what do I mean by random? Let's write f of x be sum over i j k l, f i j k l x i x j x k x l, where the f i j k l is just say a random in plus minus one, or Gaussian, whatever you prefer. And let's write this now. What do we know? We can easily, we can do kind of a probabilistic analysis to certify that something like f that basically, something like that f of x in absolute value will never be larger than I typically will get it wrong, but let's try to see. So basically, if we take any particular x, then the sum of these f i's has mean zero, and it has standard deviation n squared. And there are two to the n x's, so you'd expect square root n standard deviations. So it will be kind of like all tilde n to the 2.0. That's the true maximum of the f of x on all the x's, because you kind of expect that there would be at most, the most x would kind of make it like square root n standard deviations, because this will be like 2 to the minus n probability. And so this we kind of know this to be true. So for example, you can write f of x, f of x plus something like n to the 2.51, that will be a non-negative polynomial, but we don't know what's the best bound we can certify. So this is, we have some weaker bounds, we have some weaker bounds, like n cubed I think we can certify with like a degree for sum of squares proof, but we don't know, kind of don't know what the answer is, at least not in general. So even this basic question, which in some sense, it's one of the first questions like a smart student will ask, when seeing this thing, we still don't know how to fully answer, which is why I think it's a very nice research area. So let me, wait, so let me, okay, so let's, okay, so now let's do number three. So the only question is where, maybe I'll erase this board and do number three. Okay, so let's, so basically what we want to do is the following. We said that m, we can always assume that m is at most n to the o of d, and we can assume that m is equal n to the o of d because we can have zero polynomial. So basically a sum of squares proof is, a sum of squares proof is a vector of polynomials of at most n to the o of d polynomials. Now let's suppose that f equals sum of pi squared over the cubed, and f prime equals sum of pi prime squared, then if we take any non-negative combination of f and f prime like alpha f plus beta f prime when they are non-negative, then this basically equals sum of squared alpha pi squared plus sum of squared beta p prime j, maybe I should write squared. And we know that we can also, there would also exist a sum of squares proof, so there would be some sum of squares proof, so there is a sum of squares proof, and then we know that we can also find like one with at most n to the d coefficient. So basically if we can certify that f is non-negative and f prime is non-negative, then for every non-negative combination we can also certify with the same degree that with the same degree that f is non-negative. So basically what we, if we define say k d to be f such that you can certify in degree d that f is non-negative, then this is what's called a positive cone. It's a, or a convex cone. It's closed under any, it's closed under any convex combination and under any multiple by a positive number. So and basically that suggests that you might be able to use convex programming, but it's not immediate because you need something, you need an efficient way to, you need an efficient way to certify membership in this cone. And generally the way this works is by what I alluded before and is written in more details in the lecture notes and we also might go after, go later into more details, but for now let me say roughly how it goes. Basically for every, you want to say the following thing. f is equal to a sum of p i squared if and only if there exists and n to the d by n to the d matrix m such that the coefficient f squared, the coefficient, so f at s, this is kind of the coefficient, the coefficient that corresponds to s. s is a subset of n of size at most d. So f at s will be the coefficient of the monomial that corresponds to x i, the product of x i when x i is in s. So even only if f at s equals the sum over all a, over all a, b such that a union b equals s of m, a, b. So basically what it means is that if we think of this matrix and we think of it as n to the d by n to the d matrix or n choose d by n choose d matrix and we think of it as corresponding to a polynomial which is of the following form. mx is the sum over all a and b in n of size at most d over 2 of m, a, b, x, a, x, b. So m, a, b, x, a, x, b, that's the, let me maybe try to, let me write this a little bit more properly. So we want to write the following lemma. Okay, let's start with the definition. So suppose that m is a matrix, is an n, I'll write it this way, n less than d over 2 times n less than d over 2 matrix. So it's a matrix indexed by sets of size at most n over 2. Then we can define the polynomial pm, polynomial corresponding to m to be the following polynomial. You sum over all a and b that are sets of size at most d over 2 and you write the code m, a, x, a, x, b, where we write xs to be the product i in s, xi. And here is a lemma. If m is a matrix like that, which is psd, then pm equals sum pi squared for some polynomials pi squared. And what is the proof for this lemma? If m is psd, then m equals sum vi vi transpose, write for some vectors vi, that's the definition of being a psd matrix, one of the equivalent definitions. And now m, and now we basically just look at the polynomial, write vi, vi, vi, vi transpose. If we look at what the corresponding polynomial, then this would correspond to the polynomial pi x squared, where pi x is the sum over a of vi, the coordinate axa. So if m is psd, then pm is the sum of pi squared. And now basically to check, and now basically it turns out that if p is sum of pi squared, we can find the psd matrix, we can use this to easily define a psd matrix such that p equals pm. So basically what it means is that p is equal to the sum of pi squared if and only if there exists an m, which is psd, such that pm equals p, again, modulo xi baribu minus xi squared minus xi. So basically to check that p can be proven non-negative by a degree d proof, we basically need to check that there exists a positive semi-definite matrix that satisfies certain linear equations and this is something, this is not just a convex program, but in fact a convex program that we can do efficiently. So this is what's known as semi-definite programming and I'm not going to go into the proof of semi-definite programming, but just maybe an intuition, intuition why this would be easy is that you can test if a matrix, generally speaking you can test if a matrix is positive semi-definite, you just do the eigenvalue decomposition and you see that everything is non-negative, so it is kind of an easy set to optimize over. Right, so now when we are working in this kind of Boolean thing, we're not considered rational sum of course, but we do have the power, we see that we don't need rational in some sense, we can do everything, we can do refinery degree because we are working in the Boolean domain, we can do refinery degree every, we can prove any polynomial to be non-negative with refinery degree and because we do have this power of kind of reducing model of the ideal. Yes? But the degree could be lower if it is rational function, is that possible? Yes, so I think with rational functions it's a bit complicated, you probably want them to have kind of a common denominator and kind of put the denominator next to the F and then it's not clear if it, I don't know if you know if it kind of gives you extra power. Yes, but on the cube I don't know if it's, no, the cube it doesn't, yeah, I don't think it's, nobody, so Tuesday 4PM I'm going to give a high level talk about some of squares and particularly the Bayesian interpretation of it, so that would be somewhat of a more philosophical high level talk, well here I try mostly to stick to the math, so it should be, I think you will not be bored even if you go to that talk because there will not be much overlap, largely because that talk will be a little bit content free, it will be mostly philosophy and general proclamations, so if you don't like these kind of ruminations don't go to the talk, regardless of whether you are here or not, but you might, it might be interesting. Okay, and so generally speaking the ideal case is that we have a function, so let's focus right now on this certification problem, we have a function that the true global minimum of it is zero, it's non-negative and ideally what would happen is we have kind of a reasonable time budget, maybe we can run for, I don't know, n squared time, so we run the sum of squares algorithm with d equals 2 and ideally what would happen is it will speak to us the degree d proof that the function is non-negative and that would be, so it would certify the true bound on the function with a small d and then we would be very happy, we solved the certification problem, but sometimes you can crank up this Pablo's software, put in the function and suppose you even might know that it's non-negative for some other reason or you have good reasons to believe that it's non-negative and you try to get a proof that it's non-negative and the algorithm doesn't give you a proof, basically you run it, the algorithm will print an error message that's more or less in other words will say something like dude this problem is NP-hard, you didn't expect me to solve it in all cases, so we can't expect the sum of squares algorithm to always output a short proof that the function is non-negative because this is a general NP-hard problem, but the nice thing about the sum of squares algorithm that even when it doesn't output a proof it outputs something and now I want to talk about what that something is, what does it do when it fails to find the proof, so this is this notion of pseudo-distribution, so what does it mean that it fails to find a proof, so we basically said that we have this set, this is the set KD of all the Fs that you can prove in degree D that F is non-negative and failing to find a proof means basically that we have this function that happens to be non-negative, this I don't know F0, but it's not in this set, so what we know is the hyperplane separation theorem, it says that if you have a convex set and a point not in this set then there is a hyperplane separating these two things, so what is the space, just kind of remember what is the space we live in, we live in the space of all functions from 01 to the N to R, or in other words we live in 2 to the N dimensional space and so a hyperplane like that, basically what it means is that by the hyperplane separation theorem what we know is that there exists some mu from 01 to the N to R such that the dot product of mu and F is non-negative for every F in KD, but the dot product of mu and F0 is negative and one of the functions that is in KD is the all ones function, we can without loss of generality assume that the dot product of mu with the all ones function is equal to one. So generally speaking because it's a convex duality, the sum of squares algorithm when it doesn't give you a proof, it gives you this certificate that F0 does not have a proof and that certificate is simply this mu. So you either get out, so you either get a proof that F is non-negative or this mu that certifies that F doesn't have a degree D proof that it's non-negative and basically and this is what we call a pseudo-distribution. Let me try to motivate this name pseudo-distribution a bit more. So let's define, so we define pseudo-expectation of F with respect to mu to simply the dot product of mu and F, which is just the sum over all X in 01 to the N of FX mu or maybe mu X. We define this to be the pseudo-expectation of F with respect to mu. Notice that if mu was an actual probability measure, this would be the expectation of F according to this probability measure. And now we define mu is a degree D pseudo-distribution if pseudo-expectation of one is one and pseudo-expectation of P squared is non-negative for every P of degree at most D over 2. So basically we want to say that mu is a pseudo-distribution of degree D if it satisfies these two things and right now it's just a definition. Why do I call it a pseudo-distribution? So first of all notice that one missing thing that I didn't require is that mu of X is non-negative for every X. Now if mu of X was non-negative for every X then basically it would just basically mean that it's an actual distribution. Right? It sums up to one. It's non-negative numbers, it's an actual distribution. If it was an actual distribution then in particular the expectation of any square would be non-negative. So it's a very simple fact. If mu is a distribution then mu is also a D pseudo-distribution for every D. So this is one fact. Another fact that you can have some weak kind of corollary and the following, this is fact 2. If mu is a degree 2 N pseudo-distribution then mu is an actual distribution. And the proof of that is the following. You define, you basically look at if it's a 2 N pseudo-distribution, in particular you know that mu N of this thing that we wrote before pi i from 1 to N, 1 minus X i minus Y i plus 2 X i Y i say squared to make it a squared. So you know this, this is a square that then you know for every Y for every Y this is, for every Y this has to be a non-negative but this simply equals because this vanishes on everything except when X equals Y and this simply equals mu of Y. So you get, so basically if it's a degree 2 N pseudo-distribution then it has to be non-negative on every Y and if it has to be non-negative on every Y then it is an actual distribution. Yes? So these pseudo-distribution in general will be, how long description line will I write in the end? That's a great question. When did I plan to get to that? Like think, let's just hold on that for five minutes. I can give you the spoiler that we'll find a succinct representation for this pseudo-distribution. So, but the way I'm describing right now it's 2 to the N but we'll find N to the D length succinct representation for them. Okay. So, 3, which we'll maybe see, show an example later, but there are, just like there are statements that we cannot prove with short, with small degree proofs, then there are also there exist D pseudo-distribution if D is say much smaller than N, maybe N over 10 or I don't know, probably N will also be good enough. Then there exist D pseudo-distribution, such that mu is not, you know, mu is not an actual distribution. So, you can have actually like pseudo-distribution is a strict generalization of distributions. It can actually have sometimes these negative probabilities. But some of the, so basically pseudo-distribution of this object that sometimes behave like distribution and sometimes doesn't and a lot of the understanding of some squares is basically understanding to what extent we can pretend our distributions and when we have to be more careful. And so now let's, now let's talk about this notion of succinct representation. For this I'll make another definition. So, I'll make the following definition. So, let's just say, let's write P D, the polynomials, you know, from 01 to the N to R of degree at most D. You can think of P D as, you know, it's basically, we can almost, we can almost think of it as the same as R N to the D or something like that or N choose D or something along those lines. So, basically it's a linear, it's a linear space of dimension roughly N to the D. And suppose that E is a linear operator, so E is a map from P D to R and it's linear. And we say that it is a degree D pseudo-expectation operator E1 equals 1. So, the one here and the one here are not the same thing. This is the polynomial that happens to be, you know, to have 1 as it's coefficients, it's constant coefficient. This is the, you know, the number 1 because this maps a polynomial into a real number. And E P squared is no negative for every P of degree at most D over 2. So, this is the definition of a pseudo-expectation operator. And just note that to specify a pseudo-expectation operator, you only need N to the D numbers, right? It's like a linear operator. So, you can just pick some basis for this polynomial, say the monomial basis, and you just need to specify this value on the basis. So, this does have a succinct representation. And then basically we want to say that there is in some sense an equivalence between pseudo-distribution and pseudo-expectations in the following way. So, first, this is kind of basically by definition. If mu is a degree D pseudo-distribution, then, you know, if we define E to be, you know, expectation, pseudo-expectation we request with respect to mu, then it is a degree D pseudo-expectation operator, right? So, if mu is a degree D pseudo-distribution, then this linear operator that you take pseudo-expectation with respect to mu satisfies these two conditions. That's basically by definition. But another thing is that it's also in the other way around So, it's also the other way around that here's another lemma. If E is a degree D pseudo-expectation operator, then there exists some mu that is a degree D pseudo-distribution such that E equals E tilde mu. And the proof of this is basically these are just linear equations on mu, right? So, you basically want the requirements, right? So, you want, you pick, you know, basis P1 till P capital N for PD and then you have the equation, you want the equation that, you know, mu dot Pi equals E Pi for I from 1 till capital N and it's like N to the D much, much smaller than 2 to the N. So, we, these are, you know, these are linear equations. We know that you can be satisfied because E satisfies them. So, just, you know, this is kind of an under-determined version set of linear equations. We can find some mu that satisfies it. So, basically, given a pseudo-distribution, we can find a pseudo-expectation operator that agrees with it on all the low-degree polynomials and given a pseudo-expectation operator there always exist a pseudo-distribution that would map to it. So, if we don't care and typically we don't because we would kind of get nonsensical results. So, if we don't care about, if the only thing we care about given a pseudo-distribution is what expectations it gives to us on low-degree polynomials, then we can always represent it in terms of N to the D, N to the D numbers and we have an equivalence between pseudo-distribution and pseudo-expectation. So, if we consider mu and mu prime to be equivalent, if they agree on all low-degree polynomials on all degree D polynomials, then there is a one-to-one mapping between these equivalence classes and pseudo-expectation operators. And basically, so, as far as we are concerned, we basically can assume that these pseudo-distributions have degree D N to the D sized representation. So, does this kind of answer your question? Yes? Is that the question? Yes. The mu that you chose might have negative inner power for the function your negative disprove is not a self-squared force-forward property? So, typically we think that the function we trying to disprove also has degree at most D. So, it has to agree on that function as well. So, in some sense, another way to say, right, so basically by the duality theorem we're basically saying that a function cannot be proven to have degree, to be non-negative via degree D sum of squares proof, if and only if there exists a degree D pseudo-expectation operator that gives that function negative expectation. And notice that if we go with degree D all the way to 2N, then basically what we know that a function is non-negative, a function has a negative point, if and only if there exists some expectation, some distribution under which the function gets a negative expectation, right? In particular that distribution could be concentrated on that point. So, when we go to degree, so that's another way to say that when we go to degree 2N we can prove all the functions that are actually non-negative, we can prove that they are non-negative because if we couldn't, there would have been an actual expectation that gives them negative value. But when a function, what you can think is that, you know, if a function is non-negative but it doesn't have a small degree, you can think of it as being kind of sneakily non-negative. It has some negative points, but they're kind of, sorry, if a function is non-negative but it doesn't have a proof of that, in some sense it's hard for us to certify it. So, in truth, there are no negative points, but for us kind of bounded observers that cannot certify it, we kind of need to assume the existence of negative points because we cannot rule them out. So basically the best bounds we can give are weaker, so basically we have to kind of pretend, to us kind of mortal observers, it seems as if there are actual negative points to dysfunction. So, okay, so now I think we are basically, so we define, so the distribution we define, so the expectation, so now we basically I think in a position to kind of state this sum of squares algorithm, this proof, so basically the Boolean sum of squares algorithm, basically kind of the theorem is basically the following claim and I'm ignoring some numerical issues and basically for every D and for every F, which is the degree at most D let's say to avoid some trivialities, exactly one of following holds. So, either you can prove with the degree D sum of squares sum of squares that F is non-negative or there exists a degree D pseudo-distribution mu such that it gives it negative expectation and moreover can find one or two depending on which one is the truth in N to the O of D time. And generally to make this an actual theorem, the real theorem you will need here, maybe put here some, you will want to say that if you cannot certify that it has at least minus epsilon, then you can find a pseudo expectation that gives it at most maybe plus epsilon for arbitrarily small epsilon and have some dependence on the coefficients of this polynomial and some other but let's ignore these issues of numerical accuracy and morally speaking most of the times we can pretend this is true as stated. So, in some sense we can completely talk about pseudo-expectations and most of the time it's not necessarily useful to think, I mean somehow the thing that's kind of useful for us is to pretend the pseudo-expectations are real expectations and maybe it's sometimes useful to kind of pretend that there is an actual distribution there but it's mostly for notational reasons in some sense because just to give, in some sense it's really for notational reasons so we can talk about pseudo-expectations with that distribution because otherwise you kind of want to give two names to, you kind of want to give two different names to different expectation operators and so pretend they correspond to different distributions but in some sense we kind of, there is this notion I think you mentioned at once that there is, and I think this is like, there is some analogy, I think there was this paper by John Bell in Quantum Mechanics saying something about unspeakables in Quantum Mechanics basically that there are some quantities that you cannot speak about and basically say that uncertainty principle rules out from speaking about certain quantities in Quantum Mechanics and in a similar way in some sense a quantity like the expectation of a high degree or degree higher than D polynomial under a pseudo-distribution is an unspeakable. You can define some number for it but it's not really meaningful and you can, so in some sense you should not, if you're taking expectations under a pseudo-distribution of polynomial higher than D, then you're doing something wrong. Yes? So based on that I was kind of wondering in the practical sense do you basically introduce pseudo-distributions and rescale it to positive? So again like... So I guess more practically you're thinking about pseudo-definitions can you rescale it to make it positive? So typically, okay, so that's a very good question. In some sense that's basically... In some sense the next thing I was going to say what do we do in practice when we're given this pseudo-distribution and in some sense we... So let me just follow on that and get the other question. Yes? So what happened where it's finding that you don't get this value between expectations? So I think there might be like... So it's not really about the discrete, like for example you can... Everything I did in the Boolean cube you can do over the unit sphere. I think if you have some domains where you don't have very... The main thing that's important here is that we have this kind of very nice polynomial basis for this domain xi squared minus xi. When we don't have that, some weird things could happen but then at that point we basically do what Pablo says we forget about pseudo-distribution, we just talk about pseudo-expectations. So because we don't have a good... We don't necessarily always have like a nice way to define pseudo-distribution. So in some sense maybe pseudo-distribution are like just a mental crutch. So feel free, if it hurts you more than it helps you, feel free to go without it. So given expectation can we distinguish... It comes from real distribution or pseudo-distribution? That's another great question. Let me give also... Let me give the spoiler that the answer is no. It's NP-hard, but we'll see why. But let me go back, so to your question is... So we get this pseudo-distribution and in some sense we don't like it, we want to make it a distribution. So typically the way to make it into a distribution would not be to open it up to the end coefficients and try to make it positive, but rather we want some kind of smarter way. So basically what we... So typically this is the goal in sum of squares based algorithm. So typically if you have sum of squares based algorithm, suppose you're looking now not at the... You're looking not at the task of certifying that F is at least non-negative, you want to look at the task of actually finding the minimum element of F. So typically what you get is you get... So what you want is basically what's known as a rounding algorithm. But the name is kind of... The name is not necessarily a good name, it's kind of a name that comes because it can be looked at kind of a vast generalization of simple algorithms for linear programs to make a linear program into an integer program, find an integer solution from a solution of a linear program. But this is not really about integers or anything like that. Basically what is a rounding algorithm? So it's an input, is a pseudo-distribution mu where that gives to F some small values. They give to F some value, let's call it, I don't know, alpha bar. So it gives to F some small value. And the output would be an actual distribution. And let's call it, I don't know, a row such that expectation with row of F is at most alpha, you know, alpha upper bar. And basically, in some sense, the integrality gap is basically this difference. So basically, your goal is to try to find the minimizer and you are given some distribution that kind of pretends to be a distribution over things that are very small, that give you some kind of minimum element. And what you typically want is to output one guy or maybe if it's a randomized rounding algorithm, typically it will be a distribution of the guys that is not too far from what you were promised. Typically, you cannot really hope, alpha upper bar will always be larger than alpha small bar because you cannot hope to get something out of nothing and it's kind of problem dependent on how big this gap is going to be. But this is basically where the design of, especially the analysis in some sense, if you kind of want to know the analysis, if you kind of prove, suppose you prove generally an algorithm like that that always guarantees that alpha bar, you know, upper bar minus alpha lower bar is at most one. Just give some arbitrary number. Then you kind of know that the sum of squares algorithm, the bound that it certifies for you is at most by one off from the optimum because you know that if it gave you a bound that's too low, there would actually be a minimum element that's only one bigger than that. So this kind of, when you show a rounding algorithm and you certify some bound and sometimes we look at this kind of integrality gap, it's kind of again like a general name. It's not always that we want to subtract alpha bar from alpha, you know, alpha over bar from alpha lower bar. Sometimes we want to compare them by some other function. But generally speaking, this is kind of like, we kind of want it to be as close as possible. And when we have negative results, these are results showing that it can sometimes be very, very big. The difference between these two. So this is in some sense a lot of the time, basically when you're trying to analyze and you want to show that these pseudo distributions are not too far from actual distributions so they cannot be too crazy, that's what you want to show. You want to show that you can map, and the map might sometimes be trying to kind of really sample from the pseudo distribution in some sense, but sometimes it can do things that are somewhat different. Yes? Is F also part of the input or is this over all X? No, typically F is some sense part of the input, the rounding algorithm, yes, the rounding algorithm knows F. But generally speaking, you probably, if you want to show an integrality, a bound on the integrality gap, you would restrict your F to a certain family. So for example, you would say F corresponds to the maximum cut of some graph. And you look at the family of all functions that correspond to some cuts and you want to say, say minimizing F corresponds to maximizing the value of the cut, and now you want to say that the gap is not too big, that if you have a pseudo distribution that pretends to be a distribution of a very, very good cut, then you can find an actual distribution that is over a decent cut. And let me say, maybe I'll kind of jump to, like one of the main, let me maybe show you one of the main tools that we actually use to do these things. In essence, we want to find, and this is related also to Lay's question, we kind of, we given, so some sense here are some computation questions. So in some sense, a lot of the question we ask is how close are pseudo distributions to actual distributions? And so here are kind of two computational questions that are very natural for us to ask. So question, so if we are given a pseudo distribution mu, you know, degree D, let's say, let's given degree D pseudo distribution mu, ask the question, is there an actual distribution, an actual distribution rho such that, you know, the expectation with rho of P equals expectation with mu of P for all P in PD, like, right, so for every polynomial of degree at most D, so you can ask, if you have a pseudo distribution, is there an actual distribution? So we know that it's not always the case, but is there an algorithm to say if that's the case or not? And generally find such a distribution. So basically the kind of, generally the answer will be kind of depressing. No, the answer is generally like, generally, our cases where there is no actual distribution that agrees with it, and it's also hard to tell whether there is a distribution or there isn't. And given that it's hard to tell whether there is or there isn't, it's also hard to find one if there exists. So generally the answer is bad, but there is one kind of very useful tool, which is actually also used a lot in practice, which is this quadratic sampling lemma. And basically what it says is the answer is yes, if we restrict to D equals 2 and allow rho to range over Rn. So basically even though we kind of maybe want the distributions over 0, 1 to the n, we need to allow rho to range over Rn, and we can only do it for the degree equals 2. But this is still a very useful thing, let me in a second state it kind of more formally. But this is still a very, very useful lemma and in fact we use it in rounding algorithms. People also even use this in practice. I think for example in finance a lot of the time, so basically the way they think about it I think in finance and also like let's say maybe an example that's more relevant right now. I don't know exactly, I know people use this in finance. I'm pretty sure that this is also used by say, would Nate Silver in 538, but I'm not 100% sure, but I'm guessing that he probably does something like that. They have a matrix of correlations between different states or maybe different populations etc. They know that Pennsylvania maybe has some relation with Ohio and then there is some correlation with New Jersey or whatever. So they have this kind of matrix of correlations between states and then they have some expectation of say the number of votes that Clinton will get in each state. And now they want to sample and distribution over all the 50 states that will agree with these correlations and these expectations. So this is a simplification of what it is, but I think something like that has to be at the heart. Yes? Yes? I have a question about the graphic sampling. I feel like when low is defined on the RN, I mean if there might be multiple polynomials that have exactly the same values as a great point but not over the RN. So maybe two p's would kind of have two different expectation values overall but having exactly the same expectation. So I'll state the sampling lemma more precisely, but low is a concrete distribution. So basically given mu, the sampling lemma gives you a row, a distribution. Once the distribution is fixed, the expectation value is fixed. So it will give the same expectation for the polynomial. It's true that it will give this expectation in a somewhat cheating manner because what it uses is the trick that even though the polynomial is, we kind of intended for it to only be applied on the cube, it's a polynomial. You can feed real numbers to it and see what comes out. But it will be a single number, a row will make it a single number, but it's true it makes life easier for us that we allow this flexibility. So X1 to X1 square, I feel like you can have it. Yes, so this is only multilinear polynomials, let's say. Only multilinear polynomials. Also maybe life is simpler here because we take the degree at most too. So if the total degree is at most too, so they can have coefficient. I mean, it doesn't have to be even multilinear in this case. They can have XI squared. But let's say it's only multilinear also to make life simpler. But yes, so that's a good point. So basically, maybe I should state the lemma and then go back to saying something like silver would use it. So basically the lemma would say the following. For every degree to pseudo expectation operator E, there exist, you know, over Rn. So the pseudo expectation operator is over Rn. There exist, for every such pseudo expectation operator, there exist an actual multivariate Gaussian distribution. Maybe I'll give it a name, okay, so rho. Maybe I'll just give it a name, you know, some random variables, let's call them Y1 till Yn, such that you can write it in the monomial basis. Write expectation of YI equals the, you know, the actual expectation of YI equals the, what the operator gives for, you know, the polynomial XI and the actual expectation for YI, YJ gives what the operator gives for the polynomial XI, XJ. Okay, so this is the quadratic sampling lemma. It says, and we can find it. So it says given a degree to pseudo expectation operator, you can find an actual distribution that matches these moments. And the way people use it in practice, say, for example, something like that, you kind of, you're faced with, you know, suppose you know the correlations between states and the expected vote for, say, Clinton in each state. And now you want to compute the probability that Clinton wins the presidency. So the probability that Clinton wins the presidency is somewhat of a complicated event, right? Because it somehow, it depends on getting, like, what's the probability that you get a certain subset of states that add up to 270 electoral votes. So it's not exactly well defined just from these pairwise correlations and these expectations. But somehow, you know, they have to come up with a number. So what do they do? They take, you know, take these matrix of coefficients, of correlations and the expectation, and they probably foreign some other hacks, et cetera. But basically, they come up with some distribution that matches these moments. And then they can sample from this distribution of the total, you know, number of votes or fraction of votes and decide based on that, like, in how many times, you know, they sample, you know, a vector of like 50, they mention 50 and they add up. And if that vector comes up to be, you know, if they add up, the state she wants is more than 10 or 270, they said, well, she won't die. Otherwise, they said she didn't win. And, you know, and then they output the fraction of the times in their simulations, in their sampling from this thing that they, that she won. Now, I'm pretty sure they will do something a little bit more complicated than that, but the basic idea is along these lines because you kind of have to, you need to find this, you're given some partial information, some partial correlations or moments, and you need to find a distribution that matches these moments. Yes? Say that row is in the dual cone to our original cone. Is that approved by part 1, part 1 in question? So, row is, okay, so I'm not sure if it's, dual, it's in some sense, it's in some sense a subset why the actual distribution that matches, pseudo-distribution is a generalization of distributions. So, in some sense, the set of actual distributions that matches certain moments is like a subset of the set of actual, of pseudo-distribution that matches these moments. So, it's not exactly dual. So, let me prove this, it's actually pretty simple. So, the proof is the following. We know, we know that it's a pseudo-expectation operator, so we know that for every degree 1p, so this is simply, you know, the polynomial sum pixi, we know that e p squared, which is basically sum ij pi pj e xixj. So, we know this is non-negative, right? This is a pseudo-expectation operator, it needs to give non-negative degree 2, it needs to give non-negative, in fact, let me, okay, let me start by even, so, let me even start, just make my life simple. By shifting, let me just assume that, you know, e xi is zero for every i. It's not really important, just makes it like, you can always shift to get this, the case. So, let's just assume that. So, now e, so what this really means is that if we define the matrix mij to be e of xi xj, then m is a PSD matrix, right? Because basically for every vector p, p transpose mp is non-negative. So, now we can choose some standard Gaussian, so we can write m to be some lambda i vi, vi transpose, where the lambda i's are non-negative, and now we can choose some standard Gaussian z, z1 till zn, and now what we do is we basically, what we want to do is we define y to be the vector that is obtained by taking square root m times z, or in other words, it's the vector that you get by taking some of square root lambda i vi times the dot product of vi and z. So, I showed you how to sample from this distribution, you sample a standard Gaussian, every entry is independent, normal variable with expectation zero and deviation one, and now the way I compute y is I basically take the sum of square root lambda i times vi and times the dot product of vi in this vector z, or another way to think about it, if we think of z as a column vector, I simply apply the matrix square root m, like matrix square root m is basically the same matrix where we square root the eigenvalues to z, and now the only thing is to verify that we match the moment, so we want to verify, we want to show that y i, so what we want to show is that expectation of y i y j is m i m i j, but generally, so we want to show that expectation, and if we think of y as a column vector, then we want to show that expectation of y, y transpose is equal to m, and that would basically be expectation of square root m z, square root m z transpose, and by some manipulations, it will be m, square root m times square root m is m, by this kind of, I see some, there are some physicists here, so clearly if the dimensions match, all the transposes will be okay, so we don't have to worry about them, so it will be like expectation z, z transpose. So what is expectation of z, z transpose? This is a standard Gaussian, z i z j is equal to 1, if i equals 2 j and 0, otherwise, so this is just the identity matrix. So basically it will be m times the identity matrix. So these are kind of like linear algebra manipulations that if you are not as used to them, then they could seem like magical, but there is nothing really deeper going on, you can also do the calculations. If I do the calculation, I always get it wrong, but eventually you do the i and j's and you'll see these things. So it's actually kind of very, very simple, and generally what it means is that, so we can sample at least, we can match the first two moments, if we allow ourselves this flexibility of coming up with a real number rather than a 0, 1 valued, and this is used, like I said, this is used in some practice, let me not, just if I post this online and Nate Silver come suing me, I don't know if that's what they use, I have no idea what they actually use in 538. I definitely know that sometimes in finance, and again you have like similar situation, you might have like somewhat complicated financial product, that depends on a lot of underlying assets or markets, and you know some correlations between these markets and you know some correlations between these markets, and now you want to maybe estimate the probability that something bad happens to your complicated product, so you want to find some way to sample tons of simulations that agree with the information that you know. Right, so we had this question and we said, okay quadratic sampling, למה, it gives us one answer to this question, we can do it with degree 2, but this was a very simple proof, we can basically, not only we can do it with degree 2, probably I don't know Gauss could have done it, I mean with the knowledge of the time or whatever, it's kind of very, we can do it with fairly simple tools, so you might ask well maybe you know just walk a little bit harder and you can do it with d equals 3 or with you know d equals 2 and Boolean output, and generally the answer is no, and you know before proving that the answer is no, let me tell you why the answer should be no, why there would not be a 0, 1 quadratic sampling למה, so for every graph, so given a graph G, we can define fG from 0, 1, n to, let's say r such that fG of x is equal say to, I don't know, a minus, I think we probably have like a nice definition, I'll write some definition but there's probably like a nice definition, so let's see, so write minus i you know is a neighbor of j, i is a neighbor of j xi minus xj squared, so this is a degree 2 polynomial and it's smaller, so if you think of a vector x as basically a cut of the graph, then minimizing this polynomial is like maximizing the cut, so max cut is an NP-hard problem, so basically we know that there exists a G such that computing mean of, generally there is no general way to compute the minimum, so there is no general algorithm, efficient algorithm to given G to compute the minimum of fG, so now let's try to understand what this means, it means that, so claim there exists a G and a pseudo-distribution mu of claim that for every D, which is much much smaller than n, there exists a G and a pseudo-distribution mu, a degree D pseudo-distribution mu such that the expectation with respect to mu, the pseudo expectation with respect to mu of fG is smaller than the actual minimum overall x of fGx, so why is that the case? suppose there was, suppose for the sake, I don't know, let's start with for every constant G, so for every constant D, so suppose there was a constant D, that this was false, so suppose there was a constant D such that for every graph and every new pseudo-distribution, and for every graph and for every pseudo-distribution, the expected value on fG was at least the minimum, maybe I should do it slower because this is kind of the first proof. okay, so suppose otherwise, so just because there are several quantifiers, so let's see, so then there exists some D0 such that for every graph G and for every pseudo-distribution D0, pseudo-distribution mu, the expectation mu fG is at least the minimum of x of fGx, so if that was the case, then by duality we know that by duality we know that we can prove in degree D0, we can prove that fG is larger than the minimum over all the pseudo-distribution mu, so generally we might not be able, so let's just again kind of try to stop and pass these things. pseudo-distubutions are more general than actual distribution, if I take the minimum of expectation of fG over all actual distributions, what do I get here? Just minimum of fG, because the actual distribution that would minimize it is just the one that's concentrated on the minimal element, so this is the minimum of a richer set, I'm allowing not just actual distributions, but also pseudo-distribution this generally could be smaller, but under this assumption this actually equals the actual minimum, so it means that there exists, so basically what we have proven here is that for every graph G with max cut G equaling some value alpha, we can certify with degree D0, so under our assumption with degree D0 proof that the max cut G is at most alpha, and we clearly cannot certify that this is less than alpha because it's not true, so basically what it means that we can certify the correct value of the max cut, we can prove the, so basically what it means that if we search, we can do a kind of binary search, this max cut value can only range over like, not even binary search, the max cut value, it's only if there are m edges in the graphs, the max cut is between 0 and m, so we can see what is the best value that we can certify, and then that would compute for us the true max cut value, but max cut is an NP-hard problem, so if P is different from NP, then this should not be the case, so this claim is really under the assumption that P is different from NP, which is of course a stronger, a more believable assumption than 2 plus 2 equals 4, but under this action, so this already suggests to us that there should be this claim should be true, and now basically we want to also prove this claim unconditionally, and we can, but this is also kind of a general paradigm for understanding these things, sometimes we take a hardness, a computational hardness result, it gives us certain predictions, for example we even believe that not just P is different from NP, but we actually cannot really boot force for NP, so that would suggest that this thing should be true not just for constant D, but for D that is very maybe square root n or something like that, it still should be true that we cannot compute the max cut exactly, and indeed one can prove that, but I've been doing this basically translating computational hardness results that are into unconditional hardness result for these semi-definite programs. So now let's show a concrete or at least somewhat show, in a sense at least state these examples, so here is the, so basically you can take actually like say very simple graph, and see where is this example, yes, so here is kind of a very just simple example, so you take the free cycle, and no matter how you cut it you cannot cut all three edges, right, so in this case we know that FGX is at least right, so FGX is at least minus 2 for every X in 0, 1, 2, 3, and so if you take degree, you know, I don't know, degree 6 pseudo-distribution obviously we will certify this thing, but you can show that there is a degree 2 pseudo-distribution, and let me just kind of write what it is without proving, so basically the degree T pseudo-distribution as expectation, the degree 2 pseudo-distribution as expectation xi equals half, and expectation xi xj, the way they wrote the expectation, let me just write this way expectation of X1 minus X2 squared equals expectation of X2 minus X3 squared, expectation of, so it is in the notes exactly how you get it 3 minus X1 squared, so this is the three edges equals three quarters, so basically this pretends to, so instead of cutting, this pretends to cut not, this pretends to cut like three times three quarters of the edges, so it has expectation three times three quarters which is 9 over 4, which what is important about 9 over 4 that is larger than 2, so you can define a valid degree 2 pseudo-distribution and prove that its PSD is in the lecture notes, but the point is that basically this is a degree 2 pseudo-distribution, and it pretends as if you can cut in expectation, it kind of cuts 9 over 4 like 2.25 edges rather than 2 edges, so we kind of know that there is no actual distribution that would match, we know that there is no actual distribution that would match these parameters, so this is kind of a very simple example of a distribution that you can't match the moments, but also this kind of NP-hardness results would also tell you that the question whether you can or cannot match the moments of a pseudo-distribution is NP-hard, because if you could solve that question, then again you could solve the max-cut question, you could basically search for the pseudo-distribution that doesn't, that matches a pseudo-distribution that actually has a corresponding, or at least, let's say, at least you could have like a small certificate, a short certificate that the max-cut value is bounded by giving some moments and then you can run the algorithm to verify that the moments correspond to an actual distribution, so this is basically still a hard question to given some moments to find out whether or not they come from a distribution. So, and we'll see, so right now this is still kind of a qualitative statement in the sense that, so this basically just says that you cannot do it exactly, but typically what we are interested in is whether given some, whether given some, yes? x of i equals 1.5, that implies that those three expressions are... It doesn't imply, you know, this only defines the, right, so basically this only defines, this is a partial definition of the moments, so yes, it doesn't, this is a partial definition of the moments, in the lecture notes there is basically, you know, the covariance matrix, the covariance matrix of this thing is basically, the covariance matrix of this thing is I think one-quarter minus one-quarter, so minus one-eight. So, this behaves as if it's like you toss three unbiased coins, like the cut behaves as if you tossed three unbiased coins, and there is some negative correlation between, you know, this coin and this coin, et cetera, like the, and there is some negative correlations between these two coins, and the negative correlation is such that this matrix is still PSD, but it's actually more negative that you can actually achieve in any real distribution. So basically it behaves as if you have like you toss three coins, where every pair of distinct coins as probability three-quarters of being different, and this matrix still turns out to be PSD, so it is a degree two pseudo-distribution, but it is, but there is no actual distribution of three coins that can achieve this. Yes? You said that if we could check whether there is a distribution versus model, then it is very possible. Yes. Ah, okay. Yes, so basically the idea would be, okay, so, okay, so let's see. You might be right that it's, okay, but it's definitely computationally hard to check if a distribution satisfies the moment, what's the easiest way to see this. So that's, okay. So that's a good question. What I think, yes, so I think it might not be, yes, I think it might, okay, so let me, yeah, let me think a little bit. I think there is definitely like some computational hardness even for the problem of, you know, just given a set of moments, does there exist a distribution that matches them, and maybe even, right, but it might actually, yes, maybe for, but it might require a little bit of work to show it. Yes, so let me get back on this. So basically, so soon the distribution somehow may behave like, it can be different distribution, so you might kind of say, why do I kind of give them these names, and why do I imagine that they're distributions, and basically the reason is that even though they are not exactly distributions, they're still satisfied some nice properties, so, and we still can sometimes pretend that they are distributions, so we've seen some bad examples, let's say some good examples, so for example, we can prove, you know, so here is one theorem, you know, for every degree, I'm not sure, like 2D, I don't know, sort of distribution mu, whatever the degree needs to be, so this thing makes sense, and let's say P and Q that are degree at most D, then the expectation, the pseudo expectation of PQ is smaller than the square root of pseudo expectation of P squared, pseudo expectation of Q squared. So we have Cauchy-Schwarz, so we have Cauchy-Schwarz, this is one nice thing that distributions have, so the distribution also has it, and the way you prove it is the following, you can scale things, so without loss of generality, this is kind of a scale invariant inequality, so without loss of generality, we can assume that pseudo expectation of P squared equals pseudo expectation of Q squared equals one, right, so now we need to prove that pseudo expectation of PQ is at most one, and we know that pseudo expectation of P minus Q squared is non-negative, after all, this is right, what pseudo expectation does, and that equals to pseudo expectation of P squared minus plus pseudo expectation of Q squared minus 2 times pseudo expectation of PQ, right, so we know this is non-negative, if this is one and this is one, this means that this guy can be at most one, right, so we get Cauchy-Schwarz, and generally we'll see more and more examples like that in this course, more and more examples where we could, for things that we kind of prove without distributions, Cauchy-Schwarz, AMGM, all these kind of things, and even more sophisticated things like hypercontractivity and invariance principle, et cetera, we can, if we have a mathematical proof of that fact, it often turns out even if the mathematical proof is complicated, it can be like a great paper, I mean literally an analysis of math paper, it still can be that the proof itself is a low degree sum of squares proof, in fact the main way to generate examples that are kind of robustly not sum of squares is using the probabilistic method, which maybe explains why it took so much time for people to come up with concrete examples for Hilbert's question, even those concrete examples were like somewhat, like Motskin's example was somewhat kind of fragile, it required very low degree rational functions, so in some sense if you stay away from a probabilistic method and your proof doesn't do that, if your proof doesn't use kind of the Chernov plus union bound type arguments, typically it will be, typically it will be, it will be SOS-able, so basically there is two main, there are basically two main tools that we use when we do kind of sum of squares research and this is related to each other, this is a Marley's paradigm, so Marley's paradigm is basically if you stay away from the probabilistic method, then every little thing is going to be alright, so don't worry. So basically, so typically the way it would work is, the way it would work is that you analyze say a bounding algorithm while pretending that the pseudo-distribution is an actual distribution and you do all your proof and then you go back to the proof and the SOS-it and the idea is that as long as you didn't use the probabilistic method, it will be okay and you will be able to show that even though, so basically the idea is that you kind of take a bounding algorithm, so you pretend you're given the moments of actual distribution mu over x such that let's say f of x is smaller than some lower bar and let's say use that to find maybe some x upper bar with fx bar upper bar at most alpha bar and then you go and you look at the analysis of the algorithm and if you didn't in your proof, so suppose you prove that you know alpha, that you prove some bound delta on the integrity graph, you prove that alpha upper bar is at most delta worse than alpha lower bar, then you go and look at the proof of this statement and then try to see that maybe in the proof of this statement you only use things like Cauchy Schwarz etcetera, I mean this statement at the end of the day is an inequality and how do we prove inequalities? I mean we use Cauchy Schwarz, I mean we are like dumb computer scientists, we don't know how to do anything else, so if this inequality you only used it by basically using Cauchy Schwarz, Helder and all these kind of things, then even though you proved it in your mind, you thought that it was an actual distribution, it will still be true for a pseudo-distribution and therefore you will conclude that even if you were given not the moments of an actual distribution but the moments of a pseudo-distribution, you would still output a guy with at most delta worse than the value that you got. So this is kind of a general paradigm in kind of designing rounding algorithms for some squares and one way I like to think about it is like it's a non-type safe, kind of programming language, think of it as kind of a paradigm that allows you to prove wrong things but gives you a lot of power and freedom to think about, to come up with an algorithm without getting bogged down in a lot of detail, so you can come up with an algorithm while pretending it's an actual distribution, it's often much easier to think about these fingers as an actual distribution and then you use kind of, you go back to your analysis of the algorithm and extract from it a proof and often what happens is that like this, it separates the work of designing a rounding algorithm into two parts. This part is the part that's creative when you come up with an algorithm given actual moments but often not technically complicated, so it kind of requires creativity but not a lot of calculations and a lot of bounds, etc. This part is not creative because it's kind of mechanically going on your proof line by line and making sure that all your arguments are sum of squaresable but it often is kind of, it will be maybe the longer part of the paper and take more technical effort, so it kind of separates nicely this thing to two parts which can be useful and there is also this question of, when you're pretending this is actual distribution there is something that's somewhat deeply troubling about that and this is that suppose we are talking about max cut, so we are given a graph G and now we want to find a rounding algorithm for max cut and that's what we'll do next week and basically we pretend that we are given distribution mu over cuts of G with value alpha, so we kind of pretend that we are given distribution over cuts of G and generally the way we are given the moments of this distribution and generally you can try to compute things like, you can try to compute like the marginals for example, try to compute like say expectation over mu of xi and maybe expectation over mu, things like expectation of some higher moments and generally what you will come to the conclusion almost all cases will come to the conclusion that this distribution actually has kind of a lot of entropy, typically the marginals will be half like you know and the correlations will also be kind of small between two guys so you kind of come to the conclusion that based on these moments that these marginals have, there is kind of a lot of, this distribution has a lot of support if it was an actual distribution but this is somewhat confusing because maybe often you think that in a graph if you take the absolute maximum cut there will be only one of them, so it's come in some sense maybe there is, you might even be promised someone will tell you, I generated this graph in a way that there is only a single unique maximum cut but still you define this, you take this pseudo distribution and it will seem to tell you, hey I am giving you now a distribution not just one cut like that on many of those cuts, so in some sense like it's somewhat confusing and in some sense maybe the graph doesn't even have a cut at all but still you will find the pseudo distribution that pretends as if you have many cuts and this is basically the way, so it can be hard to kind of interpret, but how can I pretend that this is an actual distribution if I kind of have good reasons to believe that there is no actual distribution over cuts of this graph that has this property and this is kind of the other paradigm we kind of use which I kind of call the Bayesian Kool-Aid so what we do is we kind of think like Bayesians, if you think like a Bayesian you might say the following thing yes, there is only one unique cut but because I have, I don't know it, I have uncertainty about it so this entropy is not really an entropy because there are many cuts it's an entropy because I have a lot of uncertainty so it's somewhat confusing because typically Bayesians if you don't think about like uncertainty because you're bounded computationally you think of uncertainty because you don't have enough information but it's somewhat similar, right? as far as if I look at this kind of example like say suppose I personally don't know the eye colors of my great-grandfather so to me he either had blue eyes or he didn't it already happened, there is no probability involved but to me the event whether he had blue eyes or not I can put a probability on it based on maybe on all its descendants I can kind of put some probability of whether he had blue eyes or not so in some sense it makes no sense because he either had them or didn't have them it already happened but I don't have knowledge about it so I can put probability on it and you can think of these distributions therefore because they typically will show you more uncertainty than there should exist it's really the uncertainty about your knowledge but this time it's your knowledge because you're computationally bounded it's not because there is not information the graph is given to you in all its glory all the edges if you just had exponential time at your disposal you were able to find the maximum cut and you would have zero entropy about what this cut is but because you don't have this amount of time you have uncertainty and the reason I kind of call this Bayesian Kool-Aid is because what you do is which I think also Bayesians do is you take these probabilities that in some sense don't make any sense and talk about your uncertainty and then you pretend that they are actual probability distributions and you do whatever manipulation on them as if they were actual probability distributions and you cross your fingers and you hope that either you didn't make any mistake or at least if you made a mistake that the referees won't find out and I think this is all I wanted to say for today so any questions? so if there are no questions again everyone please join Piazza right so go to the website so join Piazza and I'll probably mail you some reading homeworks on Monday but if you haven't yet read the reading so the reading that you... if you haven't yet read the lecture notes for this for today's lecture it's already there on the web page there's a reminder and on Monday I'll give some more reading towards Friday's lecture and yeah I hope to see you on Friday at Harvard Maxwell Dworkin it's only like a $12 Uber away and you can share it with three people and it's... yeah if you have any comments about the course things that you would like me to cover things that are confusing feel free to talk to me email me, send a private message on Piazza send anonymous message, post anonymous message on Piazza public, non-anonymous, whatever works