 Okay. So let's start with any questions that people had while walking on the exercises. So yes. I was wondering if you could talk a little bit about how the finding the sum of squares works because I see that they're like of n to the o of the decision variables, but it seems like they're exponentially many constraints because it's like a bit of a whole. Okay. So let me talk about that. That's a good idea. I was planning to do a quick overview of context programming, not to try to, I mean Pablo sent actually like some very good resources that give you the full picture, but it might not be a bad idea to try to give some summary. Any other questions? So let me give you a crash course in context programming. Probably many of you know it. Some of you may not. Some of you know it better than me, but it doesn't hurt. Okay. So one thing that you want to remember is that the two things that are generally easy with qualifiers and we'll talk a little bit about some of the qualifiers is you want to maximize a concave function to minimize a convex function. Now, if you don't, this is not something you have to remember by heart or the way to see it is the following. This is a convex function and you can see that it's easy to minimize because if you would put a marble here, it will just fall down to the global minimum. This basically is the idea of algorithms to minimize a convex function or maximize a concave function. It's basically you have some marble and you keep going downwards until you hit the bottom. So not exactly gradient descent, it's like a marble right there. Gradient descent. Yes. So you typically want maybe the marble like to decide how much to take the marble. What steps to take with the marble. This is actually what we'll talk about soon. So generally, what is a convex set? No, k in Rn is convex and if for every a and b in k, the line between a and b is in k or more formally, for every p in 0, 1, p a plus 1 minus p b is in k. So a convex set contains every line and one way to think of a convex set and often it's the way the sets we are talking about in the form, we think of convex set as basically k is some x satisfying some constraints, f1x less than alpha 1, mx less than alpha m, where the f is on the self's convex functions. And you can see that intersection of several convex sets is convex and the set of the form x is at most alpha where alpha is convex is also a convex set. But at the definition of a convex function, right, is that f of p a plus 1 minus p b is smaller than 1 f a plus 1 minus p b. Or more generally, that x convex means that I'm always confused with these things. So the thing in the average is smaller. And another way to say is that f of expectation of x is smaller than expectation of f of x. Now, what is convex programming? So convex programming, we want to do the following. Let's say minimize some f of x where f is convex subject to x is inside some convex set k. And basically, if we guess the minimum value, this is the same as basically checking whether k prime is the empty set where k prime is k intersect x such that f of x is at most alpha. So basically, the visibility question, so we can ask the minimization question or the feasibility question whether the given convex set is empty or not. And one of the basic parts about convex sets is the hyperplane separation theorem. It tells you that if you have a convex set k and a point a that is not in k, then there is a hyperplane separating the two parts. And using this theorem, there is the ellipsoid algorithm. The ellipsoid algorithm, it says suppose you have k prime and you have some, there's still one more set here. There is one empty set here. So suppose you have some set k prime and you want to know if it's empty or not. And you kind of know that if it's not empty, so you have this k prime, you know that if it's not empty, it has to contain the ball of some radius r. And it's also always going to be inside some bigger ball of radius capital r. Then the ellipsoid algorithm, basically what it does is you take some random point inside the big ball. If it's inside your set, you're done. You know the set is not empty. If it's not inside your set, then there is some hyperplane that separates the two. And then you can kind of take this ellipsoid, takes the intersection of the ball and the hyperplane, but is still smaller in volume than the original ball and makes some progress. And generally speaking, what you take, the running time of this will be poli of log r over r, which means that it will be polynomial time as long as, you know, say small r is like 2 to the minus poli n and capital r is 2 to the poli n. So the ratio between these two things will be ptp, which will still be polynomial time. And what you can see from this vague description of this algorithm is to run this algorithm. What you need is what's called a separation algorithm. So what you need is, first of all, you need a way to test. Given x is x is in k, right? OK, fine. Otherwise, there is no, I cannot test. We cannot even know if we found the solution. And this is not trivial. There are some convex sets where we don't know how to find efficiently test memberships in these sets. In fact, I think in almost all natural cases that I know of, if we already know a efficient algorithm to test if a point is in the set or not, it typically that efficient algorithm also comes in, also give you a certificate when the point is not in the set. So basically, you want to have some kind of algorithm where if x is in k, you get one certificate. Typically, if k is defined by some kind of constraints and you cannot get some certificate that is in there, and if x is not in k, then you get some y such that x, y, is smaller than 0, but x, y is at least 0 for every a in k. And this is known as a separation oracle. And model some technical conditions. Basically, if you have an efficient separation oracle for the set k, then you can optimize over it. I should also mention that what people actually do in practice, they typically don't use their LIPSOID, they use algorithms known as the interior point methods. The interior point methods, you can think of them as a high dimensional generalization of Newton's method. You have this function, you are at some given point, you kind of take some linear, you take the derivative, you take some linear approximation of the derivative, now you use the kind of second derivative to understand what is the ball in which this linear approximation is still somewhat accurate when the first derivative still kind of dominates the effect of the second derivative. And you kind of move in the right direction inside this ball and then you continue onwards. And but for us, let's say we know this problem is solvable. Once you have a separation oracle, we know this problem is solvable. If you try to, what I just wanted to make the point is that if you try to apply this theorem as is, it will be horrendous running time. But in practice, people have better ways to do it. And maybe you can find, you can sit maybe even here, like next to Pablo on that, I guess you can literally sit on the shoulders of giants here. I'm sorry, but we don't have Thomas. There is a chair. There is actually one more chair. There's also one in the front left. There is one more. Oh, there's a chair. There is. So like the one set that we kind of care about, that we do have a separation oracle for, is the set of positive semi-definite matrices. So take k and a set of matrices such that m is positive semi-definite, which I've had some home of various different characterization of that. Then we can efficiently test if a matrix is in k. Again, ignoring issues of numerical accuracy, this is basically just checking the eigenvalues of m and seeing that they are all non-negative. And how do we get a separation oracle? Suppose we found that m has a negative eigenvalue. This means that we found that if m is not PSD, we can efficiently find some v such that v transpose mv, which is the same as m, the dot product of m and v v transpose, where you think of n-square dimensional vectors, is negative. But so if m is not in k, then we can efficiently find this v. But for every n, say, that is PSD, the dot product of n and v v transpose, which is v transpose nv, is always non-negative. So this minimum eigenvector gives you an efficient separation oracle. And you can optimize over a semi-definite matrices. Now, going back to your question, now basically let's look at a set of pseudo-distributions and let's look at a valid set of pseudo-distributions. So basically, one kind of black box way to show that you can efficiently estimate over this set is, and maybe I'll just, I'll rewrite it. So let's think of the set k. And right now, yes, it's going to be like a, let's start with a set of actual pseudo-distribution that are two-to-the n-dimensional and then say, so we can look at the set k, which is mu from 0, 1 to n to r. That's that mu is a degree d pseudo-distribution. And we can look at the set k bar, which is going to be e. So it's a linear function from polynomials of degree at most d to r, such that e is a degree d pseudo-expectation. So what we have shown is that basically we have a mapping from every pseudo-distribution corresponds to a pseudo-expectation. Every pseudo-expectation corresponds to a pseudo-distribution. So if we only care about the expectation on loaded degree polynomials, we can, whenever we work, we can kind of work with this set, k bar. The nice thing about k bar, you can specify this on the basis of the polynomials of degree at most d. So k bar is in rn to the d. So it's not as big. It's still a convex set, because we kind of have a linear map from this to this. This is a convex set. This is basically a projection of this set. So this is also a convex set. And now we can efficiently test membership in e. It's basically amounts to testing that some linear constraints on e and that, right. So we want to kind of test things like e1 equals 1. That's just a linear constraint. Or if we write this, say, and basically then we can kind of define a matrix m where it corresponds to kind of monomials. Like here, there will be some monomial x to the alpha. Here, there's going to be a monomial x to the beta. And here, there will be e applied to the polynomial x to the alpha plus beta. And then it will amount to the condition that e p squared is non-negative will amount to this matrix, the positive semi-definite. So basically, this set k bar is a set that is positive semi-definite matrices plus some linear constraints. And we can efficiently optimize over it in time, which is n to the d. And again, there is issues of bit complexity, which probably I shouldn't ignore. But right now, let me ignore it in the cases I'll talk about. They will not make any difference. Is that? Say we actually wanted to say like a sum of squares proof exists. And we actually want to find the problem. I forget that we like to do something for our convinced. Right. So the sum of squares proof is like the dual to a pseudo-distribution. So you can do one of these things. You can also talk about optimizing over sum of squares proofs, which is again going to be a positive semi-definite thing. So basically, you want to optimize over a. Suppose you want to prove for f is non-negative. You want to degree d sum of squares proof. Then what you kind of want is basically you're going to ask for the following. You want some m, which is PSD, under the linear constraints that if you look at the polynomial corresponding to m, it's equal to f. So this is again finding the sum of squares proof is basically optimizing over PSD matrices. And you can also do it another way. By basically, one of these sum of squares proofs and pseudo-distributions are the dual of one another. So if you can't find the sum of squares proof, you get a pseudo-distribution out of it by duality. It basically means that there is nothing in the setting. So you get a certificate for that, which is pseudo-distribution. And vice versa. If you can't find the pseudo-distribution, you can find the sum of squares proof. So basically, again, what I mean by a polynomial corresponding to m, you take a matrix that entries correspond to monomial. This is like x alpha. This is x beta. And this is the entry m corresponding to alpha beta. And the polynomial corresponding to m is simply sum over all alpha beta, m alpha beta, x alpha plus beta. And there might be several entries because you can get the same monomial. So there might be more than the same polynomial could have several matrices. But it's just a linear constraint. Requiring PM to be equal to something is just a linear constraint on the entries of m. You just require that. You don't require that coefficient, why is that? No, we do require coefficient-wise. But basically, yeah, but it is a linear constraint because for a particular coefficient, like the coefficient of x alpha plus beta, there might be several ways to get it. So we basically require that the sum of these coefficients in m correspond to the corresponding coefficient in f. Oh, that's the thing. OK. I thought it was PM equal to f on the point. OK. Just coefficient-wise. It's coefficient-wise as a multi-linear polynomial. So you reduced the multi-linear. That's why it's on the cube. So we use the fact that x alpha plus beta, when I say this, it's really model of the cube. So x squared here is x. So it basically all comes down to optimizing over a set that is linear matrices that are restricted to be positive semi-definite with a linear constraint. Yes? So there would not be, I mean, under some kind of relatively mild conditions in the cases we care about, I think there will not be a duality gap. It would be exactly the same. So generally, in the setting we are right now talking about maximizing some function over the sphere, then over the cube, then some concrete polynomial over the cube, then there will not be one. So basically this is, so there is more to do with convex programming and they meet it efficiently and actually understanding the conditions and the conditions on bit complexity and duality, et cetera. And in particular, semi-definite programming, it's all a topic. And I think, basically, Pablo's lecture notes and books are like a very good source for that. And I think you don't even have to read so much, right, Pablo? Like, probably like two, three lectures gives you the basis. So I sent something on Piazza, I think, with particular lectures, but maybe Pablo can also come, like, which ones if you wanted to read the minimum that you will feel that you understand it at a better than just hand-waving level, then there are still two seats empty here, OK. So now that we have quasi-sudo-convinced ourselves that we can solve some of the squares too with some mean probability, positive or negative. So let us talk about it, move to the topic of this lecture, which is various problems on graphs and matrices, particular Max-Cut, expansion, and Watson-Dix inequality. So first of all, graphs. So a graph, you know, GBE has N vertices and is D regular. Now, I think there are scientists that believe that non-regular graphs exist, but I think it's actually a hoax perpetrated by the Chinese to bring down the American economy. I don't believe in non-regular graphs. All graphs will be regular, at least in this lecture. So N vertices is D regular. And so A will be the adjacency matrix. So Aij equals to 1 if i is a neighbor of j, and 0 if i is not a neighbor of j in this graph. And we also look at the Laplacian of the graph, which is the identity matrix 1 minus 1 over da. This is the Laplacian. And we'll look at the following polynomial Fg of x. It's x transpose Lgx, which is, if you do the math, comes out to be sum over i that is a neighbor of j, xi minus xj squared. Now, the sums are always confused if you sum over all edges as undirected edges or as directed edges, and then you count every edge twice. And you have an extra factor of 2 here. I never remember it, and it doesn't matter. So everything will be 2 after factor of 2. Yes. Right. You were right. This would be Fg would be 1 over d times this. Yes. That's more important to test the 2. OK. So what is the max cut? So these are some parameters we kind of care about. So basically, the cut of s, you take some subset s of vertices. The cut value that s gives you is the number of edges from s to its complement normalized by the total number of edges. And the max cut of g is the maximum of cut g of s over all subsets s. And we can also write it as the maximum of Fg of x 0, 1 to the n with the appropriate normalization, which I guess, in this case, would be like d over e, or there is a d there, 1 over d, d over e. That's it. d over e, which is basically like 1 over n, because it is dn. dn over 2. OK. So this is the cut. There is the expansion of the graph. So Fg of s is the number of edges divided by s n minus s. And again, I think you kind of normalize it. There is d over n. Let's put you a d. And you can also again write it as, so Fg of g is the minimum expansion over all s, which is the minimum over all x of Fg x divided by x n minus x times the normalizing factor, which n and d are some other. So the max cut problem is to compute the maximum cut value of a graph or approximate it, and to find the set that actually achieves it. The expansion problem is to compute the minimum expansion of the graph, or approximate it, or find the set that actually achieves it. Let me just say that I will not make a great distinction between computing the value and finding the set or assignment that actually achieves the value. Because in all the settings that we know in these kind of settings, the algorithms we know that gives us one also gives us the other. So I'll not make a big distinction out of these two things. So what do we know? OK. Why do we care about max cut or expansion? So there are tons of books written about these problems. People have gotten tenure on the basis of these problems, so that's obviously some reason to care about them. But that's really, and there's also lots of things. I can probably, maybe I can't, but someone probably can give you some applications of these problems, like real applications for designing VLSI circuits, or a communication network, or I don't know, maybe discovering some patterns in biological networks, et cetera. And these are true. There are applications, but more than these particular applications, these are very canonical problems that seem to really crystallize some basic algorithmic questions. And often, algorithms that were developed for these problems turned out to be extremely useful elsewhere, and also harvest techniques that were developed for these problems also turned out to be very useful elsewhere. So these problems really serve as kind of a clean algorithmic testbed to develop better algorithmic understanding and better techniques that can be used elsewhere. And if you want, by the way, for applications of not exactly these problems, but of techniques that are very related to what we talk about, Ibu Zupa was watching the talk of Immanuel Kanditz in the last National Congress of Mathematicians, where he talked about some problems like phase lifting and finding a low rank plus decomposition. Some things that Pablo was also involved in, and how they can be used for a lot of applications, including actually some of these, I think, it's actually truly life-saving applications, right? Like being able to do MRI or children that you would not be able to do otherwise, because before you had these techniques, you had to actually make them stop breathing, otherwise you couldn't do the MRI. So yeah, so there are tons of applications, but more than that, it's a very clean problem. It helps us learn. OK. So what do we know about max cut? So the max cut of a graph is always a number between 0 and 1. And 1 basically means 1 is basically if and only if the graph is bipartite, right? So a cut that cuts all the edges is basically it demonstrates that the graph is bipartite. And a random set will get about half. So a random set will cut about half the edges, say, and a random graph. In a random graph, all sets are typical. So it will be like half plus little of 1. This little of 1 is actually a function of the degree. This is when the degree goes to infinity. It's like 1 over the school D or something like that. So basically one way to think about the max cut question you're trying, for example, to try to distinguish whether a graph is random or a graph is almost bipartite. And a priori, it doesn't seem like a super easy thing to do, because think of the following thing. So think of it as you take some vertices. You can put some random graph. Or you can take some vertices. You can color some random subset of them red and some other random subset of them green. And now you can put a random graph of the degree D. But now with probability 1 over minus epsilon, all the edges are going from red to green. And the only epsilon fraction of the edges go to red. Now, if you don't know this coloring and you're just looking locally at every kind of small neighborhood in this graph, this graph and this graph will look the same. In some sense, you could imagine that first you chose the edges, and then you chose the coloring. And you won't really get into contradictions until you kind of go pretty far out. So it's not so clear whether you could distinguish between these two cases, between an almost biplotite graph and a random graph. And indeed, some simple kind of combinatorial algorithms and some linear programs cannot distinguish between these two cases. And so, you know, Erdesh, in I think 1967, he showed basically a half-approximation algorithm for Max Verpaulon, which basically just relied on this thing. But this was one of the first times that someone ever thought of the notion of an approximation algorithm and analyzed it. And for a long time, this is basically the best we knew how to do in the worst case until Govinds and Williamson in 1994, where they showed 0.878 approximation. And again, the pointer is not really the difference that 0.878 is much bigger than 0.5. The point is that they discovered a new technique. And that technique has found an immense number of uses. That technique is basically the semi-definite programming of degree 2 sum squares. And for expansion, again, the expansion of a graph is between 0 and 2. The interesting thing in the case is not really the 2. It's like the 0 part. So 0 is even only if it's disconnected. Then you have a set that has no edges going out. And the random set, I think it's about 1, the way I scaled things. And the random graph is also about 1 minus 0 to 1. And again, you can again think of the same kind of examples. You can take a random graph and you can compare it with a graph where you did red and green. But now most edges try to stay inside the same color. And again, it's not clear how to distinguish. Here, basically, this was, and now we understand the connection between these two things better. We didn't understand it as well at the time. But here, basically, in the 1980s, then several people, Alon, Milman, and Dodziouk, showed that you can have an analog of what's known, something known as a discrete analog of what's known as Chigyu's inequality. And it gives you this kind of an approximation. The way we typically think of this value, it gives you an epsilon versus 4 squared epsilon approximation. So if phi of g is at most epsilon, you can certify that it is most squared epsilon. And let me kind of state, and you've seen it in the lecture notes, but let me kind of state the results slightly differently. What we know about these problems. So I mean, it's a good exercise to see that you can understand how to phrase every such result in both the notion of SOS poofs and the notion of distribution. So basically, the theorems of Gaumann's Williamson, you can phrase it in two ways. So the SOS certificate version is, say, if max cut of g is at most x, then we can prove that with degree 2, SOS, that the max, right, or maybe we can write it this polynomial fg with the right normalization, whatever it was, d over e. This polynomial fg, you can prove with an SOS certificate that it is most 0.87 x divided by 0.8. So maybe x is a bad name because it's called alpha. It's because of c, because I typically think of the variable. So this tells you that this gives you an algorithm that will give you the value up to this constant. So you look at the best upper bound, you can certify, and you know this is the most fact of 0.878 off. And there is another way to phrase the same thing, is if there exists a degree 2 pseudo-distribution mu such that the pseudo-expectation with mu of fg d over e is at least beta or gamma, I don't know, then there exists some set S such that cut gS is at least 0.87 gamma. So this is basically the same thing, because everyone see why it's the same thing. And then you can also think of some other regimes for these things, which somehow I kind of like sometimes because then I have something to do as much calculations. You can say, if the max cut is less than 1 minus epsilon, then I can certify a bound of at least 1 minus o of epsilon squared. And if the max, another way to say it, if this thing is like 1 minus delta, I can define the cut of 1 minus omega, and this would be o of square root delta. So in particular, this is enough to say that I can distinguish between graphs that are almost bipartite and graphs that are random. In Portugal, we get the same, basically the same thing. So theorem, no, alone Milman, not you, is that another way to write it is if p of g is at least, say epsilon, then we can, that means we can prove, we will agree to some of the first proof that p of g, and we'll talk about how you prove it exactly because it's p of g is irrational and not polynomial, but you can still do it. And another way to say it is that there existed two pseudo-distribution mu such that p of g of x. This is, again, like this is a rational function, so we'll have to do something here a little bit, but it works. Then there exists some s such that p g of s less than o of square root delta. So again, these two things are equivalent. So maybe you can see the similarities of these two problems. It's, so basically, when you look at them in the right way, these techniques are very similar. And these theorems capture the best efficient algorithms we know for this problem. And improving, I didn't even state here the exact constants, but improving this to any polynomial time algorithm or in some exponential time algorithm does better would refute either Quartz's unique games conjecture or the related Raga-Vendastore small-set expansion hypothesis. So there are some concrete conjectures that imply that these are the best. In particular, if these conjectures are true, then if I put here instead of degree 2, I put here degree 100, then it wouldn't change any of these theorems. We don't know how to prove that. We don't know if it's we. And we don't really have very, very good evidence that it's true, but it's a very interesting question. OK, so other questions about the statements of these things? You came up with the 1 minus epsilon versus 1 minus epsilon squared in the GW result? Well, I'll show you that proof. There is actually an exercise in the lecture notes, but that actually is the only thing I'll show the proof of. So you have the proofs in the lecture notes, and I'm not going to go overall, but the only thing I'm actually going to show is this, where it comes out. So that's what I'm glad to do after the break. OK, so let's take a six-minute break. 10, 15, 14, 11, hopefully some things I will. Ah, by the way, I mentioned this on Yatza, but Friday, part of the audience is celebrating an MIT holiday. What is this MIT holiday? Does it have a name? The holiday. The holiday. Because the MIT students will be busy howling at the moon or whatever they do in their holidays, so we'll cancel the class. We don't want to be intolerant of religious minorities. We'll have the make-up class on Monday. So it will be Monday after 5 until 8 PM. And we'll have it at Harvard, because there is some kind of mandatory, G1s have like some mandatory thing, right, discussion on Monday, 4 to 5. So we'll have it, 5 to 8. That's almost the most popular kind of Monday, seem to be the most popular in the survey I asked. And by the end of this weekend, I should be able to send you at least the beginning of the notes for the next lecture. The next lecture, we'll talk about how we should prove negative results for some of squares. I have a plan which might be a little bit ambitious, but we'll see to cover integrality result showing that these results are tight, so showing, say, for example, that you can do that this theorem has stated the constant and everything is tight, that integrality results for constraint satisfaction problems, integrality results for the planted clique problem, and talking about the connection between computational hardness results and integrality gap, which include how we take hardness results and make them into integrality gaps and a very fascinating result of Vagaventra that works in the other direction, takes integrality gap and make it into a hardness result all with one based on some concept. But anyway, OK, so this is for next week. Let's now try to prove something. Again, I'm kind of allowing myself to be a little bit non-formal in the proofs because everything is written. The full proofs are always available on the lecture notes. But I kind of want to give the flavor of how the proof looks like and also talk about where this square root comes from, which seems to be a recurrent phenomenon. So I want to prove this theorem, in particular, so I want to prove this theorem. I'm giving the grade 2 pseudo-distribution, which gives me a cut value of 1 minus delta. And I want to prove that there exists an actual cut or let's say what I'll actually do is a distribution of cuts that cuts on average 1 over 1 minus or square delta of the edges. So the proof, so first of all, without loss of generality. OK, so first of all, what are we giving? We're giving pseudo-distribution mu, such that the average number of edges, sum over rd, pseudo-expectation respect of mu to xi j squared is 1 minus delta. Right, the probability, this is kind of the probability x is 1 if it's on one side of the cut, 0 if it's on the other side of the cut. So this thing is 1 if the edge is cut, 0 if it's not. And on average, over the edges, we cut 1 minus delta fraction. This is the pseudo-distribution we are given. OK, and we can always, if you add an actual distribution, we could always modify the distribution where we first flip a coin and decide whether to call 0, 1, or 1, 0. It doesn't really matter for the cut. And the same is true for pseudo-distribution, so we can assume without loss of generality that the expectation of xi is always half. For every marginal, it's even probability half it's 0, probability half it's 1. Now, if you have a 0, 1 random variable with expected value half, what is its standard deviation? It's half, right? Because it's always half away from the mean, right? The mean is half. It's either 0 or 1. It's always half distance from the mean. The standard deviation is also half. So this also implies that the pseudo, that xi minus half squared is 1 quarter. The variance is 1 quarter. The standard deviation is 1 half. Now, you can say, well, I told you about an actual random variable. How do we know this also for pseudo-distributions? But it's a sum of squares. It's a sum of squares. Oh, another way to say it is we appear to Marley's corollary that I mentioned in the previous lecture. I didn't use the probabilistic method. I didn't say the word channel. Then everything should be all right. So this is a sum of squares. So we have this property which you can do the calculation and see that again. So basically, now what we know is by the quadratic sampling lemma, we can find Gaussian random variables, x1 till xn, such that, first of all, the xi's are distributed like normal with expectation half and variance 1 quarter. So it matches these things. Yes? First of all, it's very missing the d over e. Because on the left side, in the box, it's like. I think fg are the one over de facto insiding. Yes. But basically, if you think about it, this is the probability that you cut. So this is kind of like the average. So I think fg was like one over d times this number. So if you have an x01 vector that gives you a cut of a certain value, then the vector 1 minus x, where you flip zeros to 1, will also give you the same value. So if you have one distribution, you can modify it to a distribution where the expected value is half by flipping a coin and inverting x. So you can always assume that this is kind of a my assumption because it's just about the marginals. So basically, what the quadratic something tells us, we can find Gaussian random variables that match the marginals, the first and second marginals. So first of all, every one of the particular equations is like normal with a mean half and standard deviation half or variance quarters. And for every ij, if u xi, OK, so the way we want to write it, minus xj squared is 1 minus delta ij, then this also should be true for these guys. So expectation of xi minus xj squared is 1 minus delta ij. So basically what it means is that these guys are kind of very anti-correlated. Think of delta ij as being an average delta ij's delta. So basically, this means that these Gaussians are very anti-correlated. So by some kind of calculations, you can open this up and you'll get that if you look at the correlation of these guys, which is like rho, which is xi minus half xj minus half, then you will get that this is going to be like 1 quarter minus half 1 minus delta ij, which the way we should write it. Think of it as like minus half, minus half. The way we should really think about it is like xi squared xj squared squared. That's the magnitude that it should be this thing. And then here, so here you have a minus sign and here you have like minus delta ij over 2. So these guys, if they were perfectly anti-correlated, this would be delta ij would be 0. That means that whenever this guy is 1, the other guy is 0 and vice versa. So then they would be perfectly anti-correlated. But delta i minus delta ij over 2, that means that they're pretty anti-correlated. And now let's just define yi to be the normalized Gaussian. So yi is just xi minus half divided by half. So this is distributed like n0 divided by 1 quarter, right? So this is distributed like n01. We subtracted them in divided by the standard deviation. So we get like a normalized random variable. So basically, what this says is that expectation yi, yj is minus 1 plus half delta ij. So these guys are very anti-correlated. So now it's basically a matter of now the way we want to find the set s, we are going to take things in the set s if xi is larger than half and take things out of the set s if xi is smaller than half. Or another way to say it is that if yi is positive, we take i to be in s. And if yi is negative, we take i to be in the outside of s. So the probability that we cut i and j, let's say, is basically the probability that the signs don't agree. So it's like 1 minus the probability that yi, same sign, yj. So we basically have two standard Gaussian random variables. And we know their correlation. We know they are fairly anti-correlated here. And we want to calculate the probability that they disagree. And so at this point, it's a calculation. But so there are several ways to kind of see it. So one way to see it is basically that you can think of it geometrically. So two guys that are 1 minus delta anti-correlated, they kind of correspond to vectors of angle rho, where rho is cosinus of 1 minus delta. Yes? No. So the anti-correlation just comes out of this calculation. So I'm basically defining delta ij. Delta ij is defined to satisfy this equation. And we know that expectation, the sum of all edges, the average of all edges of delta ij is delta. So I know that on average, it's anti-correlated. Generally, maybe for some cases, delta could be very big. But so far, it's intuition, yet we'll make it. Nothing really assumes that I haven't really used the fact that I think of delta as small. I just use it for this scoring, but you cannot solve it here. So you can think of it as two vectors with angle rho, which is like cosinus of 1 minus delta, or I guess, multiplied by pi, right? And then the angle rho is our cosinus of 1 minus delta. And then, basically, what we know is this thing of sin squared plus cosinus squared is 1. So what we know is that this thing, the ratio that it takes, is going to be like 1 minus 0 squared delta. It's some kind of calculation, right? So it's the way we think of it, right? So this would be like some kind of cosin delta. This would be like, right? This would be like delta, it's something like square delta. And here, it will be 1 minus delta. So things will come out to be maybe 2 square delta. So these things sum up to be 1, right? So this. So the convection is going to be applied to say that not an average mean. Now, at the end, we'll do convexity. We still haven't used it yet. But at the end, when we want to move from delta ij to delta, we'll use convexity. But at this point, I just want to say. All the deltas you've written are actually delta ij. Yes, yes, yes. So at this point, I just want to say that this is, the probability we cut is going to be 1 minus 0 square delta ij, OK? And let me also, yes. That goes to delta ij, not just. Yes, no, not one that's small. It could be trivial in some sense, the probability we cut. And let's say the probability we cut, it's at least this. But delta is large, it will basically say that the probability we cut is at least some negative number, which, yeah, of course. That's not that impressive. And let me also, just because it's sometimes useful to see something both visual and kind of arithmetic. Another way to think about it is, what is it like with 1 minus 0, with guys that have this correlation? It basically means that you select two gaussians. So you think of them as y and y prime. So y prime is like minus 1 minus delta ij y plus square root delta ij. Again, there is some factor here, maybe 2 or something like that, y, let's call it z. So how do I select two gaussians that are 1 minus delta ij correlated? I select the first one completely at random. I select the second one to get this correlation plus a completely independent gaussian, which is scale with a factor so that the sum of squares will be 1. And the sum of squares to be 1, this will be square of this is basically 1 minus 2 delta ij. So I need this to be like square root of, maybe there is a 2 here, 2 delta ij. OK, and now we can ask, what's the probability that the signs will agree? So for the signs to agree, we basically need this. For the signs to agree, we definitely need one event to happen, which is z has to be bigger than, essentially, y. Z has to be bigger than y over square root delta ij. Right? Because if z2, if this part is smaller than this part, then the sign will be determined by this part. So we take two normals, and we ask, what's the probability that we can scale these things here? So we take some, what's the probability that this normal, sorry, it got things. z has to be bigger than, right, divided by z. z has to be bigger than, again, I got this one. So y has to be, right, what we need is that y has to be smaller than z divided by square root delta ij, or, let's say this, OK? These are both normal standards. Keep getting this one. Yes, square root delta ij, OK, right? This part has to be bigger than this. Let's say delta is smaller, so this kind of like that, 1 minus delta is not even an important thing. So y has to be smaller than z times square root delta ij. Agreed? Otherwise, the signs will agree, OK? If you want, say, less than 2. So if this doesn't happen, then it's all over. The signs are going to agree. y and y prime are going to have the same signs. So there are two ways this thing can happen. Either z is going to be abnormally big, right? These are standard normals, and let's think of this as like, I don't know, 1 over 100, right? So what's the probability that y is going to be smaller than z over 100? So there are two possibilities. Even z is going to be abnormally big. That's going to happen very, very unlikely. Like, z is going to be like 100 standard deviation, too big. It's going to be like 2 to the minus 10,000. So that's going to be extremely unlikely. The other thing is that y is going to be too small. That actually can happen. So if you look at the Gaussian, then if it's a standard Gaussian around the area of one standard deviation, probability is basically flat. So the probability that y is smaller than, say, 1 over 100. So z could be typical, like 1. And then the probability that y is less than 1 over 100 is about 1 over 100. So this event will happen with probability. So the probability that this event will happen, we can upper bound it. We can also lower bound it, but we actually care about the upper bound. So we can upper bound it by this probability that sine z equals v is less than o of square root delta ij. So this is the same calculation. I re-put the same statement in a different way. I just think it's kind of sometimes useful. And then at the end, we kind of use convexity. We say, OK, the fraction of uncut edges. So we can say this is like we can round this by 1 over e sum over ij o of square root delta ij. And then by convexity, we say this is o of square root 1 over e sum over ij delta ij, which is o of square root delta. So, or concavity. So this is basically, this is the proof. I'm going to, this is the last chance to stop me. Like I'm going to fill this QED and, you know, say, if you have any objections, speak now or forever, I'll do this. So we declare this theorem 2. And then, so now I would say, OK, there is something interesting here. There is a reason why I can, even though it's written, I showed you these calculations. And there is something that kind of, this square root thing, it's something that kind of pops up again and again and again. And basically, so it also appears in Chigur, which I don't know if we'll cover, maybe we'll see. So it also appears in Chigur, in Chigur's inequality, which maybe I mentioned a little bit. So, so it's basically, where does this square root, try to, square root phenomena. So one, one way to say is basically, we suffer this square root because we, we kind of took these Gaussian random variables and we wanted to kind of transform them to Bernoulli random variables. But we wanted to take a Gaussian and bound it to a zero one random variable. And there is a sense that it's related to another situation, which is, you basically say, you know, you have like an epsilon biased coin and you're going to need like one over epsilon squared samples to the same kind of phenomena appears there. Like you have an epsilon biased coin and you need one over epsilon squared samples to separate those things. It's also related to the difference between KL divergence and L1 distance, L2 also. So these, if you haven't seen these topics, we'll get to them later. And again, we saw it in kind of, this is also a geometry, right, that, you know, cosine of epsilon is roughly one minus epsilon squared. And so these are kind of things where you kind of suffer, where you kind of suffer because of these type of things, you kind of suffer a loss of a square root factor that you don't really want to lose. It's interesting that sometimes the exact same phenomena is used for somehow positive results. So somehow Grover's search in quantum mechanics uses exactly the same kind of geometric phenomena to gain a factor of square root. And this is also related to, you know, curvature polynomials, which again can save a factor of square root. So this quote again appears again and again. It's kind of like, and somehow one way to phrase this conjecture, the Unigame's conjecture, basically tells you this quote loss is inherent. And it does, you know, if you want efficient algorithm. So if you want inefficient algorithm, you could do better. But basically the conjecture tells you that square root loss is inherent. And it maybe also tells you why it would be very interesting to refute this Unigame's conjecture. Because, you know, if you can beat this square root in one situation, maybe you can beat it in others as well. And it seems to be something we kind of run into again, again, and again. So yes? Are these all like the same square root phenomenon? Yes. Is it possible to draw actual equivalence between these? I mean, yeah, in some sense you can say that basically, yeah, I mean, we've definitely kind of seen, we've seen in this proof that, right, we've seen this in this proof, right, this came from here. And this really, if you look at the analysis of global search, which is related to Chavichia, you see the same kind of thing, the geometric thing. This basically Gaussian and Bernoulli is basically like in some sense it's related to, but you have an epsilon by a square and you get many samples from it. And basically the reason why you need one over epsilon squared samples is that basically, even though the KL distance is like epsilon, this is kind of the amount of information you have when you get these samples, you kind of the, sorry, KL is epsilon squared. KL, even though L1 distance is epsilon, KL is epsilon squared and KL kind of controls what happens when you, because you have this loss of square root between KL and L1 and what happens is KL controls what happens if you get many samples and not the L1 part. And basically what we are doing here is really trying to round the Gaussian to Bernoulli and again we kind of suffer this kind of square root. But there is also something here that's very sensitive to the degree two thing because Gaussians and Bernoullis agree with each other in the first moment and the second moment. But then they go their separate ways and we can ask the question of whether instead of d equals 2, we use d equals 4, d equals 10, d equals 100, maybe we can make this square root into, I don't know, a third, a cube root or anything like that. And basically the answer is that we don't know. But I think it's kind of super interesting. But yeah, these things are like not just, I mean, it's not just that you do Google Scolar for square root and these are like the first heats that you get. There is actually like, it might also be true, but there are actually like relations, like technical relations between this and there is kind of a sense in which these things are related to each other. So yeah, and with Chigur we kind of get the same square root phenomena. Maybe I'll even give kind of, let me give even kind of a spoiler which I meant to talk about next week, but let me just mention this. So this square root is inherent for both Chigur and both Chigur and this. And there is a super simple example showing that this is inherent and the example is simply the cycle or in the case of Pascal, the odd cycle. So if you look at this graph, n cycle, so if n is even, this is a bipartite graph, if n is not even, then if n is odd, then this is not a bipartite graph. So if it's not a bipartite graph, you cannot cut all the edges. You have to cut at most one. So the max cut of g, if it's an odd cycle, is at most 1 minus 1 over n. There is an edge you have to miss, but it turns out that there exists a pseudo distribution, which is a little bit weird, a bit pseudo distribution, mu, such that in expectation, mu, the cut of g of x is like 1 minus o of 1 over n squared. Just a little bit weird, right, because in expectation, it's a graph with only n edges, and in expectation, it only misses like 1 over n. So in expectation, it only meets 1 of an nth of an edge, but this is the difference between pseudo distribution and actual distribution. And basically, it's a good exercise to come up with this example. This example is also like a bad example for Chigur. So as another way, it's easy to see that the expansion of this graph is at least 1 over n. So basically, if you take any set, at least 1 over n, if of size k, the worst case is when you take a set of like an interval of n over 2, if you want to minimize expansion here, what you would do is you would take an interval of about n over 2 vertices, and then it will have two neighbors, which is like o of 1 over n edges going out. So the expansion is at least 1 over n, but the best certificate which really comes from the smallest eigenvalue of the Laplacian is at least at most 1 over n squared. And for this particular graph, and it's a good exercise, you can completely, it's a very easy graph, you can completely diagonalize it, you can get all the spectrum, and you can take all the spectrum and do these calculations yourself. And so you can see that the smallest, so if you take the JCC matrix, then the second smallest eigenvalue is like d times 1 minus o of 1 over n squared, and the smallest eigenvalue is like minus d times 1 minus o of 1 over n squared. And you can show that from this second smallest eigenvalue because it's eigenvector because it's kind of a symmetric graph, and you can rotate things around, you can get a sort of distribution. So this will also be in the lecture notes for the next lecture, but it's a useful exercise, and it shows that at least as stated, these theorems are tight, this doesn't show the right constant. So this example loses in the old notation. And the examples showing the right constants, which in particular give you this 0.878 are more involved. But you can think of them a little bit as like higher dimension generalization of this odd cycle. So if you really understand the odd cycle, then in some sense you understand what's going on. But also the odd cycle is kind of also attempting a reason why to maybe doubt that this growth thing is inherent. Because it's clearly not a very hard, at least you look at it and you say, well, this is not a hard instance. It's not so hard for me to know that this graph is, but there are interesting examples. You need to see a question which is mostly open. So whether this growth is inherent or not. Would you say it implies that for all worst case problems? No, for this, for MaxCAD, I mean for MaxCAD, the small set expansion implies it for chigger. For the growth and dig, we have a very similar algorithm. It doesn't, there is there, it's not exactly square root. And maybe if I dig enough there, I could find a square root hiding there. But the way it's typically described, there is some constant there. But also you just implies that that constant is inherently the best. So in many situations like that, again, there is also some other version of MaxCAD where there is some kind of square root in trying to compare half plus epsilon versus half plus. Wait, so you're talking about algorithmically the best, right? Algorithically the best, yes, yes. Yeah, most of the crazy things are some of the recent things that are, some of the recent things right now. So the constant growth of the quality is not tight or something, but it's just not algorithmic. So, yeah, no, so the interesting thing is like, yeah, that's, I'll mention, I'll talk about after the break, the interesting thing is that the UGC implies that there is no algorithm to do better than the growth constant without telling you what the constant is. So, we'll talk about it today and maybe also next week where you can see how it is possible to put a statement. You're not talking about the specific constant that occurs in this period. I mean, for this question you do talk about because here we do know the constant. So here we do know that the Unigames conjecture predicts that you cannot do better than 8.78 and that if I calculated really precisely what is the constant in the O, whatever that is, and I can, people have done it, and this is tight. So basically, even Unigames conjecture or the small set expansion hypothesis, Unigames conjecture is typically used for max-cut and the small set expansion hypothesis for trigger, but both of them basically tell you that these algorithms are the best and whatever you know for them, if these are conjectures are true, there is not going to be any polynomial time algorithm that does better. And in particular, they predict that if you move from, instead of, you replace this two here with a hundred, then it's not going to change any constant in me. And whether that's true or not, we don't know. And there will be an exercise, by the way, in the next lecture, you can already get in, but this is definitely false for the odd cycle. So this statement is true. There is a degree two pseudo-distribution that has this property. If instead of the degree two, you move to degree six, then this is no longer true. Degree six, pseudo-distribution, they know you cannot get something better than the right number. So definitely, if there is an example, it will have to go beyond the odd cycle, which might not be, that might not sound so impressive because there are some other graphs other than the odd cycle. But yeah, if you have your favorite, you could always try it. If you have any kind of your favorite series of graph, where the first graph in this series is the odd cycle, try to see what the second and third one in the sequence do. If there, you can try to see if they show anything interesting here. Okay, so I think what I'm going to do, I'm going to do a break rather than talk about Chigur and go to Boston Deke after the break. But basically, if you haven't seen Chiguries in equality, you definitely, part of the reason I'm skipping because Chiguries is something that you typically find in many cases. Let me maybe, okay, let me just say, just state it or, right. So Chiguries in equality, I phrase it here. I phrase it here as a statement about sort of distribution, et cetera. The way it's typically phrased is the following. If lambda is second smallest, I mean, there are like second smallest eigenvalue of Lg, which is the same as saying that d minus d lambda is second largest eigenvalue of the adjacency matrix A because Lg is just i minus one over d A. So the way Chiguries is typically phrased is the following. If lambda is the second smallest, then phi, the expansion of the graph, the expansion of the graph is at most O of square root lambda. And it's also easy to show that the expansion of the graph is always at least lambda. And this is actually by kind of a sum of square root. So if phi wasn't a rational function, if it was a polynomial, what I would write is that for every x, like phi g of x is equal to lambda plus some sum of squares of linear functions. And this is basically true. The way you typically, the way you, when you talk about rational functions, and you want to talk about sum of squares proof. So if you want to prove that P divided by Q is always at least lambda, then of course this is the same as saying that P minus lambda Q is always non-negative. So a sum of squares proof that P divided by Q is at least lambda is basically sum of squares proof that P minus lambda Q is non-negative. And it's easy to show that the second smallest eigenvalue gives you, if we write this P of x as this rational function, I just remember P of x was basically x transpose lg x divided by x n minus x. So, yes. That was just, it was not easy to give us a way to talk about the value of the rational function for one pseudo-distribution. But rather to make a statement about sets of pseudo-distributions that are sort of consistent with the rational function or something. No, what you want to prove is the following, like the theorem. So basically like the theorem would say something like of the following thing. So first of all here is, this is an exercise. Okay, so we call that P of x is proportional to, I'll never remember this thing of proportionality to x transpose lg x divided by x n minus x. Where x is, I just, the sum of x i. It's also the sum of x i squared because the x i is the same here. And what we basically want to show is that, so one exercise is that if lambda is second, smallest, second smallest eigenvalue of lg, which by the way, the smallest eigenvalue will always be the all ones vector transpose lg, the all ones vector will be zero. So the smallest second eigenvalue is always basically something that's orthogonal to the all ones. So it's like the minimum over x that is orthogonal to the all one of x transpose lg x divided by x squared. We've noticed that x squared is sum x i squared is the same as x if x is in zero one n. So basically what you can do, you can show that basically then it's always the case that x transpose lg x is at least lambda times whatever the normalization factor is for this to make sense times n minus x. So you have an SOS proof that this minus this the degree to SOS proof that this is non-negative. So this demonstrates that this, a low bounds, so this demonstrates that the expansion is low bounded by lambda. And the figure's theorem, the theorem which is kind of harder to prove, but it's kind of the proof is somewhat along the lines of what we have seen is that if there exists some pseudo distribution mu, typically okay, this is if there exists, but basically this turns out to be the same as if there exists. Just basically what it says is that there is always exists a set such that p of s is at most of square root of lambda and it's kind of a constructed thing. You take the vector of the x that minimizes this quantity and you transform it into a set and you lose a square root in the process. And you can think of this vector as giving you a pseudo distribution as well. So basically it's trigger theorem. It's basically a rounding algorithm like the one we saw for the Maxcat that takes a pseudo distribution or a vector and what it produces is a set that certifies that the expansion is not too big. So you should look at the proof of trigger if you haven't seen it already. We don't have all of the proof right now in the notes. Eventually we'll probably add it, but we do have some pointer to, we showed how this typical way you phrase trigger is the same, could be phrased in the pseudo distribution language and give pointers to some nice lecture notes of lucatory summary. It does have the proof, it's not a complicated proof and if you haven't seen it already, you should see it. By the way, if you want to know more about expansions, expanders, et cetera, there is a very, very nice monograph by Huy, Lineal and Biblesson talks about lots of things related to how you construct expanders and how you use expansion several applications. And then there are more recent results on the characteristic polynomials on the constructions of the Ramanujan brush that are also very much worth looking into. You have a result in this area, right? And the result is that some set of polynomials that let you construct Ramanujan graphs are actually easy to compute. Ah, that's very good. So anyway, so let's take another break. Come back, we'll talk about the Gothen-Dijk inequality. So let's take a nine minute break, 11.51. So, Gothen-Dijk was a very, very interesting character. I wrote in my notes, I think maybe some of you know more about it, but it was one of the greatest mathematicians of the 20th century is really the one that kind of pushed towards greater abstraction and trying to show that sometimes you can solve hard problems by actually showing that there are special cases of even harder problems. And has had a lot of influence on mathematics. And one of these remarkable papers was this 1953 paper called Resumé, where he showed this inequality that has later found tons of uses in a whole lot of fields. Most of them I know nothing about it, but I believe to survey that I hope some of you will read and then tell me and teach me, because it does seem very interesting. So the various ways to phrase Gothen-Dijk inequality, but of course the right way to phrase it is using pseudo-distribution. So you can write it, the following thing. You look at maximum over, for every matrix A, no, over the wheels, let's say it's m by n, it can be generally m by n, but let's just say m by n to save one. So you can look at maximum over plus minus one vectors, x and y, y transpose A x, x transpose A y, that matter. So we compare this to the maximum over the expectation of the same thing, y transpose A x, where mu is a pseudo-distribution over plus minus one x, mu of x, y plus minus one to n. So clearly, this is more general, this maximum, we could also put here a maximum over expectation, over distribution over these things. It would not matter and pseudo-distributions, this is degree two pseudo-distribution. So the degree two pseudo-distributions are more general than actual distribution. So this clearly, this thing is at most, this thing. This is an upper bound. The heart of Gostin-Dick's inequality that, up to Gostin-Dick's, this is a lower bound. Gostin-Dick inequality is at most, this times some number called KG. Ever since KG, what we know about it is that I think it's at least 1.67 at the most, I don't remember 1.73 or something like that. Maybe, probably I got this digit wrong, but some, it's definitely, I think, between 1.6 and 1.8, and ever since Gostin-Dick posed these things, finding the actual number was, has been, he posed these things, finding the actual number has been an open problem. Crévéin gave a very nice proof, giving some clean upper bound. And we are basically going to repeat this proof to today. And he conjectured that the upper bound is tight, but in a very recent work by, I think it was Ravoman Makarichev-Makarichev-Noor, I hope I'm not missing anyone. They, in 2011, they showed that that bound is not tight, and they showed a somewhat better upper bound on this constant, but we still don't know exactly what it is. But basically, this is inequality. It found really a large number of applications. One application that people have used it in algorithms is that if you take a matrix, this is related to the cut norm of a matrix, which is, you have a matrix that is a plus minus one matrix, and you're trying to find the sets, the sets S and T that maximize the discrepancy. Say that if you sum up all the entries in S and T, it maximizes the discrepancy, the difference between the pluses and the minuses. And basically, this result gives you a constant factor of approximation for this discrepancy, and that has been useful in several algorithmic applications. So, the one thing I should say is that I kind of subtly introduce you some new notation, which is pseudo distributions over plus minus one instead of zero one. You can transform one to the other by kind of linear transformation, so it really doesn't matter. And also, you can define pseudo distributions over plus minus one in the same way as it will do. Zero one, the only thing is that now when you want to move to a multi-linear and from a general polynomial to multi-linear, instead of reducing using the rule Xi squared equals Xi, use Xi squared equals one. So, this doesn't really matter. Okay, so worth index inequality is very cool. Let us take it on faith, and let's now talk about the proof of it. So, here's the general idea. So, one idea is to simply say, well, this looks a little bit like MaxCut. Doesn't look so different too. Maybe we can just use the proof of MaxCut to prove this thing. So, what we will do is we could say, okay, using quadratic sampling lemma, we can get X, X one, X n, Y one, Y n to be Gaussians such that expectation of X i, Y j is the pseudo expectation of X more one. And then, basically, like MaxCut, we can show that if we look at expectation of sine X i, sine Y j, then this will be up to constant, pseudo expectation X i, X j. And then, we can hope that this means that sum over i j, a i j expectation of sine X i, expectation of sine X i, sine Y j will be up to constant sum i j by linearity i j, pseudo expectation of X i, X j. And, right, so that will demonstrate this, right? So, basically, that will say that this value, right, we take the pseudo expectation that achieves this value and we can achieve up to constant, we can find actual plus minus vectors that achieve this up to constant. So, this thing doesn't work. So, one and two are two, but three doesn't work. So, do you have a sense why it's three doesn't work? It doesn't work. Exactly. So, the point is, this might be, we can maybe prove that this is, you know, say, between half and two. So, if we take the signs, you know, maybe this is between half and two of this guy, something like that. But, the constant could be different for different i and j. And the point is that when you have these negative coefficients, you have these, you could have these weird constellations. So, it could be that, you know, the constants conspire against you and by just some small thing like that, you know, when you sum up these things in the pseudo-distribution, say, you have these n squared elements. And typically, you get some, you know, you, one way to think about it is that, suppose that, you know, you make, because some of these i, i, j's are negative, it could be the case that, for example, that in the pseudo-distribution, you have somewhat of a bias where this thing tends to be a little bit bigger, like a 1.1 factor bigger when a i, j is positive than when a i, j is negative. So, if it has this bias, then it's going to be very large, this sum. But maybe this thing conspires against you and when you take this thing, the constant approximation exactly kills this bias. And now, you basically get zero. Maybe this thing could be zero because suddenly you kill this bias that this guy had towards the positive thing. So, this makes our life somewhat complicated. And what we want to do is, we want to get these Gaussians and we say, you know, we don't mind losing a constant, but we want to lose the same constant, everything. So, what we want is basically the sine x i, the sine of y j to be equal to k, some constant k times this one. If you have inequality, then we are good. And we don't need these Gaussians to exactly, so we don't care really about this part. We never really did. This was just a tool. So, we want some, so we probably have to use maybe somewhat different lemma. So, we want to find some Gaussians that satisfy this property. If we satisfy this property, then we are, so we want to find some Gaussians that satisfy this property. And the idea is the following. We kind of look at the analysis and so we go back to the previous lecture and we imagine that I gave you the full analysis that I didn't give you, but the one that's written on the lecture notes that gave you this 0.87 thing. And in some sense, what this analysis said was the following that basically, I guess maybe I did give you this kind of analysis that this analysis basically said something like the probability that sine x, x i equals sine y equal sine y j. Or if you want, we can just write it as the expectation that this thing, the product of this thing is like the probability that they agree minus the probability that they disagree. So, this expectation turns out to be exactly equal to, so the way we think about it is like, we have these vectors and this is like this angle rho is like arc cosine of the product, of the dot product, which is like expectation of x i, y j. And the probability is exactly arc sine. So, the probability is sine rho, the probability is sine rho over pi or something like that. The probability that the signs agree is, sorry, oh, sorry, sorry, sorry. What, yeah, yeah, sorry, rho over pi? Right, rho over pi, and maybe this is the probability that they disagree. So, the probability that, so if I get this correctly, what we are going to get is this expectation is equal to the sinus of some constant times expectation so this is what we want. What we want is this thing that, okay, so what we know I think is this thing, so we know that expectation of this is arc sine, I think of expectation of x i, x j. Does that make sense? I never remember if it's a sinus or the arc cosine, I think it's arc sine. So, this we know by some trigonometry. So that means that what we want is that we want Gaussian such that expectation of x i, y j is equal to sinus of some constant. Let's ignore this constant, it doesn't really matter. Expectation, so the expectation of x i, y j. Okay, so this would be good enough for us. So, now the nice thing is that basically this quadratic something lemma tells us we can always create Gaussians as long as from any PSD matrix. So what we really want is like, we are now given like we have this two n by two n matrix, this kind of corresponds to the x part, this corresponds to the y part, x part and this one is in the y part. And if we define m ij to be the pseudo distribution of x i, y j, and we kind of know that m is PSD, what we want is that here at position i and j, what we want is to get like sinus of constant times m ij and similarly here, and here we would want them to be maybe a standard Gaussian, so we would want this to be like one. And here we have freedom. So basically to finish the proof what we need to show is that what we need to show is that we can use this freedom to make this matrix PSD. If we can make this matrix PSD, then we would be done. Now, if the sin function happened to, if you do the Taylor expansion of the sin function and it happened to be all positive coefficients like of the Taylor expansion, then it would be kind of very easy to show that if you apply this sin function to every entry, you get to a PSD matrix, it's still a PSD matrix. And here it's not exactly the case but you can basically correct for that. And the idea you correct for that by using, but if you're using the hyperbolic sin or hyperbolic cosine, I don't remember, like you use one of them to kind of correct for that and make sure that this is still PSD. And there are some exercises basically to complete this proof in the lecture notes but let me give you like the heart of the matter which is kind of a useful thing to know. And this is the following simple lemma but often useful, so why is it the case that for example that say if you square, if you take a PSD matrix, you square every entry, it's still PSD. So here is a lemma, suppose that M and N are PSD, then if you take the dot, the Hadamard product, so M dot N, which is this matrix, let's call it, this is the matrix, so Lij is Mij, Nij, then L is also PSD. So if you take the product entry-wise of two PSD matrix, you get a PSD matrix. In particular, if you take a PSD matrix, you square it, then you get also a PSD matrix and that somehow is used to say that if you apply kind of positive polynomials to a PSD matrix, you can get, it's still a PSD matrix and then there is an interesting thing that you can, if you apply negative polynomial here, you need to kind of correct it over there. And this lemma, the proof is kind of simple, but it's useful again if you don't, it's another reminder of what our PSD matrix is. So basically PSD matrix means that Mij is the dot product of some vector Ui, Uj. So a PSD matrix, we can associate with every index vector and the Mij is the dot product of the Ui and Uj and Nij is some other probably, Vi, Vj. And now we can write Lij as Ui tens of Vi, Uj, tens of Vj. So the tens of product is like, you take two vectors and you make out an M squared dimensional vector or whatever the dimension of these vectors was, it's the product of the dimension where you take the entry-wise product and so basically this demonstrates that L is PSD. Again, if you don't exactly follow, that's, you know, you do the exercises and then you'll see. So basically using the exercises, you can complete this proof. And so that tells us that we can get Gaussians that exactly, so this is kind of the nice thing where you get the proof right, you kind of want to get things exactly and then you don't worry about, and then you don't worry about these negative coefficients because all the cancellations work out. And somehow it was very crucial for this proof that we had this freedom to select these guys as we want. And there is kind of a more general notion of a more general Gaussian-Dick type inequalities. The more general notion is the following thing. So given some graph age, Marker, so Gaussian-Dick or so generalized Gaussian-Dick inequalities, so given some age, say graph age on, let's just make things, to make things consistent with the previous things, let's say on two end vertices, define the k age to be the maximum of, the, divide here, we'll take the maximum of x, y, x in plus minus one, two, n, define some, and then we'll write it as a, so define it as x transpose ax, and here we'll take the maximum of pseudo-distribution of x transpose ax new, is the grade two pseudo-distribution over plus minus one, two, n. So basically this is the maximum ratio, but here and where you take a, there's maximum a, where a aj is zero if aj is not in age. So basically you are taking all over, you're taking this over all matrices that could support is only in the graph, in the graph. So basically the usual Gaussian-Dick thing, you can think of it that it restricted to a's that this part and this part was zero. So the usual Gaussian-Dick will basically correspond to the complete bipartite graph. The typical Gaussian-Dick constant is the constant of the complete bipartite graph. We only allow a's where we don't, we partitioned our variables into two parts, and the variables only interact in different parts. That was kind of crucial for us for the flexibility of proving it, but you can ask if you have a different graph, what do we know? And it turns out that sometimes this constant is not a constant, but it's always, so k-h is always at most o of log of the chromatic number of h. So the chromatic number is the number of columns. You need to color it. So if it's a bipartite graph, the chromatic number is two. So that's why Gaussian-Dick constant is a constant, but if it's five-corrollable, that's also fine. And because every graph is always n-corrollable, so we know that it's always at most o of log n, which is also non-privile approximation. But there is also a log bound that is always omega of log of the clique number of k-h. So if the graph contains a clique of size t, then there will be some matrix demonstrating that this graph can be at least log t. And then, so in particular, there are some places where this Gaussian-Dick constant could be logarithmic log n, that's the maximum it can be. And this kind of more general kind of notion has been used. So basically, what you can see is that when we're talking about these kind of quadratic equations, quadratic Boolean equations, we somehow get a decent approximation of what's logarithmic and it kind of depends on the structure of the equation. And once more, you have proofs that this is tight, assuming, I don't remember even if it's the unique games conjecture or the small set expansion hypothesis, but one of them shows that you cannot improve over the Gaussian-Dick. So even though we don't know the number, basically the result says that there will be no polynomial time algorithm that does better than gives a better certificate than the one you get by looking for the best degree to some square proof. And we'll talk, next week we'll talk about how can you prove something like that. So that's actually like, here is one kind of riddle to think about, how do you prove how do you prove that there is no algorithm that does better than algorithm A without being able to compute what algorithm A actually does. It's kind of interesting that this can happen. And, but what we see is kind of like, this Gauss-Dick inequality is, I mean, both Chigar's inequality, Chigar's inequality maybe, and Gauss-Dick's inequality may, they may have not, we'll talk complicated proofs, but they are highly non-trivial, highly useful inequalities, and they have degree two sum of squares proof. So we see that degree two sum of squares proof or a small degree sum of squares proof doesn't mean that something is trivial. And there is a general question of what kind of inequalities we can, that we know are true, can we prove with low degree sum of squares proof. And give me some examples that we do know. So, for example, you know, Koshishvat, so Koshishvat, you can prove that basically with degree two, you can prove, or maybe degree four to write it properly. So with degree four, you can prove that sum of XI, YI, the right way to write it probably this way, sum of XI squared sum of, you're saying that you have a sum of squares proof. Are you just saying that for every matrix, you can prove the numerical statement with sum of squares, or is there actually some way you can formulate root and Dick's inequality in the sum of squares we're all improving with some of the much stronger statements? I see, I guess, okay, so let me just, so I think what I'm saying is that following I believe the second statement, though I probably the thing that is proven is the first statement, but I kind of believe the second statement, so I believe that, yes, you could basically, you have to be a little bit, okay, this statement, you have to think how you phrase it, if there is a natural way that some polynomial over the n squared variables, determining a is non-negative, then I think we probably can prove, at least, approximately that this thing has a sum of squares certificate, but- Is there some kind of meta-sexual man that indeed being sued, and fought in every specific case, but SOS, then he can move? It's a good question, there might be some kind of a compactness or something like that, I don't really know. I think it's a good question. Somewhere, I guess the reason I believe it is because we haven't used any probabilistic methods of the channel, et cetera, and there are definitely- It is as common as art signs. Yeah, but all these things you can basically tailor approximate, et cetera. Maybe you'll get it with the most constant or something like that, but I kind of believe that it should be okay, but it's a good question. It's a good question, you're gonna have to phrase it exactly what it means to be- But I think it should be okay, but it's a good question. So for example, Cauchy-Schwarz, you can, if I try to expand it out, I'll probably get it wrong, but this thing, if you subtract this minus this, it will be a sum of squares. So Cauchy-Schwarz does have a degree, say four, proves like holders inequality. Holders inequality, that is a little bit more challenging because if I try to state it, I'll probably state it wrong, but it is possible to state it correctly and also prove it with a degree of some squares that kind of would depend on the exponents related. So many of those kind of inequalities, we do know of low degree sum of squares proofs. Let me mention to you like an open problem, there is kind of like a super inequality that implies all these guys, not Gloucester and Dick, but implies holder Cauchy-Schwarz and basically almost, maybe that also implies Gloucester and Dick in some way. It seems like almost any inequality that someone's name has that has a name to imply. And this is known as the Baskam-Lieb inequality. I think maybe I kind of butchered someone's name here. So the inequality is actually, so the inequality, you can think of it as a generalization of there is the inequality that the entropy, the joint entropy X1, Xn is less than the sum of the entopies, right? So this is an entropy inequality that, so Baskam-Lieb is kind of a generalization like that. And to some extent, what it says is the following, you can ask for any table, so you can think of this as basically, you can write this as the following, h of x is smaller than sum of h of fi of x, where f1 till fn, maybe n is like a bad name here because we kind of want to think of this as some constant. So f1 till f, say l are the projection functions. And what you can ask is for some table of functions, so basically, Baskam-Lieb inequality basically is for some table of functions, you can ask what is the best, and I think the way we typically write this is, so we will be in B-L inequality for, for some f1 till fl is, so B-L inequality for some f1 till fl is of the form gamma, some numbers like gamma 1, gamma l and gamma, such that h of x, you can bound h of x by some gamma i, h of fi x, or plus gamma. And Baskam-Lieb basically tells you that if you want to understand what would be for particular set of functions and particular set of, say, gammas, gamma i's, what would be, say, the best gamma here, what would be the best gammas of this inequality would hold for any random variable x. Baskam-Lieb tells you that a multivariate Gaussian distribution is the worst, is always the worst case example, the one that will be tied for this inequality. And then, basically, the parameters of this distribution, you can, then the parameters of this distribution, you can try to find by some explicit convex optimization problem. And, generally, there is also a question, in some sense, how do you actually phrase this as a sum of squares poof? And there is, like, a version of this that doesn't talk about entropy, that kind of, you take everything exponential, which kind of might make more sense to talk about a sum of squares poof. But, I think it's a great question to show that whenever you have, like, some fixed number of function and some fixed gammas, there is some D that only depends on this parameter, some constant D that only depends on this parameter that can certify this inequality for no matter what random variable you somehow look at. And there is even this case, which might be solvable, but I think this is kind of one example of Baskam-Lieb that I don't know if people know a sum of squares poof, or that this might be, so this might be between an open problem and an exercise. So, this is the following thing. So, suppose you have, so this is one special case of this Baskam-Lieb. Suppose you have some body in, some body in some universe, inside some universe, some U times U times U. And you have some S inside that with expectation of S, with the measure of S being U. Then we can look at the, you know, S, the projection of S to this part, to this part and to this part. And I think what we know, if that U of S is less than U of this kind of, this is like the three-dimensional mu. So, it's less than this mu of, say, S12, mu of S23, mu of S13, all that. What's the right scaling? This was like alpha cube. Every one of them would be alpha squared, alpha squared, so this would be square. And I think, I guess another way to write this thing would be something like, think of this F, if you have like some F from U, if you like, have some F from U to, say, zero one, maybe we write it as FGH, then I think expectation of FGH squared should be smaller than expectation of F squared expectation, G squared expectation of H squared. And now you can think of FGH as just in zero one to the N, where N is the size of U, and you can ask whether you have a sum of squares. So FGH quantum is from U squared. U squared, so this is like expectation, let me write this more properly. This is like expectation over IJK, expectation over IJK mu FIJGIKHJK, should be all that squared. Should be smaller than expectation over IJ of F squared, expectation over G squared, expectation over H squared. And now you can think of FGH as just strings in zero one to the N squared, where N is the size of U, and you can try to prove this thing. So think of U as just one till N, and you can try to prove this with a sum of squares. This side, actually I don't know, so like I said, I think it's time between open problem and the next side, and I think it would be very interesting to try to show this inequality with a sum of squares proof, and then maybe try to see if it generalizes to the general BASCAM living equality. The BASCAM living equality showing it with a sum of squares proof is definitely, is an open problem, this is where we don't know of it, and it's definitely interesting to some people, I mean, for example, I mean, I think this one explicitly asked this question, so it would be very interesting to show that it generally has some sort of, some sort of proof. Yes. Can you break an interesting natural algorithm? So that, it would be interesting, so there is a very recent paper by Vignesson Sankosos giving some kind of an algorithmic version of BASCAM living using an operator scaling, and this might give an alternative algorithm, and it would be kind of interesting, so I think it's a worthwhile, so even like, for the general BASCAM living, it's also a question exactly how you phrase it, for this special case, maybe this special case, you can solve by some kind of an iterated Cauchy Schwarz, but yeah, so basically you want to certify this thing that for, if you take a function, you can bound its volume by some kind of an average, geometric average of, not the geometric average, but some kind of an average of this projection into two parts, and this is a kind of a general thing. If you think of the volume is like basically, the entropy is like log of the volume, then this is a version of this thing. You have a random variable, X1, X2, X3, and you want to say that age of X1, X2, X3 is smaller than half age X1, X2 plus half age X2, X3, this is the, this is this version, right? And then, and yeah, maybe this is not hard, but there might be some, this could be a step. So, so any questions? If there are no questions, I can just give you kind of a quick preview, and we want to finish a few minutes, maybe five minutes early, because people who want to go to other drivers talk sometime to, you know, stretch, and take all the, you know, walk one floor down, and, but let me give you just a quick kind of preview of what we're going to talk next week, I guess in 10 days, because we can quite a lecture. So, what we have seen right now is algorithmic implications. So, we have seen algorithms using sum of squares, and we have seen, we have seen also to a little bit of an extent that proofs using low-degree sum of squares. And there is one more algorithm that we will see, but a little bit later in this course, and this is the algorithm of Aurora Rau and Vasivani, which basically gives an all squared log n approximation for a p of g. So, right now we have seen this approximation of p of g that basically approximates p of g up to a square root factor, and this square root, if p of g is like one over n, this square root factor could be terrible. There, you can show using some linear programming that you can always approximate this thing up to log n. This was shown by Layton and Rau, it's not super difficult, but for a long time that was the best we could do, and Aurora and Vasivani gave this better algorithm, and this is using degree, they didn't think about this way, but this is using basically degree six sum squared, and we will see an analysis of this algorithm also in the language of pseudo-distribution, whether it's significantly simpler or not, that's a matter of debate, but it's anyway a very worthwhile algorithm to see. So, they're out of the n-n of the six times. No, no, no, and generally, they use a very specific property of degree six pseudo-distribution, in some sense the same property of square triangle in equalities that an exercise would ask you to use to show that it captures the exact value for the odd cycle, for max cut, and yeah, so what's open is the case, by the way, with sum of squares based algorithms is that you can trivially show, like you get a degree six, so it trivially shows that the algorithm runs in some polynomial time, like n to the sixth, but then if you really care, you go into the analysis of the algorithm and you see that you really didn't need to use a general SDP solver, so often their algorithm also has been improved, I think they, I don't even remember what was the first thing they did, maybe it was n cubed or maybe something like that, but people have improved even now, I think they know how to get this guarantee in roughly quasi-linear time, so, but this will be probably not next week, but the week after that, the next lecture, we are going to talk about low bounds, so some limitations of the sum of squares algorithm, so we'll show some limitations for degree two, in particular, what is known is basically that everything in our analysis was tight, to some extent, except the constant, we don't know the exact constant, but we'll show, in particular, we'll show an instance due to Fagan-Chefman showing that the Maxcat analysis was tight, and we then want to show limitations, this doesn't really tell us a lot, because it doesn't tell the rule out that there is a good algorithm for degree one, six, so we'll show a limitation for D, which is little often, so basically we'll show cases where sum of squares are going to take exponential time, but these will be for different problems, and basically these problems are motivated by NP-hardness, so basically we will see how it is kind of possible to take some insights from NP-hardness proof, and move to SOS low bounds, which technical parlance for is the quality gaps, so we'll show you, we'll take sum of squares and NP-hardness proof and move to SOS low bounds, and what we'll show is a very interesting result of Fagan-Vendva, which is in the other direction, so it takes sum of squares, in fact, only degree two, low bound, and he moves it, ideally it would have been to NP-hardness, what he knows to do is UG-hardness, but this is still super interesting, how do you actually, somewhere this direction is also not clear how you do it, but at least it's somewhat natural, if a problem is NP-hard, you predict that there is not going to be any sum of squares algorithm for it, and maybe you can use this prediction to sum a show low bound, but here you can say, well, if it's only hard for this algorithm, you get one instance that this particular algorithm depends on how do you get it into a general hardness result, and so we at least outline how he does that, and I should say that this is a great open question of finding a transformation from this part to this part, and the best result in this direction is from Seon Chan in 2013, where he didn't give any general transformation, but he did show a new NP-hardness result that was inspired by a sum of squares low bound, so that was maybe the first time that we had, first the sum of squares low bound, and then the NP-hardness, before typically it was the other way around, so this is kind of a preview, so basically the next review is going to be about the value, in some sense the limitations of the sum of squares algorithm, but understanding the limitations also gives you a good intuition as to what kind of things it cannot be, there are no questions, we can enjoy 20 minutes of break and then you can, if you're interested, you can go to other regular stuff, just if you don't know that seminar series, the one that's won't flow down, it's a very informal three hour whiteboard seminar series, and people are very, it's absolutely fine for someone to go down to have the pizza and to just stay for the first half of the talk and leave in the break, and we typically tell the speaker that some people will want just a high level part and will leave after the break, so ask them to organize their talk so that you will get something out of the talk even if you leave at half time, but of course there's been no obligation at all.