 OK, so let's start. So today, we're going to talk about approximation algorithms for graph expansion. So the general problem is the following. We're given some graph. We want to find some cut of this graph that minimizes the number of edges going around across the cut. Minimizes the fraction of the proportional to the cut size. Minimizes the ratio between the boundary and the volume. And that's an incredibly useful thing. It's I don't know all the places where it's used. Open, you want to try to analyze, say, random walks, marker processes, et cetera. You want to understand what are the obstructions from starting from a given point and reaching the entire space. A lot of divide and contour algorithms work by splitting the graph into two pieces and walking on each piece separately. So it's a very useful thing to have. And we have seen, basically, an approximation for that. So if p of g is the minimum of the number of edges divided by the expected number, you can think of it, basically, if one of the sets will be smaller than size n over 2. So you can also think of it up to constant. It's the same thing as the number of edges from S to S. To its complement, that is divided by the number of possible edges going out of S, which is the degree times the size of S of the set. So we have seen, basically, chigger inequality gives us growth p versus p approximation. So that follows from chigger. For some graphs, this could be terrible. For some graphs like the cycle or something like that, where the expansion is like 1 over n, this could be terrible. This could be maybe as bad as 1 over growth n or something. So if you want to count the approximation value as the ratio between the expansion of the set you produce and the best set that was possible, you can do that. So this could be a very bad approximation ratio. And later in the vow, gave an O of log n approximation algorithm. We will see one way how we can get a log n approximation algorithm mix. Not very hard to get such an algorithm, but this algorithm has been very, very useful in lots of applications. And you might feel like log n is a natural bound for this problem. So it's kind of a natural algorithm. You might think that maybe that's the optimal. So that's why people were very, very, very impressed when the result of a row of our basivani, they gave an O of square root log n approximation algorithm. And beyond the fact that they beat the natural bound, the techniques were especially the techniques of the analysis were very different than things we have seen before. And that also makes it a very good algorithm to study, and particularly to study its analysis. So one disadvantage of the analysis is that it's not super simple. And so let's see if that will bring a winter distance. So I took. It works. So it's not super simple, but I think it's worthwhile to see. And I think Fritum said that in Berkeley, in Locatory Science course, he spent like four to five lectures on this analysis of the algorithm. So I guess that each lecture was like one and a half hours. So yeah, I guess like seven and a half hours. So basically, we have three hours, but you guys are twice as smart as Berkeley students, so it works out. What? The two minutes are smooth and cool. Yeah, after constants. Oh, OK. After constants. They don't equal exactly, but it's kind of an exercise to show the way equal after constants. OK, so what is the A or B theorem for our Pazirani? So it's the following. Supplements of pseudo-distribution. Let me make things slightly simpler. This generally gives an approximation in anything. Let's just assume that the distribution pretends that there is a set of size exactly n over 2, which is kind of slightly simpler. So pseudo-expectation of XI, neighbors in the graph, pseudo-expectation of X. It's basically the probability that the edge is cut is less than some number of V. And that's enough for what we need, such that its complement is less than even an algorithm to find. To find the set S. So this gives you a square log n approximation algorithm for this Pazirani. We run the sum of squares algorithm. We find a degree for pseudo-distribution. We apply this theorem. So now let's prove it. So the proof, by the way, that I present, in some parts we slightly deviated from the one in the north. So David and I kind of figured out yesterday maybe a way, what we see is maybe a slightly cleaner way to explain it. I hope that is also correct. And comments or suggestions are welcome. So we'll see how it goes. Do you miss the N? What? I guess, yes, yes, yes. Yes, it's not very hard to find a set that has less than 2 times the S. So yeah, the interesting case is it might be smaller than one of the square log n. OK. So really the whole proof of ARV is just one main lemma. And this is the following. Suppose we define age. So let's set delta to be something like some very small epsilon times 1 over some very small epsilon divided by square log n. And suppose we define age to be this graph corresponding to when xi minus xj squared is smaller than delta. Then there exists an xi j as above. And there exists a and b of size n such that for every that there is no edge. So this lemma says you can take this graph. So on the original graph, you can take this age graph and break it, find two parts that are not completely tiny such that there are no edges between these two parts. And basically everything we are going to focus on is how to prove this lemma. But we'll start by showing that this lemma actually implies the theorem that we want to show. So maybe the way I'll do it is I'll use this pole to show why the lemma implies the theorem and then move to prove in the lemma. It would have been very nice if when I showed that the lemma implies the theorem, I would have not erased the theorem before, but hopefully we'll still manage. So why does the lemma imply the theorem? So what we do is the following. Essentially what we are going to do is going to take we have this set A, we have this set B. And what we are going to do is take a choose, we're basically going to do S will be something like some set that kind of contains A. It might even contain some parts of B, but so we're going to choose T at random in 0, 1. We're going to choose T at random in 0, 1. We are going to define S T to be I such that there exists a J in A with expectation of XI minus XJ squared smaller than T. OK, so now let's look at what's the, so first of all, what do we know? So the expectation, OK, so typical H, I, J, we know that the expectation of XI minus XJ squared is at most phi, right? So now if you look at I and J, suppose that the closest, let's say that this is T I, this is T J. T I is the distance from A and T J is the distance from, is I's distance from A, the smallest guy that's close to I. And T J is the distance of J from A. So what do we know about, suppose this value is phi, so what do we know about T J minus T I? It's smaller than phi, right? Can you see it? Because if you're basically looking at the probability that J, right, you want to look at the probability that, say that you look at the closest guy, there is this guy here, some guy, little a, such that the probability of these two guys of being different, say, is at most T I, and the probability of these two guys, of these two guys being different is at most phi. So by the union bound, the probability of A and J being different is at most T I plus phi. So this is, if it was actual probabilities, for the grid for pseudo-distribution, you have to prove that this thing behaves like actual probabilities. This is known as the squared triangle inequality. So that's basically the only thing that ARB really uses about the grid for pseudo-distribution rather than the grid to. So maybe we should write it that expectation of X I minus X K squared is smaller than expectation of X I minus X J squared minus X K squared. And the reason this works is because this is really the probability. If it was actual distribution, this is probability that X I is different from X K is smaller than the probability that X I is different from X J plus probability. And if X I is different from X K, one of these events has to work. In the main lemma, is that a pseudo-actual expectation? Sudo-expectation, yes. And that's really the only thing that we use. But it actually turns out that the main lemma, we talk about it, it's not that much different to prove for expectation of pseudo-expectations. In some sense, one way you can think of improving it is, but it turns out it's really the same. So the proof is not easier for actual expectation, but it is. But it's really the same. So basically, what it means, right. So if we know that the difference here is at most phi, it means that when we choose a random T, the probability that we cut any particular edge is at most phi. So the expected number of edges cut is at most phi d, the total number of edges. And the expected size of the set, so it's definitely is at least, OK. So the set is definitely size, the set definitely has size at least omega of n. And there's also these many guys these many guys that are sitting distance at least delta part. So the expected size of the set, or let's say the complement of the set, which might be the smallest guy, so just easier to. So the expected size of the set is at least delta n. Delta is this phi. So if we look at the expected, there will be some guy where the number of edges cut were like e s, s bar divided by the size of s, or maybe s bar, because s bar will be the guy is small. Like e st, st. There will be some T such that st is smaller than n over 2. This will be at most 1 over delta phi d the size of the set. So this is phi dn, but the total number of edges is dn over 2, that's the number. OK? So expectation of edges cut is at the expectation of. So the expected number of the edges cut take over all edges, there are d times n edges of dn over 2. And each one of them is cut with probability at most phi. Or like on average, it's cut with probability. So does that happen to you with the st here? Yeah, because t is random between 0 and 1. OK, so the procedure is you pick t and then you cut the edges with st. Right, you pick t and the cut is st and st complement. So the cut cuts every edge, like on average, cuts a phi fraction of the edges. And the smallest side of the cut is at least in expectation of this delta n. So you're basically going to get, so there will be some T where this ratio is going to be at most 1 over delta times phi b, the size of the set. So this is graph? So this is the original graph. We are cutting the original graph, but we are looking at the sets that came to us from the previous graph. And what we know about these sets is that when we pick t at random, then the expected size of the cut, you get at least the expected size of the set, because the smaller part of the cut is at least delta n, because there are at least n guys that are delta far away, so with probability delta, you get at least n. So we play a little bit here with the expectation. Right, when we measure the expectation, we use the fact that we choose t at random. What we care about is at the end one particular t. But to prove that something exists, we only need to show that one t exists. So it uses this thing that if you know that expectation, that they always exist. So if you have random variable x and y, such that expectation of x divided by expectation of y is at least delta, then there is always some sample. If there is correlated random variables, there is always one sample of x and y that makes this at least delta. The probability that this is at least delta is positive. And so if you know that in expectation of t, the expected number of edges cut divided by the expected size of the set is at most something, then there will be something where this is at most. So is everybody clear on why the ARV main lemma implies the ARV theorem? Good time to ask a question. Is that the ARV, but now we're on the slide? Yes. So how do we find t? OK, so you can see that actually, I mean, I talked about t between 0 and 1. But you can see that basically because there are only n vertices, you really only care about n values of t. So you can kind of enumerate over all of them. And you can basically sort the vertices according to the distance from a and y, all cuts, and one of them will work. Yeah, of course, you need to set a and b, but that's the ARV main lemma actually engaging the set. OK, so basically what it means is that now we only need to prove the thing about the ARV main lemma. And one thing to remember is when we prove the ARV main lemma, we forget about the original graph. That's we forget about the parameter of t. That doesn't matter anymore. All we need to prove is that if you have a graph defined on pseudo-distributions, where you put an edge between two guys that are very close, then that graph is going to, and we are going to assume this thing. Just again, for the sake of convenience, we are going to assume that, say, expectation of xi, xi is half power i. So it's kind of a pseudo-distribution of a balanced things, then because otherwise it could be like an actual distribution that is always identical, that these things are somehow, that the graph is the clique. But so now what you want to say is that this is, in this case, in any graph like that. So we can find a significant amount of a large set A and a large set B such that the probabilities of every guy in A and every guy in B are not close to each other. If A is in the cut, if A is in the cut, the probability that they are both on the same side of the cut is not too big. So maybe, again, somehow, it's always a useful thing before we can show it something to understand where this growth log N comes from. So let's look at an example that shows that this growth log N thing is tight. And to give a proper example, let's look at an actual distribution. So I'll show you that there is an actual distribution where the growth N thing is tight. So here is the actual distribution. So the way I think about it is the following. Think of the set N as 2 to the L. And the random variables, so the random variables are basically, now we think of them as x 0 to the N till x 1 to the N. And these are correlated random variables and 1 to the L. So this is, maybe I should write it, this is example that ALB main lemma in this type. It doesn't show that you cannot do better algorithm, but it shows that if you want to do better, you have to use a different lemma. So x, so we're going to have an actual random variables. These are 0, 1. So it's 0, L. What? 0, N, yes. OK, so we want to show an actual set of random variables. And we are going to basically define how do we choose this random variable. We pick i at random in L. And basically x a is simply a i. So if you think about it, expectation of say x a minus x b squared is really like the humming distance of a and b divided by L. The probability that they differ is the fraction of coordinates in which they differ. OK? So now we want to claim. OK, so now first of all, let's see. OK, so in our case, this delta is going to be, let's say, 1 over, I don't know, 100 square root log L. What is log L in this graph? L, right? So 100 root L, right? So what does it mean that basically in this graph, a is a neighbor of b if the humming distance of a and b is at most what? So on the other hand, I want to make sure that everyone follows. So what's the humming distance? If a is a neighbor of b in this graph, in this graph, h, what's the humming distance of a and b? How many coordinates do they differ? Yes, right? 0.01 root L coordinates, right? This is basically, so every random variable is indexed. Every random variable is indexed by a string. The random variable is just pick coordinates. So 200 variables, they differ with probability delta. If they differ in delta L coordinates, so in our case, this means they differ in 0.01 square root L coordinates. So now let me tell you what are the sets in this case. So the sets in this case, here is sets. So I can take a to be all the a's such that sum of a i is less than L over 2 minus square root L. And b to be all the b's such that sum over b i is larger than L over 2 plus square root L. So what's the humming distance between every two guys between a and b? At least 2 square root L, right? So definitely there is no h between them. And the measure of a and b, this is one standard deviation. a is everyone that has one standard deviation. The sum is one standard deviation away. So it's a constant measure set and similar to b. So basically you have, in this case, you kind of order things according to the sum. And this is L over 2. And you have minus square root L plus square root L. And this is a and this is b. And it turns out, I've not proved it right now, it's not hard to prove using some concentration inequality. So we have here the isoperometric phenomena. This is the best thing you can do in this queue. Every other set, every two sets of measure, if you take every two sets, say one way to say it is the following thing. Here is a theorem which I've not proved, or it's not super hard. If the size of a is at least say n over 100, n over 1,000, then if you look at the set, if you look at the set a, let's call it a plus x, where the number of coordinates in x is at most, I don't know, and we square root L. So if a has at least some non-independent mass, this is actually going to be at least 0.999. So if a has some significant mass, if you look at everyone that's 100 standard deviations away from it, you actually get everything. And you can show this, this is basically concentration of measure. And a lot of the things when you think of, when you say about concentration of measure is that you have a phenomena of you either get nothing or you get everything. So here, you basically say, if you didn't get nothing, if a was kind of significant, then touch it a little bit and you get everything. That's something that kind of repeats itself again and again. And just know that this is some kind of an isoperometric inequality. It says that the best way to get these separated sets is this nice geometric way. So this shows the A or B main lemma is tight. But of course, it doesn't show that it's true. Maybe there is a different example. And this is basically what we are going to show. We're going to prove the A or B main lemma. So we actually do know something else. But it might not seem this way from what I've talked so far. And it will not seem this way after this lecture. But we do some have some other tricks. But what's the one trick that we have so far used for rounding some of these distributions? Quadratic sampling, right? So that's the first thing you do. You look at the sum of distribution, you always do quadratic sampling. Never hurts. So that's what we're going to do. So we are going to, so here is the proof. We're going to let y, let's call it a y1, yn, be the Gaussian's matching. So this is an actual distribution, the y1 till yn, which matches the moments of x1 till xn. And now, this is somewhat maybe not surprising if you look at what we did above. We say a prime is, OK, maybe just for a slight simplicity, let me, so we have that expectation of yi, OK. So if you add an actual Gaussian, so we kind of normalize Gaussian. What do I mean by normalize Gaussian? If you match xi, xn, your expectation should be half, standard deviation should be half. Let's subtract the expectation and divide by the standard deviation. So expectation of yi is 0, expectation of yi is 1. So everyone individually is like a standard Gaussian. Doesn't really matter much for what we are doing. It's just slightly more convenient notation. So we're going to find a prime to be i such that yi is less than minus 1 and b prime to be, or maybe let's call it unknown, a such that ya is b such that yb is a classical, OK. So these are sets of constant measure, right? No problem about that. And the first thing we try is to output those guys. That's great, but maybe we didn't really succeed. So we look at this graph, this graph h. And maybe there are a few edges between a prime and b prime. So what we do is we find some maximum matching m between a prime and b prime. And basically, if m is less than, I don't know, n over 1,000, then we are happy, right? Because if we find the maximum matching, remove it, there are no more edges between these two guys, right? Because if there was an extra edge, it would have been, we could have added it to the matching. So we take the maximum matching between a prime and b prime, just to remove it. If a, I'd like some, I don't know, n over 5, b was n over 5, or whatever. So if this maximum matching that size is less than n over 1,000, then we are happy. And otherwise, we are sad, OK? So basically, the main claim is, I don't know, with probability at least half, we are not sad. Matching can't be too big. So we're watching. So in here, you're only using the grid 2, you know this. So you use the grid 4 when you call that the lemma in the class of theorem. But in here, somehow the statement on lemma is 12 on the grid 2. No, no. So basically, notice that this property above, this property is true about the Gaussians as well, right? It's a property about the grid 2 moments that they satisfy this triangle inequality. But it's only true because they came, because these guys came from a grid 4 pseudo-distribution. But you only use the Gaussians. Right, but the Gaussians still satisfy this property, right? So you could state this main lemma as saying that if you have Gaussians that satisfy the square triangle inequality, then you can do it. So that's the only thing you use about the grid 4 with some squares in this book is the square triangle inequality, this property here. But then you're going to use that. For the Gaussians. For the Gaussians. You're going to use this in the proof of the lemma? Yes. Yes, yes. So we are going to use it in the proof of the lemma. And yeah, we are definitely going to use it today in everything in the proof of the lemma. So, yes. So before we have the expectation of xi minus xj square, just listening to the other half, do you transform it into yi? What's the country now? The expectation of, it's the same, because the expectation of xi was half. Transforming it is really subtracting half. That doesn't change the expectation of xi minus xj. And you know, multiplying by two or quarter or whatever, that doesn't change the state. It's constant. Yeah, it's a constant, which we don't care about anymore. So, yeah, so it's the normalizing is, don't worry. Right, so basically, the air we main claim is, that you cannot have a large matching at the time. Let me, yeah, we'll prove them and claim after the break, but let me give you the outline. Or maybe let's start. OK, let's do a warm up. Suppose delta was like 100 over log n, and, you know, 1 over 100 log n. So, suppose we wanted to get like a latent value approximation. Then I claim that we wouldn't even need to, we wouldn't even need to do this matching business. This original a prime and b prime would be good. Because, so what it means, like, so a bad edge AB will satisfy the following. The expectation of yi minus ya minus yb squared. So, a bad edge would satisfy the following. The expectation of ya minus yb squared is less than delta. But the actual value, right, we have some, so the actual value happened to be at least two, which is like 100, what I want to say, happened to be 2, which is the way I should think of it, is squared 100 log n squared delta. Why do I write it this way? Because this is a variable. This is a Gaussian variable, ya minus yb. It's a Gaussian variable with standard deviation delta. And now we say what's the probability that it has to, that its value happened to be 2, right? For e to be bad, it has to be that one of them is on this side, one of them is on the other side, so the difference had to be at least 2. So what's the probability that the Gaussian random variable has k standard deviations or more than it should? It's like something like e to the minus k squared, right? So basically, the probability that for every particular a and b, the probability that a b is bad is less than, I don't know, n to the minus 100, right? If in this case, if delta is 100 log n. So we can do union bound over all the n squared edges. So there will not be, with very, very high probability, there will not be even one pair that is bad. So we are done. We don't need the matching, nothing like that, right? So if you wanted just the log n, life would be easy. The problem is now we want, what we want is delta, which is like 1 over 100 squared log n. And then this probability is just basically something like it will basically be something like 2 to the minus. There might be bad guys. Probably there will be bad guys. What you want to somehow show is that, potentially, the a priori, the graph could have had n squared edges, right? So you could imagine a a priori times 2 to the minus of square log n, which is larger than n to the bad edges. You might think you look at this and you say, well, I'm probably screwed because if I had n to the 1.99 edges, probably every guy here will be connected to some guy here. And I'm not going to be able to make this far from each other. But somehow, so you kind of need a much more sophisticated probabilistic analysis to argue that this is still OK. And that's what we'll see after the break. So it's now 10.56. Enjoy like a full 9-minute break at 11.05. That is soft in all the age. Everything needs to be probabilistic, not in geometry. OK, so I think we can, OK. So this is the way we are going to, so we need to prove the ARV main claim that we are not sad. And the way we are going to do it is the following thing. So we have this graph H of everything maybe just remind, say n to j expectation of yi minus yj squared is less than delta. Now we define the following random variable. We define z i e the maximum. So this is a random variable. So the maximum over j, which is of distance at most t to i in the graph. So j, there is a path of most t between j and i in this graph, such that the maximum of zj of yj minus yi. This is non-negative variables. Of course, y is connected to itself, i is connected to itself with distance 0. But we kind of look at the left at all the neighbors t away and you look how much can you grow this random variable, this yi reject value, and that's z i. So it's a random variable correlated with this y. So for every choice when we sample y1, t, yn, we get this random variable z i, t. And the main what we'll show is that as long as t is smaller than something like 1 over 1,000 delta, then some from i from 1 to n expectation z i, t is going to be at least 0.01 t. OK, and we assume that we have this matching with distance probability. So we're going to show this. And this will actually conclude the proof. The reason, sorry, t 1 over n. So this will conclude the proof. Because what it will mean is that if we set t to be, we set delta to be something like epsilon squared log n, then what it will mean is the following. So we set t to be epsilon over, we set t to be epsilon, what we need to be, right? Epsilon over squared, getting confused, 1 over 1 over 1,000 epsilon squared log n. Then we get that there exists at least one i. So there exists an i such that expectation of z i, t is at least this thing, is at least by this quote. If epsilon is small enough, it's at least 100 squared log n. So there will be. So in expectation, it means that there will be some i such that in some t, so that if you go from i to distance t, you can get a difference of 100 squared log n. But we have just argued before that this event never happens. Even if you can go over all the graphs and look at all pairs in this graph, you will never find a pair that is of difference more than 2 squared log n. Why did I say that? I guess because, right. So just to make sure, why do I say that? Because then we know that expectation of yj minus yi squared, it's always at most 1, right? It's 2 or whatever. Standard deviation, the standard deviation of this is at most constant. So you cannot be 100 squared log n standard deviations away. OK? Clear? So basically, this is really the heart of the matter. If we show this, right? So we want to show that the expected, right? So we want to basically show that as you take t steps, as long as t is not too big, as you take another t steps, you increase the expected gap between two guys. OK? So this is what we want to show. And let's start by assuming the case that the matching, let's say, if some probability, you get the perfect matching, just like so. The matching is some perfect matching in the graph on the left to right. Kind of partitions everything, say, equally. And so we assume that with some probability, you get matching left to right, such that for every i in left and j in right. So for every i in left, z y mi minus y equal to 1 mapping, y mi minus y i is at least whatever it is. OK? So we assume that we are sad. That means that with some probability, we get it's actually partial matching, just for starters. I think of it as a perfect matching. So with some probability, we get this matching, which basically tells us that we split these guys into left and right. And we get you at least one. So that means that if this is y i, if you go to m of i, you can increase this value. y of m of i is, you can increase this value to 1. So this is a problem. Notice this matching is a random variable. For every choice of this y1, t, yn, we get a different matching. And maybe with some probability, we don't get the matching at all. But with some probability, we do get the matching. So now the basic observation, so we want to prove this kind of thing by induction. So we want to prove that sum i from 1, so sum y i from 1 to n, y i, t, say 1 over n, is larger than 1 over n in expectation. In expectation, 1 over n sum i from 1 to n, y i, t minus 1, plus something, 0.01. This is what we want to prove. And in expectation, we are going to go. And the key property we want to use is the following. If in t steps here, you can get t minus 1 steps. You can go from this. You can increase your value by some number row. Then you can, if there is this matching, you can increase this value by one more. In t steps, you can go one more, and then this additional row. And because it's a matching, these guys are basically uniformly distributed. There is no, so basically this should be basically the expectation, the same as expectation. But let's do be more careful, because you do need to be more careful, otherwise you can prove something that's actually wrong. So let's try to basically use this thing. So what we're going to say is the following. We are going to say the following, the expectation of y i. We define y t. What? We define z t. What? y is a y, if you can t. Ah, it should be z, sorry. We want to prove z, z. So we want to look at, right, so we want to say the following thing. So we want to say that, right, the expectation, so we're going to sum up the following. On the right, do you suppose we have an expectation with respect to z? Of expectation? Yes, expectation, yes. Yes, thank you. So what we know is the following, right? So we know the following. So we know that if we sum over i in the left, z i t, this is at least, and let's say this is the perfect matching, t over n. This is at least 2 over n times the sum over, basically, i in the right, minus 1, 0.0, or at least in this case, it's plus 1, right? So for every choice of the run arrival, when we are this matching, we have this thing, OK? So now we are going to, basically, take expectation. So when we take expectation, we are going to say the following thing. We are going to say the following thing. Sum from i1 to n, expectation of y, z i condition on i being in the left. Right, so basically, we simply sum up this inequality, right? So we know that if, what's the right way? Right, so we know that z i, you know that, basically, z i t is larger than z m i t minus 1 plus 1. Because you can always go from, if you want to let the t path to increase the maximum, you can always go to this guy, add 1, then take the t minus 1 path and add whatever this adds. So we know that this is true. And so we basically add it up over all the guys in the left. And because it's a matching, this sums up over all the guys in the right, so we get this inequality, right? So basically, now we can say that if we sum up, for every i, we sum up, we take the probability that it's left. And then the condition on that, we get this z i t. This is larger than 1 over n, sum over i plus 1 n, expectation z i t minus 1, probability that i is right. Or a very probability that this matching actually exists, which, if it's not 1, then in expectation, let's say whatever the probability, some point sum. What? Where did you get the t? So I'm just summing up. Yeah, so give me if you want, maybe it's half. Let's say 1, 10. So we basically sum up this probability and we just say, we look at the guy. What's the probability of the guy participating in this inequality? And we basically are going to sum this up. And now the key property is that let's assume it's a perfect matching. First of all, you can see that this matching is going to be completely symmetric because it's random variables, A and B. You have the exact same probability of Y, B minus B, Y, A, and all symmetric. So the probability of being in the left is the same as the probability of being in the right. So the probability half you're either in the left and the same probability half you're in the right. So you basically can sum up all this over all and you get this thing. The only problem could be that maybe when you condition on this event that you have to come in. In some sense, if you think about it, when you condition on this event that you happen to be on the right side, it means that you're kind of far away to the right. Maybe you have less to go. So maybe the problem, it could be the problem of what we want. So this is what we have. And what we want is basically what we want is the same without conditioning. We want sum i from 1 to n, 1 over n expectation z i t, larger than 1 over n sum i from 1 to n expectation z i t minus 1, plus 1 over 10. So if we could basically get this, so what we want is this thing without conditioning on the left or on the right. So then, nice thing is, as happened before in this course, we can again stand on the shoulders of giants for this. And the question of observation, so again, what we said before is we take for every i, we know that we basically sum up this thing, and then we get this probability. So you basically can think of this probability of the verb, you can maybe multiply it by probability of i being left. But this is half always. So the probability of i being in the left is equal to the probability of i being in the right. So what you can think of is that you're summing up for every i, sum i z i t, 1 left, i, larger than 1 over n, 1 over n, or 2 over n, that much. i from 1 to n z i t minus 1, 1 right, i plus half. And then basically, this is up to scaling with this probability of being in the left or being in the right, which is the same probability, being in the left or being in the right. And it's a constant. So up to scaling that, then you get this inequality. You take expectation, and you get this inequality. This is the probability. This is the expectation of this condition of being left if you multiply by the probability of being in the left. So basically, if we just manage to make these conditional expectations be the same as the unconditional expectations, then we would be done. But just notice that intuitively, the correlations work against us. When you are, so basically, what is z i t? z i t is how much more you can go in t steps, how much power more you can go to the right. How much you can increase your value. And when you kind of know that you're in the right, it's probably correlated with z i t being a little bit smaller because you have less to go. So the correlations are against us, but the question is, how bad can they be? And now come the super, the important part. And this is the following thing, that what is z i t? It's the maximum over j of distance t to i of y j minus y i. Each one of those is, what kind of variable is y j minus y i? Gaussian, right? And what do we know? We know it's standard deviation. It's as expectation 0. And it's standard deviation y j minus y i squared is less than t delta because of this primary inequality. This is less than 1, much less than 1 because we chose t to be significantly smaller than 1 over delta. Now it turns out that if you have a collection of garcians that might be correlated in some weird way, which you don't even know, so you have a collection and you don't even know how many of those there are. So you have just some collection, xi t for some t is a bad notation, xi, I don't know, xi j for some j in some index set j. And these are partially possibly correlated in some way. Then the name of this, does anyone know what the name of this is? So this is actually a well-studied concept. And it's known as garcian processes. And let me just state the fact that we need it. Maybe the public can say a little bit about why we care about these garcian processes, even if you don't care about A or B. But the main cool fact about garcian processes is that, and that's the only fact we need, is that if the maximum of expectation of xi j squared is much smaller than 1, then the probability that the supremum of xi j, the supremum differs from the expectation of the supremum that it differs by more than, I don't know, 1 over 10,000 is little of 1. And that basically is enough for us, because basically what it means is that this event up there, that we're conditioning on this event of being in the left, it's a event that happens with a constant probability, then it cannot affect conditioning on this event. It cannot shift the expectation by too much, because the probability that you shift the expectation more by more than 1,000 is little of 1. So conditioning on an event, so if you have a garcian process with constant standard deviation, if you condition on an event that happens with a subconstant standard deviation, if you condition on an event that happens with constant probability, you cannot shift the expectation by more than subconstant. So basically conditioning can change things by a sub-agitive liter of 1, it's not going to matter. So we are still going to get maybe not 1 over 10, but we'll get 1 over 20. So and this is known as concentration of measure for garcian processes, but maybe I wanted Paro maybe to say a little bit about why we care about garcian processes regardless, because he knows about this more than me. Let me just write a couple of things. I guess in other reasons why we're not going to care about this kind of garcian processes, but one of the at least most important ones in optimization is it allows you to understand the solutions of random optimization problems. So when you're quite often you want to write, for instance, essentially what you have is something of the following. So if you want to maximize, for instance, some random linear function over x in some set. So what we have in here is g is a random backward. It's a random vector. I have a random cost function on, and I'm optimizing this random cost function over some set x. And typically we want to understand, for instance, kind of expectation things about the situation you were concerned before with trying to understand the maximum of a random polynomial. Essentially, it's exactly something of this type. And the point is there's very, very strong concentration results. I mean, I guess we saw an example on these kinds of problems. But also there's, I mean, one of the nicer things of Gaussian processes is that there's a very nice and very intrinsic geometry associated with this. So for instance, one of the facts that I think many of you may know is that if I have in Gaussian random variables, let's say standard random variables, then what's the maximum or what's the expected value of the maximum of the xy's? Sorry? It's again, it's a square root log n, right? And some point here, the number of random variables appears as the xy's. But it turns out that even if I have an infinite number of random variables in here, as long as I understand some of the covariant structure, the infinite covariant structure if you want to find matrices, then we can give in here a bound that depends on the intrinsic geometry of these random variables. And the beautiful techniques, I mean, you may have things like changing, but that allows you to bound exactly this kind of variables. It's a very beautiful topic. It appears all over the place. If you've heard about, for instance, compress sensing or this matrix completion stuff, there's a very natural way by which Gaussian processes appear in that it actually allows you to bound all kinds of things in a bound. So the thing that we use here is that we don't really know the degree of how many guys like that. There could be as many as n of them. It's not something we can do union bound, et cetera. But the thing about it as a Gaussian process, we say whatever the maximum is and the actual value of the maximum will differ based on the geometry of these random variables, the correlation structure. So the actual maximum, what is actually this maximum is not clear. It's hard to compute. But it will be concentrated. Whatever the value is, it will be concentrated around this value between a plus minus little over 1 addity term. And this is enough, basically, at this point we are done. And when the matching is not perfect, it doesn't really matter because you still have a constant probability of being on the left, on the right. Possibly one clean way to think about it is that you match a vertex to itself with some probability. And when you match it to itself, then you only get zero. You don't get any advantage. But you would still get, on average, a constant improvement. On average, the matching would contribute a constant improvement, and then you can still do the same thing. What are these z-adj's? Are they supposed to be like a normal equation? Right. So these z-adj's are simply yj minus yi, right? So zi, what we want to say is that zit for every t and every i is always a Gaussian process with the maximum variance or maximum standard deviation, get a law of 1. And that's the only thing we care about it. We say it's some Gaussian process, we get a law of 1. And that means, right, so let me maybe do it explicitly. So in particular, that means that probability, that expectation of zit condition on some arbitrary event A, if the probability of A is at least a constant, then the expectation of this condition on A minus the expectation of the unconditional expectation is middle of 1. And the reason is that the probability that even if A kind of tried to be the worst possible thing and kind of put all its mass on when zi deviates from the expectation, it wouldn't matter because it can only increase the deviation by a constant factor. So we can use this part to say, this is a Gaussian process. And notice that basically that's why we cannot take t to be too long because we need a standard deviation to remain middle of 1. So basically, and we use the triangle inequality here in a cushioned way. So this is we use the triangle inequality here in a cushioned way. So we said that because j is of distance at most t to i, the variance only grows like t delta. The variance only goes by a factor of t. The standard deviation only goes by a factor of scroll t. So in some sense, if you think about it, the standard deviation goes by a factor of scroll t instead of t. That's really the reason why we managed to go from log n to a square log n, like in other processes. So I mean, I think many of the times when you analyze, like it's exactly what Pablo said, when you, that's exactly often what you want to ask. If you're looking at right, if you're thinking of maximizing gx, where x is in some set x, you can think of it as basically you define the random variable Gaussian chi, right? So you're looking really at the maximum, right? You're looking at this, really, like where g is kind of a random Gaussian. So what you're really looking at is like a collection, chi x for x in x of Gaussian variables. And you want to know what is the maximum going to be. So remember something like this from Martin. Well, typically for Martin, you need some kind of ordering of your process. Here you don't, right? I mean, it just comes here. But I think sometimes people think of Gaussian processes that are indexed by the indexes. You think of it as a time, and yes, so then, you know. And yes, so these people, if you search for Gaussian processes, you see this kind of problem. You can want to try to find out what's beautiful. Yes, so there is like really a few, a lot of, you know, on the places. And in some sense, the nice thing about them is that somehow they give a general way to think about this thing, which if you try to look at it, it could look dauntingly complicated. You don't have these Gaussians. You don't know exactly their structure. And this is exactly the case here. So Apaiyo maybe wouldn't have expected that we could save over a long end, because you could say, well, all we can do is this union bound. And some of these, this allows you to take advantage of this structure. And that is perhaps, I think, one of the most important messages, right, that, for instance, in the example I was saying, because the bound is maximum over 1,000. I mean, the point is anything that you try to do that's based on union bounds or arguments of this type, you have these vendors in it. But in certain situations, if you are most on the combination spot, you get bounds that are actually dependent on it. And that's exactly, I think, the case. So here, like, again, we have no dependence on the degree of the ground. Like, how many guys? We have no idea how many guys are of distance t to i. And we don't care. So that's kind of a very useful thing to know. And the actual, like I said before, to compute the actual value of the maximum, that will depend both on the number of guys and the way it's structured. But to get rid of the deviation, we can be fine. And in the end, we might have to depend on geometric problems, or in the geometry space that you can ask, so it's a kind of random variable. And it really depends on the interest in geometry. Yes. Think about the coordinates. I don't know if this question makes any sense at all. But does that have connections? Like, is there a Gaussian process centrifugation of like, like, Sauer-Schullau or, like, the little gradients? I don't know. I got myself some answers. Is there a question there, or? Yes? So I know you said that this is more like the geometric bounds that you get that are independent of the, you just so happen that you ever get a square root log n, and then we also get a square root. Is that related? I don't know. I mean, I guess to some extent, what you can say is that these kind of bounds of Gaussian processes maybe allow you to say, you know, you have this collection of Gaussians. You can understand what happens if they are all kind of perfectly correlated in that they are all independent. And I think in this case, maybe, I don't know exactly how these guys, and this is also not in the description of ARB, like you would see in the paper. But maybe to some extent, you can say that you get optimistic because you initially try to do some guesses and say, you know, if everything was independent, we would be fine and we would get this square root log n for one reason. And if everything was correlated, we would be fine and we'd get something maybe even better than square root log n for another reason. And then the thing is that somehow these bounds kind of allow you to talk about this, you know, to get some general thing that just works for all cases. So in this case, the actual maximum does depend on n. And then you do get this. And I think the square root log n really comes from this condition here. You cannot, once you go, once you increase this thing too much, then you suddenly, the standard deviation becomes bigger. And then you cannot have this concentration down anymore. And this might seem like something very technical. Maybe you can get around it if I haven't shown you before an example that this lemma is actually tight. So this basically concludes the proof of ARP. I guess we did this in two hours. So don't worry, I thought we would don't need the whole three hours for something like that. So we're going to talk about whether ARP is tight and what kind of examples do we know or what evidence do we know whether we can improve it beyond that or not. But before I go to that and before the break, this is the time to ask questions about the proof or other questions. So this version, I mean, there is a proof written in lecture notes. It's a slightly different language. This version will probably be there in a couple of days. And but I'm really happy to answer any question now. Is everyone clear why, at least, maybe you didn't follow every line? But while we're basically going to be done, it's real. So by the way, that other proof in which I was messed up in the lecture notes is by now already fixed this CSB from last lecture. So just a question about if you were trying to prove us there and you tried to solve it for the first time, all these constants, do you just tweak them by hand until they work? So I mean, the way kind of, yeah, these constants, and one kind of way that David wrote this proof in the notes, which is not completely kosher but makes things cleaner, it never writes explicit constants. It's always used all in omega. And then basically, you have to check at the end that the dependence on the constant is acyclic. But it is. So one way to think about it is, you know, just say, if something is constant, just call it a constant, and that's it. Another way, by the way, which often is a useful heuristic, just whenever you see, you know, whenever you see O of F, think of it as, you know, F over a million times F, and whenever you see omega of F, think of it as, you know, F over a million. And just imagine this, whenever you see an O in omega, and whenever you see the other thing also, if you see a constant that is kind of larger than one, call it an O, smaller than one, call it an omega, and most of the time, you know, most of the time, it will work out fine and, like I said, I'm kind of a big fan of non-type-safe notations. Okay, so basically, after the break, what we are going to do is talk about whether ARB is tight and what do we know about this question. So, please go all the way to 11 minutes. That seems like, wow. So, you can take 11 minutes break, don't spend it on one place. Okay, so I think we can start, and I wanted maybe to start by, just because this kind of is a, some of it was a long proof, and you might be, you know, losing the post for the trees, or whatever the right metaphor is. So, let's just kind of remind ourselves, so at least if you didn't understand a point, you kind of know where that point that you didn't understand was in the grand scheme of things. So, basically the way it worked is the following, we wanted O of square root log n approximation expansion, or the expansion, and then we showed that this is enough to find A and B that are well separated. So, we showed that this is, it's enough to find A and B that are well separated in the sense that expectation of xA minus xB squared is larger than 1 over 1,000 square root log n of all A in A. So, and that was the way we did that is with this region growing argument where we selected the random T and we looked at all the guys that are distance T from A. Okay, so that was what we called the ARB main lemma to find now A and B that are well separated. And then, basically we sampled discussions and we showed, so then we showed that to find A and B that are well separated, we want to show that there is no large matching, there is no large matching of stretched edges. So, what is stretched edges? It means that we sampled y1, yn, using quadratic sampling lemma, and then we say that an edge is stretched if the yj minus yi is larger than 1, but expectation of yj minus yi squared is smaller than delta. So, basically what we want to say is the only reason we fail in some sense is when we sample y1 to yn, always it's the case that we get a very large matching that basically cleans all that touches all the vertices or almost all of the vertices. And so that's really the bad case that this was kind of a bad matching. And again, the reason was that we somehow we did this quadratic sampling lemma and if we kind of sort now things, we sort things, we basically select for every vertex a number, a Gaussian, and we sort things. So, we said, OK, this is the set A, this is the set B, the things that are smaller than the minus one, things that are smaller than the plus one. And the bad cases were like there was an edge between A and B, so there was these guys that were, even though their expectation of yj minus yi squared is at most delta, they happened to get like a distance 2, 1, whatever, in this graph. So that was, so then it basically says it would do some case to show that there was no large matching. And then what we showed is that, OK, and then basically what we showed is the following. So basically large matching, that's kind of immediately implies that on average, in one step, can grow by omega of 1. That's basically on average, if you take a random vertex and you ask by how much, if I went to my neighbor that's farthest away to the right, my neighbor that's farthest away to the right, by how much I could grow. So if there is a large matching, then a large matching, then with some constant probability you will be on the left side of this matching and you can go to the right side and increase yourself by 1. So large matching means that on average, in one step you can go omega of 1. And then basically we wanted to show that, and then this is basically where we use Gaussian process concentration in some sense to show that as long as t is not too big and plus the triangle inequality, so that on average, in these steps, we can grow omega of t. So this is where we get this potential function of how much on average can we grow if we take t steps. And we show that on average, and the two things we use is that because of this triangle inequality, or triangle squared inequality that maybe that's not better than we use it, so let me call twice square inequality. So because of this triangle squared inequality, the standard deviation doesn't grow by too much when we do this, and then we can use concentration bounds for Gaussian processes, and then we argue that on average in t steps, you grow omega of t. And then basically for t equals some number, whatever that number was, some number, which I guess was 1 over 100 delta, which turns out to be the way we chose things, it would be something like 1,000, so again, then that would be a contradiction. And the reason is that these random variables all for every i and j, yi minus yj squared is at most 1 or 2. So it has constant standard deviation. So by a union bound, if you go over all n squared variables, there will never be a pair of very high, the probability that there will be a pair that's so different is very small. So that's basically the loop rate of the truth. What was the growth of Zt? Of the what? Zt i. So Zt i was the growth, right? This was, right? So 1 over n sum i1 to n was kind of a potential function. Zt i t is that was basically the average growth t steps. So we proved by induction that every additional step gets you never constant, like at least 0.01, as long as you don't take t to be too big so that you can see this concentration. What? Right, so we did it for t steps. And eventually, we managed to show that after t, which is roughly 1 over 100 delta, which is like 100,000 times growth log n the way we chose it, you actually get that. So the contradiction was that we proved that with some probability that is at least, say, with some probability that is a constant so there exist some i and j such that yj minus yi is larger than 100 growth log n. Because if you can go in t steps, in particular, the mass exists too. But this is a Gaussian random variable with standard deviation constant. So the probability of this happening is something like n to the minus 100 or something like that. And you can do a union bound over all pairs ij. This will not happen. So this is the contradiction. So if we were willing to leave with a log n approximation, we could have just taken one step. And we wouldn't need to talk about Gaussian processes. We could just talk about Gaussians. Life is much simpler. But basically, with Gaussian processes and this finite inequality, it allowed us to kind of stretch the bound until we get the contradiction. Yes, this is the overall picture of this book. Assumption of the mass goes large and will come in at the end. No, so assumption that matching is large gives us the way it worked. It gave us that on each step, it gives us if the matching kind of has, say, fraction 1 tenth of the vertices, it tells us that at every step, we get like 1 tenth. If we take one extra step, so the Gaussian thing with the matching, we say the following. If from j, in t minus 1 step, this is yj. You can go to some yk, so that yk minus yj is some value rho. And now i is matched, like i is on the left, j is on the right. So the i is matched to j. This means that there is at least one here. Then clearly, it means that zit is at least rho plus 1. So on average, if every i gets matched like a constant fraction of the time, then on average, in step, what you can do in t steps is some constant more than what you could do in t minus 1 steps. And that's used in the matching is large, because every edge like that gives you. You want to say that on average, so you want to say that every vertex is matched with some, you know, on average, every vertex is actually matched with some. Yes? Could you also give us a look at how it is, some constant? Right, so then, yes, so we use the matching is large. OK, we used, so there are two cases, when we, we used, we used the fact that it's actually, right, so we wanted to use that, we used the fact that it's, there are two things, we wanted to use the fact that it's not just a large, we need to use the fact that it's not just some kind of large collection of edges, but it's actually a matching to argue that, you know, the left side will be like significant, so you can condition on that. And also that, you know, because it's a matching, we can, when we sum up everything, every guy appears once. So, so when we sum, we measure progress by the total, by the total progress we make over all vertices, so we wanted to make sure that when we sum up over all i on the left side, then we also sum up over all j on the right side, so we cannot use, maybe twice, the fact that it's not just a collection of say, and edges, but it's a collection that is a matching. So that's this. To make sure I understand what you're saying, so we could phrase this as well comfortably by saying that if there exists a matching, does there exist some pair ij with yj minus yi, do they vary? And that means that it can only happen with some very low probability. Yeah, so we could say, okay, so I think the way we, okay, it's a little bit delicate to say this because we somehow use, it's kind of very useful for us that our potential function is this expectation. So, we don't, we kind of don't form a theorem that say, so generally, like say, if you just take a matching, you know, it's a matching, there is no length t-path in this matching, there's no length two-path in this matching. It's just a matching. But we kind of want to say that because the matching exists with this probability, there will actually be a way to improve on the length t-path. So it's a little bit more delicate than that, and it's kind of very useful for us to think of this potential function as a way to make sure that we improve this potential function as the average improvement rather than, say, the improvement in this particular sample of the random variable. So if you go over this proof, it's kind of, the thing that we want to try to remember is, what is, when do we take expectations and when do we look at the random variables that are actually based on, you know, the actual sample, so the matching is probabilistic. The z-dies are probabilistic, and then we can take expectations. So that's the high level of the proof. And now I want to talk a little bit about whether it's tight. So even before A or B came up with their algorithm, it was actually already, people already realized that this triangle inequalities could be potentially useful for this kind of problem. And in fact, so Gommons and Vignan, they made the following conjecture even before the ARV paper. I don't remember exactly when, maybe 97 or something. I'm not sure exactly when they made this conjecture, but they made the following conjecture. They say that the way we would phrase it, it's not exactly the way they phrase it, but they said if xi's are agreed to so the distribution that satisfies triangle square inequality, so triangle square inequality, so xi is the distribution over 0, 1 to the n that satisfies the b2 triangle square inequality. That satisfies this thing that's the super probability of xi different from xk. It's smaller than the super probability of xi different from xj. So when I write to the probability, you can think of the expectation of xi minus xj squared. But if it was actual probabilities, you can see why it would be true. So this was like this triangle squared inequality. Then what they conjecture is that there exists actual Boolean, yi's such that yj, i and j, probability that yi different from yj is between i and 1, some constant. They didn't really specify the constant, like some constant c, just white c. So they made the following conjecture. And if this conjecture is true, you can see that you get a constant factor approximation for the expansion. And the way, by the way, this conjecture is usually phrased is that, I'm not going to explain all this notation, but the way that people typically say it is that there is a way to map the metric space L2 squared, which is the kind of space of pseudo-distribution satisfying these triangle inequalities, into L1, which is a space of the distance where you have a distribution of Boolean variables and the distance and the difference between distribution of the cuts and the distance between i and j is the probability of the cut. And so I think that's kind of the way that people will, if you look at Gorman's linear conjecture, that's typically the first. But basically they conjectured that this is true up to a constant. It was known that it is true up to log n. There is kind of very general basically we have seen it. I think given what we have seen, it's an exercise for you, but it's a good exercise. Prove it. So here is an exercise, true for c equals log n. And now basically, and now then ARB came in 2004 and they showed that you can do it with c, not a constant, but basically c equals square root log n. By the way, maybe I'm missing some references here. I don't know if ARB exactly put exactly this statement. There might have been some other follow-up work that actually got this exact statement. So there is this philosophy that square root log n is not a proper quantity. What does it mean, square root log n? Even if it's log n or constant, square root log n is not like a gentlemanly kind of approximation factor. So you could say that ARB was like an evidence that maybe woman's linear were right and it is a constant. But then basically it turned out and I think this is they showed that let's give some number I don't know. Standard deviation here is like two years. So they showed that if there is this conjecture small set expansion hypothesis you can think of it as like some version of the uniqueness conjecture. It implies that there is no constant factor approximation expansion. In fact they showed that the trigger thing is like in some regions of the trigger thing is optimal but basically there is no and by the way also here maybe I'm not remembering exactly all the references and everyone would be like let's put it this way. If this means that if this expansion hypothesis is true then these two things contradict. One of them has to be false. If a woman's linear or the small set expansion at least one of them has to be false. If a woman's linear or the small set expansion hypothesis. So basically so now that you have a prediction like that you can try to derive and integrate that from this prediction. And this is what Koten-Bischnoi did. He took an integrality gap for a small set expansion and got from it basically an integrality gap for a sparse cut and in particular they showed that a woman's linear conjecture is false. In particular it was even more basically they so these follow a paper to RST 11 they did it in 2004 so the history is actually a little bit backward so they didn't know of the small set expansion hypothesis but they basically showed that they could almost get a unique game's hardness so basically the way they did it they showed that they could almost get unique game's hardness for this sparse cut they couldn't exactly get unique game's hardness but they could show that they could use these ideas to basically get an integrality gap and therefore contradict this woman's linear conjecture but somehow what logically if they are known about this one set expansion hypothesis the order of things should have been this way. You first get hardness then you use this general principle that hardness predicts an integrality gap and you use the integrality gap. So now there is a question of you know if you believe this one set expansion hypothesis so it's causing the unique game's conjecture then you believe that then you believe that basically it does not matter like you could get also an integrality gap not just for sort of distribution satisfying the primary qualities but also for sort of distribution with you know the 300 million log n or whatever so there is a question whether they say the coefficient of the integrality gap say works for a degree say 100 so the distribution and it turns out that the answer to this is no and this is a paper that I work with one now how can I restore a zoo in 2012 but it's also in some sense known in an interesting way let's just give a highlight about the full proof of why the answer here is no. So basically the bottom line is that we know that the government's conjecture is false if we believe this one set expansion hypothesis then we believe that there should be an integrality gap even for showing say super constant even for say the degree 100 so the distributions we don't know an example of such an intervality gap showing even super constant for the degree 100 so the distribution we also don't know even using the degree 100 to do anything better than scroll log n so there was also this example I mean Jim's list showed that just I don't know so later after this work I think James Lee and Asap Naur they actually showed for the government's linear conjecture they showed that it's really tight like scroll log n is the right answer here so if you just want triangle inequality you cannot get better than scroll log n but for the degree 100 distributions we don't know maybe you can get log n to the power 1 over 100 maybe you know maybe you don't get anything better than scroll log n we really don't know and it's a great question so let me say a little bit of this part will be kind of very high level how this integrality gap works so this was I think maybe the first paper conditional that really really use this approach of basically getting interesting integrality gaps through hardness or at least more explicitly so the integrality gap really looks like hardness reduction in the following sense they take they didn't exactly think of it this way but you can think of it in the following you can take some integrality gap for a small set expansion which I'll explain in a second what it means they compose it with some gadgets so they do some gadget reduction each piece by some gadget which is actually similar to the gadget we've already seen and then they argue that because it's a reduction in this graph there was no small in this graph there was no small set that doesn't expand in this graph there is no set that doesn't expand so this is the soundness somehow you can pretend is a set there is a non-expanding set where pretend means distribution and completeness here means that you can also pretend here that there is a set of say size of most after practices that doesn't expand and the notion doesn't expand also changes and I'm not going to show the details of this reduction what I want to say is that I want to say is that the point in some sense is to try to understand to try to understand why if this reduction gives you so this reduction shows that the Govind's linear conjecture is false but you could ask if it also gives an integrated graph for the green hand with sub squares and it turns out that to understand this thing what I want to focus is on this soundness proof the point is the following at the end of the day this soundness proof tells you the following thing it tells you something like it tells you if the sort of the objective of the original instance is small that's soundness objective of the original instance is small then the objective it's an objective for a different problem but the objective for a different problem or okay this is a minimization problem maybe so you're trying to minimize expansions so the objective is let's call it bad it's not so good it's bad if you don't have a good solution then the objective value for reduction applied to the instance is also bad that's kind of soundness that's what you want to prove in soundness and then it turns out that this proof the kind of the components that this proof used for Kotbichnoi they used hypercontactivity which I'll say what it is use something this invariance principle that we have seen which itself also relies on hypercontactivity and it turns out that these guys or at least some good enough approximation of them they have you know I don't know degree 10 sum of square proof at least good enough approximation for this thing and what it means is the following so so what it means is right so soundness proof one way to say what a soundness proof is is to translation in this direction it takes a solution x here and it makes a solution x prime here right so it says if you have a solution that shows demonstrate that this guy is not so bad then you have a solution that demonstrate the original guy was not so bad but but basically if you have a degree 10 but basically what happens is that if you have like a degree 10 sum of square proof it means that if you have a super distribution of a solution 0 you also have a super distribution of a solution for the original thing and if for the original thing there was a small degree sum of square proof that it has no solution then you have fixed the contradiction there will not be a gap so basically the same proof that they used to show that this is an integrated gap for say degree 2 plus 1 in inequalities you can use that proof to show that it's actually not an integrated gap for degree for degree 10 sum of squares because inadvertently when they proved the soundness thing they proved something stronger the actual objective is small they proved also that the sum of squares might be small so it's kind of interesting I think that like inherently they proved because of the techniques they used it's not just an accident that it didn't end up with being an integrated gap for degree 10 sum of squares it's actually anything that uses similar techniques will not give you an integrated gap like that so let me tell you like what's the simplest thing that some sense is at the heart of their integrated gap it's kind of a useful fact to know and it's again once more something having to do with isoperimetry so let's talk about what is actually small expansion at a high level generally speaking if you think of say a graph then if you take some large set if you take some kind of large set in this graph then by just I mean even if it was completely random you think that like half of the edges will stay inside this set just because you know it's so large right? if you take a small set then maybe almost all of the edges will go out and often there are graphs where expansion for small sets is much better than the expansion for large sets and understanding whether a graph has this property is often very useful so for example when sometimes the graphs we are looking at are graphs that describe say the evolution of some process like we run in some kind of probabilistic algorithm like Monte Carlo Mapo chain algorithm or something like that we want to know at what point we kind of reach the entire graph so we kind of start with some single point and we start walking and you know this is in some sense this is what we can reach after one step this is what we can reach after two steps etc and it's often the case that if we just look at the expansion of the graph as a way to try to argue that we expect fast it will be very crude and we will not get the right result because we want to take advantage of the fact that this graph expands much more rapidly like for small sets or in this beginning of this random walk you're going to expand much more rapidly so sometimes we want to ask another question of what's the set that expands the least in the graph but say if I give some measure delta I want to ask what's the set of measure at most delta that expands the least and basically the small set expansion hypothesis is that this question is computationally hard and we don't know the answer to that but we do have some interesting graphs where a priori it may be clear how you certify that they have good small set expansion and here is one example of a graph that not a great expander but is a good expander if you think about small sets and this is you know the same how small does a set need to be so it is a function yes so ok so maybe one way to define for example let's I mean you can talk about general parameters but let's just define that say the G is a small set expansion if when S is a little o of the size of the vertices then the number of edges going from S to its complement becomes like I don't know at least 0.99 the total number of edges that you can do so you basically say that if you take small enough sets all the edges basically go out let's say constant and let me give you an example of a graph that's not a good expander but is a good say small set expander in that sense becomes a much better expander as you take the sets to be smaller and this is the following graph the vertices to be again the B and Q and A to be connected to B if the distance B is at most delta A one thing of delta is like something like a very small constant or something like that so it's not a great expander because you can think of this set which is the following you take basically the first coordinate is 0 and everything else can be anything so what's the size of this set what? the size of the set is half times the number of vertices 2 to the n minus 1 and what's the probability that if you take a guy in the set what's the probability that you leave the set so what's the probability that if you take a random A and B where A is in the set what's the probability that B is not in the set delta yes because in some sense when you take a random neighbor it's like flipping every bit with probability delta so so basically 1 minus delta fraction of the edges stay inside the set the graph is not a very good expander most of the edges stay inside the set but it turns out that if you were to look for sets of say measure 2 to the minus t the best thing you could do so that's required as proof this is basically some kind of an isoperimetric thing but the say the worst sets of measure so the sets that expand the list 2 to the minus t of the form you know 0 0 0 for the first two coordinates and the rest is this and what's the probability here that if you take say what's the probability that it's easier maybe let me look at this thing the probability that you stay inside is 1 minus delta what's the probability that if a is in s that b will stay in s yes 1 minus delta to the t right you flipped it you flip each bit with probability delta so the probability that all first t will stay the same it's 1 minus delta to the t you can see that once t becomes say larger than 1 over delta then you start to get very good expansion right so this is an example of a graph that's not a great expander for large sets but becomes very good expander for small sets but of course this requires to prove this isoperimetric inequality so it requires to prove that for every set of measure 2 to the minus t or something like that you cannot get something much better than these guys and basically this graph was basically what KotMissionoid had as their initial instance so that was a small set expander and there is a sense in which it pretends that you can say that it pretends to have you can I'm not going to show it but it's not that hard to show that you can find some distribution that satisfies the inequality and pretends to be of small sets so here is a theorem that I'm not going to prove but I'm going to kind of sketch why it's true so here is a theorem if s is a subset of 0, 1 to the L and let's just say s is smaller than 2 to the minus T times 2 to the L then the probability that if you take a in b where a in s that b in s is at most 1 minus 1 to the t maybe you need to think of it in the range that t and L are not too crazy and this is true for say any constant theorem so every set basically that says that roughly speaking there are final theorems that you can prove like more tactical theorems but basically this says that every set that the subcubes are the worst case for this polynomial hydrogen and then you really sketch very roughly how one proves this kind of theorem and the idea is the following, the idea is you use this graph that we are seeing in the lecture notes that so this Boolean cube this is a group 0, 1 to the L is a group with the action being the x o and we are basically connected every 2 so basically we connected a and b if you can get to b from a by so in one of the strings we are having weighted most than the L so this means this is a K-legraph so this is a K-legraph that nice thing about these K-legraphs we know we understand exactly the eigenvector because we can compute their eigenvalues etc and the eigenvectors are these functions like so first of all what are our eigenvectors right so an eigenvector of this graph is a vector so it's a function mapping the vertices to say bit numbers in this case so an eigenvalue is like K from 0, 1 to the L to R and it turns out say an eigenvector of the Laplacian of the Laplacian the eigenvectors of G and the Laplacian of G are the same the eigenvectors of the Laplacian is this K mapping 0, 1 to the L to R and it turns out that we can understand this completely and basically we have these functions KS of X the product of I S 1 minus 2 minus 2 Xi so these functions are the eigenvectors of this graph and lambda S these functions really correspond to these subcubes so the corresponding lambda S will be basically 1 minus delta I think maybe 1 minus 2 delta or something like that to the size of S or something like that something along those lines let's not try to check them it would be kind of not a sportsman like so it's something along these lines these things always kind of raise these probabilities are between 0 and 1 the eigenvalues are between minus 1 and 1 maybe I actually am talking about eigenvalues of the Laplacian one thing we should say maybe it's 1 minus if S is 0 if S is empty then this should be 0 when we are talking about the Laplacian so I guess it should be maybe 1 minus 1 yes I think that's good because that would be maybe delta is as most like picnics as most have so this would be I think this would be ok let's say this is the but let's just say what we really care about is that lambda S lambda S is roughly lambda S is roughly delta times the size of S in the ranges we are going to care about at some point it becomes something else but roughly in constant times when the size of S is roughly one of the most one of the that's kind of what we care about so what we really want to show in some sense if you want to show that if you want to show that there is no small set ok so now we want to actually prove this so what we want to show we want to show that for every what we want to show that for every such set of this measure what's going on is that if you take 1S hit it with the Laplacian I think basically what comes out at the end is we want to show that this is roughly this is this is at least say so let's just say let's just pick something let's suppose T is like 1 over delta or something like that and we want to show that say if T is at least 1 over delta or something like that then we want to show that this this thing is at least I don't know half or we want to show that this thing is at least half but that we're half times the norm of this value right so the Laplacian if you remember it measures the edges the Laplacian is Laplacian is like 1 over D 1 over D sum over I connected to J XI minus XJ square so it kind of measures the number of edges that yield is set and we're going to want to say that at least half of the edges will be set half of the edges that touch will be set so this is basically what we need to prove so another way to say it is that if we define V to be the span of chi S such that lambda S is at most half then if we define V to be the span of chi S such that lambda S is at most half then we want to say that 1S if S is so small 1S is far from this subspace V it's far from being contained in the in the span of eigenvectors that correspond to eigenvalues smaller than half most of its mass is the guys that are larger than half and that's why you get at least half or one third or whatever you get at least some constant maybe we should say here like one third or whatever so 0.51 this would not be enough so we want to say most of the mass of this guy is not inside this V so therefore if you hit this guy if it has at least some amount of mass 99.99% of its mass is on the space of eigenvectors that correspond to eigenvalue less larger than say 0.51 or whatever then then this thing is going to be large so this is one way to prove what we are trying to prove we want to say that if you have a large if you have a very small set its characteristic vector is going to have large large quadratic form with the Laplacian so we want to say that it's not it's very far from the space of vectors that have smaller smaller eigenvalues for the Laplacian right and the point is that this is basically S this is basically means that it's less than some constant over delta which by the T we cannot choose it's less than say T or some 10T or whatever so basically what is a span right so what is the span of all chi S's that I have at most say size where S is at most T what does it mean if F is of the form sum over alpha S chi S of size of most T remember F is some function from 0 to 1 to the L to R and the chi S are these so what does it mean this micro is not it it's not as far from that as possible ok I mean she's not going to be talking about the expansion of the economy see yeah you can say I think hyper-protectivity is like the Greek term you know it's a so right so what is a function F that is of the form sum alpha S chi S where chi S are this function and S as size at most T there are a name for functions like that right low degree polynomials polynomials of degree at most T so we want to show basically what we want to show is that for every S like that you cannot find a polynomial you cannot find a polynomial that is close to it polynomials are not close polynomials of degree T are not close to sets and have measured 2 to the minus 100 T and if you think about it it kind of makes sense how would you get you know if you wanted a polynomial to be kind of as fast as possible it kind of seems like you basically to look at 1 minus say you look just basically at X1 times XT like the beauty polynomial that kind of would be you know it will only be non-zero when one of those exercises is non-zero and it seems like that's kind of the best you could do but basically what you want to prove so let's just try to prove that one S that you cannot get this is where actually now I see it's really bad I'm using this S as both here and here okay let's call this set okay let's call this guy I don't know T capital T so what you want to show is that you show that one S is not a polynomial like that I guess we don't have a lot of time but one thing we can say about the set one S and this is the foreign thing the expectation of one S from X is zero one to the L what is this expectation it's just the size of S so let's try to write it to the 4 it doesn't matter if it's to the 4 right it's the size of L divided by 2 to the L in our case it's like 2 to the minus 100 T right and this is the same if I wrote it in 4 or 2 it didn't matter so in particular what I can say is that the expectation of one S to the 4 is larger than 2 to the 100 T times the expectation of one S squared squared right so this is so just finish by saying one thing and then we'll end because we're out of time but so right so if a set is sparse in particular it means that the expectation of one S to the 4 is much larger than expectation of one S squared squared in some cases one way to measure sparsity is that the 4 known is much larger than the 2 known and then it turns out that this is a theorem it's called seeing vector but I mean I don't know hyperconductivity theorem that if P if say F is a degree at most T polynomial then the expectation over the Boolean cube of F to the 4 is at most 9 to the D expectation over F squared squared and this is so this is a theorem and this theorem implies that F cannot be actually a sparse set but it also can be shown to imply that it cannot even be close to a sparse set and so it implies the hyperconductivity of the Boolean cube but the last thing about this theorem is that it's an inequality it's an inequality if you think about it in the coefficients of F and and it's like a degree like a degree 4 inequality and it turns out that you can prove it by showing that this minus this minus this is a sum of squares and because of that sum of squares already knows that the Boolean cube is an expander and because of that and some other work it can break the coefficient of inequality so let's talk here