 So in balancing game, you play and you optimize in t. And so maybe just to cut a long story short, let me just tell you what you get. So as it turns out, you get something which is exponentially small. So after you optimize it, this quantity is bounded by some exponentially small quantity, which is minus a small constant times this big constant c squared. So our original quantity, therefore, is bounded by this entropy cost, 8 to the n, which is big, but times this large division probability, which is small. And this is an absolute constant. And c is something that we didn't specify. But this times the entropy cost is OK if c is large enough. So this little c is something very explicit, like 1 quarter. But if you take big c big enough, this exponential gain will overcome this entropy cost, and then you win. So that's basically it. Modulo of this computation, which I'm going to skip. OK. All right, so that's the largest single value. Now you can try the same method for the least single value. And this turned out to work for rectangular matrices, but not so well for square matrices. So the next result, I'm going to say is that suppose you have a matrix which is really rectangular. So p is not only less than n, but p is actually less than 1 minus epsilon times n. So an epsilon is something. Actually, I might be using epsilon for too many things. Let me just go with delta. So you have a matrix which is genuinely rectangular. And m is Bernoulli. So now what we're going to show is that the least single value. So we showed that the largest single value is bounded above by a constant root n. We showed that the least single value is bounded below by a constant times root n. But the constant depends on delta. And it will go to 0 if delta goes to 0. In fact, it has to, with exponential high probability. OK, so it's a very similar result. OK, so just as the largest single value can't get much bigger than root n, in the rectangular case, the smallest single value can't get much less than root n. So the spectrum is wedged. The singular spectrum is wedged between two models of root n. As a measure, we can go more square. What happens is that this c goes to 0. For a general kind of matrix, there is a gap which you can exploit. So you can do the same sort of game as before. But the main thing you need is a variant of this bound here. So here we bounded this. So what are we going to do? We're trying to bound the probability that the least single value is small. So now we have an int rather than a soup. But the inequality is going to get the other way now. So once again, this is still a union, if you think about it. Still a union of a bunch of events. But again, this is an uncannable union. So you can't use the union right away. You need an inequality like this in order to proceed. So the analogous inequality. So now we need to understand the int only in this sphere. And so the analogous inequality is that the int on a sphere, if sigma is a maximal epsilon net, for some epsilon bigger than 0 and less than 1, the int from the sphere can be bounded by the int on a net. But you pay a price. And the price you pay is epsilon times the operator norm. And this is the same proof. So this inf is attained by some mx. You approximate mx by some y, where y lies in this net. And now you use the trigonometry in the other direction to estimate the error and the error you just bound by the operator norm. So previously, the operator norm was the same quantity that was on the left-hand side. And so you could absorb this term onto the left-hand side. Now it's a different quantity. So you can't just get rid of it. But we have a theorem now that bounds this. So this quantity is basically epsilon times the constant as root n with really high quality. So this is not so bad. So this error is kind of tolerable. And so because of that, this inf, you can basically bound by the input over the net, maybe less than 0 to root n, if you pick epsilon smaller. So previously, we chose epsilon equals 1 half. Now epsilon has to be a bit smaller depending on a little c and big c. Because you want this error here to be less than, say, less than 0 to root n here. OK, so you have to pick little c first and then pick epsilon. But let me not keep too much careful track of these parameters. OK, maybe just one tiny technical point here. So there is this exceptional event where this fails. And there was this uncountable union floating around here, which looks bad. But this is a single event. It doesn't depend on x. The probability of this operator norm is too big. That's just a single event. You can take it out once, and you take it out many times. So this is not a problem. OK, and then this, again, you can bound by the entropy cost, which is some constant epsilon to the n, times the probability of a single. So this is looking, and this is a little c. So this is a very similar situation to where we were with the largest thing of the value. The entropy cost is a bit bigger now. So previously, it was 8 to the n. Now it's some bigger thing to the n. So hopefully, though, we can still, ooh, ooh, ooh. Actually, yeah. Something very crucial happens here, actually. The entropy cost is actually c of epsilon to the p, not n. Because my net is taking values in the p-dimensional sphere. So actually, maybe I should make this at n, but I should use it at p. Yeah, so I'm taking a net in the p-dimensional sphere. So this loss here is like one of epsilon to the p. Now this would turn out to be very important, actually, because the, OK, so just to sort of, oh, I actually need to be a bit more precise here. Yeah, so the precise epsilon I need, actually, I think I actually need to tell you what this epsilon is. Yeah, so this epsilon is actually, I think, like little c over 2 big c. OK, and then so this is something like c epsilon. OK, all right. So yeah, so what's going to happen is that the entropy cost is now this relatively big one, one of epsilon to the p. But what's going to happen is that each individual term here is going to be something like epsilon to the n. Roughly speaking. And in order to get an exponential gain, it becomes crucial that n is a little bit bigger than p in order to be able to cancel for the term. So otherwise you can't win. OK, so that's what's going to happen. Actually, technically, I won't quite get, I will lose a little bit. I won't actually get epsilon to the n. I'll get epsilon to something slightly less than n, but still bigger than p. OK, so yeah, so you can already see that the least single value is much more delicate than the largest single value. You have to pay much more attention to how various constants depend on each other. OK, so this expression looks very much like what we had before. In fact, I can even reuse some of the stuff here. OK, so we're now, so rather than bound that the probability that this sum is very big, we now want to bound the probability that a sum like this is very small, something like this. OK, so we now have a sum of independent and a variable, so we want to bound the probability that the sum is small. Now, when we were bound in the upper tail, we could use this exponential moment method, which was very convenient. Turns out for the lower tail, this is not as convenient. I mean, you could try playing around with a negative exponential, but this turns out to not be so convenient. So we are not going to use the exponential moments here. There is something slightly different. OK, so if this sum is small, then on average, each term on average is less, OK, so this is saying that this is bound by c squared epsilon squared on average. Now, it would be really nice if I could replace this average with uniformly. OK, so if it was, you know, if bounding the average was the same as bounding just uniformly each point-wise sum-and by c squared epsilon squared, then this expression would be just the same thing as the probability that an individual, OK, so if this was a uniform bound here, then because these are all IID random variables, the probability that each of these sum-ands is bound by c squared epsilon squared is just the probability that a single term is bound by c squared epsilon squared raised to the power m. OK, now, if I could do that, then I'd basically be done, well, hopefully, because this random variable, so what is this xi dot x? There's a normalization, which I think I've lost. Ah, here it is. OK, so remember, xi is just a vector of plus or minus 1. And x is some vector on the unosphere, OK, so there is no n, yes. No, which now worries me. No, OK, yes, yes, good, yes, that is, no, that still worries me. No, no, good, OK, there's no n. OK, so xi dot x, what it is, it's plus or minus x1 plus or minus, it's a random walk, OK? Well, if you like, it's a sum of n independent random variables, each one of which, so this is a random variable of mean 0 and variance x1 squared, mean 0, variance x2 squared, and so forth. So the whole sum, this is a random variable of mean 0 and variance 1, OK? And with summing lots and lots of independent random variables, so the expectation here is that we should have some sort of central limit theorem kicking in. We expect from the central limit theorem that we expect the distribution of each of these guys to behave roughly like a normal random variable of mean 0 and variance 1, OK? That's what the central limit theorem teaches us. OK, this is not rigorous, I'll come back to that later. Not yet rigorous, OK? But if this was normally distributed, then this expression would indeed be just basically one over epsilon, order one over epsilon, because you're asking for the probability that a normal vector is less than basically constant times epsilon, and that because you're probably epsilon. And so that would give me this one of this epsilon to the n, e, sorry, order epsilon, OK? So each individual probability would be a size epsilon raised to the power n. That would give you this epsilon to the n. And that would just barely counteract this one of epsilon to the p entropy cost if n is a little bit bigger than p, OK? So that's the strategy. So there's two things to clean up here. One is that we don't have a uniform bound. We only have this bound on average. And the other is that you need to make more rigorous the central limit theorem type heuristic here, OK? So to replace it on average with uniform, it's not too hard. This is just Markov's inequality. OK, so not for Markov's inequality, but if x i dot x squared is small, then each individual term, OK? So we can't say that every term is less than the mean, but we can say that every term is less than the mean times, say, 2 over delta for at least 1 minus dot over 2 values of i. So Markov's inequality is a set for that while not every term is bound by the average, most of the terms are bound by some large multiple of the average, OK? And it turns out that the right factor to use is 2 over delta here. OK, so most terms are bound by some larger constant multiple epsilon squared for most of the n. OK, so we don't get all the n anymore. We'll get most of the n. And that's why at the end of the day, instead of getting an epsilon to the n, we get epsilon to slightly less than n, but that's still bigger than p. And so we're still going to squeeze out a win. OK, now unfortunately, OK, so Markov's inequality tells you that most of the values are bounded, but it doesn't tell you precisely which values are bounded. So of the numbers 1 to n, there's 1 minus delta over 2 n of them, which are bounded, but you don't know which ones. So because you don't know which ones, you have to apply the union bound one more time. And so there's an additional entropy cost of n choose 1 minus delta over 2 n. OK, so because this is the number of ways in which you can choose the indices, which is this big. So when you actually run the algorithm properly, this is extra cost that you have to take care of. But when delta is small, this turns out to actually be not too bad that you can absorb it in the losses that you have here. OK, so let me just mention that. But this is a cost that you have to deal with. But this is how you deal with the first problem of not having uniform control here. And then in order to make a central limit theorem precise, you need some version of the barrier scene theorem. So the barrier scene theorem is a theorem that is basically designed to make quantitative the central limit theorem type heuristics. So the precise version of the COT that is actually needed is the following. And this will actually be one of the first exercises in the problem set. OK, so let y be random on the cube. x is a unit vector. So then y dot x should behave like a Gaussian. And in particular, it can't concentrate too much at any given point. So for any t in any epsilon, the probability that you are bounded, you'll lie with an epsilon of any given point t, you can bound by an absolute constant times epsilon plus the third moment. So any of you who have actually seen the barrier scene theorem would have seen that the third moment shows up pretty much always whenever you do a theorem. So the second moment is equal to 1. So another way to think about this third moment is that you can bound this third moment by the soup. So basically, as long as this vector is fairly spread out. So you have a unit vector. So the total mass of the vector is 1. So as long as your vector is delocalized, it doesn't concentrate its energy too much, its L2 mass too much at any given coordinate. But if it's spread out over many, many different coordinates, then there's enough central limit theorem starts kicking in that this random sum becomes more more Gaussian. And you get this Gaussian type behavior. So if you use this, so if it wasn't for this term, if it wasn't for this term, you get an epsilon here. And that would give you the epsilon that you need over here. And that would let you close the argument. So this theorem will turn out to work. So the work in survival, it gives you the right bound. As long as x is what's called incompressible, which means that it's not supported on two spars of a set. Actually, what is the precise, actually? Yeah, OK. So actually, the precise notion incompressible I need is that if you look at the small values, this is bigger than some fixed constant, the eta. So the total sum is equal to 1. So you have this vector whose total magnitude is somewhat to 1. And what this is saying is that if you look at only those components which are small, this is an epsilon, you want the components that are small to already have a large amount of mass. To capture some significant fraction of this one. And this is an absolute constant. You can precisely see in the notes exactly what you take it to be. But as long as your vector is somewhat spread out over many, many different values, then this sort of barrier scene type anti-concentration result would give you the right sort of bound needed to close up this argument. So the one last thing that you need to deal with, so one also has to deal with, incompressible vectors. The notation comes from image processing. So these are vectors that are almost sparse. Vectors x1 through xp say that are close to a sparse vector. So the opposite regime is when there's a very small number of coordinates from 1 to p which capture most of the mass. And everything that captures a negligible amount of mass. So it's almost sparse. It's got an almost sparse representation. So you could compress it by throwing away all this noise and you'd be able to get a much smaller data representation of your image of your vector. Just like when you take image compression, you capture the most important features and often you can shrink the size of your image file. Anyway, so there's this exceptional sort of compressible vectors. For example, 1, 0, 0, 0, a good example of a very compressible vector, the most compressible vector. So in that case, the barrier scene theorem so fails miserably in that this random sum is now just plus or minus 1. There was no central limit behavior going on here. And so the probability that with an epsilon of 1 is 1 half. It doesn't go to 0, epsilon goes to 0. So you lose the central limit theorem, so you can't use barrier scene. So you do have to deal with these vectors separately. But how do you do that? So the one remaining observation is that these vectors have much less entropy, the set. So remember, we were covering the entire sphere by some epsilon net. If you want to deal with the compressible vectors, the compressible vectors form some subset of the sphere. In a sense, it's a much smaller subset of the sphere. And if you try to find a maximum epsilon net with these compressible vectors, depending on how you chose the parameters, the size of the net you need is actually no longer exponentially big. It's only polynomially big, as it turns out. And so the entropy cost is much, much smaller for these vectors. And so because of that, you can get by with a much smaller, much worse bound on this expression here. So for the compressible vectors, what happens is that the compressible net is only polynomial in size, rather than exponential in size. And this probability here, rather than getting like an epsilon to the n, you can get something like 0.9990 in. You can get a very tiny exponential gain here. But you don't need to say this epsilon factor anymore. Because you notice, for example, even in the worst case, when the vector is 1-0-0-0-0, this probability, while it's no longer epsilon, it's still less than half. It was still less than half. That even when you have a single sign, you can't concentrate all your random walk in a single point. So you can squeeze some number less than one here. And here, you just get a polynomial bound. And that's still good enough when you combine these two. So yeah, I'm skipping a lot of details, but they are provided in the notes. And I think Nick will go through some of these things in a little bit more detail in the first TA session. But this is the general strategy of the epsilon net argument. You have a try to bound a super, an inf, and initially, it might be uncountable or ranging over a very big set. So you first refine your parameter space to discretize it to something smaller. And then you can bound each term separately. You pay entropy cost. And then you just try to control as best you can the contribution of each individual point in your parameter space. And what normally happens in these cases is that for most of the points, you get a very good bound. But then there are these exceptional points for which you can only get a lousy bound. But what often saves you is that in the exceptional case, the entropy is a lot smaller. And so you have to go back and compute the entropy again. So yeah, the arguments get quite sophisticated as you do hard and hard estimates. But this is the general strategy of the epsilon net method. And we'll be able to adapt this method to do square matrices in the next lecture. OK, off the top of my head, maybe if your matrix is very, very rectangular, it might be possible. But I doubt it. If your matrix is almost square, 0.99n by n, I don't think so. But it's possible that if you're very clever and use really efficient estimates everywhere. But certainly the closer you get to square matrices, the more delicate this becomes. And you have to be very, very efficient everywhere. But if it's very, very eccentric, maybe it's possible.