 We're going to take mn to now be an n by n new matrix. Square matrix of plus and minus 1's. And we're interested in, so what is the behavior of the smallest singular value? So in the rectangular case, the smallest singular value was of size root n with my normalizations. But now it's actually a lot smaller. So you can use the moment method to understand the bulk distribution. So if you look at all the singular values or n of them at the same time, the distribution of the entire eigenvalues, if you normalize it by root n, converges to what's called the quarter circle law, special case of the Machenko-Pastur law. So actually, what happens is that actually it's, instead of a semicircle, which is what happens with the eigenvalues of a Hermitian matrix, for a non-Hermitian matrix, if you take the singular values, of course, it has to be positive, not negative. It turns out that the eigenvalues are distributed according to this quarter circle in the square case. In the rectangular case, there's a gap between 0 and here. But in the square case, it goes like this. So you expect the n eigenvalues to be roughly spread out uniformly. So this suggests that the smallest singular value should be of size about 1 and root n. Because the largest one is of size root n, and then the singular values just distribute roughly with more or less bounded density. So you would expect size 1 over root n. But this by itself isn't the proof. If you get a sufficiently good local version of a semicircle law, you can get close to this. But anyway, this is actually the truth. So what we actually know now is that, oh, sorry, I should have mentioned, by the way, I forgot to mention the results on rectangular matrices that I mentioned in the first lecture due to Peijor, Lidvak, Rudelsen, and Tomžek Jagermann. But thanks to the work of Rudelsen-Vershainen, we now know pretty much we have a good understanding of this, too. So we do know that the singular value here is of size root n. So let me just set a weak version of this. How should I say this? OK, so if epsilon is 0, if you have any epsilon, and if lambda is sufficiently small, let's say it's sufficiently small, and sufficiently large, then the probability that the singular value is less than, say, lambda over root n is small, less than epsilon, and there's also a boundary in the opposite direction. The probability that the singular value is much larger than root n, say, one of the lambda root n, is also small. So the least singular value is usually of the order of one of the root n. It is not much smaller or not much bigger than that. OK, in fact, we can say more now. In fact, we now know, in fact, sort of a central limit theorem. We have an asymptotic distribution for the least singular value, which is discussed in the notes, but I think I will not reach it in this lecture. In fact, I probably won't be able to reach the upper tail. So I'm going to focus mostly on the lower tail, which is actually the type of result which is most important for the application to the circular law. So this is a qualitative version of the theorem in which I don't specify the precise relationship between lambda and epsilon. So in fact, actually, there's a more precise statement. So there's actually a more precise relationship. As you send epsilon to 0, the lower tail actually decays linearly for quite a long time until epsilon becomes exponentially small, at which point there's an extra turn that kicks in. We actually have a much stronger estimate, which was proven by Routes and Bashinid. But I will just focus on this more qualitative bound here. So for this type of result, the epsilon net type argument that works in the rectangular case doesn't really work all that well here. It says the probability for an individual, so you can, again, try to understand the probability that an individual vector is small and take the union of the sum of some net. But now the probability of each individual event like this is not all that small, whereas the entropy is still exponentially big. And so direct application of the epsilon net argument is not all that effective. The problem is that there's just too many x's. And you need a way to cut down some of the entropy to a more reasonable level. And what ends up working is you should split up your matrix into various rows and use some of the rows to work out what x should be. And then, in fact, if you do it correctly, you can make x more or less deterministic precisely by some portion of the rows. And then you just try to figure out what happens, how x interacts with the remaining rows of your matrix. OK, so to illustrate the idea, let me first prove a simpler result, which is due to Komolosh, like 67, I think. So this was like, I forget, but the precise. So this was like 2008. And the result I'm going to mention, which is a lot simpler, is 67. OK, so as a special, so rather than bound to lower tarot, you can look at the most extreme case of when the singular value is actually, at least singular value is actually 0. So this is, of course, weaker than this. Again, assuming n is sufficiently big, depending on epsilon. In fact, if you actually work out the arguments that Komolosh gives, precisely, you actually get a more explicit decay rate that the singularity probability of this. So this is the same thing as the probability that m and n are singular. So you take a random n by n matrix of sign, and you ask, when is this singular? And yes, the Komolosh show that this actually goes to 0, which makes sense, because the set of singular matrices is a small subset of a set of all matrices. It's a measure 0 set, for instance. So if you had a continuous matrix, then certainly for a continuous random matrix, the probability of being singular would be 0. For example, for a Gaussian matrix, it would be 0. But for a discrete matrix, it's not 0. And so Komolosh proved the bound constant of a root n, in fact. It's not actually known exactly what the rate is. So if you have a plus or minus 1 random matrix, it can actually be singular. For example, if two rows are the same, for example, if the first two rows are the same, then the matrix would be singular. So there's certainly a lower bound, for example, of 2 to the minus n. And because there's different pairs of rows, you can actually maybe upgrade this to n choose 2, give or take a small error. And it's actually conjectured that this is basically the truth. OK, so it's conjectured that the singularity probability should actually basically be 1 half to the n. It should decay like 1 over 2 to the n. That the easiest way to make a random sine matrix singular is to make two of the rows the same, or maybe one of the rows in the negative of the other. So this is not known. There's been steady improvements to this bound over the years. The best bound currently is halfway between here and here. This is 1 over root 2 plus a little over 1 to the n. So we start with again root and wood in 2009, maybe, 2010. So there's still a little bit of a gap between the upper and lower bounds. But we do know that is now exponentially small. This, of course, is consistent with what I said before, with this linear bound here. But if you send lambda to 0, you should get an exponential bound. And in fact, we have this explicit bound here for the singular case. For the least singular value in general, we don't have as good a value here. But anyway, it is still explicitly computable. OK, I'm not going to prove these more difficult results. I just want to focus on the singularity probability here. So I can give you a heuristic proof of this sort of claim. So as I said, it's all about playing with the rows. So we think of this matrix as a matrix of n rows where each of these rows are random. They're random vectors in the unit cube, just to be uniformly and independently. So you pick n vectors in the unit cube and asking for the singular value to be 0. So we don't use this sort of inf anymore. So instead, this singular value being 0 is the same as asking for the x's to have some linear dependence so that this matrix is not a full rank. OK, so what that means is that one of these real vectors must be a linear combination of the other vectors. So let's first cheat a little bit. OK, so we know that one of these is the linear combination of the others. Let's say that it is the last one. Let's say that xn is a linear combination of the first n minus 1. OK, now let's see, this is not the only possibility. So this is actually a smaller event than this. So these are not quite equal. If you use the union bound, you can very crudely bound this probably by n times this. Because we know that one of these vectors has to be a linear combination of the others. And by symmetry, all these events, so you have n events and all the same probability. So up to a factor of n, this is the truth. But that's not a very good bound, particularly since we're only going to be getting one over root n. But never mind that fact for the moment. Let's just assume that the way the linear trend is happening is that the last vector will be a linear combination of the first n minus 1. OK, so these guys will span some space of dimension at most n minus 1. So let's call vn the span. So the first n minus 1 rows span some subspace, which will probably be n minus 1 dimensional. It will probably be a hyperplane, although it could potentially be lower dimensional than that. So we are asking for the probability that the last vector lies in the span of the first n minus 1 vectors. OK. Now, as I said, typically this would be a hyperplane. So let's pretend for a sake of argument that it is a hyperplane. Now, vn will have some normal vector. So let's say let's call n, or n I'm really using. That is bad. Let's call omega the unit vector, unit normal, to vn. Now, there might be more than one. Well, OK, this is up to sign. There may be multiple unit normal. Let's just pick one unit normal. And so this probability being on this hyperplane is the same as being orthogonal to the unit normal. So this should basically be the same as the dot product of xn with this unit normal being 0. Now, if vn was a hyperplane, this is actually equal. If vn is lower dimension, then it's not quite equal again. But again, let's pretend that this is what it is. Now, this is a rather ugly vector. OK, so this unit normal depends in a complicated way on all the first n minus 1 rows. I mean, you can work out what it is using Kramer's rule. The entries of this vector is proportional to a vector where the entries are n minus 1 by n minus 1 minus of this matrix. So it's a bit ugly. So it's going to be part random because it depends on all these random vectors. But the key thing is that it only depends on the first n minus 1 rows and not on the last row. And these rows are independent. So the key thing is that while this is random, this is independent of xn. And this is what makes this argument have a chance of working. So while this is a very messy random vector with a complicated distribution, you're taking the dot product with something which is independent of that. And that's fairly easy to understand. So what this is, so xn, remember, by contrast, is a very simple random vector. It's just a vector of random signs. And so what we have here is that we just have a random combination of these w's. So you generate these random numbers w1 to wn, omega n to omega n in some complicated fashion. And then you take this random walk with those signs. And you're asking for the probability that this random walk actually returns back to 0. OK. Now at this point, so the study of when random walks return to 0 is actually something which has been studied for quite a while. It goes under the topic of what's called a little bit offered theory. So it's a little bit offered theory. It's a study of how random walks, such as this random walk that I just wrote down here, concentrate. And there are many, many theorems describing or computing this probability. So one of the simplest results is due to Erdisch, a poor Erdisch, not last law here. So Erdisch showed the following simple bounds. OK, so if let's say if k of these shifts are non-zero, then the probability that random walk returns to, say, the origin is bounded by a constant of a root k, OK? That if you have random walk, OK. So of course, you need some sort of non-degeneracy condition. Because if all these weights were 0, then of course the probability of this walk goes to 0 as 1. But the more and more non-zero terms you have in this random walk, the rarer and rarer it gets to be able to return to 0. And root k is sort of the right expression. So for example, suppose you had k non-zero weights and they're all the same, they're all equal to 1. Then what you're doing is that you're summing together plus or minus 1, plus or minus 1, plus or minus 1, k times. And that's just a binomial distribution which has mean 0 and standard deviation of a root k. So root k. So the probability that it will return to 0, let's say k is even, should be something like 1 over root k, OK? So this is actually the truth. OK, so this is a simple theorem. It's proven combinatorially. It uses a tool from combinatorially called Spurner's lemma. You can also prove it using Fourier methods. You can write this expression as an integral of a product of a bunch of cosines. And you can sort of estimate that product by hand. But let me just take this theorem as a black box for now. Let's not talk about the proof. Proof is sketched in the notes. So anyway, if you believe this bound, so you have to now divide into two cases depending on how this is expected to behave. So as in the previous lecture, there's sort of an incompressible case. Compressible case. And in this particular, so what compressible, incompressible means changes depending on what problem you're working on. This is not really a fixed definition. But in this case, incompressible means lots of non-zero entries. So incompressible means like a positive fraction of the entries out of the coefficients of non-zero. And in the compressible case, only a small number, let's say only epsilon n of the entries are non-zero. So your vector is very sparse. All your vector is very non-sparse. So if your vector, if this normal vector is very non-sparse, then you can just use this bound of urdish. And this will give you the right bound. This will give you the bound of one over root n. So you don't use any randomness in the omegas. You just use the randomness of the signs. Yeah, so of course, for this urdish bound, you can think of the omegas as fixed. The omegas can be deterministic. As long as there's at least k non-zeros, this bound of urdish holds true. But once you've proven this bound for deterministic omegas, then just by Fubini's theorem, it's also true for random omegas. As long as the random signs are still independent of the random omegas, which is this key thing here. So as long as many of these omegas are non-zero, then you get the right bound. And then this is the remaining case where very few of the omegas are non-zero. But this case happens very rarely. So this is very rare. So just to give one extreme example, it could potentially be that the normal vector is 1-0-0-0-0, which is a very, very compressible vector. But the only way that can happen is if, actually, that can't happen at all. So actually, let's make it 1 over 2. Mine's 1 over 2. OK, so this is almost as compressible, this vector here. So the only way this can be a unit normal to Vn. So what that means is that this vector is orthogonal to every single one of these n-1 rows here, which, because of this form, means that this column here, the first column and second column, are the same. That's what it means for all these vectors to be orthogonal to this vector here. But that's really rare. That's like exponentially rare. That can only happen with probably something like 2-n. So this particular compressible vector can only occur in an exponentially small number of times. And there are not that many compressible vectors lying around. Remember, the intuition is that a compressible vector should have low entropy. So when you add that up with the union bound, this should also be small. In fact, much smaller than this. OK, so this is the general strategy of Cornelius's argument. But I had to cheat a few times to get here. So a couple of things. Firstly, when I had the linear dependence, I assumed that xn was a linear combination of all the other vectors, but the combination could be something else. And then I also cheated a little bit to get to this dot product. And then finally, I didn't quite rigorously show why the compressible case is rare. OK, so let's start doing that. So let's first deal with the compressible vectors. So having a compressible normal vector here, it's like saying that there's some sparse vector, which is orthogonal to many of the columns. So we also need to work with the columns as well as the rows. OK, so let's call y1 to yn the n columns of this matrix. And let's compute the probability that there's a sparse linear relation between these columns. So let's look at the probability that mn annihilates a vector x for some non-zero epsilon n sparse x. So there are some vector x, which is mostly zero. And only at most epsilon n of these n entries are non-zero. And suppose that there is some vector like this, which is annihilated by m. So this is the same as saying that there is some linear combination, there is a linear relation, a non-trivial linear relation between the n columns that only involves less than epsilon n of the columns. So we're going to compute the probability that this matrix has got some short relationship between the rows. And by short, it means that's an epsilon n. That'd be a good fashion. So for example, if I'll say that next. So we'll start using the union bound. So this turns out to be quite small. This will be exponentially small, this event here. So it involves some number of the y's. We can focus on the shortest relation. So you can call k the shortest, the fewest number of y's, which has some linear relation. So you can bound this by the sum of all k's up to epsilon n of the probability that k of the y i's are linearly dependent, but no k minus 1 of them are. So among all the linear relations, there are some shortest one, the shortest one has some k. So for all k up to epsilon n, we want to compute the probability that there's some k of these columns that are linearly related, and k is minimal. So for example, k is 2. This event is just the event that two of the columns are either the same or negative of each other. And that's a very small event. That's 2 to the minus n times n choose 2, something like that. OK. All right, so how big is this? OK, so there's the most epsilon n terms here. So I'm just going to crudely bound this by epsilon n. This is a very small entropy cost, because I'm going to gain an exponential factor later on. So I don't care about losing this factor. Now, what happens here? OK, so k of these columns are going to be independent. Now, but which k is not specified? OK, so we don't know which of the k columns actually is going to have the dependence. And there are n choose k different k tuples which could carry this dependence. So we're just going to use the union bound again. We're going to pay an entropy cost of n choose k for some k in this range. And in order to fix the specific k columns which are going to be dependent, and because of symmetry, we may assume that it's the first k columns. So just by the union bound and symmetry, I can bound this by n choose k, entropy cost, times the probability that the first k vectors have a linearly independent, but no k minus 1. OK, all right, oh, that's what this is. OK, so I guess maybe there's a super k. Maybe it's still left here. You could keep the sum too if you wanted to. It doesn't really matter. But yeah, I'm now thinking of k. It's just been fixed. All right, all right. So what does it mean to be linearly independent? OK, so now we're just looking at this skinny matrix n and k. All right, so OK, so the first k columns are linearly dependent, which means this matrix is not full rank. OK, so the rank of this matrix is less than k. In fact, it's exactly k minus 1. So now actually, the next thing you do is to switch back to the rows. So this matrix has some rows that are shorter than these rows. So maybe I won't call them x's. I call them z's. They're just truncated versions of the x's. OK, so if the y's are linearly dependent, then this matrix has rank less than at most k minus 1, which means that there is k rows in z1, zn, k minus 1 rows, which span all the other rows. Let me just check that I'm doing this correctly. No, that's not what I want to do. That's true, but sorry. Let's do this differently. OK, so having this linear dependence means, so this matrix, let me call it mk, so this means that this mk has a no vector. Yeah, has a no vector, that there is some vector. So if this matrix has a linear dependence along the columns, then there is some vector, which is non-zero, which is annihilated by mk. In fact, that's an equivalent form of saying that you're linearly dependent. OK, now usually, so we're trying to find the probability that this matrix annihilates some vector. Now, we could again try the union bound, and we try to take the union of all different x's, but that's an unkindable set, and that would be terrible. You could try using nets and so forth, but still the cost of many, many x's that could potentially annihilate all these.