 So thank you for having me. It's a pleasure to be at ICTP. When I was an undergraduate, I was thinking that I would study theoretical physics. But then I became a wayward child, and I went and I did mathematics and some statistics. So it's good to be here. I'm going to tell you today what I think has a simple takeaway for data scientists when looking at the size of problems. So on the board, I've written down some dimensions. We're going to have some, these three numbers are going to appear over and over throughout the talk. There's n, the number of data, d, the dimension of the data, and k, which is, I'm going to do a classification problem. So this is going to be about perfect linear separation. I'm going to want to separate k pieces of the data from the remaining n. And k is always going to be the smaller size. So the number of entries in the smaller class that I'm wishing to classify. And just to give you an idea, if you were thinking about mNIST, what would these numbers look like? mNIST, of course, you have a training set, which would be 50,000. The images are a size 28 squared. So d would be 784, and k would be 5,000, because there's 10 digits and you want to separate out one class from the rest. And in this talk, one of the things that we'll see is problems like this, some of the phenomena that you see, such as being able to drive the training error to zero reliably, is not a surprise. And this will give us some pointers to what it is that we would require if we were wanting to find problems for which classifying is harder than would be, in some sense, statistically insignificant. So we're going to talk about the statistical significance of perfect linear separation. So it would be nice to be talking about deep learning. This is vaguely connected, but it's not deep learning. But in essentially all the deep learning problems that you'll see, one of the key features, not all of them, but many of the problems, one of the key features is some final task of doing some separation. So at the end of your deep net, you're wanting to do some linear classifier, and the work in the deep net is this preprocessing you do before you apply that. We're not going to talk about processing. We're going to see that it turns out just knowing about the dimension of the data will tell us something interesting, irregardless of how we're going to preprocess the data. Now, in particular, what we're going to look at is we're going to quantify, based upon these problem dimensions, when it would be easy to separate and to classify random data. Now, we expect that when we do classification tasks, that the reason we can do it is because there's some structure in the data. And when we find this linear separating hyperplane, and we say, ah, this is of one class, and the other is of the other class, we think that the reason why we've been able to do this is because of structure in the data. We're going to show that up to a certain point, based upon the dimensions, actually you could do this for random data, which has no structure at all. And that would tell you that if you don't have the data dimensions appropriately compared to each other, then if you do find a separation, it might not be telling you that there is structure at all. It might just be happenstance. It might be structure you never know. On the other hand, if you have the k, the number of terms in the smaller dimension is large enough, and we'll give formulas and expressions for this, if it's large enough, then if you had random data, you would not be able to classify it. You would not be able to separate it linearly. And as a result, if you can separate it, that gives you reason to believe that that separation is because the data has some structure, and it's that that structure is what's allowing you to do the separation task, because it wouldn't work for random data. OK. So here's our data matrix, vectors of length d. We have n of them. We're going to have class labels, 0 or 1, and we're going to just ask whether we can do a perfectly linear separation and this question of statistical significance. Now I'm going to begin with a data set, which, for those of you who work in data science, will seem very strange when Dave showed me this data set. I thought it was very strange. But it turns out that this is a data set that statisticians know and love. So if you were, here, I'm not really a statistician. If you were an undergraduate in statistics, you might well have run into this data set. This is called the CAR data set. Donahoe and a collaborator of his, back when they were PhD students, came up with this data set. They combed through magazines, and they read off information about cars. How fuel efficient it is, miles per gallon, how many cylinders it has, how big is the displacement, what is the horsepower, the weight of the car, the acceleration, the year, where it was built, these kinds of things. And they made this very small data set. This is the land of small data. N is 406. There's 406 different kinds of cars. There's eight things that you can record about them. We're not normally going to use all eight because some of them, like origin, there's just very little there. And we have 392 complicates. I guess some of them have some missing information. So this is an idea of what this data looks like. Now, why am I telling you about this seemingly silly data size, this peculiar data problem? It's not even Amnist, let alone ImageNet or something more interesting. Part of the point is we're going to do some quantitative things, and we want to show that they're true when the dimension is very small. They're going to be true when it's big, and it would be kind of not surprising that it's true when it's big, but you might wonder at what problem size does this kick in? This problem is very small. And if you look at the data, and you look at, say, if I plot the miles per gallon versus the cylinders, I see some line of the different cars, different six cylinder engines, different miles per gallon, the displacement, these kinds of things. And if you look at this data, one of the things that you should take away from this is this data does not look Gaussian. It's low dimensional, and it's not Gaussian. Now I'm going to tell you things about Gaussian. And then I'm going to tell you things about this data. And the interesting thing we're going to see is that there's not much difference. Even though this data does in no way look like it's Gaussian. So again, this is the data, just in a 3D cube. Doesn't really look Gaussian per se. There's some kind of structure there. Fine. So here's the problem that we're going to solve. We're going to look for a separating hyperplane. So we're going to look for a vector, beta, which is going to be of length d plus 1. It's going to have, beta is going to be of length d, and then we're going to have the affine shift. And what we're going to look for with this vector, this is going to be our linear separating hyperplane, that on one side of 0, that's going to be everything in class 1, and the other side is going to be everything in class 0. And this is a very easy problem that we could solve. So to give you an idea of how easy it is, I understood that there might be students here. So we put up a slide of how easy it is. This is the MATLAB software to do it. This isn't sort of the software. This is really the whole software. So you type that into MATLAB and have CVX installed, and away it goes. So you just have a constraint that on one side it's 1, on the other side it's minus 1. Here we did 1 and minus 1 instead of 1 and 0. And then you normalize the variable in some way so that it doesn't give you a value which is 0. Now here's the question that we're going to be interested in. If we took the labels and we were to randomly permute them, so we said, OK, we're looking for separating k data points from the rest. Now if there's some structure there, we might imagine we could do that. But if we randomly permute the labels, then any structure that was built into those labels is now, presumably, gone. And we're going to ask, if we do that, what is the probability that we're going to be able to separate them? If we can separate them, that's troubling, because then we can separate it even though there isn't structure. Now for those of you who are in the deep learning community, you might be familiar with this work by Ben Rekt, Zhang co-authors, on rethinking generalization. And the point there is essentially, you can drive training error to 0 in a certain case, and they do experiments where they take the labels, they take a net, they train it on something like MNIST, then they randomly permute the labels, and they train it, and it works fine. They can train it to fit random labels just as well as non-random labels. And part of why that's the case is because of these numbers. And we'll see that in just a few moments, OK? So we'll give some explanation as to the thing that they were seeing. OK, so this predicted p value, p of x, given this permutation y, is this observed feasibility. So this is just a classical permutation test that's commonly used. And as I said on the board here, k is going to be the smaller of the dimensions for the two classes. Not dimensions, but size of the two classes. And what we're going to show is that the results for this p value is roughly just depending on these three values, k, d, and n. And in particular, the ratios of these values if the problem size is very large. But we'll look at something extremely small to begin with. And the main takeaways of this paper is two things. We're going to give you the ability to actually compute these quantities, software that would do it. We're also going to give you an analysis which would give you the large deviation exponents for what happens in large dimensions. OK, now this work builds very much on some earlier work by two people that we'll feature here. I'll show you some pictures and some old papers, Brad Efron and Kover. So Efron looked at this question of the various expectations concerning the convex hole of n independently identically distributed random points in a plane or in space. So in particular, n could be anything and k would be 1 being a vertex. So let's see what Efron was telling us. So he gave us integrals to calculate for any n, and for d being 2 or 3, the expected number of vertices that you get when you put n points down in 2 or 3 dimensions where the points are put down according to a Gaussian distribution. Now this is exactly telling us what this p0 when k equals 1 is for n or for d being 2 or 3. So the answer for this is known in these low dimensional cases. So you just count the number of vertices divided by the number of possible points. There you go. There's your expectation. So Kover and Efron later extended this work, and they considered what happens when, of course, you have to have k being less than n over 2 because it's the smaller of the two sizes, but what happens when n is greater than 2d? So here, the number of data points here in this case is approximately not quite 20, 15d. What happens in the case where n is greater than 2d? It turns out that they showed that p is going to be 1. So in this case, in this setting of MNIST, this result from back in, I don't see the year, but I think it was in the 60s, would tell you that for this problem, if I were to randomly permute the labels on MNIST and I were to look for a perfect linear separator, don't do any deep learning, just ask if I can find a linear separator, the answer is, yeah, I can. I randomly permute again, yeah, no problem. And that's for random data, let alone for MNIST. Now, if you run it through a deep net and then you try this again, will it happen? Yeah, yeah, no problem. Just because of the problem size, because the amount of data in the smaller class is too small compared to the dimension. If you wanted it to be hard, you'd have to crank up the dimension of the data to go to something like ImageNet, where actually you would have a large data size. So we're going to consider the case where it's not trivial, and we're going to find regions where it would be because of finding a linear separating hyperplane, you would expect that that was because of some underlying geometry in the data. Now, this is connected to some prior work, so I just mentioned the work of Efron and Efron and Kover. There is some related work that Dave and I did years ago for compressed sensing, and some more recent work that Immanuel Khandez and Pragyasur did recently for logistic regression, where coincidentally they get the same phase transition, yep. Then you should definitely ask a question. You will be able to, yeah. It is possible to separate. Because what's happening, think of the extreme case. Imagine what is the easiest case to separate. That would be when the small dimension is very small. So imagine you have some data set, K is one. Is it easy to separate one? Generally speaking, yes. What if you let K grow? Eventually it will be hard to separate, right? And the question is, how big does it need to be compared to the whole number of data, and critically also the dimension of the data? So here, I shouldn't have compared this and this. I should have compared, yeah, these two. So here, the N is significantly bigger than 2D. This is one of the experiments in this generalization paper by Reck. Is it a statement that? Almost, that's almost the same statement. So that if you do a random permutation, that there is a linear separating hyperplane. Now, it's almost the same thing as this work. So this work is a bit stronger. So this work, what you would do is you would say, there is a transition, there is a region based upon these problem sizes, and I could give a separate talk on that, where if I take random data points, then any K points will be separable from the rest. And in fact, all K are separable. And it turns out that that, which seems like it's the same problem, is not quite the same problem. And that phase transition is lower. Yes. Yes. Yes. So the question is, it seems intuitively incorrect that as I keep cranking up N, that it gets easier. Yes. And then I'm considering all the others as like, so I'm classifying. So I'm still stuck with what we are talking about here. So in your example, we only had two labels. And here you are taking like the other 45 to be one of the labels and 5,000 to be the other label. Yes. OK. So can I separate the digit 5 from the other digits? Everything that's not a problem. OK. I guess my intuition is about the case where K would be just n half. Maybe there's the. That would be exactly balanced, which would be the case where it's the hardest to separate. Yes. So in that case, I mean, my knowledge of this cover work tells that n needs to be smaller than twice d to be able to separate it linearly, the random labels. So I just get the direction, don't get the direction of the inequality. Well, let me show you some examples. And then we'll see. And then we can go back to that. So here's the main takeaway, really, which is a formula that if x is drawn Gaussian, then the probability that K data points are separated from the remaining n minus K data points has a formula. And this is exactly the formula. And one can calculate these things. For large problem sizes, you can't calculate them very reasonably. You have to do some large deviation, asymptotics, and we'll give you some exponents and some probability bounds, yadda yadda. But this is a formula, and it's in some ways not so it's surprising that it isn't exactly a formula from this earlier compress sensing results on number of K dimensional faces of a polytope. So it's not exactly that. That's what we thought the answer was. In fact, we were sure. We started a paper. This tells you something. And then we ran the experiment, and it didn't agree. And then we had to figure out why it wasn't exactly that. But it turns out that all the participants in this party are these internal and external angles. And this double sum looks very, very much like what you would see in those results. But it isn't exactly the same. But there is a bizarre coincidence. I'm not sure if it's a coincidence that I will tell you later in the talk. OK. So we're going to consider this card data set where n is equal to 14, and d is equal to 7. Now, so we're going to see some concrete examples from that card data set. Sorry. 1, I guess this would be n. And then we're going to consider K up to n over 2. And what we're going to do is, because we now have this 7 by 7 grid of possible values for n fixed of the K and the d, we're going to actually just empirically try all possible combinations. Because the problem is so small that for that data set, we can just try to separate all things where we consider separating a hyperplane where we've randomly considered one dimension of the data. And then 2, and then 3, and then 4, and then 5. And we go up. And what we see is here where we have a high amount of the dimension of the data, the probability of getting a linear separation is very close to 1. Alternatively, if I were to take a dimension and fix it, say d, here it's easy to separate. And then as I grow K, it becomes harder and harder to separate. So this is what it is. This is specific for a particular data set. Now, here's the thing that's interesting. Tiny problem size. That's what the theory tells you from the formula. There's the original. This is the actual car data set. This is what theory tells you if you compute those quantities. And you can compute them for problems of this size. This is the difference. These scores are the difference telling you more or less the same. OK. Now, if you scale up the problem size, then this would be the empirical phase transition for trying some problems. This is the thing that you can go and you can calculate. And they're essentially indistinguishable. But the key here, so raising the question that you had in mind, which would be what's happening out in this region, this p would always be 1. Why is this happening? For the physics audience. So the simplest analogy for why it's happening that I can give you is from these earlier results. These are slightly more complicated. But the earlier results would say, imagine I'm in dimension 100. And I now start throwing down data points in dimension 100 from a Gaussian distribution. We know that it's going to land essentially on a sphere. Now, it turns out that if I keep throwing these dimensions down and I look at, let's say, the ratio of the dimension to the number of data points, let's say it's 2 to 1. So I'm in 100 dimensions. I throw down 200. It turns out we can calculate exactly a size k, which is the dimension of a smaller number of pieces. So we take k out of this 200 data points, say 10, that for any k points that I pick, with high probability, they will be on the convex hole of this object, which means they're on the outside. And if I do a linear separating hyperplane, it's a property of high-dimensional geometry. If you throw too many points, then this won't happen. But the number of points that you throw for this not to occur has to be exponential in the dimension. It's like if I draw points and I ask, if I draw points, Gaussian points, I draw n Gaussian points, what is the angle between them? Let's say I draw n Gaussian points and I look at all the pairwise angles. They will be in expectation like 1 over the square root of the length of the vector. But if I draw exponentially many of them, then I'll get a covering of the whole space. And then it turns out, if I were to ask, what was the minimum angle between them, that would actually be high. But until I get exponentially many, they will all be essentially orthogonal. Because the size of the high-dimensional space is like 2 to the dimension, just looking at the coordinates. So unless you get exponential in the number of points, you haven't really filled out the space. That's kind of the geometric intuition. Did I vaguely, are you more convinced? Do I understand your plot that d needs to be at least and a half to be able to separate? So if d is greater than, so if n is greater than, or d is, so this is a 60, so if d is less, if d is more than 60, if you're out here, then the probability of getting a linear separation is going to be 1, or essentially 1. So d needs to be large, and n needs to be small, right? So your M-nist is in the case where you cannot separate. Right, right, right, right, right. Have I flipped a sign on that prior slide? I think on the cover slide, I think you did, yeah. d needs to be bigger than n half, not the other way around. If d is bigger than n half, then what happens? We separate. Then it's trivial to separate. Yes, yes. Yes, so I think you had it the other round on the cover slide. An M-nist, we cannot separate it. You said correctly that d would need to be larger in order to be able to separate it, but as it is, we cannot separate randomly permuted labels on M-nist. But is this not what was observed in Ben's paper? Because they were not linearly separating, they were separating with a deep net, and that's very different. OK, thank you. OK, so I mentioned these angles before. So these angles, there's the internal angle, which is something which is kind of what you would imagine. So you take your geometric object, you have these two features that are going on, these two faces. So one of them is the object that you're considering, and the other one is a lower dimensional face of that object. So if we go back to this formula, then here, what you're considering is dimensions of these u and v, which are adding up to d plus 1 minus 2s, and then summing them, and then this is looking at the angle between one face and another. So this face, this is the dimension of that face of u and v, where we're going from one simplex to another simplex. So we look at this convex polytope, we take a k dimensional face, we put a cone on it, we look at all points which are pointing inward from that face inside, and we look at the fractional volume of that face. The external angle is a little less straightforward, we're not going to go into it. There are formulas for these things. They are computable, but not easy to compute for large problem sizes, for large dimensions. Unless you do some asymptotic approximation. So this leads us back to this earlier work by Ephron, where he would give us formulas for how we would calculate these things, to tell us how to do this, this is an example. They tend to have factorials involved, which is why it tends to be difficult to compute them for small problem sizes. You can do it for, if you push yourself, maybe up to about 60, but going beyond that gets to be a problem. OK, so if you take this problem and you consider large problem sizes, and you don't want to try to calculate these individual quantities, which is of course something you wouldn't be doing in large problem sizes, then it turns out that everything is going to be dependent upon the ratio between k and n. So kappa now is this fraction. What fraction of the data is in this smaller class? And delta is telling you about the dimension compared to n. And what we end up with is a function, smooth function, which goes from 0 to 1 half. So delta goes from 0 to 1 half. And in case of raff, I don't think this is 0. I think this is 1. I'll show you on the next slide. I think that's incorrect. I think this is 1. So what we get are two large deviation exponents. We're going to see a plot where the bound on the probability is e to the minus n times psi. So the minus means that actually if we want the probability to be 0, we want to be in this case. So in this case, we would have the probability going to 0. And in this case, above this other threshold, we'd have the probability going to 1. And we can compute what these curves are. So this would give you, again, this case where n was 60. But then, overlying it here on this curve, is this curve that goes, no, OK. So the k is given as a fraction of n. So k is scaled up to 1 half, which is correct. So my remark over here that I wasn't sure if this was a half, it was 1, we've rescaled the problem slightly. We used to rescale it where k was depending on d, not on n, and then it would go to 1. But now we're doing it where it goes as a function of n. So it is going up to a half. So this is this curve where you go to a half and to a half. And this is the curve here. This is another curve related to a different phenomenon, like a higher probability phenomenon, but we're not going to go into that. So this curve is the 50% level curve of the phase transition of large deviation exponent where this psi is equal to 0, where psi is equal to 0 defines this curve here. Now this earlier result in compressed sensing would give you a curve which would be a bit lower than this. Turns out OK. Now here's the strange coincidence. So it is a strange coincidence. We don't have a clear understanding of why this is true. It would be, we thought about it quite a bit to see if there was an obvious reason why the answer was what it is. But it turns out that this expression is exactly this phase transition for randomly projecting a simplex for the expected number of faces of a simplex, but with a scaling where the internal dimension is scaled by twice, and then you divide the phase transition by 2. And we don't have a geometric understanding as to why this is the case, but it's not approximately true. It's exactly true. So if you look at the large deviation exponent, you can see that this thing is actually the same thing. Maybe it's a coincidence. Seems like a strange coincidence. Although the formulas are all involving the same angles. So maybe it's true. This was also defined, this phase, this same curve was also discovered for the well-definedness of the maximum likelihood estimator in binary logistic progression recently by Condes. And again, if you're below this curve, then MLE would not be well-defined, just as this linear separation would be very probable, even for random data. So this is the large deviation exponent. And on that plot before, where we had this ratio, rho and kappa, so let's go back, rho being, sorry, I guess as delta and kappa, sorry about that, the fraction of n and the fraction of d. And if we go here, we see what this psi looks like. And the curve that we saw in the prior plots is the maximum of this curve. Because of course, what you're looking at is the sum would you grow along this line. So this curve, when you would consider whether it's true, whether you have the probability here, you would have that it's below. And then past this, you would be not going down, but you would be taking the sum of this and everything past it. So that will show you why it's going to be large. Good. So the elements of the proof, well, we just write out what the large deviation exponents are for this thing by taking a log and then simplifying, removing out polynomial factors. So this is a formula for what this thing is. So it has a closed form expression. You can compute it. It doesn't have an expression that it's trivial to evaluate, but it can be evaluated. And we have software that will do that for you. So this is related to this linear feasibility problem. But I think that's probably, I would say that's probably enough for now. Anything past that gets very detailed. All right. Thank you very much.