 So firstly, thanks very much to the organizers for inviting me. It's a real privilege to have the opportunity to present here. And thanks to everyone for coming. So yeah, today I'll be talking about kind of statistical and then computational or algorithmic perspectives of randomized sketching algorithms. So in particular, I'll be kind of trying to sort of unify or explain sort of two different perspectives of sketching which I'll introduce. So kind of the broad focus of this workshop is the idea of kind of when you have large scale data problems in machine learning or statistics or optimization or computer science. You kind of are looking at either statistical or computational or sort of both perspectives of that. And that's sort of a real issue in many kind of modern large scale data problems. And so kind of an ideal world, what you really want is something that has, is something that you can, is computationally feasible that you can run on sort of whatever computational resources you have available. But then it also has kind of good performance in some statistical or other metric that you might be interested in. And so, I mean, there's many different ways that people have done that and we've had a few different approaches in this workshop and we will hear more today. But one approach that's kind of received a fair amount of attention in the last sort of five to ten years is this notion of sketching. And also I'll define what that precisely means in the context of the problem that I talk about. But what kind of sketching refers to is, if you have very large scale data, so to highlight we have very, very large scale data. If you kind of sketch or project it onto something low dimensional and then run whatever algorithm you're going to run on the large data set. But on now this sketched or smaller data set, that then you kind of will, that that's one way to certainly improve, like reduce computation because you've reduced the size of your data. And ideally, you sort of have not lost too much from a, from a statistical or accuracy perspective. And so, I mean, there's, like, how you do the sketching is sort of, it depends on the, on the problem and, but there has been, and this is a very, very sparse sampling of the sketching literature. There's been lots and lots of other work done, but here are just a few relevant examples. But if you do particularly, like, you know, things like randomized projections or sub-sampling, there are some results to show that, you know, in some sense you, you gain computationally because you've reduced the data size and run your algorithm on a much smaller data set, but then also potentially not lost too much in terms of, in terms of accuracy. And this is sort of broadly based on, you know, ideas due to, like, to kind of Johnston, the Johnston-Lindenstraux dilemma for dimension reduction and then, like, concentration of measure. But in particular for ordinary least squares, which is what I'll be talking about today, there's some results, you know, to show that, you know, you potentially don't lose much computationally. There's also some work related to the CUID composition. So that's kind of trying to do, like, SVD-type decompositions or, you know, multi-dimensional, or, you know, some kind of low-dimensional reduction techniques, but using kind of, you know, random projections to speed up computation. And there's also been this kind of line of work on kind of iterative sketching for a kind of, you know, if you want to solve linear systems using spectral sparsification. So here are just a couple of examples where this idea of sketching or reducing the data size is potentially, you know, going to sort of allow us to do, you know, faster computations for things without losing too much. So I guess just to sort of explain the genesis of this problem. So in some ways, like, there's two sort of perspectives of this. So there's, like, a lot of the work that's been done on sketching sort of prior to this has largely been, I would say, more in the applied maths theoretical computer science literature. And so I'm kind of describing this as the algorithmic or computational perspective of sketching. So people like Michael Mahoney, Dan Spielman and others have kind of worked on this a lot. And the general principle of it is that, you know, by doing sketching, you still get sort of close to optimal worst case error bounds. And I'll kind of define what that means for this particular problem, even by doing sketching. So you're kind of, you know, so you're gaining a lot computationally and basically not losing much from an accuracy perspective. And there's some kind of results that sort of support this. But sort of I, you know, I've sort of originally, you know, trained in statistics and kind of used to thinking about things in a statistical way. And so when I think about sort of the sketching, you know, the sketching type methods that are being talked about in literature here that, in some sense, you're throwing away a lot of your data. So, and I'll talk about exactly how much for the problem that I'll be focusing on. But, like, and this is a lot, like, more than half your data, basically. And in some sense, what these results are saying is you're not losing much by kind of throwing away a lot of your data. So this didn't really sort of make much sense to me, because obviously if you have more data, you're going to, you know, if you have, like, less than half your data, you should be losing a lot of it. And so this was something that at some level, when I started working on this problem in early 2013, I sort of didn't really understand, you know, why you were gaining something or why you were not losing too much. And so the goal of this talk is basically to kind of unify these two perspectives. So essentially, kind of, I sort of ran into Michael, I guess, at a conference at the beginning of 2013, and sort of having read some of this, you know, this work on sketching, I sort of didn't really understand exactly sort of how, you know, you were not losing too much. And we'll see that the answer to that in the context of this ordinary least squares or regression type problem is that it's just that sort of it's we're coming at it from a couple of, from two different perspectives. So we had a lot of long discussions and arguments about this, and I guess eventually we just decided to sort of resolve by just writing a paper together, which I'll talk about. So just to sort of be concrete about the problem that I'm looking at, and so this is kind of the simplest problem that, you know, you can think of, but it's still sort of very widely used. It's this, you know, notion of you're doing least squares for large scale problems and analogously just solving large scale linear systems. So if we think about ordinary least squares, you have, you know, two things, your data like your x and your y. So x is an n by p matrix and y is an n dimensional vector. And to be, to make things concrete, we're assuming that both n and p are extremely large, but n's much, much larger than p. And for simplicity and without loss of generality, we're assuming that the rank of x is just p. So you can kind of, you know, easily apply this if you have lower rank structure, but we're going to assume for simplicity that the rank of x equals p. And so, you know, we also don't know how to solve the ordinary least squares problem. So, you know, this is, and this is something that can be done, but obviously the computational cost of that is order of n p squared, which is, you know, perfectly reasonable and in many cases, but when you have, when you're in the extremely large, large scale data setting, this is, you know, potentially, you know, not feasible. And particularly, you know, if you're doing least squares iteratively, you in some sense ideally would want to sort of potentially reduce this computation. And, you know, so although this is sort of a very simple problem that, you know, in some sense you would think is more or less solved is, as is kind of being described in lots of earlier talks, so Alex touched upon this in his talk and then Vinay mentioned this, that in some sense a lot of problems reduced to sort of solving the least squares problem. So any gains you can get computationally to solve that are certainly beneficial. So the idea of doing sketching in the least squares sense is, again, pretty simple. You apply some sketching matrix S, so S is this sort of R by n matrix, so R is much, much smaller than n. And then you, you're going to do your least squares, but now on your sketch data. So you have SX and SY, so now SX is R by p and SY is just an R dimensional law vector. And you're going to get an estimator that's kind of, you know, based on minimizing this, this least squares problem. Okay, and so the, the computational complexity of this is order of RP squared plus whatever computation was involved in getting that sketch. So, if you can come up with a computationally efficient way to get that sketch, then you can potentially, you know, do, you can potentially save a lot of computation by doing this. And, you know, I'm going to talk a little bit about what are reasonable choices for sketching matrices later. But the idea is that you, in some sense, want your beta, beta S to be a reasonably good approximation to beta, beta OLS under some metrics. And I'll talk about, you know, what metrics you're using, sort of how the choice of the metrics is, is really important. Okay, so this is sort of the prior work and, you know, what we're basically saying that, you know, sketching works really well. So this is kind of worked by Drenneus and Mahoney back in 2011, which basically supplied the idea of doing sketching for least squares was basically, you know, you're gaining a lot and not losing much from an accuracy perspective. So basically what you assume is you assume that Y is Y and X are fixed and, you know, they can be whatever they are. And what, a reasonable way of measuring, you know, how much you lose by doing your, but by doing your sketching is you kind of compare these kind of residual mean squared errors. So, you know, Y minus X beta S squared divided by sort of the original, sort of the accuracy of the original OLS estimator. And here, you know, to make it worse case, I'm taking a supremum over Y. You can, equivalently, I mean, a question you might have is, why not take a supremum over X as well? And for most of the results I talk about, you can apply that, but there's gonna be one result where it gets a little bit complicated a bit. So essentially you can assume that it's just sort of any worst case set up. And what the results show is basically that, essentially you get like that this quantity is only sort of one plus delta worse than the original least squares estimator. So that's why in some sense, and this was for various sketching schemes and I'll illustrate some of them shortly. Like, you know, doing random projections or doing sub sampling types of sketching scheme. So this was kind of the computational algorithmic perspective, which was saying that you're not really losing anything by doing sketching. Okay. All right, and so now, so again, like going back to the original goal, like this was something I didn't get when I sort of saw these results. Like how can you, you know, throw away a lot of your data? So R is much, much smaller than N. So you would think that you're, you know, you would be losing by taking sort of only like a fraction of R and N samples. So you must be losing something. And yet you would, according to these results, not really losing anything. So up to constant you were doing just as well as you were before. And so this was something I sort of wasn't, wasn't really understanding. So I'm used to sort of thinking about things typically question. Was there any condition on N and P in the previous result? Yes, so the condition, so the condition, the only condition you need is that yeah, so the R has to be bigger than like P log P. So that was that, yeah, so that's the only result you need is that R has to be bigger than P log P. So you kind of like, you're getting enough of the rows so that you're getting most of it. Yeah. Okay. And so, so the way I'm sort of used to thinking about things, and this is really the way is, you know, a statistical, a statistical class in which I think about things typically is that when you're trying to solve least squares, this comes from some underlying generative model. So like a typical like, you know, a Gaussian linear model. So you have this, you know, the simple Gaussian linear model where you have some true parameter beta. And then you're adding some noise, which, you know, we're assuming for simplicity is isotropic, but that's sort of a fairly easy assumption to relax. So it's zero mean noise that's isotropic. And then there's a number of different metrics you can choose. I'm gonna talk about two particular metrics here. One, and, you know, in statistics here, typically you're often used to talking about efficiency when you're comparing estimators. So you can think of your sketched least squares solution as an estimator, and then your original OLS solution as an estimator too. And so one, this first metric, which looks quite similar to the original one is the so-called residual efficiency, which is just kind of comparing the residuals. But the difference is you're now taking an expectation, and this expectation's being taken over the noise epsilon. So you're taking an expectation over the noise epsilon, and this is how you're kind of comparing these two estimators. But the second one, which is often what statisticians are more interested in is the so-called prediction efficiency. So we're used to talking about prediction error, which is like x beta hat minus x beta, where beta is the true parameter. And we're gonna sort of tie, and this is the second metric we're gonna see. And we'll see that there's gonna be a very important distinction between these two metrics. And so ideally you wanted a time in good sketching schemes that sort of do well in terms of all three of these criteria, but these two are really statistical types of criteria, and then the original, what I presented on the previous slide was an algorithmic or computational criteria. Okay, so now I'll talk a little bit about what sketching schemes are actually typically used in practice. And so there's a number that are widely used. So essentially there's two classes of sketching that are, I would say, very common. They're both based on kind of randomization. There are also some deterministic sketching schemes that are suggested, but computationally, randomized sketching schemes generally work well. So one's based on sampling, and in particular I'll talk about what leverage score sampling is shortly. But trying to sample in a way that's random so you can think about sampling rows or columns or doing something more clever than that. And the other sort of approach that's widely used is random projection. So you can imagine if your S is a random Gaussian or Bernoulli matrix and just sort of do a projection onto using that, and then there's also a Hadamard projection, that's a different kind of projection, but essentially there's two classes of randomized sketching that are widely used and I'll be sort of focusing on these classes of sketching, either doing sort of sampling or doing projection. Okay, so I mentioned this notion of leverage score sampling. So this is something that sort of gained, particularly by Michael Mahoney and others, gained a lot of sort of traction in the last maybe sort of five years or so in the theoretical computer science and machine learning literature. And so what leverage score? So when you're thinking about sampling kind of your X's and your Y's, I mean the natural choice that our thing to do would be just to sample uniformly. And that works okay, but like essentially, and we'll see this with some simulations later too, if you can do sampling in a more potentially clever way or more efficient way that picks the samples that do best in terms of mean squared error, you can potentially do more with less samples. And so one way to do this that's sort of been shown is to pick the samples that have what are known as high leverage score. So to explain high leverage score, so you can think about the singular value decomposition of X, so that's U sigma V transpose. And the points that are so called like have high leverage score, if you imagine that you have your U matrix which is essentially the row singular vectors. And if you take the two norm of the rows of these singular vectors, then these are what are so called the leverage scores. And these are numbers between 0 and 1 such that the sum of these n leverage scores add up to p. And in some sense, the idea of leverage score sampling is that you pick the rows that have the highest leverage scores. Okay, and so this is a typical, it's been shown that if the points with high leverage in some sense are best in terms of kind of reducing the mean squared error that we've talked about. And we'll see that in simulations that tend to work pretty well. And I guess on another sort of note that this is kind of a quick diversion from what I'm talking about but it's also related is that in general, this concept of leverage scores and points with high leverage, like that actually goes back to some work in the robust statistics community in the 50s and 60s. And in fact, like if you, so if you present this idea of sampling points with high leverage score to a statistics community, they'll often say that that's sort of a really, like not a smart thing to do at all. And the reason for that is because in some sense, typically points with high leverage reflect points that may be outliers. And so in some sense, like this, and so this is being proposed as a scheme to include points that reduce your mean squared error. But a statistician might often say that you're picking those points because they have, because they're sort of outliers. And so this again also illustrates maybe a different perspective in terms of how a statistician might look at high leverage points and someone in sort of numerical linear algebra or computer science. Because in some sense, the high leverage points are best in terms of reducing mean squared error because they, if you view your least squares problems having like no noise and no outliers, then those points are the ones that in some sense contain the most information. And so they skew your mean squared error estimate, your unreleased squares estimate the most. But on the other hand, if what you're fitting is instead the fact that they're outliers or noise, which is what typically people in statistics care about, then you're essentially just, you know, biasing your estimator in completely the wrong way. So that's sort of a side notice to, you know, another sort of difference in perspective. But for now I'm kind of, and so for now I'm focusing on the fact that, you know, these, in some sense, your model's right, which is, you know, obviously a very strong assumption. But, you know, we, you know, that's, and so we're fitting these, we're picking these high leverage points because they reduce the overall mean squared error. Okay, any questions at this point? And so a natural question you might sort of have related to this is that, you know, our original goal was to reduce the computation of unreleased squares. And in some sense to get these leverage scores, which we're gonna use for sampling, we had to compute an SVD, which is the same computation as ordinaryly squares. So, you know, why are we doing this and are we gaining anything? And so there are some, you know, there's some other work that's, you know, done by Drennaeus and Mahoney and others that shows that you can actually compute these leverage scores, like not the exact leverage scores, but approximate leverage scores fairly efficient, fairly efficiently. So with a computation of auto NPs, so you can potentially do this computation fairly efficiently as well. So that's why, you know, doing this computation isn't sort of a, you know, you're not just, you know, do it with solving the original problem, like something that's as hard as the original one. Can you do this sampling-based sketching like in online fiction or you have to store all your data and then pick what you're going to do to keep? Yeah, so that's a good question. I don't know of any scheme that allows you to do online because to compute the leverage- Something you can. Sir? If you just do something you can. Yeah, absolutely. So that's one of the, yeah, I agree. So that's one of the potential weaknesses that if you're doing this in an online way, I don't think there's been any work on that and I don't know how you'd do that because you sort of need to know all of the rows, yeah. Okay, so I guess I'm sort of building up to the, you know, the goal is to sort of explain these two different perspectives. And so just to sort of build up to that, there's a slide that in some sense talks about how you can relate like how these two metrics perform or how these three metrics perform in terms of properties of the simple projection matrix. So typically when you're solving ordinary least squares and recall that we had the SVD of x on the previous slide, if you're just solving least squares, that's just a projection on, it's projecting y onto the row space of x. And so what, if we do sketching then where you're doing is rather than doing a projection onto the row space of x, you're kind of doing a projection on using this sort of oblique projection matrix. So this is a projection matrix. So, you know, if you square it, you'll get back what you had originally. But then, you know, it's oblique in the sense that it's not, you can't like, if you take a transpose, it's not the same thing. And so you can relate it to quantities that involve this oblique projection matrix. And I guess the main thing to take away from this is that if you look at these last two statistical metrics, then in some sense that you can see that, you know, up to constant that this prediction efficiency kind of scales like N over P times the residual efficiency. And that in a nutshell is explains kind of the difference between, you know, these two perspectives and, you know, why you potentially will do very well in terms of worst case efficiency and residual efficiency but not so well in terms of prediction efficiency, which is what kind of statisticians would potentially be most interested in. And so this is, I guess, the main result and probably the one thing that you should sort of take most away from this talk. If we look at a comparison of these results and so remember that P is kind of the low dimension of your least squares, R is the number of rows that you've sketched and N is kind of the big dimension or the number of samples that you have. If we look at, and these are three particular sketching schemes, so SR stands for kind of leverage scores, you know, with like, I guess that's with some sort of correction for the bias. SGP is kind of a Gaussian or sub-Galcian type projection and then HAD is just a Hadamard projection. But for all of these sketching schemes, you see that for the worst case and for the residual error, we see that we get kind of a one plus a P over R term. So as long as like, R is a little bit bigger than P, which is like P log P relating to what Simone asked, then you can see that you get the performance guarantees like what Drenace and Mahoney were getting earlier. But then if you look at the prediction efficiency, you get something that's very different, which kind of means that it, which is consistent with what my sort of intuition was initially that you need R to be of order N to even get close in terms of performance according to this, according to this metric, which is what statisticians to really care about. So in some sense, sketching isn't really helping you in any reasonable way, if you kind of are looking at this prediction efficiency metric. So even though it works well from a worst case perspective, it's not working well at all in terms of prediction efficiency. But do you have lower bounds? That's, I'm getting to that on the last slide. Yeah, that's a good question. Yeah. Okay. So this is, yes, I was pointed out these are just upper bounds on these sketching schemes, like are there lower bounds and yeah, like there are and I'll talk about those. Yeah. Is it possible to remove the expectations in ratios? So you can get high probability results. I suspect that it probably is. We haven't done that, but I, so it made the analysis easier to not do that, but I think you can probably get away with that. Yeah. Do you think it could change the main to the last bound? Um, to like, you mean to like- Is it because it is a ratio of expectation and the- Oh, do you think it would change to something better than n over r? So I guess I'll lower, so we've got, I'll talk about lower bounds at the end that suggests that you can't really do better than that. Unless you have very specialized cases and I'll talk about that on the next slide, but yeah, in general, in general, you can't really do better than that. Yeah. Again, so this is like just a different upper bound and this is for a particular sketching scheme that I guess is not traditionally used by in the sketching literature, but this I guess is, it's trying to take advantage of the fact that you potentially have leverage score distributions that you can take advantage of and exploit and it also supports some of the simulation results I'll show and it shows that you can potentially do better in terms of the statistical leverage scores. If you do this leverage score sketching without doing a kind of correction for bias and so you get the K here is typically gonna be less than n because it's taking the largest star and like the largest K leverage scores and so you're gonna potentially do better using the sketching scheme, but it's under this very, very strong assumption. Again, just to show some simulations to sort of compare these different sketching schemes and also to compare the different metrics. So there's gonna be six different sketching schemes that I use the first four are related to sampling type approaches that I talked about and the last couple are related to projection type approaches. So what we're doing, and this is not a very large data setting, but I've got some simulations in a larger data setting, but here we've got n equals 1,000 and P equals 50 and we vary R as varied as we'll see from around maybe 50 to about 1,000 because we typically want R to be bigger than P but smaller than n. And we're gonna sample X uniformly according to, we're gonna use a multivariate T distribution and the reason for that is because I've talked a bit about leverage score sketching and leverage score sampling and so what this allows us to do is to allow us to generate X that has different leverage score distribution. So some X that has sort of very uniform leverage scores and some X that don't. So we're gonna use T distributions with three different numbers of degrees of freedom, one, two and 10 degrees of freedom. And we're gonna compare these six different sampling or sketching schemes. So in particular, just to illustrate sort of why we're using these different distributions, this is a plot of what the leverage scores look like for these three different values. So in some sense, the intuition for this is that if you have new equals one, that's a very heavy tail distribution and so that means that you're gonna have in some sense more non-uniform leverage. So new equals one has a very heavy tail distribution because you're kind of dividing, you have fewer degrees of freedom so in some sense you're gonna have more, there's gonna be like a, there's gonna be a less uniform leverage score distribution which is what you see here and for new equals 10, that's kind of getting close to like a normal distribution and so you have very, very uniform leverage score. So we're just comparing the performance for different choices of X here. And so if we look at, sorry that's a bit crowded but if we look at these six different, these six different sketching schemes. So there's a couple, I mean one of the main points to take out of this and what particularly relating to the results is that look at the scale of this. So this is looking at the algorithmic or computational respect. If you look at the y-axis, you can see that everything's pretty close to one or particularly for the first two cases, everything's kind of fairly close to one. So this supports the theory that you're kind of within one plus delta if you look at the kind of the worst case type of performance that Drenes and Mahoney were dealing with. And then on the other hand, if you look at the statistical perspective which is what that third metric that I talked about, the prediction efficiency, you can see that the y-axis here is substantially larger than one because in some sense you're losing in terms of like by you're losing in terms of efficiency by having less data to deal with. So this is kind of just showing the plots for when we look at these two different metrics. Okay, so we can see that in general like sampling schemes and projection schemes both seem to work reasonably well but in particular like the reason why I showed that that second result was that that seems to do the best computationally even though there's not much in terms of theory for that. So this was kind of doing these leverage score sketching schemes without doing any kind of bias correction. What is the green curve? The green curve, that's just that's what I guess Drenes and Mahoney typically do which is leverage score sketching but then they do a bias correction or a rescaling. So that is in some sense what's typically being proposed and has the most theoretical guarantees but it doesn't do so well in practice. The red is doing that but without doing the bias correction or rescaling. So that's actually not proposed that the reason we do that is because that's not proposed in theory and there's very little theoretical. What is rescaling? So typically like when you do a rescale so when you're sampling say and you take our samples you typically need to rescale things so that you're not like so that you in some sense correct for the fact that you have less sample so you might multiply by square root of r over n so to make sure that everything is sort of considered. Well for way less what do you multiply? So if you were to take our samples uniformly you would need to multiply by square root of r over n to make sure that everything kind of works out and so what this is doing but for leverage scores remember you have to do things a little bit differently because every point has a different leverage score so instead of dividing by square root of like r and like square root of n over r you do like square root of n over whatever the leverage score is. Does this used to get the correct expectation? Yes exactly, yeah that's exactly right to get the correct expectation but what's interesting is that according to the simulations you do a lot better if you don't do this correction many times and so that's sort of why I included this and that's why we had that second theoretical result because we wanted to get some theoretical justification for why we were doing so much better in terms of the simulations but I mean it has to be pointed out so that I think this works very well under this model but under a model mis-specification it's sort of unclear how well this would really do. Could you show the previous slide again? Yeah. Okay so for both metric green is much worse than... Yeah exactly so yeah for both metrics green is much worse than red in general but there's more theoretical justification for green than red and in general like doing sampling as you might expect is less stable than doing projections because with projections you're kind of keeping all of the data but rotating it whereas for sampling you're kind of losing a lot of your data and as you'd expect the performance is also worse for or more unstable for the new equals one which is the case where you have more like non-uniformity in your leverage scores. Okay so getting back to Francis's earlier questions so this was actually work that was done subsequently by Polanche and Wainwright so we suspect that there was a lower bound but I guess when we were trying to prove we discovered that Polanche and Wainwright had been working on this set of independently and they've done a bunch of other things too but if we looked at one of their results and it wasn't written in this way exactly but in some sense if you assume that your sketching matrix has this property so this is for randomized sketching so to be clear this expectation is taken over S over the sketching scheme that you're using and if you kind of assume that it has this property which is more or less just says that you've got kind of R rows in your sketch if it has this property then like the prediction efficiency scales is no better than N over R basically so this suggests that all of those upper bounds that we had cannot really be improved by using a different sketching scheme or that the analysis was not really tight so this kind of really confirms what we thought that in fact this prediction efficiency which is what statisticians are typically thinking about is intrinsically harder metric than what was kind of being looked at in the algorithmic or computational in the computational setting and this was proved kind of using pretty standard information theoretic lower bounding techniques. Okay and so this condition is kind of satisfied for all of the S's that we talked about except for that I guess the red line here so for the red line we saw that we actually got a better result for this Padilla case and that's because it doesn't satisfy this it does not kind of satisfy this condition but all of the other lines that I plotted basically satisfy this condition. Okay and so I guess the work that they did also kind of led so what it's basically saying is that doing one sketch of your data when you do ordinary least squares or even constrained least squares is really not enough if you wanna do well in terms of the statistical metric and so what they kind of proposed is if you do iterative sketching schemes and this is computationally more intensive because you're taking kind of more and more sketches if you do iterative sketching then you can potentially sort of get the original like this one plus delta type bound that you want but you have to do kind of iterative sketching so one sketch sort of isn't enough when you're doing least squares and you're interested in this in this prediction efficiency. You're doing, is there any better bound for the 50% probability chart? Sir? You have, as soon as holding the probability around 50% you have any, can it be improved? Well sir, you can make this 128 bigger and make that half smaller. So it's like typically how that yeah this works is you kind of, you got it like there's constants here and so you get like a one minus delta and then it's like n over like that 128 depends on whatever that probability is so if you want to make that bigger you can make this closer to one I guess so yeah that's typically how yeah so I mean these constants are very loose both for the upper and the lower bound so the upper bounds are kind of based on like concentration of measure type arguments and these lower bounds are using information theory techniques but yeah the constants are really really loose here okay any other questions? All right so just to sort of conclude and to talk about like this is sort of a we looked at a very simple problem so there's a number of ways that that this work can be extended so what this analysis shows is that if we're interested in this we had these two perspectives the computational algorithmic perspective saying that doing sketching once does that preserves most of the information you want to preserve and does well but then you know we thought like you know we came out of thinking that you're losing a lot of your data and so this was resolved by just the fact that we're in some sense looking at different criteria and different models and so in terms of prediction efficiency which is what you care about in statistical setting this is substantially more challenging than the standard kind of algorithmic or worst case setting and then there was this other sort of interesting thing that sort of came up that when you do like this leverage score sampling kind of without correcting for the bias to make the expectation then it seems to do well from a computational from a simulation performance perspective but yeah I mean as pointed out there's and you know relating to some of these questions is a number of ways that this work can be extended so as pointed out there's the online case so this is certainly only works in the batch mode case so a lot of these ideas but if you potentially did you know is there a way to kind of extend this to the online setting so this is and this is kind of the simplest problem that you can look at this on-release squares problem and so what we're looking at right now is when you kind of have like low rank matrices or tensors like there's this notion of doing sketching to come up with like an approximation to the SVD and so this is called the CUR decomposition and so we're kind of doing so right now what we're doing is an analysis of for that then and there's also a perspective of sketching that's related to regularization and implicit regularization where kind of we're exploring this connection and also looking at whether like sketching can in some sense recover like low rank tensor structure so in general tensors are more difficult to deal with than vectors or matrices and we're seeing if sketching can potentially give you kind of computational gains there if you do the sketching in a reasonably clever and efficient way. All right, thanks very much. Okay, so we have time for questions. So the leverage score sampling so do you have the results again? Yeah, sure, yeah, the simulations. Yeah, the simulations and so you had a fast version of one of those leverage. Yeah, yeah, yeah exactly. So that's this SHR, which is sorry, the pink, yeah. Is this like the analog of the SR one but fast? Yeah. Do you have an analog of the SNR one which is fast? Yeah, so that also does better, yeah. Yeah, so that does better than the SR one which is fast. So the SNR one which is fast does better than the... Fast means you estimate the average score approximately? Yeah, exactly. So we use the scheme to estimate the leverage scores approximately and so you can see you do what? You do worse than if you knew them exactly. But... That scheme does work because I have the paper but it seems like they get a proof of something which is how to implement or how to check that this was implemented. Yeah, this was implemented. Yeah, it works, yeah, reasonably well. Which color is this? That's pink. Yeah. So you compare pink to green? Yeah, yeah, that's right. So pink, yeah, pink to green basically. So pink is better than green? Yeah, yeah, which like... And that's probably because of the fact that you're doing some kind of regularization by doing the approximate and that actually does better, which is interesting, yeah. And could you say the same thing about scaling versus not scaling? The fact that you're not scaling probably acts as a regularization and it means that it works better. Yeah, that's, yeah. Totally comparative, right? Yeah, no, I agree, I think it is because it is a regularization type argument. Can you write it down in a certain way? So this was something we tried to do and I guess that was what this theorem was kind of trying to get at but we needed to assume something very specific and so we're trying to do that in a more formal way. Right, imoso is non-uniform leverage course to divide the leverage course, I guess. Yeah, yeah. This may introduce a lot of variance as well. Exactly, that's exactly right. So that's exactly why it does terribly when you have very non- because you're dividing by something that's potentially you're closer. You're doing like impotence weights. Yes, exactly, yeah. Impotence weights, when the weights and then probably the support of distributions or a bit of who and who. Yeah, it's terrible, yeah, yeah. So that's, yeah, so that's some, and I mean, like one way that, I mean, like other people try to overcome that is they'll do like a linear combination of leverage score plus uniform to try to overcome that but it still, that seems to still do worse than just not doing any re-sampling at all or re-weighting at all, yeah. And so there's another technique which is random rotation which make the leverage score uniform and then uniform something. Yeah, yeah, that's right. That's completely intractable because there's no way to speed up your random rotation step. As I understand, that's true, yeah, that doing the random rotation step, yeah. Yeah, that's as I understand, yeah. The general, the space from the rotation presently exists. Well, I guess not. It's like compensating matrices. There's a limit on house parts you can make them. I guess it might be the same issue here. And so the idea for them, your rotation is that you don't want the rotation and then leverage. If you don't want the rotation, then the leverage scores are sort of really fine so you don't care about computing them anymore. And you can do random rotation, it's like, you don't want the rotation, you can do like very simple random, I think. I wish I knew. Yeah, I don't know that there's a fast way to do that, but yeah, that's a good point now. Is there any other question? Do you think that your run is breakable and how it feels, how far do you think it's possible to go, maybe under some assumptions? Oh, so when you say breakable, you mean, sorry, like... I would say that the build your, not the run is even correct. Well, it's not my end, it's not my end. Yeah, yeah, so it's all right. But like, I mean, so when you say, yeah. I mean, you said that without normalization, you do not satisfy the assumption. Yeah, so yeah, like, yeah. So I think you can go below? Yeah, yeah, so certainly like, yeah, the lower bound is certainly under this assumption, so yeah, you can potentially, yeah, so how far you can go, like, I guess, for example, like if you have a leverage score distribution, which is very like non-uniform, then you can see you can get very close to, to kind of, you know, like a part, like you can do much, much better than lower bound. But this, again, this is under a very restrictive assumption. So basically like what it's saying is if all of your leverage score masses in like P of the samples, then you can kind of get those P samples and that takes care of everything. But yeah, is that like often satisfied? Yeah, so that's kind of, yeah, so it depends, I think, on what your X matrix is like. But for a general X, I don't know if you can do much better than the lower bound. Okay, have a question? Yes, cool. What's the quick intuition why SNR doesn't satisfy the assumption? So the intuition is that essentially like if you, if you have, for example, like, like if you think about an example where P leverage scores have like a leverage score of one and the rest are all zero, you could just take all of those P leverage scores and it's like the rest of the rows are zero. And so that would like, essentially, this would kind of have a, like this SS transverse matrix would have a, you know, very different condition number in, because what this is basically saying is you're effectively getting R, R on N fraction of the, of the leverage score weight. And so if you don't do rescaling, you're getting, you're potentially getting way more or way less than that.