 Thank you very much. So thanks to all the organizers for bringing us all here and for the invitation. I'm very happy to talk here. So yeah, the title of my talk, I think, is a subset of the title of the workshop. I'm going to talk about trade-offs in statistical learning, computational trade-offs, but not only multiple trade-offs. So I'm going to talk about a few things. I should mention that this is work with my co-authors all over the world, so Philly Prigolet, Richard Samwerth, Teng Yau Wang, Venkatsch and Rassakaran, and Jordan Nellenberg. So a lot of people talk about big data. And I think everyone has their own idea of what it is. I think in general, you can describe it as a broad phenomenon, a collection of challenges in data sciences, much like global warming is a bit hard to define. But it's a lot of things we're working on. And several important aspects, but I think one that really we have focused on a lot, at least in statistics, is that data is really collected a lot. And it's not always very useful. Most of the data might not be relevant to the problem at hand. And the data can be also very complex. It might be very different. There might be a lot of issues attached to this data we might want to keep it private. There might be a lot of errors in it. And so this creates inherently when we had all these complexity trade-offs in learning problems. So yeah, this is one of these introduction slides when you have a lot of big numbers. The idea is that when we have a lot of data like this and trying to say something about a very high dimensional parameter, so d here can be very large. n can be very large. We often introduce assumptions about the model. So structure, sparsity, we try to make the problem tractable from a statistical point of view by introducing all this. And statistician sensors tend to be optimizer's problem. This creates combinatorial problems when you just focus on a likelihood method. So if you're doing estimation and you want to solve an optimization problem, this might be intractable. If you're doing hypothesis testing and want to compute averages on very large sets, this might also be intractable. And so this is a little bit of a barrier when our objective is to do computationally efficient statistics. We want to be able to tell a practitioner what's going to happen when they're going to put a method in their computer. And so in order to do statistics in a computationally efficient way, the worst case hardness in a likelihood problem is more or less a 0 of order information. It tells us that the first thing that we want to do, or first idea, is not going to be possible all the time. But it doesn't say much more, essentially. First of all, it's possible that there is an algorithm that exists and that will work for frequent instances of an NPR problem. So in clustering, in non-convex regression, a lot of the time we can do a stochastic gradient descent. We can do alternating minimization. And if the right conditions are here, and this will be true for natural signals in general, then we can solve these problems for a lot of instances. Also what's possible is we can create another optimization problem. We have a proxy functional optimization problem. So this is what we're doing when we're doing convex relaxations on when we're doing the usual trick of transforming the L0 norm into an L1 norm. And so in order to say that something is going to be hard computationally from a statistical point of view, we have to focus on average case hypothesis, so not only a worst case hardness. We need to say that some task is hard to achieve consistently and efficiently. So the big question is, can we have OK and ED2, the beurre and l'argent du beurre in French? So can we have good statistical performance and algorithmic efficiency at the same time? And so this is a figure that I think a lot of people have started using describing the statistical error more or less as a function of the running time. So above this line is essentially the things that you can do for which there are algorithms. And so below this is essentially the information theoretic threshold that says that no matter what the running time is, you can't push your statistical error much below some particular threshold. But there's been more and more work that suggests that actually this statistical error will increase when we are constrained on the algorithmic side. So there are two kind of points of views around these kind of results. The first one is to say, well, we have in mind what reasonable computing time means. So there's going to be two regions. It's going to be the one where we have infinite computing time. So the barrier will be information theory here. And then there's going to be some other region where we know how to compute things. And then there's going to be another lower bound that might be higher than the other one. So this is kind of a binary result. But in this kind of setting, we're able to get lower bounds. And the other one, it's more smooth. It's a finer analysis of this curve. So in some sense, there's some work where people explain that they do more and more relaxation. So they get more and more points on upper bounds on this side of the curve. There's other cases where people look at a particular computation model. So they have more or less a cursor to describe what running time means. And they're able to show that the statistical error for their particular procedure will increase when they do that. But this is the picture I want you to have in mind, at least when we're looking at statistical error and algorithmic efficiently at the same time. So just to give an idea, so for the people who've seen this before, since this is the CESTA session, you can sleep for 10 minutes and wake up after. So this is a sparse principal component analysis, in particular sparse principal component detection. So we have X1, XN, the independent random vectors that are centered. And the question is, do these vectors have an isotropic distribution like this? Or is there a sparse unit direction V where there is more variance, in particular theta and more variance? So this is different from just principal component detection, where V can be anything. In high dimension, this is going to be a very complicated problem. There's a random matrix theory tells us, essentially, that we need a very strong signal. We need that the added variance grows with the dimension. So that's going to be really problematic in a high dimensional setting. So this is why, in some sense, we focus on sparse vectors, vectors that are like this aligned with the axis. And the signal strength is theta. So it quantifies, in some sense, the distance between these two distributions, how different they are. So of course, if theta is very small, this is going to be a hard problem. And if it's very large, it's going to be an easy problem. And so we have some results, actually. So if you only look at the statistics point of view, first we have a really tight result for this. Essentially, we take the empirical covariance matrix and we maximize this quadratic form over unit sparse vectors. So if here we had sigma, the population covariance matrix, we would get 1 under the null and 1 plus theta under the alternative. And so we just have to control the deviations of this statistic when we don't have the population covariance matrix when we have the empirical one. And you do an analysis of this and you find that theta needs to be of this order. So we don't have d here. We have k, which is, in some sense, the true dimension of the problem. We lose a logarithmic factor. But this analysis is tight. From an information theoretic point of view, we have matching lower bounds more or less of the same order. But of course, we don't know algorithmically how to optimize this. And when we do an analysis with a relaxation, so the usual trick of substituting sparsity assumptions by L1 norms and with an added relaxation of doing lifting in the semi-definite positive cone, we have a relaxation by a few people and more in the audience here. But we show that actually it requires to be effective at discriminating these two hypotheses a much stronger signal. Theta needs to be of the order squared of k squared this time divided by n. So this can still be better than what we first saw, lambda max, where you need theta to be greater than square root of d over n. But it's still a suboptimal result compared to this. So if we go to the slide I had here, this is just saying we have this barrier here and we have a suboptimal result here. And we don't know if it's because the curve is truly here or it's because our analysis is not really good or because this method is not really good. So the overall picture for the testing problem that we have is that computationally efficient tests seem to require this at first. And so it seems to say that here of course there's no detection that's possible. Here there's a combinatorial method that works and here there's only polynomial time methods will start working at this level. But of course this is just a suggestion right now. This is just an upper bound that we had. The true situation could be very different. There could be an alpha smaller than 2 such that there is an algorithm that starts working at this level of signal for alpha between 1 and 2. And so in order to show that this is not the case, we need to look at complexity, ferritic lower bounds. So a little bit like information ferritic lower bounds when we're doing minimax analysis in statistics. But this time taking into account the algorithmic efficiency of the testing procedure. And in order to show that this is not possible, we have to use some assumption from our computer science. We have to use the fact that some problem at least is hard on average. So the problem that we're going to look at is the planetary problem. So it's very easy to describe. It's about random graphs, the easiest possible kind of random graphs where each edge is randomly connected with probability 1 half independently. So here is the adjacency matrix. And when it's like that, the adjacency matrix is essentially pure noise. It's a bunch of up to symmetry IID coefficients. So here the expectation is absolutely a constant 1 half. And another distribution is the planted clique distribution. So this time a clique, so it's a subset of completely connected vertices here in orange, is planted in a graph. It's added. So in the adjacency matrix, it's going to be a small square like this. Again, now it's hidden. So of course on a small graph like this, it's kind of easy to see that there is one. But in general, what we can say is that this random instance will have an expectation that's like this. It has a sparse signal structure. A little bit like the original problem that we had on sparse PCA where it was either something where everything was completely independent. It was all pure noise. Or there was a sparse signal that was present. And we need to differentiate these two possible cases. And of course here it's just a schematic. The clique is represented as 1 square. But of course it could be everywhere. And so that's what makes it a truly hard problem because we would have to look everywhere to find it. So if we just look at this problem from a purely probabilistic or statistical point of view, under this hypothesis, you can show that the largest clique will have a size smaller than 2 log m with very high probability. And under this alternative, it will have a size greater than kappa by definition because we've planted a clique of size kappa. So of course, when kappa is greater than 2 log m, well, it's possible in theory to detect the presence of this clique. But of course it's NP-hard to check the value of this number. So people have looked at other methods to try and detect or even estimate the position of this clique. So what they found is that you have some polynomial time detection method that works but requires this size to be greater than square root m. m is the number of vertices. It's the size of the graph. And there's a very strong hypothesis in theoretical computer science that says that essentially it's impossible to do better if kappa is the power of m that's smaller than 1 half. Then there's no polynomial time method that works even to distinguish these two distributions. And so this assumption is considered pretty strong for a few reasons. First of all, a lot of very smart people have tried to improve this result and haven't been able to. So that's the authority argument. A lot of people have also shown that if you focus on particular computational models, so Markov chains, all kinds of convex relaxation for this problem, square root m is where things start to fall apart. And now people have used it as a primitive to prove the hardness of other things. So to prove that it's hard to approximate Nash equilibria, that some cryptographic systems are secure. And so it is considered pretty secure. And this is something that we can use to show that other problems in statistics are hard. We will be a primitive for average case reduction. We'll show a link between problems. And we'll say, well, if there was an algorithm that would work really well for this statistical problem that we have, so like sparse PCA that I talked about in the beginning, well, we could modify, we could take a graph as input, modify it a little bit, and then put it into the algorithm that we have for original statistical problem, and it would contradict this hypothesis. And so that's what we've been able to do. We've been able to show that you can link these two problems of planted leak and sparse PCA. And we've been able to show that any kind of detection below this threshold would contradict this hypothesis. So the take home message here is that when you focus only on computational efficient method for detection, at least in this problem, there is a gap of square root k. So in the curve that I had in the beginning, there are really in these two regions, so infinite computing time and reasonable computing times, a gap between the barriers for statistical efficiency. And the same result is true for estimation, so this is something we've worked on with some colleagues at Cambridge later. The picture is a little bit more complex because then there are two metrics. There are the size of the signal and then there's the speed of convergence. It's not just can you detect or can you know how close can you detect. And we show again that in certain signal regimes, there is a link between estimating V and recovering the click. And so we're able to show that really when you focus on these, again, computationally efficient methods, there is, again, a gap of square root k, yeah. But so here is still a conjecture of this gap, right? Because you don't have, you don't have to worry about what you have is you say a lot of people try to solve this problem and fail. Absolutely. But I mean, you could also argue that p different form and p is an unproved conjecture. And so maybe we can compute lambda k max in polynomial time. But sure, but was this kind of problem of the similar sort of p equal to p? It's not, no, it's not as hard. It's not as secure in some sense as p equals np. It's possible that this conjecture is false and that p different from np still stands. However, it's been shown that you cannot, if you want to prove lower bounds for things in the average case in general, it seems very likely that you cannot use worst case assumptions. To spare you the technical details, the whole hierarchy of all these problems would collapse if that were true. So it's assumed that you need an average case assumption. You can't prove these results just based on p different from np. So you're always going to have to link different problems, in average case, with some possible distribution. Other questions before I close the SPAS-PCA parenthesis? Can you show it is equivalent to the point-and-click problem or is it only one way? No, that's a very good question. But it's not clear that it's always equivalent when some parameters are very large. You can show that you can build the reduction the other way as well. But it's not in very natural parameter regimes. OK? So this was one particular problem where you cannot have your cake and eat it too. But of course, this is not at all a universal situation. In the last few decades, we've seen the power of convex relaxation in statistics and machine learning. We've seen also that even more recently, there's been, I think, more focus on that, the fact that a lot of things will work by magic, a lot of the time. Even if SGD is not supposed to work, it will work. You can actually prove that it will work a lot of the time. So there's a lot of situations where you can actually have these two things at the same time. And other people have worked on showing that you can have statistical performance and some other thing as well. It doesn't have to be a computational efficiency. So some people have worked on privacy, data security, robustness to errors, the fact that data can be distributed. So you can have statistical performance and some other thing and eventually lose only a little bit of statistical performance when you do that. So now the question that I had in mind is, well, can you have all three? Can you have your cake in two and the cherry on top as well? And I like to be a buzzkill. I like to have negative results in general. So I focused on problems where you cannot have all three at the same time. So you can have two if you want. You can have statistical performance and algorithmic efficiency. You can have statistical performance and some other thing. But I've noticed that in a lot of these analysis, a lot of the time, it is shown that this works for the MLE or for something that has a lot of strong assumptions. But I've wondered, can you show that sometimes it doesn't work for all estimators? In particular, can you show that it doesn't work for some algorithmically efficient estimators? So are there cases when we have these two but not this? So one example that's actually pretty natural that I'd like to talk about now is between statistical efficiency, algorithmic efficiency, and the fact that you can distribute your data. So this is something related to sparse matrix signals. So there's a lot of motivations to have distributed data. First of all, it can be, as Ohad said yesterday, too big to store or move around. But you can also have other type of constraints. So the data maybe cannot be shared freely. So you have a country and you have data sets in different parts of this country. And maybe these are hospitals that are not willing to share their data. But they may be ready to share one average, or a small part of their data. The question is then, can you do as well as if you have all the data? So this is the classical picture in this case. You have x1, xn that are distributed according to some distribution. And you want to estimate something about this distribution. So this is when you have all the data and you have some good guarantees for theta hat, deviation bound guarantees with high probability, et cetera. The question is when you distribute the data like this, when you cut up in a lot of small blocks, capital M blocks with N over M samples, can you do as well when estimation can only be done locally? So you have a bunch of local estimators. And then you aggregate them. So here I wrote an average, but it doesn't have to be an average. It can be all kinds of things. And here it's assumed that all these blocks are IID. And so of course, there are practical advantages to doing things like this. So first of all, as I said in the beginning, it can be a constraint. You might have security issues or size issues. But also you have running time advantages when you do things like this. So if you are able to completely distribute all the computations, it will divide the computing time a lot. And even if you have to do it one by one, even if you have to do first the computations for this one and then for this one and then for this one, et cetera, then it will still speed up computations because you have a smaller sample size to begin with. So if you have N to the 1, 4 blocks and the running time is N to the power of 5, then you can improve even when you have to do it them one by one this power. So of course, there are issues when you do things like this. You have a loss of signal. Each local estimator is based on less samples. So the question is, how does it scale? And of course, that's going to be different for different problems. If you're doing an average, well, everything commutes. So here, when you are computing this, it will be equal to the average of them all. But for a lot of problems, it doesn't commute like this and it doesn't scale really well. What people have focused on a lot recently are positive results for this. They're saying that there's a large class of estimators where you can parallelize things well. Things based on convex minimization. So if you have strong assumptions on your MLE, you can show that you analyze the geometry of this problem and you can show that actually you have a minimal loss when you do things like this. So the natural question that I had in the beginning is, can this be extended to any kind of estimator for theta? In particular, in situations where you cannot compute the MLE when this is not a convex problem, can you do anything? And so the problem that I found, that is pretty natural, but that raises some issue. It's pretty simple. So you have a big matrix of observations. The mean, so the signal is AA transpose for a sparse vector A. And the noise is just a bunch of IID coefficients and your matrix is of size N. And so just to fix all the scaling, we assume that the AIs in the signal are constant when there is signal with probability K over N, but most of the time they're equal to zero. So K is more or less the number of these non-zero coefficients. And the noise, as I said, is independent coefficient, so it's centered with variance one as the usual sub-gaussian terms. And in this problem, A is sparse and we fix this sparsity to be a constant times square with N, because this is where interesting phenomena appear. So what do you fix the variance to one because you lose the possibility of having a scaling in alpha? I'm sorry? A is between one and a half and one? So yeah, I mean half and one or not, it can be any constant. Should be better to have more noise. You're fixing the amount of noise. Yes. Yes, yes. Yeah, I'm fixing all of these because this is where the interesting stuff happens. Oh, so you chose one purpose because this is where things get difficult? Well, this is where things start to be interesting when you start to distribute your data. But I mean, you could change things a little bit. You could say that alpha i is smaller than this and then you have to change the sparsity and then the same kind of phenomenon will appear, but this is one example where interesting things appear. Okay? And so what you wanna do here is estimate A or even something very simple, the mean of the alpha i's. So you can think of this as the study of an epidemic. So you have a bunch of profiles for patients and you don't really know yet what the disease looks like. So you compare all their files and if you have all the patient's files, you have this big matrix of observation and the idea is that whenever they both have the disease, well, their profiles look alike and something spikes up. All right, this is the problem we wanna look at. But of course, we don't actually have access to all this data. These patients are in certain cities all over the country. Their profiles are in a hospital and this is not going to be shared. So the only data that actually exists is this one. And so you have N to the epsilon for any constant epsilon blocks and in each block you have N over N profiles or samples. And so the question is, of course, is the statistical problem still possible? First of all, is it possible when you haven't distributed the data? Then is it possible when you do this and then how does algorithmic efficiency come in the mix? So one thing where we can see first of all that there are going to be issues is that even if we do a very simple heuristic for alpha, the mean of the alpha i, the sum of the coefficients, we're gonna see why things become complicated when we divide the data like this. So if we sum all the coefficients of this matrix because we've added, we've made the mean of some coefficients higher, so the overall mean should be higher as well. We see that we can estimate alpha up to some small constants with higher probability. And so if this constant that comes from the sparsity, so the sparsity is C times square root N, if this constant is large enough, then we can detect the presence of signal and we can have a pretty good estimate for alpha. So this is when we have all the data. This is the baseline result, the benchmark in some sense, okay? But when we cut up the data like this, then the sparsity locally when you compare it to the size of the block has changed. When you zoom in, essentially things look more sparse. You can do the computations and you'll see that K epsilon, so the sparsity in one block scales with N epsilon, the size of the block, with a smaller constant than one half here. And so if you try to sum in each individual block, then the added mean will be smaller than the size of the deviations and things will not work. So I mean, here you can challenge what I'm saying a little bit by saying, well, first of all, maybe it's just a simple heuristic. Maybe this one doesn't work, but other things will work. And you can also say, well, of course, you've made the problem harder, now there's less data, so maybe you've made the problem impossible by having less data like this. So first off, we'll see that actually you don't make the problem impossible when you cut up the data like this. It's still possible to estimate a when you only have access to the small diagonal blocks like this. So first we do the analysis when all the data is available and we start with, the analysis is going to be first based on a sparse spectral study. So for now we don't care about computing time at all. So we're assuming again that we can maximize this quadratic form and the same type of results that we had for sparse PCA with steel hood, we have a deviation for this vector that will be governed by two things, by the spectral gap for the matrix A and by some norm of the added noise. And so you can control this norm of the added noise, you can control by looking at the parameters of your problem, the spectral gap, and you find a bound that goes to zero when n grows. And so you have, when n is very large, a good idea of what this vector is. This allows you to create a good candidate set of non-zero coefficients. So these are people where you have a strong assumption that they have the disease. You show that it has a good intersection with the true support of A. And by doing this then you can do a second phase of refinement and you look at essentially the profiles that are highly correlated with those in W. And you look at these highest coefficients and this allows you to recover the whole set. So this is when you have all the data but the same analysis with the parameters changing just a little bit will still work when you do it locally on each block. You will still find that this will grow, will go to zero when n grows. The speed will be a little bit slower. Of course we have a little bit less samples but it still works. And we're still able, this should be a W at S, right? We have in each block one candidate set, we do the second phase also in each block and it works. So the problem has not been made impossible from a statistical point of view by cutting up the data like this. But of course the method that I've presented here is computationally hard. So now we could say well maybe this problem is computationally hard whether or not it's only distributed a setting. And actually it's not. So you can do the same analysis for just a spectral method. So you forget about the sparsity assumption here. And so now you'll have a different norm because this maximization problem is less constrained. You'll still have the spectral gap here and when you analyze this, so this is why we chose K to be of order square with n. You have something here that is less than one. So you have a non-trivial angle between your estimated vector and the true vector. This means that the candidate set will have still a pretty good intersection with the true support and the second phase will still work. The only thing you need for the second phase to work is to have something that's not like one over square root d in this norm. So small constant is fine. Okay so the problem initially is not hard computationally. You can just do a spectral analysis and it will work. And you can do it in a distributed manner with a sparse spectral method. The question is can you do the same thing as when I went from this slide to this slide and said there was only a minimal loss? Can you then do the same analysis in each block here? So of course since I'm asking the question, it's not possible. You do the same analysis of the operator norm of the noise, of the spectral gap in each small block and you'll find that this upper bound here is non-informative. This is a distance on the sphere in some sense. This is a bounded distance. So if you have a bound that goes to infinity, it doesn't tell you anything. This could be any vector on the sphere satisfies this. And so you cannot do the second phase and you have no information. So again this is saying well, this particular method doesn't work but maybe something else will. So here our objective is to have a statistical procedure for a that satisfies various properties. So this is kind of the picture that I showed in the beginning between having your cake wanting to eat it too and having something else on the side. So you can have these two things at the same time. So you can solve the statistical problem by and have the data being completely distributed. This is the sparse spectral method. You can also solve your statistical problem if you have all the data with just a simple spectral approach. But if you want to do all three at the same time, so far everything we try doesn't work. And so you can do the same thing as in the previous problem I talked about actually, so the global idea is that if you want to estimate A in a distributed manner it means that you're estimating all the AS. So if I go back to this block picture, so this is the only data that exists and so if you want to estimate the whole matrix or the vector A, well the only way you can do it when this is the data available and then you're going to do estimation locally on each of these is to estimate all the AS well. If there's one part, if there's one region where you're not recovering the support then this is going to be a problem as well when you want to estimate the whole matrix. So if you could estimate the whole matrix in a distributed way it would mean that you can estimate all the small blocks individually. And now the signal strength in each individual block is too small, the sparsity is too small as well. These two are related and in particular it's going to be a hard now for computationally efficient procedure. So we also do a reduction from the planted click problem that I talked about in the beginning and we show that if you have these graphs so these two possible distributions on graphs and you look at the adjacency matrix, do some manipulations, add some noise to the coefficients, et cetera, you will transform the adjacency matrices that come from these problems in instances of this problem. Here the connection I think is even more evident with sparse PCA and in particular you will show that if you had something that worked for these small matrices for each local estimator then you could use this to detect the presence of very small clicks in graphs which would contradict this assumption. So in the end we don't have that. We're able to show that this is the picture. You can do statistics in an algorithmically efficient way. You can do it in a distributed way but you cannot do both at the same time. And I think so, I mean this is more the big picture message that I want to convey here is that when we look at statistical problems and we want to say that we're able to add something to our statistical procedures, I think this something that we add should not be considered individually. We should consider all of these at the same times because otherwise we'll have multiple trade-offs. So this is why the title of my talk was Trade-offs Plural and Statistical Learning. We can have impossibility results when we want to have a lot of things at the same time in a statistical problem. There's another type of trade-offs that I've studied as well. So this is a work I mentioned in the beginning with Jordan Ellenberg. Now the other thing that we want to add in this problem is not distributed computation or distributed data. It's robustness to errors. So it's a statistical problem that's a little bit more of a toy problem. It's a bit less natural than the one I talked about, the two ones I talked about before. It's related to satisfiability. But the overall picture, even without going too much into details, is that in this problem optimal statistical performance can be reached by an NP-hard method. So this is like sparse PCA, this is like the other problem on sparse sub matrices that I talked about in the beginning. And two improvements can be made to this problem with small statistical loss, but they can be made individually. So one of the things you can do is you can use another statistical procedure that will be computationally efficient. You will have some statistical loss with this procedure, but it will be minimal. And the other thing that you can do is you can allow for a constant portion of error. So you can also change differently your statistical procedure. And now if even 99% of your data is pure trash, then you'll still be able to detect complete noise from the presence of a little bit of signal in this problem. And the issue is you cannot do both at the same time. So again, in this, there's this type of picture when we can have statistical efficiency with two other things individually. So here, either robustness to errors or computational efficiency. But the only way we can make an improvement is to not be able to do the other one. The two improvements cannot be made simultaneously. And so we actually have a negative result that tells us that there is an issue in this problem if we try to do two at the same time. So it's not related to the planted click problem. It's related to another problem in theoretical computer science called learning parity with noise. Another one of these problems that is assumed to be harder on average. How like, because of PNP or the unique game stuff? No, I mean, no, it's not unique games. It's not P different from NP either. It's just, so people have tried and not be, I mean, it's the same kind of situation as planted click. So people have tried and not been able to improve the results that exist. People have also shown this is actually one of the first kinds of computational lower bounds that existed that if you focus on one particular model of statistical queries, so you count the number of queries that you can make to a numerical, this is going to be problematic. And this is kind of related to this whole literature on statistical algorithms that Vitaly Feldman is developing. So essentially, these are problems that are hard when you're only using averages. So here you're always a bit like fuzzing what statistical over bounds are very clear, okay? Yeah. But you see that for computation, it's always fuzzing. Like you need to have like P equal NP or can you formally define what it means to be hard computationally? What is the computation of lower bound precisely? Computational lower bound precisely is saying if you look at, if you have an algorithm that improves the present, you know, assumed bound that we have, then it will contradict. You can construct another algorithm that will contradict something. P equals NP? No, no, some, they will perform better also than thought possible in another problem. It's always, it's a big thing what, it's like a big conjecture that P different from NP, I think it's like big conjecture and there is, but you don't always are used to this sometimes to do something which is different. Yeah. And maybe not as widely conjectured. Absolutely. As I said to Simon, people have looked at this because you know, people, I mean, this is true in all computational learning theory. People are well aware that this is an issue, but they've shown that it's very likely, so you know, unless the whole polynomial hierarchy collapses, it's very likely that you cannot show hardness of learning in average by using just P different from NP or the worst case assumption. So essentially, I mean, all, not all assumptions are equal, so you know, P different from NP is much, much, much stronger than the assumptions we're using here. So, but it is still an unproved conjecture, right? It's not as clean as a statistical lower bound. Well, you know, you have Fano, you have bound on the total variation distance. So of course it's not as clean as that and you know, there's no big recipe or big method that tells you how to construct lower bounds, right? In statistics, you have all these theorems that tell you, you know, you compute KLs, you compute k square divergences and you find these lower bounds. Here, there's a little bit of a recipe which is essentially you find one of these problems in theoretical computer science that's assumed to be really hard on average and you try to see if it's really, you know, if it's related to your problem. Or you reverse engineer it and try to find a statistical problem that's linked to one of these assumptions. But it's more or less something that you're always going to have to do if you want to say anything negative about, you know, statistics done in a computationally efficient manner. And I agree, I mean, this is, I think, in my next conclusion slide. It'd be interesting to have a finer analysis of these trade-offs. But I think this is always kind of related to upper bounds, right? It's nice, it would be nice to have, so I'm going back to the beginning but I think I still have a bit of time, so it's okay. Yeah, right, so it'd be really great to be able to say, you know, with proof, without any assumption. First of all, there is a lower bound here that's different from just, you know, the information ferritic lower bound. And also we can describe, you know, in a very fine manner how this works, where exactly this boundary is. This would be really great for practitioners, right? They tell you, you know, have this much running time. This is the added error you're going to make, right? So, you know, that way they're able to manage all these trade-offs, et cetera. That's a little bit possible when you focus on upper bounds, right? Or when you focus on one particular algorithm. So, you know, your estimator is going to be based on the iteration of some kind of gradient descent or something else, and then you have t and you can show maybe a lower bound for this particular optimization problem, for this particular estimator, right? So, going back to the end. So, the results that we have here are computational limits to doing estimation in a distributed manner or in a manner that's robust to a lot of errors. One thing that I didn't completely talk about because essentially I was still working on it is you can avoid some of these issues if that's what you're interested in by adding some redundancy. So, here in some sense the very nice aspect of having your data like this is that if someone gets their hand on that, then they cannot say anything about the whole data set. Even if they have, you know, a lot of computing time, as long as it's not infinite, it's going to be fine. So, essentially if you replicate in a minimal way your data by creating these blocks, essentially that don't exist yet, you could create a situation where if anyone has access to one of these blocks individually in a reasonable computing time, they're not going to be able to extract anything from the data, so in some sense it's still secure, but you can still, you can now do things because essentially now the whole data exists so you can, you know, sum all these coefficients and a central agency that communicates with all these data centers would still be able to say that something is going on. So, this is something interesting, this is something that could be nice if you're interested in positive results, fixing these issues a little bit. And then there are more questions about, you know, if you think that this is actually a good thing, if you think that your data set now in some sense has been made more secure by no one being able to, by no one being able to look at it and extract anything from it, can you systematically make data sets more secure? Can you have a data set, transform it into something that looks like what I had this past sub-matrix signal problem and now make it secure to someone who's bounded in computing time? So this is a little bit related to issues in cryptography but more from a statistical point of view. And then as I said when I first came to this slide, can we have a finer analysis of these trade-offs and what are we going to have to pay in some sense to have a finer analysis of these trade-offs? Are we gonna have to focus on one particular model? Are we going to have only upper bounds? So it's complicated to do lower bounds already in this very binary manner with a lot of conjectures, et cetera. It's going to be even harder if you don't have all these things. And yeah, this is supported by DS Technical Trust. So thanks a lot for your attention. Are there some questions? So in optimization, you get lower bounds by using a specific model of computation like combination of gradients. Yes. Could you imagine something like a graph problem where you access like graphs, locally and by sequence of local news? Yeah. Is there already like lower bounds? Yeah. Do you think this? Yeah. Montanari and one of his students has worked on that so I think in one of my slides on trade-offs I had something ahead of reference. So yeah, so in the planted clique column they've shown that they have a method that works well for estimating cliques of say square root n times a small constant. And they've shown that if you have so the same kind of methods you were talking about, local methods essentially that are based on looking at these trees so you walk around locally in the graph they're not able to improve that constant. So there are methods like this and I think it'd be interesting also to look at situations when you're saying again, we have, we're only using gradients or Hessians or inverses et cetera but the data is distributed is that creating also these types of problems? I think that'd be really interesting. It'd be based on information theory but you have to make some assumptions about your reason. But yeah, I think that's an open interesting problem. Yeah. So this is more of a comment than a question but so you're absolutely right that understanding the fine-grained behavior like in terms of front end is very difficult because we don't really have very good tools to get lower bounds in terms of the runtime even if you can say something that's polynomial not polynomial. And even then, yeah. And then you get assumptions. But I think sometimes you also have more information theoretic quantities that you can prove things about without making any assumptions which are still a proxy for the runtime. So things like memory or the amount of communication we have. So like for some of the problems you mentioned since first PCA, if you show that like to get something optimal you would need the say memory or amount of communication with skills quadratically within the dimension and you could argue that that kind of method probably won't be efficient. I think that's what happens if we linear in the dimension. And at least when you ask it this way you can understand, I think, much better. You can get the fine-grained analysis. I really agree with your comment. And this is a little bit what also I was describing when I was saying you have some given method of framework. So you're creating a proxy as you said for running time and then you have bounds based on information theory. So I think there's been, I mean at least recent work that I'm aware of on sparsely regression with bounded memory and there's been a lot of things like that when you bound a computation. The only issue that I have with this is that sometimes you have so like statistical algorithms you have a framework where you're saying well this framework I'm able to show true bounds, true lower bounds based on information theory that kind of suggest this kind of trade-off between running time and error. But actually you find something like Gaussian elimination that doesn't fall into this framework and that actually allows you to solve this problem. So then you have the other problem where it's hard to argue that your framework is a true description of running time. So this is something I'm a little bit worried about creating a framework doing all these bounds and then someone noticing that actually something really easy doesn't fall into that framework. Other comments or questions? I have one last question. In the distributed setting so there is a gap between what you can do distributed and what you can do with the full information. Could you imagine at least for upper bounds if you force the hospitals to share a part of the information where we continue on between the full information and having only the diagonal blocks? So this is something I've looked at in essentially having a continuum like this is kind of hard. Essentially as soon as you ask them to share even a little bit of information then suddenly you can recover everything. So this can be good if you know you're a central agency and you really wanna know if there's an epidemic going on and so you tell the hospitals come on and you have to share at least a little bit will make sure that it stays in some secure data sets. No one has all the data. It cannot fall into the wrong hands. But really quickly you can recover everything. So this is good in some sense. Okay so the positive result? Yeah the positive result is quite fast. More sharing. Yeah, yeah, yeah. But yeah trying to do it in a smoother way is kind of hard I think. At least in this model. Maybe it's also an issue with the model. Okay. Thanks a lot. Thanks again. No. Thank you.